jo-jo

jo-jo is the final piece in a trio including "The Watsons, 2016" and "A.M.I, 2017". With jo-jo I sought to explore two themes: the dissemination of "false" information online, how it can often be a confusing influence on those who are impressionable and the overall desire for those who are anonymous to test the boundaries of what is perceived correct or tasteful once accountability is removed (trolling). During a period where the term 'fake news' is referred to daily by politicians and attempts of foreign influence on elections via the spread of misinformation has become expected, it's becoming ever harder to decide what to believe online.

When you consider those, who are far more susceptible to influence (such as children) having access to this information you have to question what this insecurity will do to their beliefs? jo-jo exemplifies this; he is the archetype vulnerable child. He will learn anything that he is told and use the internet to research any topics openly he might find useful and related to that information. Just like a young child would during the curious development stage. Unlike a child, however, he won't make any judgment calls on whether the information is legitimate, correct or to date. He will adopt those values, representing the worst-case scenario or an ever more likely one given that the spread of misinformation is getting more advanced daily.

I'm hoping that to see a child audibly recite beliefs that should not be common to someone his age will shock the audience into reconsidering both what they might put online (trolling) but also the current stage of miss-information we are in (fake-news). We have reached a point where "fake news" and strongly political leaning news (e.g., Infowars) have become prevalent to a point where they have vast online exposure to those who have no prior knowledge of the biases that this news station might have, which can affect the viewpoint of the news. This will likely have a significant impact on the beliefs of the youth and highlights that we need to do something to help children spot information they should treat skeptically.

During the development of jo-jo, I was influenced by the research I was undertaking in regards to censorship and contemporary art. Specifically, how some artists actively court censorship for either motivation or financial gain. Precisely, liberal democracies tend to oppose censorship by supporting those who are victims; it could be argued that because of this, censorship becomes a ‘joyous’ act for the artist allowing them to receive a small amount of ‘power’ from the censor in the form of outcry and publicity. As supported by Harris' discussion of Spinoza's conatus doctrine, quoting "capitalist social relations can channel desire in such a way that the employee can ultimately desire their repression." I believe this can be applied to misinformation and the spread of it. By spreading and ultimately falling for misinformation, it provides a scapegoat for beliefs that are not seen as 'suitable' in the modern context. Those with far-right views might adopt the use of "fake news" to best support their argument, as they remain willingly ignorant to its legitimacy for the reward of being "justified" in their beliefs. A kind of "Stockholm syndrome". They actively are benefiting from the spread and belief of misinformation. This could also suggest why a lot of politically-motivated fake news appears to come from the hard left or right sources. This is just one of the reasons it is so crucial for the youth to understand the motivation behind any given information.


Construction

At the heart of jo-jo is a powerful machine-learning chatbot which is initially trained on a corpus of internet conversations scraped from Twitter and Reddit. This provides the initial influence for the child, allowing him to converse on a set of basic questions and topics (e.g., What's your name? What do you think of climate change?). He adopts the opinions of any conversation he can find, so therefore he could be seen as a truly influenceable child, similar to that of a toddler that age. He can then take any questions or statements audibly from the audience (similar to Alexa) which in turn affects his opinions on matters and his subsequent response. In theory, if everyone in the gallery tells jo-jo he's "stupid," he will soon adopt the idea he is stupid and will tell people this.


Physical sculpture

jo-jo is modeled after me at a few years old. Sculpted using reference photos and videos of myself at that age. He's a polymer cast painted with a light matte white paint which allows the projection to retain clarity rather than diffusing the light within the plastic. Using a polymer rather than the original clay sculpt allowed me to create a much lighter piece which helps with not only transport but also the projection mapping, as it can be attached directly to the chair via one of his legs to keep it stable. Because the cast is hollow, I used expanding foam to fill the inside of the sculpture, adding density while adding the least weight as possible.

By using clay as the original sculpt, latex and modroc as the molding materials and a custom mixed polymer as the casting material I was able to retain a shocking amount of detail. This included textile seams and individual stitching which was hand sculpted. To achieve the fabric textures, I would press fabric samples into the clay. Because of the nature of the mold, I was able to produce two jo-jo sculpture casts before the frame lost integrity and would no longer retain enough detail.



After sanding down imperfections and casting lines by hand, I took multiple photos of the child at different angles so I could later create the texture that I would use for the projection mapping. Overall four pieces fit together: the hat, the head, the body and the hand.


Software

The piece is made up of a complex set of hardware and software to allow seamless interaction with the viewer with low hardware cost. They comprise of: A Raspberry Pi Model B+ (Powers the TV Screen), ASUS Tinkerboard (Projection Mapping, Keyword Detection, Sound), Short-throw 1080p projector, Speakers, Microphones, Highpowered VPS (Machine-learning, chatbot, host control), MacBook Pro (speech).

Projection mapping on a tinkerboard

To projection-map the child I repurposed a piece of software I wrote for the artwork "A.M.I, (2016)". This software built upon a 'cinder block' was written in c++ and used OpenGL to try to be the most graphically efficient as possible. This was particularly important given I was running it on very minimal hardware.


(Above) The first test, projection mapping the sculpture. The textures were simplified to allow me to see the fit on the sculpture.

After some experimentation, I compiled cinder on tinkerboard's Linux and was able to build my original project from "A.M.I," and it ran surprisingly fast (at around 30fps). This caused the board to heat up entirely, so I hooked a small fan up positioned over the massive heatsink on this little raspberry-pi-like board. I reached the point where there was virtually no more optimization that could be done to speed up the mapping (programming-wise) feasibly, so I had to turn to a bit of a physical hack. By using a very powerful short-throw projector I positioned the projector several meters away from the sculpture, this meant the projection took up a tiny percentage of the screen so therefore was required to render significantly less on the screen. This method worked a charm and increased the framerate to around 40 - 50fps.

(Image) The final mapping texture - Why no eyes or lips? These were added dynamically to allow animation. The green spiral was also added in post as it was animated via the application.

Keyword detection, speech recognition, and response

Above: basic setup of the keyword detection

This was tough. The piece was due to be exhibited during an incredibly busy exhibition where it would be loud consistently, and the last thing I needed was jo-jo continually asking people what they said. So I decided I needed to create a keyword trigger that would prompt the listening process, increasing the chance he would only listen when people were talking to him, not around him. To do this, I turned to the pocketsphinx library. By using this open source library, I would be able to perform keyword detection locally on the board rather than by using an API, something that would cost me a significant amount of money but also require exceptional bandwidth and internet connectivity, which was doubtful when using eduroam.

Only issue - the speech models available via cmusphinx were all too restrictive. The range of accents at this show meant the keyword detection would likely not work whatsoever, especially in lousy mic conditions. So instead, I trained my own model using Mozilla's Common Voice project. This project collected real samples of thousands of people talking, all with varying accents. This worked a lot better, and with some testing, I decided on the phrase "hey" as a wake command.

Interestingly, I found an odd bias in some of the training data when using the pre-trained models provided with cmusphinx. When using the keyword phrase "joe" the success rate for those identifying as men was near ~80% whereas women only ~10%. I suppose the reasoning for this could be many things, including perhaps lower frequencies being picked up more efficiently by the microphone. But I'd side with the likelihood being that this occurred due to an imbalance in training data where more 'male' samples were used -- a common arising issue in machine learning -- come on people, this isn't hard to fix.

Once the keyword was detected my python script begins listening for the audience member speaking. Once it detects they have stopped (amplitude below the specific threshold for 2.5 seconds) the piece responds with a waiting phrase such as "mmm, let me think" before preparing a payload for the logic server. Because speed is essential, I developed a strange but efficient method of creating the payload as follows: (this was performed using bash).

cd /home/linaro/Desktop/mic-jojo/server && mkdir zip_pl && sox payload.wav zip_pl/payload.mp3 && rm payload.wav && zip payload.zip -r zip_pl/ && rm -r zip_pl

The bash script

Step by step:

  1. Make directory zip_pl
  2. Use Sox to convert wav to lossy mp3 (large filesize reduction, loose quality but not an issue in the bad recording environment)
  3. Clean up old files (wav file)
  4. Zip the payload
  5. Remove the old zip_pl folder

The reasoning for using a zipped folder rather than the individual file was because my debugging stage included sending more files to the server so, therefore, the zip had added benefit. It also feels like a cleaner approach to me.

Once my server receives the push request, it downloaded the generated folder from the tinkerboard and begins the speech to text stage. Using CMUSphix, I implemented my own speech to text which produced a list of likely transcripts of any given audio file. This was good, but I found the results for varying accents and conditions to be poor, so to prevent corrupting the chatbot corpus with terrible dialog I turned to IBM and their bluemix API for fast, accurate transcription. Once I received the transcription from IBM I sent it to the chatbot which was running locally on a port on the server (exposed via a FLASK python server) which returned a response. This was then sent to the TTS server on my laptop which produced an audio file and sent a publicly exposed URL to the server which packaged this up and returned it as JSON payload to the tinkerboard which would download the response and play it to the audience member.


(Above) a reasonably detailed explanation of the system to allow a response to a question - omits the consistent training portion.

Chatbot and constant learning

An essential part of the piece was the ability for jo-jo to respond to given questions, but also improve his knowledge and subsequent replies based on that conversation and the information he can find online.

To achieve this, I turned to machine learning to develop a chatbot that could be re-trained throughout the night to provide a primary form of 'learning.' I turned to chatterbot, a machine learning, conversational dialog engine to implement the core of the system.

Chatterbot loosely works as follows:

  1. Gets input
  2. Process input
  3. Return output

Every interaction with the chatbot also gets put back into the corpus, allowing the chatbot to learn new phrases and responses to questions continually. In addition to this, an additional step between feeding the audience input into the chatbot takes the information, runs it through NLTK and parses the likely subject of the topic. My system then performs multiple searches of the topic on opinion websites such as Reddit and Twitter which is used to provide additional training material.

By doing this jo-jo is effectively able to learn whatever the audience member speaks to him. This, in fact, was very clear during the exhibition after he learned many unusual views.

What happened to jo-jo?

At the beginning of the exhibition, jo-jo was trained on a generic corpus, one that included responses for a few topics but nothing particular. The sole purpose was to explore how people would react to this power and whether they would mistreat it and teach the child terrible topics of conversation.

I used some NLTK tools including my own "stormkit" to analyse the corpus at varying intervals throughout the night. jo-jo started the night with a corpus containing a ~64% skew to negative topics. Two hours into the exhibition it had changed to a ~89% skew, resembling the audience being generally hostile towards the child. The largest topics within his corpus were "guns", "pro-gun", "soviet union", "Russia", "communism", "UFO's". And produced alarming responses including:

"When will Putin invade?"

I believe this one originated from someone asking jo-jo a question about Crimea.

"Communism from people who want to keep their Volvos. Any of the various theories or systems of ownership and operation of the means of production and distribution by society or the community rather than by private individuals, with all members of society or the community sharing in the wo..."

This one is a mystery; it wasn't in the original corpus. I think it must have been through research into one of the key terms.

"Tell me about communism"

I think the considerable interest in communism is because jo-jo was situated across from a piece by Kotryna Sajeviciute which commented directly on the soviet union. People standing near jo-jo may have discussed that piece when jo-jo had been listening, and therefore he had subsequently researched the topic.

"Should kill all lithuanian[s]"

This is just concerning, I also believe it to be similar to the statement above.


Thoughts

jo-jo performed exactly as I had hoped. His behavior exposed the highly sensitive relationship between trust and the freedom of information. The general interaction between jo-jo and the audience could be likened to a physical manifestation of 'trolling.' By removing the human entity but keeping the retaliation of dialog the piece actively encouraged those to 'let down their guard' resulting in insulting, mocking and telling jo-jo to research an array of inappropriate topics. This could be likened to how a sector of the internet communicates online. By removing a real persona via a semi-anonymous twitter account, it is similar to which jo-jo had no human entity. The inhibition of respect out of age, demeanor, scare or attitude is removed, leaving the user to send abuse to the other users readily. Highlighting the subjective issue of accountability, regarding what we say online and how that may impact others that see such material. Especially those of a younger age.

I had predicted that the audience would be mostly negative in dialog towards the child. This was because I believed the temptation would be too high to ignore. I also went out of my way to encourage the audience to feel at ease to say anything to the child whatsoever by a few visual cues and psychological tricks. For example, by placing his hand over his ear it drew the audience to his left side, facing away from his face. This meant anything they said would be directly out of his gaze, perhaps a less intimidating position to say something negative (literally behind his back) as you don't have to stare at his face. This also aligned the viewer to see the screen output directly meaning they would look at any images displayed. Additionally, what the audience said to the child would not be shown or repeated at any time. This was to allow a degree of privacy between the child and the audience member. When I had past tested the piece, I made the user input visible via the tv screen and this significantly reduced the negativity directed towards the child. This could be as it added accountability to what was said, further supporting my idea of the piece exposing "trolling"-like tendencies.

I was most surprised by the amount of time people within the audience would give jo-jo. Often people would stay for many minutes to ask jo-jo a set of comprehensive questions. People would even come back to ask new questions they had thought of later. This was nice to see and only adds to the evidence that the message of the piece came through.


Images of jo-jo


Video of jo-jo


References