Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

robots with bad accents

Robots with Bad Accents Living with Synthetic Speech Submitted December 2006, Revised March2007 Vol. 41, No. 3, Leonardo, MIT Press, Cambridge, Massachusetts Marc Böhlen, Artist, Engineer MediaRobotics Lab, Department of Media Study, University at Buffalo AILAB, University of Zurich marcbohlen@acm.org ABSTRACT Synthetic speech technologies have a profound impact on how we think about and interact with computers. This text discusses parts I and II of the 'Make Language Project', a trilogy on the cultural fallout of machine generated speech as a conduit for reconsidering prejudices in synthetic speech production and humanoid robot design. Normalized Android Culture Born from the industrial revolution's promise for a life of plenty and leisure, robotics is firmly committed to the positive, utopian interpretation of technology as first formulated by early thinkers such as Moore, Spencer, Saint-Simon, and reinterpreted in terms of computer technology in the 20th century by the cybernetics' community [1]. Not everyone shared this view. The sociologist Sorokin [2], for example, imagined human advancement through technology would end in disaster. The odd contradictory mix of awe, angst and admiration with which high-end robots are perceived today is proof of the continued vigor of the polarized view points; the intellectual landscape seems firmly settled with engineers and scientists on the positivist side, and humanities scholars and artists mostly on the pessimistic side, with some interesting scholars suggesting a compromise, as it were, by claiming the future of technology to end in utter uselessness [3]. From Turing to Kurzweil and beyond into popular culture [4], the capacity to recall more and calculate faster has been directly associated with super-human intelligence. Because the illusive goal of superior intelligence is not practically achievable, research agendas have concentrated on matching human intelligence and behavior in select domains. Not surprisingly, even this less lofty goal is far from trivial. Computational vision, for example, is still struggling to achieve synthetic visual perception and processing on par with that of humans. Likewise, the field of humanoid robotics does not currently attempt to make machines superior to humans; rather it has moved its focus to devices and processes that mimic humans. Interestingly, this notion of similarity or equality is defined in very specific ways and along strong disciplinary assumptions and rhetorical goals. For example, as Nourbakhsh and others have observed, most robots are designed as pets or servants [5], and they are all benevolent and polite [6]. Furthermore, humanoid and android robot designers tend to recreate physical perfection in their products. Ishiguro [7], for example, used an attractive young television moderator as a model for his most advanced and uncomfortably realistic android. Despite the immediate appeal, beauty, benevolence and politeness are problematic machine design guidelines. They normalize android culture and create a sympathetic base for robots that the machines do not necessarily deserve. By normalizing android culture, one looses opportunities for interaction forms that are uncomfortable and problematic but, potentially, rich and complex. Normalized android culture leaves us the promise of a friendly utopia that might well remain unfulfilled; it promises a future that is only superficially friendly, and leavs us unprepared to deal with conflicts that will likely arise with sentient machines in the future. Confrontational Interaction Normalizing machines to behave as humans do in select social contexts limits the scope of research in robot design. It also creates a fragile and shallow basis for any kind of deep exchanges between robots and people that the social robotics agenda claims to address. But if deep and longterm exchanges between synthetic systems and real people are to be achieved, a wider basis for possible forms of exchange and ways of sharing between machines and people is of essence. Synthetic systems, complex, confused and contradictory as we humans are will make, over time, for better partners than polite drones. Ultimately, the goal is diversity in robot design, a diversity defined not only in technical terms, but also by varied ideas about what machines could be and what we can share with them. In this context I offer some preliminary observations from the 'Make Language' trilogy [8], parts I and II, in which synthetic accented speech and machinic foul language infringe on comfort zones of interaction with computational devices. Text To Speech Synthesis Humans are uniquely specialized in the production of speech, and only homo sapiens can use tongue, cheeks, lips and teeth to produce 14 phonemes per second. Even children show a remarkable aptitude in recognizing sounds as speech. Speech makes us unique creatures. Language is understood in the research community [9] as well as in folk knowledge as central to being human. Because language is so central to being human, language processing has become synonymous with synthetic intelligence [10]. Understand how humans process verbal input, so the logic goes, and you will be able to build intelligent machines. For this reason a short overview of important concepts in synthetic speech is appropriate. Synthetic speech research is often divided into two categories: Text to Speech and Automated Speech Recognition [11]. Text to speech (TTS) entails the creation of a sound pattern (voice) from a textual input (words). Automatic Speech Recognition (ASR), the inverse of TTS, entails the mapping of arbitrary voice input to printable text. While the field of ASR and the well known and often despised dictation systems have had for little real world success, TTS has made leaps and bounds in research as well as practice. TTS combines signal processing based acoustic representations of speech together with linguistic based analysis of text to create machinic utterances that sound like human voices. TTS systems are typically comprised of multiple components. A text analysis component defines and disambiguates the raw input. It finds sentence and paragraph breaks. It is also responsible for text normalization (mapping abbreviations and acronyms to full words). The output from the text analysis module is passed on to the phonetic analysis module. This module performs, amongst other things, the all important grapheme to phoneme (letter to sound) conversion. The output of this module, in turn, is passed on to the prosodic analysis module. It is charged with setting pitch duration and amplitude targets for each phoneme. Finally this output passes on to the speech synthesis module where the constructed string of symbols is rendered to an audible output reminiscent of a voice. TTS designers have experimented with various synthesis approaches for this last module. The most widely used approaches today are concatenative synthesis and formant synthesis [11]. Here the concatenative approach is of particular interest. As opposed to the rule based formant method, concatenative synthesis is data centric. To construct an utterance, a concatenative TTS system would divide the input into segments, look for corresponding entries in a large database of recordings from a real human speaker (voice talent in the speech synthesis industry), and then concatenate (add serially) the individual parts to form the final output. This allows even sound sequences that have not been recorded per se to be rendered. The search, mapping and filtering steps included in concatenative systems are elaborate and deliver realistic machinic speech, particularly when perceived over low bandwidth media such as the telephone. Advanced concatenative systems include techniques of unit selection synthesis that automate the laborious task of (manually) finding correspondences, loosely speaking, between graphemes and phonemes. Unit selection synthesis is, in turn, heavily dependent on automated classification, most commonly implemented in the form of specifically designed neural networks the details of which are beyond the scope of this short overview. Return of the Spoken Word Join these technical advancements with the universally acknowledged significance of language and it becomes clear that TTS is of prime interest as a cultural phenomenon. Nothing less than a resurgence of oral traditions and a reassessment of the act of speaking can be expected in the wake of these new voice-centric systems. From the telegraph through punch cards to the keyboard and gaming console, computers have demanded people to meet them through clumsy haptic interfaces. TTS and ASR will spell out, literally, the end of the era of manually entered text input for machines. Furthermore, TTS and ASR redefine the quest for 'naturalness' in machines in ways other computer technologies do not. The consequences of this are far reaching, and this paper will only touch on some of them. But this much will be claimed: Speech technologies will allow for and require new definitions in our comfort zones with machines and with this they will create new hard (both in the computer science sense of intractable as well as in the cultural sense of multi-layered) problems in robot design. The issue is not only academic. Many people share the feeling of discomfort when a gentle machine voice repeatedly cautions you to watch your step as you exit the escalator or experience anger when the menu of options offered by a kind robot voice of a service department you are calling does not in the least match your particular predicament. And even when robots do get it right, their tone of voice is often off. Statements have, in many languages, a simpler prosodic signature that questions, where prosody patterns vary widely as a function of the semantics of the question itself. This makes questions much more challenging to represent computationally than statements. Consequently, our intelligent speaking machines are better suited to issue commands than to ask questions. But there is no turning back and synthetic speech will require us to think again about our own ways of expressing ourselves and how and when synthetic systems should mimic humans. The fact that machines can sound like humans does mean machines should use language in the same way people do. Beyond the flavor of utterances, synthetic speech begs the question of what machines could be saying to each other and what they should be saying to us. Cast as kind and patient, they have the capacity to say what we need to know, but also to insistently repeat what we have already heard or do not want to be confronted with. By default linked to databases and information systems, these übercorrect agents without a mother tongue have been delegated to the roles of clerks, instructors and supervisors. But they seem primed for more. Accents and Immigrants Speech acts not only reveal the intention of a speaker but also offer information on his or her origin. Many people who are born and raised in one culture and live in a different one, retain audible remnants of their past in their pronunciation patterns. Accented speech is particular speech because of the way its flavoring complicates the transmission of a message [12]. Prejudices alter the seemingly neutral transmission of content. Depending on the language competencies (and sympathy) of the listener, an accent can render an utterance charming, un-intelligible or even aggressive. Digital signal processing can create arbitrary signals, most of which have no relationship to those human beings are capable of perceiving. With elaborate models of the human vocal tract's geometry, all sounds that humans can generate can be created artificially. Despite the universality of the technical infrastructure, TTS systems are usually designed for very specific applications along national fault lines with localized voice fonts and linguistically identifiable entities. Commercial vendors of TTS systems usually name these voice fonts according to their national linguistic origin, (but not by the individual voice talent who delivered the audio samples). For example, there are Sarahs for US English, Heathers for UK English and Günthers for German. The deliberate naming of these synthetic voices helps to enhance their believability; it conveys the comfortable feeling of a living person behind the digital audio utterance and assists in branding the product. The set of synthetic voices on the market represent a cleansed and controlled subset of popular human languages. It comes as no surprise that commercial TTS systems do not offer speech products with 'undesirable' features such as slurred speech or, say, a strong German accent. Synthetic Heavy Accents: Make Language, part I The perfect tone of the machine belies the fact that no human being really speaks without an accent, slight as it may be. We all come from somewhere and that somewhere flavors our lives and our voices. Only the machine can speak in a perfect tone that is location and history-free. To date, TTS and ASR researchers [13] have been interested in accented speech mostly for the difficulties it presents in intelligibility; i.e. when does an accent in a given language become a measurable hindrance in conveying a message - a recent important issue for telecommunication companies and call center operators. In order to better understand the cultural fallout of synthetic speech, I am experimenting with synthetic heavy accents. In this context, I have crafted a German accented US English and a Mexican Spanish accented US English system with limited vocabulary [8] based on the SVOX speech engine [13, 14]. Several different methods allow one to craft accented speech. They range from combining a voice from one language with a linguistic model in a second language to recording an entire database from an accented speaker [15]. Even the simplest method of piping text from language A into a speech synthesizer constructed with a phoneme and grammar set of language B delivers useful results for some utterances. In order to generalize this language mixing, however, more elaborate methods are required. They include approaches gleaned from attempts to improve mis-transcriptions of ASR systems used to label grapheme to phoneme mappings [16], elaborate mixing of phoneme sets from two base languages, mixing of grammar requirements, modification of frequency, volume and word transition delays and various rules for exceptions. Both the German-English as well as the Spanish-English accents I have constructed combine all of these procedures. The success of this process is dependent on the proximity of the languages in question. Indo-Germanic languages, for example, mix amongst each other with greater ease than with Slavic languages due to the similarity of base phones used for their articulation. The approach used here is brittle, however, and can not handle general purpose text or emotionally charged speech acts [17]. Also, the often awkward grammatical mappings that foreign speakers construct when they speak in a second language are not easily included in a formal grammar. The most general results would most likely come from building, with no link to any base language, a completely new language model based on a particular accented speech, effectively treating it as a full fledged language. This would allow one then to construct any kind of accented utterance including those a human speaker would never consider making. Synthetic Hissy Fits: Make Language part II Can we learn something from mixing languages for synthetically accented speech that will help us imagine how we might mix human and synthetic beings? Imperfection, in all its rich variation, might be a a good learning ground for this. And from this we might better imagine what synthetic beings should actually have to say, once they are free from announcing flight schedules and processing concert ticket purchases. Consequently, we might consider investing more energy into telling machines about ourselves. Maybe our own language acquisition experiences can be ported to machines. When humans learn a new language they usually acquire an odd mix of bare essentials and examples of foul language. This is interesting as foul language circumvents the unknown new language and connects the speaker directly back to known territories [18]. While culturally specific in the boundary conditions that control its use, foul language links us more directly to our bodies than other forms of speech. Fig. 1 Amy and Klara, 2006 Constructing a Potential for Synthetic Hissy Fits In order to experiment with foul language in synthetic systems, I have built a set of nasty robotic agents housed in in cute pink boxes named Amy and Klara. Their ontologies are formed by and limited to reading and analyzing on-line trivia of life-style magazines such as Salon dot com. Creating ontologies that reflect common knowledge is an ongoing AI research agenda. Here, there is no claim to universality or completeness. This is a borrowed epistemology. The robots scan the website for news and cluster the results according to topics. This list becomes part of Amy and Klara's world. While the topics are meaningful for humans, they are but words to the robots. However, these words do receive, over time, significance by virtue of being repeated. Whichever topic is mentioned repeatedly receives more computational weight than topics found only once. Items that reach a critical threshold of numerically constructed significance become material for discussion. Amy and Klara share their statically weighted text summaries with each other via TTS and ASR. Since both robots scan the same website, each robot expects the other to say what it already knows, and when this does not occur, dissent arises and they begin to call each other names. Each robot's ASR has a vocabulary of foul language, divided into increasingly aggressive scales of curse words. The robots were trained specifically to be able to respond to this kind of language that is filtered from ASR products. Additionally, the results from the speech recognizer as well as the physical transmission of utterances from speaker to microphone are error prone; miscommunication is unavoidable and with this, arguments practically guaranteed. The fact that Klara has a thick German accent only increases the likelihood for misunderstanding. Making the recognizer less selective creates more instances of false positive results (falsely recognizing a valid possible result) in which case the robots believe, as it were, to be offended by most all words they perceive. There is no direct call to begin using foul language in the program of the robots. The arguments themselves are an emerging property of the above described configuration. Once one of the robots does emit a curse word, the other responds in kind if it recognizes the word. Each robot is equipped with a noise reducing microphone array and a small but dynamic equalizer-equipped speaker-amplifier system. The neural nets responsible for matching the perceived input to the foul language augmented linguistic corpus were trained with this hardware in place under low ambient noise conditions. Repeated use of a curse word from scale n leads to the selection of a curse word from scale n+1, provided the following word is recognized as a curse word within a given time frame (otherwise the aggression levels recede). Since recognition and utterance occur in quick succession, both a low level exchange (when recognition results are poor) as well as a heated escalating fight, if recognition results are positive, are possible. Furthermore, both robots are equipped with video cameras and able to see each other. Like the speech system, the vision system is geared for conflict. Amy and Klara both have a programmatic predisposition to be annoyed by the color pink, not knowing that they are pink themselves as each robot's own camera can see only white on the inside of its box. An adaptive histogram based hue detection algorithm [19] allows the robots to detect the other box's pink even under varying lighting conditions. The often idolized property of emergence is not limited to noble causes alone. Of course the boundary conditions can be set to prevent arguments in which case the robots sit quietly next to each other. It is more interesting however, to set the conditions such that most episodes end in an exchange of expletives that leaves people watching the two robots really wondering about machine intelligence. A video documenting such an exchange of foul language is available on the web [8]. Fig. 2 constructing the potential for hissy fits in Amy and Klara Learning from Amy and Klara Amy and Klara and their provocative synthetic hissy fits warn us not to expect too much from intelligent machines. They counter the rhetoric of the gentle intelligent machine with a critique of normative uses of synthetic speech and linguistic imperfection. Can we learn something about mixing robots and people by mixing languages in machines? If foul language is out of bounds for machines, then what about other taboos? Will we map all our taboos onto robots once they look, sound or smell like we do? Many linguistic taboos are derived from taboos in religion, sex and mental and physical ailments directly related to the physical constraints of being human. Since machines lack our bodily functions, the corresponding taboos really need not hold. We should expect some of our own taboos to be invalidated by machines. We should not be surprised if machines invent new curse words particular to the experience of being machine and having speech. Languages that are human in origin will be altered and amended by their use in machines in similar ways as popular culture alters and adds to the corpora of English language. There will be new figures of speech. Languages no longer in use by humans might be kept artificially alive in machines. And people lacking or having lost the capacity of speech might regain the skill in exchange with machines more patient than we are. Wolfgang von Kemplen, one of the first experimental researchers in synthetic speech, originally imagined his talking machine to be used for therapeutic purposes [20]. Fallout from advanced information processing technologies will make us continuously question our preconceptions of intelligence and challenge us to re-evaluate the ways in which we engage with machines. There is good reason to believe that human intelligence as we know it is dependent on the human body, and that language requires just such a body. Some computational linguistics researchers have attempted to prove this link in simulations of language evolution in embodied robots [21], and delivered astonishing results, provided one accepts initial conditions that allow for shared semantics [22] ab initio. However, if this is not given, the model fails. And so it might be helpful to reconsider the unconscious decisions we make based on an anthropocentric view of synthetic language. Synthetic speech is a good example of the kind of dilemmas that perfect mimesis can generate. If nothing else, the experiments and observations described here might invite us to consider intelligence and cognition not particular to being human as worthy of attention. Computational devices have the capacity to 'be' in ways humans can not. The interior workings of machines and computational devices are so different from our own in material, construction, time scales and biological constraints. Being machine is not being human; rather it is a kind of foreign being that has no relationship to our own ways unless we force it to behave as such. What is lost in the mimetic approach of robot design is the opportunity to engage the otherness, as it were, of the machine. In this regard, engineers might be advised to check into the history of pictorial representation and its struggles over millenia with mimesis. From Zeuxis to Giotto, Leonardo and the Paragone of the Arts [23] through Caravaggio to the demise of the traditional art academy at the end of the 19th century, realism and mimesis have been potent and problematic principles of representation in Western civilization. Only in the wake of war and cataclysmic social upheaval did the arts find in abstraction and constructivism new pathways of non mimetic expression. Synthetic speech research is a complex endeavor that demands rigorous attention to detail. Unfortunately, it is also another victim of the division of labor, as it were, that has established itself between the engineering sciences and the humanities and arts. This would be just another instance of a well known and often lamented disciplinary specialization if we did not have to repeatedly listen to the consequences on telephones and hear them in automobile navigation systems. How different might voice enabled machines sound and behave if they were informed by Wittgenstein's insight into meaning of words arising only from their use [24], or Rose's elaborately choreographed word games that begin with talk reminiscent of an academic presentation gone bad and end in a cacophony of utterances that sound like 'real' words but are nothing but babble [25]. Imagine if they knew about Blonk's powerful vocal tract, tongue and cheek skills that create sounds so odd they seem un-human and at times machinic [26], or the novelist Albahari, who surmised in recent work [27] the minimum number of words one actually needs to function. He counts five, provided one refrains from asking questions. Acknowledgments This work is supported in part by a grant from the Humanities Institute at the University at Buffalo. Thanks to SVOX for making their text to speech engine available for strange experiments. Thanks also to FONIX SPEECH for their assistance in making their automated speech recognition engine handle Amy and Klara's hissy fits with relative ease. Bibliography [1] Pickering, A., “Cybernetics and the mangle: Ashby, Beer and Pask”, Social studies of science, 2002, vol. 32, no3, pp. 413-437. [2] Sorokin, P., “Social and Cultural Dynamics: A Study of Change in Major Systems of Art, Truth, Ethics, Law and Social Relationships”, 1957/1970, Boston: Extending Horizons Books, Porter Sargent Publishers. [3] CAE, “The Technology Of Uselessness,” 1994, in: CTHEORY, http://www.ctheory.net/articles.aspx?id=59 [4] Kurzweil. R. “The Age of Intelligent Machines”, MIT Press 1992 [5]. Fong, T., Nourbakhsh, I., and Dautenhahn, K., “A Survey of Socially Interactive Robots”, Tech. Rep. CMU-RI-TR-0229, Rob. Inst., CMU, 2002 [6] Simmons, R. Nakauchi, Y., “A Social Robot that Stands in Line”, IROS, 2000. [7] Ishiguro, H., “Android Science”, Cognitive Science Society, 2005 [8] Böhlen, M., The Make Language Project, www.realtechsupport.org/new_works/ml.html [9] Nass, C., Gong, L., “Speech interfaces from an evolutionary perspective”, Communications of the ACM, Vol. 43, Num 9 (2000), P 36-43 [10] Christiansen, M. and Kirby, S., “Language Evolution: The Hardest Problem in Science?” In: Language Evolution: The States of the Art. Oxford University Press, 2003 [11] Schroeter, J. “Text to Speech Synthesis”, in: Electrical Engineering Handbook, 3rd edition, chapter 16, p. 1-13, AT&T Laboratories, 2005 [12] Ikeno, A., Pellom, B., Cer, D., Thornton, A., Brenier, J., Jurafsky, D., Ward, W., Byrne, W., "Issues in Recognition of Spanish-Accented Spontaneous English", in ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, April, 2003 [13] Pfister B., Romsdorder H., “Mixed lingual text analysis for polyglot TTS synthesis”, Eurospeech2003 [14] Pfister, B., “Skript zur Vorlesung Sprachverarbeitung I+II Abteilung fuer Elektrotechnik”, ETH Zuerich, 2005 [15] Tomokioyo L., Black A., Lenzo K., “Foreign Accents in Synthetic Speech: Development and Evaluation”, Interspeech2005 [16] Kim Y., Sydral A., Conkie, A., “Pronunciation Lexicon Adaptation for TTS Voice Building”, AT&T Labs, in: InterSpeech2004 – ICSLP, Korea 2004. [17] Schroder, M., “Emotional speech synthesis: A review”. In Proceedings of Eurospeech 2001,. volume 1, pages 561–564, Aalborg, Denmark, 2001. [18] Hughes, G., “Swearing: A Social History of Foul Language, Oaths and Profanity in English”. London: Penguin Books, 1998 [19] Bradski, G.R., Computer vision face tracking for use in a perceptual user interface. Intel Technol. J. 2(2), 1-15, 1998. [20] von Kempelen, W., “Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine”, 1791 and http://www.ling.su.se/staff/hartmut/kemplne.htm [21] Steels, L., Kaplan F., “Bootstrapping grounded word semantics”, in: Briscoe, T., Linguistic evolution through language acquisition: formal and computational models, Cambridge University Press, 1999. [22] Bickerton, D., “Language evolution: A brief guide for linguists”, Lingua 117 (2007), p. 510-526. [23] Steinitz, K., “Leonardo da Vinci's Trattato della Pittura: Treatise on Painting. A Bibliography of the Printed Editions 1651-1956”. Library Research Monographs. Vol. 5. Copenhagen: Munksgaard, 1958. [24] Wittgenstein, L., “Philosophische Untersuchungen,” (1953), Suhrkamp 2003. [25] Rose, P., “Pressures of the Text”, video 17 minutes, 1983 [26] Blonk, J., “Vocalor” in: "Labior -Phonetic Etude #3", released on Staalplaat, 1998 [27] Albahari D., “Fünf Wörter”, (2004), Eichborn Verlag, 2005 Glossary synthetic speech Speech created or processed by a computer TTS Text to Speech. The process of converting written words into audible speech. ASR Automated Speech Recognition. The process of mapping received audio input to text phoneme The smallest contrastive unit in the sound system of a language, independent of position in word or phrase grapheme The equivalent of the phoneme in written systems prosody Speech related information (intonation, pitch, duration, gestures) that is not contained in text itself. Prosody is often considered a parallel communication channel containing information that supplements or contrasts that of the primary channel. concatenative synthesis Synthesis based on the concatenation (or stringing together) of segments of recorded speech. This method generally produces the most natural-sounding synthesized speech but requires an often extensive database of recorded and tagged speech samples. formant synthesis Formant synthesis does not use human speech samples at runtime. Instead, the speech output is created using an acoustic model where parameters such as fundamental frequency (f0) and voicing are varied over time to create a waveform of artificial speech. Formant synthesizers usually produce more 'robotic sounding utterances but typically have a smaller footprint than concatenative systems as they lack a database of speech samples. adaptive histogram hue detection The process of dividing the complete color spectrum given in a given image into a a series of occurrence defined bins and using the results from this to set the boundary conditions for the histogram in a subsequent image such that a desired bin/color can be tracked over time. Biographical Information Marc Böhlen is Associate Professor in the Department of Media Study at the University of Buffalo and Visiting Professor at the AI-LAB at the University of Zürich. He has been contributing to diversity in machine culture since 1996. Figure Captions Figure 1 Amy and Klara's computers are synchronized (hence the large pink clock). This allows them to pop out of their respective boxes simultaneously. Each box is 10 x 10 by 10 inches. The robots are made of aluminum and plastics. Each robot is equipped with a microphone, a speaker and a camera Figure 2 Schematic of the software architecture that allows hissy fits between Amy and Klara to occur.