Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3491102.3517687acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
research-article

Aware: Intuitive Device Activation Using Prosody for Natural Voice Interactions

Published: 29 April 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Voice interactive devices often use keyword spotting for device activation. However, this approach suffers from misrecognition of keywords and can respond to keywords not intended for calling the device (e.g., ”You can ask Alexa about it.”), causing accidental device activations. We propose a method that leverages prosodic features to differentiate calling/not-calling voices (F1 score: 0.869), allowing devices to respond only when called upon to avoid misactivation. As a proof of concept, we built a prototype smart speaker called Aware that allows users to control the device activation by speaking the keyword in specific prosody patterns. These patterns are chosen to represent people’s natural calling/not-calling voices, which are uncovered in a study to collect such voices and investigate their prosodic difference. A user study comparing Aware with Amazon Echo shows Aware can activate more correctly (F1 score 0.93 vs. 0.56) and is easy to learn and use.

    Supplementary Material

    Supplemental Materials (3491102.3517687-supplemental-materials.zip)
    MP4 File (3491102.3517687-video-figure.mp4)
    Video Figure
    MP4 File (3491102.3517687-video-preview.mp4)
    Video Preview
    MP4 File (3491102.3517687-talk-video.mp4)
    Talk Video

    References

    [1]
    Karan Ahuja, Andy Kong, Mayank Goel, and Chris Harrison. 2020. Direction-of-Voice (DoV) Estimation for Intuitive Speech Interaction with Smart Device Ecosystems. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 1121–1131.
    [2]
    Saul Albert and Magnus Hamann. 2021. Putting wake words to bed: We speak wake words with systematically varied prosody, but CUIs don’t listen. In CUI 2021-3rd Conference on Conversational User Interfaces. 1–5.
    [3]
    Tawfiq Ammari, Jofish Kaye, Janice Y Tsai, and Frank Bentley. 2019. Music, Search, and IoT: How People (Really) Use Voice Assistants.ACM Trans. Comput. Hum. Interact. 26, 3 (2019), 17–1.
    [4]
    Rainer Banse and Klaus R Scherer. 1996. Acoustic profiles in vocal emotion expression.Journal of personality and social psychology 70, 3(1996), 614.
    [5]
    Jared Bernstein, Amir Najmi, and Farzad Ehsani. 1999. Subarashii: Encounters in Japanese spoken language education. CALICO journal (1999), 361–384.
    [6]
    Richard A Bolt. 1980. “Put-that-there” Voice and gesture at the graphics interface. In Proceedings of the 7th annual conference on Computer graphics and interactive techniques. 262–270.
    [7]
    Varun Chandrasekaran, Suman Banerjee, Bilge Mutlu, and Kassem Fawaz. 2021. PowerCut and Obfuscator: An Exploration of the Design Space for Privacy-Preserving Interventions for Smart Speakers. In Seventeenth Symposium on Usable Privacy and Security ({SOUPS} 2021). 535–552.
    [8]
    Ailbhe Ní Chasaide, Irena Yanushevskaya, and Christer Gobl. 2017. Voice-to-Affect Mapping: Inferences on Language Voice Baseline Settings. In INTERSPEECH. 1258–1262.
    [9]
    Henry S Cheang and Marc D Pell. 2008. The sound of sarcasm. Speech communication 50, 5 (2008), 366–381.
    [10]
    Yuxin Chen, Huiying Li, Shan-Yuan Teng, Steven Nagels, Zhijing Li, Pedro Lopes, Ben Y Zhao, and Haitao Zheng. 2020. Wearable microphone jamming. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
    [11]
    Sunming Cheung, Steven M Entine, and Jerome H Klotz. 1977. Microcomputer voice-response telephone entry for balanced clinical trial randomization. Journal of medical systems 1, 2 (1977), 165–169.
    [12]
    Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, 2018. Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190(2018).
    [13]
    Thierry Desot, François Portet, and Michel Vacher. 2019. Towards End-to-End spoken intent recognition in smart home. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD). IEEE, 1–8.
    [14]
    Kelly Dobson. 2004. Blendie. In Proceedings of the 5th conference on Designing interactive systems: processes, practices, methods, and techniques. 309–309.
    [15]
    Daniel J Dubois, Roman Kolcun, Anna Maria Mandalari, Muhammad Talha Paracha, David R Choffnes, and Hamed Haddadi. 2020. When Speakers Are All Ears: Characterizing Misactivations of IoT Smart Speakers.Proc. Priv. Enhancing Technol. 2020, 4 (2020), 255–276.
    [16]
    Huan Feng, Kassem Fawaz, and Kang G Shin. 2017. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking. 343–355.
    [17]
    Carole T Ferrand. 2002. Harmonics-to-noise ratio: an index of vocal aging. Journal of voice 16, 4 (2002), 480–487.
    [18]
    Kim Fluitt, Timothy Mermagen, and Tomasz Letowski. 2014. Auditory distance estimation in an open space. Soundscape Semiotics-Localization and Categorization (2014).
    [19]
    Ido Freeman, Lutz Roese-Koerner, and Anton Kummert. 2018. Effnet: An efficient structure for convolutional neural networks. In 2018 25th ieee international conference on image processing (icip). IEEE, 6–10.
    [20]
    Robert W Frick. 1985. Communicating emotion: The role of prosodic features.Psychological bulletin 97, 3 (1985), 412.
    [21]
    J. Garofolo, Lori Lamel, W. Fisher, Jonathan Fiscus, D. Pallett, N. Dahlgren, and V. Zue. 1992. TIMIT Acoustic-phonetic Continuous Speech Corpus. Linguistic Data Consortium (11 1992).
    [22]
    Christer Gobl and Ailbhe Nı Chasaide. 2003. The role of voice quality in communicating emotion, mood and attitude. Speech communication 40, 1-2 (2003), 189–212.
    [23]
    Masataka Goto, Koji Kitayama, Katsunobu Itou, and Tetsunori Kobayashi. 2004. Speech Spotter: On-demand speech recognition in human-human conversation on the telephone or in face-to-face situations. In Eighth International Conference on Spoken Language Processing.
    [24]
    Nele Hellbernd and Daniela Sammler. 2016. Prosody conveys speaker’s intentions: Acoustic cues for speech act perception. Journal of Memory and Language 88 (2016), 70–86.
    [25]
    Julie M Hupp, Melissa K Jungers, Celeste M Hinerman, and Brandon L Porter. 2021. Cup! Cup? Cup: Comprehension of intentional prosody in adults and children. Cognitive Development 57(2021), 100971.
    [26]
    Takeo Igarashi and John F Hughes. 2001. Voice as sound: using non-verbal voice input for interactive control. In Proceedings of the 14th annual ACM symposium on User interface software and technology. 155–156.
    [27]
    Amazon Inc.[n.d.]. Amazon Echo. https://www.amazon.com/smart-home-devices/b?ie=UTF8&node=9818047011.
    [28]
    Apple Inc.[n.d.]. Siri. https://www.apple.com/siri/.
    [29]
    Google Inc.[n.d.]. Google Nest. https://store.google.com/category/connected_home?.
    [30]
    Yasha Iravantchi, Karan Ahuja, Mayank Goel, Chris Harrison, and Alanson Sample. 2021. PrivacyMic: Utilizing Inaudible Frequencies for Privacy Preserving Daily Activity Recognition. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–13.
    [31]
    Yuki Irie, Shigeki Matsubara, Nobuo Kawaguchi, Yukiko Yamaguchi, and Yasuyoshi Inagaki. 2004. Speech intention understanding based on decision tree learning. In Eighth International Conference on Spoken Language Processing.
    [32]
    C Ishi, Hiroshi Ishiguro, and Norihiro Hagita. 2006. Using prosodic and voice quality features for paralinguistic information extraction. In Proc. of Speech Prosody. Citeseer, 883–886.
    [33]
    Carlos Toshinori Ishi, Hiroshi Ishiguro, and Norihiro Hagita. 2008. Automatic extraction of paralinguistic information using prosodic features related to F0, duration and voice quality. Speech communication 50, 6 (2008), 531–543.
    [34]
    Xiaoming Jiang and Marc D Pell. 2017. The sound of confidence and doubt. Speech Communication 88(2017), 106–126.
    [35]
    Bjorn Karmann. 2019. Project Alias. https://bjoernkarmann.dk/project_alias/.
    [36]
    Akinobu Lee, Tatsuya Kawahara, and Kiyohiro Shikano. 2001. Julius—an open source real-time large vocabulary recognition engine. (2001).
    [37]
    Matrix. [n.d.]. Matrix Voice. https://www.matrix.one/products/voice.
    [38]
    Sven Mayer, Gierad Laput, and Chris Harrison. 2020. Enhancing Mobile Voice Assistants with WorldGaze. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–10.
    [39]
    Donald McMillan, Barry Brown, Ikkaku Kawaguchi, Razan Jaber, Jordi Solsona Belenguer, and Hideaki Kuzuoka. 2019. Designing with Gaze: Tama–a Gaze Activated Smart-Speaker. Proceedings of the ACM on Human-Computer Interaction 3, CSCW(2019), 1–26.
    [40]
    Abraham Mhaidli, Manikandan Kandadai Venkatesh, Yixin Zou, and Florian Schaub. 2020. Listen Only When Spoken To: Interpersonal Communication Cues as Smart Speaker Privacy Controls. Proceedings on Privacy Enhancing Technologies 2020, 2(2020), 251–270.
    [41]
    Rachel LC Mitchell and Elliott D Ross. 2013. Attitudinal prosody: What we know and directions for future study. Neuroscience & Biobehavioral Reviews 37, 3 (2013), 471–479.
    [42]
    Jack Mostow 2001. Evaluating tutors that listen: An overview of Project LISTEN.(2001).
    [43]
    Robert Neßelrath, Mohammad Mehdi Moniri, and Michael Feld. 2016. Combining speech, gaze, and micro-gestures for the multimodal control of in-car functions. In 2016 12th International Conference on Intelligent Environments (IE). IEEE, 190–193.
    [44]
    II Nicholls. 1988. WL. Computer-assisted telephone interviewing: a general introduction. Telephone survey methodology. New York: John Wiley & Sons Inc (1988), 377–85.
    [45]
    Liqiang Nie, Mengzhao Jia, Xuemeng Song, Ganglu Wu, Harry Cheng, and Jian Gu. 2021. Multimodal Activation: Awakening Dialog Robots without Wake Words. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 491–500.
    [46]
    Leonardo Pepino, Pablo Riera, and Luciana Ferrer. 2021. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings. arXiv preprint arXiv:2104.03502(2021).
    [47]
    Bastian Pfleging, Stefan Schneegass, and Albrecht Schmidt. 2012. Multimodal interaction in the car: combining speech and gestures on the steering wheel. In Proceedings of the 4th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. 155–162.
    [48]
    Patryk Pomykalski, Mikołaj P Woźniak, Paweł W Woźniak, Krzysztof Grudzień, Shengdong Zhao, and Andrzej Romanowski. 2020. Considering Wake Gestures for Smart Assistant Use. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 1–8.
    [49]
    Yue Qin, Chun Yu, Zhaoheng Li, Mingyuan Zhong, Yukang Yan, and Yuanchun Shi. 2021. ProxiMic: Convenient Voice Activation via Close-to-Mic Speech Detected by a Single Microphone. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–12.
    [50]
    Simon Rigoulot, Karyn Fish, and Marc D Pell. 2014. Neural correlates of inferring speaker sincerity from white lies: An event-related potential source localization study. Brain research 1565(2014), 48–62.
    [51]
    Florian Roider, Lars Reisig, and Tom Gross. 2018. Just look: The benefits of gaze-activated voice input in the car. In Adjunct Proceedings of the 10th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. 210–214.
    [52]
    Nirupam Roy, Sheng Shen, Haitham Hassanieh, and Romit Roy Choudhury. 2018. Inaudible voice commands: The long-range attack and defense. In 15th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 18). 547–560.
    [53]
    Daisuke Sakamoto, Takanori Komatsu, and Takeo Igarashi. 2013. Voice augmented manipulation: using paralinguistic information to manipulate mobile devices. In Proceedings of the 15th international conference on Human-computer interaction with mobile devices and services. 69–78.
    [54]
    Lea Schönherr, Maximilian Golla, Thorsten Eisenhofer, Jan Wiele, Dorothea Kolossa, and Thorsten Holz. 2020. Unacceptable, where is my privacy? exploring accidental triggers of smart speakers. arXiv preprint arXiv:2008.00508(2020).
    [55]
    Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason I Hong. 2018. ” Hey Alexa, What’s Up?” A Mixed-Methods Studies of In-Home Conversational Agent Usage. In Proceedings of the 2018 Designing Interactive Systems Conference. 857–868.
    [56]
    Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj Kumar, Baiyang Liu, and Yoshua Bengio. 2018. Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5754–5758.
    [57]
    Edwin Simonnet, Sahar Ghannay, Nathalie Camelin, Yannick Estève, and Renato De Mori. 2017. ASR error management for improving spoken language understanding. arXiv preprint arXiv:1705.09515(2017).
    [58]
    Shamane Siriwardhana, Tharindu Kaluarachchi, Mark Billinghurst, and Suranga Nanayakkara. 2020. Multimodal Emotion Recognition With Transformer-Based Self Supervised Feature Fusion. IEEE Access 8(2020), 176274–176285.
    [59]
    Paul Warren. 1999. Prosody and language processing. Language processing (1999), 155–188.
    [60]
    Yi Xu and Xuejing Sun. 2002. Maximum speed of pitch change and how it may relate to speech. The Journal of the Acoustical Society of America 111, 3 (2002), 1399–1413.
    [61]
    Jackie Yang, Gaurab Banerjee, Vishesh Gupta, Monica S Lam, and James A Landay. 2020. Soundr: Head Position and Orientation Prediction Using a Microphone Array. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–12.
    [62]
    Nicole Yankelovich, Gina-Anne Levow, and Matt Marx. 1995. Designing SpeechActs: Issues in speech user interfaces. In Proceedings of the SIGCHI conference on Human factors in computing systems. 369–376.
    [63]
    Yucan Zhou, Qinghua Hu, Jie Liu, and Yuan Jia. 2015. Combining heterogeneous deep neural networks with conditional random fields for Chinese dialogue act recognition. Neurocomputing 168(2015), 408–417.

    Cited By

    View all
    • (2023)A Multimodal Activation Detection Model for Wake-Free RobotsWeb and Big Data. APWeb-WAIM 2022 International Workshops10.1007/978-981-99-1354-1_10(97-109)Online publication date: 30-Mar-2023
    • (2023)“Garbage In, Garbage Out”: Mitigating Human Biases in Data Entry by Means of Artificial IntelligenceHuman-Computer Interaction – INTERACT 202310.1007/978-3-031-42286-7_2(27-48)Online publication date: 28-Aug-2023

    Index Terms

    1. Aware: Intuitive Device Activation Using Prosody for Natural Voice Interactions

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CHI '22: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems
      April 2022
      10459 pages
      ISBN:9781450391573
      DOI:10.1145/3491102
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 April 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Conversational Interface
      2. Device Activation
      3. Intention
      4. Keyword Spotting
      5. Prosody
      6. Voice Interaction

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      CHI '22
      Sponsor:
      CHI '22: CHI Conference on Human Factors in Computing Systems
      April 29 - May 5, 2022
      LA, New Orleans, USA

      Acceptance Rates

      Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)104
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 29 Jul 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)A Multimodal Activation Detection Model for Wake-Free RobotsWeb and Big Data. APWeb-WAIM 2022 International Workshops10.1007/978-981-99-1354-1_10(97-109)Online publication date: 30-Mar-2023
      • (2023)“Garbage In, Garbage Out”: Mitigating Human Biases in Data Entry by Means of Artificial IntelligenceHuman-Computer Interaction – INTERACT 202310.1007/978-3-031-42286-7_2(27-48)Online publication date: 28-Aug-2023

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media