research-article

Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing

Authors:

Dong WangAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 8, Issue 2

Article No.: 84, Pages 1 - 29

https://doi.org/10.1145/3659614

Published: 15 May 2024 Publication History

Abstract

Silent Speech Interfaces (SSI) on mobile devices offer a privacy-friendly alternative to conventional voice input methods. Previous research has primarily focused on smartphones. In this paper, we introduce Lipwatch, a novel system that utilizes acoustic sensing techniques to enable SSI on smartwatches. Lipwatch leverages the inaudible waves emitted by the watch's speaker to capture lip movements and then analyzes the echo to enable SSI. In contrast to acoustic sensing-based SSI on smartphones, our development of Lipwatch takes into full consideration the specific scenarios and requirements associated with smartwatches. Firstly, we elaborate a wake-up-free mechanism, allowing users to interact without the need for a wake-up phrase or button presses. The mechanism utilizes the inertial sensors on the smartwatch to detect gestures, in combination with acoustic signals that detecting lip movements to determine whether SSI should be activated. Secondly, we design a flexible silent speech recognition mechanism that explores limited vocabulary recognition to comprehend a broader range of user commands, even those not present in the training dataset, relieving users from strict adherence to predefined commands. We evaluate Lipwatch on 15 individuals using a set of the 80 most common interaction commands on smartwatches. The system achieves a Word Error Rate (WER) of 13.7% in user-independent test. Even when users utter commands containing words absent in the training set, Lipwatch still demonstrates a remarkable 88.7% top-3 accuracy. We implement a real-time version of Lipwatch on a commercial smartwatch. The user study shows that Lipwatch can be a practical and promising option to enable SSI on smartwatches.

References

[1]

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2018. Deep Audio-visual Speech Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018), 1--1.

[2]

Fouad Alallah, Ali Neshati, Yumiko Sakamoto, Khalad Hasan, Edward Lank, Andrea Bunt, and Pourang Irani. 2018. Performer vs. Observer: Whose Comfort Level Should We Consider When Examining the Social Acceptability of Input Modalities for Head-Worn Display?. In Proceedings of the 24th ACM Symposium on Virtual Reality Software and Technology (Tokyo, Japan) (VRST '18). Association for Computing Machinery, New York, NY, USA, Article 10, 9 pages. https://doi.org/10.1145/3281505.3281541

Digital Library

[3]

Fouad Alallah, Ali Neshati, Nima Sheibani, Yumiko Sakamoto, Andrea Bunt, Pourang Irani, and Khalad Hasan. 2018. Crowdsourcing vs Laboratory-Style Social Acceptability Studies? Examining the Social Acceptability of Spatial User Interactions for Head-Worn Displays. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI '18). Association for Computing Machinery, New York, NY, USA, 1--7. https://doi.org/10.1145/3173574.3173884

Digital Library

[4]

Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016).

[5]

Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, and Michael Auli. 2021. Unsupervised speech recognition. Advances in Neural Information Processing Systems 34 (2021), 27826--27839.

[6]

Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu. 2017. Exploring neural transducers for end-to-end speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 206--213.

[7]

Abdelkareem Bedri, Himanshu Sahni, Pavleen Thukral, Thad Starner, David Byrd, Peter Presti, Gabriel Reyes, Maysam Ghovanloo, and Zehua Guo. 2015. Toward Silent-Speech Control of Consumer Wearables. Computer 48, 10 (2015), 54--62.

Digital Library

[8]

Shirui Cao, Dong Li, Sunghoon Ivan Lee, and Jie Xiong. 2023. PowerPhone: Unleashing the Acoustic Sensing Capability of Smartphones. In Proceedings of the 29th Annual International Conference on Mobile Computing and Networking. 1--16.

Digital Library

[9]

Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, and Yu-Wing Tai. 2020. Fully convolutional networks for continuous sign language recognition. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXIV 16. Springer, 697--714.

Digital Library

[10]

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017. Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3444--3453.

[11]

B Denby, Tanja Schultz, K Honda, Thomas Hueber, James M Gilbert, and Jonathan S Brumberg. 2010. Silent speech interfaces. Speech Communication 52, 4 (2010), 270--287.

Digital Library

[12]

Feng Ding, Dong Wang, Qian Zhang, and Run Zhao. 2019. ASSV: handwritten signature verification using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (2019), 274--277.

Digital Library

[13]

Han Ding, Yizhan Wang, Hao Li, Cui Zhao, Ge Wang, Wei Xi, and Jizhong Zhao. 2022. UltraSpeech: Speech Enhancement by Interaction between Ultrasound and Speech. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--25.

Digital Library

[14]

Christopher M Gaeta. 2016. Quit playing with your watch: Perceptions of smartwatch use. (2016).

[15]

Yang Gao, Yincheng Jin, Jiyang Li, Seokmin Choi, and Zhanpeng Jin. 2020. Echowhisper: Exploring an acoustic-based silent speech interface for smartphone users. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1--27.

Digital Library

[16]

Alex Graves, Santiago Fernández, and Faustino Gomez. 2006. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In International Conference on Machine Learning.

Digital Library

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[18]

Shota Horiguchi, Daiki Ikami, and Kiyoharu Aizawa. 2019. Significance of softmax-based features in comparison to distance metric learning-based features. IEEE transactions on pattern analysis and machine intelligence 42, 5 (2019), 1279--1285.

[19]

Yincheng Jin, Yang Gao, Xuhai Xu, Seokmin Choi, Jiyang Li, Feng Liu, Zhengxiong Li, and Zhanpeng Jin. 2022. EarCommand: "Hearing" Your Silent Speech Commands In Ear. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1--28.

Digital Library

[20]

Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. Alterego: A personalized wearable silent speech interface. In 23rd International Conference on Intelligent User Interfaces. 43--53.

Digital Library

[21]

Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems.

Digital Library

[22]

Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2020. FM-track: pushing the limits of contactless multi-target tracking using acoustic signals. In Proceedings of the 18th Conference on Embedded Networked Sensor Systems. 150--163.

Digital Library

[23]

Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2022. Lasense: Pushing the limits of fine-grained activity sensing using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1--27.

Digital Library

[24]

Dong Li, Jialin Liu, Sunghoon Ivan Lee, and Jie Xiong. 2022. Room-scale hand gesture recognition using smart speakers. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems. 462--475.

Digital Library

[25]

Ke Li, Ruidong Zhang, Bo Liang, François Guimbretière, and Cheng Zhang. 2022. Eario: A low-power acoustic sensing earable for continuously tracking detailed facial movements. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1--24.

Digital Library

[26]

Zisu Li, Chen Liang, Yuntao Wang, Yue Qin, Chun Yu, Yukang Yan, Mingming Fan, and Yuanchun Shi. 2023. Enabling Voice-Accompanying Hand-to-Face Gesture Recognition with Cross-Device Sensing. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--17.

Digital Library

[27]

K. Ling, H. Dai, Y. Liu, and A. X. Liu. 2018. UltraGesture: Fine-Grained Gesture Sensing and Recognition. In 2018 15th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). 1--9.

[28]

Jialin Liu, Dong Li, Lei Wang, and Jie Xiong. 2021. BlinkListener: "Listen" to Your Eye Blink Using Your Smartphone. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 2 (2021), 1--27.

Digital Library

[29]

Jialin Liu, Dong Li, Lei Wang, Fusang Zhang, and Jie Xiong. 2022. Enabling contact-free acoustic sensing under device motion. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--27.

Digital Library

[30]

Minh Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. (2015).

[31]

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746--748.

[32]

Rajalakshmi Nandakumar, Shyamnath Gollakota, and Nathaniel Watson. 2015. Contactless Sleep Apnea Detection on Smartphones. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services. 45--57.

Digital Library

[33]

Laxmi Pandey, Khalad Hasan, and Ahmed Sabbir Arif. 2021. Acceptability of speech and silent speech input methods in private and public. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1--13.

Digital Library

[34]

Junfu Pu, Wengang Zhou, and Houqiang Li. 2019. Iterative alignment network for continuous sign language recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4165--4174.

[35]

Yue Qin, Chun Yu, Zhaoheng Li, Mingyuan Zhong, Yukang Yan, and Yuanchun Shi. 2021. Proximic: Convenient voice activation via close-to-mic speech detected by a single microphone. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1--12.

Digital Library

[36]

Tanja Schultz, Michael Wand, Thomas Hueber, Dean J. Krusienski, Christian Herff, and Jonathan S. Brumberg. 2017. Biosignal-Based Spoken Communication: A Survey. IEEE/ACM Transactions on Audio Speech and Language Processing 25, 12 (2017), 2257--2271.

Digital Library

[37]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015).

[38]

Tanmay Srivastava, Prerna Khanna, Shijia Pan, Phuc Nguyen, and Shubham Jain. 2022. MuteIt: Jaw Motion Based Unvoiced Command Recognition Using Earable. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 3 (2022), 1--26.

Digital Library

[39]

Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology. 581--593.

Digital Library

[40]

Ke Sun and Xinyu Zhang. 2021. UltraSE: single-channel speech enhancement using ultrasound. In Proceedings of the 27th annual international conference on mobile computing and networking. 160--173.

Digital Library

[41]

Jiayao Tan, Cam-Tu Nguyen, and Xiaoliang Wang. 2017. SilentTalk: Lip reading through ultrasonic sensing on mobile phones. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications. IEEE, 1--9.

[42]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, undefinedukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). 6000--6010.

Digital Library

[43]

Jingxian Wang, Chengfeng Pan, Haojian Jin, Vaibhav Singh, Yash Jain, Jason I Hong, Carmel Majidi, and Swarun Kumar. 2019. RFID Tattoo: A Wireless Platform for Speech Recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 4 (2019), 1--24.

Digital Library

[44]

Tianben Wang, Daqing Zhang, Yuanqing Zheng, Tao Gu, Xingshe Zhou, and Bernadette Dorizzi. 2018. C-FMCW based contactless respiration detection using acoustic signal. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1, 4 (2018), 1--20.

Digital Library

[45]

Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. 82--94.

[46]

James R Williams. 1998. Guidelines for the use of multimedia in instruction. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, Vol. 42. SAGE Publications Sage CA: Los Angeles, CA, 1447--1451.

[47]

Xiangyu Xu, Gao Hang, Jiadi Yu, Yingying Chen, and Minglu Li. 2017. ER: Early recognition of inattentive driving leveraging audio devices on smartphones. In IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[48]

Zhican Yang, Chun Yu, Fengshi Zheng, and Yuanchun Shi. 2019. ProxiTalk: Activate Speech Input by Bringing Smartphone to the Mouth. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3, 3 (2019), 1--25.

Digital Library

[49]

Yafeng Yin, Zheng Wang, Kang Xia, Lei Xie, and Sanglu Lu. 2023. Acoustic-based Lip Reading for Mobile Devices: Dataset, Benchmark and A Self Distillation-based Approach. IEEE Transactions on Mobile Computing (2023).

[50]

Sangki Yun, Yichao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-Grained Acoustic-based Device-Free Tracking. (2017), 15--28.

[51]

Fusang Zhang, Jie Xiong, Zhaoxin Chang, Junqi Ma, and Daqing Zhang. 2022. Mobi2Sense: empowering wireless sensing with mobility. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 268--281.

Digital Library

[52]

Qian Zhang, Ke Liu, and Dong Wang. 2024. Sensing to hear through memory: Ultrasound speech enhancement without real ultrasound signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1--31.

Digital Library

[53]

Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2021. Soundlip: Enabling word and sentence-level lip interaction for smart devices. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 1, 1--28.

Digital Library

[54]

Qian Zhang, Dong Wang, Run Zhao, Yinggang Yu, and Junjie Shen. 2021. Sensing to hear: Speech enhancement for mobile devices using acoustic signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--30.

Digital Library

[55]

Ruidong Zhang, Mingyang Chen, Benjamin Steeper, Yaxuan Li, Zihan Yan, Yizhuo Chen, Songyun Tao, Tuochao Chen, Hyunchul Lim, and Cheng Zhang. 2021. SpeeChin: a smart necklace for silent speech recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--23.

Digital Library

[56]

Ruidong Zhang, Ke Li, Yihong Hao, Yufan Wang, Zhengnan Lai, François Guimbretière, and Cheng Zhang. 2023. EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1--18.

Digital Library

[57]

Yongzhao Zhang, Wei-Hsiang Huang, Chih-Yun Yang, Wen-Ping Wang, Yi-Chao Chen, Chuang-Wen You, Da-Yuan Huang, Guangtao Xue, and Jiadi Yu. 2020. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 1 (2020).

Digital Library

[58]

Run Zhao, Dong Wang, Qian Zhang, Xueyi Jin, and Ke Liu. 2021. Smartphone-based handwritten signature verification using acoustic signals. Proceedings of the ACM on human-computer interaction 5, ISS (2021), 1--26.

Digital Library

[59]

Shiwen Zhao, Brandt Westing, Shawn Scully, Heri Nieto, Roman Holenstein, Minwoo Jeong, Krishna Sridhar, Brandon Newendorp, Mike Bastian, Sethu Raman, Tim Paek, Kevin Lynch, and Carlos Guestrin. 2019. Raise to Speak: An Accurate, Low-Power Detector for Activating Voice Assistants on Smartwatches. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '19). Association for Computing Machinery, New York, NY, USA, 2736--2744. https://doi.org/10.1145/3292500.3330761

Digital Library

[60]

Bing Zhou, Jay Lohokare, Ruipeng Gao, and Fan Ye. 2018. EchoPrint: Two-factor Authentication using Acoustics and Vision on Smartphones. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. 321--336.

Digital Library

[61]

Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and Matti Pietikäinen. 2014. A review of recent advances in visual speech decoding. Image and vision computing 32, 9 (2014), 590--605.

Cited By

Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://doi.org/10.1145/3659598

Index Terms

Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing

This paper reports on word recognition experiments using a silent speech interface based on magnetic sensing of articulator movements. A magnetic field was generated by permanent magnet pellets fixed to relevant speech articulators. Magnetic field ...
Modeling coarticulation in EMG-based continuous speech recognition

This paper discusses the use of surface electromyography for automatic speech recognition. Electromyographic signals captured at the facial muscles record the activity of the human articulatory apparatus and thus allow to trace back a speech signal even ...
Direct Speech Generation for a Silent Speech Interface based on Permanent Magnet Articulography
BIOSTEC 2016: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies

Patients with larynx cancer often lose their voice following total laryngectomy. Current methods for post-laryngectomy voice restoration are all unsatisfactory due to different reasons: requires frequent replacement due to biofilm growth (tracheo-...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 8, Issue 2

May 2024

1330 pages

EISSN:2474-9567

DOI:10.1145/3665317

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 2024

Published in IMWUT Volume 8, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)287
Downloads (Last 6 weeks)59

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://doi.org/10.1145/3659598

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents