research-article

Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands

Authors:

Yuanchun ShiAuthors Info & Claims

UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology

Pages 581 - 593

https://doi.org/10.1145/3242587.3242599

Published: 11 October 2018 Publication History

Abstract

We present Lip-Interact, an interaction technique that allows users to issue commands on their smartphone through silent speech. Lip-Interact repurposes the front camera to capture the user's mouth movements and recognize the issued commands with an end-to-end deep learning model. Our system supports 44 commands for accessing both system-level functionalities (launching apps, changing system settings, and handling pop-up windows) and application-level functionalities (integrated operations for two apps). We verify the feasibility of Lip-Interact with three user experiments: evaluating the recognition accuracy, comparing with touch on input efficiency, and comparing with voiced commands with regards to personal privacy and social norms. We demonstrate that Lip-Interact can help users access functionality efficiently in one step, enable one-handed input when the other hand is occupied, and assist touch to make interactions more fluent.

Supplementary Material

suppl.mov (ufp1075.mp4)

Supplemental video

Download
72.11 MB

suppl.mov (ufp1075p.mp4)

Supplemental video

Download
10.81 MB

MP4 File (p581-sun.mp4)

Download
256.32 MB

References

[1]

Jessalyn Alvina, Carla F. Griggio, Xiaojun Bi, and Wendy E. Mackay. 2017. CommandBoard: Creating a General-Purpose Command Gesture Input Space for Soft Keyboard. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST '17). ACM, New York, NY, USA, 17--28.

Digital Library

[2]

Caroline Appert and Shumin Zhai. 2009. Using Strokes As Command Shortcuts: Cognitive Benefits and Toolkit Support. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09). ACM, New York, NY, USA, 2289--2298.

Digital Library

[3]

Apple. 2018. iOS-Siri-Apple. (2018). https://www.apple.com/ios/siri/.

[4]

Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: end-to-end sentence-level lipreading. (2016).

[5]

Patrick Baudisch and Gerry Chu. 2009. Back-of-device Interaction Allows Creating Very Small Touch Devices. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09). ACM, New York, NY, USA, 1923--1932.

Digital Library

[6]

Jonathan S Brumberg, Alfonso Nieto-Castanon, Philip R Kennedy, and Frank H Guenther. 2010. Brain-computer interfaces for speech communication. Speech communication 52, 4 (2010), 367--379.

Digital Library

[7]

Alex Butler, Shahram Izadi, and Steve Hodges. 2008. SideSight: Multi-"Touch" Interaction Around Small Devices. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST '08). ACM, New York, NY, USA, 201--204.

Digital Library

[8]

Xiang 'Anthony' Chen and Yang Li. 2016. Bootstrapping User-Defined Body Tapping Recognition with Offline-Learned Probabilistic Representation. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 359--364.

Digital Library

[9]

Xiang 'Anthony' Chen, Julia Schwarz, Chris Harrison, Jennifer Mankoff, and Scott E. Hudson. 2014. Air

Digital Library

[10]

Touch: Interweaving Touch & In-air Gestures. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (UIST '14). ACM, New York, NY, USA, 519--525.

Digital Library

[11]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[12]

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip reading sentences in the wild. arXiv preprint arXiv:1611.05358 2 (2016).

[13]

Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian Conference on Computer Vision. 87--103.

[14]

Andy Cockburn, Carl Gutwin, and Saul Greenberg. 2007. A Predictive Model of Menu Performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). ACM, New York, NY, USA, 627--636.

Digital Library

[15]

Philip Cohen, David McGee, and Josh Clow. 2000. The Efficiency of Multimodal Interaction for a Map-based Task. In Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLC '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 331--338.

Digital Library

[16]

Philip R. Cohen, Michael Johnston, David McGee, Sharon Oviatt, Jay Pittman, Ira Smith, Liang Chen, and Josh Clow. 1997. QuickSet: Multimodal Interaction for Distributed Applications. In Proceedings of the Fifth ACM International Conference on Multimedia (MULTIMEDIA '97). ACM, New York, NY, USA, 31--40.

Digital Library

[17]

B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg. 2010. Silent Speech Interfaces. Speech Commun. 52, 4 (2010), 270--287.

Digital Library

[18]

Michael J Fagan, Stephen R Ell, James M Gilbert, E Sarrazin, and Peter M Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical Engineering and Physics 30, 4 (2008), 419--425.

[19]

Victoria M Florescu, Lise Crevier-Buchman, Bruce Denby, Thomas Hueber, Antonia Colazo-Simon, Claire Pillot-Loiseau, Pierre Roussel, Cédric Gendrot, and Sophie Quattrocchi. 2010. Silent vs vocalized articulation for a portable ultrasound-based silent speech interface. In Eleventh Annual Conference of the International Speech Communication Association.

[20]

Google. 2018. Google Assistant. (2018). https://assistant.google.com/.

[21]

Otmar Hilliges, Shahram Izadi, Andrew D. Wilson, Steve Hodges, Armando Garcia-Mendoza, and Andreas Butz. 2009. Interactions in the Air: Adding Further Depth to Interactive Tabletops. In Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology (UIST '09). ACM, New York, NY, USA, 139--148.

Digital Library

[22]

Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301--313.

Digital Library

[23]

Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication 52, 4 (2010), 288--300.

Digital Library

[24]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).

Digital Library

[25]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and others. 2015. Spatial transformer networks. In Advances in neural information processing systems. 2017--2025.

Digital Library

[26]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2013), 221--231.

Digital Library

[27]

Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 43--53.

Digital Library

[28]

Vahid Kazemi and Josephine Sullivan. 2014. One Millisecond Face Alignment with an Ensemble of Regression Trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '14). IEEE Computer Society, Washington, DC, USA, 1867--1874.

Digital Library

[29]

Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. 1993. Segmenting Time Series: A Survey and Novel Approach. In In an Edited Volume, Data mining in Time Series Databases. Published by World Scientific. Publishing Company, 1--22.

[30]

Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10, Jul (2009), 1755--1758.

Digital Library

[31]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[32]

Sven Kratz and Michael Rohs. 2009. HoverFlow: Expanding the Design Space of Around-device Interaction. In Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '09). ACM, New York, NY, USA, 4:1--4:8.

Digital Library

[33]

Per Ola Kristensson and Shumin Zhai. 2007. Command Strokes with and Without Preview: Using Pen Gestures on Keyboard for Command Selection. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). ACM, New York, NY, USA, 1137--1146.

Digital Library

[34]

Gordon Kurtenbach and William Buxton. 1991. Issues in combining marking and direct manipulation techniques. In Proceedings of the 4th annual ACM symposium on User interface software and technology. ACM, 137--144.

Digital Library

[35]

Yang Li. 2010. Gesture Search: A Tool for Fast Mobile Data Access. In Proceedings of the 23Nd Annual ACM Symposium on User Interface Software and Technology (UIST '10). ACM, New York, NY, USA, 87--96.

Digital Library

[36]

Hao Lu and Yang Li. 2015. Gesture On: Enabling Always-On Touch Gestures for Fast Mobile Access from the Device Standby Mode. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 3355--3364.

Digital Library

[37]

Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-audible murmur (NAM) recognition. IEICE TRANSACTIONS on Information and Systems 89, 1 (2006), 1--8.

Digital Library

[38]

Sharon Oviatt. 2003. Multimodal interfaces. The human-computer interaction handbook: Fundamentals, evolving technologies and emerging applications 14 (2003), 286--304.

Digital Library

[39]

Ken Pfeuffer, Jason Alexander, Ming Ki Chong, and Hans Gellersen. 2014. Gaze-touch: Combining Gaze with Multi-touch for Interaction on the Same Surface. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (UIST '14). ACM, New York, NY, USA, 509--518.

Digital Library

[40]

Anne Porbadnigk, Marek Wester, and Tanja Schultz Jan-p Calliess. 2009. EEG-based speech recognition impact of temporal effects. (2009).

[41]

S. Prabhakar, S. Pankanti, and A. K. Jain. 2003. Biometric recognition: security and privacy concerns. IEEE Security Privacy 1, 2 (Mar 2003), 33--42.

Digital Library

[42]

J Ross Quinlan. 2014. C4. 5: programs for machine learning. Elsevier.

Digital Library

[43]

Katie A. Siek, Yvonne Rogers, and Kay H. Connelly. 2005. Fat Finger Worries: How Older and Younger Users Physically Interact with PDAs. In Human-Computer Interaction - INTERACT 2005. Springer Berlin Heidelberg, Berlin, Heidelberg, 267--280.

Digital Library

[44]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.

Digital Library

[45]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV '15). IEEE Computer Society, Washington, DC, USA, 4489--4497.

Digital Library

[46]

Michael Wand, Jan Koutn'ik, and Jürgen Schmidhuber. 2016. Lipreading with long short-term memory. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 6115--6119.

Digital Library

[47]

Michael Wand and Tanja Schultz. 2011. Session-independent EMG-based Speech Recognition. In Biosignals. 295--300.

[48]

Saiwen Wang, Jie Song, Jaime Lien, Ivan Poupyrev, and Otmar Hilliges. 2016. Interacting with Soli: Exploring Fine-Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 851--860.

Digital Library

[49]

Christopher D Wickens, Justin G Hollands, Simon Banbury, and Raja Parasuraman. 2015. Engineering psychology & human performance. Psychology Press.

[50]

Daniel Wigdor, Clifton Forlines, Patrick Baudisch, John Barnwell, and Chia Shen. 2007. Lucid Touch: A See-through Mobile Device. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology (UIST '07). ACM, New York, NY, USA, 269--278.

Digital Library

Cited By

陆心(2024)Design of Automatic Speech Recognition System Based on Throat VibrationModeling and Simulation10.12677/MOS.2024.13103513:01(365-376)Online publication date: 2024
https://doi.org/10.12677/MOS.2024.131035
Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659614
Hiraki HKanazawa SMiura TYoshida MMochimaru MRekimoto J(2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1145/3652920.3652925
Show More Cited By

Index Terms

Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. HCI design and evaluation methods
      1. User studies
    2. Interaction techniques
  2. Interaction design
    1. Interaction design process and methods
      1. Interface design prototyping

Recommendations

SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks
CHI '19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information ...
Touch & Interact: touch-based interaction with a tourist application
MobileHCI '08: Proceedings of the 10th international conference on Human computer interaction with mobile devices and services

Touch & Interact is an interaction technique which combines mobile phones and public displays. The motivation for the project is to overcome the intrinsic output limitations of mobile phones. Touch & Interact extends the phone output to a public display ...
WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology

October 2018

1016 pages

ISBN:9781450359481

DOI:10.1145/3242587

General Chairs:
Patrick Baudisch
Hasso-Plattner Institute, Germany
,
Albrecht Schmidt
LMU, Germany
,
Program Chair:
Andy Wilson
Microsoft Research, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Plan
Tsinghua University Research Funding
Natural Science Foundation of China

Conference

UIST '18

Sponsor:

UIST '18: The 31st Annual ACM Symposium on User Interface Software and Technology

October 14, 2018

Berlin, Germany

Acceptance Rates

UIST '18 Paper Acceptance Rate 80 of 375 submissions, 21%;

Overall Acceptance Rate 842 of 3,967 submissions, 21%

Upcoming Conference

UIST '24

Sponsor:
sigchi
sigchi

The 37th Annual ACM Symposium on User Interface Software and Technology

October 13 - 16, 2024

Pittsburgh , PA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

91
Total Citations
View Citations
1,495
Total Downloads

Downloads (Last 12 months)101
Downloads (Last 6 weeks)15

Reflects downloads up to 06 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

陆心(2024)Design of Automatic Speech Recognition System Based on Throat VibrationModeling and Simulation10.12677/MOS.2024.13103513:01(365-376)Online publication date: 2024
https://doi.org/10.12677/MOS.2024.131035
Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659614
Hiraki HKanazawa SMiura TYoshida MMochimaru MRekimoto J(2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1145/3652920.3652925
Pandey LArif A(2024)MELDER: The Design and Evaluation of a Real-time Silent Speech Recognizer for Mobile DevicesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642348(1-23)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642348
Dong XChen YNishiyama YSezaki KWang YChristofferson KMariakakis A(2024)ReHEarSSE: Recognizing Hidden-in-the-Ear Silently Spelled ExpressionsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642095(1-16)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642095
Wang XSu ZRekimoto JZhang Y(2024)Watch Your Mouth: Silent Speech Recognition with Depth SensingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642092(1-15)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642092
Kanamaru TSaitoh T(2024)KuchiNavi: lip-reading-based navigation appFifteenth International Conference on Graphics and Image Processing (ICGIP 2023)10.1117/12.3021118(47)Online publication date: 25-Mar-2024
https://doi.org/10.1117/12.3021118
Sheng CKuang GBai LHou CGuo YXu XPietikäinen MLiu L(2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
https://doi.org/10.1109/TPAMI.2024.3376710
Chen HWang QDu JWan GXiong SYin BPan JLee C(2024)Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip ReadingIEEE Transactions on Multimedia10.1109/TMM.2024.339014826(9358-9371)Online publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3390148
Sun XXiong JFeng CLi HWu YFang DChen X(2024)EarSSR: Silent Speech Recognition via EarphonesIEEE Transactions on Mobile Computing10.1109/TMC.2024.335671923:8(8493-8507)Online publication date: Aug-2024
https://doi.org/10.1109/TMC.2024.3356719
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents