Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3242587.3242599acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article

Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands

Published: 11 October 2018 Publication History

Abstract

We present Lip-Interact, an interaction technique that allows users to issue commands on their smartphone through silent speech. Lip-Interact repurposes the front camera to capture the user's mouth movements and recognize the issued commands with an end-to-end deep learning model. Our system supports 44 commands for accessing both system-level functionalities (launching apps, changing system settings, and handling pop-up windows) and application-level functionalities (integrated operations for two apps). We verify the feasibility of Lip-Interact with three user experiments: evaluating the recognition accuracy, comparing with touch on input efficiency, and comparing with voiced commands with regards to personal privacy and social norms. We demonstrate that Lip-Interact can help users access functionality efficiently in one step, enable one-handed input when the other hand is occupied, and assist touch to make interactions more fluent.

Supplementary Material

suppl.mov (ufp1075.mp4)
Supplemental video
suppl.mov (ufp1075p.mp4)
Supplemental video
MP4 File (p581-sun.mp4)

References

[1]
Jessalyn Alvina, Carla F. Griggio, Xiaojun Bi, and Wendy E. Mackay. 2017. CommandBoard: Creating a General-Purpose Command Gesture Input Space for Soft Keyboard. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (UIST '17). ACM, New York, NY, USA, 17--28.
[2]
Caroline Appert and Shumin Zhai. 2009. Using Strokes As Command Shortcuts: Cognitive Benefits and Toolkit Support. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09). ACM, New York, NY, USA, 2289--2298.
[3]
Apple. 2018. iOS-Siri-Apple. (2018). https://www.apple.com/ios/siri/.
[4]
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: end-to-end sentence-level lipreading. (2016).
[5]
Patrick Baudisch and Gerry Chu. 2009. Back-of-device Interaction Allows Creating Very Small Touch Devices. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '09). ACM, New York, NY, USA, 1923--1932.
[6]
Jonathan S Brumberg, Alfonso Nieto-Castanon, Philip R Kennedy, and Frank H Guenther. 2010. Brain-computer interfaces for speech communication. Speech communication 52, 4 (2010), 367--379.
[7]
Alex Butler, Shahram Izadi, and Steve Hodges. 2008. SideSight: Multi-"Touch" Interaction Around Small Devices. In Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology (UIST '08). ACM, New York, NY, USA, 201--204.
[8]
Xiang 'Anthony' Chen and Yang Li. 2016. Bootstrapping User-Defined Body Tapping Recognition with Offline-Learned Probabilistic Representation. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 359--364.
[9]
Xiang 'Anthony' Chen, Julia Schwarz, Chris Harrison, Jennifer Mankoff, and Scott E. Hudson. 2014. Air
[10]
Touch: Interweaving Touch & In-air Gestures. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (UIST '14). ACM, New York, NY, USA, 519--525.
[11]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[12]
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2016. Lip reading sentences in the wild. arXiv preprint arXiv:1611.05358 2 (2016).
[13]
Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. In Asian Conference on Computer Vision. 87--103.
[14]
Andy Cockburn, Carl Gutwin, and Saul Greenberg. 2007. A Predictive Model of Menu Performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). ACM, New York, NY, USA, 627--636.
[15]
Philip Cohen, David McGee, and Josh Clow. 2000. The Efficiency of Multimodal Interaction for a Map-based Task. In Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLC '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 331--338.
[16]
Philip R. Cohen, Michael Johnston, David McGee, Sharon Oviatt, Jay Pittman, Ira Smith, Liang Chen, and Josh Clow. 1997. QuickSet: Multimodal Interaction for Distributed Applications. In Proceedings of the Fifth ACM International Conference on Multimedia (MULTIMEDIA '97). ACM, New York, NY, USA, 31--40.
[17]
B. Denby, T. Schultz, K. Honda, T. Hueber, J. M. Gilbert, and J. S. Brumberg. 2010. Silent Speech Interfaces. Speech Commun. 52, 4 (2010), 270--287.
[18]
Michael J Fagan, Stephen R Ell, James M Gilbert, E Sarrazin, and Peter M Chapman. 2008. Development of a (silent) speech recognition system for patients following laryngectomy. Medical Engineering and Physics 30, 4 (2008), 419--425.
[19]
Victoria M Florescu, Lise Crevier-Buchman, Bruce Denby, Thomas Hueber, Antonia Colazo-Simon, Claire Pillot-Loiseau, Pierre Roussel, Cédric Gendrot, and Sophie Quattrocchi. 2010. Silent vs vocalized articulation for a portable ultrasound-based silent speech interface. In Eleventh Annual Conference of the International Speech Communication Association.
[20]
Google. 2018. Google Assistant. (2018). https://assistant.google.com/.
[21]
Otmar Hilliges, Shahram Izadi, Andrew D. Wilson, Steve Hodges, Armando Garcia-Mendoza, and Andreas Butz. 2009. Interactions in the Air: Adding Further Depth to Interactive Tabletops. In Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology (UIST '09). ACM, New York, NY, USA, 139--148.
[22]
Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301--313.
[23]
Thomas Hueber, Elie-Laurent Benaroya, Gérard Chollet, Bruce Denby, Gérard Dreyfus, and Maureen Stone. 2010. Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips. Speech Communication 52, 4 (2010), 288--300.
[24]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015).
[25]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and others. 2015. Spatial transformer networks. In Advances in neural information processing systems. 2017--2025.
[26]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35, 1 (2013), 221--231.
[27]
Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In 23rd International Conference on Intelligent User Interfaces (IUI '18). ACM, New York, NY, USA, 43--53.
[28]
Vahid Kazemi and Josephine Sullivan. 2014. One Millisecond Face Alignment with an Ensemble of Regression Trees. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR '14). IEEE Computer Society, Washington, DC, USA, 1867--1874.
[29]
Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. 1993. Segmenting Time Series: A Survey and Novel Approach. In In an Edited Volume, Data mining in Time Series Databases. Published by World Scientific. Publishing Company, 1--22.
[30]
Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research 10, Jul (2009), 1755--1758.
[31]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[32]
Sven Kratz and Michael Rohs. 2009. HoverFlow: Expanding the Design Space of Around-device Interaction. In Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI '09). ACM, New York, NY, USA, 4:1--4:8.
[33]
Per Ola Kristensson and Shumin Zhai. 2007. Command Strokes with and Without Preview: Using Pen Gestures on Keyboard for Command Selection. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '07). ACM, New York, NY, USA, 1137--1146.
[34]
Gordon Kurtenbach and William Buxton. 1991. Issues in combining marking and direct manipulation techniques. In Proceedings of the 4th annual ACM symposium on User interface software and technology. ACM, 137--144.
[35]
Yang Li. 2010. Gesture Search: A Tool for Fast Mobile Data Access. In Proceedings of the 23Nd Annual ACM Symposium on User Interface Software and Technology (UIST '10). ACM, New York, NY, USA, 87--96.
[36]
Hao Lu and Yang Li. 2015. Gesture On: Enabling Always-On Touch Gestures for Fast Mobile Access from the Device Standby Mode. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems (CHI '15). ACM, New York, NY, USA, 3355--3364.
[37]
Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-audible murmur (NAM) recognition. IEICE TRANSACTIONS on Information and Systems 89, 1 (2006), 1--8.
[38]
Sharon Oviatt. 2003. Multimodal interfaces. The human-computer interaction handbook: Fundamentals, evolving technologies and emerging applications 14 (2003), 286--304.
[39]
Ken Pfeuffer, Jason Alexander, Ming Ki Chong, and Hans Gellersen. 2014. Gaze-touch: Combining Gaze with Multi-touch for Interaction on the Same Surface. In Proceedings of the 27th Annual ACM Symposium on User Interface Software and Technology (UIST '14). ACM, New York, NY, USA, 509--518.
[40]
Anne Porbadnigk, Marek Wester, and Tanja Schultz Jan-p Calliess. 2009. EEG-based speech recognition impact of temporal effects. (2009).
[41]
S. Prabhakar, S. Pankanti, and A. K. Jain. 2003. Biometric recognition: security and privacy concerns. IEEE Security Privacy 1, 2 (Mar 2003), 33--42.
[42]
J Ross Quinlan. 2014. C4. 5: programs for machine learning. Elsevier.
[43]
Katie A. Siek, Yvonne Rogers, and Kay H. Connelly. 2005. Fat Finger Worries: How Older and Younger Users Physically Interact with PDAs. In Human-Computer Interaction - INTERACT 2005. Springer Berlin Heidelberg, Berlin, Heidelberg, 267--280.
[44]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929--1958.
[45]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV '15). IEEE Computer Society, Washington, DC, USA, 4489--4497.
[46]
Michael Wand, Jan Koutn'ik, and Jürgen Schmidhuber. 2016. Lipreading with long short-term memory. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 6115--6119.
[47]
Michael Wand and Tanja Schultz. 2011. Session-independent EMG-based Speech Recognition. In Biosignals. 295--300.
[48]
Saiwen Wang, Jie Song, Jaime Lien, Ivan Poupyrev, and Otmar Hilliges. 2016. Interacting with Soli: Exploring Fine-Grained Dynamic Gesture Recognition in the Radio-Frequency Spectrum. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (UIST '16). ACM, New York, NY, USA, 851--860.
[49]
Christopher D Wickens, Justin G Hollands, Simon Banbury, and Raja Parasuraman. 2015. Engineering psychology & human performance. Psychology Press.
[50]
Daniel Wigdor, Clifton Forlines, Patrick Baudisch, John Barnwell, and Chia Shen. 2007. Lucid Touch: A See-through Mobile Device. In Proceedings of the 20th Annual ACM Symposium on User Interface Software and Technology (UIST '07). ACM, New York, NY, USA, 269--278.

Cited By

View all
  • (2024)Design of Automatic Speech Recognition System Based on Throat VibrationModeling and Simulation10.12677/MOS.2024.13103513:01(365-376)Online publication date: 2024
  • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
  • (2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
UIST '18: Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology
October 2018
1016 pages
ISBN:9781450359481
DOI:10.1145/3242587
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. lip interaction
  2. mobile interaction
  3. semantic gesture
  4. silent speech
  5. touch-free
  6. vision-based recognition

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Plan
  • Tsinghua University Research Funding
  • Natural Science Foundation of China

Conference

UIST '18

Acceptance Rates

UIST '18 Paper Acceptance Rate 80 of 375 submissions, 21%;
Overall Acceptance Rate 842 of 3,967 submissions, 21%

Upcoming Conference

UIST '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)101
  • Downloads (Last 6 weeks)15
Reflects downloads up to 06 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Design of Automatic Speech Recognition System Based on Throat VibrationModeling and Simulation10.12677/MOS.2024.13103513:01(365-376)Online publication date: 2024
  • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
  • (2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
  • (2024)MELDER: The Design and Evaluation of a Real-time Silent Speech Recognizer for Mobile DevicesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642348(1-23)Online publication date: 11-May-2024
  • (2024)ReHEarSSE: Recognizing Hidden-in-the-Ear Silently Spelled ExpressionsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642095(1-16)Online publication date: 11-May-2024
  • (2024)Watch Your Mouth: Silent Speech Recognition with Depth SensingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642092(1-15)Online publication date: 11-May-2024
  • (2024)KuchiNavi: lip-reading-based navigation appFifteenth International Conference on Graphics and Image Processing (ICGIP 2023)10.1117/12.3021118(47)Online publication date: 25-Mar-2024
  • (2024)Deep Learning for Visual Speech Analysis: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.337671046:9(6001-6022)Online publication date: Sep-2024
  • (2024)Collaborative Viseme Subword and End-to-End Modeling for Word-Level Lip ReadingIEEE Transactions on Multimedia10.1109/TMM.2024.339014826(9358-9371)Online publication date: 17-Apr-2024
  • (2024)EarSSR: Silent Speech Recognition via EarphonesIEEE Transactions on Mobile Computing10.1109/TMC.2024.335671923:8(8493-8507)Online publication date: Aug-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media