Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content

Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations

Published: 11 September 2017 Publication History


Continuous audio analysis from embedded and mobile devices is an increasingly important application domain. More and more, appliances like the Amazon Echo, along with smartphones and watches, and even research prototypes seek to perform multiple discriminative tasks simultaneously from ambient audio; for example, monitoring background sound classes (e.g., music or conversation), recognizing certain keywords (‘Hey Siri' or ‘Alexa'), or identifying the user and her emotion from speech. The use of deep learning algorithms typically provides state-of-the-art model performances for such general audio tasks. However, the large computational demands of deep learning models are at odds with the limited processing, energy and memory resources of mobile, embedded and IoT devices.
In this paper, we propose and evaluate a novel deep learning modeling and optimization framework that specifically targets this category of embedded audio sensing tasks. Although the supported tasks are simpler than the task of speech recognition, this framework aims at maintaining accuracies in predictions while minimizing the overall processor resource footprint. The proposed model is grounded in multi-task learning principles to train shared deep layers and exploits, as input layer, only statistical summaries of audio filter banks to further lower computations.
We find that for embedded audio sensing tasks our framework is able to maintain similar accuracies, which are observed in comparable deep architectures that use single-task learning and typically more complex input layers. Most importantly, on an average, this approach provides almost a 2.1× reduction in runtime, energy, and memory for four separate audio sensing tasks, assuming a variety of task combinations.


2013. Recent Advances in Deep Learning for Speech Research at Microsoft. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=188864
2017. https://www.qualcomm.com/products/snapdragon/processors/400. (2017).
2017. Amazon Echo. http://www.amazon.com/Amazon-Echo-Bluetooth-Speaker-with-WiFi-Alexa/dp/B00X4WHP5E. (2017).
2017. Auto Shazam. https://support.shazam.com/hc/en-us/articles/204457738-Auto-Shazam-iPhone-. (2017).
2017. Fitbit Surge. https://www.fitbit.com/uk/surge. (2017).
2017. Google Home. https://home.google.com/. (2017).
2017. Motorola Moto 360 Smartwatch. http://www.motorola.com/us/products/moto-360. (2017).
2017. Qualcomm Snapdragon 800 MDP. http://goo.gl/ySfCFl. (2017).
2017. TensorFlow. https://www.tensorflow.org/. (2017).
2017. Torch. http://torch.ch/. (2017).
Sourav Bhattacharya and Nicholas D. Lane. 2016. From Smart to Deep: Robust Activity Recognition on Smartwatches using Deep Learning. In Workshop on Sensing Systems and Applications Using Wrist Worn Smart Devices (WristSense'16).
Sourav Bhattacharya and Nicholas D. Lane. 2016. Sparsification and Separation of Deep Learning Layers for Constrained Resource Inference on Wearables. In ACM Conference on Embedded Networked Sensor Systems (SenSys) 2016.
Rich Caruana. 1997. Multitask Learning. Mach. Learn. 28, 1 (July 1997), 41--75.
Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint Keyword Spotting Using Deep Neural Networks (ICASSP'14).
Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. 2015. Compressing Neural Networks with the Hashing Trick. ICML-15 (2015). http://arxiv.org/abs/1504.04788
Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML '08). ACM, New York, NY, USA, 160--167.
Li Deng, Geoffrey Hinton, and Brian Kingsbury. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). http://research.microsoft.com/apps/pubs/default.aspx?id=189004
Zheng Fang, Zhang Guoliang, and Song Zhanjiang. 2001. Comparison of Different Implementations of MFCC. J. Comput. Sci. Technol. 16, 6 (Nov. 2001), 582--589.
Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. 2015. Compressing Deep Convolutional Networks using Vector Quantization. ICLR-15 (2015). http://arxiv.org/abs/1412.6115
Nils Hammerla, James Fisher, Peter Andras, Lynn Rochester, Richard Walker, and Thomas Ploetz. 2015. PD Disease State Assessment in Naturalistic Environments Using Deep Learning. (2015). http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9930
Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'16). AAAI Press. http://www.ijcai.org/Abstract/16/220
Kun Han, Dong Yu, and Ivan Tashev. 2014. Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine. In Interspeech-14. http://research.microsoft.com/apps/pubs/default.aspx?id=230136
Song Han, Jeff Pool, John Tran, and William J. Dally. 2015. Learning both Weights and Connections for Efficient Neural Networks. NIPS-15 (2015). http://arxiv.org/abs/1506.02626
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Tianxing He, Yuchen Fan, Yanmin Qian, Tian Tan, and Kai Yu. 2014. Reshaping deep neural network for fast decoding by node-pruning. In ICASSP-14, May 4-9, 2014. 245--249.
H. Hermansky. 1990. Perceptual Linear Predictive (PLP) Analysis of Speech. J. Acoust. Soc. Am. 57, 4 (April 1990), 1738--52.
Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language Knowledge Transfer using Multilingual Deep Neural Network with Shared Hidden Layers. In ICASSP-13. http://research.microsoft.com/apps/pubs/default.aspx?id=189250
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS-12. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Nicholas Lane, Sourav Bhattacharya, Akhil Mathur, Claudio Forlivesi, and Fahim Kawsar. 2016. Dxtk: Enabling resource-efficient deep learning on mobile and embedded devices with the deepx toolkit. In Proceedings of the 8th EAI International Conference on Mobile Computing, Applications and Services, ser. MobiCASE, Vol. 16. 98--107.
Nicholas D. Lane, Sourav Bhattacharya, Petko Georgiev, Claudio Forlivesi, Lei Jiao, Lorena Qendro, and Fahim Kawsar. 2016. DeepX: A Software Accelerator for Low-Power Deep Learning Inference on Mobile Devices. In International Conference on Information Processing in Sensor Networks (IPSN '16).
Nicholas D. Lane and Petko Georgiev. 2015. Can Deep Learning Revolutionize Mobile Sensing?. In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications (HotMobile '15). ACM, New York, NY, USA, 117--122.
Nicholas D. Lane, Petko Georgiev, and Lorena Qendro. 2015. DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments Using Deep Learning. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '15). ACM, New York, NY, USA, 283--294.
Honglak Lee, Peter Pham, Yan Largman, and Andrew Y. Ng. 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS-09. Curran Associates, Inc., 1096--1104. http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf
Youngki Lee, Chulhong Min, Chanyou Hwang, Jaeung Lee, Inseok Hwang, Younghyun Ju, Chungkuk Yoo, Miri Moon, Uichin Lee, and Junehwa Song. 2013. SocioPhone: Everyday Face-to-face Interaction Monitoring Platform Using Multi-phone Sensor Fusion. In Proceeding of the 11th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys '13). ACM, New York, NY, USA, 375--388.
Xi Li, Liming Zhao, Lina Wei, MingHsuan Yang, Fei Wu, Yueting Zhuang, Haibin Ling, and Jingdong Wang. 2015. DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection. CoRR abs/1510.05484 (2015). http://arxiv.org/abs/1510.05484
Mark Liberman, Kelly Davis, Murray Grossman, Nii Martey, and John Bell. 2002. Emotional Prosody Speech and Transcripts. (2002).
Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. 2015. Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015. 912--921. http://aclweb.org/anthology/N/N15/N15-1092.pdf
Hong Lu, A.J. Bernheim Brush, Bodhi Priyantha, Amy K. Karlson, and Jie Liu. 2011. SpeakerSense: Energy Efficient Unobtrusive Speaker Identification on Mobile Phones. In Proceedings of the 9th International Conference on Pervasive Computing (Pervasive'11). Springer-Verlag, Berlin, Heidelberg, 188--205. http://dl.acm.org/citation.cfm?id=2021975.2021992
Hong Lu, Denise Frauendorfer, Mashfiqui Rabbi, Marianne Schmid Mast, Gokul T. Chittaranjan, Andrew T. Campbell, Daniel Gatica-Perez, and Tanzeem Choudhury. 2012. StressSense: Detecting Stress in Unconstrained Acoustic Environments Using Smartphones. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (UbiComp '12). 10.
Hong Lu, Wei Pan, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2009. SoundSense: Scalable Sound Sensing for People-centric Applications on Mobile Phones. In Proceedings of the 7th International Conference on Mobile Systems, Applications, and Services (MobiSys '09). ACM, New York, NY, USA, 165--178.
Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). 71--84.
Hong Lu, Jun Yang, Zhigang Liu, Nicholas D. Lane, Tanzeem Choudhury, and Andrew T. Campbell. 2010. The Jigsaw Continuous Sensing Engine for Mobile Phone Applications. In Proceedings of the 8th ACM Conference on Embedded Networked Sensor Systems (SenSys '10). ACM, New York, NY, USA, 71--84.
Chengwen Luo and Mun Choon Chan. 2013. SocialWeaver: Collaborative Inference of Human Conversation Networks Using Smartphones. In Proceedings of the 11th ACM Conference on Embedded Networked Sensor Systems (SenSys '13). Article 20, 14 pages.
Akhil Mathur, Nicholas D Lane, Sourav Bhattacharya, Aidan Boran, Claudio Forlivesi, and Fahim Kawsar. 2017. DeepEye: Resource Efficient Local Execution of Multiple Deep Vision Models using Wearable Commodity Hardware. In The 15th International Conference on Mobile Systems, Applications and Services (MobiSys).
Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 807--814. http://www.icml2010.org/papers/432.pdf
Thomas Plötz, Nils Y. Hammerla, and Patrick Olivier. 2011. Feature Learning for Activity Recognition in Ubiquitous Computing. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two (IJCAI'11). AAAI Press, 1729--1734.
Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). 10.
Kiran K. Rachuri, Mirco Musolesi, Cecilia Mascolo, Peter J. Rentfrow, Chris Longworth, and Andrius Aucinas. 2010. EmotionSense: A Mobile Phones Based Adaptive Platform for Experimental Social Psychology Research. In Proceedings of the 12th ACM International Conference on Ubiquitous Computing (Ubicomp '10). ACM, New York, NY, USA, 281--290.
Alain Rakotomamonjy and Gilles Gasso. 2015. Histogram of gradients of Time-Frequency Representations for Audio scene detection. CoRR abs/1508.04909 (2015). http://arxiv.org/abs/1508.04909
M. Smith and T. Barnwell. 1987. A new filter bank theory for time-frequency representation. IEEE Transactions on Acoustics, Speech, and Signal Processing 35, 3 (Mar 1987), 314--327.
Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf. 2014. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In CVPR '14. 8.
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688 (May 2016). http://arxiv.org/abs/1605.02688
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In ICASSP-14. IEEE, 4052--4056.
Max Welling, Michal Rosen-Zvi, and Geoffrey Hinton. 2004. Exponential Family Harmoniums with an Application to Information Retrieval. In Proceedings of the 17th International Conference on Neural Information Processing Systems (NIPS'04). MIT Press, Cambridge, MA, USA, 1481--1488. http://dl.acm.org/citation.cfm?id=2976040.2976226
Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilci, Md Sahidullah, and Aleksandr Sizov. 2015. ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In INTERSPEECH 2015, Automatic Speaker Verification Spoofing and Countermeasures Challenge, colocated with INTERSPEECH 2015, September 6-10, 2015, Dresden, Germany. Dresden, ALLEMAGNE. http://www.eurecom.fr/publication/4573
Chenren Xu, Sugang Li, Gang Liu, Yanyong Zhang, Emiliano Miluzzo, Yih-Farn Chen, Jun Li, and Bernhard Firner. 2013. Crowd++: Unsupervised Speaker Count with Smartphones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp '13). ACM, New York, NY, USA, 43--52.
Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition. In INTERSPEECH, Frédéric Bimbot, Christophe Cerisara, Cécile Fougeron, Guillaume Gravier, Lori Lamel, François Pellegrino, and Pascal Perrier (Eds.). ISCA, 2365--2369.
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS-14 (2014). http://arxiv.org/abs/1411.1792

Cited By

View all
  • (2024)AdaStreamLiteProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314607:4(1-29)Online publication date: 12-Jan-2024
  • (2023)Yapay Zeka Sohbet Robotları ve ChatGPT’nin Hemşirelik Eğitiminde KullanılmasıArtificial Intelligence Chatbots and Using Chatgbt in Nursing EducationAkdeniz Hemşirelik Dergisi10.59398/ahd.13303412:2(73-80)Online publication date: 12-Oct-2023
  • (2023)ScaleFlow: Efficient Deep Vision Pipeline with Closed-Loop Scale-Adaptive InferenceProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612412(1698-1706)Online publication date: 26-Oct-2023
  • Show More Cited By

Index Terms

  1. Low-resource Multi-task Audio Sensing for Mobile and Embedded Devices via Shared Deep Neural Network Representations



      Information & Contributors


      Published In

      cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
      Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 1, Issue 3
      September 2017
      2023 pages
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]


      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 September 2017
      Accepted: 01 June 2017
      Revised: 01 May 2017
      Received: 01 November 2016
      Published in IMWUT Volume 1, Issue 3


      Request permissions for this article.

      Check for updates

      Author Tags

      1. Audio sensing
      2. deep learning
      3. multi-task learning
      4. shared representation


      • Research-article
      • Research
      • Refereed


      Other Metrics

      Bibliometrics & Citations


      Article Metrics

      • Downloads (Last 12 months)64
      • Downloads (Last 6 weeks)6
      Reflects downloads up to 01 Sep 2024

      Other Metrics


      Cited By

      View all
      • (2024)AdaStreamLiteProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36314607:4(1-29)Online publication date: 12-Jan-2024
      • (2023)Yapay Zeka Sohbet Robotları ve ChatGPT’nin Hemşirelik Eğitiminde KullanılmasıArtificial Intelligence Chatbots and Using Chatgbt in Nursing EducationAkdeniz Hemşirelik Dergisi10.59398/ahd.13303412:2(73-80)Online publication date: 12-Oct-2023
      • (2023)ScaleFlow: Efficient Deep Vision Pipeline with Closed-Loop Scale-Adaptive InferenceProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612412(1698-1706)Online publication date: 26-Oct-2023
      • (2023)Transforming Large-Size to Lightweight Deep Neural Networks for IoT ApplicationsACM Computing Surveys10.1145/357095555:11(1-35)Online publication date: 9-Feb-2023
      • (2023)Differentiable Neural Network Pruning to Enable Smart Applications on MicrocontrollersProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35694686:4(1-19)Online publication date: 11-Jan-2023
      • (2023)FedMPT: Federated Learning for Multiple Personalized Tasks Over Mobile ComputingIEEE Transactions on Network Science and Engineering10.1109/TNSE.2023.3246463(1-15)Online publication date: 2023
      • (2023)MyoKey: Inertial Motion Sensing and Gesture-Based QWERTY Keyboard for Extended RealitiesIEEE Transactions on Mobile Computing10.1109/TMC.2022.315693922:8(4807-4821)Online publication date: 1-Aug-2023
      • (2023)On-Device Deep Multi-Task Inference via Multi-Task ZippingIEEE Transactions on Mobile Computing10.1109/TMC.2021.312430622:5(2878-2891)Online publication date: 1-May-2023
      • (2023)Survey on Emotion Sensing Using Mobile DevicesIEEE Transactions on Affective Computing10.1109/TAFFC.2022.322048414:4(2678-2696)Online publication date: 1-Oct-2023
      • (2023)Capturing Interaction Quality in Long Duration (Simulated) Space Missions With WearablesIEEE Transactions on Affective Computing10.1109/TAFFC.2022.317696714:3(2139-2152)Online publication date: 1-Jul-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options


      View or Download as a PDF file.



      View online with eReader.








      Share this Publication link

      Share on social media