survey

Backdoor Attacks against Voice Recognition Systems: A Survey

Authors:

Zheng YanAuthors Info & Claims

ACM Computing Surveys, Volume 57, Issue 3

Article No.: 78, Pages 1 - 35

https://doi.org/10.1145/3701985

Published: 22 November 2024 Publication History

Abstract

Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and privacy of VRSs. Unfortunately, existing literature lacks a thorough review on this topic. This paper fills this research gap by conducting a comprehensive survey on backdoor attacks against VRSs. We first present an overview of VRSs and backdoor attacks, elucidating their basic knowledge. Then we propose a set of evaluation criteria to assess the performance of backdoor attack methods. Next, we present a comprehensive taxonomy of backdoor attacks against VRSs from different perspectives and analyze the characteristic of different categories. After that, we comprehensively review existing attack methods and analyze their pros and cons based on the proposed criteria. Furthermore, we review classic backdoor defense methods and generic audio defense techniques. Then we discuss the feasibility of deploying them on VRSs. Finally, we figure out several open issues and further suggest future research directions to motivate the research of VRSs security.

References

[1]

Gunnar Ahlbom, Frédéric Bimbot, and Gérard Chollet. 1987. Modeling spectral speech transitions using temporal decomposition techniques. In ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 12. IEEE, 13–16.

[2]

Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, and Parmesh Ramanathan. 2020. Preech: A system for privacy-preserving speech transcription. In Proceedings of the 29th USENIX Conference on Security Symposium. 2703–2720.

[3]

Tawfiq Ammari, Jofish Kaye, Janice Y. Tsai, and Frank Bentley. 2019. Music, search, and IoT: How people (really) use voice assistants. ACM Trans. Comput. Hum. Interact. 26, 3 (2019), 17–1.

Digital Library

[4]

Kristian Timm Andersen and Marc Moonen. 2016. Adaptive time-frequency analysis for noise reduction in an audio filter bank with low delay. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 4 (2016), 784–795.

Digital Library

[5]

Eugene Bagdasaryan and Vitaly Shmatikov. 2021. Blind backdoors in deep learning models. In 30th USENIX Security Symposium (USENIX Security’21). 1505–1521.

[6]

Tanja Bänziger, Didier Grandjean, and Klaus R. Scherer. 2009. Emotion recognition from expressions in face, voice, and body: The Multimodal Emotion Recognition Test (MERT). Emotion 9, 5 (2009), 691.

[7]

Ekaba Bisong and Ekaba Bisong. 2019. Google colaboratory. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (2019), 59–64.

[8]

S. Boll. 1979. A spectral subtraction algorithm for suppression of acoustic noise in speech. In ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. IEEE, 200–203.

[9]

Eitan Borgnia, Valeriia Cherepanova, Liam Fowl, Amin Ghiasi, Jonas Geiping, Micah Goldblum, Tom Goldstein, and Arjun Gupta. 2021. Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff. In ICASSP 2021: The 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 3855–3859.

[10]

Andrew Brown, Jaesung Huh, Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2022. VoxSRC 2021: The third VoxCeleb Speaker Recognition Challenge. arXiv preprint arXiv:2201.04583 (2022).

[11]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.

[12]

Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, and Shunhui Ji. 2022. PBSM: Backdoor attack against Keyword spotting based on pitch boosting and sound masking. arXiv preprint arXiv:2211.08697 (2022).

[13]

Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, and Shunhui Ji. 2022. VSVC: Backdoor attack against keyword spotting based on voiceprint selection and voice conversion. arXiv preprint arXiv:2212.10103 (2022).

[14]

Francis Charpentier and M. Stella. 1986. Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 11. IEEE, 2015–2018.

[15]

Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. 2018. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728 (2018).

[16]

Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial attacks on speaker recognition systems. In 2021 IEEE Symposium on Security and Privacy (SP’21). IEEE, 694–711.

[17]

Tianlong Chen, Zhenyu Zhang, Yihua Zhang, Shiyu Chang, Sijia Liu, and Zhangyang Wang. 2022. Quarantine: Sparsity can uncover the trojan attack trigger for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 598–609.

[18]

Weixin Chen, Baoyuan Wu, and Haoqian Wang. 2022. Effective backdoor defense by exploiting sensitivity of poisoned samples. Advances in Neural Information Processing Systems 35 (2022), 9727–9737.

[19]

Fernando Roberti de Siqueira, William Robson Schwartz, and Helio Pedrini. 2013. Multi-scale gray level co-occurrence matrices for texture description. Neurocomputing 120 (2013), 336–345.

[20]

Kien Do, Haripriya Harikumar, Hung Le, Dung Nguyen, Truyen Tran, Santu Rana, Dang Nguyen, Willy Susilo, and Svetha Venkatesh. 2022. Towards effective and robust neural trojan defenses via input filtering. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V. Springer, 283–300.

Digital Library

[21]

Bao Gia Doan, Ehsan Abbasnejad, and Damith C. Ranasinghe. 2020. Februus: Input purification defense against trojan attacks on deep neural network systems. In Annual Computer Security Applications Conference. 897–912.

Digital Library

[22]

Jacob Dumford and Walter Scheirer. 2020. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB’20). IEEE, 1–9.

Digital Library

[23]

Stéphane Dupont and Juergen Luettin. 2000. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2, 3 (2000), 141–151.

Digital Library

[24]

Ugo Erra, Sabrina Senatore, Fernando Minnella, and Giuseppe Caggianese. 2015. Approximate TF–IDF based on topic extraction from massive message stream using the GPU. Information Sciences 292 (2015), 143–161.

Digital Library

[25]

Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. 2020. Local model poisoning attacks to Byzantine-robust federated learning. In Proceedings of the 29th USENIX Conference on Security Symposium. 1623–1640.

Digital Library

[26]

Yun Fu and Thomas S. Huang. 2008. Image classification using correlation tensor analysis. IEEE Transactions on Image Processing 17, 2 (2008), 226–234.

Digital Library

[27]

Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma, Jiliang Zhang, Anmin Fu, Surya Nepal, and Hyoungshick Kim. 2020. Backdoor attacks and countermeasures on deep learning: A comprehensive review. arXiv preprint arXiv:2007.10760 (2020).

[28]

Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C. Ranasinghe, and Hyoungshick Kim. 2021. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2349–2364.

[29]

Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe, and Surya Nepal. 2019. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference. 113–125.

Digital Library

[30]

Yunjie Ge, Qian Wang, Jiayuan Yu, Chao Shen, and Qi Li. 2023. Data poisoning and backdoor attacks on audio intelligence systems. IEEE Communications Magazine 61, 12 (2023), 176–182.

Digital Library

[31]

Xueluan Gong, Yanjiao Chen, Huayang Huang, Weihan Kong, Ziyao Wang, Chao Shen, and Qian Wang. 2023. KerbNet: A QoE-aware kernel-based backdoor attack framework. IEEE Transactions on Dependable and Secure Computing (2023).

[32]

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access 7 (2019), 47230–47244.

[33]

Wei Guo, Benedetta Tondi, and Mauro Barni. 2022. An overview of backdoor attacks against deep neural networks and possible defences. IEEE Open Journal of Signal Processing (2022).

[34]

Ameer Hamza, Abdul Rehman Rehman Javed, Farkhund Iqbal, Natalia Kryvinska, Ahmad S. Almadhor, Zunera Jalil, and Rouba Borghol. 2022. Deepfake audio detection via MFCC features using machine learning. IEEE Access 10 (2022), 134018–134028.

[35]

Marko Heikkilä, Matti Pietikäinen, and Cordelia Schmid. 2009. Description of interest regions with local binary patterns. Pattern Recognition 42, 3 (2009), 425–436.

Digital Library

[36]

Enrique Herrera-Viedma and Antonio Gabriel López-Herrera. 2007. A model of an information retrieval system with unbalanced fuzzy linguistic information. International Journal of Intelligent Systems 22, 11 (2007), 1197–1214.

[37]

Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. 2022. Handcrafted backdoors in deep neural networks. Advances in Neural Information Processing Systems 35 (2022), 8068–8080.

[38]

Matthew B. Hoy. 2018. Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly 37, 1 (2018), 81–88.

[39]

Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. 2022. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR) 54, 11s (2022), 1–37.

Digital Library

[40]

Rui Hu, Yuanxiong Guo, Miao Pan, and Yanmin Gong. 2019. Targeted poisoning attacks on social recommender systems. In 2019 IEEE Global Communications Conference (GLOBECOM’19). IEEE, 1–6.

Digital Library

[41]

Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020).

[42]

Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein, and J. Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence. 43–58.

Digital Library

[43]

Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, and Paavo Alku. 2018. Speech waveform synthesis from MFCC sequences with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5679–5683.

Digital Library

[44]

Stefanos Koffas, Luca Pajola, Stjepan Picek, and Mauro Conti. 2022. Going in style: Audio backdoors through stylistic transformations. arXiv preprint arXiv:2211.03117 (2022).

[45]

Stefanos Koffas, Jing Xu, Mauro Conti, and Stjepan Picek. 2022. Can you hear it? Backdoor attacks via ultrasonic triggers. In Proceedings of the 2022 ACM Workshop on Wireless Security and Machine Learning. 57–62.

Digital Library

[46]

Gary E. Kopec and Philip A. Chou. 1994. Document image decoding using Markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 6 (1994), 602–617.

Digital Library

[47]

Hyun Kwon. 2021. Defending deep neural networks against backdoor attack by using de-trigger autoencoder. IEEE Access (2021).

[48]

Jiahe Lan, Jie Wang, Baochen Yan, Zheng Yan, and Elisa Bertino. 2024. FlowMur: A stealthy and practical audio backdoor attack with limited knowledge. In 2024 IEEE Symposium on Security and Privacy (SP’24). IEEE Computer Society, 148–148.

[49]

Jiahe Lan, Rui Zhang, Zheng Yan, Jie Wang, Yu Chen, and Ronghui Hou. 2022. Adversarial attacks and defenses in speaker recognition systems: A survey. Journal of Systems Architecture 127 (2022), 102526.

Digital Library

[50]

Mingxuan Li, Xiao Wang, Dongdong Huo, Han Wang, Chao Liu, Yazhe Wang, Yu Wang, and Zhen Xu. 2021. A novel trojan attack against co-learning based ASR DNN system. In 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD’21). IEEE, 907–912.

[51]

Shaofeng Li, Shiqing Ma, Minhui Xue, and Benjamin Zi Hao Zhao. 2022. Deep learning backdoors. In Security and Artificial Intelligence: A Crossdisciplinary Approach. Springer, 313–334.

[52]

Yuanchun Li, Jiayi Hua, Haoyu Wang, Chunyang Chen, and Yunxin Liu. 2021. DeepPayload: Black-box backdoor attack on deep learning models through neural payload injection. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 263–274.

Digital Library

[53]

Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2022. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems (2022).

[54]

Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2021. Neural attention distillation: Erasing backdoor triggers from deep neural networks. In ICLR.

[55]

Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1121–1134.

Digital Library

[56]

Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 273–294.

[57]

Peng Liu, Shuyi Zhang, Chuanjian Yao, Wenzhe Ye, and Xianxian Li. 2022. Backdoor attacks against deep neural networks by personalized audio steganography. In 2022 26th International Conference on Pattern Recognition (ICPR’22). IEEE, 68–74.

[58]

Qiang Liu, Tongqing Zhou, Zhiping Cai, and Yonghao Tang. 2022. Opportunistic backdoor attacks: Exploring human-imperceptible vulnerabilities on speech recognition systems. In Proceedings of the 30th ACM International Conference on Multimedia. 2390–2398.

Digital Library

[59]

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2017. Trojaning attack on neural networks. (2017).

[60]

Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. 2020. Reflection backdoor: A natural backdoor attack on deep neural networks. (2020).

[61]

Yuntao Liu, Ankit Mondal, Abhishek Chakraborty, Michael Zuzak, Nina Jacobsen, Daniel Xing, and Ankur Srivastava. 2020. A survey on neural trojans. In 2020 21st International Symposium on Quality Electronic Design (ISQED’20). IEEE, 33–39.

[62]

Yugeng Liu, Rui Wen, Xinlei He, Ahmed Salem, Zhikun Zhang, Michael Backes, Emiliano De Cristofaro, Mario Fritz, and Yang Zhang. 2022. \(\lbrace\)ML-Doctor\(\rbrace\): Holistic risk assessment of inference attacks against machine learning models. In 31st USENIX Security Symposium (USENIX Security’22). 4525–4542.

[63]

Zeyan Liu, Fengjun Li, Zhu Li, and Bo Luo. 2022. LoneNeuron: A highly-effective feature-domain neural trojan using invisible and polymorphic watermarks. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 2129–2143.

Digital Library

[64]

Yuxiao Luo, Jianwei Tai, Xiaoqi Jia, and Shengzhi Zhang. 2022. Practical backdoor attack against speaker recognition system. In International Conference on Information Security Practice and Experience. Springer, 468–484.

Digital Library

[65]

David J. Miller, Zhen Xiang, and George Kesidis. 2020. Adversarial learning targeting deep neural network classification: A comprehensive review of defenses against attacks. Proc. IEEE 108, 3 (2020), 402–433.

[66]

Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2018. Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73 (2018), 1–15.

[67]

Thanapon Noraset, Chen Liang, Larry Birnbaum, and Doug Downey. 2017. Definition modeling: Learning to define word embeddings in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.

[68]

Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Interspeech. 3214–3218.

[69]

Yuan Ping, Bin Hao, Xiali Hei, Yazhou Tu, Xiaojiang Du, and Jie Wu. 2019. Feature fusion and voiceprint-based access control for wireless insulin pump systems. IEEE Access 7 (2019), 121286–121302.

[70]

Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice interfaces in everyday life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.

Digital Library

[71]

Alisha Pradhan, Amanda Lazar, and Leah Findlater. 2020. Use of intelligent voice assistants by older adults with low technology use. ACM Transactions on Computer-Human Interaction (TOCHI) 27, 4 (2020), 1–27.

Digital Library

[72]

Alisha Pradhan, Kanika Mehta, and Leah Findlater. 2018. “Accessibility Came by Accident” use of voice-controlled intelligent personal assistants by people with disabilities. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–13.

Digital Library

[73]

Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, and Ronan Collobert. 2019. Wav2letter++: A fast open-source speech recognition system. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6460–6464.

[74]

Zhenghan Qi, Yoel Sanchez Araujo, Wendy C. Georgan, John D. E. Gabrieli, and Joanne Arciuli. 2019. Hearing matters more than seeing: A cross-modality study of statistical learning and reading ability. Scientific Studies of Reading 23, 1 (2019), 101–115.

[75]

Han Qiu, Yi Zeng, Shangwei Guo, Tianwei Zhang, Meikang Qiu, and Bhavani Thuraisingham. 2021. DeepSweep: An evaluation framework for mitigating DNN backdoor attacks using data augmentation. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 363–377.

Digital Library

[76]

Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.

[77]

Amir Mohammad Rostami, Ali Karimi, and Mohammad Ali Akhaee. 2022. Keyword spotting in continuous speech using convolutional neural network. Speech Communication 142 (2022), 15–21.

Digital Library

[78]

Zhang Rui and Zheng Yan. 2018. A survey on biometric authentication: Toward secure and privacy-preserving identification. IEEE Access 7 (2018), 5994–6009.

[79]

Michael W. Schwarz, William B. Cowan, and John C. Beatty. 1987. An experimental comparison of RGB, YIQ, LAB, HSV, and opponent color models. ACM Transactions on Graphics (TOG) 6, 2 (1987), 123–158.

Digital Library

[80]

Frank Seide, Gang Li, and Dong Yu. 2011. Conversational speech transcription using context-dependent deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association.

[81]

Ivan W. Selesnick, Richard G. Baraniuk, and Nick C. Kingsbury. 2005. The dual-tree complex wavelet transform. IEEE Signal Processing Magazine 22, 6 (2005), 123–151.

[82]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4779–4783.

Digital Library

[83]

Cong Shi, Tianfang Zhang, Zhuohang Li, Huy Phan, Tianming Zhao, Yan Wang, Jian Liu, Bo Yuan, and Yingying Chen. 2022. Audio-domain position-independent backdoor attack via unnoticeable triggers. In Proceedings of the 28th Annual International Conference on Mobile Computing and Networking. 583–595.

Digital Library

[84]

David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, and Sanjeev Khudanpur. 2019. Speaker recognition for multi-speaker conversations using x-vectors. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5796–5800.

[85]

Xu Tan and Xiao-Lei Zhang. 2021. Speech enhancement aided end-to-end multi-task learning for voice activity detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 6823–6827.

[86]

Ruixiang Tang, Mengnan Du, Ninghao Liu, Fan Yang, and Xia Hu. 2020. An embarrassingly simple approach for trojan attack in deep neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 218–228.

Digital Library

[87]

Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, and Zize Liang. 2019. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Computers and Electronics in Agriculture 157 (2019), 417–426.

Digital Library

[88]

Zhiyi Tian, Lei Cui, Jie Liang, and Shui Yu. 2022. A comprehensive survey on poisoning attacks and countermeasures in machine learning. Comput. Surveys 55, 8 (2022), 1–35.

Digital Library

[89]

Brandon Tran, Jerry Li, and Aleksander Madry. 2018. Spectral signatures in backdoor attacks. Advances in Neural Information Processing Systems 31 (2018).

[90]

Adrian R. L. Travis. 1997. The display of three-dimensional video images. Proc. IEEE 85, 11 (1997), 1817–1832.

[91]

Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. 2019. Demystifying membership inference attacks in machine learning as a service. IEEE Transactions on Services Computing 14, 6 (2019), 2073–2089.

[92]

Alexander Turner, Dimitris Tsipras, and Aleksander Madry. 2019. Label-consistent backdoor attacks. arXiv preprint arXiv:1912.02771 (2019).

[93]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).

[94]

Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP’19). IEEE, 707–723.

[95]

Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, and Zhenglu Yang. 2022. Improving self-supervised learning for speech recognition with intermediate layer supervision. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 7092–7096.

[96]

Weiqing Wang, Xiaoyi Qin, and Ming Li. 2022. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2MeT challenge. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 9171–9175.

[97]

Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).

[98]

Dongxian Wu and Yisen Wang. 2021. Adversarial neuron pruning purifies backdoored deep models. Advances in Neural Information Processing Systems 34 (2021), 16913–16925.

[99]

Lei Wu, Steven C. H. Hoi, and Nenghai Yu. 2010. Semantics-preserving Bag-of-Words models and applications. IEEE Transactions on Image Processing 19, 7 (2010), 1908–1920.

Digital Library

[100]

Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. Speech Communication 66 (2015), 130–153.

Digital Library

[101]

Jun Xia, Ting Wang, Jiepin Ding, Xian Wei, and Mingsong Chen. 2022. Eliminating backdoor triggers for deep neural networks using attention relation graph distillation. arXiv preprint arXiv:2204.09975 (2022).

[102]

Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level aggregation for speaker recognition in the wild. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5791–5795.

[103]

Jinwen Xin, Xixiang Lyu, and Jing Ma. 2023. Natural backdoor attacks on speech recognition models. In Machine Learning for Cyber Security: 4th International Conference, ML4CS 2022, Guangzhou, China, December 2–4, 2022, Proceedings, Part I. Springer, 597–610.

Digital Library

[104]

Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A. Gunter, and Bo Li. 2021. Detecting AI trojans using Meta Neural Analysis. In 2021 IEEE Symposium on Security and Privacy (SP’21). IEEE, 103–120.

[105]

Baochen Yan, Rui Zhang, and Zheng Yan. 2022. VoiceSketch: A privacy-preserving voiceprint authentication system. In 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom’22). IEEE, 623–630.

[106]

Zheng Yan and Sihui Zhao. 2016. A usable authentication system based on personal voice challenge. In 2016 International Conference on Advanced Cloud and Big Data (CBD’16). IEEE, 194–199.

[107]

Jianbin Ye, Xiaoyuan Liu, Zheng You, Guowei Li, and Bo Liu. 2022. DriNet: Dynamic backdoor attack against automatic speech recognization models. Applied Sciences 12, 12 (2022), 5786.

[108]

Kota Yoshida and Takeshi Fujino. 2020. Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security. 117–127.

Digital Library

[109]

Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, and Shu-Tao Xia. 2021. Backdoor attack against speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 2560–2564.

[110]

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. 2022. WenetSpeech: A 10000+ hours multi-domain Mandarin corpus for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 6182–6186.

[111]

Rui Zhang, Zheng Yan, Xuerui Wang, and Robert Deng. 2022. VOLERE: Leakage resilient user authentication based on personal voice challenges. IEEE Transactions on Dependable and Secure Computing (2022).

[112]

Rui Zhang, Zheng Yan, Xuerui Wang, and Robert H. Deng. 2022. LiVoAuth: Liveness detection in voiceprint authentication with random challenges and detection modes. IEEE Transactions on Industrial Informatics (2022).

[113]

Ya-Jie Zhang, Shifeng Pan, Lei He, and Zhen-Hua Ling. 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6945–6949.

[114]

Yu Zhong and Anil K. Jain. 2000. Object localization using color, texture and shape. Pattern Recognition 33, 4 (2000), 671–684.

[115]

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proc. IEEE 109, 1 (2020), 43–76.

[116]

Wei Zong, Yang-Wai Chow, Willy Susilo, Kien Do, and Svetha Venkatesh. 2023. TrojanModel: A practical trojan attack against automatic speech recognition systems. In 2023 IEEE Symposium on Security and Privacy (SP’23). IEEE, 1667–1683.

[117]

Wei Zong, Yang-Wai Chow, Willy Susilo, and Jongkil Kim. 2022. Trojan attacks and defense for speech recognition. In Mobile Internet Security: 5th International Symposium, MobiSec 2021, Jeju Island, South Korea, October 7–9, 2021, Revised Selected Papers. Springer, 195–210.

Index Terms

Backdoor Attacks against Voice Recognition Systems: A Survey
1. Human-centered computing
  1. Human computer interaction (HCI)
2. Security and privacy
  1. Systems security

Recommendations

Backdoor Attacks and Defenses Targeting Multi-Domain AI Models: A Comprehensive Review
Since the emergence of security concerns in artificial intelligence (AI), there has been significant attention devoted to the examination of backdoor attacks. Attackers can utilize backdoor attacks to manipulate model predictions, leading to significant ...
Natural Backdoor Attacks on Speech Recognition Models
Machine Learning for Cyber Security
Abstract
With the rapid development of deep learning, its vulnerability has gradually emerged in recent years. This work focuses on backdoor attacks on speech recognition systems. We adopt sounds that are ordinary in nature or in our daily life as triggers ...
MASTERKEY: Practical Backdoor Attack Against Speaker Verification Systems
ACM MobiCom '23: Proceedings of the 29th Annual International Conference on Mobile Computing and Networking

Speaker Verification (SV) is widely deployed in mobile systems to authenticate legitimate users by using their voice traits. In this work, we propose a backdoor attack MasterKey, to compromise the SV models. Different from previous attacks, we focus ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 57, Issue 3

March 2025

984 pages

EISSN:1557-7341

DOI:10.1145/3697147

Editors:
David Atienza
Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
,
Michela Milano
University of Bologna, Italy

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Online AM: 26 October 2024

Accepted: 19 October 2024

Revised: 07 July 2024

Received: 19 July 2023

Published in CSUR Volume 57, Issue 3

Check for updates

Author Tags

Qualifiers

Survey

Funding Sources

National Natural Science Foundation of China
Key Research Project of Shaanxi Natural Science Foundation
Concept Verification Funding of Hangzhou Institute of Technology of Xidian University
111 Project
Fundamental Research Funds for the Central Universities

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
435
Total Downloads

Downloads (Last 12 months)435
Downloads (Last 6 weeks)208

Reflects downloads up to 16 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents