Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Backdoor Attacks against Voice Recognition Systems: A Survey

Published: 22 November 2024 Publication History

Abstract

Voice Recognition Systems (VRSs) employ deep learning for speech recognition and speaker recognition. They have been widely deployed in various real-world applications, from intelligent voice assistance to telephony surveillance and biometric authentication. However, prior research has revealed the vulnerability of VRSs to backdoor attacks, which pose a significant threat to the security and privacy of VRSs. Unfortunately, existing literature lacks a thorough review on this topic. This paper fills this research gap by conducting a comprehensive survey on backdoor attacks against VRSs. We first present an overview of VRSs and backdoor attacks, elucidating their basic knowledge. Then we propose a set of evaluation criteria to assess the performance of backdoor attack methods. Next, we present a comprehensive taxonomy of backdoor attacks against VRSs from different perspectives and analyze the characteristic of different categories. After that, we comprehensively review existing attack methods and analyze their pros and cons based on the proposed criteria. Furthermore, we review classic backdoor defense methods and generic audio defense techniques. Then we discuss the feasibility of deploying them on VRSs. Finally, we figure out several open issues and further suggest future research directions to motivate the research of VRSs security.

References

[1]
Gunnar Ahlbom, Frédéric Bimbot, and Gérard Chollet. 1987. Modeling spectral speech transitions using temporal decomposition techniques. In ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 12. IEEE, 13–16.
[2]
Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz, and Parmesh Ramanathan. 2020. Preech: A system for privacy-preserving speech transcription. In Proceedings of the 29th USENIX Conference on Security Symposium. 2703–2720.
[3]
Tawfiq Ammari, Jofish Kaye, Janice Y. Tsai, and Frank Bentley. 2019. Music, search, and IoT: How people (really) use voice assistants. ACM Trans. Comput. Hum. Interact. 26, 3 (2019), 17–1.
[4]
Kristian Timm Andersen and Marc Moonen. 2016. Adaptive time-frequency analysis for noise reduction in an audio filter bank with low delay. IEEE/ACM Transactions on Audio, Speech, and Language Processing 24, 4 (2016), 784–795.
[5]
Eugene Bagdasaryan and Vitaly Shmatikov. 2021. Blind backdoors in deep learning models. In 30th USENIX Security Symposium (USENIX Security’21). 1505–1521.
[6]
Tanja Bänziger, Didier Grandjean, and Klaus R. Scherer. 2009. Emotion recognition from expressions in face, voice, and body: The Multimodal Emotion Recognition Test (MERT). Emotion 9, 5 (2009), 691.
[7]
Ekaba Bisong and Ekaba Bisong. 2019. Google colaboratory. Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners (2019), 59–64.
[8]
S. Boll. 1979. A spectral subtraction algorithm for suppression of acoustic noise in speech. In ICASSP’79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4. IEEE, 200–203.
[9]
Eitan Borgnia, Valeriia Cherepanova, Liam Fowl, Amin Ghiasi, Jonas Geiping, Micah Goldblum, Tom Goldstein, and Arjun Gupta. 2021. Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff. In ICASSP 2021: The 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 3855–3859.
[10]
Andrew Brown, Jaesung Huh, Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2022. VoxSRC 2021: The third VoxCeleb Speaker Recognition Challenge. arXiv preprint arXiv:2201.04583 (2022).
[11]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.
[12]
Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, and Shunhui Ji. 2022. PBSM: Backdoor attack against Keyword spotting based on pitch boosting and sound masking. arXiv preprint arXiv:2211.08697 (2022).
[13]
Hanbo Cai, Pengcheng Zhang, Hai Dong, Yan Xiao, and Shunhui Ji. 2022. VSVC: Backdoor attack against keyword spotting based on voiceprint selection and voice conversion. arXiv preprint arXiv:2212.10103 (2022).
[14]
Francis Charpentier and M. Stella. 1986. Diphone synthesis using an overlap-add technique for speech waveforms concatenation. In ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 11. IEEE, 2015–2018.
[15]
Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. 2018. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728 (2018).
[16]
Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is Real Bob? Adversarial attacks on speaker recognition systems. In 2021 IEEE Symposium on Security and Privacy (SP’21). IEEE, 694–711.
[17]
Tianlong Chen, Zhenyu Zhang, Yihua Zhang, Shiyu Chang, Sijia Liu, and Zhangyang Wang. 2022. Quarantine: Sparsity can uncover the trojan attack trigger for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 598–609.
[18]
Weixin Chen, Baoyuan Wu, and Haoqian Wang. 2022. Effective backdoor defense by exploiting sensitivity of poisoned samples. Advances in Neural Information Processing Systems 35 (2022), 9727–9737.
[19]
Fernando Roberti de Siqueira, William Robson Schwartz, and Helio Pedrini. 2013. Multi-scale gray level co-occurrence matrices for texture description. Neurocomputing 120 (2013), 336–345.
[20]
Kien Do, Haripriya Harikumar, Hung Le, Dung Nguyen, Truyen Tran, Santu Rana, Dang Nguyen, Willy Susilo, and Svetha Venkatesh. 2022. Towards effective and robust neural trojan defenses via input filtering. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part V. Springer, 283–300.
[21]
Bao Gia Doan, Ehsan Abbasnejad, and Damith C. Ranasinghe. 2020. Februus: Input purification defense against trojan attacks on deep neural network systems. In Annual Computer Security Applications Conference. 897–912.
[22]
Jacob Dumford and Walter Scheirer. 2020. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB’20). IEEE, 1–9.
[23]
Stéphane Dupont and Juergen Luettin. 2000. Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia 2, 3 (2000), 141–151.
[24]
Ugo Erra, Sabrina Senatore, Fernando Minnella, and Giuseppe Caggianese. 2015. Approximate TF–IDF based on topic extraction from massive message stream using the GPU. Information Sciences 292 (2015), 143–161.
[25]
Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. 2020. Local model poisoning attacks to Byzantine-robust federated learning. In Proceedings of the 29th USENIX Conference on Security Symposium. 1623–1640.
[26]
Yun Fu and Thomas S. Huang. 2008. Image classification using correlation tensor analysis. IEEE Transactions on Image Processing 17, 2 (2008), 226–234.
[27]
Yansong Gao, Bao Gia Doan, Zhi Zhang, Siqi Ma, Jiliang Zhang, Anmin Fu, Surya Nepal, and Hyoungshick Kim. 2020. Backdoor attacks and countermeasures on deep learning: A comprehensive review. arXiv preprint arXiv:2007.10760 (2020).
[28]
Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C. Ranasinghe, and Hyoungshick Kim. 2021. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2349–2364.
[29]
Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C. Ranasinghe, and Surya Nepal. 2019. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference. 113–125.
[30]
Yunjie Ge, Qian Wang, Jiayuan Yu, Chao Shen, and Qi Li. 2023. Data poisoning and backdoor attacks on audio intelligence systems. IEEE Communications Magazine 61, 12 (2023), 176–182.
[31]
Xueluan Gong, Yanjiao Chen, Huayang Huang, Weihan Kong, Ziyao Wang, Chao Shen, and Qian Wang. 2023. KerbNet: A QoE-aware kernel-based backdoor attack framework. IEEE Transactions on Dependable and Secure Computing (2023).
[32]
Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access 7 (2019), 47230–47244.
[33]
Wei Guo, Benedetta Tondi, and Mauro Barni. 2022. An overview of backdoor attacks against deep neural networks and possible defences. IEEE Open Journal of Signal Processing (2022).
[34]
Ameer Hamza, Abdul Rehman Rehman Javed, Farkhund Iqbal, Natalia Kryvinska, Ahmad S. Almadhor, Zunera Jalil, and Rouba Borghol. 2022. Deepfake audio detection via MFCC features using machine learning. IEEE Access 10 (2022), 134018–134028.
[35]
Marko Heikkilä, Matti Pietikäinen, and Cordelia Schmid. 2009. Description of interest regions with local binary patterns. Pattern Recognition 42, 3 (2009), 425–436.
[36]
Enrique Herrera-Viedma and Antonio Gabriel López-Herrera. 2007. A model of an information retrieval system with unbalanced fuzzy linguistic information. International Journal of Intelligent Systems 22, 11 (2007), 1197–1214.
[37]
Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. 2022. Handcrafted backdoors in deep neural networks. Advances in Neural Information Processing Systems 35 (2022), 8068–8080.
[38]
Matthew B. Hoy. 2018. Alexa, Siri, Cortana, and more: An introduction to voice assistants. Medical Reference Services Quarterly 37, 1 (2018), 81–88.
[39]
Hongsheng Hu, Zoran Salcic, Lichao Sun, Gillian Dobbie, Philip S. Yu, and Xuyun Zhang. 2022. Membership inference attacks on machine learning: A survey. ACM Computing Surveys (CSUR) 54, 11s (2022), 1–37.
[40]
Rui Hu, Yuanxiong Guo, Miao Pan, and Yanmin Gong. 2019. Targeted poisoning attacks on social recommender systems. In 2019 IEEE Global Communications Conference (GLOBECOM’19). IEEE, 1–6.
[41]
Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020).
[42]
Ling Huang, Anthony D. Joseph, Blaine Nelson, Benjamin I. P. Rubinstein, and J. Doug Tygar. 2011. Adversarial machine learning. In Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence. 43–58.
[43]
Lauri Juvela, Bajibabu Bollepalli, Xin Wang, Hirokazu Kameoka, Manu Airaksinen, Junichi Yamagishi, and Paavo Alku. 2018. Speech waveform synthesis from MFCC sequences with generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 5679–5683.
[44]
Stefanos Koffas, Luca Pajola, Stjepan Picek, and Mauro Conti. 2022. Going in style: Audio backdoors through stylistic transformations. arXiv preprint arXiv:2211.03117 (2022).
[45]
Stefanos Koffas, Jing Xu, Mauro Conti, and Stjepan Picek. 2022. Can you hear it? Backdoor attacks via ultrasonic triggers. In Proceedings of the 2022 ACM Workshop on Wireless Security and Machine Learning. 57–62.
[46]
Gary E. Kopec and Philip A. Chou. 1994. Document image decoding using Markov source models. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 6 (1994), 602–617.
[47]
Hyun Kwon. 2021. Defending deep neural networks against backdoor attack by using de-trigger autoencoder. IEEE Access (2021).
[48]
Jiahe Lan, Jie Wang, Baochen Yan, Zheng Yan, and Elisa Bertino. 2024. FlowMur: A stealthy and practical audio backdoor attack with limited knowledge. In 2024 IEEE Symposium on Security and Privacy (SP’24). IEEE Computer Society, 148–148.
[49]
Jiahe Lan, Rui Zhang, Zheng Yan, Jie Wang, Yu Chen, and Ronghui Hou. 2022. Adversarial attacks and defenses in speaker recognition systems: A survey. Journal of Systems Architecture 127 (2022), 102526.
[50]
Mingxuan Li, Xiao Wang, Dongdong Huo, Han Wang, Chao Liu, Yazhe Wang, Yu Wang, and Zhen Xu. 2021. A novel trojan attack against co-learning based ASR DNN system. In 2021 IEEE 24th International Conference on Computer Supported Cooperative Work in Design (CSCWD’21). IEEE, 907–912.
[51]
Shaofeng Li, Shiqing Ma, Minhui Xue, and Benjamin Zi Hao Zhao. 2022. Deep learning backdoors. In Security and Artificial Intelligence: A Crossdisciplinary Approach. Springer, 313–334.
[52]
Yuanchun Li, Jiayi Hua, Haoyu Wang, Chunyang Chen, and Yunxin Liu. 2021. DeepPayload: Black-box backdoor attack on deep learning models through neural payload injection. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 263–274.
[53]
Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2022. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems (2022).
[54]
Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2021. Neural attention distillation: Erasing backdoor triggers from deep neural networks. In ICLR.
[55]
Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1121–1134.
[56]
Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 273–294.
[57]
Peng Liu, Shuyi Zhang, Chuanjian Yao, Wenzhe Ye, and Xianxian Li. 2022. Backdoor attacks against deep neural networks by personalized audio steganography. In 2022 26th International Conference on Pattern Recognition (ICPR’22). IEEE, 68–74.
[58]
Qiang Liu, Tongqing Zhou, Zhiping Cai, and Yonghao Tang. 2022. Opportunistic backdoor attacks: Exploring human-imperceptible vulnerabilities on speech recognition systems. In Proceedings of the 30th ACM International Conference on Multimedia. 2390–2398.
[59]
Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2017. Trojaning attack on neural networks. (2017).
[60]
Yunfei Liu, Xingjun Ma, James Bailey, and Feng Lu. 2020. Reflection backdoor: A natural backdoor attack on deep neural networks. (2020).
[61]
Yuntao Liu, Ankit Mondal, Abhishek Chakraborty, Michael Zuzak, Nina Jacobsen, Daniel Xing, and Ankur Srivastava. 2020. A survey on neural trojans. In 2020 21st International Symposium on Quality Electronic Design (ISQED’20). IEEE, 33–39.
[62]
Yugeng Liu, Rui Wen, Xinlei He, Ahmed Salem, Zhikun Zhang, Michael Backes, Emiliano De Cristofaro, Mario Fritz, and Yang Zhang. 2022. \(\lbrace\)ML-Doctor\(\rbrace\): Holistic risk assessment of inference attacks against machine learning models. In 31st USENIX Security Symposium (USENIX Security’22). 4525–4542.
[63]
Zeyan Liu, Fengjun Li, Zhu Li, and Bo Luo. 2022. LoneNeuron: A highly-effective feature-domain neural trojan using invisible and polymorphic watermarks. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 2129–2143.
[64]
Yuxiao Luo, Jianwei Tai, Xiaoqi Jia, and Shengzhi Zhang. 2022. Practical backdoor attack against speaker recognition system. In International Conference on Information Security Practice and Experience. Springer, 468–484.
[65]
David J. Miller, Zhen Xiang, and George Kesidis. 2020. Adversarial learning targeting deep neural network classification: A comprehensive review of defenses against attacks. Proc. IEEE 108, 3 (2020), 402–433.
[66]
Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller. 2018. Methods for interpreting and understanding deep neural networks. Digital Signal Processing 73 (2018), 1–15.
[67]
Thanapon Noraset, Chen Liang, Larry Birnbaum, and Doug Downey. 2017. Definition modeling: Learning to define word embeddings in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[68]
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Interspeech. 3214–3218.
[69]
Yuan Ping, Bin Hao, Xiali Hei, Yazhou Tu, Xiaojiang Du, and Jie Wu. 2019. Feature fusion and voiceprint-based access control for wireless insulin pump systems. IEEE Access 7 (2019), 121286–121302.
[70]
Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice interfaces in everyday life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–12.
[71]
Alisha Pradhan, Amanda Lazar, and Leah Findlater. 2020. Use of intelligent voice assistants by older adults with low technology use. ACM Transactions on Computer-Human Interaction (TOCHI) 27, 4 (2020), 1–27.
[72]
Alisha Pradhan, Kanika Mehta, and Leah Findlater. 2018. “Accessibility Came by Accident” use of voice-controlled intelligent personal assistants by people with disabilities. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–13.
[73]
Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, and Ronan Collobert. 2019. Wav2letter++: A fast open-source speech recognition system. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6460–6464.
[74]
Zhenghan Qi, Yoel Sanchez Araujo, Wendy C. Georgan, John D. E. Gabrieli, and Joanne Arciuli. 2019. Hearing matters more than seeing: A cross-modality study of statistical learning and reading ability. Scientific Studies of Reading 23, 1 (2019), 101–115.
[75]
Han Qiu, Yi Zeng, Shangwei Guo, Tianwei Zhang, Meikang Qiu, and Bhavani Thuraisingham. 2021. DeepSweep: An evaluation framework for mitigating DNN backdoor attacks using data augmentation. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 363–377.
[76]
Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63, 10 (2020), 1872–1897.
[77]
Amir Mohammad Rostami, Ali Karimi, and Mohammad Ali Akhaee. 2022. Keyword spotting in continuous speech using convolutional neural network. Speech Communication 142 (2022), 15–21.
[78]
Zhang Rui and Zheng Yan. 2018. A survey on biometric authentication: Toward secure and privacy-preserving identification. IEEE Access 7 (2018), 5994–6009.
[79]
Michael W. Schwarz, William B. Cowan, and John C. Beatty. 1987. An experimental comparison of RGB, YIQ, LAB, HSV, and opponent color models. ACM Transactions on Graphics (TOG) 6, 2 (1987), 123–158.
[80]
Frank Seide, Gang Li, and Dong Yu. 2011. Conversational speech transcription using context-dependent deep neural networks. In Twelfth Annual Conference of the International Speech Communication Association.
[81]
Ivan W. Selesnick, Richard G. Baraniuk, and Nick C. Kingsbury. 2005. The dual-tree complex wavelet transform. IEEE Signal Processing Magazine 22, 6 (2005), 123–151.
[82]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2018. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’18). IEEE, 4779–4783.
[83]
Cong Shi, Tianfang Zhang, Zhuohang Li, Huy Phan, Tianming Zhao, Yan Wang, Jian Liu, Bo Yuan, and Yingying Chen. 2022. Audio-domain position-independent backdoor attack via unnoticeable triggers. In Proceedings of the 28th Annual International Conference on Mobile Computing and Networking. 583–595.
[84]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Alan McCree, Daniel Povey, and Sanjeev Khudanpur. 2019. Speaker recognition for multi-speaker conversations using x-vectors. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5796–5800.
[85]
Xu Tan and Xiao-Lei Zhang. 2021. Speech enhancement aided end-to-end multi-task learning for voice activity detection. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 6823–6827.
[86]
Ruixiang Tang, Mengnan Du, Ninghao Liu, Fan Yang, and Xia Hu. 2020. An embarrassingly simple approach for trojan attack in deep neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 218–228.
[87]
Yunong Tian, Guodong Yang, Zhe Wang, Hao Wang, En Li, and Zize Liang. 2019. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Computers and Electronics in Agriculture 157 (2019), 417–426.
[88]
Zhiyi Tian, Lei Cui, Jie Liang, and Shui Yu. 2022. A comprehensive survey on poisoning attacks and countermeasures in machine learning. Comput. Surveys 55, 8 (2022), 1–35.
[89]
Brandon Tran, Jerry Li, and Aleksander Madry. 2018. Spectral signatures in backdoor attacks. Advances in Neural Information Processing Systems 31 (2018).
[90]
Adrian R. L. Travis. 1997. The display of three-dimensional video images. Proc. IEEE 85, 11 (1997), 1817–1832.
[91]
Stacey Truex, Ling Liu, Mehmet Emre Gursoy, Lei Yu, and Wenqi Wei. 2019. Demystifying membership inference attacks in machine learning as a service. IEEE Transactions on Services Computing 14, 6 (2019), 2073–2089.
[92]
Alexander Turner, Dimitris Tsipras, and Aleksander Madry. 2019. Label-consistent backdoor attacks. arXiv preprint arXiv:1912.02771 (2019).
[93]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).
[94]
Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP’19). IEEE, 707–723.
[95]
Chengyi Wang, Yu Wu, Sanyuan Chen, Shujie Liu, Jinyu Li, Yao Qian, and Zhenglu Yang. 2022. Improving self-supervised learning for speech recognition with intermediate layer supervision. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 7092–7096.
[96]
Weiqing Wang, Xiaoyi Qin, and Ming Li. 2022. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2MeT challenge. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 9171–9175.
[97]
Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).
[98]
Dongxian Wu and Yisen Wang. 2021. Adversarial neuron pruning purifies backdoored deep models. Advances in Neural Information Processing Systems 34 (2021), 16913–16925.
[99]
Lei Wu, Steven C. H. Hoi, and Nenghai Yu. 2010. Semantics-preserving Bag-of-Words models and applications. IEEE Transactions on Image Processing 19, 7 (2010), 1908–1920.
[100]
Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. Speech Communication 66 (2015), 130–153.
[101]
Jun Xia, Ting Wang, Jiepin Ding, Xian Wei, and Mingsong Chen. 2022. Eliminating backdoor triggers for deep neural networks using attention relation graph distillation. arXiv preprint arXiv:2204.09975 (2022).
[102]
Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level aggregation for speaker recognition in the wild. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 5791–5795.
[103]
Jinwen Xin, Xixiang Lyu, and Jing Ma. 2023. Natural backdoor attacks on speech recognition models. In Machine Learning for Cyber Security: 4th International Conference, ML4CS 2022, Guangzhou, China, December 2–4, 2022, Proceedings, Part I. Springer, 597–610.
[104]
Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A. Gunter, and Bo Li. 2021. Detecting AI trojans using Meta Neural Analysis. In 2021 IEEE Symposium on Security and Privacy (SP’21). IEEE, 103–120.
[105]
Baochen Yan, Rui Zhang, and Zheng Yan. 2022. VoiceSketch: A privacy-preserving voiceprint authentication system. In 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom’22). IEEE, 623–630.
[106]
Zheng Yan and Sihui Zhao. 2016. A usable authentication system based on personal voice challenge. In 2016 International Conference on Advanced Cloud and Big Data (CBD’16). IEEE, 194–199.
[107]
Jianbin Ye, Xiaoyuan Liu, Zheng You, Guowei Li, and Bo Liu. 2022. DriNet: Dynamic backdoor attack against automatic speech recognization models. Applied Sciences 12, 12 (2022), 5786.
[108]
Kota Yoshida and Takeshi Fujino. 2020. Disabling backdoor and identifying poison data by using knowledge distillation in backdoor attacks on deep neural networks. In Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security. 117–127.
[109]
Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, and Shu-Tao Xia. 2021. Backdoor attack against speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 2560–2564.
[110]
Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, Di Wu, and Zhendong Peng. 2022. WenetSpeech: A 10000+ hours multi-domain Mandarin corpus for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 6182–6186.
[111]
Rui Zhang, Zheng Yan, Xuerui Wang, and Robert Deng. 2022. VOLERE: Leakage resilient user authentication based on personal voice challenges. IEEE Transactions on Dependable and Secure Computing (2022).
[112]
Rui Zhang, Zheng Yan, Xuerui Wang, and Robert H. Deng. 2022. LiVoAuth: Liveness detection in voiceprint authentication with random challenges and detection modes. IEEE Transactions on Industrial Informatics (2022).
[113]
Ya-Jie Zhang, Shifeng Pan, Lei He, and Zhen-Hua Ling. 2019. Learning latent representations for style control and transfer in end-to-end speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19). IEEE, 6945–6949.
[114]
Yu Zhong and Anil K. Jain. 2000. Object localization using color, texture and shape. Pattern Recognition 33, 4 (2000), 671–684.
[115]
Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. 2020. A comprehensive survey on transfer learning. Proc. IEEE 109, 1 (2020), 43–76.
[116]
Wei Zong, Yang-Wai Chow, Willy Susilo, Kien Do, and Svetha Venkatesh. 2023. TrojanModel: A practical trojan attack against automatic speech recognition systems. In 2023 IEEE Symposium on Security and Privacy (SP’23). IEEE, 1667–1683.
[117]
Wei Zong, Yang-Wai Chow, Willy Susilo, and Jongkil Kim. 2022. Trojan attacks and defense for speech recognition. In Mobile Internet Security: 5th International Symposium, MobiSec 2021, Jeju Island, South Korea, October 7–9, 2021, Revised Selected Papers. Springer, 195–210.

Index Terms

  1. Backdoor Attacks against Voice Recognition Systems: A Survey

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 57, Issue 3
      March 2025
      984 pages
      EISSN:1557-7341
      DOI:10.1145/3697147
      • Editors:
      • David Atienza,
      • Michela Milano
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 November 2024
      Online AM: 26 October 2024
      Accepted: 19 October 2024
      Revised: 07 July 2024
      Received: 19 July 2023
      Published in CSUR Volume 57, Issue 3

      Check for updates

      Author Tags

      1. Backdoor attacks
      2. voice recognition systems
      3. deep learning
      4. speech recognition
      5. speaker recognition

      Qualifiers

      • Survey

      Funding Sources

      • National Natural Science Foundation of China
      • Key Research Project of Shaanxi Natural Science Foundation
      • Concept Verification Funding of Hangzhou Institute of Technology of Xidian University
      • 111 Project
      • Fundamental Research Funds for the Central Universities

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 435
        Total Downloads
      • Downloads (Last 12 months)435
      • Downloads (Last 6 weeks)208
      Reflects downloads up to 16 Jan 2025

      Other Metrics

      Citations

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media