research-article

PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme Injection

Authors:

Guangjing Wang,

Li XiaoAuthors Info & Claims

RAID '23: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

Pages 366 - 380

https://doi.org/10.1145/3607199.3607240

Published: 16 October 2023 Publication History

Abstract

In this paper, we propose PhantomSound, a query-efficient black-box attack toward voice assistants. Existing black-box adversarial attacks on voice assistants either apply substitution models or leverage the intermediate model output to estimate the gradients for crafting adversarial audio samples. However, these attack approaches require a significant amount of queries with a lengthy training stage. PhantomSound leverages the decision-based attack to produce effective adversarial audios, and reduces the number of queries by optimizing the gradient estimation. In the experiments, we perform our attack against 4 different speech-to-text APIs under 3 real-world scenarios to demonstrate the real-time attack impact. The results show that PhantomSound is practical and robust in attacking 5 popular commercial voice controllable devices over the air, and is able to bypass 3 liveness detection mechanisms with success rate. The benchmark result shows that PhantomSound can generate adversarial examples and launch the attack in a few minutes. We significantly enhance the query efficiency and reduce the cost of a successful untargeted and targeted adversarial attack by 93.1% and 65.5% compared with the state-of-the-art black-box attacks, using merely ∼ 300 queries (∼ 5 minutes) and ∼ 1,500 queries (∼ 25 minutes), respectively.

References

[1]

1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus. https://catalog.ldc.upenn.edu/LDC93S1. Accessed: 2021-11-04.

[2]

2018. Persian Vowel recognition with MFCC and ANN on PCVC speech dataset. https://github.com/smalekz/PCVC. Accessed: 2021-11-04.

[3]

Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. 2019. Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734 (2019).

[4]

Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. 2021. Hear" No Evil", See" Kenansville"*: Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 712–729.

[5]

Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh, Iljoo Kim, Taekkyung Oh, and Hyoungshick Kim. 2020. Void: A fast and light voice liveness detection system. In USENIX Security.

[6]

Raziel Alvarez and Hyun-Jin Park. 2019. End-to-end streaming keyword spotting. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6336–6340.

[7]

Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2018. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554 (2018).

[8]

Amazon. 2021. Amazon Echo. https://www.amazon.com/All-New-Echo-4th-Gen/dp/B07XKF5RM3.

[9]

Amazon. 2021. Amazon Transcribe. https://aws.amazon.com/transcribe/.

[10]

Amazon. 2021. Skills. https://developer.amazon.com/en-US/alexa/alexa-skills-kit.

[11]

Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang Cheng, Guoliang Chen, 2016. Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning. PMLR, 173–182.

[12]

Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2017. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248 (2017).

[13]

Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). IEEE, 39–57.

[14]

Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 1–7.

[15]

Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is real bob? adversarial attacks on speaker recognition systems. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 694–711.

[16]

Jianbo Chen, Michael I Jordan, and Martin J Wainwright. 2020. Hopskipjumpattack: A query-efficient decision-based attack. In 2020 ieee symposium on security and privacy (sp). IEEE, 1277–1294.

[17]

Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security. 15–26.

Digital Library

[18]

Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil’s whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices. In 29th USENIX Security Symposium (USENIX Security 20). 2667–2684.

[19]

Minhao Cheng, Simranjit Singh, Patrick Chen, Pin-Yu Chen, Sijia Liu, and Cho-Jui Hsieh. 2019. Sign-opt: A query-efficient hard-label adversarial attack. arXiv preprint arXiv:1909.10773 (2019).

[20]

Moustapha M Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured visual and speech recognition models with adversarial examples. Advances in neural information processing systems 30 (2017).

[21]

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).

[22]

Google. 2021. Google Assistant. https://assistant.google.com/.

[23]

Google. 2021. Google Home/Nest. https://store.google.com/product.

[24]

Google. 2021. Google Speech. https://cloud.google.com/speech-to-text.

[25]

Google. 2021. Google Vision. https://cloud.google.com/vision.

[26]

Tianyu Gu, Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2019. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access 7 (2019), 47230–47244.

[27]

Hanqing Guo, Yuanda Wang, Nikolay Ivanov, Li Xiao, and Qiben Yan. 2022. SpecPatch: Human-in-the-Loop Adversarial Audio Spectrogram Patch Attack on Speech Recognition. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 1353–1366.

Digital Library

[28]

Hanqing Guo, Qiben Yan, Nikolay Ivanov, Ying Zhu, Li Xiao, and Eric J Hunter. 2022. SuperVoice: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech. In Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security. 1019–1033.

Digital Library

[29]

Junfeng Guo, Ang Li, and Cong Liu. 2022. AEVA: Black-box Backdoor Detection Using Adversarial Extreme Value Analysis. In International Conference on Learning Representations. https://openreview.net/forum?id=OM_lYiHXiCL

[30]

Junfeng Guo, Yiming Li, Xun Chen, Hanqing Guo, Lichao Sun, and Cong Liu. 2023. Scale-up: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency. arXiv preprint arXiv:2302.03251 (2023).

[31]

Junfeng Guo and Cong Liu. 2020. Practical poisoning attacks on neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 142–158.

[32]

Nawar Halabi. 2016. Arabic Speech Corpus. http://en.arabicspeechcorpus.com/. Accessed: 2021-11-04.

[33]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).

[34]

Yitao He, Junyu Bian, Xinyu Tong, Zihui Qian, Wei Zhu, Xiaohua Tian, and Xinbing Wang. 2019. Canceling inaudible voice commands against voice control systems. In The 25th Annual International Conference on Mobile Computing and Networking. 1–15.

Digital Library

[35]

Chien-yu Huang, Yist Y Lin, Hung-yi Lee, and Lin-shan Lee. 2021. Defending your voice: Adversarial attack on voice conversion. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 552–559.

[36]

IBM. 2021. IBM Speeche. https://www.ibm.com/cloud/watson-speech-to-text.

[37]

Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning. PMLR, 2137–2146.

[38]

Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. (2017).

[39]

Bret Kinsella and Ava Mutchler. 2019. Smart speaker consumer adoption report. Informe estadístico (2019).

[40]

Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. 2017. Audio Replay Attack Detection with Deep Learning Frameworks. In Interspeech. 82–86.

[41]

Gen Li, Zhichao Cao, and Tianxing Li. 2023. EchoAttack: Practical Inaudible Attacks To Smart Earbuds. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services. 383–396.

Digital Library

[42]

Zhuohang Li, Cong Shi, Tianfang Zhang, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2021. Robust Detection of Machine-induced Audio Attacks in Intelligent Audio Systems with Microphone Array. (2021).

[43]

Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1121–1134.

Digital Library

[44]

Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018. Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc.

[45]

Yan Meng, Jiachun Li, Matthew Pillari, Arjun Deopujari, Liam Brennan, Hafsah Shamsie, Haojin Zhu, and Yuan Tian. 2022. Your Microphone Array Retains Your Identity: A Robust Voice Liveness Detection System for Smart Speakers. In 31th USENIX Security Symposium (USENIX Security 21).

[46]

Microsoft. 2021. Microsoft Azure. https://azure.microsoft.com/en-us/.

[47]

Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. 506–519.

Digital Library

[48]

Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning. PMLR, 5231–5240.

[49]

Nirupam Roy, Sheng Shen, Haitham Hassanieh, and Romit Roy Choudhury. 2018. Inaudible voice commands: The long-range attack and defense. In 15th { USENIX} Symposium on Networked Systems Design and Implementation ({ NSDI} 18). 547–560.

[50]

Lea Schönherr, Thorsten Eisenhofer, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2020. Imperio: Robust over-the-air adversarial examples for automatic speech recognition systems. In Annual Computer Security Applications Conference. 843–855.

Digital Library

[51]

scikit fda. 2021. fetch phoneme. skfda.datasets.fetch_phoneme.html.

[52]

Cong Shi, Tianfang Zhang, Zhuohang Li, Huy Phan, Tianming Zhao, Yan Wang, Jian Liu, Bo Yuan, and Yingying Chen. 2022. Audio-domain position-independent backdoor attack via unnoticeable triggers. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 583–595.

Digital Library

[53]

Takeshi Sugawara, Benjamin Cyr, Sara Rampazzi, Daniel Genkin, and Kevin Fu. 2020. Light commands: laser-based audio injection attacks on voice-controllable systems. In 29th USENIX Security Symposium (USENIX Security 20). 2631–2648.

[54]

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).

[55]

Raphael Tang and Jimmy Lin. 2018. Deep residual learning for small-footprint keyword spotting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5484–5488.

Digital Library

[56]

Siri Team. 2017. Hey siri: An on-device dnn-powered voice trigger for apple’s personal assistant. Apple Machine Learning Journal 1, 6 (2017).

[57]

Susanne Trauzettel-Klosinski, Klaus Dietz, IReST Study Group, 2012. Standardized assessment of reading performance: The new international reading speed texts IReST. Investigative ophthalmology & visual science 53, 9 (2012), 5452–5461.

[58]

Yuanda Wang, Hanqing Guo, Guangjing Wang, Bocheng Chen, and Qiben Yan. 2023. VSMask: Defending Against Voice Synthesis Attack via Real-Time Predictive Perturbation. arXiv preprint arXiv:2305.05736 (2023).

[59]

Yuanda Wang, Hanqing Guo, and Qiben Yan. 2022. Ghosttalk: Interactive attack on smartphone voice system through power line. arXiv preprint arXiv:2202.02585 (2022).

[60]

Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018).

[61]

Emily Wenger, Max Bronckers, Christian Cianfarani, Jenna Cryan, Angela Sha, Haitao Zheng, and Ben Y Zhao. 2021. " Hello, It’s Me": Deep Learning-based Speech Synthesis Attacks in the Real World. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 235–251.

Digital Library

[62]

Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. 2020. Surfingattack: Interactive hidden attack on voice assistants using ultrasonic guided waves. In Network and Distributed Systems Security (NDSS) Symposium.

[63]

Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875 (2018).

[64]

Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A Gunter. 2018. Commandersong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium (USENIX Security 18). 49–64.

[65]

Tongqing Zhai, Yiming Li, Ziqi Zhang, Baoyuan Wu, Yong Jiang, and Shu-Tao Xia. 2021. Backdoor attack against speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2560–2564.

[66]

Guoming Zhang, Xiaoyu Ji, Xinfeng Li, Gang Qu, and Wenyuan Xu. 2021. EarArray: Defending against DolphinAttack via Acoustic Attenuation. (2021).

[67]

Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. Dolphinattack: Inaudible voice commands. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 103–117.

Digital Library

[68]

Nan Zhang, Xianghang Mi, Xuan Feng, XiaoFeng Wang, Yuan Tian, and Feng Qian. 2019. Dangerous skills: Understanding and mitigating security risks of voice-controlled third-party functions on virtual personal assistant systems. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 1381–1396.

[69]

Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. 2021. Black-box Adversarial Attacks on Commercial Speech Platforms with Minimal Information. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security.

Digital Library

Cited By

Yuan XZhang JChen KWei CLi RMa ZLing X(2024)Adversarial Attack and Defense for Commercial Black-box Chinese-English Speech Recognition SystemsACM Transactions on Privacy and Security10.1145/370172528:1(1-27)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3701725
Li GZeng HGuo HRen YDixon ACao ZLi TShu YLiu JTan RHe YChen J(2024)PiezoBud: A Piezo-Aided Secure Earbud with Practical Speaker AuthenticationProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699358(564-577)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699358
Guo HWang GChen BWang YZhang XChen XYan QXiao LGanesan DLane NShi W(2024)WavePurifier: Purifying Audio Adversarial Examples via Hierarchical Diffusion ModelsProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3690692(1268-1282)Online publication date: 4-Dec-2024
https://dl.acm.org/doi/10.1145/3636534.3690692
Show More Cited By

Index Terms

PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme Injection
1. Computing methodologies
  1. Machine learning
2. Security and privacy

Recommendations

Efficient Query-based Black-box Attack against Cross-modal Hashing Retrieval
Deep cross-modal hashing retrieval models inherit the vulnerability of deep neural networks. They are vulnerable to adversarial attacks, especially for the form of subtle perturbations to the inputs. Although many adversarial attack methods have been ...
Certifiable Black-Box Attacks with Randomized Adversarial Examples: Breaking Defenses with Provable Confidence
CCS '24: Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security

Black-box adversarial attacks have demonstrated strong potential to compromise machine learning models by iteratively querying the target model or leveraging transferability from a local surrogate model.Recently, such attacks can be effectively mitigated ...
Object-Aware Transfer-Based Black-Box Adversarial Attack on Object Detector
Pattern Recognition and Computer Vision
Abstract
Deep neural networks have been demonstrated to be vulnerable to adversarial noise from attacks. Compared with white-box attacks, black-box attacks fool deep neural networks to yield erroneous predictions without knowing the model parameters. Black-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

RAID '23: Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses

October 2023

769 pages

ISBN:9798400707650

DOI:10.1145/3607199

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF
Cisco Systems

Conference

RAID 2023

RAID 2023: The 26th International Symposium on Research in Attacks, Intrusions and Defenses

October 16 - 18, 2023

Hong Kong, China

Acceptance Rates

Overall Acceptance Rate 43 of 173 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
202
Total Downloads

Downloads (Last 12 months)158
Downloads (Last 6 weeks)6

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yuan XZhang JChen KWei CLi RMa ZLing X(2024)Adversarial Attack and Defense for Commercial Black-box Chinese-English Speech Recognition SystemsACM Transactions on Privacy and Security10.1145/370172528:1(1-27)Online publication date: 7-Nov-2024
https://dl.acm.org/doi/10.1145/3701725
Li GZeng HGuo HRen YDixon ACao ZLi TShu YLiu JTan RHe YChen J(2024)PiezoBud: A Piezo-Aided Secure Earbud with Practical Speaker AuthenticationProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699358(564-577)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699358
Guo HWang GChen BWang YZhang XChen XYan QXiao LGanesan DLane NShi W(2024)WavePurifier: Purifying Audio Adversarial Examples via Hierarchical Diffusion ModelsProceedings of the 30th Annual International Conference on Mobile Computing and Networking10.1145/3636534.3690692(1268-1282)Online publication date: 4-Dec-2024
https://dl.acm.org/doi/10.1145/3636534.3690692
Chen BIvanov NWang GYan QQuek TGao DZhou JCardenas A(2024)Multi-Turn Hidden Backdoor in Large Language Model-powered Chatbot ModelsProceedings of the 19th ACM Asia Conference on Computer and Communications Security10.1145/3634737.3656289(1316-1330)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3634737.3656289
Wang ZXu XWang XZhou H(2024)GWO-GEA: A Black-box Method for Generating Audio Adversarial Examples2024 9th International Conference on Image, Vision and Computing (ICIVC)10.1109/ICIVC61627.2024.10837319(439-444)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICIVC61627.2024.10837319
Bhanushali AMun HYun J(2024)Adversarial Attacks on Automatic Speech Recognition (ASR): A SurveyIEEE Access10.1109/ACCESS.2024.341696512(88279-88302)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3416965

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten