research-article

Voice spoofing detection with raw waveform based on Dual Path Res2net

Authors:

Zhenhua LingAuthors Info & Claims

ICCSE '21: 5th International Conference on Crowd Science and Engineering

Pages 160 - 165

https://doi.org/10.1145/3503181.3503218

Published: 01 March 2022 Publication History

Abstract

The natural-sounding speech produced by recent text-to-speech and voice conversion techniques pose serious threats to automatic speaker verification systems. The majority of existing spoofing detection countermeasures perform well when the nature of the attacks is known during training. However, their performance in realistic applications degrades in dealing with unseen types of attacks. To address this concern, we propose a novel method for spoof detection, namely Dual Path Res2Net (DP-Res2Net) to improve the robustness to unknown attacks. As to the feature engineering, we employ the time domain features rather than the commonly-used frequency domain ones. We directly input the time domain features of 80,000 sampling points into the network. The input features are further processed by shallow feature learning module, interactive feature learning module, deep feature learning module as well as the discriminator network. The dual-path residual-like block exploit the dependence between successive pieces of audios with large receptive fields. Furthermore, the proposed DP-Res2Net significantly improves the model’s generalizability to unseen spoofing attacks. We evaluate the performance of the proposed method over public-available ASVspoof 2019 logic access evaluation set, and the results demonstrate that it outperforms state-of-the-art audio spoof detection models.

References

[1]

Zhuxin Chen, Zhifeng Xie, Weibin Zhang, and Xiangmin Xu. 2017. ResNet and Model Fusion for Automatic Spoofing Detection. In Interspeech. 102–106.

[2]

Rohan Kumar Das, Sarfaraz Jelil, and SR Mahadeva Prasanna. 2017. Development of multi-level speech based person authentication system. Journal of Signal Processing Systems 88, 3 (2017), 259–271.

Digital Library

[3]

Rohan Kumar Das, Xiaohai Tian, Tomi Kinnunen, and Haizhou Li. 2020. The attacker’s perspective on automatic speaker verification: An overview. arXiv preprint arXiv:2004.08849(2020).

[4]

Xin Fang, Tian Gao, Liang Zou, and Zhenhua Ling. 2020. Bidirectional Attention for Text-Dependent Speaker Verification. Sensors 20, 23 (2020), 6784.

[5]

Xin Fang, Liang Zou, Jin Li, Lei Sun, and Zhen-Hua Ling. 2019. Channel adversarial training for cross-channel text-independent speaker recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6221–6225.

[6]

Alejandro Gomez-Alanis, Jose A. Gonzalez-Lopez, and Antonio M. Peinado. 2020. A Kernel Density Estimation Based Loss Function and its Application to ASV-Spoofing Detection. IEEE Access 8(2020), 108530–108543. https://doi.org/10.1109/ACCESS.2020.3000641

[7]

Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on acoustics, speech, and signal processing 32, 2(1984), 236–243.

[8]

Sarfaraz Jelil, Abhishek Shrivastava, Rohan Kumar Das, SR Mahadeva Prasanna, and Rohit Sinha. 2019. SpeechMarker: A Voice Based Multi-Level Attendance Application. In Interspeech. 3665–3666.

[9]

Jee-weon Jung, Hee-Soo Heo, Ju-ho Kim, Hye-jin Shim, and Ha-Jin Yu. 2019. Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv preprint arXiv:1904.08104(2019).

[10]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[11]

Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, and Zhenhua Ling. 2018. A spoofing benchmark for the 2018 voice conversion challenge: Leveraging from spoofing countermeasures for speech artifact assessment. arXiv preprint arXiv:1804.08438(2018).

[12]

Kazuhiro Kobayashi, Tomoki Toda, Graham Neubig, Sakriani Sakti, and Satoshi Nakamura. 2014. Statistical singing voice conversion with direct waveform modification based on the spectrum differential. In Fifteenth Annual Conference of the International Speech Communication Association.

[13]

Itshak Lapidot and Jean-François Bonastre. 2019. Effects of waveform pmf on anti-spoofing detection. In Interspeech 2019. ISCA, 2853–2857.

[14]

Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandr Kozlov. 2019. STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576(2019).

[15]

Xu Li, Na Li, Chao Weng, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. 2021. Replay and Synthetic Speech Detection with Res2Net Architecture. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 6354–6358. https://doi.org/10.1109/ICASSP39728.2021.9413828

[16]

Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Junichi Yamagishi, and Tomi Kinnunen. 2018. Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama’s voice using GAN, WaveNet and low-quality found data. arXiv preprint arXiv:1803.00860(2018).

[17]

Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling. 2016. The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods. In Odyssey 2018 - The Speaker and Language Recognition Workshop, Vol. 2018.

[18]

Driss Matrouf, J-F Bonastre, and Corinne Fredouille. 2006. Effect of speech transformation on impostor acceptance. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1. IEEE, I–I.

[19]

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems 99, 7 (2016), 1877–1884.

[20]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499(2016).

[21]

Tanvina B Patel and Hemant A Patil. 2015. Combining evidences from mel cepstral, cochlear filter cepstral and instantaneous frequency features for detection of natural vs. spoofed speech. In Sixteenth annual conference of the international speech communication association.

[22]

Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. 2020. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6398–6407.

[23]

Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka. 2018. Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 632–639.

[24]

M. Todisco, H. Delgado, and N. Evans. 2016. A New Feature for Automatic Speaker Verification Anti-Spoofing: Constant Q Cepstral Coefficients. In Odyssey 2016 - The Speaker and Language Recognition Workshop.

[25]

Massimiliano Todisco, Xin Wang, Ville Vestman, Md Sahidullah, Héctor Delgado, Andreas Nautsch, Junichi Yamagishi, Nicholas Evans, Tomi Kinnunen, and Kong Aik Lee. 2019. ASVspoof 2019: Future horizons in spoofed and fake audio detection. arXiv preprint arXiv:1904.05441(2019).

[26]

Francis Tom, Mohit Jain, and Prasenjit Dey. 2018. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Interspeech. 681–685.

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[28]

Feng Wang, Jian Cheng, Weiyang Liu, and Haijun Liu. 2018. Additive margin softmax for face verification. IEEE Signal Processing Letters 25, 7 (2018), 926–930.

[29]

Xin Wang, Shinji Takaki, and Junichi Yamagishi. 2019. Neural source-filter-based waveform model for statistical parametric speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5916–5920.

[30]

Zhenzong Wu, Rohan Kumar Das, Jichen Yang, and Haizhou Li. 2020. Light convolutional neural network with feature genuinization for detection of synthetic speech attacks. arXiv preprint arXiv:2009.09637(2020).

[31]

Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. Speech communication 66(2015), 130–153.

[32]

Zhizheng Wu and Haizhou Li. 2016. On the study of replay and voice conversion attacks to text-dependent speaker verification. Multimedia Tools and Applications 75, 9 (2016), 5311–5327.

Digital Library

[33]

Zhizheng Wu, Junichi Yamagishi, Tomi Kinnunen, Cemal Hanilçi, Mohammed Sahidullah, Aleksandr Sizov, Nicholas Evans, Massimiliano Todisco, and Hector Delgado. 2017. ASVspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE Journal of Selected Topics in Signal Processing 11, 4(2017), 588–604.

[34]

R. Xiao. 2021. Adaptive Margin Circle Loss for Speaker Verification. In Interspeech.

Cited By

Wang LYeoh BNg J(2022)Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP57327.2022.10037999(115-119)Online publication date: 11-Dec-2022
https://doi.org/10.1109/ISCSLP57327.2022.10037999

Index Terms

Voice spoofing detection with raw waveform based on Dual Path Res2net

Index terms have been assigned to the content through auto-classification.

Recommendations

Speech frame selection for spoofing detection with an application to partially spoofed audio-data
Abstract
In this paper, we introduce a frame selection strategy for improved detection of spoofed speech. A countermeasure (CM) system typically uses a Gaussian mixture model (GMM) based classifier for computing the log-likelihood scores. The average log-...
A robust voice spoofing detection system using novel CLS-LBP features and LSTM
Abstract
Automatic Speaker Verification (ASV) systems are vulnerable to a variety of voice spoofing attacks, e.g., replays, speech synthesis, etc. The imposters/fraudsters often use different voice spoofing attacks to fool the ASV systems to ...
Physiological-physical feature fusion for automatic voice spoofing detection
Abstract
Biometric speech recognition systems are often subject to various spoofing attacks, the most common of which are speech synthesis and speech conversion attacks. These spoofing attacks can cause the biometric speech recognition system to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCSE '21: 5th International Conference on Crowd Science and Engineering

October 2021

182 pages

ISBN:9781450395540

DOI:10.1145/3503181

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICCSE '21

ICCSE '21: 5th International Conference on Crowd Science and Engineering

October 16 - 18, 2021

Jinan, China

Acceptance Rates

Overall Acceptance Rate 92 of 247 submissions, 37%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
80
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang LYeoh BNg J(2022)Synthetic Voice Detection and Audio Splicing Detection using SE-Res2Net-Conformer Architecture2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP57327.2022.10037999(115-119)Online publication date: 11-Dec-2022
https://doi.org/10.1109/ISCSLP57327.2022.10037999

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents