research-article

Spatial reconstructed local attention Res2Net with F0 subband for fake speech detection

Authors:

Chenglong Wang,

Chengshi Zheng,

Zhao LvAuthors Info & Claims

Volume 175, Issue C

https://doi.org/10.1016/j.neunet.2024.106320

Published: 17 July 2024 Publication History

Abstract

The rhythm of bonafide speech is often difficult to replicate, which causes that the fundamental frequency (F0) of synthetic speech is significantly different from that of real speech. It is expected that the F0 feature contains the discriminative information for the fake speech detection (FSD) task. In this paper, we propose a novel F0 subband for FSD. In addition, to effectively model the F0 subband so as to improve the performance of FSD, the spatial reconstructed local attention Res2Net (SR-LA Res2Net) is proposed. Specifically, Res2Net is used as a backbone network to obtain multiscale information, and enhanced with a spatial reconstruction mechanism to avoid losing important information when the channel group is constantly superimposed. In addition, local attention is designed to make the model focus on the local information of the F0 subband. Experimental results on the ASVspoof 2019 LA dataset show that our proposed method obtains an equal error rate (EER) of 0.47% and a minimum tandem detection cost function (min t-DCF) of 0.0159, achieving the state-of-the-art performance among all of the single systems.

References

[1]

Al-Radhi M.S., Csapó T.G., Németh G., A continuous vocoder using sinusoidal model for statistical parametric speech synthesis, in: Speech and computer: 20th international conference, SPECOM 2018, leipzig, Germany, September 18–22, 2018, proceedings 20, Springer, 2018, pp. 11–20.

[2]

Ali M., Sabir A., Hassan M., Fake audio detection using hierarchical representations learning and spectrogram features, in: 2021 international conference on robotics and automation in industry, IEEE, 2021, pp. 1–6.

[3]

Cáceres, J., Font, R., Grau, T., & Molina, J. (2021). The Biometric Vox System for the ASVspoof 2021 Challenge. In Proc. 2021 edition of the automatic speaker verification and spoofing countermeasures challenge (pp. 68–74).

[4]

Chen, T., Khoury, E., Phatak, K., & Sivaraman, G. (2021). Pindrop Labs’ Submission to the ASVspoof 2021 Challenge. In Proc. 2021 edition of the automatic speaker verification and spoofing countermeasures challenge (pp. 89–93).

[5]

Chen, T., Kumar, A., Nagarsheth, P., Sivaraman, G., & Khoury, E. (2020). Generalization of Audio Deepfake Detection. In Proc. odyssey 2020 the speaker and language recognition workshop (pp. 132–137).

[6]

Chettri B., Kinnunen T., Benetos E., Subband modeling for spoofing detection in automatic speaker verification, in: Proceedings of odyssey 2020: the speaker and language recognition workshop, ISCA, 2020, pp. 341–348.

[7]

Chettri, B., Stoller, D., Morfi, V., Ramírez, M., Benetos, E., & Sturm, B. (2019). Ensemble models for spoofing detection in automatic speaker verification. In Proc. interspeech (pp. 1018–1022).

[8]

Das R.K., Yang J., Li H., Long range acoustic features for spoofed speech detection, in: Interspeech, 2019, pp. 1058–1062.

[9]

Ding S., Zhang Y., Duan Z., SAMO: Speaker attractor multi-center one-class learning for voice anti-spoofing, in: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing, IEEE, 2023, pp. 1–5.

[10]

Doan T.-P., Nguyen-Vu L., Jung S., Hong K., BTS-e: Audio deepfake detection using breathing-talking-silence encoder, in: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing, IEEE, 2023, pp. 1–5.

[11]

Fan C., Ding M., Yi J., Li J., Lv Z., Two-stage deep spectrum fusion for noise-robust end-to-end speech recognition, Applied Acoustics 212 (2023).

[12]

Fan C., Xue J., Dong S., Ding M., Yi J., Li J., et al., Subband fusion of complex spectrogram for fake speech detection, Speech Communication 155 (2023).

[13]

Fan C., Zhang H., Li A., Xiang W., Zheng C., Lv Z., et al., CompNet: Complementary network for single-channel speech enhancement, Neural Networks 168 (2023) 508–517.

[14]

Gao S., Cheng M.-M., Zhao K., Zhang X.-Y., Yang M.-H., Torr P.H., Res2Net: a new multi-scale backbone architecture, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) 652–662.

[15]

Hajipour M., Akhaee M.A., Toosi R., Listening to sounds of silence for audio replay attack detection, in: 2021 7th international conference on signal processing and intelligent systems, IEEE, 2021, pp. 1–6.

[16]

He J., Xu J., Zhang L., Zhu J., An interpretive constrained linear model for ResNet and mgnet, Neural Networks 162 (2023) 384–392.

[17]

He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition, in: 2016 IEEE conference on computer vision and pattern recognition, IEEE, 2016, pp. 770–778.

[18]

Hu J., Shen L., Albanie S., Sun G., Wu E., Squeeze-and-excitation networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (8) (2019) 2011–2023.

Digital Library

[19]

Huang B., Cui S., Huang J., Kang X., Discriminative frequency information learning for end-to-end speech anti-spoofing, IEEE Signal Processing Letters 30 (2023) 185–189.

[20]

Huang S.-F., Lin C.-J., Liu D.-R., Chen Y.-C., Lee H.-y., Meta-tts: Meta-learning for few-shot speaker adaptive text-to-speech, IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022) 1558–1571.

[21]

Jung J.-w., Heo H.-S., Tak H., Shim H.-j., Chung J.S., Lee B.-J., et al., Aasist: audio anti-spoofing using integrated spectro-temporal graph attention networks, in: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing, IEEE, 2022, pp. 6367–6371.

[22]

Kang, W. H., Alam, J., & Fathan, A. (2021). CRIM’s System Description for the ASVSpoof2021 Challenge. In Proc. 2021 edition of the automatic speaker verification and spoofing countermeasures challenge (pp. 100–106).

[23]

Kim J., Ban S.M., Phase-aware spoof speech detection based on res2net with phase network, in: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing, IEEE, 2023, pp. 1–5.

[24]

Kinnunen T., Lorenzo-Trueba J., Yamagishi J., Toda T., Saito D., Villavicencio F., et al., A spoofing benchmark for the 2018 voice conversion challenge: Leveraging from spoofing countermeasures for speech artifact assessment, in: The speaker and language recognition workshop, ISCA, 2018, pp. 187–194.

[25]

Kinnunen, T., Sahidullah, M., Delgado, H., Todisco, M., Evans, N., Yamagishi, J., et al. (2017). The ASVspoof 2017 challenge: assessing the limits of replay spoofing attack detection. In Proc. interspeech (pp. 2–6).

[26]

Kinnunen T., Sahidullah M., Falcone M., Costantini L., Hautamäki R.G., Thomsen D., et al., Reddots replayed: a new replay spoofing attack corpus for text-dependent speaker verification research, in: 2017 IEEE international conference on acoustics, speech and signal processing, IEEE, 2017, pp. 5395–5399.

[27]

Kinnunen T., Wu Z.-Z., Lee K.A., Sedlak F., Chng E.S., Li H., Vulnerability of speaker verification systems against voice conversion spoofing attacks: The case of telephone speech, in: 2012 IEEE international conference on acoustics, speech and signal processing, IEEE, 2012, pp. 4401–4404.

[28]

Łańcucki A., Fastpitch: Parallel text-to-speech with pitch prediction, in: IEEE international conference on acoustics, speech and signal processing, IEEE, 2021, pp. 6588–6592.

[29]

Lavrentyeva, G., Tseren, A., Volkova, M., Gorlanov, A., Kozlov, A., & Novoselov, S. (2019). STC antispoofing systems for the AsVspoof2019 challenge. In Proc. interspeech (pp. 1033–1037).

[30]

Lei, Z., Yang, Y., Liu, C., & Ye, J. (2020). Siamese Convolutional Neural Network Using Gaussian Probability Feature for Spoofing Speech Detection. In Proc. interspeech (pp. 1116–1120).

[31]

Li X., Li N., Weng C., Liu X., Su D., Yu D., et al., Replay and synthetic speech detection with Res2Net architecture, in: IEEE international conference on acoustics, speech and signal processing, IEEE, 2021, pp. 6354–6358.

[32]

Li J., Wang H., He P., Abdullahi S.M., Li B., Long-term variable q transform: A novel time-frequency transform algorithm for synthetic speech detection, Digital Signal Processing 120 (2022).

[33]

Li, X., Wu, X., Lu, H., Liu, X., & Meng, H. (2021). Channel-wise gated res2net: towards robust detection of synthetic speech attacks. In Proc. Interspeech 2021.

[34]

Ling, H., Huang, L., Huang, J., Zhang, B., & Li, P. (2021). Attention-based convolutional neural network for ASV spoofing detection. In Proc. interspeech (pp. 4289–4293).

[35]

Liu R., Zhang J., Gao G., Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection, Information Fusion 105 (2024).

[36]

Lv Z., Zhang S., Tang K., Hu P., Fake audio detection based on unsupervised pretraining models, in: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing, IEEE, 2022, pp. 9231–9235.

[37]

Ma, Y., Ren, Z., & Xu, S. (2021). RW-Resnet: a Novel Speech Anti-Spoofing Model Using Raw Waveform. In Proc. interspeech (pp. 4144–4148).

[38]

Paul A., Das R.K., Sinha R., Prasanna S.M., Countermeasure to handle replay attacks in practical speaker verification systems, in: 2016 international conference on signal processing and communications, IEEE, 2016, pp. 1–5.

[39]

Paul D., Pal M., Saha G., Spectral features for synthetic speech detection, IEEE Journal of Selected Topics in Signal Processing 11 (4) (2017) 605–617.

[40]

Qian K., Jin Z., Hasegawa-Johnson M., Mysore G.J., F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder, in: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing, IEEE, 2020, pp. 6284–6288.

[41]

Shang W., Stevenson M., A preliminary study of factors affecting the performance of a playback attack detector, in: 2008 Canadian conference on electrical and computer engineering, IEEE, 2008, pp. 459–464.

[42]

Shchemelinin, Vadim, & Simonchik, K. (2013). Examining Vulnerability of Voice Verification Systems to Spoofing Attacks by Means of a TTS System. In Proceedings of the 15th international conference on speech and computer-volume 8113 (pp. 132–137).

[43]

Sun T., Ding S., Guo L., Low-degree term first in ResNet, its variants and the whole neural network family, Neural Networks 148 (2022) 155–165.

[44]

Tak H., Jung J.-W., Patino J., Kamble M., Todisco M., Evans N., End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection, in: ASVSPOoF 2021, automatic speaker verification and spoofing countermeasures challenge, ISCA, 2021, pp. 1–8.

[45]

Tak, H., weon Jung, J., Patino, J., Todisco, M., & Evans, N. (2021). Graph Attention Networks for Anti-Spoofing. In Proc. interspeech 2021 (pp. 2356–2360).

[46]

Tak H., Kamble M., Patino J., Todisco M., Evans N., Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing, in: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing, IEEE, 2022, pp. 6382–6386.

[47]

Tak, H., Patino, J., NAutsch, A., Evans, N., & Todisco, M. (2020). Spoofing attack detection using the non-linear fusion of sub-band classifiers. In Proc. interspeech (pp. 1106–1110).

[48]

Tian X., Lee S.W., Wu Z., Chng E.S., Li H., An exemplar-based approach to frequency warping for voice conversion, IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (10) (2017) 1863–1876.

[49]

Todisco, M., Wang, X., Vestman, V., Sahidullah, M., Delgado, H., Nautsch, A., et al. (2019). ASVspoof 2019: future horizons in spoofed and fake audio detection. In Proc. interspeech (pp. 1008–1012).

[50]

Tomilov, A., Svishchev, A., Volkova, M., Chirkovskiy, A., Kondratev, A., & Lavrentyeva, G. (2021). STC Antispoofing Systems for the ASVspoof2021 Challenge. In Proc. 2021 edition of the automatic speaker verification and spoofing countermeasures challenge (pp. 61–67).

[51]

van der Maaten L., Hinton G., Visualizing data using t-SNE, Journal of Machine Learning Research 9 (2008).

[52]

Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q. (2020). ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 11531–11539).

[53]

Wang X., Yamagishi J., Todisco M., Delgado H., Nautsch A., Evans N., et al., Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech, Computer Speech and Language 64 (2020).

[54]

Wei L., Long Y., Wei H., Li Y., New acoustic features for synthetic and replay spoofing attack detection, Symmetry 14 (2) (2022) 274.

[55]

Williams, J., & Rownicka, J. (2019). Speech Replay Detection with x-Vector Attack Embeddings and Spectral Features. In Proc. Interspeech 2019 (pp. 1053–1057).

[56]

Witkowski M., Kacprzak S., Zelasko P., Kowalczyk K., Galka J., Audio replay attack detection using high-frequency features, in: Interspeech, 2017, pp. 27–31.

[57]

Woo, S., Park, J., Lee, J.-Y., & Kweon, I. S. (2018). Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (pp. 3–19).

[58]

Wu, Z., Kinnunen, T., Evans, N., Yamagishi, J., Hanilçi, C., Sahidullah, M., et al. (2015). ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge. In Proc. interspeech (pp. 2037–2041).

[59]

Xue J., Fan C., Yi J., Wang C., Wen Z., Zhang D., et al., Learning from yourself: A self-distillation method for fake speech detection, in: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing, IEEE, 2023, pp. 1–5.

[60]

Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., et al. (2021). ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection. In ASVspoof 2021 workshop-automatic speaker verification and spoofing coutermeasures challenge.

[61]

Yang J., Das R.K., Long-term high frequency features for synthetic speech detection, Digital Signal Processing 97 (2020).

[62]

Yang J., Das R.K., Li H., Significance of subband features for synthetic speech detection, IEEE Transactions on Information Forensics and Security 15 (2019) 2160–2170.

[63]

Yang J., Das R.K., Zhou N., Extraction of octave spectra information for spoofing attack detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12) (2019) 2373–2384.

[64]

Yang, Y., Wang, H., Dinkel, H., Chen, Z., Wang, S., Qian, Y., et al. (2019). The sjtu robust anti-spoofing systems for the ASVspoof 2019 challenge. In Proc. interspeech (pp. 1038–1042).

[65]

Yi J., Fu R., Tao J., Nie S., Ma H., Wang C., et al., Add 2022: the first audio deep synthesis detection challenge, in: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing, IEEE, 2022, pp. 9216–9220.

[66]

Zhang Z., Gu Y., Yi X., Zhao X., FMFCC-a: a challenging mandarin dataset for synthetic speech detection, in: Digital forensics and watermarking: 20th international workshop, IWDW 2021, Beijing, China, November 20–22, 2021, revised selected papers, Springer, 2022, pp. 117–131.

[67]

Zhang Y., Jiang F., Duan Z., One-class learning towards synthetic voice spoofing detection, IEEE Signal Processing Letters (2021) 937–941.

[68]

Zhang, Y., Wang, W., & Zhang, P. (2021). The effect of silence and dual-band fusion in anti-spoofing system. In Proc. interspeech (pp. 4279–4283).

[69]

Zhang, Z., Yi, X., & Zhao, X. (2021). Fake speech detection using residual network with transformer encoder. In Proceedings of the 2021 ACM workshop on information hiding and multimedia security (pp. 13–22).

Recommendations

Analysis and modeling of F0 contours for cantonese text-to-speech

For the generation of highly natural synthetic speech, the control of prosody is of primary importance. The fundamental frequency (F0) is one of the most important components of speech prosody. This research investigates the variation of F0 in ...
Subband fusion of complex spectrogram for fake speech detection
Abstract
The phase information was shown useful in fake speech detection. However, the most common reason why phase-based features are not widely used is phase wrapping. This makes the original phase hard to model directly. Therefore, it remains a ...
Highlights
- A subband fusion of complex spectrogram is proposed for fake speech detection.
- We model different subbands of complex spectrogram respectively and fuse finally.
- Experimental results show that our proposed method is very effective.
Supervised and unsupervised separation of convolutive speech mixtures using f0 and formant frequencies

In this paper we discuss the role of fundamental frequency f0 and formants F1, F2 and F3 of the speech signal in supervised and unsupervised source separation of real recorded convolutive speech mixtures. Initially supervised source separation is ...

Comments

Information & Contributors

Information

Published In

cover image Neural Networks

Neural Networks Volume 175, Issue C

Jul 2024

571 pages

Issue’s Table of Contents

Elsevier Ltd.

Publisher

Elsevier Science Ltd.

United Kingdom

Publication History

Published: 17 July 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents