Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization

Published: 08 March 2024 Publication History

Abstract

As an important part of the text-to-speech (TTS) system, vocoders convert acoustic features into speech waveforms. The difference in vocoders is key to producing different types of forged speech in the TTS system. With the rapid development of general adversarial networks (GANs), an increasing number of GAN vocoders have been proposed. Detectors often encounter vocoders of unknown types, which leads to a decline in the generalization performance of models. However, existing studies lack research on detection generalization based on GAN vocoders. To solve this problem, this study proposes vocoder detection of spoofed speech based on GAN fingerprints and domain generalization. The framework can widen the distance between real speech and forged speech in feature space, improving the detection model’s performance. Specifically, we utilize a fingerprint extractor based on an autoencoder to extract GAN fingerprints from vocoders. We then weight them to the forged speech for subsequent classification to learn the forged speech features with high differentiation. Subsequently, domain generalization is used to further improve the generalization ability of the model for unseen forgery types. We achieve domain generalization using domain-adversarial learning and asymmetric triplet loss to learn a better generalized feature space in which real speech is compact and forged speech synthesized by different vocoders is dispersed. Finally, to optimize the training process, curriculum learning is used to dynamically adjust the contributions of the samples with different difficulties in the training process. Experimental results show that the proposed method achieves the most advanced detection results among four GAN vocoders. The code is available at https://github.com/multimedia-infomation-security/GAN-Vocoder-detection.

References

[1]
Thibault Castells, Philippe Weinzaepfel, and Jerome Revaud. 2020. SuperLoss: A generic loss for robust curriculum learning. Advances in Neural Information Processing Systems 33 (2020), 4308–4319.
[2]
Zhuxin Chen, Zhifeng Xie, Weibin Zhang, and Xiangmin Xu. 2017. ResNet and model fusion for automatic spoofing detection. In Interspeech. 102–106.
[3]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning. PMLR, 1180–1189.
[4]
Alex Graves, Marc G. Bellemare, Jacob Menick, Remi Munos, and Koray Kavukcuoglu. 2017. Automated curriculum learning for neural networks. In International Conference on Machine Learning. PMLR, 1311–1320.
[5]
Wei Han, Cheong-Fat Chan, Chiu-Sing Choy, and Kong-Pang Pun. 2006. An efficient MFCC extraction method in speech recognition. In 2006 IEEE International Symposium on Circuits and Systems (ISCAS’06). IEEE, 4–pp.
[6]
Guang Hua, Andrew Beng Jin Teoh, and Haijian Zhang. 2021. Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters 28 (2021), 1265–1269.
[7]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022–17033.
[8]
A. Kishore Kumar, Dipjyoti Paul, Monisankha Pal, Md. Sahidullah, and Goutam Saha. 2021. Speech frame selection for spoofing detection with an application to partially spoofed audio-data. International Journal of Speech Technology 24 (2021), 193–203.
[9]
Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre De Brebisson, Yoshua Bengio, and Aaron C. Courville. 2019. MelGAN: Generative adversarial networks for conditional waveform synthesis. Advances in Neural Information Processing Systems 32 (2019).
[10]
Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. 2017. Audio replay attack detection with deep learning frameworks. In Interspeech. 82–86.
[11]
Galina Lavrentyeva, Sergey Novoselov, Andzhukaev Tseren, Marina Volkova, Artem Gorlanov, and Alexandrg Kozlov. 2019. STC antispoofing systems for the ASVspoof2019 challenge. arXiv preprint arXiv:1904.05576 (2019).
[12]
Zhenchun Lei, Yingen Yang, Changhong Liu, and Jihua Ye. 2020. Siamese convolutional neural network using Gaussian probability feature for spoofing speech detection. In Interspeech. 1116–1120.
[13]
Hao Li, Yongguo Kang, and Zhenyu Wang. 2018. EMPHASIS: An emotional phoneme-based acoustic model for speech synthesis system. arXiv preprint arXiv:1806.09276 (2018).
[14]
Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C. Kot. 2018. Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5400–5409.
[15]
Xu Li, Xixin Wu, Hui Lu, Xunying Liu, and Helen Meng. 2021. Channel-wise gated Res2Net: Towards robust detection of synthetic speech attacks. arXiv preprint arXiv:2107.08803 (2021).
[16]
Jiachen Ma, Yong Liu, Meng Liu, and Meng Han. 2022. Curriculum contrastive learning for fake news detection. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 4309–4313.
[17]
Kaijie Ma, Yifan Feng, Beijing Chen, and Guoying Zhao. 2023. End-to-end dual-branch network towards synthetic speech detection. IEEE Signal Processing Letters 30 (2023), 359–363.
[18]
Francesco Marra, Diego Gragnaniello, Luisa Verdoliva, and Giovanni Poggi. 2019. Do GANs leave artificial fingerprints?. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR’19). IEEE, 506–511.
[19]
Yichuan Mo and Shilin Wang. 2022. Multi-task learning improves synthetic speech detection. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’22). IEEE, 6392–6396.
[20]
Ahmed Mustafa, Nicola Pia, and Guillaume Fuchs. 2021. StyleMelGAN: An efficient high-fidelity adversarial vocoder with temporal adaptive normalization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’21). IEEE, 6034–6038.
[21]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. ISCA Speech Synthesis Workshop (2016), 125–125.
[22]
Tanvina B. Patel and Hemant A. Patil. 2017. Significance of source–filter interaction for classification of natural vs. spoofed speech. IEEE Journal of Selected Topics in Signal Processing 11, 4 (2017), 644–659.
[23]
Dipjyoti Paul, Monisankha Pal, and Goutam Saha. 2017. Spectral features for synthetic speech detection. IEEE Journal of Selected Topics in Signal Processing 11, 4 (2017), 605–617.
[24]
Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C. Yuen. 2019. Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10023–10031.
[25]
Chengzhe Sun, Shan Jia, Shuwei Hou, and Siwei Lyu. 2023. AI-synthesized voice detection using neural vocoder artifacts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 904–912.
[26]
Hemlata Tak, Jee-weon Jung, Jose Patino, Madhu Kamble, Massimiliano Todisco, and Nicholas Evans. 2021. End-to-end spectro-temporal graph attention networks for speaker verification anti-spoofing and speech deepfake detection. arXiv preprint arXiv:2107.12710 (2021).
[27]
Hemlata Tak, Jose Patino, Andreas Nautsch, Nicholas Evans, and Massimiliano Todisco. 2020. Spoofing attack detection using the non-linear fusion of sub-band classifiers. arXiv preprint arXiv:2005.10393 (2020).
[28]
Massimiliano Todisco, Héctor Delgado, and Nicholas W. D. Evans. 2016. A new feature for automatic speaker verification anti-spoofing: Constant Q Cepstral coefficients. In Odyssey, Vol. 2016. 283–290.
[29]
Massimiliano Todisco, Héctor Delgado, Kong Aik Lee, Md. Sahidullah, Nicholas Evans, Tomi Kinnunen, and Junichi Yamagishi. 2018. Integrated presentation attack detection and automatic speaker verification: Common features and Gaussian back-end fusion. In Interspeech 2018-19th Annual Conference of the International Speech Communication Association. ISCA.
[30]
Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7167–7176.
[31]
Huaming Wang, Jianwei Fei, Yunshu Dai, Lingyun Leng, and Zhihua Xia. 2022. General GAN-generated image detection by data augmentation in fingerprint domain. arXiv preprint arXiv:2212.13466 (2022).
[32]
Wenfu Wang, Shuang Xu, and Bo Xu. 2016. First step towards end-to-end parametric TTS synthesis: Generating spectral parameters with neural attention. In Interspeech. 2243–2247.
[33]
Xin Wang and Junich Yamagishi. 2021. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv preprint arXiv:2103.11326 (2021).
[34]
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. 2017. Tacotron: Towards end-to-end speech synthesis. Proc. Interspeech (2017), 4006–4010.
[35]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV’18). 3–19.
[36]
Pengxiang Xu, Xue Mei, Yi Wei, and Tiancheng Qian. 2021. Robust facial manipulation detection via domain generalization. In 2021 7th International Conference on Computing and Artificial Intelligence. 196–201.
[37]
Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’20). IEEE, 6199–6203.
[38]
Xinrui Yan, Jiangyan Yi, Jianhua Tao, Chenglong Wang, Haoxin Ma, Tao Wang, Shiming Wang, and Ruibo Fu. 2022. An initial investigation for detecting vocoder fingerprints of fake audio. In Proceedings of the 1st International Workshop on Deepfake Detection for Audio Multimedia. 61–68.
[39]
Ning Yu, Larry S. Davis, and Mario Fritz. 2019. Attributing fake images to GANs: Learning and analyzing GAN fingerprints. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7556–7566.
[40]
You Zhang, Fei Jiang, and Zhiyao Duan. 2021. One-class learning towards synthetic voice spoofing detection. IEEE Signal Processing Letters 28 (2021), 937–941.
[41]
Yuxiang Zhang, Wenchao Wang, and Pengyuan Zhang. 2021. The effect of silence and dual-band fusion in anti-spoofing system. In Proc. Interspeech.

Cited By

View all
  • (2025)Self-distillation framework for improving fake speech detection in the domain variability scenarioNeural Computing and Applications10.1007/s00521-024-10760-837:5(3111-3127)Online publication date: 1-Feb-2025
  • (2024)Introduction to the Special Issue on Integrity of Multimedia and Multimodal Data in Internet of ThingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364304020:6(1-4)Online publication date: 8-Mar-2024
  • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Multimedia Computing, Communications, and Applications
    ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 6
    June 2024
    715 pages
    EISSN:1551-6865
    DOI:10.1145/3613638
    • Editor:
    • Abdulmotaleb El Saddik
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 March 2024
    Online AM: 28 October 2023
    Accepted: 21 October 2023
    Revised: 25 August 2023
    Received: 24 May 2023
    Published in TOMM Volume 20, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Speech forgery
    2. vocoder
    3. GAN fingerprint
    4. domain generalization
    5. curriculum learning

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)436
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 25 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Self-distillation framework for improving fake speech detection in the domain variability scenarioNeural Computing and Applications10.1007/s00521-024-10760-837:5(3111-3127)Online publication date: 1-Feb-2025
    • (2024)Introduction to the Special Issue on Integrity of Multimedia and Multimodal Data in Internet of ThingsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364304020:6(1-4)Online publication date: 8-Mar-2024
    • (2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
    • (2024)Who is Authentic Speaker?2024 29th International Conference on Automation and Computing (ICAC)10.1109/ICAC61394.2024.10718807(1-6)Online publication date: 28-Aug-2024

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media