research-article

Vocoder Detection of Spoofing Speech Based on GAN Fingerprints and Domain Generalization

Authors:

Fan Li,

Xin LiaoAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 6

Article No.: 157, Pages 1 - 20

https://doi.org/10.1145/3630751

Published: 08 March 2024 Publication History

Get Access

Abstract

As an important part of the text-to-speech (TTS) system, vocoders convert acoustic features into speech waveforms. The difference in vocoders is key to producing different types of forged speech in the TTS system. With the rapid development of general adversarial networks (GANs), an increasing number of GAN vocoders have been proposed. Detectors often encounter vocoders of unknown types, which leads to a decline in the generalization performance of models. However, existing studies lack research on detection generalization based on GAN vocoders. To solve this problem, this study proposes vocoder detection of spoofed speech based on GAN fingerprints and domain generalization. The framework can widen the distance between real speech and forged speech in feature space, improving the detection model’s performance. Specifically, we utilize a fingerprint extractor based on an autoencoder to extract GAN fingerprints from vocoders. We then weight them to the forged speech for subsequent classification to learn the forged speech features with high differentiation. Subsequently, domain generalization is used to further improve the generalization ability of the model for unseen forgery types. We achieve domain generalization using domain-adversarial learning and asymmetric triplet loss to learn a better generalized feature space in which real speech is compact and forged speech synthesized by different vocoders is dispersed. Finally, to optimize the training process, curriculum learning is used to dynamically adjust the contributions of the samples with different difficulties in the training process. Experimental results show that the proposed method achieves the most advanced detection results among four GAN vocoders. The code is available at https://github.com/multimedia-infomation-security/GAN-Vocoder-detection.

References

[1]

Thibault Castells, Philippe Weinzaepfel, and Jerome Revaud. 2020. SuperLoss: A generic loss for robust curriculum learning. Advances in Neural Information Processing Systems 33 (2020), 4308–4319.

Abstract

References

Cited By

Index Terms

Recommendations

Residual-Based Excitation with Continuous F0 Modeling in HMM-Based Speech Synthesis

Fine Vocoder Tuning for HMM-Based Speech Synthesis: Effect of the Analysis Window Length

Automatic detection of breathy voiced vowels in Gujarati speech

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Full Text

Share

Share this Publication link

Share on social media

Affiliations