Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Target speaker filtration by mask estimation for source speaker traceability in voice conversion

Published: 01 October 2024 Publication History

Abstract

Voice Conversion (VC) can manipulate the source speaker's identity of speech signal to make it sound like some specific target speaker, which makes it harder for a human being or a speaker verification/identification system to trace the real identity of the source speaker. However, extracting features of the source speaker from converted audio is challenging since the features of the target speaker are dominant in the converted audio, which hinders the extraction of the features of the source speaker. In this paper, to extract features of the source speaker from audios processed by VC methods, a speaker filtration block is designed, which uses mask estimation to identify source speakers from manipulated speech signals by filtering out the features of the target speaker in converted audio. Extensive experiments are conducted to evaluate the effectiveness of the proposed model in tracing source speakers of audios converted by ADAIN-VC, AGAIN-VC, VQMIVC, and FREEVC. Experimental results demonstrate the effectiveness of the proposed model by comparing to competitive baselines in speaker verification/identification scenarios. Notably, it has good performance even when being applied to unknown VC methods. Furthermore, the experiments also show that, training audios generated by multiple VC methods can improve the performance on the traceability of the source speaker.

References

[1]
D. Cai, Z. Cai, M. Li, Identifying source speakers for voice conversion based spoofing attacks on speaker verification systems, in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1–5,.
[2]
Y.-H. Chen, D.-Y. Wu, T.-H. Wu, H. Lee, AGAIN-VC: a one-shot voice conversion using activation guidance and adaptive instance normalization, in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5954–5958,. Toronto, ON, Canada.
[3]
S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Zhuo Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Yanmin Qian, Yao Qian, J. Wu, M. Zeng, X. Yu, F. Wei, WavLM: large-scale self-supervised pre-training for full stack speech processing, IEEE J. Sel. Top. Signal Process. 16 (2022) 1505–1518,.
[4]
J. Chou, C. Yeh, H. Lee, One-shot voice conversion by separating speaker and content representations with instance normalization, in: Proc. 2019 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, pp. 664–668,. Graz, Austria.
[5]
J. Deng, J. Guo, J. Yang, N. Xue, I. Kotsia, S. Zafeiriou, ArcFace: additive angular margin loss for deep face recognition, IEEE Trans. Pattern Anal. Mach. Intell. 44 (2022) 5962–5979,.
[6]
J. Deng, Y. Chen, Y. Zhong, Q. Miao, X. Gong, W. Xu, Catch you and I can: revealing source voiceprint against voice conversion, in: Proc. 2023 32nd USENIX Security Symposium (USENIX), Anaheim, CA, USA, 2023, pp. 5163–5180. https://www.usenix.org/conference/usenixsecurity23/presentation/deng-jiangyi-voiceprint.
[7]
B. Desplanques, J. Thienpondt, K. Demuynck, ECAPA-TDNN: emphasized channel attention, in: Proc. 2020 21st Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 3830–3834,. Shanghai, China.
[8]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA, 2016, pp. 770–778,.
[9]
Z. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF models for sequence tagging, arXiv:1508.01991 (2015) arXiv. http://arxiv.org/abs/1508.01991.
[10]
Y. Junichi, V. Christophe, M. Kirsten, CSTR VCTK Corpus: English Multi-Speaker Corpus for CSTR Voice Cloning Toolkit, University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2019,. [sound] version 0.92.
[11]
J. Kong, J. Kim, J. Bae, HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis, in: Proc. 2020 34th Conference on Neural Information Processing Systems (NeurIPS), 2020, pp. 17022–17033. Vancouver, Canada https://dl.acm.org/doi/abs/10.5555/3495724.3497152.
[12]
K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W.Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, A. Courville, MelGAN: generative adversarial networks for conditional waveform synthesis, in: Proc. 2019 33rd Conference on Neural Information Processing Systems (NeurIPS), 2019, pp. 14910–14921. Vancouver, Canada https://dl.acm.org/doi/abs/10.5555/3454287.3455622.
[13]
J. Li, W. Tu, L. Xiao, FreeVC: towards high-quality text-free one-shot voice conversion, in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1–5,.
[14]
H. Ma, J. Yi, J. Tao, Y. Bai, Z. Tian, C. Wang, Continual learning for fake audio detection, in: Proc. 2021 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH), Brno, Czech Republic, 2021, pp. 886–890,.
[15]
N.M. Müller, K. Pizzi, J. Williams, Human perception of audio deepfakes, in: Proc 2022 30th ACM International Conference on Multimedia, ACM, Lisboa Portugal, 2022, pp. 85–91,.
[16]
K. Okabe, T. Koshinaka, K. Shinoda, Attentive statistics pooling for deep speaker embedding, in: Proc. 2018 19th Annual Conference of the International Speech Communication Association (INTERSPEECH), Hyderabad, India, 2018, pp. 2252–2256,.
[17]
K. Qian, Y. Zhang, H. Gao, J. Ni, C.-I.J. Lai, D. Cox, M. Hasegawa-Johnson, S. Chang, CONTENTVEC: an improved self-supervised speech representation by disentangling speakers, in: Proc. 2022 39th International Conference on Machine Learning (ICML), 2022, pp. 18003–18017. Baltimore, Maryland, USA http://arxiv.org/abs/1904.05742.
[18]
Y. Ren, H. Zhu, L. Zhai, Z. Sun, R. Shen, L. Wang, Who is speaking actually? Robust and versatile speaker traceability for voice conversion, in: Proc. 2023 31st ACM International Conference on Multimedia, Ottawa, Canada, 2023, pp. 8674–8685,.
[19]
B.M.L. Srivastava, N. Vauquier, M. Sahidullah, A. Bellet, M. Tommasi, E. Vincent, Evaluating voice conversion-based privacy protection against informed attackers, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2802–2806,. Barcelona, Spain.
[20]
H. Tak, J. Jung, J. Patino, M. Kamble, M. Todisco, N. Evans, End-to-End spectro-temporal graph attention networks for speaker verification anti-spoofing and speech Deepfake detection, arXiv 2107 (2021) arXiv http://arxiv.org/abs/2107.12710.
[21]
Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey, R.A. Saurous, R.J. Weiss, Y. Jia, I.L. Moreno, VoiceFilter: targeted voice separation by speaker-conditioned spectrogram masking, in: Proc. 2019 20th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2019, pp. 2728–2732,. Graz, Austria.
[22]
D. Wang, L. Deng, Y.T. Yeung, X. Chen, X. Liu, H. Meng, VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion, in: Proc. 2021 21th Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021, pp. 1344–1348,. Brno, Czechia.
[23]
X. Xiang, S. Wang, H. Huang, Y. Qian, K. Yu, Margin matters: towards more discriminative deep neural network embeddings for speaker recognition, in: Proc. 2019 11th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 1652–1656,. Lanzhou, China.
[24]
R. Yamamoto, E. Song, J.-M. Kim, Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6199–6203,. Barcelona, Spain.
[25]
M. Zakariah, M.K. Khan, H. Malik, Digital multimedia audio forensics: past, present and future, Multimed. Tool. Appl. 77 (2018) 1009–1040,.
[26]
L. Zheng, J. Li, M. Sun, X. Zhang, T.F. Zheng, When automatic voice disguise meets automatic speaker verification, IEEE Trans. Inf. Forensics Secur. 16 (2021) 824–837,.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Engineering Applications of Artificial Intelligence
Engineering Applications of Artificial Intelligence  Volume 136, Issue PB
Oct 2024
1562 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 01 October 2024

Author Tags

  1. Speaker traceability
  2. Voice conversion
  3. Deep feature extraction
  4. Mask estimation
  5. Orthogonal decomposition
  6. Audio forensics

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media