Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3643491.3660286acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article
Open access

Introduction to Audio Deepfake Generation: Academic Insights for Non-Experts

Published: 10 June 2024 Publication History

Abstract

With the advancement of artificial intelligence, the methods for generating audio deepfakes have improved, but the technology behind it has become more complex. Despite this, non-expert users are able to generate audio deepfakes due to the increased accessibility of the latest technologies. These technologies can be used to support content creators, singers, and businesses such as the advertisement or entertainment industries. However, they can also be misused to create disinformation, CEO fraud, and voice scams. Therefore, with the increasing demand for countermeasures against their misuse, continuous interdisciplinary exchange is required. This work introduces recent techniques for generating audio deepfakes, with a focus on Text-to-Speech Synthesis and Voice Conversion for non-experts. It covers background knowledge, the latest trends and models, as well as open-source and closed-source software to explore both technological and practical aspects of audio deepfakes.

References

[1]
Sercan Ö Arık, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, Jonathan Raiman, 2017. Deep voice: Real-time neural text-to-speech. In International conference on machine learning. PMLR, 195–204.
[2]
Matthew Baas, Benjamin van Niekerk, and Herman Kamper. 2023. Voice Conversion With Just Nearest Neighbors. In Interspeech.
[3]
Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, and Bryan Catanzaro. 2023. RAD-MMM: Multilingual Multiaccented Multispeaker Text To Speech. In INTERSPEECH 2023. ISCA, 626–630. https://doi.org/10.21437/Interspeech.2023-2330
[4]
Alexei Baevski, Steffen Schneider, and Michael Auli. 2019. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453 (2019).
[5]
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
[6]
Anders R. Bargum, Stefania Serafin, and Cumhur Erkut. 2023. Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion. arxiv:2311.08104 [cs.SD]
[7]
Joshua Camp, Tom Kenter, Lev Finkelstein, and Rob Clark. 2023. MOS vs. AB: Evaluating Text-to-Speech Systems Reliably Using Clustered Standard Errors. In INTERSPEECH 2023. ISCA, 1090–1094. https://doi.org/10.21437/Interspeech.2023-2014
[8]
Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, and Moacir Antonelli Ponti. 2023. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone. http://arxiv.org/abs/2112.02418 arXiv:2112.02418 [cs, eess].
[9]
Mingjie Chen, Yanghao Zhou, Heyan Huang, and Thomas Hain. 2022. Efficient non-autoregressive gan voice conversion using vqwav2vec features and dynamic convolution. arXiv preprint arXiv:2203.17172 (2022).
[10]
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. 2020. WaveGrad: Estimating Gradients for Waveform Generation. http://arxiv.org/abs/2009.00713 arXiv:2009.00713 [cs, eess, stat].
[11]
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.
[12]
Yen-Hao Chen, Da-Yi Wu, Tsung-Han Wu, and Hung-yi Lee. 2021. Again-vc: A one-shot voice conversion using activation guidance and adaptive instance normalization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5954–5958.
[13]
Jeongsoo Choi, Minsu Kim, and Yong Man Ro. 2023. Intelligible Lip-to-Speech Synthesis with Speech Units. In INTERSPEECH 2023. ISCA, 4349–4353. https://doi.org/10.21437/Interspeech.2023-194
[14]
Giulia Comini, Sam Ribeiro, Fan Yang, Heereen Shim, and Jaime Lorenzo-Trueba. 2023. Multilingual context-based pronunciation learning for Text-to-Speech. In INTERSPEECH 2023. ISCA, 631–635. https://doi.org/10.21437/Interspeech.2023-861
[15]
Erica Cooper, Cheng-I Lai, Yusuke Yasuda, and Junichi Yamagishi. 2020. Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?. In Interspeech 2020. ISCA, 3979–3983. https://doi.org/10.21437/Interspeech.2020-1229
[16]
Erica Cooper and Junichi Yamagishi. 2023. Investigating Range-Equalizing Bias in Mean Opinion Score Ratings of Synthesized Speech. In INTERSPEECH 2023. ISCA, 1104–1108. https://doi.org/10.21437/Interspeech.2023-1076
[17]
D. Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32, 2 (April 1984), 236–243. https://doi.org/10.1109/TASSP.1984.1164317
[18]
Houjian Guo, Chaoran Liu, Carlos Toshinori Ishi, and Hiroshi Ishiguro. 2023. QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion. arxiv:2302.08296 [cs.SD]
[19]
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
[20]
Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. http://arxiv.org/abs/1802.08435 arXiv:1802.08435 [cs, eess].
[21]
Anton Kashkin, Ivan Karpukhin, and Svyatoslav Shishkin. 2022. Hifi-vc: High quality asr-based voice conversion. arXiv preprint arXiv:2203.16937 (2022).
[22]
Heeseung Kim, Sungwon Kim, Jiheum Yeom, and Sungroh Yoon. 2023. UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data. In INTERSPEECH 2023. ISCA, 3038–3042. https://doi.org/10.21437/Interspeech.2023-2326
[23]
Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon. 2020. Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. http://arxiv.org/abs/2005.11129 arXiv:2005.11129 [cs, eess].
[24]
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning. PMLR, 5530–5540.
[25]
Sungwon Kim, Heeseung Kim, and Sungroh Yoon. 2022. Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data. http://arxiv.org/abs/2205.15370 arXiv:2205.15370 [cs, eess].
[26]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33 (2020), 17022–17033.
[27]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 17022–17033. https://proceedings.neurips.cc/paper_files/paper/2020/file/c5d736809766d46260d816d8dbc9eb44-Paper.pdf
[28]
Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, and Sangjin Kim. 2023. VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design. In INTERSPEECH 2023. ISCA, 4374–4378. https://doi.org/10.21437/Interspeech.2023-534
[29]
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. http://arxiv.org/abs/2009.09761 arXiv:2009.09761 [cs, eess, stat].
[30]
Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, and Aaron Courville. 2019. MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. http://arxiv.org/abs/1910.06711 arXiv:1910.06711 [cs, eess].
[31]
Jingyi Li, Weiping Tu, and Li Xiao. 2023. Freevc: Towards High-Quality Text-Free One-Shot Voice Conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
[32]
Yinghao Aaron Li, Ali Zare, and Nima Mesgarani. 2021. Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion. arXiv preprint arXiv:2107.10394 (2021).
[33]
Yist Y Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-shan Lee. 2021. Fragmentvc: Any-to-any voice conversion by end-to-end extracting and fusing fine-grained voice fragments with attention. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5939–5943.
[34]
Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Lei Xie, and Zhifei Li. 2023. PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions. In INTERSPEECH 2023. ISCA, 4888–4892. https://doi.org/10.21437/Interspeech.2023-1779
[35]
Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 11020–11028.
[36]
Songxiang Liu, Yuewen Cao, Disong Wang, Xixin Wu, Xunying Liu, and Helen Meng. 2021. Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 1717–1728.
[37]
Hieu-Thi Luong and Junichi Yamagishi. 2023. Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme. In INTERSPEECH 2023. ISCA, 4379–4383. https://doi.org/10.21437/Interspeech.2023-1175
[38]
Youneng Ma, Junyi He, Meimei Wu, Guangyue Hu, and Haojun Fei. 2023. EdenTTS: A Simple and Efficient Parallel Text-to-speech Architecture with Collaborative Duration-alignment Learning. In INTERSPEECH 2023. ISCA, 4449–4453. https://doi.org/10.21437/Interspeech.2023-700
[39]
Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. 2023. Deepfakes Generation and Detection: State-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence 53, 4 (2023), 3974–4026.
[40]
Ambuj Mehrish, Abhinav Ramesh Kashyap, Li Yingting, Navonil Majumder, and Soujanya Poria. 2023. ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation. arXiv preprint arXiv:2305.18028 (2023).
[41]
Ambuj Mehrish, Navonil Majumder, Rishabh Bhardwaj, Rada Mihalcea, and Soujanya Poria. 2023. A Review of Deep Learning Techniques for Speech Processing. arxiv:2305.00359 [eess.AS]
[42]
Juan Felipe Montesinos, Daniel Michelsanti, Gloria Haro, Zheng-Hua Tan, and Jesper Jensen. 2023. Speech inpainting: Context-based speech synthesis guided by video. In INTERSPEECH 2023. ISCA, 4459–4463. https://doi.org/10.21437/Interspeech.2023-1020
[43]
Hiroki Mori and Shunya Kimura. 2023. A Generative Framework for Conversational Laughter: Its ’Language Model’ and Laughter Sound Synthesis. In INTERSPEECH 2023. ISCA, 3372–3376. https://doi.org/10.21437/Interspeech.2023-2453
[44]
Nobuyuki Morioka, Heiga Zen, Nanxin Chen, Yu Zhang, and Yifan Ding. 2022. Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation. http://arxiv.org/abs/2210.15868 arXiv:2210.15868 [cs, eess].
[45]
Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems E99.D, 7 (2016), 1877–1884. https://doi.org/10.1587/transinf.2015EDP7457
[46]
Anastasia Natsiou and Seán O’Leary. 2021. Audio representations for deep learning in sound synthesis: A review. 2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA) (2021), 1–8. https://api.semanticscholar.org/CorpusID:245827795
[47]
Bac Nguyen and Fabien Cardinaux. 2022. Nvc-net: End-to-end adversarial voice conversion. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7012–7016.
[48]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[49]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[50]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.
[51]
Hyun Joon Park, Seok Woo Yang, Jin Sob Kim, Wooseok Shin, and Sung Won Han. 2023. TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
[52]
Seongyeon Park, Bohyung Kim, and Tae-Hyun Oh. 2023. Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis. In INTERSPEECH 2023. ISCA, 4319–4323. https://doi.org/10.21437/Interspeech.2023-58
[53]
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. 2021. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech. http://arxiv.org/abs/2105.06337 arXiv:2105.06337 [cs, stat].
[54]
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, and Jiansheng Wei. 2021. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. arXiv preprint arXiv:2109.13821 (2021).
[55]
Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A Flow-based Generative Network for Speech Synthesis. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 3617–3621. https://doi.org/10.1109/ICASSP.2019.8683143
[56]
Magdalena Proszewska, Grzegorz Beringer, Daniel Sáez-Trigueros, Thomas Merritt, Abdelhamid Ezzerg, and Roberto Barra-Chicote. 2022. GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion. arXiv preprint arXiv:2207.01454 (2022).
[57]
Kaizhi Qian, Zeyu Jin, Mark Hasegawa-Johnson, and Gautham J Mysore. 2020. F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6284–6288.
[58]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning. PMLR, 5210–5219.
[59]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2022. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. http://arxiv.org/abs/2006.04558 arXiv:2006.04558 [cs, eess].
[60]
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: Fast, Robust and Controllable Text to Speech. http://arxiv.org/abs/1905.09263 arXiv:1905.09263 [cs, eess].
[61]
Yi Ren, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2022. Revisiting Over-Smoothness in Text to Speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8197–8213. https://doi.org/10.18653/v1/2022.acl-long.564
[62]
Morgane Riviere, Armand Joulin, Pierre-Emmanuel Mazaré, and Emmanuel Dupoux. 2020. Unsupervised pretraining transfers well across languages. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7414–7418.
[63]
Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, and Hiroshi Saruwatari. 2023. ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings. In INTERSPEECH 2023. ISCA, 3048–3052. https://doi.org/10.21437/Interspeech.2023-1095
[64]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779–4783.
[65]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Calgary, AB, 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368
[66]
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 132–157. https://doi.org/10.1109/TASLP.2020.3038524
[67]
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561 (2021).
[68]
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. A Survey on Neural Speech Synthesis. http://arxiv.org/abs/2106.15561 arXiv:2106.15561 [cs, eess].
[69]
Linh The Nguyen, Thinh Pham, and Dat Quoc Nguyen. 2023. XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech. In INTERSPEECH 2023. ISCA, 5506–5510. https://doi.org/10.21437/Interspeech.2023-444
[70]
Chung Tran, Chi Mai Luong, and Sakriani Sakti. 2023. STEN-TTS: Improving Zero-shot Cross-Lingual Transfer for Multi-Lingual TTS with Style-Enhanced Normalization Diffusion Framework. In INTERSPEECH 2023. ISCA, 4464–4468. https://doi.org/10.21437/Interspeech.2023-2243
[71]
Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Matthew Baas, Hugo Seuté, and Herman Kamper. 2022. A comparison of discrete and soft speech units for improved voice conversion. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6562–6566.
[72]
Petra Wagner, Jonas Beskow, Simon Betz, Jens Edlund, Joakim Gustafson, Gustav Eje Henter, Sébastien Le Maguer, Zofia Malisz, Éva Székely, Christina Tånnander, and Jana Voße. 2019. Speech Synthesis Evaluation — State-of-the-Art Assessment and Suggestion for a Novel Research Program. In 10th ISCA Workshop on Speech Synthesis (SSW 10). ISCA, 105–110. https://doi.org/10.21437/SSW.2019-19
[73]
Tomasz Walczyna and Zbigniew Piotrowski. 2023. Overview of Voice Conversion Methods Based on Deep Learning. Applied Sciences 13, 5 (2023), 3100.
[74]
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. 2023. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers. http://arxiv.org/abs/2301.02111 arXiv:2301.02111 [cs, eess].
[75]
Hui Wang, Shiwan Zhao, Xiguang Zheng, and Yong Qin. 2023. RAMP: Retrieval-Augmented MOS Prediction via Confidence-based Dynamic Weighting. 1095–1099. https://doi.org/10.21437/Interspeech.2023-851
[76]
Ruobai Wang, Yu Ding, Lincheng Li, and Changjie Fan. 2020. One-shot voice conversion using star-GAN. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7729–7733.
[77]
Yihan Wu, Xu Tan, Bohan Li, Lei He, Sheng Zhao, Ruihua Song, Tao Qin, and Tie-Yan Liu. 2022. AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios. http://arxiv.org/abs/2204.00436 arXiv:2204.00436 [cs, eess].
[78]
Zhizheng Wu and Haizhou Li. 2014. Voice conversion versus speaker verification: an overview. APSIPA Transactions on Signal and Information Processing 3 (2014), e17. https://doi.org/10.1017/ATSIP.2014.17
[79]
Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, and Hiroshi Saruwatari. 2023. Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus. In INTERSPEECH 2023. ISCA, 17–21. https://doi.org/10.21437/Interspeech.2023-806
[80]
Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.
[81]
Hong-Sun Yang, Ji-Hoon Kim, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Shuk-Jae Choi, and Hyung-Yong Kim. 2023. FACTSpeech: Speaking a Foreign Language Pronunciation Using Only Your Native Characters. In INTERSPEECH 2023. ISCA, 606–610. https://doi.org/10.21437/Interspeech.2023-2303
[82]
Jingzhou Yang and Lei He. 2020. Towards Universal Text-to-Speech. In Interspeech 2020. ISCA, 3171–3175. https://doi.org/10.21437/Interspeech.2020-1590

Cited By

View all
  • (2024)MAD '24 Workshop: Multimedia AI against DisinformationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3660000(1339-1341)Online publication date: 30-May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MAD '24: Proceedings of the 3rd ACM International Workshop on Multimedia AI against Disinformation
June 2024
107 pages
ISBN:9798400705526
DOI:10.1145/3643491
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 June 2024

Check for updates

Author Tags

  1. Attacks
  2. Audio Deepfakes
  3. Disinformation
  4. Text-to-Speech Synthesis
  5. Voice Conversion

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICMR '24
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)815
  • Downloads (Last 6 weeks)160
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MAD '24 Workshop: Multimedia AI against DisinformationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3660000(1339-1341)Online publication date: 30-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media