Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3640543.3645146acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article

Conan's Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming

Published: 05 April 2024 Publication History

Abstract

Recent years have witnessed a dramatic growing trend of Virtual YouTubers (VTubers) as a new business on social media, such as YouTube, Twitch, and TikTok. However, a significant challenge arises when VTuber voice actors face health issues or retire, jeopardizing the continuity of their avatar’s recognizable voices. A potential solution reminiscent of Conan’s Bow Tie voice changer in the popular animation Case Closed (i.e., Detective Conan) has inspired our work. To make this a reality, we introduce VTuberBowTie, a user-friendly streaming voice conversion system for real-time VTuber livestreaming. We propose an innovative streaming voice conversion approach that tackles the challenges of limited context modeling and bidirectional context dependence inherent to conventional real-time voice conversion. Rather than individually processing the voice stream in data chunks, our approach adopts a fully sequential structure that leverages contextual information preceding the input chunk, thereby expanding the perceptual range and enabling seamless concatenation. Moreover, we developed a ready-to-use interaction interface for VTuberBowTie and deployed it on various computing platforms. The experimental results show that VTuberBowTie can achieve high-quality voice conversion in a streaming manner with a latency of 179.1ms on CPU and 70.8ms on GPU while providing users a friendly interactive experience.

References

[1]
Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. In Proceedings of ISCA Speech Synthesis Workshop. 93–98.
[2]
Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. TransVoice: Real-Time Voice Conversion for Augmenting Near-Field Speech Communication. In Proceedings of ACM UIST, François Guimbretière, Michael S. Bernstein, and Katharina Reinecke (Eds.). New Orleans, LA, USA, 33–35.
[3]
George Boateng, Prabhakaran Santhanam, Janina Lüscher, Urte Scholz, and Tobias Kowatsch. 2019. VADLite: an open-source lightweight system for real-time voice activity detection on smartwatches. In Proceedings of ACM UbiComp/ISWC. London, UK, 902–906.
[4]
Julia Cambre, Alex C. Williams, Afsaneh Razi, Ian Bicking, Abraham Wallin, Janice Y. Tsai, Chinmay Kulkarni, and Jofish Kaye. 2021. Firefox Voice: An Open and Extensible Voice Assistant Built Upon the Web. In Proceedings of ACM CHI, Yoshifumi Kitamura, Aaron Quigley, Katherine Isbister, Takeo Igarashi, Pernille Bjørn, and Steven Mark Drucker (Eds.). Virtual Event / Yokohama, Japan, 250:1–250:18.
[5]
Sadil Chamishka, Ishara Madhavi, Rashmika Nawaratne, Damminda Alahakoon, Daswin De Silva, Naveen Chilamkurti, and Vishaka Nanayakkara. 2022. A voice-based real-time emotion detection technique using recurrent neural network empowered feature modelling. Multimedia Tools and Applications 81 (2022), 35173–35194.
[6]
Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE ACM Trans. Audio Speech Lang. Process. 22, 12 (2014), 1859–1872.
[7]
Meng Chen, Li Lu, Junhao Wang, Jiadi Yu, Yingying Chen, Zhibo Wang, Zhongjie Ba, Feng Lin, and Kui Ren. 2023. VoiceCloak: Adversarial Example Enabled Voice De-Identification with Balanced Privacy and Utility. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 2 (2023), 48:1–48:21.
[8]
Qianniu Chen, Meng Chen, Li Lu, Jiadi Yu, Yingying Chen, Zhibo Wang, Zhongjie Ba, Feng Lin, and Kui Ren. 2022. Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain. In Proceedings of the 20th ACM SenSys. Boston, Massachusetts, 710–724.
[9]
Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of IEEE CVPR. Salt Lake City, UT, USA, 8789–8797.
[10]
Toby Long Hin Chong, Hijung Valentina Shin, Deepali Aneja, and Takeo Igarashi. 2023. SoundToons: Exemplar-Based Authoring of Interactive Audio-Driven Animation Sprites. In Proceedings of ACM IUI. Sydney, NSW, Australia, 710–722.
[11]
Andreea Danielescu, Sharone Horowit-Hendler, Alexandria Pabst, Kenneth Michael Stewart, Eric M. Gallo, and Matthew Peter Aylett. 2023. Creating Inclusive Voices for the 21st Century: A Non-Binary Text-to-Speech for Conversational Assistants. In Proceedings of ACM CHI, Albrecht Schmidt, Kaisa Vaananen, Tesh Goyal, Per Ola Kristensson, Anicia Peters, Stefanie Mueller, Julie R. Williamson, and Max L. Wilson (Eds.). Hamburg, Germany, 390:1–390:17.
[12]
Srinivas Desai, E. Veera Raghavendra, B. Yegnanarayana, Alan W. Black, and Kishore Prahallad. 2009. Voice conversion using Artificial Neural Networks. In Proceedings of IEEE ICASSP. Taipei, Taiwan, 3893–3896.
[13]
Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of ISCA INTERSPEECH, Helen Meng, Bo Xu, and Thomas Fang Zheng (Eds.). ISCA, 3830–3834.
[14]
Parashar Dhakal, Praveen Damacharla, Ahmad Y. Javaid, and Vijay Devabhaktuni. 2019. A Near Real-Time Automatic Speaker Recognition Architecture for Voice-Based User Interface. Machine Learning and Knowledge Extraction 1, 1 (2019), 504–520.
[15]
Susumu Harada, Jacob O. Wobbrock, and James A. Landay. 2011. Voice Games: Investigation Into the Use of Non-speech Voice Input for Making Computer Games More Accessible. In Proceedings of INTERACT(Lecture Notes in Computer Science, Vol. 6946). Springer, Lisbon, Portugal, 11–29.
[16]
Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, and Alexander Gruenstein. 2019. Streaming End-to-end Speech Recognition for Mobile Devices. In Proceedings of IEEE ICASSP. Brighton, United Kingdom, 6381–6385.
[17]
Elina Helander, Hanna Silén, Tuomas Virtanen, and Moncef Gabbouj. 2012. Voice Conversion Using Dynamic Kernel Partial Least Squares Regression. IEEE Trans. Speech Audio Process. 20, 3 (2012), 806–817.
[18]
Purnima Kamath, Zhuoyao Li, Chitralekha Gupta, Kokil Jaidka, Suranga Nanayakkara, and Lonce Wyse. 2023. Evaluating Descriptive Quality of AI-Generated Audio Using Image-Schemas. In Proceedings of ACM IUI. Sydney, NSW, Australia, 621–632.
[19]
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In Proceedings of IEEE SLT. IEEE, 266–273.
[20]
Hirokazu Kameoka, Kou Tanaka, and Takuhiro Kaneko. 2021. FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion. CoRR abs/2104.06900 (2021). arXiv:2104.06900
[21]
Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, and Nobukatsu Hojo. 2020. ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion. IEEE ACM Trans. Audio Speech Lang. Process. 28 (2020), 1849–1863.
[22]
Takuhiro Kaneko and Hirokazu Kameoka. 2018. CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In Proceeding of IEEE EUSIPCO. Roma, Italy, 2100–2104.
[23]
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2019. Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion. In Proceedings of IEEE ICASSP. 6820–6824.
[24]
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2020. CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion. In Proceedings of ISCA INTERSPEECH. ISCA, Virtual Event, Shanghai, China, 2017–2021.
[25]
Saman Karim, Jin Kang, and Audrey Girouard. 2023. Exploring Rulebook Accessibility and Companionship in Board Games via Voiced-based Conversational Agent Alexa. In Proceedings ACM DIS. Pittsburgh, PA, USA, 2221–2232.
[26]
Leon Koren and Tomislav Stipancic. 2021. Multimodal Emotion Analysis Based on Acoustic and Linguistic Features of the Voice. In Proceedings of ACM HCI(Lecture Notes in Computer Science, Vol. 12774), Gabriele Meiselwitz (Ed.). Springer, Virtual Event, 301–311.
[27]
R. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE PACRIM, Vol. 1. 125–128.
[28]
Sangeun Kum and Juhan Nam. 2019. Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks. Applied Sciences 9, 7 (2019).
[29]
Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-Shan Lee. 2021. Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention. In Proceedings of IEEE ICASSP. Toronto, ON, Canada, 5939–5943.
[30]
Yi Luan, Daisuke Saito, Yosuke Kashiwagi, Nobuaki Minematsu, and Keikichi Hirose. 2014. Semi-supervised noise dictionary adaptation for exemplar-based noise robust speech recognition. In Proceedings of IEEE ICASSP. Florence, Italy, 1745–1748.
[31]
Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina, and Max Roser. 2020. Coronavirus Pandemic (COVID-19). Our World in Data (2020). https://ourworldindata.org/coronavirus.
[32]
Huaiping Ming, Dong-Yan Huang, Lei Xie, Jie Wu, Minghui Dong, and Haizhou Li. 2016. Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion. In Proceedings of ISCA INTERSPEECH. ISCA, San Francisco, CA, USA, 2453–2457.
[33]
Seyed Hamidreza Mohammadi and Alexander Kain. 2014. Voice conversion using deep neural networks with speaker-independent pre-training. In Proceedings of IEEE SLT. South Lake Tahoe, NV, USA, 19–23.
[34]
Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans. Inf. Syst. 99-D, 7 (2016), 1877–1884.
[35]
Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Voice conversion in high-order eigen space using deep belief nets. In Proceedings of ISCA INTERSPEECH. ISCA, Lyon, France, 369–372.
[36]
Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki. 2014. High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion. In Proceedings of ISCA INTERSPEECH. ISCA, Singapore, 2278–2282.
[37]
Bac Nguyen and Fabien Cardinaux. 2022. NVC-Net: End-To-End Adversarial Voice Conversion. In Proceedings of IEEE ICASSP. Virtual and Singapore, 7012–7016.
[38]
PLAYBOARD. 2022. Most Super Chatted. https://playboard.co/en/youtube-ranking/most-superchatted-all-channels-in-worldwide-yearend.
[39]
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Sergeevich Kudinov, and Jiansheng Wei. 2022. Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme. In Proceedings of ICLR. OpenReview.net, Virtual Event.
[40]
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In Proceedings of ICML(Proceedings of Machine Learning Research, Vol. 97). PMLR, Long Beach, California, USA, 5210–5219.
[41]
Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2020. Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU. In Proceedings of ISCA INTERSPEECH. ISCA, Virtual Event, Shanghai, China, 1021–1022.
[42]
Katie Seaborn, Somang Nam, Julia Keckeis, and Tatsuya Itagaki. 2023. Can Voice Assistants Sound Cute? Towards a Model of Kawaii Vocalics. In Proceedings of ACM CHI EA. Hamburg, Germany, 63:1–63:7.
[43]
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 132–157.
[44]
SpeechBrain. 2021. Transformer for LibriSpeech (with Transformer LM). https://huggingface.co/speechbrain/asr-transformer-transformerlm-librispeech.
[45]
Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 2 (1998), 131–142.
[46]
Lifa Sun, Shiyin Kang, Kun Li, and Helen M. Meng. 2015. Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks. In Proceedings of IEEE ICASSP. South Brisbane, Queensland, Australia, 4869–4873.
[47]
Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo. 2019. ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms. In Proceedings of IEEE ICASSP. Brighton, United Kingdom, 6805–6809.
[48]
Yao Wang, Wandong Cai, Tao Gu, Wei Shao, Yannan Li, and Yong Yu. 2019. Secure Your Voice: An Oral Airflow-Based Continuous Liveness Detection for Voice Assistants. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 4 (2019), 157:1–157:28.
[49]
Prashan Wanigasekara, Rafid Al-Humaimidi, Turan Gojayev, Niloofar Gheissari, Achal Dave, Stephen Rawls, Fan Yang, Kechen Qin, Nalin Gupta, Spurthi Sandiri, Chevanthie Dissanayake, Zeynab Raeesy, Emre Barut, and Chengwei Su. 2023. Visual Item Selection With Voice Assistants: A systems perspective. In Proceedings of ACM WWW Companion. Austin, TX, USA, 500–507.
[50]
Virtual YouTuber Wiki. 2023. Kizuna AI. https://virtualyoutuber.fandom.com/wiki/Kizuna_Ai.
[51]
Feng-Long Xie, Yao Qian, Yuchen Fan, Frank K. Soong, and Haifeng Li. 2014. Sequence error (SE) minimization training of neural network for voice conversion. In Proceedings of ISCA INTERSPEECH. ISCA, Singapore, 2283–2287.
[52]
Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. Proceedings of the International Conference on Language Resources and Evaluation (LREC) (2019).
[53]
Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu, Chen Liang, and Li-Rong Dai. 2019. Improving Sequence-to-sequence Voice Conversion by Adding Text-supervision. In Proceedings of IEEE ICASSP. Brighton, United Kingdom, 6785–6789.
[54]
Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and Li-Rong Dai. 2019. Sequence-to-Sequence Acoustic Modeling for Voice Conversion. IEEE ACM Trans. Audio Speech Lang. Process. 27, 3 (2019), 631–644.
[55]
Juan Zhao, Tianrui Zong, Yong Xiang, Longxiang Gao, Wanlei Zhou, and Gleb Beliakov. 2021. Desynchronization Attacks Resilient Watermarking Method Based on Frequency Singular Value Coefficient Modification. IEEE ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2282–2295.
[56]
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of IEEE ICCV. Venice, Italy, 2242–2251.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
IUI '24: Proceedings of the 29th International Conference on Intelligent User Interfaces
March 2024
955 pages
ISBN:9798400705083
DOI:10.1145/3640543
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Generative AI
  2. Livestreaming
  3. Multi-modal interfaces
  4. Virtual reality
  5. Voice conversion

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

IUI '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 373
    Total Downloads
  • Downloads (Last 12 months)373
  • Downloads (Last 6 weeks)70
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media