research-article

Conan's Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming

Authors:

Kui RenAuthors Info & Claims

IUI '24: Proceedings of the 29th International Conference on Intelligent User Interfaces

Pages 35 - 50

https://doi.org/10.1145/3640543.3645146

Published: 05 April 2024 Publication History

Abstract

Recent years have witnessed a dramatic growing trend of Virtual YouTubers (VTubers) as a new business on social media, such as YouTube, Twitch, and TikTok. However, a significant challenge arises when VTuber voice actors face health issues or retire, jeopardizing the continuity of their avatar’s recognizable voices. A potential solution reminiscent of Conan’s Bow Tie voice changer in the popular animation Case Closed (i.e., Detective Conan) has inspired our work. To make this a reality, we introduce VTuberBowTie, a user-friendly streaming voice conversion system for real-time VTuber livestreaming. We propose an innovative streaming voice conversion approach that tackles the challenges of limited context modeling and bidirectional context dependence inherent to conventional real-time voice conversion. Rather than individually processing the voice stream in data chunks, our approach adopts a fully sequential structure that leverages contextual information preceding the input chunk, thereby expanding the perceptual range and enabling seamless concatenation. Moreover, we developed a ready-to-use interaction interface for VTuberBowTie and deployed it on various computing platforms. The experimental results show that VTuberBowTie can achieve high-quality voice conversion in a streaming manner with a latency of 179.1ms on CPU and 70.8ms on GPU while providing users a friendly interactive experience.

References

[1]

Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. Implementation of DNN-based real-time voice conversion and its improvements by audio data augmentation and mask-shaped device. In Proceedings of ISCA Speech Synthesis Workshop. 93–98.

[2]

Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. TransVoice: Real-Time Voice Conversion for Augmenting Near-Field Speech Communication. In Proceedings of ACM UIST, François Guimbretière, Michael S. Bernstein, and Katharina Reinecke (Eds.). New Orleans, LA, USA, 33–35.

Digital Library

[3]

George Boateng, Prabhakaran Santhanam, Janina Lüscher, Urte Scholz, and Tobias Kowatsch. 2019. VADLite: an open-source lightweight system for real-time voice activity detection on smartwatches. In Proceedings of ACM UbiComp/ISWC. London, UK, 902–906.

Digital Library

[4]

Julia Cambre, Alex C. Williams, Afsaneh Razi, Ian Bicking, Abraham Wallin, Janice Y. Tsai, Chinmay Kulkarni, and Jofish Kaye. 2021. Firefox Voice: An Open and Extensible Voice Assistant Built Upon the Web. In Proceedings of ACM CHI, Yoshifumi Kitamura, Aaron Quigley, Katherine Isbister, Takeo Igarashi, Pernille Bjørn, and Steven Mark Drucker (Eds.). Virtual Event / Yokohama, Japan, 250:1–250:18.

[5]

Sadil Chamishka, Ishara Madhavi, Rashmika Nawaratne, Damminda Alahakoon, Daswin De Silva, Naveen Chilamkurti, and Vishaka Nanayakkara. 2022. A voice-based real-time emotion detection technique using recurrent neural network empowered feature modelling. Multimedia Tools and Applications 81 (2022), 35173–35194.

Digital Library

[6]

Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE ACM Trans. Audio Speech Lang. Process. 22, 12 (2014), 1859–1872.

Digital Library

[7]

Meng Chen, Li Lu, Junhao Wang, Jiadi Yu, Yingying Chen, Zhibo Wang, Zhongjie Ba, Feng Lin, and Kui Ren. 2023. VoiceCloak: Adversarial Example Enabled Voice De-Identification with Balanced Privacy and Utility. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 2 (2023), 48:1–48:21.

Digital Library

[8]

Qianniu Chen, Meng Chen, Li Lu, Jiadi Yu, Yingying Chen, Zhibo Wang, Zhongjie Ba, Feng Lin, and Kui Ren. 2022. Push the Limit of Adversarial Example Attack on Speaker Recognition in Physical Domain. In Proceedings of the 20th ACM SenSys. Boston, Massachusetts, 710–724.

Digital Library

[9]

Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In Proceedings of IEEE CVPR. Salt Lake City, UT, USA, 8789–8797.

[10]

Toby Long Hin Chong, Hijung Valentina Shin, Deepali Aneja, and Takeo Igarashi. 2023. SoundToons: Exemplar-Based Authoring of Interactive Audio-Driven Animation Sprites. In Proceedings of ACM IUI. Sydney, NSW, Australia, 710–722.

[11]

Andreea Danielescu, Sharone Horowit-Hendler, Alexandria Pabst, Kenneth Michael Stewart, Eric M. Gallo, and Matthew Peter Aylett. 2023. Creating Inclusive Voices for the 21st Century: A Non-Binary Text-to-Speech for Conversational Assistants. In Proceedings of ACM CHI, Albrecht Schmidt, Kaisa Vaananen, Tesh Goyal, Per Ola Kristensson, Anicia Peters, Stefanie Mueller, Julie R. Williamson, and Max L. Wilson (Eds.). Hamburg, Germany, 390:1–390:17.

Digital Library

[12]

Srinivas Desai, E. Veera Raghavendra, B. Yegnanarayana, Alan W. Black, and Kishore Prahallad. 2009. Voice conversion using Artificial Neural Networks. In Proceedings of IEEE ICASSP. Taipei, Taiwan, 3893–3896.

Digital Library

[13]

Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of ISCA INTERSPEECH, Helen Meng, Bo Xu, and Thomas Fang Zheng (Eds.). ISCA, 3830–3834.

[14]

Parashar Dhakal, Praveen Damacharla, Ahmad Y. Javaid, and Vijay Devabhaktuni. 2019. A Near Real-Time Automatic Speaker Recognition Architecture for Voice-Based User Interface. Machine Learning and Knowledge Extraction 1, 1 (2019), 504–520.

[15]

Susumu Harada, Jacob O. Wobbrock, and James A. Landay. 2011. Voice Games: Investigation Into the Use of Non-speech Voice Input for Making Computer Games More Accessible. In Proceedings of INTERACT(Lecture Notes in Computer Science, Vol. 6946). Springer, Lisbon, Portugal, 11–29.

[16]

Yanzhang He, Tara N. Sainath, Rohit Prabhavalkar, Ian McGraw, Raziel Alvarez, Ding Zhao, David Rybach, Anjuli Kannan, Yonghui Wu, Ruoming Pang, Qiao Liang, Deepti Bhatia, Yuan Shangguan, Bo Li, Golan Pundak, Khe Chai Sim, Tom Bagby, Shuo-Yiin Chang, Kanishka Rao, and Alexander Gruenstein. 2019. Streaming End-to-end Speech Recognition for Mobile Devices. In Proceedings of IEEE ICASSP. Brighton, United Kingdom, 6381–6385.

[17]

Elina Helander, Hanna Silén, Tuomas Virtanen, and Moncef Gabbouj. 2012. Voice Conversion Using Dynamic Kernel Partial Least Squares Regression. IEEE Trans. Speech Audio Process. 20, 3 (2012), 806–817.

Digital Library

[18]

Purnima Kamath, Zhuoyao Li, Chitralekha Gupta, Kokil Jaidka, Suranga Nanayakkara, and Lonce Wyse. 2023. Evaluating Descriptive Quality of AI-Generated Audio Using Image-Schemas. In Proceedings of ACM IUI. Sydney, NSW, Australia, 621–632.

Digital Library

[19]

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. 2018. Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks. In Proceedings of IEEE SLT. IEEE, 266–273.

[20]

Hirokazu Kameoka, Kou Tanaka, and Takuhiro Kaneko. 2021. FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion. CoRR abs/2104.06900 (2021). arXiv:2104.06900

[21]

Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, and Nobukatsu Hojo. 2020. ConvS2S-VC: Fully Convolutional Sequence-to-Sequence Voice Conversion. IEEE ACM Trans. Audio Speech Lang. Process. 28 (2020), 1849–1863.

Digital Library

[22]

Takuhiro Kaneko and Hirokazu Kameoka. 2018. CycleGAN-VC: Non-parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In Proceeding of IEEE EUSIPCO. Roma, Italy, 2100–2104.

[23]

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2019. Cyclegan-VC2: Improved Cyclegan-based Non-parallel Voice Conversion. In Proceedings of IEEE ICASSP. 6820–6824.

[24]

Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, and Nobukatsu Hojo. 2020. CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-Spectrogram Conversion. In Proceedings of ISCA INTERSPEECH. ISCA, Virtual Event, Shanghai, China, 2017–2021.

[25]

Saman Karim, Jin Kang, and Audrey Girouard. 2023. Exploring Rulebook Accessibility and Companionship in Board Games via Voiced-based Conversational Agent Alexa. In Proceedings ACM DIS. Pittsburgh, PA, USA, 2221–2232.

Digital Library

[26]

Leon Koren and Tomislav Stipancic. 2021. Multimodal Emotion Analysis Based on Acoustic and Linguistic Features of the Voice. In Proceedings of ACM HCI(Lecture Notes in Computer Science, Vol. 12774), Gabriele Meiselwitz (Ed.). Springer, Virtual Event, 301–311.

Digital Library

[27]

R. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE PACRIM, Vol. 1. 125–128.

[28]

Sangeun Kum and Juhan Nam. 2019. Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks. Applied Sciences 9, 7 (2019).

[29]

Yist Y. Lin, Chung-Ming Chien, Jheng-Hao Lin, Hung-yi Lee, and Lin-Shan Lee. 2021. Fragmentvc: Any-To-Any Voice Conversion by End-To-End Extracting and Fusing Fine-Grained Voice Fragments with Attention. In Proceedings of IEEE ICASSP. Toronto, ON, Canada, 5939–5943.

[30]

Yi Luan, Daisuke Saito, Yosuke Kashiwagi, Nobuaki Minematsu, and Keikichi Hirose. 2014. Semi-supervised noise dictionary adaptation for exemplar-based noise robust speech recognition. In Proceedings of IEEE ICASSP. Florence, Italy, 1745–1748.

[31]

Edouard Mathieu, Hannah Ritchie, Lucas Rodés-Guirao, Cameron Appel, Charlie Giattino, Joe Hasell, Bobbie Macdonald, Saloni Dattani, Diana Beltekian, Esteban Ortiz-Ospina, and Max Roser. 2020. Coronavirus Pandemic (COVID-19). Our World in Data (2020). https://ourworldindata.org/coronavirus.

[32]

Huaiping Ming, Dong-Yan Huang, Lei Xie, Jie Wu, Minghui Dong, and Haizhou Li. 2016. Deep Bidirectional LSTM Modeling of Timbre and Prosody for Emotional Voice Conversion. In Proceedings of ISCA INTERSPEECH. ISCA, San Francisco, CA, USA, 2453–2457.

[33]

Seyed Hamidreza Mohammadi and Alexander Kain. 2014. Voice conversion using deep neural networks with speaker-independent pre-training. In Proceedings of IEEE SLT. South Lake Tahoe, NV, USA, 19–23.

[34]

Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Trans. Inf. Syst. 99-D, 7 (2016), 1877–1884.

[35]

Toru Nakashika, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2013. Voice conversion in high-order eigen space using deep belief nets. In Proceedings of ISCA INTERSPEECH. ISCA, Lyon, France, 369–372.

[36]

Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki. 2014. High-order sequence modeling using speaker-dependent recurrent temporal restricted boltzmann machines for voice conversion. In Proceedings of ISCA INTERSPEECH. ISCA, Singapore, 2278–2282.

[37]

Bac Nguyen and Fabien Cardinaux. 2022. NVC-Net: End-To-End Adversarial Voice Conversion. In Proceedings of IEEE ICASSP. Virtual and Singapore, 7012–7016.

[38]

PLAYBOARD. 2022. Most Super Chatted. https://playboard.co/en/youtube-ranking/most-superchatted-all-channels-in-worldwide-yearend.

[39]

Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Sergeevich Kudinov, and Jiansheng Wei. 2022. Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme. In Proceedings of ICLR. OpenReview.net, Virtual Event.

[40]

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. 2019. AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In Proceedings of ICML(Proceedings of Machine Learning Research, Vol. 97). PMLR, Long Beach, California, USA, 5210–5219.

[41]

Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2020. Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU. In Proceedings of ISCA INTERSPEECH. ISCA, Virtual Event, Shanghai, China, 1021–1022.

[42]

Katie Seaborn, Somang Nam, Julia Keckeis, and Tatsuya Itagaki. 2023. Can Voice Assistants Sound Cute? Towards a Model of Kawaii Vocalics. In Proceedings of ACM CHI EA. Hamburg, Germany, 63:1–63:7.

[43]

Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE ACM Trans. Audio Speech Lang. Process. 29 (2021), 132–157.

Digital Library

[44]

SpeechBrain. 2021. Transformer for LibriSpeech (with Transformer LM). https://huggingface.co/speechbrain/asr-transformer-transformerlm-librispeech.

[45]

Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech Audio Process. 6, 2 (1998), 131–142.

[46]

Lifa Sun, Shiyin Kang, Kun Li, and Helen M. Meng. 2015. Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks. In Proceedings of IEEE ICASSP. South Brisbane, Queensland, Australia, 4869–4873.

[47]

Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo. 2019. ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms. In Proceedings of IEEE ICASSP. Brighton, United Kingdom, 6805–6809.

[48]

Yao Wang, Wandong Cai, Tao Gu, Wei Shao, Yannan Li, and Yong Yu. 2019. Secure Your Voice: An Oral Airflow-Based Continuous Liveness Detection for Voice Assistants. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 4 (2019), 157:1–157:28.

Digital Library

[49]

Prashan Wanigasekara, Rafid Al-Humaimidi, Turan Gojayev, Niloofar Gheissari, Achal Dave, Stephen Rawls, Fan Yang, Kechen Qin, Nalin Gupta, Spurthi Sandiri, Chevanthie Dissanayake, Zeynab Raeesy, Emre Barut, and Chengwei Su. 2023. Visual Item Selection With Voice Assistants: A systems perspective. In Proceedings of ACM WWW Companion. Austin, TX, USA, 500–507.

Digital Library

[50]

Virtual YouTuber Wiki. 2023. Kizuna AI. https://virtualyoutuber.fandom.com/wiki/Kizuna_Ai.

[51]

Feng-Long Xie, Yao Qian, Yuchen Fan, Frank K. Soong, and Haifeng Li. 2014. Sequence error (SE) minimization training of neural network for voice conversion. In Proceedings of ISCA INTERSPEECH. ISCA, Singapore, 2283–2287.

[52]

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit. Proceedings of the International Conference on Language Resources and Evaluation (LREC) (2019).

[53]

Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu, Chen Liang, and Li-Rong Dai. 2019. Improving Sequence-to-sequence Voice Conversion by Adding Text-supervision. In Proceedings of IEEE ICASSP. Brighton, United Kingdom, 6785–6789.

[54]

Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and Li-Rong Dai. 2019. Sequence-to-Sequence Acoustic Modeling for Voice Conversion. IEEE ACM Trans. Audio Speech Lang. Process. 27, 3 (2019), 631–644.

Digital Library

[55]

Juan Zhao, Tianrui Zong, Yong Xiang, Longxiang Gao, Wanlei Zhou, and Gleb Beliakov. 2021. Desynchronization Attacks Resilient Watermarking Method Based on Frequency Singular Value Coefficient Modification. IEEE ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2282–2295.

Digital Library

[56]

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of IEEE ICCV. Venice, Italy, 2242–2251.

Index Terms

Conan's Bow Tie: A Streaming Voice Conversion for Real-Time VTuber Livestreaming
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
2. Human-centered computing
  1. Ubiquitous and mobile computing

Recommendations

Third-wave livestreaming: teens' long form selfie
MobileHCI '17: Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services

Mobile livestreaming is now well into its third wave. From early systems such as Bambuser and Qik, to more popular apps Meerkat and Periscope, to today's integrated social streaming features in Facebook and Instagram, both technology and usage have ...
"I Felt Everyone Was a Streamer": An Empirical Study on What Makes Avatar Collective Streaming Engaging
CSCW

A novel participatory livestreaming genre, Avatar Collective Streaming, is gaining traction. Beyond traditional audience participation mechanisms, such as chat messages or commands, these livestreams allow viewers to join, socially interact, and ...
Understanding and Supporting Autistic Adults in Livestreaming
CSCW '23 Companion: Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing

In my work, I aim to develop opportunities for autistic adults to flourish in online social spaces. I focus on investigating how autistic adults experience social livestreaming platforms such as Twitch.tv and Youtube live so that we can develop tools to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IUI '24: Proceedings of the 29th International Conference on Intelligent User Interfaces

March 2024

955 pages

ISBN:9798400705083

DOI:10.1145/3640543

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Hangzhou Leading Innovation and Entrepreneurship Team
National Natural Science Foundation of China
Zhejiang Provincial Natural Science Foundation
Jiangsu Provincial Natural Science Foundation of China

Conference

IUI '24

Sponsor:

IUI '24: 29th International Conference on Intelligent User Interfaces

March 18 - 21, 2024

SC, Greenville, USA

Acceptance Rates

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Sponsor:
sigai
sigai

30th International Conference on Intelligent User Interfaces

March 24 - 27, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
373
Total Downloads

Downloads (Last 12 months)373
Downloads (Last 6 weeks)70

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents