short-paper

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN

Authors:

Dan SuAuthors Info & Claims

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

Pages 126 - 130

https://doi.org/10.1145/3461615.3491114

Published: 17 December 2021 Publication History

Abstract

Speech coding aims at compressing digital speech signals with fewer bits and reconstructing it back to raw signals, maintaining the speech quality as much as possible. But conventional codecs usually need a high bit-rate to achieve reconstructed speech with reasonable high quality. In this paper, we propose an end-to-end neural generative codec with a VQ-VAE based auto-encoder and the generative adversarial network (GAN), which achieves reconstructed speech with high-fidelity at a low bit-rate about 2 kb/s. The compression process of speech coding is carried out by a down-sampling module of the encoder and a learnable discrete codebook. GAN is used to further improve the reconstructed quality. Our experiments confirm the effectiveness of the proposed model in both objective and subjective tests, which significantly outperforms the conventional codecs at low bit-rate in terms of speech quality and speaker similarity.

References

[1]

Nevio Benvenuto, Guido Bertocci, William R Daumer, and Duncan K Sparrell. 1986. Report: The 32-kb/s ADPCM coding standard. AT&T technical journal 65, 5 (1986), 12–22.

[2]

Jonah Casebeer, Vinjai Vale, Umut Isik, Jean-Marc Valin, Ritwik Giri, and Arvindh Krishnaswamy. 2021. Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders. arXiv preprint arXiv:2102.06610(2021).

[3]

Roy Fejgin, Janusz Klejsa, Lars Villemoes, and Cong Zhou. 2020. Source coding of audio signals with a generative model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 341–345.

[4]

Cristina Gârbacea, Aäron van den Oord, Yazhe Li, Felicia SC Lim, Alejandro Luebs, Oriol Vinyals, and Thomas C Walters. 2019. Low bit-rate speech coding with VQ-VAE and a WaveNet decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 735–739.

[5]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.

[6]

Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In International Conference on Machine Learning. PMLR, 2410–2419.

[7]

W Bastiaan Kleijn, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, Florian Stimberg, Quan Wang, and Thomas C Walters. 2018. Wavenet based low rate speech coding. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 676–680.

Digital Library

[8]

W Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, and Hengchin Yeh. 2021. Generative Speech Coding with Predictive Variance Regularization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6478–6482.

[9]

Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, and Lars Villemoes. 2019. High-quality speech coding with sample RNN. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7155–7159.

[10]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv preprint arXiv:2010.05646(2020).

[11]

Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910.06711(2019).

[12]

Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2794–2802.

[13]

J Markel and A Gray. 1974. A linear prediction vocoder simulation based upon the autocorrelation method. IEEE Transactions on Acoustics, Speech, and Signal Processing 22, 2(1974), 124–134.

[14]

Alan McCree, Kwan Truong, E Bryan George, Thomas P Barnwell, and Vishu Viswanathan. 1996. A 2.4 kbit/s MELP coder candidate for the new US Federal Standard. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 200–203.

Digital Library

[15]

Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2016. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837(2016).

[16]

Gang Min, Changqing Zhang, Xiongwei Zhang, and Wei Tan. 2019. Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 372–377.

[17]

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957(2018).

[18]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499(2016).

[19]

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. arXiv preprint arXiv:1711.00937(2017).

[20]

Douglas O’Shaughnessy. 1988. Linear predictive coding. IEEE potentials 7, 1 (1988), 29–32.

[21]

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.

[22]

ITUR Recommendation. 2001. Method for the subjective assessment of intermediate sound quality (MUSHRA). ITU, BS (2001), 1543–1.

[23]

P Revised Recommendation. 1996. 800 (Methods for Subjective Determination of Transmission Quality).

[24]

Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749–752.

Digital Library

[25]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.

[26]

Jan Skoglund and Jean-Marc Valin. 2019. Improving Opus low bit rate quality with neural speech synthesis. arXiv preprint arXiv:1905.04628(2019).

[27]

Naoya Takahashi, Mayank Kumar Singh, and Yuki Mitsufuji. 2021. Hierarchical disentangled representation learning for singing voice conversion. arXiv preprint arXiv:2101.06842(2021).

[28]

Jean-Marc Valin and Jan Skoglund. 2019. LPCNet: Improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5891–5895.

[29]

Jean-Marc Valin and Jan Skoglund. 2019. A real-time wideband neural vocoder at 1.6 kb/s using LPCNet. arXiv preprint arXiv:1903.12087(2019).

[30]

Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.

[31]

Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. 2020. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. arXiv preprint arXiv:2005.05106(2020).

Cited By

Guo HXie FWu XSoong FMeng H(2023)MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTSIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2023.327247031(1811-1824)Online publication date: 2023
https://doi.org/10.1109/TASLP.2023.3272470
Vali MBackstrom T(2022)NSVQ: Noise Substitution in Vector Quantization for Machine LearningIEEE Access10.1109/ACCESS.2022.314767010(13598-13610)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147670

Index Terms

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Signal processing systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

October 2021

418 pages

ISBN:9781450384711

DOI:10.1145/3461615

Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 December 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

ICMI '21

Sponsor:

SIGCHI

ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION

October 18 - 22, 2021

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
210
Total Downloads

Downloads (Last 12 months)54
Downloads (Last 6 weeks)5

Reflects downloads up to 26 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Guo HXie FWu XSoong FMeng H(2023)MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTSIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2023.327247031(1811-1824)Online publication date: 2023
https://doi.org/10.1109/TASLP.2023.3272470
Vali MBackstrom T(2022)NSVQ: Noise Substitution in Vector Quantization for Machine LearningIEEE Access10.1109/ACCESS.2022.314767010(13598-13610)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3147670

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents