Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3461615.3491114acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

TeNC: Low Bit-Rate Speech Coding with VQ-VAE and GAN

Published: 17 December 2021 Publication History
  • Get Citation Alerts
  • Abstract

    Speech coding aims at compressing digital speech signals with fewer bits and reconstructing it back to raw signals, maintaining the speech quality as much as possible. But conventional codecs usually need a high bit-rate to achieve reconstructed speech with reasonable high quality. In this paper, we propose an end-to-end neural generative codec with a VQ-VAE based auto-encoder and the generative adversarial network (GAN), which achieves reconstructed speech with high-fidelity at a low bit-rate about 2 kb/s. The compression process of speech coding is carried out by a down-sampling module of the encoder and a learnable discrete codebook. GAN is used to further improve the reconstructed quality. Our experiments confirm the effectiveness of the proposed model in both objective and subjective tests, which significantly outperforms the conventional codecs at low bit-rate in terms of speech quality and speaker similarity.

    References

    [1]
    Nevio Benvenuto, Guido Bertocci, William R Daumer, and Duncan K Sparrell. 1986. Report: The 32-kb/s ADPCM coding standard. AT&T technical journal 65, 5 (1986), 12–22.
    [2]
    Jonah Casebeer, Vinjai Vale, Umut Isik, Jean-Marc Valin, Ritwik Giri, and Arvindh Krishnaswamy. 2021. Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders. arXiv preprint arXiv:2102.06610(2021).
    [3]
    Roy Fejgin, Janusz Klejsa, Lars Villemoes, and Cong Zhou. 2020. Source coding of audio signals with a generative model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 341–345.
    [4]
    Cristina Gârbacea, Aäron van den Oord, Yazhe Li, Felicia SC Lim, Alejandro Luebs, Oriol Vinyals, and Thomas C Walters. 2019. Low bit-rate speech coding with VQ-VAE and a WaveNet decoder. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 735–739.
    [5]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.
    [6]
    Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. 2018. Efficient neural audio synthesis. In International Conference on Machine Learning. PMLR, 2410–2419.
    [7]
    W Bastiaan Kleijn, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, Florian Stimberg, Quan Wang, and Thomas C Walters. 2018. Wavenet based low rate speech coding. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 676–680.
    [8]
    W Bastiaan Kleijn, Andrew Storus, Michael Chinen, Tom Denton, Felicia SC Lim, Alejandro Luebs, Jan Skoglund, and Hengchin Yeh. 2021. Generative Speech Coding with Predictive Variance Regularization. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6478–6482.
    [9]
    Janusz Klejsa, Per Hedelin, Cong Zhou, Roy Fejgin, and Lars Villemoes. 2019. High-quality speech coding with sample RNN. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7155–7159.
    [10]
    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv preprint arXiv:2010.05646(2020).
    [11]
    Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis. arXiv preprint arXiv:1910.06711(2019).
    [12]
    Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2794–2802.
    [13]
    J Markel and A Gray. 1974. A linear prediction vocoder simulation based upon the autocorrelation method. IEEE Transactions on Acoustics, Speech, and Signal Processing 22, 2(1974), 124–134.
    [14]
    Alan McCree, Kwan Truong, E Bryan George, Thomas P Barnwell, and Vishu Viswanathan. 1996. A 2.4 kbit/s MELP coder candidate for the new US Federal Standard. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Vol. 1. IEEE, 200–203.
    [15]
    Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham Jain, Jose Sotelo, Aaron Courville, and Yoshua Bengio. 2016. SampleRNN: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837(2016).
    [16]
    Gang Min, Changqing Zhang, Xiongwei Zhang, and Wei Tan. 2019. Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder. In 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 372–377.
    [17]
    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957(2018).
    [18]
    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499(2016).
    [19]
    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. arXiv preprint arXiv:1711.00937(2017).
    [20]
    Douglas O’Shaughnessy. 1988. Linear predictive coding. IEEE potentials 7, 1 (1988), 29–32.
    [21]
    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5206–5210.
    [22]
    ITUR Recommendation. 2001. Method for the subjective assessment of intermediate sound quality (MUSHRA). ITU, BS (2001), 1543–1.
    [23]
    P Revised Recommendation. 1996. 800 (Methods for Subjective Determination of Transmission Quality).
    [24]
    Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), Vol. 2. IEEE, 749–752.
    [25]
    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234–241.
    [26]
    Jan Skoglund and Jean-Marc Valin. 2019. Improving Opus low bit rate quality with neural speech synthesis. arXiv preprint arXiv:1905.04628(2019).
    [27]
    Naoya Takahashi, Mayank Kumar Singh, and Yuki Mitsufuji. 2021. Hierarchical disentangled representation learning for singing voice conversion. arXiv preprint arXiv:2101.06842(2021).
    [28]
    Jean-Marc Valin and Jan Skoglund. 2019. LPCNet: Improving neural speech synthesis through linear prediction. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5891–5895.
    [29]
    Jean-Marc Valin and Jan Skoglund. 2019. A real-time wideband neural vocoder at 1.6 kb/s using LPCNet. arXiv preprint arXiv:1903.12087(2019).
    [30]
    Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6199–6203.
    [31]
    Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, and Lei Xie. 2020. Multi-band MelGAN: Faster waveform generation for high-quality text-to-speech. arXiv preprint arXiv:2005.05106(2020).

    Cited By

    View all
    • (2023)MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTSIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2023.327247031(1811-1824)Online publication date: 2023
    • (2022)NSVQ: Noise Substitution in Vector Quantization for Machine LearningIEEE Access10.1109/ACCESS.2022.314767010(13598-13610)Online publication date: 2022

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction
    October 2021
    418 pages
    ISBN:9781450384711
    DOI:10.1145/3461615
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 December 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Codec
    2. GAN
    3. VQ-VAE
    4. low bit-rate
    5. neural speech coding

    Qualifiers

    • Short-paper
    • Research
    • Refereed limited

    Conference

    ICMI '21
    Sponsor:
    ICMI '21: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
    October 18 - 22, 2021
    QC, Montreal, Canada

    Acceptance Rates

    Overall Acceptance Rate 453 of 1,080 submissions, 42%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)54
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 26 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)MSMC-TTS: Multi-Stage Multi-Codebook VQ-VAE Based Neural TTSIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2023.327247031(1811-1824)Online publication date: 2023
    • (2022)NSVQ: Noise Substitution in Vector Quantization for Machine LearningIEEE Access10.1109/ACCESS.2022.314767010(13598-13610)Online publication date: 2022

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media