Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13624))

Included in the following conference series:

  • 746 Accesses

Abstract

Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory assistance module which improves the performance of downstream ASR and a vocal reinforcement module to boosts the performance of ASV. In addition, we design a new loss function to improve speaker vocal similarity. Experimental results on the Libri2mix dataset show that our method outperforms baseline methods in several metrics, including speech quality, intelligibility, and speaker vocal similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)

    Google Scholar 

  2. Chen, G., et al.: Who is real bob? Adversarial attacks on speaker recognition systems. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 694–711. IEEE (2021)

    Google Scholar 

  3. Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)

  4. Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., Vincent, E.: Librimix: an open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262 (2020)

  5. Faraji, F., Attabi, Y., Champagne, B., Zhu, W.P.: On the use of audio fingerprinting features for speech enhancement with generative adversarial network. In: 2020 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2020). https://doi.org/10.1109/SiPS50750.2020.9195238

  6. Fu, S.W., Liao, C.F., Tsao, Y.: Learning with learned loss function: speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Process. Lett. 27, 26–30 (2019)

    Article  Google Scholar 

  7. Fu, S.W., Yu, C., Hung, K.H., Ravanelli, M., Tsao, Y.: Metricgan-u: unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7412–7416. IEEE (2022)

    Google Scholar 

  8. Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Twelfth Annual Conference of the International Speech Communication Association (2011)

    Google Scholar 

  9. Hao, X., Su, X., Horaud, R., Li, X.: Fullsubnet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6633–6637. IEEE (2021)

    Google Scholar 

  10. Hsieh, T.A., Yu, C., Fu, S.W., Lu, X., Tsao, Y.: Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement. arXiv preprint arXiv:2010.15174 (2020)

  11. Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020)

  12. Hu, Y., Loizou, P.C.: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2007)

    Article  Google Scholar 

  13. Ji, W., Chee, K.C.: Prediction of hourly solar radiation using a novel hybrid model of ARMA and TDNN. Sol. Energy 85(5), 808–817 (2011)

    Article  Google Scholar 

  14. Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)

    Google Scholar 

  15. Li, A., Liu, W., Zheng, C., Fan, C., Li, X.: Two heads are better than one: a two-stage complex spectral mapping approach for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1829–1843 (2021). https://doi.org/10.1109/TASLP.2021.3079813

    Article  Google Scholar 

  16. Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304 (2017)

  17. Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimedia 23, 292–305 (2020)

    Article  Google Scholar 

  18. Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Interspeech, pp. 2643–2647 (2018)

    Google Scholar 

  19. Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  20. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)

    Google Scholar 

  21. Pandey, A., Wang, D.: Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6629–6633. IEEE (2020)

    Google Scholar 

  22. Quackenbush, S.R.: Objective measures of speech quality (subjective) (1986)

    Google Scholar 

  23. Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752. IEEE (2001)

    Google Scholar 

  24. Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)

    Article  Google Scholar 

  25. Tan, K., Wang, D.: A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech, vol. 2018, pp. 3229–3233 (2018)

    Google Scholar 

  26. Tan, K., Wang, D.: Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6865–6869. IEEE (2019)

    Google Scholar 

  27. Tan, K., Wang, D.: Towards model compression for deep learning based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1785–1794 (2021). https://doi.org/10.1109/TASLP.2021.3082282

    Article  Google Scholar 

  28. Tiwari, V.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1(1), 19–22 (2010)

    Google Scholar 

  29. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)

    Article  Google Scholar 

  30. Wang, J., et al.: Three-dimensional lip motion network for text-independent speaker recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3380–3387. IEEE (2021)

    Google Scholar 

  31. Wang, K., He, B., Zhu, W.P.: TSTNN: two-stage transformer based neural network for speech enhancement in the time domain. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7098–7102. IEEE (2021)

    Google Scholar 

  32. Wichern, G., et al.: Wham!: extending speech separation to noisy environments. arXiv preprint arXiv:1907.01160 (2019)

  33. Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Li Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, J., Li, X., Li, X., Yu, M., Fang, Q., Liu, L. (2023). MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30108-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30107-0

  • Online ISBN: 978-3-031-30108-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics