MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement

Wang, Jianrong; Li, Xiaomin; Li, Xuewei; Yu, Mei; Fang, Qiang; Liu, Li

doi:10.1007/978-3-031-30108-7_9

Jianrong Wang¹²,
Xiaomin Li¹²,
Xuewei Li¹²,
Mei Yu¹²,
Qiang Fang¹³ &
…
Li Liu¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13624))

Included in the following conference series:

International Conference on Neural Information Processing

746 Accesses

Abstract

Speech enhancement improves speech quality and promotes the performance of various downstream tasks. However, most current speech enhancement work was mainly devoted to improving the performance of downstream automatic speech recognition (ASR), only a relatively small amount of work focused on the automatic speaker verification (ASV) task. In this work, we propose a MVNet consisted of a memory assistance module which improves the performance of downstream ASR and a vocal reinforcement module to boosts the performance of ASV. In addition, we design a new loss function to improve speaker vocal similarity. Experimental results on the Libri2mix dataset show that our method outperforms baseline methods in several metrics, including speech quality, intelligibility, and speaker vocal similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement

Article 23 April 2024

ATT:Adversarial Trained Transformer for Speech Enhancement

MGVC: A Mask Voice Conversion Using Generating Adversarial Training

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Google Scholar
Chen, G., et al.: Who is real bob? Adversarial attacks on speaker recognition systems. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 694–711. IEEE (2021)
Google Scholar
Chen, J., Mao, Q., Liu, D.: Dual-path transformer network: direct context-aware modeling for end-to-end monaural speech separation. arXiv preprint arXiv:2007.13975 (2020)
Cosentino, J., Pariente, M., Cornell, S., Deleforge, A., Vincent, E.: Librimix: an open-source dataset for generalizable speech separation. arXiv preprint arXiv:2005.11262 (2020)
Faraji, F., Attabi, Y., Champagne, B., Zhu, W.P.: On the use of audio fingerprinting features for speech enhancement with generative adversarial network. In: 2020 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6 (2020). https://doi.org/10.1109/SiPS50750.2020.9195238
Fu, S.W., Liao, C.F., Tsao, Y.: Learning with learned loss function: speech enhancement with quality-net to improve perceptual evaluation of speech quality. IEEE Signal Process. Lett. 27, 26–30 (2019)
Article Google Scholar
Fu, S.W., Yu, C., Hung, K.H., Ravanelli, M., Tsao, Y.: Metricgan-u: unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7412–7416. IEEE (2022)
Google Scholar
Garcia-Romero, D., Espy-Wilson, C.Y.: Analysis of i-vector length normalization in speaker recognition systems. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
Google Scholar
Hao, X., Su, X., Horaud, R., Li, X.: Fullsubnet: a full-band and sub-band fusion model for real-time single-channel speech enhancement. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6633–6637. IEEE (2021)
Google Scholar
Hsieh, T.A., Yu, C., Fu, S.W., Lu, X., Tsao, Y.: Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement. arXiv preprint arXiv:2010.15174 (2020)
Hu, Y., et al.: DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement. arXiv preprint arXiv:2008.00264 (2020)
Hu, Y., Loizou, P.C.: Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process. 16(1), 229–238 (2007)
Article Google Scholar
Ji, W., Chee, K.C.: Prediction of hourly solar radiation using a novel hybrid model of ARMA and TDNN. Sol. Energy 85(5), 808–817 (2011)
Article Google Scholar
Kong, J., Kim, J., Bae, J.: HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Adv. Neural. Inf. Process. Syst. 33, 17022–17033 (2020)
Google Scholar
Li, A., Liu, W., Zheng, C., Fan, C., Li, X.: Two heads are better than one: a two-stage complex spectral mapping approach for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1829–1843 (2021). https://doi.org/10.1109/TASLP.2021.3079813
Article Google Scholar
Li, C., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304 (2017)
Liu, L., Feng, G., Beautemps, D., Zhang, X.P.: Re-synchronization using the hand preceding model for multi-modal fusion in automatic continuous cued speech recognition. IEEE Trans. Multimedia 23, 292–305 (2020)
Article Google Scholar
Liu, L., Hueber, T., Feng, G., Beautemps, D.: Visual recognition of continuous cued speech using a tandem CNN-HMM approach. In: Interspeech, pp. 2643–2647 (2018)
Google Scholar
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Google Scholar
Pandey, A., Wang, D.: Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6629–6633. IEEE (2020)
Google Scholar
Quackenbush, S.R.: Objective measures of speech quality (subjective) (1986)
Google Scholar
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752. IEEE (2001)
Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Article Google Scholar
Tan, K., Wang, D.: A convolutional recurrent neural network for real-time speech enhancement. In: Interspeech, vol. 2018, pp. 3229–3233 (2018)
Google Scholar
Tan, K., Wang, D.: Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6865–6869. IEEE (2019)
Google Scholar
Tan, K., Wang, D.: Towards model compression for deep learning based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 1785–1794 (2021). https://doi.org/10.1109/TASLP.2021.3082282
Article Google Scholar
Tiwari, V.: MFCC and its applications in speaker recognition. Int. J. Emerg. Technol. 1(1), 19–22 (2010)
Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14(4), 1462–1469 (2006)
Article Google Scholar
Wang, J., et al.: Three-dimensional lip motion network for text-independent speaker recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 3380–3387. IEEE (2021)
Google Scholar
Wang, K., He, B., Zhu, W.P.: TSTNN: two-stage transformer based neural network for speech enhancement in the time domain. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7098–7102. IEEE (2021)
Google Scholar
Wichern, G., et al.: Wham!: extending speech separation to noisy environments. arXiv preprint arXiv:1907.01160 (2019)
Williamson, D.S., Wang, Y., Wang, D.: Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Jianrong Wang, Xiaomin Li, Xuewei Li & Mei Yu
Institute of Linguistics, Chinese Academy of Social Sciences, Beijing, China
Qiang Fang
Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen, China
Li Liu

Authors

Jianrong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomin Li
View author publications
You can also search for this author in PubMed Google Scholar
Xuewei Li
View author publications
You can also search for this author in PubMed Google Scholar
Mei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Fang
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Li Liu .

Editor information

Editors and Affiliations

Indian Institute of Technology Indore, Indore, India
Mohammad Tanveer
Indian Institute of Information Technology - Allahabad, Prayagraj, India
Sonali Agarwal
Kobe University, Kobe, Japan
Seiichi Ozawa
Indian Institute of Technology Patna, Patna, India
Asif Ekbal
University of Innsbruck, Innsbruck, Austria
Adam Jatowt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Li, X., Li, X., Yu, M., Fang, Q., Liu, L. (2023). MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement. In: Tanveer, M., Agarwal, S., Ozawa, S., Ekbal, A., Jatowt, A. (eds) Neural Information Processing. ICONIP 2022. Lecture Notes in Computer Science, vol 13624. Springer, Cham. https://doi.org/10.1007/978-3-031-30108-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-30108-7_9
Published: 13 April 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30107-0
Online ISBN: 978-3-031-30108-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement

ATT:Adversarial Trained Transformer for Speech Enhancement

MGVC: A Mask Voice Conversion Using Generating Adversarial Training

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

MVNet: Memory Assistance and Vocal Reinforcement Network for Speech Enhancement

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Improving low-complexity and real-time DeepFilterNet2 for personalized speech enhancement

ATT:Adversarial Trained Transformer for Speech Enhancement

MGVC: A Mask Voice Conversion Using Generating Adversarial Training

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation