Abstract
Speech enhancement is a fundamental way to improve speech perception quality in adverse environment where the received speech is seriously corrupted by noise. In this paper, we propose a cognitive computing based speech enhancement model termed SETransformer which can improve the speech quality in unkown noisy environments. The proposed SETransformer takes advantages of LSTM and multi-head attention mechanism, both of which are inspired by the auditory perception principle of human beings. Specifically, the SETransformer pocesses the ability of characterizing the local structure implicated in the speech spectrum and has more lower computation complexity due to its distinctive parallelization perfermance. Experimental results show that, compared with the standard Transformer and the LSTM model, the proposed SETransformer model can consistently achieve better denoising performance in terms of speech quality (PESQ) and speech intelligibility (STOI) under unseen noise conditions.
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wang W, Xing C, Wang D, et al. A Robust Audio-Visual Speech Enhancement Model, in ICASSP 2020 – 45th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4–8. Virtual Barcelona. 2020.:p. 7529–33.
Li L, Wang D, Zheng F. Neural Discriminant Analysis for Deep Speaker Embedding. in arXiv preprint arXiv:2005.11905, 2020.
Gogate M, Dashtipour K, Bell P, et al. Deep Neural Network Driven Binaural Audio Visual Speech Separation, in 2020 International Joint Conference on Neural Networks (IJCNN), 2020. p. 1–7.
Gogate M, Adeel A, Dashtipour K, et al. AV Speech Enhancement Challenge using a Real Noisy Corpus. in arXiv preprint 2019. arXiv:1910.00424
Gogate M, Dashtipour K, Adeel A, et al. Cochleanet: A robust language-independent audio-visual model for speech enhancement, in arXiv preprint 2019. arXiv:1909.10407
Xu Y, Du J, Dai L, Lee CH. An Experimental Study on Speech Enhancement Based on Deep Neural Networks. IEEE Signal Processing Letters. 2014;21(1):65–8.
Narayanan A, Wang D. Ideal ratio mask estimation using deep neural networks for robust speech recognition, in ICASSP 2013 – 38th IEEE International Conference on Acoustics, Speech and Signal Processing, May 26–31, Vancouver, BC, Canada; 2013. p. 7092–6.
Xu Y, Du J, Dai L, Lee CH. Multi-Objective Learning and Mask-Based Post-Processing for Deep Neural Network Based Speech Enhancement, in Interspeech 2015 – 15th Annual Conference of the International Speech Communication Association, September 6–10, Gremany, Austria; 2015. p. 1058–512.
Xu Y, Du J, Dai L, Lee CH. A Regression Approach to Speech Enhancement Based on Deep Neural Networks, in IEEE Transactions on Acoustics, Speech and Signal Processing. 2015;23(1):7–19.
Kounovsky T, Malek J. Single channel speech enhancement using convolutional neural network, in ECMSM 2017 – 15th IEEE International Workshop of Electronics, May 24–26, Donostia-San Sebastian, Spain; 2017. p. 1–5.
Park SR, Lee J. A fully convolutional neural network for speech enhancement, in Interspeech 2017 – 17th Annual Conference of the International Speech Communication Association, August 20–24, Stockholm, Sweden; 2017. p. 1993–7.
Fu S, Tsao Y, Lu X. Raw Waveform-based Speech Enhancement by Fully Con- volutional Networks, in APSIPA ASC 2017 – 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, December 12–15, Kuala Lumpur, Malaysia; 2017. p. 6–12.
Grais EM, Plumbley MD, Single Channel Audio Source Separation using Con- volutional Denoising Autoencoders, in GlobalSIP 2017 – 5th IEEE Global Conference on Signal and Information Processing, November 14–16, Montreal, QC, Canada; 2017, p. 1265–9.
Huang P, Kim M, Hasegawa-Johnson M, Smaragdis P. Joint Optimization of Masks and Deep Recurrent Neural Networks for Monaural Source Separation, in IEEE Transactions on Acoustics, Speech and Signal Processing. 2015;23(12):2136–47.
Sun L, Du J, Dai L, Lee C. Multiple-target deep learning for LSTM-RNN based speech enhancement, in HSCMA 2017 – 15th Hands-free Speech Communications and Microphone Arrays, March 1–3, San Francisco, California; 2017. p. 136–40.
Sainath TN, Vinyals O, Senior A, Sak H. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, in ICASSP 2015 – 40th IEEE International Conference on Acoustics, Speech and Signal Processing, April 19–24, Brisbane, QLD, Australia; 2015, p. 4580–4.
Mimilakis SI, Drossos K, Santos JF, Schuller G, Virtanen T, Bengio Y. Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask, in ICASSP 2018 – 43th IEEE International Conference on Acoustics, Speech and Signal Processing, April 15–20, Calgary, AB, Canada; 2018. p. 721–5.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. Adv Neural Inf Process Syst. 2017, p. 5998–6008.
Wang Z, Ma Y, Liu Z, Tang J. R-Transformer: Recurrent Neural Network Enhanced Transformer, in arXiv preprint, 2019. arXiv: 1907.05572
Wisdom S, Powers T, Hershey J, Roux JL, Atlas L. Full-capacity unitary recurrent neural networks. Adv Neural Inf Process Syst. 2016;4880–8.
Ba, JL Kiros JR, Hinton GE. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin, in arXiv preprint, 2016. arXiv:1607.06450
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN. Convolutional sequence to sequence learning, in Proceedings of the 34th International Conference on Machine Learning. 2017;70:1243–52.
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS. ARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM. NIST speech disc 1-1.1, in NASA STI/Recon technical report n. 1993;93.
Snyder D, Chen G, Povey D. Musan: A music, speech, and noise corpus, in arXiv preprint, 2015. arXiv:1510.08484
Kingma D, Ba J. Musan: A music, speech, and noise corpus, in arXiv preprint, 2014. arXiv:1412.6980
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R, Dropout: a simple way to prevent neural networks from overfitting, J Mach Lear Res. 2014;1929–58.
Valentini C, Wang X, Takaki S, Yamagishi J. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech, 9th ISCA Speech Synthesis Workshop. 2016; p. 146–52.
Xue Y, Xu T, Zhang H, Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics. 2018;16(3–4):383–39.
Shah N, Patil A, Soni H. Time-frequency mask-based speech enhancement using convolutional generative adversarial network, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2018; p. 1246–51.
Jansson A, Humphrey E, Montecchio N, et al. Singing voice separation with deep u-net convolutional networks, n Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). 2017; p. 323–32.
Luo Y, Mesgarani N. Tasnet: time-domain audio separation network for real-time, single-channel speech separation, ICASSP 2018 – 43th IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 22–27, Seoul, South Korea. 2018; p. 696–700.
Funding
This study was funded in part by Natural Science Fund Project of China under No.61301295, the Anhui Natural Science Fund Project (No.1708085MF151), and Anhui University Natural Science Research Project (KJ2018A0018).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that they have no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Yu, W., Zhou, J., Wang, H. et al. SETransformer: Speech Enhancement Transformer. Cogn Comput 14, 1152–1158 (2022). https://doi.org/10.1007/s12559-020-09817-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-020-09817-2