Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Perception-guided generative adversarial network for end-to-end speech enhancement

Published: 01 October 2022 Publication History

Abstract

Single channel speech enhancement has reached a great progress recently with the development of deep learning. However, it is still a challenging problem to achieve promising performance on unseen noisy conditions. The introduction of generative adversarial networks (GAN) could be helpful to alleviate the mismatch between seen training conditions and unseen testing conditions, where the generator performs as a speech denoiser to produce cleaned speech to fool the clean/noisy discriminator. In GAN, the design of the discriminator plays an important role to guide the generator towards an adequate goal. In this paper, we improve the well-known time domain framework, Speech Enhancement GAN (SEGAN), by introducing a perception-guided discriminator. The discriminator is able to quantitatively evaluate the quality of speech to be strongly related to human listening. New adversarial structures and training recipe have been proposed, studied and evaluated on the widely used dataset composed of the voice bank corpus and the DEMAND dataset. Experimental results show the superiority of our method with respect to the state-of-the-art baselines on obtaining higher or competitive scores of commonly used metrics of speech enhancement (STOI = 0.944, SSNR = 12.20, CSIG = 3.99, CBAK = 3.59, PESQ = 2.81, and CVOL = 3.36).

References

[1]
Cui Z., Bao C., Power exponent based weighting criterion for DNN-based mask approximation in speech enhancement, IEEE Signal Process. Lett. 28 (2021) 618–622,.
[2]
Chen J., Wang Y., Yoho S.E., Wang D., Healy E.W., Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises, J. Acoust. Soc. Am. 139 (5) (2016) 2604–2612.
[3]
K. Tan, X. Zhang, D. Wang, Real-time Speech Enhancement Using an Efficient Convolutional Recurrent Network for Dual-microphone Mobile Phones in Close-talk Scenarios, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2019, pp. 5751–5755.
[4]
Boll S., Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust. Speech Signal Process. 27 (2) (1979) 113–120.
[5]
Lim J., Oppenheim A., Enhancement and bandwidth compression of noisy speech, Proc. IEEE 67 (12) (1979) 1586–1604.
[6]
Ephraim Y., Malah D., Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Trans. Acoust. Speech Signal Process. 32 (6) (1985) 1109–1121.
[7]
Jie Z., Liu H., A dual-channel beamformer based on time-delay compensation estimator and shifted PCA for speech enhancement, in: 2015 23rd International Conference on Software, Telecommunications and Computer Networks, SoftCOM, IEEE, 2015, pp. 180–184,.
[8]
Mohammadiha N., Smaragdis P., Leijon A., Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Trans. Audio Speech Lang. Process. 21 (10) (2013) 2140–2151.
[9]
Saleem N., Khattak M.I., Multi-scale decomposition based supervised single channel deep speech enhancement, Appl. Soft Comput. 95 (2020),.
[10]
Samui S., Chakrabarti I., Ghosh S.K., Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network, Appl. Soft Comput. 74 (2019) 583–602,.
[11]
Xu Y., Du J., Dai L.-R., Lee C.-H., A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. Audio Speech Lang. Process. 23 (1) (2015) 7–19.
[12]
Ouyang Z., Yu H., Zhu W.-P., Champagne B., A fully convolutional neural network for complex spectrogram processing in speech enhancement, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2019, pp. 5756–5760,.
[13]
Gao T., Du J., Dai L.-R., Lee C.-H., Densely connected progressive learning for LSTM-based speech enhancement, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2018, pp. 5054–5058,.
[14]
Pandey A., Wang D., Dense CNN with self-attention for time-domain speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2021) 1270–1279.
[15]
C.-F. Liao, Y. Tsao, H.-y. Lee, H.-M. Wang, Noise Adaptive Speech Enhancement Using Domain Adversarial Training, in: INTERSPEECH, 2019, pp. 3148–3152.
[16]
Liu Y., Wang Z., Zeng Y., Zeng H., Zhao D., PD-GAN: Perceptual-details GAN for extremely noisy low light image enhancement, in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2021, pp. 1840–1844,.
[17]
S. Pascual, A. Bonafonte, J. Serrá, SEGAN: Speech enhancement generative adversarial network, in: INTERSPEECH, 2017, pp. 3642–3646.
[18]
Baby D., Verhulst S., SERGAN: Speech enhancement using relativistic generative adversarial networks with gradient penalty, in: ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2019, pp. 106–110,.
[19]
Rix A., Beerends J., Hollier M., Hekstra A., Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in: ICASSP 2001 - 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings, Vol. 2, ICASSP, 2001, pp. 749–752,.
[20]
S.-W. Fu, C.-F. Liao, Y. Tsao, S.-D. Lin, MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement, in: INTERSPEECH, 2019, pp. 2031–2041.
[21]
Liu G., Gong K., Liang X., Chen Z., CP-GAN: Context pyramid generative adversarial network for speech enhancement, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6624–6628,.
[22]
Phan H., McLoughlin I.V., Pham L., Chén O.Y., Koch P., De Vos M., Mertins A., Improving GANs for speech enhancement, IEEE Signal Process. Lett. 27 (2020) 1700–1704.
[23]
Soni M.H., Shah N., Patil H.A., Time-frequency masking-based speech enhancement using generative adversarial network, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, IEEE, 2018, pp. 5039–5043,.
[24]
Qiu Y., Wang R., Hou F., Singh S., Ma Z., Jia X., Adversarial multi-task learning with inverse mapping for speech enhancement, Appl. Soft Comput. 120 (2022),. URL: https://www.sciencedirect.com/science/article/pii/S1568494622000862.
[25]
Li P., Jiang Z., Yin S., Song D., Ouyang P., Liu L., Wei S., PAGAN: A phase-adapted generative adversarial networks for speech enhancement, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2020, pp. 6234–6238,.
[26]
Wali A., Alamgir Z., Karim S., Fawaz A., Ali M.B., Adan M., Mujtaba M., Generative adversarial networks for speech processing: A review, Comput. Speech Lang. 72 (2022),.
[27]
X. Mao, Q. Li, H. Xie, R.Y.K. Lau, Z. Wang, S.P. Smolley, Least Squares Generative Adversarial Networks, in: 2017 IEEE International Conference on Computer Vision, ICCV, 2017, pp. 2813–2821.
[28]
S.-W. Fu, Y. Tsao, H.-T. Hwang, H.-M. Wang, Quality-Net: an end-to-end non-intrusive speech quality assessment model based on BLSTM, in: INTERSPEECH, 2018.
[29]
Moon S.-H., Kim B., Lee I.-S., Importance of phase information in speech enhancement, in: 2010 International Conference on Complex, Intelligent and Software Intensive Systems, 2010, pp. 770–773,.
[30]
H. Jung-WooLee, K. H.-S., K.J. Huh, J. Kim, Phase-aware speech enhancement with deep complex u-net, in: International Conference on Learning Representations, ICLR, 2019.
[31]
K. He, X. Zhang, S. Ren, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: International Conference on Computer Vision, ICCV, 2015, pp. 1026–1034.
[32]
Tan K., Wang D., Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement, IEEE/ACM Trans. Audio Speech Lang. Process. 28 (2020) 380–390,.
[33]
Zhao S., Nguyen T.H., Ma B., Monaural speech enhancement with complex convolutional block attention module and joint time frequency losses, in: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2021, pp. 6648–6652,.
[34]
Masahito T., Yoshiki M., Tatsuya K., Kazuyoshi Y., Tatsuya K., Computer-resource-aware deep speech separation with a run-time-specified number of BLSTM layers, in: 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC, IEEE, 2020, pp. 788–793.
[35]
M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN, in: 34th International Conference on Machine Learning, 2017, pp. 214–223.
[36]
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of wasserstein GANs, in: Conference and Workshop on Neural Information Processing Systems, 2017, pp. 5769–5779.
[37]
Routray S., Mao Q., Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network, Comput. Speech Lang. 71 (2022),.
[38]
Christophe V., Junichi Y., Simon K., The voice bank corpus: Design, collection and data analysis of a large regional accent speech database, in: 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation, O-COCOSDA/CASLRE, IEEE, 2013, pp. 1–4.
[39]
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings, in: Meetings on Acoustics, Vol. 19, 2013, pp. 35–81.
[40]
Taal C.H., Hendriks R.C., Heusdens R., Jensen J., An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Trans. Audio Speech Lang. Process. 19 (7) (2011) 2125–2136.
[41]
Quackenbush S.R., Barnwell T.P., Clements M.A., Objective Measures of Speech Quality, Georgia Institute of Technology, 1988.
[42]
Hu Y., Loizou P.C., Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio Speech Lang. Process. 16 (1) (2018) 229–238.
[43]
Jiang J., Cai L., Dong F., Yu K., Chen K., Qu W., Jiang J., Deploying and optimizing convolutional neural networks on heterogeneous architecture, in: 2019 IEEE 13th International Conference on ASIC, ASICON, 2019, pp. 1–4,.

Cited By

View all
  • (2023)Semi-Supervised Learning with Coevolutionary Generative Adversarial NetworksProceedings of the Genetic and Evolutionary Computation Conference10.1145/3583131.3590426(568-576)Online publication date: 15-Jul-2023

Index Terms

  1. Perception-guided generative adversarial network for end-to-end speech enhancement
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image Applied Soft Computing
          Applied Soft Computing  Volume 128, Issue C
          Oct 2022
          902 pages

          Publisher

          Elsevier Science Publishers B. V.

          Netherlands

          Publication History

          Published: 01 October 2022

          Author Tags

          1. Single channel speech enhancement
          2. Generative adversarial networks
          3. Deep learning
          4. End-to-end processing
          5. Perception discriminator

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 13 Jan 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2023)Semi-Supervised Learning with Coevolutionary Generative Adversarial NetworksProceedings of the Genetic and Evolutionary Computation Conference10.1145/3583131.3590426(568-576)Online publication date: 15-Jul-2023

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media