Abstract
Speech enhancement using neural networks is recently receiving large attention in research and being integrated in commercial devices and applications. In this work, we investigate data augmentation techniques for supervised deep learning-based speech enhancement. We show that not only augmenting SNR values to a broader range and a continuous distribution helps to regularize training, but also augmenting the spectral and dynamic level diversity. However, to not degrade training by level augmentation, we propose a modification to signal-based loss functions by applying sequence level normalization. We show in experiments that this normalization overcomes the degradation caused by training on sequences with imbalanced signal levels, when using a level-dependent loss function.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Cho, K., Merriënboer, B.V., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-8) (2014)
Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 32(6), 1109–1121 (1984)
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. 37(4), 112:1–112:11 (2018)
Gerkmann, T., Hendriks, R.C.: Noise power estimation based on the probability of speech presence. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 145–148, October 2011
Hu, K., Divenyi, P., Ellis, D., Jin, Z., Shinn-Cunningham, B.G., Wang, D.: Preliminary intelligibility tests of a monaural speech segregation system. In: Proceedings of the Workshop on Statistical and Perceptual Audition, Brisbane, September 2008
ITU-T: Recommendation P.862: Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, February 2001
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=Bkg6RiCqY7
Martin, R.: Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech Audio Process. 9, 504–512 (2001)
Reddy, C.K.A., et al.: The INTERSPEECH 2020 deep noise suppression challenge: datasets, subjective speech quality and testing framework. In: Proceedings of the INTERSPEECH 2020 (2020, to appear)
Roux, J.L., Wisdom, S., Erdogan, H., Hershey, J.R.: SDR - half-baked or well done? In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630, May 2019
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 19(7), 2125–2136 (2011)
Tan, K., Wang, D.: A convolutional recurrent neural network for real-time speech enhancement. In: Proceedings of the Interspeech, pp. 3229–3233 (2018)
Tu, Y.H., Tashev, I., Zarar, S., Lee, C.: A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2531–2535, April 2018
Valin, J.: A hybrid DSP/deep learning approach to real-time full-band speech enhancement. In: 20th International Workshop on Multimedia Signal Processing (MMSP), pp. 1–5, August 2018
Vincent, E., Barker, J., Watanabe, S., Nesta, F.: The second ‘CHIME’ speech separation and recognition challenge: datasets, tasks and baselines. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), June 2012
Wang, D., Chen, J.: Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. 26(10), 1702–1726 (2018)
Wichern, G., Lukin, A.: Low-latency approximation of bidirectional recurrent networks for speech denoising. In: Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 66–70, October 2017
Wilson, K., et al.: Exploring tradeoffs in models for low-latency speech enhancement. In: Proceedings of the International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 366–370, September 2018
Wisdom, S., et al.: Differentiable consistency constraints for improved deep speech enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 900–904, May 2019
Xia, R., Braun, S., Reddy, C., Dubey, H., Cutler, R., Tahev, I.: Weighted speech distortion losses for neural-network-based real-time speech enhancement. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Braun, S., Tashev, I. (2020). Data Augmentation and Loss Normalization for Deep Noise Suppression. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)