Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Self-supervised speech denoising using only noisy audio signals

Published: 01 April 2023 Publication History

Highlights

A self-supervised speech denoising training strategy using only noisy audio signals.
Training input and target are sub-sampled from the same noisy audio sample.
Both precise phase estimation and context information are considered while training.
Performing very favorably against other traditional training strategies.

Abstract

In traditional speech denoising tasks, clean audio signals are often used as the training target, but absolutely clean signals are collected from expensive recording equipment or in studios with the strict environments. To overcome this drawback, we propose an end-to-end self-supervised speech denoising training scheme using only noisy audio signals, named Only-Noisy Training (ONT), without extra training conditions. The proposed ONT strategy constructs training pairs only from each single noisy audio, and it contains two modules: training audio pairs generated module and speech denoising module. The first module adopts a random audio sub-sampler on each noisy audio to generate training pairs. The sub-sampled pairs are then fed into a novel complex-valued speech denoising module. Experimental results show that the proposed method not only eliminates the high dependence on clean targets of traditional audio denoising tasks, but also achieves on-par or better performance than other training strategies. Availability —ONT is available at https://github.com/liqingchunnnn/Only-Noisy-Training

References

[1]
N. Alamdari, A. Azarang, N. Kehtarnavaz, Improving deep speech denoising by noisy2noisy signal mapping, Applied Acoustics 172 (2021),.
[2]
D. Baby, S. Verhulst, Sergan: speech enhancement using relativistic generative adversarial networks with gradient penalty, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 106–110,.
[3]
H.S. Choi, J.H. Kim, J. Huh, A. Kim, J.W. Ha, K. Lee, Phase-aware speech enhancement with deep complex u-net, in: International Conference on Learning Representations, 2018.
[4]
A. Defossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain, Interspeech, 2020,.
[5]
S.W. Fu, C.F. Liao, Y. Tsao, S.D. Lin, Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in: International Conference on Machine Learning, PMLR, 2019, pp. 2031–2041.
[6]
S.W. Fu, T.W. Wang, Y. Tsao, X. Lu, H. Kawai, End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans Audio Speech Lang Process 26 (2018) 1570–1584,.
[7]
S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+: an improved version of metricgan for speech enhancement, arXiv preprint (2021),. arXiv:2104.03538.
[8]
T. Fujimura, Y. Koizumi, K. Yatabe, R. Miyazaki, Noisy-target training: a training strategy for dnn-based speech enhancement without clean speech, in: 2021 29th European Signal Processing Conference (EUSIPCO), IEEE, 2021, pp. 436–440,.
[9]
Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio Speech Lang. Process. 16 (2007) 229–238,.
[10]
Y. Hu, P.C. Loizou, Subjective comparison and evaluation of speech enhancement algorithms, Speech Commun 49 (2007) 588–601,.
[11]
T. Huang, S. Li, X. Jia, H. Lu, J. Liu, Neighbor2neighbor: self-supervised denoising from single noisy images, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14781–14790,.
[12]
M.M. Kashyap, A. Tambwekar, K. Manohara, S. Natarajan, Speech denoising without clean training data: a noise2noise approach, Interspeech, 2021, pp. 2716–2720,.
[13]
Kim, J., El-Khamy, M., Lee, J., 2021. Transformer with gaussian weighted self-attention for speech enhancement. US Patent 11-195-541.
[14]
J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, T. Aila, Noise2noise: learning image restoration without clean data, in: Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 2965–2974.
[15]
Y. Luo, Z. Chen, T. Yoshioka, Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 46–50,.
[16]
M. Maciejewski, J. Shi, S. Watanabe, S. Khudanpur, Training noisy single-channel speech separation with noisy oracle sources: a large gap and a small step, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 5774–5778,.
[17]
A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 7092–7096,.
[18]
A. Pandey, D. Wang, Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6629–6633,.
[19]
S. Pascual, A. Bonafonte, J. Serra, Segan: speech enhancement generative adversarial network, Interspeech, 2017,.
[20]
A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (pesq): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, in: 2001 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2001, pp. 749–752,.
[21]
Robert, J., Webbie, M., 2018. Pydub. http://pydub.com/.
[22]
O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241,.
[23]
Saito, K., Uhlich, S., Fabbro, G., Mitsufuji, Y., 2021. Training speech enhancement systems with noisy speech datasets. arXiv preprint arXiv:2105.12315.
[24]
J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research, in: Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 1041–1044,.
[25]
A. Sivaraman, M. Kim, Efficient personalized speech enhancement through self-supervised learning, IEEE J Sel Top Signal Process 16 (2022) 1342–1356,.
[26]
A. Sivaraman, S. Kim, M. Kim, Personalized speech enhancement through self-supervised data augmentation and purification, Interspeech, 2021, pp. 2676–2680,.
[27]
M.H. Soni, N. Shah, H.A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, in: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018, pp. 5039–5043,.
[28]
S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition, Speech Commun 48 (2006) 1486–1501,.
[29]
Stoller, D., Ewert, S., Dixon, S., 2018. Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185.
[30]
J. Su, Z. Jin, A. Finkelstein, Hifi-gan: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks, Interspeech, 2020,.
[31]
C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Trans. Audio Speech Lang. Process. 19 (2011) 2125–2136,.
[32]
J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings, in: Proceedings of Meetings on Acoustics ICA2013, Acoustical Society of America, 2013,.
[33]
C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C. Pal, Deep complex networks, in: 6th International Conference on Learning Representations, ICLR, 2018, p. 2018.
[34]
Valentini-Botinhao, C., et al., 2017. Noisy speech database for training speech enhancement algorithms and tts models. https://doi.org/10.7488/ds/2117.
[35]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Adv Neural Inf Process Syst (2017) 30.
[36]
C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in: 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), IEEE, 2013, pp. 1–4,.
[37]
K. Wang, B. He, W.P. Zhu, Caunet: context-aware u-net for speech enhancement in time domain, in: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, 2021, pp. 1–5,.
[38]
Wang, Y.C., Venkataramani, S., Smaragdis, P., 2020. Self-supervised learning for speech enhancement. arXiv preprint arXiv:2006.10388.
[39]
F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr, in: International conference on latent variable analysis and signal separation, Springer, 2015, pp. 91–99,.
[40]
D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Trans Audio Speech Lang Process 24 (2015) 483–492,.
[41]
S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, J. Hershey, Unsupervised sound separation using mixture invariant training, Adv Neural Inf Process Syst 33 (2020) 3846–3857.
[42]
Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks, IEEE Signal Process Lett. 21 (2013) 65–68,.
[43]
D. Yin, C. Luo, Z. Xiong, W. Zeng, Phasen: a phase-and-harmonics-aware speech enhancement network, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 9458–9465,.
[44]
Y. Zhao, D. Wang, I. Merks, T. Zhang, Dnn-based enhancement of noisy and reverberant speech, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 6525–6529,.

Cited By

View all
  • (2023)SIDAProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36109197:3(1-24)Online publication date: 27-Sep-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Speech Communication
Speech Communication  Volume 149, Issue C
Apr 2023
108 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 April 2023

Author Tags

  1. Speech denoising
  2. Self-supervised
  3. Training target
  4. Audio sub-sampler

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)SIDAProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36109197:3(1-24)Online publication date: 27-Sep-2023

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media