research-article

Self-supervised speech denoising using only noisy audio signals

Authors:

Lotfi Senhadji,

Huazhong ShuAuthors Info & Claims

Volume 149, Issue C

Pages 63 - 73

https://doi.org/10.1016/j.specom.2023.03.009

Published: 01 April 2023 Publication History

Highlights

•

A self-supervised speech denoising training strategy using only noisy audio signals.

•

Training input and target are sub-sampled from the same noisy audio sample.

•

Both precise phase estimation and context information are considered while training.

•

Performing very favorably against other traditional training strategies.

Abstract

In traditional speech denoising tasks, clean audio signals are often used as the training target, but absolutely clean signals are collected from expensive recording equipment or in studios with the strict environments. To overcome this drawback, we propose an end-to-end self-supervised speech denoising training scheme using only noisy audio signals, named Only-Noisy Training (ONT), without extra training conditions. The proposed ONT strategy constructs training pairs only from each single noisy audio, and it contains two modules: training audio pairs generated module and speech denoising module. The first module adopts a random audio sub-sampler on each noisy audio to generate training pairs. The sub-sampled pairs are then fed into a novel complex-valued speech denoising module. Experimental results show that the proposed method not only eliminates the high dependence on clean targets of traditional audio denoising tasks, but also achieves on-par or better performance than other training strategies. Availability —ONT is available at https://github.com/liqingchunnnn/Only-Noisy-Training

References

[1]

N. Alamdari, A. Azarang, N. Kehtarnavaz, Improving deep speech denoising by noisy2noisy signal mapping, Applied Acoustics 172 (2021),.

[2]

D. Baby, S. Verhulst, Sergan: speech enhancement using relativistic generative adversarial networks with gradient penalty, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 106–110,.

[3]

H.S. Choi, J.H. Kim, J. Huh, A. Kim, J.W. Ha, K. Lee, Phase-aware speech enhancement with deep complex u-net, in: International Conference on Learning Representations, 2018.

[4]

A. Defossez, G. Synnaeve, Y. Adi, Real time speech enhancement in the waveform domain, Interspeech, 2020,.

[5]

S.W. Fu, C.F. Liao, Y. Tsao, S.D. Lin, Metricgan: generative adversarial networks based black-box metric scores optimization for speech enhancement, in: International Conference on Machine Learning, PMLR, 2019, pp. 2031–2041.

[6]

S.W. Fu, T.W. Wang, Y. Tsao, X. Lu, H. Kawai, End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks, IEEE/ACM Trans Audio Speech Lang Process 26 (2018) 1570–1584,.

Digital Library

[7]

S.W. Fu, C. Yu, T.A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, Y. Tsao, Metricgan+: an improved version of metricgan for speech enhancement, arXiv preprint (2021),. arXiv:2104.03538.

[8]

T. Fujimura, Y. Koizumi, K. Yatabe, R. Miyazaki, Noisy-target training: a training strategy for dnn-based speech enhancement without clean speech, in: 2021 29th European Signal Processing Conference (EUSIPCO), IEEE, 2021, pp. 436–440,.

[9]

Y. Hu, P.C. Loizou, Evaluation of objective quality measures for speech enhancement, IEEE Trans. Audio Speech Lang. Process. 16 (2007) 229–238,.

[10]

Y. Hu, P.C. Loizou, Subjective comparison and evaluation of speech enhancement algorithms, Speech Commun 49 (2007) 588–601,.

Digital Library

[11]

T. Huang, S. Li, X. Jia, H. Lu, J. Liu, Neighbor2neighbor: self-supervised denoising from single noisy images, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14781–14790,.

[12]

M.M. Kashyap, A. Tambwekar, K. Manohara, S. Natarajan, Speech denoising without clean training data: a noise2noise approach, Interspeech, 2021, pp. 2716–2720,.

[13]

Kim, J., El-Khamy, M., Lee, J., 2021. Transformer with gaussian weighted self-attention for speech enhancement. US Patent 11-195-541.

[14]

J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, T. Aila, Noise2noise: learning image restoration without clean data, in: Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 2965–2974.

[15]

Y. Luo, Z. Chen, T. Yoshioka, Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 46–50,.

[16]

M. Maciejewski, J. Shi, S. Watanabe, S. Khudanpur, Training noisy single-channel speech separation with noisy oracle sources: a large gap and a small step, in: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 5774–5778,.

[17]

A. Narayanan, D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 7092–7096,.

[18]

A. Pandey, D. Wang, Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 6629–6633,.

[19]

S. Pascual, A. Bonafonte, J. Serra, Segan: speech enhancement generative adversarial network, Interspeech, 2017,.

[20]

A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, Perceptual evaluation of speech quality (pesq): an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs, in: 2001 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2001, pp. 749–752,.

Digital Library

[21]

Robert, J., Webbie, M., 2018. Pydub. http://pydub.com/.

[22]

O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241,.

[23]

Saito, K., Uhlich, S., Fabbro, G., Mitsufuji, Y., 2021. Training speech enhancement systems with noisy speech datasets. arXiv preprint arXiv:2105.12315.

[24]

J. Salamon, C. Jacoby, J.P. Bello, A dataset and taxonomy for urban sound research, in: Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 1041–1044,.

Digital Library

[25]

A. Sivaraman, M. Kim, Efficient personalized speech enhancement through self-supervised learning, IEEE J Sel Top Signal Process 16 (2022) 1342–1356,.

[26]

A. Sivaraman, S. Kim, M. Kim, Personalized speech enhancement through self-supervised data augmentation and purification, Interspeech, 2021, pp. 2676–2680,.

[27]

M.H. Soni, N. Shah, H.A. Patil, Time-frequency masking-based speech enhancement using generative adversarial network, in: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, 2018, pp. 5039–5043,.

Digital Library

[28]

S. Srinivasan, N. Roman, D. Wang, Binary and ratio time-frequency masks for robust speech recognition, Speech Commun 48 (2006) 1486–1501,.

[29]

Stoller, D., Ewert, S., Dixon, S., 2018. Wave-u-net: a multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185.

[30]

J. Su, Z. Jin, A. Finkelstein, Hifi-gan: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks, Interspeech, 2020,.

[31]

C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Trans. Audio Speech Lang. Process. 19 (2011) 2125–2136,.

Digital Library

[32]

J. Thiemann, N. Ito, E. Vincent, The diverse environments multi-channel acoustic noise database (demand): a database of multichannel environmental noise recordings, in: Proceedings of Meetings on Acoustics ICA2013, Acoustical Society of America, 2013,.

[33]

C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, C. Pal, Deep complex networks, in: 6th International Conference on Learning Representations, ICLR, 2018, p. 2018.

[34]

Valentini-Botinhao, C., et al., 2017. Noisy speech database for training speech enhancement algorithms and tts models. https://doi.org/10.7488/ds/2117.

[35]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Adv Neural Inf Process Syst (2017) 30.

[36]

C. Veaux, J. Yamagishi, S. King, The voice bank corpus: design, collection and data analysis of a large regional accent speech database, in: 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE), IEEE, 2013, pp. 1–4,.

[37]

K. Wang, B. He, W.P. Zhu, Caunet: context-aware u-net for speech enhancement in time domain, in: 2021 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, 2021, pp. 1–5,.

[38]

Wang, Y.C., Venkataramani, S., Smaragdis, P., 2020. Self-supervised learning for speech enhancement. arXiv preprint arXiv:2006.10388.

[39]

F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr, in: International conference on latent variable analysis and signal separation, Springer, 2015, pp. 91–99,.

Digital Library

[40]

D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Trans Audio Speech Lang Process 24 (2015) 483–492,.

Digital Library

[41]

S. Wisdom, E. Tzinis, H. Erdogan, R. Weiss, K. Wilson, J. Hershey, Unsupervised sound separation using mixture invariant training, Adv Neural Inf Process Syst 33 (2020) 3846–3857.

[42]

Y. Xu, J. Du, L.R. Dai, C.H. Lee, An experimental study on speech enhancement based on deep neural networks, IEEE Signal Process Lett. 21 (2013) 65–68,.

[43]

D. Yin, C. Luo, Z. Xiong, W. Zeng, Phasen: a phase-and-harmonics-aware speech enhancement network, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 9458–9465,.

[44]

Y. Zhao, D. Wang, I. Merks, T. Zhang, Dnn-based enhancement of noisy and reverberant speech, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 6525–6529,.

Digital Library

Cited By

Zhang JDai YChen JLuo CWei BLeung VLi J(2023)SIDAProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36109197:3(1-24)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3610919

Recommendations

Twin self-supervision based semi-supervised learning (TS-SSL): Retinal anomaly classification in SD-OCT images
Abstract
The performance of supervised deep learning significantly relies on the volume of training samples. However, the vast majority of medical images lacks manual expert annotations. Compared to natural image annotation, the cost of medical ...
Self-supervised Multi-view Stereo via Inter and Intra Network Pseudo Depth
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Recent self-supervised learning-based multi-view stereo (MVS) approaches have shown promising results. However, previous methods primarily utilize view synthesis as the replacement for costly ground-truth depth data to guide network learning, still ...
Few-Shot Classification with Multi-task Self-supervised Learning
Neural Information Processing
Abstract
Few-shot learning aims to mitigate the need for large-scale annotated data in the real world. The focus of few-shot learning is how to quickly adapt to unseen tasks, which heavily depends on outstanding feature extraction ability. Motivated by the ...

Comments

Information & Contributors

Information

Published In

cover image Speech Communication

Speech Communication Volume 149, Issue C

Apr 2023

108 pages

ISSN:0167-6393

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 April 2023

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang JDai YChen JLuo CWei BLeung VLi J(2023)SIDAProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36109197:3(1-24)Online publication date: 27-Sep-2023
https://dl.acm.org/doi/10.1145/3610919

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

Affiliations

Jiasong Wu

LIST, Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education, Nanjing, 210096, China

Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing, Southeast University, Nanjing, 210096, China

Centre de Recherche en Information Biomédicale Sino-français (CRIBs), Univ Rennes, INSERM, Rennes, F-35000, France

Univ Rennes, INSERM, LTSI-UMR 1099, Rennes, F-35042, France

Qingchun Li

LIST, Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education, Nanjing, 210096, China

Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing, Southeast University, Nanjing, 210096, China

Centre de Recherche en Information Biomédicale Sino-français (CRIBs), Univ Rennes, INSERM, Rennes, F-35000, France

Guanyu Yang

LIST, Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education, Nanjing, 210096, China

Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing, Southeast University, Nanjing, 210096, China

Centre de Recherche en Information Biomédicale Sino-français (CRIBs), Univ Rennes, INSERM, Rennes, F-35000, France

Lei Li

The 28th Research Institute of China Electronics Technology Group Corporation, Nanjing, 210007, China

Lotfi Senhadji

Centre de Recherche en Information Biomédicale Sino-français (CRIBs), Univ Rennes, INSERM, Rennes, F-35000, France

Univ Rennes, INSERM, LTSI-UMR 1099, Rennes, F-35042, France

Huazhong Shu

LIST, Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education, Nanjing, 210096, China

Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing, Southeast University, Nanjing, 210096, China

Centre de Recherche en Information Biomédicale Sino-français (CRIBs), Univ Rennes, INSERM, Rennes, F-35000, France

View Issue’s Table of Contents