research-article

Open access

Online Audio-Visual Speech Separation with Generative Adversarial Training

Authors:

Bo XuAuthors Info & Claims

ICCAI '21: Proceedings of the 2021 7th International Conference on Computing and Artificial Intelligence

Pages 379 - 385

https://doi.org/10.1145/3467707.3467764

Published: 24 September 2021 Publication History

All formats PDF

Abstract

Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.

References

[1]

Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix and Tomohiro Nakatani. 2020. Improving noise robust automatic speech recognition with single channel time-domain enhancement network. In ICASSP, pp. 7009–7013.

[2]

Ariel Ephrat 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics, 2018, 37(4CD):112.1-112.11.

[3]

Rui Lu, Zhiyao Duan, and Changshui Zhang. 2019. Audio–Visual Deep Clustering for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1697-1712.

Digital Library

[4]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. In Interspeech.

[5]

Peng Zhang, Jiaming Xu, Jing Shi, Yunzhe Hao and Bo Xu. 2020. Audio-visual Speech Separation with Adversarially Disentangled Visual Representation. arXiv preprint arXiv:2011.14334.

[6]

Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals and Andrew Zisserman. 2018. Deep audiovisual speech recognition. IEEE transactions on pattern analysis and machine intelligence.

[7]

Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, vol. 10, pp. 1755–1758.

Digital Library

[8]

Bruce D. Lucas and Takeo Kanade. 1981. An iterative image registration technique with an application to stereo vision. In IJCAI.

Digital Library

[9]

Masahiro Sunohara, Chiho Haruta and Nobutaka Ono. 2017. Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with truncation of non-causal components. In ICASSP. IEEE, pp. 216–220.

[10]

Yuxuan Wang, Deliang Wang, and Ke Hu. 2017. Real-time method for implementing deep neural network based speech separation. uS Patent App. 14/536,114.

[11]

Cong Han, Yi Luo, Nima Mesgarani. 2019. Online Deep Attractor Network for Real-time Single-channel Speech Separation. In ICASSP.

[12]

Zeng-Xi Li, Yan Song, Li-Rong Dai, and Ian McLoughlin. 2018. Source-aware context network for single-channel multi-speaker speech separation. In ICASSP. IEEE, pp. 681–685.

Digital Library

[13]

Yi Luo and Nima Mesgarani. 2018. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 696–700.

Digital Library

[14]

Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256-1266.

[15]

Ian Goodfellow 2014. Generative adversarial nets. in International Conference on Neural Information Processing Systems, pp. 2672–2680.

[16]

Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. arXiv preprint arXiv:1611.0407.

[17]

Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin and Bo Xu. 2020. A unified of framework for low-latency speaker extraction in cocktail party environments. In Interspeech, pp. 1431-1435.

[18]

Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2020. Spex: Multiscale time domain speaker extraction network. arXiv preprint arXiv:2004.08326.

[19]

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.

[20]

Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503.

[21]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. in Proc. Int. Conf. Lear. Represent., 2014.

[22]

Emmanuel Vincent, Rémi Gribonval and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462-1469.

[23]

Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP, pp. 749–752.

[24]

Cees H. Taal, Richard C. Hendriks, Richard Heusdens and Jesper Jensen. 2011. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136.

Digital Library

[25]

Caroline E Harriott and Julie A. Adams. 2017. Towards reaction and response time metrics for real-world human-robot interaction. 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 799-804.

Cited By

Agrawal JGupta MGarg H(2023)A review on speech separation in cocktail party environment: challenges and approachesMultimedia Tools and Applications10.1007/s11042-023-14649-x82:20(31035-31067)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1007/s11042-023-14649-x

Recommendations

RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

While existing Audio-Visual Speech Separation (AVSS) methods primarily concentrate on the audio-visual fusion strategy for two-speaker separation, they demonstrate a severe performance drop in the multi-speaker separation scenarios. Typically, AVSS ...
Separation of audio-visual speech sources: a new approach exploiting the audio-visual coherence of speech stimuli

We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading, the objective is to extract an acoustic speech signal from other acoustic signals by exploiting ...
Audio-visual speech recognition for difficult environments

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCAI '21: Proceedings of the 2021 7th International Conference on Computing and Artificial Intelligence

April 2021

498 pages

ISBN:9781450389501

DOI:10.1145/3467707

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

the Strategic Priority Research Program of the Chinese Academy of Sciences
the National Key Research and Development Program of China

Conference

ICCAI '21

ICCAI '21: 2021 7th International Conference on Computing and Artificial Intelligence

April 23 - 26, 2021

Tianjin, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
668
Total Downloads

Downloads (Last 12 months)242
Downloads (Last 6 weeks)38

Reflects downloads up to 03 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Agrawal JGupta MGarg H(2023)A review on speech separation in cocktail party environment: challenges and approachesMultimedia Tools and Applications10.1007/s11042-023-14649-x82:20(31035-31067)Online publication date: 23-Feb-2023
https://dl.acm.org/doi/10.1007/s11042-023-14649-x

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten