Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3467707.3467764acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccaiConference Proceedingsconference-collections
research-article
Open access

Online Audio-Visual Speech Separation with Generative Adversarial Training

Published: 24 September 2021 Publication History

Abstract

Audio-visual speech separation has been demonstrated to be effective in solving the cocktail party problem. However, most of the models cannot meet online processing, which limits their application in video communication and human-robot interaction. Besides, SI-SNR, the most popular training loss function in speech separation, results in some artifacts in the separated audio, which would harm downstream applications, such as automatic speech recognition (ASR). In this paper, we propose an online audio-visual speech separation model with generative adversarial training to solve the two problems mentioned above. We build our generator (i.e., audio-visual speech separator) with causal temporal convolutional network block and propose a streaming inference strategy, which allows our model to do speech separation in an online manner. The discriminator is involved in optimizing the generator, which can reduce the negative effects of SI-SNR. Experiments on simulated 2-speaker mixtures based on challenging audio-visual dataset LRS2 show that our model outperforms the state-of-the-art audio-only model Conv-TasNet and audio-visual model advr-AVSS under the same model size. We test the running time of our model on GPU and CPU, and results show that our model meets online processing. The demo and code can be found at https://github.com/aispeech-lab/oavss.

References

[1]
Keisuke Kinoshita, Tsubasa Ochiai, Marc Delcroix and Tomohiro Nakatani. 2020. Improving noise robust automatic speech recognition with single channel time-domain enhancement network. In ICASSP, pp. 7009–7013.
[2]
Ariel Ephrat 2018. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics, 2018, 37(4CD):112.1-112.11.
[3]
Rui Lu, Zhiyao Duan, and Changshui Zhang. 2019. Audio–Visual Deep Clustering for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 11, pp. 1697-1712.
[4]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. In Interspeech.
[5]
Peng Zhang, Jiaming Xu, Jing Shi, Yunzhe Hao and Bo Xu. 2020. Audio-visual Speech Separation with Adversarially Disentangled Visual Representation. arXiv preprint arXiv:2011.14334.
[6]
Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals and Andrew Zisserman. 2018. Deep audiovisual speech recognition. IEEE transactions on pattern analysis and machine intelligence.
[7]
Davis E King. 2009. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, vol. 10, pp. 1755–1758.
[8]
Bruce D. Lucas and Takeo Kanade. 1981. An iterative image registration technique with an application to stereo vision. In IJCAI.
[9]
Masahiro Sunohara, Chiho Haruta and Nobutaka Ono. 2017. Low-latency real-time blind source separation for hearing aids based on time-domain implementation of online independent vector analysis with truncation of non-causal components. In ICASSP. IEEE, pp. 216–220.
[10]
Yuxuan Wang, Deliang Wang, and Ke Hu. 2017. Real-time method for implementing deep neural network based speech separation. uS Patent App. 14/536,114.
[11]
Cong Han, Yi Luo, Nima Mesgarani. 2019. Online Deep Attractor Network for Real-time Single-channel Speech Separation. In ICASSP.
[12]
Zeng-Xi Li, Yan Song, Li-Rong Dai, and Ian McLoughlin. 2018. Source-aware context network for single-channel multi-speaker speech separation. In ICASSP. IEEE, pp. 681–685.
[13]
Yi Luo and Nima Mesgarani. 2018. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 696–700.
[14]
Yi Luo and Nima Mesgarani. 2019. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256-1266.
[15]
Ian Goodfellow 2014. Generative adversarial nets. in International Conference on Neural Information Processing Systems, pp. 2672–2680.
[16]
Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. 2017. Least squares generative adversarial networks. arXiv preprint arXiv:1611.0407.
[17]
Yunzhe Hao, Jiaming Xu, Jing Shi, Peng Zhang, Lei Qin and Bo Xu. 2020. A unified of framework for low-latency speaker extraction in cocktail party environments. In Interspeech, pp. 1431-1435.
[18]
Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2020. Spex: Multiscale time domain speaker extraction network. arXiv preprint arXiv:2004.08326.
[19]
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. 2018. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957.
[20]
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499-1503.
[21]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. in Proc. Int. Conf. Lear. Represent., 2014.
[22]
Emmanuel Vincent, Rémi Gribonval and Cédric Févotte. 2006. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing, vol. 14, no. 4, pp. 1462-1469.
[23]
Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. 2001. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In ICASSP, pp. 749–752.
[24]
Cees H. Taal, Richard C. Hendriks, Richard Heusdens and Jesper Jensen. 2011. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136.
[25]
Caroline E Harriott and Julie A. Adams. 2017. Towards reaction and response time metrics for real-world human-robot interaction. 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), pp. 799-804.

Cited By

View all
  • (2023)A review on speech separation in cocktail party environment: challenges and approachesMultimedia Tools and Applications10.1007/s11042-023-14649-x82:20(31035-31067)Online publication date: 23-Feb-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCAI '21: Proceedings of the 2021 7th International Conference on Computing and Artificial Intelligence
April 2021
498 pages
ISBN:9781450389501
DOI:10.1145/3467707
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Audio-visual speech separation
  2. causal temporal convolutional network
  3. generative adversarial training
  4. online processing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • the Strategic Priority Research Program of the Chinese Academy of Sciences
  • the National Key Research and Development Program of China

Conference

ICCAI '21

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)242
  • Downloads (Last 6 weeks)38
Reflects downloads up to 03 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A review on speech separation in cocktail party environment: challenges and approachesMultimedia Tools and Applications10.1007/s11042-023-14649-x82:20(31035-31067)Online publication date: 23-Feb-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media