research-article

ConceptBeam: Concept Driven Target Speech Extraction

Authors:

Yasunori Ohishi,

Tsubasa Ochiai,

Daiki Takeuchi,

Daisuke Niizumi,

Akisato Kimura,

Kunio KashinoAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 4252 - 4260

https://doi.org/10.1145/3503161.3548397

Published: 10 October 2022 Publication History

Abstract

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

Supplementary Material

MP4 File (MM22-fp3008.mp4)

Presentation video

Download
124.99 MB

References

[1]

T. Afouras, J. S. Chung, and A. Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. In Proc. Interspeech.

[2]

S. Arora, S. Dalmia, P. Denisov, X. Chang, Y. Ueda, Y. Peng, Y. Zhang, S. Kumar, K. Ganesan, B. Yan, et al. 2022. ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet. In Proc. ICASSP.

[3]

M. Borsdorf, H. Li, and T. Schultz. 2021a. Target Language Extraction at Multilingual Cocktail Parties. In Proc. ASRU.

[4]

M. Borsdorf, C. Xu, H. Li, and T. Schultz. 2021b. Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers. In Proc. Interspeech.

[5]

E. Ceolini, J. Hjortkjær, D. D. Wong, J. O'Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu, and N. Mesgarani. 2020. Brain-informed speech separation (biss) for enhancement of target speaker in multitalker speech perception. NeuroImage, Vol. 223 (2020), 117282.

[6]

G. Chrupa?a. 2021. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. arXiv preprint arXiv:2104.13225.

[7]

M. Delcroix, J. B. Vázquez, T. Ochiai, K. Kinoshita, Y. Ohishi, and S. Araki. 2022. SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning. arXiv preprint arXiv:2204.03895.

[8]

M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani. 2021. Speaker Activity Driven Neural Speech Extraction. In Proc. ICASSP.

[9]

A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W.T. Freeman, and M. Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. In Proc. SIGGRAPH.

[10]

C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba. 2020. Music Gesture for Visual Sound Separation. In Proc. ICCV.

[11]

R. Gao and K. Grauman. 2019. Co-separating sounds of visual objects. In Proc. ICCV.

[12]

R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu. 2019. Neural spatial filter: Target speaker speech separation assisted with directional information. In Proc. Interspeech.

[13]

D. Harwath, W-H. Hsu, and J. Glass. 2020. Learning Hierarchical Discrete Linguistic Units from Visually-grounded speech. In Proc. ICLR.

[14]

D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. Glass. 2019. Jointly Discovering Visual Objects and Spoken words from raw sensory input. International Journal of Computer Vision (2019).

[15]

D. Harwath, A. Torralba, and J. Glass. 2016. Unsupervised learning of spoken language with visual context. In Proc. NIPS.

[16]

J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP.

[17]

W-N. Hsu, D. Harwath, C. Song, and J. Glass. 2020. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units. In the NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing.

[18]

G. Ilharco, Y. Zhang, and J. Baldridge. 2019. Large-scale representation learning from visually grounded untranscribed speech. In Proc. CoNLL.

[19]

L. Chen M. Yu Y. Qian J. Chen, D. Su and D. Yu. 2018. Deep extractor network for target speaker recovery from single channel speech mixtures. In Proc. Interspeech.

[20]

D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. ICLR.

[21]

M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen. 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, 10 (2017), 1901--1913.

Digital Library

[22]

Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang, and M. D. Plumbley. 2020. Source separation with weakly labelled data: An approach to computational auditory scene analysis. In Proc. ICASSP.

[23]

T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. 2014. Microsoft COCO: Common Objects in Context. http://arxiv.org/abs/1405.0312

[24]

D. Merkx, S. L. Frank, and M. Ernestus. 2019. Language learning using speech to image retrieval. In Proc. Interspeech.

[25]

A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman. 2020. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In Proc. CVPR.

[26]

M. Monfort, S. Jin, A. Liu, D. Harwath, R. Feris, J. Glass, and A. Oliva. 2021. Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions. In Proc. CVPR.

[27]

M. Nikolaus, A. Alishahi, and G. Chrupaa. 2022. Learning english with peppa pig. arXiv preprint arXiv:2202.12917.

[28]

T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani. 2019. Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues. In Proc. INTERSPEECH.

[29]

Y. Ohishi, A. Kimura, T. Kawanishi, K. Kashino, D. Harwath, and J. Glass. 2020. Trilingual Semantic Embeddings of Visually Grounded Speech with Self-attention Mechanisms. In Proc. ICASSP.

[30]

K. Olaleye and H. Kamper. 2021. Attention-based keyword localisation in speech using visual grounding. In Proc. Interspeech.

[31]

I. Palmer, A. Rouditchenko, A. Barbu, B. Katz, and J. Glass. 2021. Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. In Proc. Interspeech.

[32]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP.

[33]

R. Pascanu, T. Mikolov, and Y. Bengio. 2013. On the difficulty of training recurrent neural networks. In Proc. ICML.

[34]

P. Peng and D. Harwath. 2022a. Fast-Slow Transformer for Visually Grounding Speech. In Proc. ICASSP.

[35]

P. Peng and D. Harwath. 2022b. Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling. In The Self-Supervised Learning for Speech and Audio Processing Workshop at AAAI 2022.

[36]

P. Peng and D. Harwath. 2022c. Word Discovery in Visually Grounded, Self-Supervised Speech Models. arXiv preprint arXiv:2203.15081.

[37]

S. Ren, K. He, R. Girshick, and J. Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 6 (2017), 1137--1149.

Digital Library

[38]

A. Rouditchenko, A. Boggust, D. Harwath, B. Chen, D. Joshi, S. Thomas, K. Audhkhasi, H. Kuehne, R. Panda, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, and J. Glass. 2021. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. In Proc. Interspeech.

[39]

R. Sanabria, A. Waters, and J. Baldridge. 2021. Talk, don't write: A study of direct speech-based image retrieval. In Proc. Interspeech.

[40]

H. Sato, T. Ochiai, K. Kinoshita, M. Delcroix, T. Nakatani, and S. Araki. 2021. Multimodal Attention Fusion for Target Speaker Extraction. In Proc. SLT.

[41]

S. Scholten, D. Merkx, and O. Scharenborg. 2021. Learning to recognise words using visually grounded speech. In Proc. ISCAS.

[42]

E. Tzinis, G. Wichern, A. Subramanian, P. Smaragdis, and J. Le Roux. 2022. Heterogeneous Target Speech Separation. In Proc. Interspeech.

[43]

K. Veselý, S. Watanabe, K. molíková, M. Karafiát, L. Burget, and J.H. ernocký. 2016. Sequence summarizing neural network for speaker adaptation. In Proc. ICASSP.

Digital Library

[44]

E. Vincent, R. Gribonval, and C. Févotte. 2006. Performance measurement in blind audio source separation. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 14, 4 (2006), 1462--1469.

Digital Library

[45]

K. molíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani. 2017. Learning speaker representation for neural network based multichannel speaker extraction. In Proc. ASRU.

[46]

K. molíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J.H. vCernocký. 2019. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE Journal of Selected Topics in Signal Processing, Vol. 13, 4 (2019), 800--814.

[47]

L. Wang, X. Wang, M. Hasegawa-Johnson, O. Scharenborg, and N. Dehak. 2021. Align or attend toward more efficient and accurate spoken word discovery using speech-to-image retrieval. In Proc. ICASSP.

[48]

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. In Proc. Interspeech.

[49]

S. Wisdom, H. Erdogan, D. P. Ellis, R. Serizel, N. Turpault, E. Fonseca, J. Salamon, P. Seetharaman, and J. R. Hershey. 2021. What?? all the fuss about free universal sound separation data?. In Proc. ICASSP.

[50]

X. Xu, B. Dai, and D. Lin. 2019. Recursive visual sound separation using minus-plus net. In Proc. ICCV.

[51]

H. Zhao, C. Gan, W-C. Ma, and A. Torralba. 2019. The sound of motions. In Proc. ICCV.

[52]

H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. Mcdermott, and A. Torralba. 2018. The sound of pixels. In Proc. ECCV.

Cited By

Borsdorf MCai SPahuja SDe Silva DLi HSchultz T(2024)Attention and Sequence Modeling for Match-Mismatch Classification of Speech Stimulus and EEG ResponseIEEE Open Journal of Signal Processing10.1109/OJSP.2023.33400635(799-809)Online publication date: 2024
https://doi.org/10.1109/OJSP.2023.3340063
Hu YXu HGuo ZHuang HHe L(2024)SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual AttentionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447832(1496-1500)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447832
Wakayama KOchiai TDelcroix MYasuda MSaito SAraki SNakayama A(2024)Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal TeacherICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446265(561-565)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446265
Show More Cited By

Index Terms

ConceptBeam: Concept Driven Target Speech Extraction
1. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Efficient online target speech extraction using DOA-constrained independent component analysis of stereo data for robust speech recognition

This paper describes an efficient online target-speech-extraction method used as a preprocessing step for robust automatic speech recognition (ASR). Because a target speaker is located relatively close to microphones in many ASR applications, acoustic ...
The Lombard intelligibility benefit of native and non-native speech for native and non-native listeners
Highlights
- We compared native English and non-native (Dutch) Lombard and plain speech.
- ...
Abstract
Speech produced in noise (Lombard speech) is more intelligible than speech produced in quiet (plain speech). Previous research on the Lombard intelligibility benefit focused almost entirely on how native speakers produce and perceive ...
MFCC-GMM based accent recognition system for Telugu speech signals

Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
299
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)7

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Borsdorf MCai SPahuja SDe Silva DLi HSchultz T(2024)Attention and Sequence Modeling for Match-Mismatch Classification of Speech Stimulus and EEG ResponseIEEE Open Journal of Signal Processing10.1109/OJSP.2023.33400635(799-809)Online publication date: 2024
https://doi.org/10.1109/OJSP.2023.3340063
Hu YXu HGuo ZHuang HHe L(2024)SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual AttentionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447832(1496-1500)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447832
Wakayama KOchiai TDelcroix MYasuda MSaito SAraki SNakayama A(2024)Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal TeacherICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446265(561-565)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446265
Seki SKanami Imamura Kameoka HKaneko TTanaka KHarada N(2023)W2N-AVSC: Audiovisual Extension For Whisper-To-Normal Speech Conversion2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289823(296-300)Online publication date: 4-Sep-2023
https://doi.org/10.23919/EUSIPCO58844.2023.10289823
Sun CChen MCheng JLiang HZhu CChen JEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual CodingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613805(261-270)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613805
Zmolikova KDelcroix MOchiai TKinoshita KČernocký JYu D(2023)Neural Target Speech Extraction: An overviewIEEE Signal Processing Magazine10.1109/MSP.2023.324000840:3(8-29)Online publication date: May-2023
https://doi.org/10.1109/MSP.2023.3240008
Li CQian YChen ZWang DYoshioka TLiu SQian YZeng M(2023)Target Sound Extraction with Variable Cross-Modality CluesICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095266(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095266
Ott FRügamer DHeublein LBischl BMutschler C(2023)Auxiliary Cross-Modal Representation Learning With Triplet Loss Functions for Online Handwriting RecognitionIEEE Access10.1109/ACCESS.2023.331081911(94148-94172)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3310819
Xu XZhang YYang YTu W(2023)Leveraging Sound Local and Global Features for Language-Queried Target Sound ExtractionNeural Information Processing10.1007/978-981-99-8070-3_28(367-379)Online publication date: 15-Nov-2023
https://doi.org/10.1007/978-981-99-8070-3_28

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents