Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3503161.3548397acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

ConceptBeam: Concept Driven Target Speech Extraction

Published: 10 October 2022 Publication History

Abstract

We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.

Supplementary Material

MP4 File (MM22-fp3008.mp4)
Presentation video

References

[1]
T. Afouras, J. S. Chung, and A. Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. In Proc. Interspeech.
[2]
S. Arora, S. Dalmia, P. Denisov, X. Chang, Y. Ueda, Y. Peng, Y. Zhang, S. Kumar, K. Ganesan, B. Yan, et al. 2022. ESPnet-SLU: Advancing Spoken Language Understanding through ESPnet. In Proc. ICASSP.
[3]
M. Borsdorf, H. Li, and T. Schultz. 2021a. Target Language Extraction at Multilingual Cocktail Parties. In Proc. ASRU.
[4]
M. Borsdorf, C. Xu, H. Li, and T. Schultz. 2021b. Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers. In Proc. Interspeech.
[5]
E. Ceolini, J. Hjortkjær, D. D. Wong, J. O'Sullivan, V. S. Raghavan, J. Herrero, A. D. Mehta, S.-C. Liu, and N. Mesgarani. 2020. Brain-informed speech separation (biss) for enhancement of target speaker in multitalker speech perception. NeuroImage, Vol. 223 (2020), 117282.
[6]
G. Chrupa?a. 2021. Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques. arXiv preprint arXiv:2104.13225.
[7]
M. Delcroix, J. B. Vázquez, T. Ochiai, K. Kinoshita, Y. Ohishi, and S. Araki. 2022. SoundBeam: Target sound extraction conditioned on sound-class labels and enrollment clues for increased performance and continuous learning. arXiv preprint arXiv:2204.03895.
[8]
M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, and T. Nakatani. 2021. Speaker Activity Driven Neural Speech Extraction. In Proc. ICASSP.
[9]
A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W.T. Freeman, and M. Rubinstein. 2018. Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. In Proc. SIGGRAPH.
[10]
C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba. 2020. Music Gesture for Visual Sound Separation. In Proc. ICCV.
[11]
R. Gao and K. Grauman. 2019. Co-separating sounds of visual objects. In Proc. ICCV.
[12]
R. Gu, L. Chen, S.-X. Zhang, J. Zheng, Y. Xu, M. Yu, D. Su, Y. Zou, and D. Yu. 2019. Neural spatial filter: Target speaker speech separation assisted with directional information. In Proc. Interspeech.
[13]
D. Harwath, W-H. Hsu, and J. Glass. 2020. Learning Hierarchical Discrete Linguistic Units from Visually-grounded speech. In Proc. ICLR.
[14]
D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. Glass. 2019. Jointly Discovering Visual Objects and Spoken words from raw sensory input. International Journal of Computer Vision (2019).
[15]
D. Harwath, A. Torralba, and J. Glass. 2016. Unsupervised learning of spoken language with visual context. In Proc. NIPS.
[16]
J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. 2016. Deep clustering: Discriminative embeddings for segmentation and separation. In Proc. ICASSP.
[17]
W-N. Hsu, D. Harwath, C. Song, and J. Glass. 2020. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units. In the NeurIPS Workshop on Self-Supervised Learning for Speech and Audio Processing.
[18]
G. Ilharco, Y. Zhang, and J. Baldridge. 2019. Large-scale representation learning from visually grounded untranscribed speech. In Proc. CoNLL.
[19]
L. Chen M. Yu Y. Qian J. Chen, D. Su and D. Yu. 2018. Deep extractor network for target speaker recovery from single channel speech mixtures. In Proc. Interspeech.
[20]
D. P. Kingma and J. Ba. 2015. Adam: A Method for Stochastic Optimization. In Proc. ICLR.
[21]
M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen. 2017. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 25, 10 (2017), 1901--1913.
[22]
Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang, and M. D. Plumbley. 2020. Source separation with weakly labelled data: An approach to computational auditory scene analysis. In Proc. ICASSP.
[23]
T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. 2014. Microsoft COCO: Common Objects in Context. http://arxiv.org/abs/1405.0312
[24]
D. Merkx, S. L. Frank, and M. Ernestus. 2019. Language learning using speech to image retrieval. In Proc. Interspeech.
[25]
A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman. 2020. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In Proc. CVPR.
[26]
M. Monfort, S. Jin, A. Liu, D. Harwath, R. Feris, J. Glass, and A. Oliva. 2021. Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions. In Proc. CVPR.
[27]
M. Nikolaus, A. Alishahi, and G. Chrupaa. 2022. Learning english with peppa pig. arXiv preprint arXiv:2202.12917.
[28]
T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani. 2019. Multimodal SpeakerBeam: Single channel target speech extraction with audio-visual speaker clues. In Proc. INTERSPEECH.
[29]
Y. Ohishi, A. Kimura, T. Kawanishi, K. Kashino, D. Harwath, and J. Glass. 2020. Trilingual Semantic Embeddings of Visually Grounded Speech with Self-attention Mechanisms. In Proc. ICASSP.
[30]
K. Olaleye and H. Kamper. 2021. Attention-based keyword localisation in speech using visual grounding. In Proc. Interspeech.
[31]
I. Palmer, A. Rouditchenko, A. Barbu, B. Katz, and J. Glass. 2021. Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. In Proc. Interspeech.
[32]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. 2015. Librispeech: An ASR corpus based on public domain audio books. In Proc. ICASSP.
[33]
R. Pascanu, T. Mikolov, and Y. Bengio. 2013. On the difficulty of training recurrent neural networks. In Proc. ICML.
[34]
P. Peng and D. Harwath. 2022a. Fast-Slow Transformer for Visually Grounding Speech. In Proc. ICASSP.
[35]
P. Peng and D. Harwath. 2022b. Self-Supervised Representation Learning for Speech Using Visual Grounding and Masked Language Modeling. In The Self-Supervised Learning for Speech and Audio Processing Workshop at AAAI 2022.
[36]
P. Peng and D. Harwath. 2022c. Word Discovery in Visually Grounded, Self-Supervised Speech Models. arXiv preprint arXiv:2203.15081.
[37]
S. Ren, K. He, R. Girshick, and J. Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 6 (2017), 1137--1149.
[38]
A. Rouditchenko, A. Boggust, D. Harwath, B. Chen, D. Joshi, S. Thomas, K. Audhkhasi, H. Kuehne, R. Panda, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, and J. Glass. 2021. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. In Proc. Interspeech.
[39]
R. Sanabria, A. Waters, and J. Baldridge. 2021. Talk, don't write: A study of direct speech-based image retrieval. In Proc. Interspeech.
[40]
H. Sato, T. Ochiai, K. Kinoshita, M. Delcroix, T. Nakatani, and S. Araki. 2021. Multimodal Attention Fusion for Target Speaker Extraction. In Proc. SLT.
[41]
S. Scholten, D. Merkx, and O. Scharenborg. 2021. Learning to recognise words using visually grounded speech. In Proc. ISCAS.
[42]
E. Tzinis, G. Wichern, A. Subramanian, P. Smaragdis, and J. Le Roux. 2022. Heterogeneous Target Speech Separation. In Proc. Interspeech.
[43]
K. Veselý, S. Watanabe, K. molíková, M. Karafiát, L. Burget, and J.H. ernocký. 2016. Sequence summarizing neural network for speaker adaptation. In Proc. ICASSP.
[44]
E. Vincent, R. Gribonval, and C. Févotte. 2006. Performance measurement in blind audio source separation. IEEE/ACM Transactions on Audio, Speech and Language Processing, Vol. 14, 4 (2006), 1462--1469.
[45]
K. molíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani. 2017. Learning speaker representation for neural network based multichannel speaker extraction. In Proc. ASRU.
[46]
K. molíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J.H. vCernocký. 2019. SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures. IEEE Journal of Selected Topics in Signal Processing, Vol. 13, 4 (2019), 800--814.
[47]
L. Wang, X. Wang, M. Hasegawa-Johnson, O. Scharenborg, and N. Dehak. 2021. Align or attend toward more efficient and accurate spoken word discovery using speech-to-image retrieval. In Proc. ICASSP.
[48]
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. In Proc. Interspeech.
[49]
S. Wisdom, H. Erdogan, D. P. Ellis, R. Serizel, N. Turpault, E. Fonseca, J. Salamon, P. Seetharaman, and J. R. Hershey. 2021. What?? all the fuss about free universal sound separation data?. In Proc. ICASSP.
[50]
X. Xu, B. Dai, and D. Lin. 2019. Recursive visual sound separation using minus-plus net. In Proc. ICCV.
[51]
H. Zhao, C. Gan, W-C. Ma, and A. Torralba. 2019. The sound of motions. In Proc. ICCV.
[52]
H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. Mcdermott, and A. Torralba. 2018. The sound of pixels. In Proc. ECCV.

Cited By

View all
  • (2024)Attention and Sequence Modeling for Match-Mismatch Classification of Speech Stimulus and EEG ResponseIEEE Open Journal of Signal Processing10.1109/OJSP.2023.33400635(799-809)Online publication date: 2024
  • (2024)SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual AttentionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447832(1496-1500)Online publication date: 14-Apr-2024
  • (2024)Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal TeacherICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446265(561-565)Online publication date: 14-Apr-2024
  • Show More Cited By

Index Terms

  1. ConceptBeam: Concept Driven Target Speech Extraction

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '22: Proceedings of the 30th ACM International Conference on Multimedia
    October 2022
    7537 pages
    ISBN:9781450392037
    DOI:10.1145/3503161
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 October 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. concept representation
    2. crossmodal semantic embeddings
    3. target speech extraction
    4. vision and spoken language

    Qualifiers

    • Research-article

    Conference

    MM '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)82
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Attention and Sequence Modeling for Match-Mismatch Classification of Speech Stimulus and EEG ResponseIEEE Open Journal of Signal Processing10.1109/OJSP.2023.33400635(799-809)Online publication date: 2024
    • (2024)SMMA-Net: An Audio Clue-Based Target Speaker Extraction Network with Spectrogram Matching and Mutual AttentionICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447832(1496-1500)Online publication date: 14-Apr-2024
    • (2024)Online Target Sound Extraction with Knowledge Distillation from Partially Non-Causal TeacherICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446265(561-565)Online publication date: 14-Apr-2024
    • (2023)W2N-AVSC: Audiovisual Extension For Whisper-To-Normal Speech Conversion2023 31st European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO58844.2023.10289823(296-300)Online publication date: 4-Sep-2023
    • (2023)SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual CodingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613805(261-270)Online publication date: 26-Oct-2023
    • (2023)Neural Target Speech Extraction: An overviewIEEE Signal Processing Magazine10.1109/MSP.2023.324000840:3(8-29)Online publication date: May-2023
    • (2023)Target Sound Extraction with Variable Cross-Modality CluesICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095266(1-5)Online publication date: 4-Jun-2023
    • (2023)Auxiliary Cross-Modal Representation Learning With Triplet Loss Functions for Online Handwriting RecognitionIEEE Access10.1109/ACCESS.2023.331081911(94148-94172)Online publication date: 2023
    • (2023)Leveraging Sound Local and Global Features for Language-Queried Target Sound ExtractionNeural Information Processing10.1007/978-981-99-8070-3_28(367-379)Online publication date: 15-Nov-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media