research-article

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech

Authors:

Mehmet Altinkaya,

Arnold W.M. SmeuldersAuthors Info & Claims

MuCAI ?20: Proceedings of the 1st International Workshop on Multimodal Conversational AI

Pages 9 - 13

https://doi.org/10.1145/3423325.3423733

Published: 15 October 2020 Publication History

Abstract

Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones.

Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery.

In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.

References

[1]

Stuttering Foundation, 'Stuttering foundation website,'https://www.stutteringhelp.org, 2019.

[2]

National Institute on Deafness and Other Communication Disorders, 'Stuttering,' https://www.nidcd.nih.gov, 2019.

[3]

B. Guitar, Stuttering: An Integrated Approach to Its Nature and Treatment, Lippincott Williams & Wilkins, 4 edition, 2014.

[4]

Sherwyn p Morreale, Michael m Osborn, and Judy c Peaston, 'Why communication is important: A rationale for the centrality of the study of communication,' Journal of the Associationfor Communication Administration, vol. 29, pp. 1--25, 2000.

[5]

E. Szabelska and A. Kruczyska, 'Computer-based speech analysis in stutter,' Applied Computer Science, vol. 9, no. 2, pp.34--42, 2013.

[6]

T. Kourkounakis, A. Hajavi and A. Etemad, "Detecting Multiple Speech Disfluencies Using a Deep Residual Network with Bidirectional Long Short-Term Memory," ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020, pp. 6089--6093.

[7]

S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos and M. Pantic, "End-to-End Audiovisual Speech Recognition," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, 2018, pp. 6548--6552.

[8]

Mendes, Carina & Yaruss, J. & Quesal, Robert. (2010). 'The impact of stuttering on quality of life of children and adolescents'. Pró-fono : revista de atualização científica. 22. 567--9. 10.1590/S0104--56872010000400035.

[9]

P. Howell, S. Davis, and J. Bartrip, 'The University College London archive of stuttered speech (uclass),' Journal of Speech, Language, and Hearing Research, vol. 52, pp. 556--569, 2009.

[10]

Bernstein, R. N. and MacWhinney, B. (2018). 'Fluency bank: A new resource for fluency research and practice'. Journal of fluency disorders, 56:69.

[11]

F. Rudzicz, A. K. Namasivayam, and T. Wolff, 'The TORGO database of acoustic and articulatory speech from speakers with dysarthria,' Language Resources and Evaluation, vol. 46, no. 4, pp. 523--541, 2012.

[12]

Oue, Stacey et al.,(2015). 'Automatic dysfluency detection in dysarthric speech using deep belief networks", Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies,60--64.

[13]

T. Saltuklaroglu and J. Kalinowski, 'How effective is therapy for childhood stuttering' dissecting and reinterpreting the evidence in light of spontaneous recovery rates,' International Journal of Language & Communication Disorders, vol. 40, pp. 359--374, 2004.

[14]

P. A. Heeman, R. Lunsford, A. McMillin, and J. S. Yaruss, 'Using clinician annotations to improve automatic speech recognition of stuttered speech,' INTERSPEECH, pp. 2651--2655, 2016.

[15]

S. Alharbi, M. Hasan, A.J.H. Simons, S. Brumfitt, and P. Green, 'Detecting stuttering events in transcripts of childrens speech,' International Conference on Statistical Language and Speech Processing, pp. 217--228, 2017.

[16]

S. Alharbi, M. Hasan, A.J.H. Simons, S. Brumfitt, and P. Green, 'A lightly supervised approach to detect stuttering.

[17]

V. Zayats, M. Ostendorf, and H. Hajishirzi, 'Disfluency detection using a bidirectional lstm,' INTERSPEECH, pp. 2523--2527, 2016.

[18]

Chung, Y.-A. and Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech.

[19]

P. Mahesha and D. S. Vinod, 'Automatic segmentation and classification of dysfluencies in stuttering speech,' Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, pp. 1--6, 2016.

[20]

A. Hajavi and A. Etemad, 'A deep neural network for short segment speaker recognition,' INTERSPEECH, pp. 2878--2882, 2019.

[21]

Rousseau, Isabelle & Onslow, Mark & Packman, Ann & Jones, Mark. (2008). 'Comparisons of Audio and Audiovisual Measures of Stuttering Frequency and Severity in Preschool-Age Children'. American journal of speech-language pathology / American Speech-Language-Hearing Association. 17. 173--8. 10.1044/1058-0360(2008/017).

[22]

D. Hu, X. Li and X. Lu, "Temporal multimodal learning in audiovisual speech recognition", IEEE CVPR, pp. 3574--3582, 2016.

[23]

Y. Mroueh, E. Marcheret and V. Goel, "Deep multimodal learning for audio-visual speech recognition", IEEE ICASSP, pp. 2130--2134, 2015.

[24]

Y. Takashima, R. Aihara, T. Takiguchi, Y. Ariki, N. Mitani, K. Omori, et al., "Audio-visual speech recognition using bimodal-trained bottleneck features for a person with severe hearing loss", Interspeech, pp. 277--281, 2016.

[25]

G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. Nicolaou, B. Schuller, et al., "Adieu features' end-to-end speech emotion recognition using a deep convolutional recurrent network", IEEE ICASSP, pp. 5200--5204, 2016.

[26]

G Hinton, L Deng, D Yu, G Dahl, A Mohamed, N Jaitly,A Senior, V Vanhoucke, P Nguyen, T Sainath, andB Kingsbury, 'Deep neural networks for acoustic modeling in speech recognition: The shared views of fourresearch groups,' IEEE Signal Processing Magazine,2012.

[27]

A Graves, S Fernandez, F Gomez, and J Schmidhuber,'Connectionist temporal classification : labelling unsegmented sequence data with recurrent neural networks,'Proceedings of ICML, 2006.

Cited By

Park JLee C(2023)AI-based stuttering automatic classification method: Using a convolutional neural network*Phonetics and Speech Sciences10.13064/KSSS.2023.15.4.07115:4(71-80)Online publication date: 31-Dec-2023
https://doi.org/10.13064/KSSS.2023.15.4.071
Sheikh SSahidullah MHirsch FOuni S(2022)Machine learning for stuttering identification: Review, challenges and future directionsNeurocomputing10.1016/j.neucom.2022.10.015514(385-402)Online publication date: Dec-2022
https://doi.org/10.1016/j.neucom.2022.10.015

Index Terms

A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech
1. Computing methodologies
  1. Machine learning
    1. Learning settings
      1. Learning from demonstrations

Recommendations

Towards Fair and Inclusive Speech Recognition for Stuttering: Community-led Chinese Stuttered Speech Dataset Creation and Benchmarking
CHI EA '24: Extended Abstracts of the 2024 CHI Conference on Human Factors in Computing Systems
Despite the widespread adoption of Automatic Speech Recognition (ASR) models in voice-operated products and conversational AI agents, current ASR models perform poorly for people who stutter. One primary cause of the performance disparity is the lack of ...
Pause Duration of Disfluent Speech

This work has the goal of comparing the pause duration in the disfluent speech and normal speech. Disfluency and normal spontaneous speech was recorded in a context were the subjects had to describe a scene from each other. The pause determination ...
Intelligent Processing of Stuttered Speech

The process of counting stuttering events could be carried out more objectively through the automatic detection of stop-gaps, syllable repetitions and vowel prolongations. The alternative would be based on the subjective evaluations of speech fluency ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MuCAI ?20: Proceedings of the 1st International Workshop on Multimodal Conversational AI

October 2020

44 pages

ISBN:9781450381567

DOI:10.1145/3423325

Program Chairs:
Alexander Hauptmann
Carnegie Mellon University
,
João Magalhães
Universidade Nova de Lisboa
,
Ricardo Gamelas Sousa
Farfetch
,
João Paulo Costeira
Instituto Superior Técnico

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 16, 2020

WA, Seattle, USA

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
197
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)2

Reflects downloads up to 27 Jul 2024

Other Metrics

View Author Metrics

Citations

Cited By

Park JLee C(2023)AI-based stuttering automatic classification method: Using a convolutional neural network*Phonetics and Speech Sciences10.13064/KSSS.2023.15.4.07115:4(71-80)Online publication date: 31-Dec-2023
https://doi.org/10.13064/KSSS.2023.15.4.071
Sheikh SSahidullah MHirsch FOuni S(2022)Machine learning for stuttering identification: Review, challenges and future directionsNeurocomputing10.1016/j.neucom.2022.10.015514(385-402)Online publication date: Dec-2022
https://doi.org/10.1016/j.neucom.2022.10.015

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents