Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3490099.3511164acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article
Open access

BeParrot: Efficient Interface for Transcribing Unclear Speech via Respeaking

Published: 22 March 2022 Publication History

Abstract

Transcribing speech from audio files to text is an important task not only for exploring the audio content in text form but also for utilizing the transcribed data as a source to train speech models, such as automated speech recognition (ASR) models. A post-correction approach has been frequently employed to reduce the time cost of transcription where users edit errors in the recognition results of ASR models. However, this approach assumes clear speech and is not designed for unclear speech (such as speech with high levels of noise or reverberation), which severely degrades the accuracy of ASR and requires many manual corrections. To construct an alternative approach to transcribe unclear speech, we introduce the idea of respeaking, which has primarily been used to create captions for television programs in real time. In respeaking, a proficient human respeaker repeats the heard speech as shadowing, and their utterances are recognized by an ASR model. While this approach can be effective for transcribing unclear speech, one problem is that respeaking is a highly cognitively demanding task and extensive training is often required to become a respeaker. We address this point with BeParrot, the first interface designed for respeaking that allows novice users to benefit from respeaking without extensive training through two key features: parameter adjustment and pronunciation feedback. Our user study involving 60 crowd workers demonstrated that they could transcribe different types of unclear speech 32.2 % faster with BeParrot than with a conventional approach without losing the accuracy of transcriptions. In addition, comments from the workers supported the design of the adjustment and feedback features, exhibiting a willingness to continue using BeParrot for transcription tasks. Our work demonstrates how we can leverage recent advances in machine learning techniques to overcome the area that is still challenging for computers themselves with the help of a human-in-the-loop approach.

References

[1]
Riku Arakawa, Shinnosuke Takamichi, and Hiroshi Saruwatari. 2019. Implementation of DNN-Based Real-Time Voice Conversion and Its Improvements by Audio Data Augmentation and Mask-Shaped Device. In Proceedings of the 10th ISCA Speech Synthesis Workshop. ISCA, Grenoble, France, 93–98. https://doi.org/10.21437/ssw.2019-17
[2]
Claude Barras, Edouard Geoffrois, Zhibiao Wu, and Mark Liberman. 2001. Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production. Speech Communication 33, 1–2 (2001), 5–22. https://doi.org/10.1016/S0167-6393(00)00067-4
[3]
James C. Byers, Alvah C. Bittner, and Susan G. Hill. 1989. Traditional and Raw Task Load Index (TLX) Correlations: Are Paired Comparisons Necessary?. In Proceedings of the 1989 Annual International Industrial Ergonomics and Safety Conference. Taylor & Francis, Philadelphia, PA, 481–485.
[4]
Yashesh Gaur, Walter S. Lasecki, Florian Metze, and Jeffrey P. Bigham. 2016. The Effects of Automatic Speech Recognition Quality on Human Transcription Latency. In Proceedings of the 13th Web for All Conference. ACM, New York, NY, 23:1–23:8. https://doi.org/10.1145/2899475.2899478
[5]
Anne-Sophie Ghyselen, Anne Breitbarth, Melissa Farasyn, Jacques Van Keymeulen, and Arjan van Hessen. 2020. Clearing the Transcription Hurdle in Dialect Corpus Building: The Corpus of Southern Dutch Dialects as Case Study. Frontiers in Artificial Intelligence 3 (2020), 10. https://doi.org/10.3389/frai.2020.00010
[6]
Michael Gref, Joachim Köhler, and Almut Leh. 2018. Improved Transcription and Indexing of Oral History Interviews for Digital Humanities Research. In Proceedings of the 11th International Conference on Language Resources and Evaluation. ELRA, Paris, France, 3124–3131.
[7]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang. 2020. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proceedings of the 21st Annual Conference of the International Speech Communication Association. ISCA, Grenoble, France, 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015
[8]
Reinhold Haeb-Umbach, Jahn Heymann, Lukas Drude, Shinji Watanabe, Marc Delcroix, and Tomohiro Nakatani. 2021. Far-Field Automatic Speech Recognition. Proc. IEEE 109, 2 (2021), 124–148. https://doi.org/10.1109/JPROC.2020.3018668
[9]
Sandra G. Hart and Lowell E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of Empirical and Theoretical Research. Advances in Psychology 52 (1988), 139–183. https://doi.org/10.1016/s0166-4115(08)62386-9
[10]
Shinichi Homma, Akio Kobayashi, Takahiro Oku, Shoei Sato, Toru Imai, and Tohru Takagi. 2008. New Real-Time Closed-Captioning System for Japanese Broadcast News Programs. In Proceedings of the 11th International Conference on Computers Helping People with Special Needs. Springer Berlin Heidelberg, Berlin, Germany, 651–654. https://doi.org/10.1007/978-3-540-70540-6_93
[11]
Richang Hong, Meng Wang, Mengdi Xu, Shuicheng Yan, and Tat-Seng Chua. 2010. Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment. In Proceedings of the 18th ACM International Conference on Multimedia. ACM, New York, NY, 421–430. https://doi.org/10.1145/1873951.1874013
[12]
Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan. 2017. Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM. In Proceedings of the 18th Annual Conference of the International Speech Communication Association. ISCA, Grenoble, France, 949–953.
[13]
Toru Imai, Atsushi Matsui, Shinichi Homma, Takeshi Kobayakawa, Kazuo Onoe, Shoei Sato, and Akio Ando. 2002. Speech Recognition with a Re-Speak Method for Subtitling Live Broadcasts. In Proceedings of the 7th International Conference on Spoken Language Processing. ISCA, Grenoble, France, 5 pages.
[14]
Oh-Wook Kwon and Jun Park. 2003. Korean Large Vocabulary Continuous Speech Recognition with Morpheme-based Recognition Units. Speech Communication 39, 3–4 (2003), 287–300. https://doi.org/10.1016/S0167-6393(02)00031-6
[15]
Sebastian Linxen, Christian Sturm, Florian Brühlmann, Vincent Cassau, Klaus Opwis, and Katharina Reinecke. 2021. How WEIRD is CHI?. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, Article 143, 14 pages. https://doi.org/10.1145/3411764.3445488
[16]
Peng Liu and Frank K. Soong. 2006. Word Graph Based Speech Recognition Error Correction by Handwriting Input. In Proceedings of the 8th ACM International Conference on Multimodal Interfaces. ACM, New York, NY, 339–346. https://doi.org/10.1145/1180995.1181059
[17]
Saturnino Luz, Masood Masoodian, and Bill Rogers. 2010. Supporting Collaborative Transcription of Recorded Speech with a 3D Game Interface. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Vol. 6279. Springer Berlin Heidelberg, Berlin, Germany, 394–401. https://doi.org/10.1007/978-3-642-15384-6_42
[18]
Saturnino Luz, Masood Masoodian, Bill Rogers, and Chris Deering. 2008. Interface Design Strategies for Computer-Assisted Speech Transcription. In Proceedings of the 20th Australasian Computer-Human Interaction Conference, Vol. 287. ACM, New York, NY, 203–210. https://doi.org/10.1145/1517744.1517812
[19]
Alison Marsh. 2006. Respeaking for the BBC. inTRAlinea 1700(2006), 1 pages.
[20]
William D. Marslen-Wilson. 1985. Speech shadowing and speech comprehension. Speech Communication 4, 1-3 (1985), 55–73. https://doi.org/10.1016/0167-6393(85)90036-6
[21]
Lise Menn and Nan Bernstein Ratner (Eds.). 1999. Methods for Studying Language Production. Psychology Press, Hove, UK. https://doi.org/10.4324/9781410601599
[22]
Juan Daniel Valor Miró, Joan Albert Silvestre-Cerdà, Jorge Civera, Carlos Turró, and Alfons Juan. 2015. Efficiency and Usability Study of Innovative Computer-Aided Transcription Strategies for Video Lecture Repositories. Speech Communication 74(2015), 65–75. https://doi.org/10.1016/j.specom.2015.09.006
[23]
Jack Mostow and Gregory Aist. 2001. Evaluating Tutors That Listen: An Overview of Project LISTEN. MIT Press, Cambridge, MA, 169–234.
[24]
Ali Bou Nassif, Ismail Shahin, Imtinan Basem Attili, Mohammad Azzeh, and Khaled Shaalan. 2019. Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 7(2019), 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880
[25]
Jun Ogata, Masataka Goto, and Kouichirou Eto. 2007. Automatic Transcription for a Web 2.0 Service to Search Podcasts. In Proceedings of the 8th Annual Conference of the International Speech Communication Association. ISCA, Grenoble, France, 2617–2620.
[26]
Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, and Maja Pantic. 2018. Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop. IEEE, New York, NY, 513–520. https://doi.org/10.1109/SLT.2018.8639643
[27]
Aleš Pražák, Zdeněk Loose, Josef V. Psutka, Vlasta Radová, and Josef Psutka. 2020. Live TV Subtitling through Respeaking with Remote Cutting-Edge Technology. Multimedia Tools and Applications 79, 1–2 (2020), 1203–1220. https://doi.org/10.1007/s11042-019-08235-3
[28]
Margaret E. L. Renwick and Rachel M. Olsen. 2017. Analyzing Dialect Variation in Historical Speech Corpora. The Journal of the Acoustical Society of America 142, 1 (2017), 406–421. https://doi.org/10.1121/1.4991009
[29]
Pablo Romero-Fresco. 2012. Respeaking in Translator Training Curricula. The Interpreter and Translator Trainer 6, 1 (2012), 91–112. https://doi.org/10.1080/13556509.2012.10798831
[30]
Pablo Romero-Fresco. 2020. Subtitling through Speech Recognition: Respeaking. Routledge, London, UK. https://doi.org/10.4324/9781003073147
[31]
Pablo Romero-Fresco and Carlo Eugeni. 2020. Live Subtitling through Respeaking. In The Palgrave Handbook of Audiovisual Translation and Media Accessibility, Lukasz Bogucki and Mikolaj Deckert (Eds.). Springer International Publishing, Cham, Switzerland, 269–295. https://doi.org/10.1007/978-3-030-42105-2_14
[32]
Berrak Sisman, Junichi Yamagishi, Simon King, and Haizhou Li. 2021. An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 132––157. https://doi.org/10.1109/TASLP.2020.3038524
[33]
Anselm L Strauss snf Juliet M Corbin. 1990. Basics of Qualitative Research: Grounded Theory Procedures and Techniques. Sage Publications, Newbury Park, CA.
[34]
Jongseo Sohn, Nam Soo Kim, and Wonyong Sung. 1999. A Statistical Model-Based Voice Activity Detection. IEEE Signal Processing Letter 6, 1 (1999), 1–3. https://doi.org/10.1109/97.736233
[35]
Matthias Sperber, Graham Neubig, Christian Fügen, Satoshi Nakamura, and Alex Waibel. 2013. Efficient Speech Transcription through Respeaking. In Proceedings of the 14th Annual Conference of the International Speech Communication Association. ISCA, Grenoble, France, 1087–1091.
[36]
Matthias Sperber, Graham Neubig, Satoshi Nakamura, and Alex Waibel. 2016. Optimizing Computer-Assisted Transcription Quality with Iterative User Interfaces. In Proceedings of the 10th International Conference on Language Resources and Evaluation. ELRA, Paris, France, 1986–1992.
[37]
Constantin Spille, Birger Kollmeier, and Bernd T. Meyer. 2018. Comparing Human and Automatic Speech Recognition in Simple and Complex Acoustic Scenes. Computer Speech and Language 52 (2018), 123–140. https://doi.org/10.1016/j.csl.2018.04.003
[38]
Viet Anh Trinh and Michael I. Mandel. 2021. Directly Comparing the Listening Strategies of Humans and Machines. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 312–323. https://doi.org/10.1109/TASLP.2020.3040545
[39]
Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, and Shinji Watanabe. 2021. Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios. arXiv abs/2106.03419(2021), 5 pages.
[40]
Aditya Vashistha, Abhinav Garg, and Richard J. Anderson. 2019. ReCall: Crowdsourcing on Basic Phones to Financially Sustain Voice Forums. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, Article 169, 13 pages. https://doi.org/10.1145/3290605.3300399
[41]
Aditya Vashistha, Pooja Sethi, and Richard J. Anderson. 2017. Respeak: A Voice-based, Crowd-powered Speech Transcription System. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, 1855–1866. https://doi.org/10.1145/3025453.3025640
[42]
Aditya Vashistha, Pooja Sethi, and Richard J. Anderson. 2018. BSpeak: An Accessible Voice-based Crowdsourcing Marketplace for Low-Income Blind People. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, Article 57, 13 pages. https://doi.org/10.1145/3173574.3173631
[43]
Luuk Waes, Mariëlle Leijten, and Aline Remael. 2013. Live Subtitling with Speech Recognition: Causes and Consequences of Text Reduction. Across Languages and Cultures 14, 1 (2013), 15–46. https://doi.org/10.1556/acr.14.2013.1.2
[44]
Robert A. Wagner and Michael J. Fischer. 1974. The String-to-String Correction Problem. Journal of the ACM 21, 1 (1974), 168–173. https://doi.org/10.1145/321796.321811
[45]
Xiangdong Wang, Ying Yang, Hong Liu, and Yueliang Qian. 2017. Improving Speech Transcription by Exploiting User Feedback and Word Repetition. Multimedia Tools and Applications 76, 19 (2017), 20359–20376. https://doi.org/10.1007/s11042-017-4714-x
[46]
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin andJahn Heymann andMatthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. ESPnet: End-to-End Speech Processing Toolkit. In Proceedings of the 19th Annual Conference of the International Speech Communication Association. ISCA, Grenoble, France, 2207–2211. https://doi.org/10.21437/Interspeech.2018-1456
[47]
Shinji Watanabe, Michael I. Mandel, Jon Barker, and Emmanuel Vincent. 2020. CHiME-6 Challenge: Tackling Multispeaker Speech Recognition for Unsegmented Recordings. arXiv abs/2004.09249(2020), 7 pages.
[48]
Xinlei Zhang, Takashi Miyaki, and Jun Rekimoto. 2020. WithYou: Automated Adaptive Speech Tutoring With Context-Dependent Speech Recognition. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, Article 195, 1–12 pages. https://doi.org/10.1145/3313831.3376322

Cited By

View all
  • (2023)PrISM-TrackerProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35695046:4(1-27)Online publication date: 11-Jan-2023
  • (2023)A Generative Framework for Designing Interactions to Overcome the Gaps between Humans and Imperfect AIs Instead of Improving the Accuracy of the AIsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3577036(1-5)Online publication date: 19-Apr-2023

Index Terms

  1. BeParrot: Efficient Interface for Transcribing Unclear Speech via Respeaking
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    IUI '22: Proceedings of the 27th International Conference on Intelligent User Interfaces
    March 2022
    888 pages
    ISBN:9781450391443
    DOI:10.1145/3490099
    This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 March 2022

    Check for updates

    Author Tags

    1. automated speech recognition
    2. respeak
    3. speech transcription

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    IUI '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 746 of 2,811 submissions, 27%

    Upcoming Conference

    IUI '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)230
    • Downloads (Last 6 weeks)30
    Reflects downloads up to 04 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)PrISM-TrackerProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/35695046:4(1-27)Online publication date: 11-Jan-2023
    • (2023)A Generative Framework for Designing Interactions to Overcome the Gaps between Humans and Imperfect AIs Instead of Improving the Accuracy of the AIsExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544549.3577036(1-5)Online publication date: 19-Apr-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media