Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3172944.3172984acmconferencesArticle/Chapter ViewAbstractPublication PagesiuiConference Proceedingsconference-collections
research-article

Touch-Supported Voice Recording to Facilitate Forced Alignment of Text and Speech in an E-Reading Interface

Published: 05 March 2018 Publication History

Abstract

Reading a book together with a family member who has impaired vision or other difficulties reading is an important social bonding activity. However, for the person being read to, there is little support in making these experiences repeatable. While audio can easily be recorded, synchronizing it with the text for later playback requires the use of forced alignment algorithms, which do not perform well on amateur read-aloud speech. We propose a human-in-the-loop approach to augmenting such algorithms, in the form of touch metaphors during collocated read-aloud sessions using tablet e-readers. The metaphor is implemented as a finger-follows-text tracker. We explore how this could better handle the variability of amateur reading, which poses accuracy challenges for existing forced alignment techniques. Data collected from users reading aloud as assisted by touch metaphors show increases in the accuracy of forced alignment algorithms and reveal opportunities for how to better support reading aloud.

References

[1]
Xavier Anguera, Jordi Luque, and Ciro Gracia. 2014. Audio-to-text alignment for speech recognition with very limited resources. In Fifteenth Annual Conference of the International Speech Communication Association.
[2]
Abbas Attarwala, Cosmin Munteanu, and Ronald Baecker. 2013. An accessible, large-print, listening and talking e-book to support families reading together. In Proceedings of the 15th international conference on Human-computer interaction with mobile devices and services, 440--443.
[3]
Jill Attewell and Carol Savill-Smith. 2004. Mobile learning and social inclusion: focusing on learners and learning. Learning with mobile devices: research and development: 3--11.
[4]
Gérard Bailly and William-Seamus Barbour. 2011. Synchronous reading: learning French orthography by audiovisual training. In 12th Annual Conference of the International Speech Communication Association (Interspeech 2011), 1153--1156.
[5]
Valentina Bartalesi and Barbara Leporini. 2015. An Enriched ePub eBook for Screen Reader Users. In International Conference on Universal Access in Human-Computer Interaction, 375--386.
[6]
Joseph E. Beck and June Sison. 2006. Using knowledge tracing in a noisy environment to measure student reading proficiencies. International Journal of Artificial Intelligence in Education 16, 2: 129--143.
[7]
Norbert Braunschweiler and Langzhou Chen. 2013. Automatic detection of inhalation breath pauses for improved pause modelling in HMM-TTS. In Eighth ISCA Workshop on Speech Synthesis.
[8]
Norbert Braunschweiler, Mark JF Gales, and Sabine Buchholz. 2010. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In Eleventh Annual Conference of the International Speech Communication Association.
[9]
Leen Cleuren, Jacques Duchateau, Alain Sips, Pol Ghesquière, and Hugo Van Hamme. 2006. Developing an automatic assessment tool for children's oral reading. In Ninth International Conference on Spoken Language Processing.
[10]
Luca Colombo, Monica Landoni, and Elisa Rubegni. 2012. Understanding reading experience to inform the design of ebooks for children. In Proceedings of the 11th International Conference on Interaction Design and Children, 272--275.
[11]
Rasmus Dali, Sandrine Brognaux, Korin Richmond, Cassia Valentini-Botinhao, Gustav Eje Henter, Julia Hirschberg, Junichi Yamagishi, and Simon King. 2016. Testing the consistency assumption: Pronunciation variant forced alignment in read and spontaneous speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 5155--5159.
[12]
Kathryn DR Drager and Joe E. Reichle. 2001. Effects of Discourse Context on the Intelligibility of Synthesized Speech for Young Adult and Older Adult Listeners Applications for AAC. Journal of Speech, Language, and Hearing Research 44, 5: 1052--1057.
[13]
Carrie Demmans Epp, Cosmin Munteanu, Benett Axtell, Keerthika Ravinthiran, Yomna Aly, and Elman Mansimov. 2017. Finger tracking: facilitating non-commercial content production for mobile e-reading applications. In Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services, 34.
[14]
Jean-Philippe Goldman. 2011. EasyAlign: an automatic phonetic alignment tool under Praat.
[15]
Sharon Goldwater, Dan Jurafsky, and Christopher D. Manning. 2010. Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Communication 52, 3: 181--200.
[16]
Timothy J. Hazen. 2006. Automatic alignment and error correction of human generated transcripts for long speech recordings. In Ninth International Conference on Spoken Language Processing.
[17]
Athanasios Katsamanis, Matthew Black, Panayiotis G. Georgiou, Louis Goldstein, and S. Narayanan. 2011. SailAlign: Robust long speech-text alignment. In Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Research.
[18]
Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1: 006.
[19]
Thea Knowles, Meghan Clayards, Morgan Sonderegger, Michael Wagner, Aparna Nadig, and Kristine H. Onishi. 2015. Automatic forced alignment on child speech: Directions for improvement. In Proceedings of Meetings on Acoustics 170ASA, 060001.
[20]
Benjamin Lecouteux, Georges Linares, Pascal Nocéra, and Jean-François Bonastre. 2006. Imperfect transcript driven speech recognition. In InterSpeech.
[21]
Nat Lertwongkhanakoola, Natthawut Kertkeidkachornb, Proadpran Punyabukkanac, and Atiwong Suchatod. 2014. An Automatic Real-time Synchronization of Live Speech with Its Transcription Approach. ENGINEERING JOURNAL 19, 5. Retrieved May 8, 2017 from http://engj.org/index.php/ej/article/view/703
[22]
Xiaolong Li, Li Deng, Yun-Cheng Ju, and Alex Acero. 2008. Automatic children's reading tutor on hand-held devices. In Ninth Annual Conference of the International Speech Communication Association.
[23]
Yan-Hua Long and Hong Ye. 2015. Filled Pause Refinement Based on the Pronunciation Probability for Lecture Speech. PloS one 10, 4: e0123466.
[24]
Yoshitaka Mamiya, Adriana Stan, Junichi Yamagishi, Peter Bell, Oliver Watts, Robert AJ Clark, and Simon King. 2013. Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments. In Eighth ISCA Workshop on Speech Synthesis.
[25]
Michael Massimi, Rachelle Campigotto, Abbas Attarwala, and Ronald Baecker. 2013. Reading together as a Leisure Activity: Implications for E-reading. In 14th International Conference on Human-Computer Interaction (INTERACT), 19--36.
[26]
Petr Mizera, Petr Pollak, Alice Kolman, and Mirjam Ernestus. 2014. Impact of irregular pronunciation on phonetic segmentation of nijmegen corpus of casual czech. In International Conference on Text, Speech, and Dialogue, 499--506.
[27]
Pedro J. Moreno and Christopher Alberti. 2009. A factor automaton approach for the forced alignment of long speech recordings. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, 4869--4872.
[28]
Pedro J. Moreno, Chris Joerg, Jean-Manuel Van Thong, and Oren Glickman. 1998. A recursive algorithm for the forced alignment of very long audio segments. In Fifth International Conference on Spoken Language Processing.
[29]
Jack Mostow. 2012. Why and how our automated reading tutor listens. In Proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training (ISADEPT), 43--52.
[30]
Cosmin Munteanu, Joanna Lumsden, Hélène Fournier, Rock Leung, Danny D'Amours, Daniel McDonald, and Julie Maitland. 2010. ALEX: mobile language assistant for low-literacy adults. In Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, 427--430.
[31]
Cosmin Munteanu, Heather Molyneaux, Julie Maitland, Daniel McDonald, Rock Leung, Hélène Fournier, and Joanna Lumsden. 2014. Hidden in plain sight: low-literacy adults in a developed country overcoming social and educational challenges through mobile learning support tools. Personal and ubiquitous computing 18, 6: 1455--1469.
[32]
Emma Murphy, Ravi Kuber, Graham McAllister, Philip Strain, and Wai Yu. 2008. An empirical investigation into the difficulties experienced by visually impaired Internet users. Universal Access in the Information Society 7, 1--2: 79--91.
[33]
Susan B. Neuman and David K. Dickinson. 2003. Handbook of early literacy research. Guilford Press.
[34]
Andrea Passerini and Michele Sebag. 2015. 4.2 Learning and Optimization with the Human in the Loop. Constraints, Optimization and Data: 21.
[35]
Kishore Prahallad and Alan W. Black. 2011. Segmentation of monologues in audio books for building synthetic voices. IEEE Transactions on Audio, Speech, and Language Processing 19, 5: 1444--1449.
[36]
Frank Rudzicz, Rosalie Wang, Momotaz Begum, and Alex Mihailidis. 2014. Speech recognition in Alzheimer's disease with personal assistive robots. In Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), 20--28.
[37]
Nithya Sambasivan, Ed Cutrell, Kentaro Toyama, and Bonnie Nardi. 2010. Intermediated technology use in developing communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2583--2592.
[38]
Kei Sawada, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, and Keiichi Tokuda. 2014. Overview of NITECH HMM-based text-to-speech system for Blizzard Challenge 2014. In Blizzard Challenge Workshop.
[39]
Roy Shilkrot, Jochen Huber, Wong Meng Ee, Pattie Maes, and Suranga Chandima Nanayakkara. 2015. FingerReader: a wearable device to explore printed text on the go. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2363--2372.
[40]
Ronanki Srikanth and Li Bo2 James Salsman. 2012. Automatic pronunciation evaluation and mispronunciation detection using Carnegie Mellon University Sphinx. In 24th International Conference on Computational Linguistics, 61.
[41]
Richard K. Wagner, Andrea E. Muse, and Kendra R. Tannenbaum. 2007. Vocabulary acquisition: Implications for reading comprehension. Guilford Press.
[42]
Mirjam Wester, Matthew Aylett, Marcus Tomalin, and Rasmus Dall. 2015. Artificial personality and disfluency. In Sixteenth Annual Conference of the International Speech Communication Association.
[43]
Silke M. Witt. 2012. Automatic error detection in pronunciation training: Where we are and where we need to go. In International Symposium on Automatic Detection of Errors in Pronunciation Training, Stockholm, Sweden.
[44]
Junbo Zhang, Fuping Pan, and Yonghong Yan. 2012. An LVCSR based automatic scoring method in English reading tests. In Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2012 4th International Conference on, 34--37.
[45]
LibriVox | free public domain audiobooks. Retrieved October 4, 2017 from https://librivox.org/

Cited By

View all
  • (2023)“It’s All About the Pictures:” Understanding How Parents/Guardians With Visual Impairments Co-Read With Their Child(ren)Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3614488(1-4)Online publication date: 22-Oct-2023
  • (2022)A Survey Study on Automatic Subtitle Synchronization and Positioning System for Deaf and Hearing Impaired PeopleInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-7393(423-428)Online publication date: 17-Nov-2022
  • (2021)Automatic Subtitle Synchronization and Positioning System Dedicated to Deaf and Hearing Impaired PeopleIEEE Access10.1109/ACCESS.2021.31192019(139544-139555)Online publication date: 2021
  • Show More Cited By

Index Terms

  1. Touch-Supported Voice Recording to Facilitate Forced Alignment of Text and Speech in an E-Reading Interface

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      IUI '18: Proceedings of the 23rd International Conference on Intelligent User Interfaces
      March 2018
      698 pages
      ISBN:9781450349451
      DOI:10.1145/3172944
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 05 March 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. assistive technology
      2. forced alignment
      3. multi-modal interfaces
      4. natural language and speech processing

      Qualifiers

      • Research-article

      Conference

      IUI'18
      Sponsor:

      Acceptance Rates

      IUI '18 Paper Acceptance Rate 43 of 299 submissions, 14%;
      Overall Acceptance Rate 746 of 2,811 submissions, 27%

      Upcoming Conference

      IUI '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)“It’s All About the Pictures:” Understanding How Parents/Guardians With Visual Impairments Co-Read With Their Child(ren)Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3614488(1-4)Online publication date: 22-Oct-2023
      • (2022)A Survey Study on Automatic Subtitle Synchronization and Positioning System for Deaf and Hearing Impaired PeopleInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-7393(423-428)Online publication date: 17-Nov-2022
      • (2021)Automatic Subtitle Synchronization and Positioning System Dedicated to Deaf and Hearing Impaired PeopleIEEE Access10.1109/ACCESS.2021.31192019(139544-139555)Online publication date: 2021
      • (2021)Emerging ApplicationsTouch-Based Human-Machine Interaction10.1007/978-3-030-68948-3_7(179-229)Online publication date: 26-Mar-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media