research-article

Touch-Supported Voice Recording to Facilitate Forced Alignment of Text and Speech in an E-Reading Interface

Authors:

Cosmin Munteanu,

Carrie Demmans Epp,

Frank RudziczAuthors Info & Claims

IUI '18: Proceedings of the 23rd International Conference on Intelligent User Interfaces

Pages 129 - 140

https://doi.org/10.1145/3172944.3172984

Published: 05 March 2018 Publication History

Abstract

Reading a book together with a family member who has impaired vision or other difficulties reading is an important social bonding activity. However, for the person being read to, there is little support in making these experiences repeatable. While audio can easily be recorded, synchronizing it with the text for later playback requires the use of forced alignment algorithms, which do not perform well on amateur read-aloud speech. We propose a human-in-the-loop approach to augmenting such algorithms, in the form of touch metaphors during collocated read-aloud sessions using tablet e-readers. The metaphor is implemented as a finger-follows-text tracker. We explore how this could better handle the variability of amateur reading, which poses accuracy challenges for existing forced alignment techniques. Data collected from users reading aloud as assisted by touch metaphors show increases in the accuracy of forced alignment algorithms and reveal opportunities for how to better support reading aloud.

References

[1]

Xavier Anguera, Jordi Luque, and Ciro Gracia. 2014. Audio-to-text alignment for speech recognition with very limited resources. In Fifteenth Annual Conference of the International Speech Communication Association.

[2]

Abbas Attarwala, Cosmin Munteanu, and Ronald Baecker. 2013. An accessible, large-print, listening and talking e-book to support families reading together. In Proceedings of the 15th international conference on Human-computer interaction with mobile devices and services, 440--443.

Digital Library

[3]

Jill Attewell and Carol Savill-Smith. 2004. Mobile learning and social inclusion: focusing on learners and learning. Learning with mobile devices: research and development: 3--11.

[4]

Gérard Bailly and William-Seamus Barbour. 2011. Synchronous reading: learning French orthography by audiovisual training. In 12th Annual Conference of the International Speech Communication Association (Interspeech 2011), 1153--1156.

[5]

Valentina Bartalesi and Barbara Leporini. 2015. An Enriched ePub eBook for Screen Reader Users. In International Conference on Universal Access in Human-Computer Interaction, 375--386.

[6]

Joseph E. Beck and June Sison. 2006. Using knowledge tracing in a noisy environment to measure student reading proficiencies. International Journal of Artificial Intelligence in Education 16, 2: 129--143.

Digital Library

[7]

Norbert Braunschweiler and Langzhou Chen. 2013. Automatic detection of inhalation breath pauses for improved pause modelling in HMM-TTS. In Eighth ISCA Workshop on Speech Synthesis.

[8]

Norbert Braunschweiler, Mark JF Gales, and Sabine Buchholz. 2010. Lightly supervised recognition for automatic alignment of large coherent speech recordings. In Eleventh Annual Conference of the International Speech Communication Association.

[9]

Leen Cleuren, Jacques Duchateau, Alain Sips, Pol Ghesquière, and Hugo Van Hamme. 2006. Developing an automatic assessment tool for children's oral reading. In Ninth International Conference on Spoken Language Processing.

[10]

Luca Colombo, Monica Landoni, and Elisa Rubegni. 2012. Understanding reading experience to inform the design of ebooks for children. In Proceedings of the 11th International Conference on Interaction Design and Children, 272--275.

Digital Library

[11]

Rasmus Dali, Sandrine Brognaux, Korin Richmond, Cassia Valentini-Botinhao, Gustav Eje Henter, Julia Hirschberg, Junichi Yamagishi, and Simon King. 2016. Testing the consistency assumption: Pronunciation variant forced alignment in read and spontaneous speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, 5155--5159.

Digital Library

[12]

Kathryn DR Drager and Joe E. Reichle. 2001. Effects of Discourse Context on the Intelligibility of Synthesized Speech for Young Adult and Older Adult Listeners Applications for AAC. Journal of Speech, Language, and Hearing Research 44, 5: 1052--1057.

[13]

Carrie Demmans Epp, Cosmin Munteanu, Benett Axtell, Keerthika Ravinthiran, Yomna Aly, and Elman Mansimov. 2017. Finger tracking: facilitating non-commercial content production for mobile e-reading applications. In Proceedings of the 19th International Conference on Human-Computer Interaction with Mobile Devices and Services, 34.

Digital Library

[14]

Jean-Philippe Goldman. 2011. EasyAlign: an automatic phonetic alignment tool under Praat.

[15]

Sharon Goldwater, Dan Jurafsky, and Christopher D. Manning. 2010. Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates. Speech Communication 52, 3: 181--200.

[16]

Timothy J. Hazen. 2006. Automatic alignment and error correction of human generated transcripts for long speech recordings. In Ninth International Conference on Spoken Language Processing.

[17]

Athanasios Katsamanis, Matthew Black, Panayiotis G. Georgiou, Louis Goldstein, and S. Narayanan. 2011. SailAlign: Robust long speech-text alignment. In Proc. of Workshop on New Tools and Methods for Very-Large Scale Phonetics Research.

[18]

Simon King. 2014. Measuring a decade of progress in text-to-speech. Loquens 1, 1: 006.

[19]

Thea Knowles, Meghan Clayards, Morgan Sonderegger, Michael Wagner, Aparna Nadig, and Kristine H. Onishi. 2015. Automatic forced alignment on child speech: Directions for improvement. In Proceedings of Meetings on Acoustics 170ASA, 060001.

[20]

Benjamin Lecouteux, Georges Linares, Pascal Nocéra, and Jean-François Bonastre. 2006. Imperfect transcript driven speech recognition. In InterSpeech.

[21]

Nat Lertwongkhanakoola, Natthawut Kertkeidkachornb, Proadpran Punyabukkanac, and Atiwong Suchatod. 2014. An Automatic Real-time Synchronization of Live Speech with Its Transcription Approach. ENGINEERING JOURNAL 19, 5. Retrieved May 8, 2017 from http://engj.org/index.php/ej/article/view/703

[22]

Xiaolong Li, Li Deng, Yun-Cheng Ju, and Alex Acero. 2008. Automatic children's reading tutor on hand-held devices. In Ninth Annual Conference of the International Speech Communication Association.

[23]

Yan-Hua Long and Hong Ye. 2015. Filled Pause Refinement Based on the Pronunciation Probability for Lecture Speech. PloS one 10, 4: e0123466.

[24]

Yoshitaka Mamiya, Adriana Stan, Junichi Yamagishi, Peter Bell, Oliver Watts, Robert AJ Clark, and Simon King. 2013. Using Adaptation to Improve Speech Transcription Alignment in Noisy and Reverberant Environments. In Eighth ISCA Workshop on Speech Synthesis.

[25]

Michael Massimi, Rachelle Campigotto, Abbas Attarwala, and Ronald Baecker. 2013. Reading together as a Leisure Activity: Implications for E-reading. In 14th International Conference on Human-Computer Interaction (INTERACT), 19--36.

[26]

Petr Mizera, Petr Pollak, Alice Kolman, and Mirjam Ernestus. 2014. Impact of irregular pronunciation on phonetic segmentation of nijmegen corpus of casual czech. In International Conference on Text, Speech, and Dialogue, 499--506.

[27]

Pedro J. Moreno and Christopher Alberti. 2009. A factor automaton approach for the forced alignment of long speech recordings. In Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, 4869--4872.

Digital Library

[28]

Pedro J. Moreno, Chris Joerg, Jean-Manuel Van Thong, and Oren Glickman. 1998. A recursive algorithm for the forced alignment of very long audio segments. In Fifth International Conference on Spoken Language Processing.

[29]

Jack Mostow. 2012. Why and how our automated reading tutor listens. In Proceedings of the International Symposium on Automatic Detection of Errors in Pronunciation Training (ISADEPT), 43--52.

[30]

Cosmin Munteanu, Joanna Lumsden, Hélène Fournier, Rock Leung, Danny D'Amours, Daniel McDonald, and Julie Maitland. 2010. ALEX: mobile language assistant for low-literacy adults. In Proceedings of the 12th international conference on Human computer interaction with mobile devices and services, 427--430.

Digital Library

[31]

Cosmin Munteanu, Heather Molyneaux, Julie Maitland, Daniel McDonald, Rock Leung, Hélène Fournier, and Joanna Lumsden. 2014. Hidden in plain sight: low-literacy adults in a developed country overcoming social and educational challenges through mobile learning support tools. Personal and ubiquitous computing 18, 6: 1455--1469.

Digital Library

[32]

Emma Murphy, Ravi Kuber, Graham McAllister, Philip Strain, and Wai Yu. 2008. An empirical investigation into the difficulties experienced by visually impaired Internet users. Universal Access in the Information Society 7, 1--2: 79--91.

Digital Library

[33]

Susan B. Neuman and David K. Dickinson. 2003. Handbook of early literacy research. Guilford Press.

[34]

Andrea Passerini and Michele Sebag. 2015. 4.2 Learning and Optimization with the Human in the Loop. Constraints, Optimization and Data: 21.

[35]

Kishore Prahallad and Alan W. Black. 2011. Segmentation of monologues in audio books for building synthetic voices. IEEE Transactions on Audio, Speech, and Language Processing 19, 5: 1444--1449.

Digital Library

[36]

Frank Rudzicz, Rosalie Wang, Momotaz Begum, and Alex Mihailidis. 2014. Speech recognition in Alzheimer's disease with personal assistive robots. In Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), 20--28.

[37]

Nithya Sambasivan, Ed Cutrell, Kentaro Toyama, and Bonnie Nardi. 2010. Intermediated technology use in developing communities. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2583--2592.

Digital Library

[38]

Kei Sawada, Shinji Takaki, Kei Hashimoto, Keiichiro Oura, and Keiichi Tokuda. 2014. Overview of NITECH HMM-based text-to-speech system for Blizzard Challenge 2014. In Blizzard Challenge Workshop.

[39]

Roy Shilkrot, Jochen Huber, Wong Meng Ee, Pattie Maes, and Suranga Chandima Nanayakkara. 2015. FingerReader: a wearable device to explore printed text on the go. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2363--2372.

Digital Library

[40]

Ronanki Srikanth and Li Bo2 James Salsman. 2012. Automatic pronunciation evaluation and mispronunciation detection using Carnegie Mellon University Sphinx. In 24th International Conference on Computational Linguistics, 61.

[41]

Richard K. Wagner, Andrea E. Muse, and Kendra R. Tannenbaum. 2007. Vocabulary acquisition: Implications for reading comprehension. Guilford Press.

[42]

Mirjam Wester, Matthew Aylett, Marcus Tomalin, and Rasmus Dall. 2015. Artificial personality and disfluency. In Sixteenth Annual Conference of the International Speech Communication Association.

[43]

Silke M. Witt. 2012. Automatic error detection in pronunciation training: Where we are and where we need to go. In International Symposium on Automatic Detection of Errors in Pronunciation Training, Stockholm, Sweden.

[44]

Junbo Zhang, Fuping Pan, and Yonghong Yan. 2012. An LVCSR based automatic scoring method in English reading tests. In Intelligent Human-Machine Systems and Cybernetics (IHMSC), 2012 4th International Conference on, 34--37.

Digital Library

[45]

LibriVox | free public domain audiobooks. Retrieved October 4, 2017 from https://librivox.org/

Cited By

Park SCassidy CBranham S(2023)“It’s All About the Pictures:” Understanding How Parents/Guardians With Visual Impairments Co-Read With Their Child(ren)Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3614488(1-4)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3614488
Santosh S Kale Shruti Dhanak Paras Chavan Jay Kakade Prasad Humbe (2022)A Survey Study on Automatic Subtitle Synchronization and Positioning System for Deaf and Hearing Impaired PeopleInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-7393(423-428)Online publication date: 17-Nov-2022
https://doi.org/10.48175/IJARSCT-7393
Mocanu BTapu R(2021)Automatic Subtitle Synchronization and Positioning System Dedicated to Deaf and Hearing Impaired PeopleIEEE Access10.1109/ACCESS.2021.31192019(139544-139555)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3119201
Show More Cited By

Index Terms

Touch-Supported Voice Recording to Facilitate Forced Alignment of Text and Speech in an E-Reading Interface
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms

Recommendations

An accessible, large-print, listening and talking e-book to support families reading together
MobileHCI '13: Proceedings of the 15th international conference on Human-computer interaction with mobile devices and services

Reading is an activity that is not only informative or pleasurable, but can have significant social benefits. Especially in a family setting, it is part of the interaction between children and their parents, it helps create a bond between children and ...
Automatic Phonetic Transcription in Two Steps: Forced Alignment and Burst Detection
Statistical Language and Speech Processing
Abstract
In the last decade, there was a growing interest in conversational speech in the fields of human and automatic speech recognition. Whereas for the varieties spoken in Germany, both resources and tools are numerous, for Austrian German only ...
Applications in accessibility of text-to-speech synthesis for South African languages: initial system integration and user engagement
SAICSIT '17: Proceedings of the South African Institute of Computer Scientists and Information Technologists

Persons with certain disabilities face barriers to information access and interpersonal communication. Assistive technologies provide workaround solutions to these problems. Augmentative and alternative communication systems aid the person with little ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

IUI '18: Proceedings of the 23rd International Conference on Intelligent User Interfaces

March 2018

698 pages

ISBN:9781450349451

DOI:10.1145/3172944

General Chairs:
Shlomo Berkovsky
CSIRO, Australia
,
Yoshinori Hijikata
Kwansei Gakuin University, Japan
,
Jun Rekimoto
University of Tokyo, Japan
,
Program Chairs:
Margaret Burnett
Oregon State University, USA
,
Mark Billinghurst
University of South Australia, Australia
,
Aaron Quigley
University of St Andrews, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

In-Cooperation

SIGCHI: ACM Special Interest Group on Computer-Human Interaction

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 March 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

IUI'18

Sponsor:

SIGAI

IUI'18: 23rd International Conference on Intelligent User Interfaces

March 7 - 11, 2018

Tokyo, Japan

Acceptance Rates

IUI '18 Paper Acceptance Rate 43 of 299 submissions, 14%;

Overall Acceptance Rate 746 of 2,811 submissions, 27%

Upcoming Conference

IUI '25

Sponsor:
sigai
sigai

30th International Conference on Intelligent User Interfaces

March 24 - 27, 2025

Cagliari , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Park SCassidy CBranham S(2023)“It’s All About the Pictures:” Understanding How Parents/Guardians With Visual Impairments Co-Read With Their Child(ren)Proceedings of the 25th International ACM SIGACCESS Conference on Computers and Accessibility10.1145/3597638.3614488(1-4)Online publication date: 22-Oct-2023
https://dl.acm.org/doi/10.1145/3597638.3614488
Santosh S Kale Shruti Dhanak Paras Chavan Jay Kakade Prasad Humbe (2022)A Survey Study on Automatic Subtitle Synchronization and Positioning System for Deaf and Hearing Impaired PeopleInternational Journal of Advanced Research in Science, Communication and Technology10.48175/IJARSCT-7393(423-428)Online publication date: 17-Nov-2022
https://doi.org/10.48175/IJARSCT-7393
Mocanu BTapu R(2021)Automatic Subtitle Synchronization and Positioning System Dedicated to Deaf and Hearing Impaired PeopleIEEE Access10.1109/ACCESS.2021.31192019(139544-139555)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3119201
Gao SYan SZhao HNathan AGao SYan SZhao HNathan A(2021)Emerging ApplicationsTouch-Based Human-Machine Interaction10.1007/978-3-030-68948-3_7(179-229)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-68948-3_7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten