Article

Semantic speech editing

Authors:

Steve Whittaker,

Brian AmentoAuthors Info & Claims

CHI '04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

Pages 527 - 534

https://doi.org/10.1145/985692.985759

Published: 25 April 2004 Publication History

Abstract

Editing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a semantic editor that reduces the need for extensive sampling by providing access to meaning. The editor shows a time-aligned errorful transcript produced by applying automatic speech recognition (ASR) to the original speech. Users visually scan the words in the transcript to identify important phrases. They then edit the transcript directly using standard word processing 'cut and paste' operations, which extract the corresponding time-aligned speech. ASR errors mean that users must supplement what they read in the transcript by accessing the original speech. Even when there are transcript errors, however, the semantic representation still provides users with enough information to target what they edit and play, reducing the need for extensive sampling. A laboratory evaluation showed that semantic editing is more efficient than acoustic editing even when ASR is highly inaccurate.

References

[1]

Arons, B. SpeechSkimmer: A system for interactively skimming recorded speech. ACM Transactions on Human Computer Interaction, 4(1), 38, 1997.

Digital Library

[2]

Bacchiani, M., Hirschberg, J., Rosenberg, A., Whittaker, S., Hindle, D., Isenhour, P., Jones, M., Stark, L., and Zamchick, G. SCANMail: Audio Navigation in the Voicemail Domain. In Proc. of the Workshop on Human Language Technology, 2001.

Digital Library

[3]

Chalfonte, B., Fish, R., and Kraut, R. Expressive richness. In Proc. CHI91, 21--26, 1991.

Digital Library

[4]

Degen, L., Mander, R., and Salomon, G. Working with audio. In Proc. CHI92, 413--418, 1992.

Digital Library

[5]

Goldwave Digital Audio Editor. http://www.goldwave.com/

[6]

Hauptmann, A. and Witbrock, M. Informedia: News-on-Demand Multimedia Information Acquisition and Retrieval, In M. Maybury (Ed.), Intelligent Multimedia Information Retrieval, AAAI Press, 213--239, 1997.

Digital Library

[7]

Hindus, D., Schmandt, C., and Horner, C. Capturing, structuring and representing ubiquitous audio. ACM Transactions on Information Systems, 11, 1993.

Digital Library

[8]

Jones, G., Foote, J., Spärck Jones, K., and Young, S. Retrieving Spoken Documents by Combining Multiple Index Sources, In Proc. SIGIR, 30--38, 1996.

Digital Library

[9]

Kazman, R., Al-Halimi, R., Hunt, W., and Mantei, M. Four paradigms for indexing videoconferences. In IEEE Multimedia, 3(1), 63--73, 1996.

Digital Library

[10]

Schmandt, C. The Intelligent Ear: A Graphical Interface to Digital Audio, Proceedings, IEEE International Conference on Cybernetics and Society, IEEE, Atlanta, GA, 1981.

[11]

Stifelman, L., Arons, B., and Schmandt, C. The audio notebook: paper and pen interaction with structured speech. In Proc. CHI2001, 182--189, 2001.

Digital Library

[12]

Whittaker, S., Davies, R., Hirschberg, J., and Muller, U. Jotmail: a voicemail interface that enables you to see what was said. In Proceedings of CHI2000 Conference on Human Computer Interaction, 89-96. New York: ACM Press, 2000.

Digital Library

[13]

Whittaker, S., Hirschberg, J., Amento, B., Stark, L., Bacchiani, M., Isenhour, P., Stead, L., Zamchick G., & Rosenberg, A. SCANMail: a voicemail interface that makes speech browsable, readable and searchable. In Proceedings of CHI2002, New York: ACM Press, 275--282, 2002.

Digital Library

[14]

Whittaker, S., Hirschberg, J., and Nakatani, C. H. All talk and all action: strategies for managing voicemail messages. In Proceedings of CHI98 Conference on Computer Human Interaction, New York: ACM Press, 1998.

Digital Library

[15]

Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F., and Singhal, A. SCAN: designing and evaluating user interfaces to support retrieval from speech archives. In Proc. of SIGIR99, 26--33, New York: ACM Press,1998.

Digital Library

[16]

Whittaker, S., Hyland, P., and Wiley. M. Filochat: handwritten notes provide access to recorded conversations. In Proc. of CHI94 Conference on Computer Human Interaction, 271--277. New York: ACM Press, 1994.

Digital Library

[17]

Wilcox, L. Chen, F., Kimber D. and Balasubramanian, V. Segmentation of Speech Using Speaker Identification. Proc. International Conference on Acoustic Speech and Signal Processing, 1994.

Cited By

Wang TYi JFu RTao JWen Z(2022)CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech EditingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.319071730(2241-2254)Online publication date: 14-Jul-2022
https://dl.acm.org/doi/10.1109/TASLP.2022.3190717
Wang TYi JDeng LFu RTao JWen Z(2022)Context-Aware Mask Prediction Network for End-to-End Text-Based Speech EditingICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746765(6082-6086)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746765
Tan DDeng LYeung YJiang XChen XLee T(2021)EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9688051(626-633)Online publication date: 13-Dec-2021
https://doi.org/10.1109/ASRU51503.2021.9688051
Show More Cited By

Semantic speech editing
1. Human-centered computing

Recommendations

Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Automatic Speech Recognition Used for Intelligibility Assessment of Text-to-Speech Systems
Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction

Speech intelligibility is the most important parameter in evaluation of speech quality. In the contribution, a new objective intelligibility assessment of general speech processing algorithms is proposed. It is based on automatic recognition methods ...
Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coefficient (RMCC) features.A ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CHI '04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

April 2004

742 pages

ISBN:1581137028

DOI:10.1145/985692

Conference Chairs:
Elizabeth Dykstra-Erickson
Kinoma
,
Manfred Tscheligi
CURE, Austria

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

CHI04

Sponsor:

CHI04: CHI 2004 Conference on Human Factors in Computing Systems

April 24 - 29, 2004

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
698
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang TYi JFu RTao JWen Z(2022)CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech EditingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.319071730(2241-2254)Online publication date: 14-Jul-2022
https://dl.acm.org/doi/10.1109/TASLP.2022.3190717
Wang TYi JDeng LFu RTao JWen Z(2022)Context-Aware Mask Prediction Network for End-to-End Text-Based Speech EditingICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746765(6082-6086)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746765
Tan DDeng LYeung YJiang XChen XLee T(2021)EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9688051(626-633)Online publication date: 13-Dec-2021
https://doi.org/10.1109/ASRU51503.2021.9688051
Ruprechter TKhosmood FGuetl C(2020)Deconstructing Human-assisted Video Transcription and Annotation for Legislative ProceedingsDigital Government: Research and Practice10.1145/33953161:3(1-24)Online publication date: 18-Nov-2020
https://dl.acm.org/doi/10.1145/3395316
Haas GGugenheimer JRukzio E(2020)VoiceMessage++: Augmented Voice Recordings for Mobile Instant Messaging22nd International Conference on Human-Computer Interaction with Mobile Devices and Services10.1145/3379503.3403560(1-10)Online publication date: 5-Oct-2020
https://dl.acm.org/doi/10.1145/3379503.3403560
Pavel AReyes GBigham JIqbal SMacLean KChevalier FMueller S(2020)RescribeProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology10.1145/3379337.3415864(747-759)Online publication date: 20-Oct-2020
https://dl.acm.org/doi/10.1145/3379337.3415864
Rainey JMontague KBriggs PAnderson RNappey TOlivier PBrewster SFitzpatrick GCox AKostakos V(2019)GabberProceedings of the 2019 CHI Conference on Human Factors in Computing Systems10.1145/3290605.3300607(1-12)Online publication date: 2-May-2019
https://dl.acm.org/doi/10.1145/3290605.3300607
Evans MFerne TWatson ZMelchior FBrooks MStenton PForrester IBaume C(2017)Creating Object-Based Experiences in the Real WorldSMPTE Motion Imaging Journal10.5594/JMI.2017.2709859126:6(1-7)Online publication date: Aug-2017
https://doi.org/10.5594/JMI.2017.2709859
Jin ZMysore GDiverdi SLu JFinkelstein A(2017)VoCoACM Transactions on Graphics10.1145/3072959.307370236:4(1-13)Online publication date: 20-Jul-2017
https://dl.acm.org/doi/10.1145/3072959.3073702
Chaitanya CKaplanyan ASchied CSalvi MLefohn ANowrouzezahrai DAila T(2017)Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoderACM Transactions on Graphics10.1145/3072959.307360136:4(1-12)Online publication date: 20-Jul-2017
https://dl.acm.org/doi/10.1145/3072959.3073601
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents