Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/985692.985759acmconferencesArticle/Chapter ViewAbstractPublication PageschiConference Proceedingsconference-collections
Article

Semantic speech editing

Published: 25 April 2004 Publication History

Abstract

Editing speech data is currently time-consuming and error-prone. Speech editors rely on acoustic waveform representations, which force users to repeatedly sample the underlying speech to identify words and phrases to edit. Instead we developed a semantic editor that reduces the need for extensive sampling by providing access to meaning. The editor shows a time-aligned errorful transcript produced by applying automatic speech recognition (ASR) to the original speech. Users visually scan the words in the transcript to identify important phrases. They then edit the transcript directly using standard word processing 'cut and paste' operations, which extract the corresponding time-aligned speech. ASR errors mean that users must supplement what they read in the transcript by accessing the original speech. Even when there are transcript errors, however, the semantic representation still provides users with enough information to target what they edit and play, reducing the need for extensive sampling. A laboratory evaluation showed that semantic editing is more efficient than acoustic editing even when ASR is highly inaccurate.

References

[1]
Arons, B. SpeechSkimmer: A system for interactively skimming recorded speech. ACM Transactions on Human Computer Interaction, 4(1), 38, 1997.
[2]
Bacchiani, M., Hirschberg, J., Rosenberg, A., Whittaker, S., Hindle, D., Isenhour, P., Jones, M., Stark, L., and Zamchick, G. SCANMail: Audio Navigation in the Voicemail Domain. In Proc. of the Workshop on Human Language Technology, 2001.
[3]
Chalfonte, B., Fish, R., and Kraut, R. Expressive richness. In Proc. CHI91, 21--26, 1991.
[4]
Degen, L., Mander, R., and Salomon, G. Working with audio. In Proc. CHI92, 413--418, 1992.
[5]
Goldwave Digital Audio Editor. http://www.goldwave.com/
[6]
Hauptmann, A. and Witbrock, M. Informedia: News-on-Demand Multimedia Information Acquisition and Retrieval, In M. Maybury (Ed.), Intelligent Multimedia Information Retrieval, AAAI Press, 213--239, 1997.
[7]
Hindus, D., Schmandt, C., and Horner, C. Capturing, structuring and representing ubiquitous audio. ACM Transactions on Information Systems, 11, 1993.
[8]
Jones, G., Foote, J., Spärck Jones, K., and Young, S. Retrieving Spoken Documents by Combining Multiple Index Sources, In Proc. SIGIR, 30--38, 1996.
[9]
Kazman, R., Al-Halimi, R., Hunt, W., and Mantei, M. Four paradigms for indexing videoconferences. In IEEE Multimedia, 3(1), 63--73, 1996.
[10]
Schmandt, C. The Intelligent Ear: A Graphical Interface to Digital Audio, Proceedings, IEEE International Conference on Cybernetics and Society, IEEE, Atlanta, GA, 1981.
[11]
Stifelman, L., Arons, B., and Schmandt, C. The audio notebook: paper and pen interaction with structured speech. In Proc. CHI2001, 182--189, 2001.
[12]
Whittaker, S., Davies, R., Hirschberg, J., and Muller, U. Jotmail: a voicemail interface that enables you to see what was said. In Proceedings of CHI2000 Conference on Human Computer Interaction, 89-96. New York: ACM Press, 2000.
[13]
Whittaker, S., Hirschberg, J., Amento, B., Stark, L., Bacchiani, M., Isenhour, P., Stead, L., Zamchick G., & Rosenberg, A. SCANMail: a voicemail interface that makes speech browsable, readable and searchable. In Proceedings of CHI2002, New York: ACM Press, 275--282, 2002.
[14]
Whittaker, S., Hirschberg, J., and Nakatani, C. H. All talk and all action: strategies for managing voicemail messages. In Proceedings of CHI98 Conference on Computer Human Interaction, New York: ACM Press, 1998.
[15]
Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F., and Singhal, A. SCAN: designing and evaluating user interfaces to support retrieval from speech archives. In Proc. of SIGIR99, 26--33, New York: ACM Press,1998.
[16]
Whittaker, S., Hyland, P., and Wiley. M. Filochat: handwritten notes provide access to recorded conversations. In Proc. of CHI94 Conference on Computer Human Interaction, 271--277. New York: ACM Press, 1994.
[17]
Wilcox, L. Chen, F., Kimber D. and Balasubramanian, V. Segmentation of Speech Using Speaker Identification. Proc. International Conference on Acoustic Speech and Signal Processing, 1994.

Cited By

View all
  • (2022)CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech EditingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.319071730(2241-2254)Online publication date: 14-Jul-2022
  • (2022)Context-Aware Mask Prediction Network for End-to-End Text-Based Speech EditingICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746765(6082-6086)Online publication date: 23-May-2022
  • (2021)EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9688051(626-633)Online publication date: 13-Dec-2021
  • Show More Cited By
  1. Semantic speech editing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CHI '04: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
    April 2004
    742 pages
    ISBN:1581137028
    DOI:10.1145/985692
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 April 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. acoustic representations
    2. speech browsing
    3. speech editing
    4. speech recognition
    5. speech retrieval
    6. transcripts

    Qualifiers

    • Article

    Conference

    CHI04
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 6,199 of 26,314 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech EditingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.319071730(2241-2254)Online publication date: 14-Jul-2022
    • (2022)Context-Aware Mask Prediction Network for End-to-End Text-Based Speech EditingICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746765(6082-6086)Online publication date: 23-May-2022
    • (2021)EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)10.1109/ASRU51503.2021.9688051(626-633)Online publication date: 13-Dec-2021
    • (2020)Deconstructing Human-assisted Video Transcription and Annotation for Legislative ProceedingsDigital Government: Research and Practice10.1145/33953161:3(1-24)Online publication date: 18-Nov-2020
    • (2020)VoiceMessage++: Augmented Voice Recordings for Mobile Instant Messaging22nd International Conference on Human-Computer Interaction with Mobile Devices and Services10.1145/3379503.3403560(1-10)Online publication date: 5-Oct-2020
    • (2020)RescribeProceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology10.1145/3379337.3415864(747-759)Online publication date: 20-Oct-2020
    • (2019)GabberProceedings of the 2019 CHI Conference on Human Factors in Computing Systems10.1145/3290605.3300607(1-12)Online publication date: 2-May-2019
    • (2017)Creating Object-Based Experiences in the Real WorldSMPTE Motion Imaging Journal10.5594/JMI.2017.2709859126:6(1-7)Online publication date: Aug-2017
    • (2017)VoCoACM Transactions on Graphics10.1145/3072959.307370236:4(1-13)Online publication date: 20-Jul-2017
    • (2017)Interactive reconstruction of Monte Carlo image sequences using a recurrent denoising autoencoderACM Transactions on Graphics10.1145/3072959.307360136:4(1-12)Online publication date: 20-Jul-2017
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media