research-article

A Case Study in Comparative Speech-to-Text Libraries for Use in Transcript Generation for Online Education Recordings

Authors:

Pablo Ángel Álvarez Fernández,

Jeremy R. HajekAuthors Info & Claims

SIGITE '20: Proceedings of the 21st Annual Conference on Information Technology Education

Pages 223 - 228

https://doi.org/10.1145/3368308.3415380

Published: 07 October 2020 Publication History

Get Access

Abstract

With a proliferation of Cloud based Speech-to-Text services it can be difficult to decide where to start and how to make use of these technologies. These include the major Cloud providers as well as several Open Source Speech-to-Text projects available. We desired to investigate a sample of the available libraries and their attributes relating to the recording artifacts that are the by-product of Online Education.

The fact that so many resources are available means that the computing and technical barriers for applying speech recognition algorithms have decreased to the point of being a non-factor in the decision to use Speech-to-Text services. New barriers such as price, compute time, and access to the services? source code (software freedom) can be factored into the decision of which platform to use.

This case study provides a beginning to developing a test-suite and guide to compare Speech-to-Text libraries and their out-of-the-box accuracy. Our initial test suite employed two models: 1) a Cloud model employing AWS S3 using AWS Transcribe, 2) an on-premises Open Source model that relies on Mozilla's DeepSpeech[1]. We present our findings and recommendations based on the criteria discovered.

In order to deliver this test-suite, we also conducted research into the latest web development technologies with emphasis on security. This was done to produce a reliable and secure development process and to provide open access to this proof of concept for further testing and development.

References

[1]

Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro and Greg Diamos, 2014. DeepSpeech: Scaling up end-to-end Speech Recognition eprint=1412.5567, archivePrefix=arXiv,primaryClass=cs.CL.

Google Scholar

[2]

Pablo Angel Alvarez Fernandez. (2020, June 15). pabloaaf/Factor-TranscriptionCaseStudy: v1.0.0 (Version 1.0.0). Zenodo. http://doi.org/10.5281/zenodo.3893988

Crossref

Google Scholar

[3]

Mozilla. 2020. https://voice.mozilla.org/en/about Why Common Voice?

Google Scholar

[4]

Shimaa Ahmed, Amrita Roy Chowdhury, Kassem Fawaz,Parmesh Ramanathan. 2020. A System for Privacy-Preserving Speech Transcription. In arXiv:1909.04198v3 [cs.CR] 18 Feb 2020. https://arxiv.org/pdf/1909.04198.pdf

Google Scholar

[5]

Sonal Shetty, Vidya Nagar, Harish Hebballi, Moula Husain, Meena S M and Shiddu Nagaralli. 2015. Content Based Audiobooks Indexing using Apache Hadoop Framework. In WCI '15, August 10 - 13, 2015, Kochi, India.

Digital Library

Google Scholar

[6]

Larwan Berke, 2017. Displaying Confidence From Imperfect Automatic Speech Recognition for Captioning. SIGACCESS Newsletter. Issue 117. January 2017.

Google Scholar

[7]

AWS Transcribe. 2020. https://aws.amazon.com/transcribe/pricing/ AWS Trasncribe About

Google Scholar

[8]

Thierry Lavoie and Ettore Merlo. 2012. An accurate estimation of the Levenshtein distance using metric trees and Manhattan distance. IWSC, Zurich, Switzerland.

Crossref

Google Scholar

[9]

Matthew F. Dabkowski, Samuel H. Huddleston, and Ian Kloo. 2019. Improving record linkage for counter-threat finance intelligence with dynamic Jaro-Winkler thresholds. WSC, Maryland, United States.

Crossref

Google Scholar

Cited By

View all

Pinyo NLokaphadhana PSaengow PSiangsanoh SWonnaparhown TChuangsuwanich EPunyabukkana PSuchato A(2023)0.01 Cent per Second: Developing a Cloud-based Cost-effective Audio Transcription System for an Online Video Learning Platform2023 20th International Joint Conference on Computer Science and Software Engineering (JCSSE)10.1109/JCSSE58229.2023.10201942(432-437)Online publication date: 28-Jun-2023
https://doi.org/10.1109/JCSSE58229.2023.10201942
Packowski SKulkarni RRichard SFaircloth GOnuţ IZulkernine F(2021)Using IBM watson services to process video to streamline business processes and improve customer experienceProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507830(262-267)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507830

Index Terms

A Case Study in Comparative Speech-to-Text Libraries for Use in Transcript Generation for Online Education Recordings
1. Human-centered computing
  1. Accessibility
    1. Accessibility systems and tools

Recommendations

Accurate synthesis of dysarthric Speech for ASR data augmentation
Highlights
- Modified a neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels.
- Providing data augmentation for machine learning tasks such ...
Abstract
Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech Recognition (ASR) systems can help dysarthric talkers communicate more ...
Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach

Dysarthria is a neurological impairment of controlling the motor speech articulators that compromises the speech signal. Automatic Speech Recognition (ASR) can be very helpful for speakers with dysarthria because the disabled persons are often ...
Prosody modification for speech recognition in emotionally mismatched conditions

A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of ...

Comments

Information & Contributors

Information

Published In

SIGITE '20: Proceedings of the 21st Annual Conference on Information Technology Education

October 2020

446 pages

ISBN:9781450370455

DOI:10.1145/3368308

General Chairs:
Deepak Khazanchi
University of Nebraska at Omaha, USA
,
Harvey Siy,
Program Chairs:
George Grispos,
Tenace Kwaku Setor
University of Nebraska at Omaha, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGITE '20

Sponsor:

SIGITE

SIGITE '20: The 21st Annual Conference on Information Technology Education

October 7 - 9, 2020

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 176 of 429 submissions, 41%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
190
Total Downloads

Downloads (Last 12 months)29
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Pinyo NLokaphadhana PSaengow PSiangsanoh SWonnaparhown TChuangsuwanich EPunyabukkana PSuchato A(2023)0.01 Cent per Second: Developing a Cloud-based Cost-effective Audio Transcription System for an Online Video Learning Platform2023 20th International Joint Conference on Computer Science and Software Engineering (JCSSE)10.1109/JCSSE58229.2023.10201942(432-437)Online publication date: 28-Jun-2023
https://doi.org/10.1109/JCSSE58229.2023.10201942
Packowski SKulkarni RRichard SFaircloth GOnuţ IZulkernine F(2021)Using IBM watson services to process video to streamline business processes and improve customer experienceProceedings of the 31st Annual International Conference on Computer Science and Software Engineering10.5555/3507788.3507830(262-267)Online publication date: 22-Nov-2021
https://dl.acm.org/doi/10.5555/3507788.3507830

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Accurate synthesis of dysarthric Speech for ASR data augmentation

Artificial neural networks as speech recognisers for dysarthric speech: Identifying the best-performing set of MFCC parameters and studying a speaker-independent approach

Prosody modification for speech recognition in emotionally mismatched conditions