-
SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition
Authors:
Patrick K. O'Neill,
Vitaly Lavrukhin,
Somshubra Majumdar,
Vahid Noroozi,
Yuekai Zhang,
Oleksii Kuchaiev,
Jagadeesh Balam,
Yuliya Dovzhenko,
Keenan Freyberg,
Michael D. Shulman,
Boris Ginsburg,
Shinji Watanabe,
Georg Kucsko
Abstract:
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present…
▽ More
In the English speech-to-text (STT) machine learning task, acoustic models are conventionally trained on uncased Latin characters, and any necessary orthography (such as capitalization, punctuation, and denormalization of non-standard words) is imputed by separate post-processing models. This adds complexity and limits performance, as many formatting tasks benefit from semantic information present in the acoustic signal but absent in transcription. Here we propose a new STT task: end-to-end neural transcription with fully formatted text for target labels. We present baseline Conformer-based models trained on a corpus of 5,000 hours of professionally transcribed earnings calls, achieving a CER of 1.7. As a contribution to the STT research community, we release the corpus free for non-commercial use at https://datasets.kensho.com/datasets/scribe.
△ Less
Submitted 6 April, 2021; v1 submitted 5 April, 2021;
originally announced April 2021.
-
Cross-Language Transfer Learning, Continuous Learning, and Domain Adaptation for End-to-End Automatic Speech Recognition
Authors:
Jocelyn Huang,
Oleksii Kuchaiev,
Patrick O'Neill,
Vitaly Lavrukhin,
Jason Li,
Adriana Flores,
Georg Kucsko,
Boris Ginsburg
Abstract:
In this paper, we demonstrate the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed on: (1) different English accents, (2) different languages (German, Spanish and Russian) and (3) application-specific domains. Our experimen…
▽ More
In this paper, we demonstrate the efficacy of transfer learning and continuous learning for various automatic speech recognition (ASR) tasks. We start with a pre-trained English ASR model and show that transfer learning can be effectively and easily performed on: (1) different English accents, (2) different languages (German, Spanish and Russian) and (3) application-specific domains. Our experiments demonstrate that in all three cases, transfer learning from a good base model has higher accuracy than a model trained from scratch. It is preferred to fine-tune large models than small pre-trained models, even if the dataset for fine-tuning is small. Moreover, transfer learning significantly speeds up convergence for both very small and very large target datasets.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
QuartzNet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
Authors:
Samuel Kriman,
Stanislav Beliaev,
Boris Ginsburg,
Jocelyn Huang,
Oleksii Kuchaiev,
Vitaly Lavrukhin,
Ryan Leary,
Jason Li,
Yang Zhang
Abstract:
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpe…
▽ More
We propose a new end-to-end neural acoustic model for automatic speech recognition. The model is composed of multiple blocks with residual connections between them. Each block consists of one or more modules with 1D time-channel separable convolutional layers, batch normalization, and ReLU layers. It is trained with CTC loss. The proposed network achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models. We also demonstrate that this model can be effectively fine-tuned on new datasets.
△ Less
Submitted 22 October, 2019;
originally announced October 2019.
-
NeMo: a toolkit for building AI applications using Neural Modules
Authors:
Oleksii Kuchaiev,
Jason Li,
Huyen Nguyen,
Oleksii Hrinchuk,
Ryan Leary,
Boris Ginsburg,
Samuel Kriman,
Stanislav Beliaev,
Vitaly Lavrukhin,
Jack Cook,
Patrice Castonguay,
Mariya Popova,
Jocelyn Huang,
Jonathan M. Cohen
Abstract:
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations…
▽ More
NeMo (Neural Modules) is a Python framework-agnostic toolkit for creating AI applications through re-usability, abstraction, and composition. NeMo is built around neural modules, conceptual blocks of neural networks that take typed inputs and produce typed outputs. Such modules typically represent data layers, encoders, decoders, language models, loss functions, or methods of combining activations. NeMo makes it easy to combine and re-use these building blocks while providing a level of semantic correctness checking via its neural type system. The toolkit comes with extendable collections of pre-built modules for automatic speech recognition and natural language processing. Furthermore, NeMo provides built-in support for distributed training and mixed precision on latest NVIDIA GPUs. NeMo is open-source https://github.com/NVIDIA/NeMo
△ Less
Submitted 13 September, 2019;
originally announced September 2019.
-
Jasper: An End-to-End Convolutional Neural Acoustic Model
Authors:
Jason Li,
Vitaly Lavrukhin,
Boris Ginsburg,
Ryan Leary,
Oleksii Kuchaiev,
Jonathan M. Cohen,
Huyen Nguyen,
Ravi Teja Gadde
Abstract:
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep arc…
▽ More
In this paper, we report state-of-the-art results on LibriSpeech among end-to-end speech recognition models without any external training data. Our model, Jasper, uses only 1D convolutions, batch normalization, ReLU, dropout, and residual connections. To improve training, we further introduce a new layer-wise optimizer called NovoGrad. Through experiments, we demonstrate that the proposed deep architecture performs as well or better than more complex choices. Our deepest Jasper variant uses 54 convolutional layers. With this architecture, we achieve 2.95% WER using a beam-search decoder with an external neural language model and 3.86% WER with a greedy decoder on LibriSpeech test-clean. We also report competitive results on the Wall Street Journal and the Hub5'00 conversational evaluation datasets.
△ Less
Submitted 26 August, 2019; v1 submitted 5 April, 2019;
originally announced April 2019.