T4 - Towards End-To-End Speech Recognition PDF
T4 - Towards End-To-End Speech Recognition PDF
T4 - Towards End-To-End Speech Recognition PDF
ISCSLP Tutorial 4
Nov. 26, 2018
Acknowledgement
The content of this tutorial is mostly based on the following tutorial with recent updates. Many
slides are borrowed with authors' consent.
Monophone CD-phone
AM Training
GMM System GMM System
Training
Force-aligned
Speech
Utterances
Utterances Decision Tree-based
CD-Phone Clustering
Conventional ASR
LM Training
Text Data
CD-Phone Pronunciation
First-Pass LM Second-Pass LM
Clustering Lexicon
Conventional ASR
● Most ASR systems involve acoustic, pronunciation and language model components
which are trained separately
● Discriminative Sequence Training of AMs does couple these components
● Curating pronunciation lexicon, defining phoneme sets for the particular language
requires expert knowledge, and is time-consuming
Proprietary + Confidential
Acoustic Model
Pronunciation
Model
End2End Trained
Verbalizer Sequence-to-Sequence
Recognizer
Language
Model
2nd-Pass
Rescoring
Key Takeaway A single end-to-end trained sequence-to-sequence model, which directly outputs words or
graphemes, could greatly simplify the speech recognition pipeline.
Agenda
Research developments on end-to-end models towards productionisation.
CTC
Connectionist Temporal Classification
Connectionist Temporal Classification (CTC)
Key Takeaway CTC allows for training an acoustic model without the need for frame-level alignments
between the acoustics and the transcripts.
Connectionist Temporal Classification (CTC)
Key Takeaway Encoder: Multiple layers of Uni- or Bi-directional RNNs (often LSTMs).
Connectionist Temporal Classification (CTC)
B B c B B a a B B t
B c c B a B B B B t
...
B c B B a B B t t B
Key Takeaway CTC introduces a special symbol - blank (denoted by B) - and maximizes the total probability
of the label sequence by marginalizing over all possible alignments
Connectionist Temporal Classification (CTC)
B B c B B a a B B t
B c c B a B B B B t
...
B c B B a B B t t B
B c B a
In a conventional hybrid system, this would correspond to defining the HMMs corresponding
Key Takeaway to each unit to consist of a shared initial state (blank), followed by a separate state(s) for the
actual unit.
Connectionist Temporal Classification (CTC)
B B
c c
B B
Forward-Backward
a Algorithm Computation a
B B
t t
1 2 3 Frames, t T-2 T-1 T
Key Takeaway Computing the gradients of the loss requires the computation of the alpha-beta variables
using the forward-backward algorithm [Rabiner, 1989]
CTC-Based End-to-End ASR
● Obtaining good performance from CTC models requires the use of an external language
model - direct greedy decoding does not perform very well
Proprietary + Confidential
Attention-based
Encoder-Decoder Models
Attention-based Encoder-Decoder Models
Attention
Encoder Output
Encoder
Attention-based Models
P(a|c,<sos>,x) = 0.95
P(b|c,<sos>,x) = 0.01
P(c|c,<sos>,x) = 0.01
c
...
c Decoder
Attention
Encoder
Attention-based Models
P(a|a,c,<sos>,x) = 0.01
P(b|a,c,<sos>,x) = 0.08
...
P(t|a,c,<sos>,x)
c = 0.89
...
Output: cat Softmax Labels from previous step are fed into
decoder at the next step to predict
a Decoder
Attention
Encoder
Attention-based Models
P(a|t,a,c,<sos>,x) = 0.01
P(b|t,a,c,<sos>,x) = 0.01 Process terminates when the model predicts
... <eos> which denotes end of sentence.
P(<eos>|t,a,c,<sos>,x)
c = 0.96
...
Output: cat Softmax
a Decoder
Labels
Attention
Encoder
Frames
Proprietary + Confidential
Online Models
RNN-T, NT, MoChA
To be discussed in the 2nd part of the tutorial
Model Comparisons
on a 12,500 hour Google Task
Comparing Various End-to-End Approaches
● Compare various
sequence-to-sequence models
head-to-head, trained on same
data, to understand how these
approaches compare to each
other
● Evaluated on a large-scale 12,500
hour Google Voice Search Task
● Baseline
○ State-of-the-art CD-Phoneme model: 5x700 BLSTM; ~8000 CD-Phonemes
○ CTC-training followed by sMBR discriminative sequence training
○ Decoded with large 5-gram LM in first pass
○ Second pass rescoring with much larger 5-gram LM
○ Lexicon of millions of words of expert curated pronunciations
● Sequence-to-Sequence Models
○ Trained to output graphemes: [a-z], [0-9], <space>, and punctuation
○ Models are evaluated using beam search (Keep Top 15 Hyps at Each Step)
○ Models are not decoded or rescored with an external language model, or a
pronunciation model
Experimental Setup
Data
● Training Set
○ ~15M Utterances (~12,500 hrs) of anonymized utterances from Google Voice Search
Traffic
○ Multi-style Training: Artificially distorted using room simulator by adding noise
samples extracted from YouTube videos and environmental recordings of daily
events
● Evaluation Sets
○ Dictation: ~13K utterances (~124K words) open-ended dictation
○ VoiceSearch: ~12.9K utterances (~63K words) of voice-search queries
Results
Clean
Model
Dictation VoiceSearch
Clean
Model
Dictation VoiceSearch
Key Takeaway Attention-based model performs the best, but cannot be used for streaming applicaitons.
End-to-End Comparisons [Battenberg et al., 2017]
Switchboard DeepSpeech
Key Takeaway Similar conclusions were reported by [Battenberg et al., 2017] on Switchboard. RNN-T
without an LM is consistently better than CTC with an LM.
Combining Approaches
● Structural improvements
○ Wordpiece models
○ Multi-headed attention
● Optimization improvements
○ Minimum word error rate (MWER) training
○ Scheduled sampling
○ Asynchronous and synchronous training
○ Label smoothing
● External language model integration
Proprietary + Confidential
Structure improvements
Wordpiece Model
● Instead of the commonly used grapheme, we can use longer units such as wordpieces
● Motivations:
○ Typically, word-level LMs have a much lower perplexity compared to grapheme-level
LMs [Kannan et al., 2018]
○ Modeling wordpiece allows for a much stronger decoder LM
○ Modeling longer units improves the effective memory of the decoder LSTMs
○ Allows the model to potentially memorize pronunciations for frequently occurring
words
○ longer units require fewer decoding steps; this speeds up inference in these models
significantly
● good performance for LAS and RNN-T [Rao et al., 2017].
Wordpiece Model
● sub-word units, ranging from graphemes all the way up to entire words.
● there are no out-of-vocabulary words with word piece models
● The word piece models are trained to maximize the language model likelihood over the
training set
● the word pieces are “position-dependent”, in that a special word separator marker is used
to denote word boundaries.
● Words are segmented deterministically and independent of context, using a greedy
algorithm.
Multi-headed Attention
● Multi-head attention (MHA) was first explored in [Vaswani et al., 2017] for machine
translation
● MHA extends the conventional attention mechanism to have multiple heads, where each
head can generate a different attention distribution.
Multi-headed Attention
● Results
Proprietary + Confidential
Optimization improvements
Minimum Word Error Rate (MWER)
● Training criterion does not match metric of interest: Word Error Rate
●
Minimum Word Error Rate (MWER)
● In the context of conventional ASR system, for Neural Network Acoustic Models
Key Takeaway Minimizing expected WER directly is intractable since it involves a summation over all
possible label sequences. Approximate expectation using samples.
Minimum Word Error Rate (MWER)
Uni-Directional Bi-Directional
Model
Encoder Encoder
● Since [Prabhavalkar et al., 2018] we have repeated the experiments with MWER training on
a number of models including RNN-T [Graves et al., 2013] and other streaming
attention-based models such as MoChA [Chiu and Raffel, 2017] and the Neural Transducer
[Jaitly et al., 2016]
● In all cases we have observed between 8% to 20% relative WER reduction
● Implementing MWER requires the ability to decode N-best hypotheses from the model
which can be somewhat computationally expensive
Scheduled Sampling
● Feeding the ground-truth label as the previous prediction (so-called teacher forcing) helps
the decoder to learn quickly at the beginning, but introduces a mismatch between training
and inference.
● The scheduled sampling process, on the other hand, samples from the probability
distribution of the previous prediction (i.e., from the softmax output) and then uses the
resulting token to feed as the previous token when predicting the next label
● This process helps reduce the gap between training and inference behavior. Our training
process uses teacher forcing at the beginning of training steps, and as training proceeds,
we linearly ramp up the probability of sampling from the model’s prediction to 0.4 at the
specified step, which we then keep constant until the end of training
Asynchronous and Synchronous Training
● synchronous training can potentially provide faster convergence rates and better model
quality, but also requires more effort in order to stabilize network training.
● Both approaches have a high gradient variance at the beginning of the training when using
multiple replicas
○ In asynchronous training we use replica ramp up: that is, the system will not start all
training replicas at once, but instead start them gradually
○ In synchronous training we use two techniques: learning rate ramp up and a gradient
norm tracker
Label Smoothing
● Results
Proprietary + Confidential
Key Takeaway Some Voice Search errors appear to be fixable with a good language model trained on more
text-only data.
Motivation
● The LAS model requires audio-text pairs: we have only 15M of these
● Our production LM is trained on billions of words of text-only data
● How can we look at incorporating a larger LM into our LAS model?
● More details can be found in [Kannan et al., 2018]
Extending LAS with an LM
• LM is applied on output
• Deep fusion [Gulcehre et al., 2015] LM
• Assumes LM is fixed
Deep/Cold fusion
• Cold fusion [Sriram et al., 2018]
• Simple interface between a deep lm and the
encoder
• Allows to swap in task-specific LMs
• In these experiments, fusion is used during the beam
search rather than n-best rescoring.
Shallow Fusion
State-of-the-art Performance
Google's Voice Search Task
Yan et.al; Attention-based sequence-to-sequence model for speech recognition: development of state-of-the-art system on LibriSpeech and its application to non-native English
Kazuki et.al; MODEL UNIT EXPLORATION FOR SEQUENCE-TO-SEQUENCE SPEECH RECOGNITION
Online Models
RNN-T, NT, MoChA
Streaming Speech Recognition
Endpoint quickly
● Reliability
● Latency
● Privacy
RNN-T [Graves, 2012] augments CTC encoder with a recurrent neural network LM
Recurrent Neural Network Transducer (RNN-T)
<blank>
Softmax over n + 1 labels, includes a
Softmax n+1 blank like CTC.
<blank> → advance in Encoder, retain
Output:
Joint
Network prediction network state
Prediction
Encoder
Network
<SOS>
t=1 frame
Recurrent Neural Network Transducer (RNN-T)
<blank>
Softmax over n + 1 labels, includes a
Softmax n+1 blank like CTC.
<blank> → advance in Encoder, retain
Output:
Joint
Network prediction network state
Prediction
Encoder
Network
<SOS>
t=2 frame
Recurrent Neural Network Transducer (RNN-T)
g
Softmax over n + 1 labels, includes a
Softmax n+1 blank like CTC.
<blank> → advance in Encoder, retain
Output: g
Joint
Network prediction network state
Prediction
Encoder
Network
<SOS>
t=3 frame
Recurrent Neural Network Transducer (RNN-T)
o
Softmax over n + 1 labels, includes a
Softmax n+1 blank like CTC.
<blank> → advance in Encoder, retain
Output: go
Joint
Network prediction network state
Prediction
Encoder
Network
g
t=3 frame
Recurrent Neural Network Transducer (RNN-T)
o
Softmax over n + 1 labels, includes a
Softmax n+1 blank like CTC.
<blank> → advance in Encoder, retain
Output: goo
Joint
Network prediction network state
Prediction
Encoder
Network
o
t=3 frame
Recurrent Neural Network Transducer (RNN-T)
<blank> Inference terminates when all input
Output: google
Joint
Network
Prediction
Encoder
Network
e
t=T frame
Recurrent Neural Network Transducer (RNN-T)
t, u During training feed the true label sequence to
the LM.
Softmax n+1 Given a target sequence of length U and T
acoustic frames we generate UxT softmax
Joint t
Network
Prediction
Encoder c
Network
u-1 <SOS>
t
1 2 3 4 5
Frames, t
Recurrent Neural Network Transducer (RNN-T)
[Graves et al., 2013] showed promising results on TIMIT phoneme recognition, but the work did not seem to
get as much traction in the field as CTC.
• Both components can be initialized from a separately trained CTC-AM and a RNN-LM (which can
be trained on text only data)
• Initialization provides some gains [Rao et al., 2017] but is not critical to get good performance
• Generally speaking, RNN-T always seems to perform better than CTC alone in our experiments (even
when decoded with a separate LM)
• More on this in a bit when we compare various approaches on a voice search task.
RNN-T: Case Study on ~18,000 hour Google Data
Reproduced from
[Rao et al., 2017] ASRU
The RNN-T model with ~96M parameters can match the performance of a
conventional sequence-trained CD-phone based CTC model with a large first pass LM
Proprietary + Confidential
Further Improvements
Improved Architecture
/b/
labels
/a/
time
Efficient Forward-Backward in TPU
Native TF support for
efficient batched
/b/ forward-backward
computation.
[Sim et al., 2017]
labels
/a/
• Main idea: shift the matrix
to do pipelining (so that
each column only depends
on the previous column but
not itself).
time
This allows us to train faster in TPUs with much larger batch sizes than would be
possible using GPUs, which improves accuracy.
Comparisons of Streaming Models for Mobile Devices
More details: [He et al., 2018]
● Inputs:
○ Each 10ms frame feature: 80-dimensional log-Mel.
○ Every 3 frames are stacked as the input to the networks, so the effective frame rate is 30ms.
● RNN-T:
○ 8-layer encoder, 2048 unidirectional LSTM cells + 640 projection units.
○ 2-layer prediction network, 2048 unidirectional LSTM cells + 640 projection units.
○ Model output units: grapheme or word-piece.
○ Total system size: ~120MB after quantization (more on this in a later slide).
With all the optimizations, a streaming RNN-T model improves WER by more than
20% over a conventional CTC embedded model + word LMs.
Proprietary + Confidential
Real-Time Recognition
Time Reduction Layer
RNN-T decodes speech twice as fast as real time on a Google Pixel phone, which
improves WER by more than 20% relative to a conventional CTC embedded model.
Neural Transducer
Neural Transducer: “Online” Attention Models
c
<sos> Decoder
Attention Mechanism
Chunk
Encoder
Neural Transducer: “Online” Attention Models
a
c Decoder
Attention Mechanism
Chunk
Encoder
Neural Transducer: “Online” Attention Models
t
a Decoder
Attention Mechanism
Chunk
Encoder
Neural Transducer: “Online” Attention Models
<epsilon>
t Decoder
Attention Mechanism
Chunk
Encoder
Neural Transducer: “Online” Attention Models
i
<epsilon> Decoder
Attention Mechanism
Chunk Chunk
Encoder
Neural Transducer: “Online” Attention Models
n
i Decoder
Attention Mechanism
Chunk Chunk
Encoder
Training Data for Neural Transducer
Word
hello how are you
Alignment
• Online methods like RNN-T, Policy Gradient learn alignment jointly with model
• We train neural transducer with a pre-specified alignment, so don’t need to re-compute alignments
(e.g., forward-backward) during training, which slows things down on GPU/TPU.
Training Data for Neural Transducer
Word
hello how are you
Alignment
NT model examines previous frames without looking beyond the current chunk
Monotonic Chunkwise Attention
Monotonic Attention
MoChA
Monotonic Chunkwise Attention (MoChA)
Train Inference
Online Model Comparison
Clean
Model
Voice
Dictation
Search
NT 8.7 7.8
Summary
Personalization
Fast endpointing
Multi-dialect ASR
Personalization
“Bias” the priors to the speech models based
on personal information
• Contacts - “call Joe Doe, send a message to Jason Dean”
• Songs - “play Lady Gaga, play songs from Jason Mraz”
• Dialog- “yes, no, stop, cancel”
Why Important
Decoder
Si Si
B1 = grey chicken
B2 = blue dog
• Example ref: The grey chicken jumps over the lazy dog
• Sample uniformly a bias phrase b, e.g. grey chicken
• With drop-probability p (e.g. 0.5) drop the selected phrase
• Augment with additional N-1 more bias phrases from other references in the batch
• quick turtle
• grey chicken
• brave monkey
• If b was not dropped, insert a </bias> token to reference:
• The grey chicken</bias> jumps over the lazy dog
● The Biaser embeds each phrase into a fixed length vector
○ → Last state of an LSTM
● Embedding happens once per bias phrase (possibly offline)
○ Cheap computation
● Attention is then computed over the set of embeddings
bi
N/A
h_2 h_3
D O G
Weights for
each phrase
CLAS
• CLAS is better than biasing with extern LM
• CLAS model + external LM works best
Method Songs
CLAS 6.9
Finalize recognition
Fetching the search results
Endpoint quickly
LATENCY
LATENCY
2. Noisy conditions
...
10% WER
vs 100 ms
EP50(ms)
Median latency (ms)
VAD vs end-of-query detection
PrefetchEvent RecEvent
SOS = start of speech PrefetchEvent
EOS = end of speech
Confidential & Proprietary
EOU = end of utterance
Training a Voice Activity Detector
● Take an utterance with ground-truth transcription
● Use forced alignment to find the timing of the utterance
● Based on the timing mark each frame as SPEECH (0) or NON-SPEECH (1)
1 0 1 0 1 0 1 0 1
2 0 1 0 1 0 1 0 3
en-us
AM PM LM LM
AM PM
Seq2Seq
en-gb
en-gb
AM PM LM LM
… …
In conventional systems, languages/dialects, A single model for all.
are handled with individual AMs, PMs and LMs.
Upscaling is becoming challenging.
Multi-Dialect LAS
Conventional Seq2Seq
data
phoneme
lexicon data
text normalization
⨉
N
LM
Motivations
● We share the same interest:
○ S. Watanabe, T. Hori, J.R. Hershey; Language independent end-to-end architecture for
joint language identification and speech recognition; ASRU 2017. MERL, USA.
■ English, Japanese, Mandarin, German, Spanish, French, Italian, Dutch, Portuguese,
Russian.
○ S. Kim, M.L. Seltzer; Towards language-universal end-to-end speech recognition;
submitted to ICASSP 2018. Microsoft, USA.
■ English, German, Spanish.
Dialect as Output Targets
Decoder
LSTM
● Passing the dialect information
Attention
as additional features LSTM
Encoder
LSTM
encoders → acoustic
LSTM
LSTM
lexicon and
decoders → LSTM
language
LSTM
input
dialect
Experimental Evaluations
Task
Dialect US IN GB ZA AU NG KE
dialect-ind. 10.6 18.3 12.9 12.7 12.8 33.4 19.2
dialect-dep. 9.7 16.2 12.7 11.0 12.1 33.4 19.0
Dialect US IN GB ZA AU NG KE
Baseline
9.7 16.2 12.7 11.0 12.1 33.4 19.0
(dialect-dep.)
LID first 9.9 16.6 12.3 11.6 12.2 33.6 18.7
ASR first 9.4 16.5 11.6 11.0 11.9 32.0 17.9
Dialect US IN GB ZA AU NG KE
Baseline
9.7 16.2 12.7 11.0 12.1 33.4 19.0
(dialect-dep.)
encoder 9.6 16.4 11.8 10.6 10.7 31.6 18.1
decoder 9.4 16.2 11.3 10.8 10.9 32.8 18.0
both 9.1 15.7 11.5 10.0 10.1 31.3 17.4
★ feeding dialect to both encoder and decoder gives the largest gains
LAS With Dialect as Input Features
Feeding different dialect vectors (rows) on different test sets (columns).
Test set
dialect feature
★ for low-resource dialects (NG, KE), the model learns to ignore the dialect information
LAS With Dialect as Input Features
color colour
dialect vector encoder decoder (US) (GB)
❌ ❌ ❌ 1 22
<en-gb>: [0, 1, 0, 0, 0, 0, 0] ✓ ❌ 19 4
<en-gb>: [0, 1, 0, 0, 0, 0, 0] ❌ ✓ 0 25
<en-us>: [1, 0, 0, 0, 0, 0, 0] ❌ ✓ 24 0
Decoder
LSTM
○ output targets:
Attention
■ multi-task with ASR LSTM
Encoder
LSTM
■ feeding dialect to
LSTM
both encoder and
decoder LSTM
LSTM
LSTM
input
dialect
Final Multi-Dialect LAS
Dialect US IN GB ZA AU NG KE
Baseline
9.7 16.2 12.7 11.0 12.1 33.4 19.0
(dialect-dep.)
output targets
9.4 16.5 11.6 11.0 11.9 32.0 17.9
(ASR first)
input features
9.1 15.7 11.5 10.0 10.1 31.3 17.4
(both)
Summary
Proprietary + Confidential
Question 1 Question 2
Source: https://github.com/tensorflow/lingvo
Proprietary + Confidential
Thank You
References
[Audhkhasi et al., 2017] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, D. Nahamoo “Direct Acoustics-to-Word Models for
English Conversational Speech Recognition,” Proc. of Interspeech, 2017.
[Bahdanau et al., 2017] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, Y. Bengio, “An Actor-Critic Algorithm
for Sequence Prediction,” Proc. of ICLR, 2017.
[Battenberg et al., 2017] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, S. Satheesh, D. Seetapun, A. Sriram, Z. Zhu,
“Exploring Neural Transducers For End-to-End Speech Recognition,” Proc. of ASRU, 2017.
[Chan et al., 2015] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, Attend and Spell,” CoRR, vol. abs/1508.01211, 2015.
[Chang et al., 2017] S-Y. Chang, B. Li, T. N. Sainath, G. Simko, C. Parada, “Endpoint Detection using Grid Long Short-Term Memory
Networks for Streaming Speech Recognition,” Proc. of Interspeech, 2017.
[Chang et al., 2018] S-Y. Chang, B. Li, G. Simko, T. N. Sainath, A. Tripathi, A. van den Oord, O. Vinyals, “Temporal Modeling Using
Dilated Convolution and Gating for Voice-Activity-Detection,” Proc. of ICASSP, 2018.
[Chiu and Raffel, 2017] C.-C. Chiu, C. Raffel, “Monotonic Chunkwise Alignments,” Proc. of ICLR, 2017.
[Chiu et al., 2018] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N.
Jaitly, B. Li, J. Chorowski, M. Bacchiani, “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” Proc. of ICASSP,
2018.
[Chorowski et al., 2015] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech
Recognition,” in Proc. of NIPS, 2015.
[Graves et al., 2006] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber, “Connectionist Temporal Classification: Labelling
Unsegmented Sequence Data with Recurrent Neural Networks,” Proc. of ICML, 2006.
[Graves, 2012] A. Graves, “Sequence Transduction with Recurrent Neural Networks,” Proc. of ICML Representation Learning
Workshop, 2012.
References
[Graves et al., 2013] A. Graves, A. Mohamed, and G. Hinton, “Speech Recognition with Deep Neural Networks,” in Proc. ICASSP, 2013.
[Gulcehre et al., 2015] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, Y. Bengio, “On Using
Monolingual Corpora in Neural Machine Translation”, CoRR, vol. abs/1503.03535, 2015.
[Hannun et al., 2014] A. Hannun, A. Maas, D. Jurafsky, A. Ng, “First-Pass Large Vocabulary Continuous Speech Recognition using
Bi-Directional Recurrent DNNs,” CoRR, vol. abs/1408.2873, 2014.
[He et al., 2017] Y. He, R. Prabhavalkar, K. Rao, W. Li, A. Bakhtin and I. McGraw, "Streaming small-footprint keyword spotting using
sequence-to-sequence models," Proc. of ASRU, 2017.
[He et al., 2018] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D.
Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S-Y. Chang, K. Rao, A. Gruenstein, “Streaming End-to-end Speech
Recognition For Mobile Devices,” CoRR, vol. abs/1811.06621, 2018.
[Jaitly et al., 2016] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, S. Bengio, “An Online Sequence-to-Sequence Model Using
Partial Conditioning,” Proc. of NIPS, 2016.
[Kannan et al., 2018] A.Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, R. Prabhavalkar, "An analysis of incorporating an external
language model into a sequence-to-sequence model," Proc. of ICASSP, 2018.
[Kim and Rush, 2016] Y. Kim and A. M. Rush, “Sequence-level Knowledge Distillation,” Proc. of EMNLP, 2016.
[Kim et al., 2017] S. Kim, T. Hori and S. Watanabe, “Joint CTC-attention based End-to-End Speech Recognition using Multi-Task
Learning,” Proc. of ICASSP, 2017.
[Kingsbury, 2009] B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic
modeling,” Proc. of ICASSP, 2009.
[Li et al., 2018] B. Li, T. N. Sainath, K. C. Sim, M. Bacchiani, E. Weinstein, P. Nguyen, Z. Chen, Y. Wu, K. Rao, “Multi-Dialect Speech
Recognition With A Single Sequence-To-Sequence Model,” Proc. of ICASSP, 2018.
References
[Maas et al., 2015] A. Maas, Z. Xie, D. Jurafsky, A. Ng, “Lexicon-Free Conversational Speech Recognition with Neural Networks," Proc.
of NAACL-HLT, 2015.
[McGraw et al., 2016] I. McGraw, R. Prabhavalkar, R. Alvarez, M. G. Arenas, K. Rao, D. Rybach, O. Alsharif, H. Sak, A. Gruenstein, F.
Beaufays, C. Parada, “Personalized speech recognition on mobile devices”, Proc. of ICASSP, 2016
[Rabiner, 1989] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proc. of IEEE,
1989.
[Prabhavalkar et al., 2017] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, N. Jaitly, “A Comparison of Sequence-to-Sequence
Models for Speech Recognition,” Proc. of Interspeech, 2017.
[Prabhavalkar et al., 2018] R. Prabhavalkar, T. N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.-C. Chiu, A. Kannan, “Minimum Word Error Rate
Training for Attention-based Sequence-to-Sequence Models,” Proc. of ICASSP, 2018.
[Povey, 2003] D. Povey, “Discriminative Training for Large Vocabulary Speech Recognition”, Ph.D. thesis, Cambridge University
Engineering Department, 2003.
[Pundak et al., 2018] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, D. Zhao, “Deep context: end-to-end contextual speech
recognition,” Proc. of SLT, 2018.
[Rao et al., 2017] K. Rao, H. Sak, R. Prabhavalkar, “Exploring Architectures, Data and Units For Streaming End-to-End Speech
Recognition with RNN-Transducer”, Proc. of ASRU, 2017.
[Ranzato et al., 2016] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level training with recurrent neural networks,” Proc.
of ICLR, 2016.
[Sainath et al., 2018] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, Z. Chen, “Improving the Performance of
Online Neural Transducer Models,” Proc. of ICASSP, 2018.
[Sak et al., 2015] Has¸im Sak, Andrew Senior, Kanishka Rao, Francoise Beaufays, “Fast and Accurate Recurrent Neural Network
Acoustic Models for Speech Recognition,” Proc. of Interspeech, 2015.
References
[Sak et al., 2017] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model
for sequence to sequence mapping,” in Proc. of Interspeech, 2017.
[Schuster & Nakajima, 2012] M. Schuster and K. Nakajima, “Japanese and Korean Voice Search,” Proc. of ICASSP, 2012.
[Shannon, 2017] M. Shannon, “Optimizing expected word error rate via sampling for speech recognition,” in Proc. of Interspeech,
2017.
[Sim et al., 2017] K. Sim, A. Narayanan, T. Bagby, T. N. Sainath, and M. Bacchiani, “Improving the Efficiency of Forward-Backward
Algorithm using Batched Computation in TensorFlow,” Proc. of ASRU, 2017.
[Sriram et al., 2018] A. Sriram, H. Jun, S. Satheesh, A. Coates, “Cold Fusion: Training Seq2Seq Models Together with Language
Models,” Proc. of ICLR, 2018.
[Stolcke et al., 1997] A. Stolcke, Y. Konig, M. Weintraub, “Explicit word error minimization in N-best list rescoring,” Proc. of
Eurospeech, 1997.
[Su et al., 2013] H. Su, G. Li, D. Yu, and F. Seide, “Error back propagation for sequence training of context-dependent deep networks
for conversational speech transcription,” Proc. of ICASSP, 2013.
[Szegedy et al., 2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer
vision,” Proc. of CVPR, 2016.
[Toshniwal et al., 2018] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. Weinstein, K. Rao, “Multilingual Speech
Recognition With A Single End-To-End Model,” Proc. of ICASSP, 2018.
[Vaswani et al., 2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention
Is All You Need,” Proc. of NIPS, 2017.
[Wiseman and Rush, 2016] S. Wiseman and A. M. Rush, “Sequence-to-Sequence Learning as Beam Search Optimization,” Proc. of
EMNLP, 2016.