(a) Speaker information form for recording (b) Recording interface
configuration is opted to proceed further. At the end, output lat- and i-vector of dimensionality 100 for each sample. Network
tice from the finest tuned configuration is selected for re-scoring consists of 7 hidden layers. First layer is fixed affine layer
using recurrent neural network language model. Kaldi toolkit whereas rest of the hidden layers are TDNN with cell dimen-
[36] is used for Urdu ASR development. sionality of 1024. Relu-renorm [16] is used as activation func-
tion. Experiment with LF-MMI based TDNN is also done and
3.3. Language Modeling results show that it performs better than cross-entropy based
TDNN. So, LF-MMI based networks are used for rest of the
SRI Language Modeling (SRILM) toolkit [7] is used for build-
DNN experiments. Lattice free MMI based TDNN, LSTM and
ing trigram language model. A very large corpus is collected by
BLSTM networks are trained. Furthermore, reduced frame rate
crawling a huge number of Urdu websites covering a number
is used in decoding to speed up the process. These networks
of categories such as news, magazines, books and blogs. Cor-
are termed as chain models in Kaldi. Comparison of all these
pus is sentence tokenized and cleaned for only Urdu and code-
experiments is shown in Table 3.
switched sentences. This collected corpus along with readily
available Urdu corpora [28, 29] is used for language modeling.
Table 3: Comparison of %WER of differnet deep neural net-
It contains around 154 million Urdu words forming 35 million
works for Urdu ASR. Number of epochs fixed to 2
trigrams. This corpus is also used for training RNNLM to com-
bine with best acoustic model.
Model No. of Cell
hidden dim. %WER
4. Experimental Results layers
Various experiments with different configurations of deep neu-
TDNN 7 1024 21.33
ral networks for acoustic modeling are performed. Results are
Chain TDNN 8 625 19.92
reported in terms of Word Error Rate (WER).
Chain TDNN- 7 512 19.18
4.1. Acoustic Modeling
Chain BLSTM 3 1024 19.38
4.1.1. GMM-HMM Chain TDNN- 5 1024 18.64
Using 300 hours of training data, a baseline system on GMM-
HMM using high resolution MFCCs is built. Furthermore,
speaker independent Linear Discriminant Analysis (LDA) and It is evident from Table 3 that chain TDNN-BLSTM out-
Maximum Likelihood Linear Transform (MLLT) transforms are performs for Urdu speech data also. In further experiments, it is
investigated. On top of LDA+MLLT, Speaker Adapted Train- optimized by varying different parameters to achieve best con-
ing (SAT) is done and its alignments are used for DNN training. figuration.
Summary of WER of baseline systems is shown in Table 2.
4.1.3. Number of layers
Table 2: %WER of GMM-HMM Urdu ASR Several experiments are done to optimize number of hidden lay-
ers for chain TDNN-BLSTM network. By default, network
Model %WER consists of 5 hidden layers among which first two are time de-
lay neural layers while rest of 3 are bidirectional long short-
GMM 46.87
term memory layers. Number of hidden layers are varied on
this stage and results are shown in Table 4.
4.1.4. Hidden layers’ size
For experiments shown in Table 4, number of neurons in hid-
4.1.2. Deep Neural Networks (DNNs)
den layers are fixed to 1024 per layer which means 1024 nodes
On alignments of SAT training, various deep neural networks per memory cells are used for each forward and backward di-
are trained. DNN training is done with high resolution MFCCs rection. Table 5 compares the different layer sizes for chain
Table 4: Comparison of hidden number of layers. Cell dimen- correct acoustically and semantically in both conditions but may
sions fixed to 1024, recurrent and non-recurrent projection di- contradict to reference text. For example, the word /P@Y®k/
mension as 256, delay=-3, decay-time=20, No. of epochs=2
/h@qd”A:r/ is sometimes decoded as /P@X k/ /h@q d”A:r/ that is
No. of hidden layers No. of param %WER correct both acoustically and semantically but not correct for
(M) WER calculation.
To compensate such errors, retraining is done after text nor-
4 (2 TDNN + 2 51.2 19.80 malization of training and test sets’ transcriptions. In text nor-
BLSTM) malization process, all the numbers are standardized to same
5 (2 TDNN + 3 51.2 18.64 format (in words). All the words with same pronunciation but
BLSTM) different orthographic representations are replaced with one of
6 (2 TDNN + 4 62.7 18.88 the representations. Words that are correct with and without
BLSTM) spaces are replaced with the one having spaces. Furthermore,
7 (2 TDNN + 5 74.3 18.92 for all cases, redundant entries from lexicon are removed. Af-
BLSTM) ter retraining the acoustic model, WER is further reduced to
13.50% from 16.94% (shown in Table 6).
TDNN-BLSTM with best number of hidden layers. Table 6: Comparison of WER after RNNLM and text normal-
Table 5: Comparison of cell dimensionality (layer size). Num-
ber of hidden layer fixed to 5, recurrent and non-recurrent pro- %WER
jection dimension as 256, delay=-3, decay-time=20, Number of
3-gram LM 18.64
RNNLM 16.94
Text normalized acoustic model+RNNLM 13.50
Layer size No. of param %WER
512 26.8 28.6
1024 51.2 18.64
6. Conclusion
This paper presents collection of Urdu speech corpus of 292.5
hours from 1586 speakers. A large vocabulary Urdu speech
recognition system is developed using 300 hours of microphone
4.2. Language Modeling speech data from 1671 speakers. A text corpus of 154 million
All word error rates reported in last section are decoded using words is developed for 3-gram and neural network based lan-
3-gram language model. After choosing the best configuration guage modeling. For evaluation of speech recognition system,
of acoustic model, trigram LM is replaced with recurrent neural a test data set of 9.5 hours is collected from 62 unseen speak-
network based language model. For RNNLM training, TDNN- ers. Different state-of-the-art modeling techniques are inves-
LSTM network is used with 3 TDNN and 2 LSTM hidden lay- tigated to develop Urdu LVCSR system. After evaluation of
ers. Layer size is fixed to 1024 cells. Best lattice is re-scored various techniques for acoustic modeling, TDNN-BLSTM net-
using this model and WER is improved from 18.64% to 16.94% work is chosen to develop the system. Decoded output lattice is
which is shown in Table 6. re-scored using RNNLM. To compensate error due to insertion
or deletion of spaces and alternate orthographic representations
5. Discussion of same pronunciation, text normalization is done and acoustic
model is retrained. A minimum WER of 13.50% is achieved on
A post analysis is done on decoded output by aligning hypothe- test data set. Speech corpora, collected in this work, can also
sis and reference texts. It is found that digits are being decoded be used for development of various other speech technologies
into words and vice versa. Sometimes, ASR decodes digit 6 as such as Urdu speakers’ recognition, age estimation and gender
/êk/ /Ùh e:/ (six in Urdu) and /êk/ /Ùh e:/ as 6 or /6 / (digit six in identification systems.
arabic script). In case of a single digit, error computation penal-
