Improving Large Vocabulary Urdu Speech Recognition System Using Deep Neural Networks
Improving Large Vocabulary Urdu Speech Recognition System Using Deep Neural Networks
Improving Large Vocabulary Urdu Speech Recognition System Using Deep Neural Networks
2979
(a) Speaker information form for recording (b) Recording interface
configuration is opted to proceed further. At the end, output lat- and i-vector of dimensionality 100 for each sample. Network
tice from the finest tuned configuration is selected for re-scoring consists of 7 hidden layers. First layer is fixed affine layer
using recurrent neural network language model. Kaldi toolkit whereas rest of the hidden layers are TDNN with cell dimen-
[36] is used for Urdu ASR development. sionality of 1024. Relu-renorm [16] is used as activation func-
tion. Experiment with LF-MMI based TDNN is also done and
3.3. Language Modeling results show that it performs better than cross-entropy based
TDNN. So, LF-MMI based networks are used for rest of the
SRI Language Modeling (SRILM) toolkit [7] is used for build-
DNN experiments. Lattice free MMI based TDNN, LSTM and
ing trigram language model. A very large corpus is collected by
BLSTM networks are trained. Furthermore, reduced frame rate
crawling a huge number of Urdu websites covering a number
is used in decoding to speed up the process. These networks
of categories such as news, magazines, books and blogs. Cor-
are termed as chain models in Kaldi. Comparison of all these
pus is sentence tokenized and cleaned for only Urdu and code-
experiments is shown in Table 3.
switched sentences. This collected corpus along with readily
available Urdu corpora [28, 29] is used for language modeling.
Table 3: Comparison of %WER of differnet deep neural net-
It contains around 154 million Urdu words forming 35 million
works for Urdu ASR. Number of epochs fixed to 2
trigrams. This corpus is also used for training RNNLM to com-
bine with best acoustic model.
Model No. of Cell
hidden dim. %WER
4. Experimental Results layers
Various experiments with different configurations of deep neu-
TDNN 7 1024 21.33
ral networks for acoustic modeling are performed. Results are
Chain TDNN 8 625 19.92
reported in terms of Word Error Rate (WER).
Chain TDNN- 7 512 19.18
LSTM
4.1. Acoustic Modeling
Chain BLSTM 3 1024 19.38
4.1.1. GMM-HMM Chain TDNN- 5 1024 18.64
BLSTM
Using 300 hours of training data, a baseline system on GMM-
HMM using high resolution MFCCs is built. Furthermore,
speaker independent Linear Discriminant Analysis (LDA) and It is evident from Table 3 that chain TDNN-BLSTM out-
Maximum Likelihood Linear Transform (MLLT) transforms are performs for Urdu speech data also. In further experiments, it is
investigated. On top of LDA+MLLT, Speaker Adapted Train- optimized by varying different parameters to achieve best con-
ing (SAT) is done and its alignments are used for DNN training. figuration.
Summary of WER of baseline systems is shown in Table 2.
4.1.3. Number of layers
Table 2: %WER of GMM-HMM Urdu ASR Several experiments are done to optimize number of hidden lay-
ers for chain TDNN-BLSTM network. By default, network
Model %WER consists of 5 hidden layers among which first two are time de-
lay neural layers while rest of 3 are bidirectional long short-
GMM 46.87
term memory layers. Number of hidden layers are varied on
GMM+LDA+MLLT 37.26
this stage and results are shown in Table 4.
GMM+LDA+MLLT+SAT 32.24
4.1.4. Hidden layers’ size
For experiments shown in Table 4, number of neurons in hid-
4.1.2. Deep Neural Networks (DNNs)
den layers are fixed to 1024 per layer which means 1024 nodes
On alignments of SAT training, various deep neural networks per memory cells are used for each forward and backward di-
are trained. DNN training is done with high resolution MFCCs rection. Table 5 compares the different layer sizes for chain
2980
Table 4: Comparison of hidden number of layers. Cell dimen- correct acoustically and semantically in both conditions but may
sions fixed to 1024, recurrent and non-recurrent projection di- contradict to reference text. For example, the word /P@Y®k/
mension as 256, delay=-3, decay-time=20, No. of epochs=2
/h@qd”A:r/ is sometimes decoded as /P@X k/ /h@q d”A:r/ that is
No. of hidden layers No. of param %WER correct both acoustically and semantically but not correct for
(M) WER calculation.
To compensate such errors, retraining is done after text nor-
4 (2 TDNN + 2 51.2 19.80 malization of training and test sets’ transcriptions. In text nor-
BLSTM) malization process, all the numbers are standardized to same
5 (2 TDNN + 3 51.2 18.64 format (in words). All the words with same pronunciation but
BLSTM) different orthographic representations are replaced with one of
6 (2 TDNN + 4 62.7 18.88 the representations. Words that are correct with and without
BLSTM) spaces are replaced with the one having spaces. Furthermore,
7 (2 TDNN + 5 74.3 18.92 for all cases, redundant entries from lexicon are removed. Af-
BLSTM) ter retraining the acoustic model, WER is further reduced to
13.50% from 16.94% (shown in Table 6).
TDNN-BLSTM with best number of hidden layers. Table 6: Comparison of WER after RNNLM and text normal-
ization
Table 5: Comparison of cell dimensionality (layer size). Num-
ber of hidden layer fixed to 5, recurrent and non-recurrent pro- %WER
jection dimension as 256, delay=-3, decay-time=20, Number of
3-gram LM 18.64
epochs=2
RNNLM 16.94
Text normalized acoustic model+RNNLM 13.50
Layer size No. of param %WER
(M)
512 26.8 28.6
1024 51.2 18.64
6. Conclusion
This paper presents collection of Urdu speech corpus of 292.5
hours from 1586 speakers. A large vocabulary Urdu speech
recognition system is developed using 300 hours of microphone
4.2. Language Modeling speech data from 1671 speakers. A text corpus of 154 million
All word error rates reported in last section are decoded using words is developed for 3-gram and neural network based lan-
3-gram language model. After choosing the best configuration guage modeling. For evaluation of speech recognition system,
of acoustic model, trigram LM is replaced with recurrent neural a test data set of 9.5 hours is collected from 62 unseen speak-
network based language model. For RNNLM training, TDNN- ers. Different state-of-the-art modeling techniques are inves-
LSTM network is used with 3 TDNN and 2 LSTM hidden lay- tigated to develop Urdu LVCSR system. After evaluation of
ers. Layer size is fixed to 1024 cells. Best lattice is re-scored various techniques for acoustic modeling, TDNN-BLSTM net-
using this model and WER is improved from 18.64% to 16.94% work is chosen to develop the system. Decoded output lattice is
which is shown in Table 6. re-scored using RNNLM. To compensate error due to insertion
or deletion of spaces and alternate orthographic representations
5. Discussion of same pronunciation, text normalization is done and acoustic
model is retrained. A minimum WER of 13.50% is achieved on
A post analysis is done on decoded output by aligning hypothe- test data set. Speech corpora, collected in this work, can also
sis and reference texts. It is found that digits are being decoded be used for development of various other speech technologies
into words and vice versa. Sometimes, ASR decodes digit 6 as such as Urdu speakers’ recognition, age estimation and gender
/êk/ /Ùh e:/ (six in Urdu) and /êk/ /Ùh e:/ as 6 or /6 / (digit six in identification systems.
arabic script). In case of a single digit, error computation penal-
izes it as one substitution. However, for larger numbers, penalty 7. References
goes higher. For instance, decoding year 2021, ASR decodes it
ðX/ /d”o: h@za:r Ikki:s/ (two thousand twenty-one [1] N. Patel, S. Agarwal, N. Rajput, A. Nanavati, P. Dave, and T.
as / » @ P@Qï S.Parikh, “A comparative study of speech and dialed input voice
f
in Urdu) that raises a penalty of three words (one substitution in-terfaces in rural india,” in SIGCHI Conference on Human Fac-
and two insertions). Conversely ASR is penalized as one sub- torsin Computing Systems. ACM, 2009.
stitution and two deletions. [2] J. Sherwani, N. Ali, S. Mirza, A. Fatma, Y. Memon, M. Karim,R.
Similarly, some Urdu words can be written in two alternate Tongia, and R. Rosenfeld, “Healthline: Speech-based access to-
ways. And lexicon contains alternate orthographic represen- health information by low-literate users,” in ICTD. IEEE, 2007.
tations of same pronunciation. For instance, there is a proper
noun in Urdu / Õæï @QK. @/ /Ibra:hi:m/ which can be written as / Õæï @QK. @/ [3] British Broadcasting Coorporation (BBC)., [Online] Available:
f f http://www.bbc.co.uk/languages/other/urdu/guide/facts.shtml
or / Õæë@QK. @/. If the decoded one is different than the one in refer- (Last Accessed on March 12, 2019).
ence text, ASR is penalized as one substitution. [4] J. Ashraf, N. Iqbal, N. S. Khattak, A. M. Zaidi, “Speaker Inde-
Additionally, ASR intermittently inserts space in some pendent Urdu Speech Recognition,” in International Conference on
words that are correct with or without space. Such words are Informatics and Systems (INFOS), Cairo, Egypt, 2010.
2981
[5] V. Peddinti, D. Povey, S. Khudanpur, “A time delay neural net- [23] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency
workarchitecture for efficient modeling of long temporal contexts,” acoustic modeling using temporal convolution and LSTMs,” IEEE
In Sixteenth Annual Conference of the International Speech Com- Signal Processing Letters, 2017.
munication Association 2015.
[24] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D.
[6] L. Besacier, E. Barnard, A. Karpov, T. Schultz, “Automatic speech Povey, S. Khudanpur, “A pruned rnnlm lattice-rescoring algorithm
recognition for under-resourced languages: A survey,” Speech for automatic speech recognition,” in IEEE International Confer-
Commun. , vol. 56, pp. 85–100, Jan. 2014 ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE,
2018.
[7] Andreas Stolcke, “SRILM – an extensible language modeling
toolkit,” In Proceedings of the International Conference on Spoken [25] H. Sarfraz, S. Hussain, R. Bokhari, A.A. Raza, I. Ullah, Z. Sar-
Language Processing, Vol. 2, pages 901–904. fraz, S. Pervez, A. Mustafa, I. Javed, R. Parveen.“ Speech Corpus
Development for a Speaker Independent Spontaneous Urdu Speech
[8] L. E. Baum, J. A. Eagon, “An inequality with applications to sta- Recognition System,” in O-COCOSDA 2010.
tistical estimation for probabilistic functions of Markov processes
and to a model for ecology,” Bulletin of American Mathematical [26] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X.
Society, vol. 73, pp. 360–363, 1967. Na,Y. Wang, and S. Khudanpur, “Purely sequence-trained neural
networksfor ASR based on lattice-free MMI,” in Proc. Interspeech,
[9] M. Gales and S. Young. “The application of hidden markov pp. 2751–2755, 2016.
models in speech recognition,” Founds. Trends Signal Process.,
1(3):195–304, January 2007. [27] Raza A., Hussain S., Sarfraz H., Ullah I., and Sarfraz Z.,“Design
and Development of Phonetically Rich Urdu Speech Corpus,” in
[10] V. Manohar, D. Povey, and S. Khudanpur, “Semi-supervised max- Proceedings of IEEE Oriental COCOSDA International Confer-
imum mutual information training of deep neural network acoustic ence on Speech Database and Assessments, Urumqi, pp. 38-43,
models,” in Proc. Interspeech, Dresden, Germany, Sep. 2015, pp. 2009.
2630–2634
[28] S, Urooj, S, Hussain, F, Adeeba, F. Jabeen, R. Perveen, “CLE
[11] K. Vesely, A. Ghoshal, L. Burget, D. Povey, “Sequence-
Urdu Digest Corpus”, in the Proc. of Conference on Language and
discriminative training of deep neural networks,” In Proceedings
Technology 2012 (CLT12), Lahore, Pakistan, 2012.
of Interspeech. 2013
[12] P. Upadhyaya, S. K. Mittal, O. Farooq, Y. V. Varshney, M. R. [29] F. Adeeba, Q. Akram, H. Khalid, and S. Hussain, “CLE Urdu
Abidi, “Continuous Hindi Speech Recognition Using Kaldi ASR BooksN-gram,” in Proc. Conf. Lang. Technol., Karachi, Pakistan,
Based on Deep Neural Network,” In: Tanveer M., Pachori R. (eds) 2014, pp. 87–92.
Machine Intelligence and Signal Analysis. Advances in Intelligent
[30] F. Adeeba, S. Hussain, T. Habib, E. Ul-Haq, K. S. Shahid, “Com-
Systems and Computing, vol 748. Springer, Singapore (2019)
parison of Urdu text to speech synthesis using unit selection and
[13] D. Dash, M. Kim, K. Teplansky, J. Wang, “Automatic Speech HMM based techniques”, Presented at the Oriental COCOSDA
Recognition with Articulatory Information and a Unified Dictio- Bali (Indonesia, 2016)
nary for Hindi, Marathi, Bengali and Oriya,” in Interspeech 2018.
[31] B. Mumtaz, A.Hussain, S. Hussain, A.Mehmood., R. Bhatti, M.
[14] U. G. Patil, S. D. Shirbahadurkar, A. N. Paithane, “Automatic Farooq, S. Rauf.,“Multitier Annotation of Urdu Speech Corpus”,
Speech Recognition of isolated words in Hindi language using Conference on Language and Technology (CLT14), Karachi, Pak-
MFCC,” in 2016 International Conference on Computing, Analyt- istan, 2014
ics and Security Trends (CAST), Dec 2016, pp. 433–438.
[32] H. Sarfraz, S. Hussain, R. Bokhari, A. A. Raza, I. Ullah, Z. Sar-
[15] A. Mishra, M. Chandra, A. Biswas, S. N Sharan, “Robust fea- fraz, S. Pervez, A. Mustafa, I. Javed, R. Parveen, “Large vocabu-
tures for connected Hindi digits recognition,” International Journal lary continuous speech recognition for urdu”, in 8th International
of Signal Processing, Image Processing and Pattern Recogni-tion, Conference on Frontiers of Information Technology. ACM, 2010.
vol. 4, no. 2, Jun 2011.
[33] M. A. B. Shaik, Z. Tukse, M. A.Tahir, M. Nubaum-Thom, R.
[16] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Schluter, H.Ney,“Improvements in RWTH LVCSR evaluation sys-
boltzmann machines,” In ICML, 2010. tems for Polish, Portgeese, English, Urdu and Arabic”, in Sixteenth
Annual Conference of the ISCA, 2015.
[17] R. K. Aggarwal, M. Dave, “Using gaussian mixtures for hindi
speech recognition system,” in International Journal of Signal Pro- [34] M. Qasim, S. Nawaz, S. Hussain, T. Habib, “Urdu speech recog-
cessing, Image Processing and Pattern Recognition, 2011. nition system for district names of Pkaistan: Development, chal-
lenges and solutions”, in O-COCOSDA, 2016.
[18] B. Venkataramani, “SOPC-based speech-to-text conversion,” in
Nios II Embedded Processor Design Contest Outstanding Designs, [35] A. A. Raza, A. Athar, S. Randhawa, Z. Tariq, M. B. Saleem, H.
2006. B. Zia, U. Saif, R. Rosenfeld, “Rapid Collection of Spontaneous
Speech Corpora Using Telephonic Community Forums,” in Proc.
[19] K. V. S. Parsad and S. M. Virk, “Computational evidence that Interspeech 2018 (2018), 1021–1025.
Hindi and Urdu share a grammar but not the lexicon,” In the 3rd
Workshop on South and Southeast Asian NLP, COLING (2012). [36] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N.
Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky,
[20] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term mem- G. Stemmer, K. Vesely, “The Kaldi Speech Recognition Toolkit,”
ory recurrent neural network architectures for large scale acoustic In IEEE 2011 Workshop on Automatic Speech Recognitionand Un-
modeling,” in Proc. INTERSPEECH, 2014, pp. 338–342. derstanding. IEEE Signal Processing Society, 2011.
[21] A. Graves, S. Fernandez, and J. Schmidhuber, “Bidirectional
LSTM networks for improved phoneme classification and recogni-
tion,” in Proc. Int. Conf. Artif. Neural Netw.: Formal Models Their
Appl., 2005, pp. 799–804.
[22] I. McCowan et al., “The AMI meeting corpus,” in Proc. 5th Int.
Conf. Methods Tech. Behav. Res., 2005, vol. 88.
2982