Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Improving Large Vocabulary Urdu Speech Recognition System Using Deep Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

INTERSPEECH 2019

September 15–19, 2019, Graz, Austria

Improving Large Vocabulary Urdu Speech Recognition System using Deep


Neural Networks
Muhammad Umar Farooq, Farah Adeeba, Sahar Rauf, Sarmad Hussain

Center for Language Engineering,


Al-Khawarizmi Institute of Computer Science,
University of Engineering and Technology, Lahore.
{umar.farooq, farah.adeeba, sahar.rauf, sarmad.hussain}@kics.edu.pk

Abstract alignments produced by HMM models [10]. Additionally, ex-


tensively used n-gram Language Models (LMs) are now being
Development of Large Vocabulary Continuous Speech Recog- replaced by RNNLMs [24].
nition (LVCSR) system is a cumbersome task, especially for Over the years, limited efforts have been made to develop
low resource languages. Urdu is the national language and lin- resources and speech technology related applications for Urdu.
gua franca of Pakistan, with 100 million speakers worldwide. Developments in the domain of Urdu speech recognition started
Due to resource scarcity, limited work has been done in the do- with development of an isolated speech recognition system [4].
main of Urdu speech recognition. In this paper, collection of Speech corpus was collected from 10 speakers and vocabulary
Urdu speech corpus and development of Urdu speech recog- size for this system was 52 words. A minimum Word Error Rate
nition system is presented. Urdu LVCSR is developed using (WER) of 10.6% was achieved for unseen speakers.
300 hours of read speech data with a vocabulary size of 199K
Urdu continuous speech recognition system [32] was devel-
words. Microphone speech is recorded from 1671 Urdu and
oped using spontaneous speech corpus collected from 82 speak-
Punjabi speakers in both indoor and outdoor environments. Dif-
ers [25]. Speech corpus was recorded over telephone and micro-
ferent acoustic modeling techniques such as Gaussian Mixture
phone channels. Total duration of the corpus was 45 hours and
Models based Hidden Markov Models (GMM-HMM), Time
vocabulary size was 14K words. Minimum WER of 68.8% was
Delay Neural Networks (TDNN), Long-Short Term Memory
attained. Qasim et al. [34] developed a speaker independent
(LSTM) and Bidirectional Long-Short Term Memory (BLSTM)
Urdu speech recognition system for 139 district names of Pak-
networks are investigated. Cross entropy and Lattice Free Max-
istan. It covered the accent variation around the Pakistan and
imum Mutual Information (LF-MMI) objective functions are
gained a minimum WER of 7.44% by building adapted ASR on
employed during acoustic modeling. In addition, Recurrent
field data.
Neural Network Language Model (RNNLM) is also being used
for re-scoring. Developed speech recognition system has been First Urdu LVCSR was developed on 99 hours of Urdu
evaluated on 9.5 hours of collected test data and a minimum broadcast data [33]. The vocabulary size of the system was 79K.
Word Error Rate (%WER) of 13.50% is achieved. To build a 5-gram Language Model (LM), a corpus of 266M
words was collected from different newspapers. System was
Index Terms: Urdu, ASR, GMM-HMM, DNN-HMM, TDNN,
evaluated on an evaluation data set of 0.5 hours and a minimum
BLSTM, RNNLM, LVCSR
WER of 32.6% was achieved by GMM-HMM based speaker
adapted system.
1. Introduction A. Raza et al. [35] recorded about 1207 hours of speech
Automatic Speech Recognition (ASR) is one of the applica- data from 11017 speakers from all over the Pakistan. However,
tions of speech and language technologies that converts speech only 9.5 hours of data was annotated out of which 8.5 hours
into text. ASR has numerous applications in all fields of life were used for ASR development of 5K words. System was eval-
such as agriculture [1], health care [2], banking sector and ho- uated on 1 hour of speech data and a minimum WER of 24.14%
tel management are a few to name. Urdu is the national lan- was attained.
guage and lingua franca of Pakistan; bridging people speaking In domain of speech recognition, most of the work has
regional languages such as Balochi, Pashto, Punjabi and Sindhi been done on Hindi among south Asian languages [13] which is
with various dialects. It is spoken by more than a hundred mil- much similar to spoken Urdu. Isolated word recognition [14],
lion speakers in Pakistan, India, Bangladesh and the regions of connected digit recognition [15], statistical pattern classifica-
Europe [3]. Urdu is a low resource language and very little tion [17], online speech to text engine [18] and large vocabu-
transcribed speech data, text corpus and pronunciation lexicon lary speech recognition systems have been developed. Upad-
is available publicly. A robust speech recognition system re- hyaya et al. [12] developed a Hindi speech recognition system
quires hundreds of hours of transcribed speech data, very large using deep neural networks on AMUAV Hindi speech database.
text corpus and lexicon. This database consists of 1000 phonetically balanced sentences
Speech recognition for low resource languages has received recorded by 100 speakers and covers 54 Hindi phones. Min-
scant attention in recent few years [6]. Hidden Markov Mod- imum WER achieved was 11.63% using Karel’s DNN [11].
els (HMMs) [8] is the widely used technique to build acous- Though Hindi speech recognition systems have achieved a low
tic models for speech recognition systems [9]. However, with WER, these systems can not be used as an alternative of Urdu
the resurgence of deep learning, paradigm has been shifted to- ASRs due to substantial lexicon differences [19].
wards Deep Neural Networks (DNN) based acoustic and lan- This paper presents collection of large Urdu speech corpus
guage models. DNN based acoustic models are trained using and development of deep neural networks based Urdu LVCSR

Copyright © 2019 ISCA 2978 http://dx.doi.org/10.21437/Interspeech.2019-2629


system. Urdu speech data of 292.5 hours is recorded from 1671
speakers and annotated at sentence level. Along with read-
ily available Urdu data, 300 hours of speech data covering a
vocabulary size of 199K words are used for development of
the system. Different state-of-the-art techniques for acoustic
modeling such as GMM-HMM, Time-Delay Neural Network
(TDNN) [5], Long-Short Term Memory (LSTM) [20], Bidirec-
tional Long-Short Term Memory (BLSTM) [21] and a combi-
nation of TDNN and BLSTM networks (TDNN-BLSTM) with
cross entropy and LF-MMI [26] loss functions are investigated
to get minimum WER. Lattice free MMI based TDNN-BLSTM
network outperforms for AMI [22] and Switchboard corpora of
English speech data [23]. Impact of this network for Urdu data
is investigated in this work.
To make it accessible for development of further speech in-
terfaces, Urdu speech recognition system is available for devel-
opers as a web service1 .
Figure 1: User interface of recording utility for first 50 Hours
2. Urdu Speech Corpus Development
For Urdu speech collection, text corpus is designed covering
For evaluation, a test corpus of 9.5 hours of Urdu speech
some available Urdu corpora [28, 29, 30], proper nouns, dates,
is collected from 62 speakers. It is ensured that both the text
months, news from different categories, 11 digits long tele-
corpus and the speakers should be unseen. It is designed in
phone numbers, national identity card numbers of 13 digits
the same way as training data to ensure a balanced text corpus.
and addresses. A large corpus of news is extracted from dif-
Speech corpus recorded for test data is also gender and chan-
ferent Urdu news channels’ websites and tweets. English, be-
nel balanced. Summary of corpus collection for training and
ing the official language of Pakistan, is frequently mixed with
evaluation is given in Table 1.
Urdu even in everyday conversations. To cover this mixing,
code-switched sentences between Urdu and English are also in-
cluded. Around one thousand Urdu news containing most fre- Table 1: Statistics of Urdu speech corpus
quently used English words are extracted. After verification and
rephrasing, 779 code-switched sentences are included in corpus. Training Data Testing Data
Collection and annotation of speech data is a cumbersome Total Duration (in 292.5 9.5
task. Initially, phonetically rich text corpus [30] is used to Hours)
record around 50 hours of speech data from 182 speakers in Number of speakers 1586 62
supervised recording sessions. This data is recorded in a clean Channels USB micro- USB micro-
environment through an automated utility. User interface for phone, USB phone, USB
recording utility is shown in Figure 1. Speakers’ information headset, hands- headset, hands-
(unique ID, gender and channel) is provided before start of free, laptop free, laptop
recording session. A speaker is asked to record the sentences microphone microphone
appearing on screen one by one. On completion of a sentence, a Age Group 18-50 18-50
speaker may proceed to next sentence or re-record current sen-
tence in case of mispronunciation or linguist’s advice. All the
recorded data is manually verified by linguists.
After collecting 50 hours of speech data, a baseline GMM- 3. Experimental Setup
HMM based Urdu ASR system is developed. Another utility is
used for further data recording. Interface of utility, used in this 3.1. Lexicon
step, is shown in Figure 2. Speakers’ metadata such as unique Urdu ASR is developed with a vocabulary size of 199K words.
ID, gender and channel is stored before session start. During It includes 106K words from readily available Urdu lexicon [30]
recording session, speaker is directed to record the sentences and 93K words added during corpus development.
appearing on screen. On completion of a speech utterance, it
is decoded by ASR. If a sentence is perfectly decoded, it gets 3.2. Acoustic Modeling
separated out and rejected otherwise. However, sentences with
1 and 2 word errors are reconsidered by linguists and are ac- For training of acoustic model, 300 hours of speech data includ-
cepted or rejected after manual verification. ing readily available Urdu speech corpora [25, 27, 30, 31] (8.5
During speech corpus collection, gender and channel bal- Hours) and 292.5 hours of newly recorded data are used. Us-
ancing is taken into consideration. Data is recorded from more ing this speech data, a baseline GMM-HMM acoustic model is
than 1650 male and female Punjabi and Urdu speakers from age built using Mel-Frequency Cepstral Coefficients (MFCC) fea-
group ranging from 18-50 years. All the audios are recorded tures with 40 coefficients (high resolution MFCCs). This model
in WAV format at sampling rate of 16 KHz using USB micro- is used to get alignments for DNN training. TDNN, LF-MMI
phone, USB headsets, hands-free and laptop microphone. based TDNN, LSTM, BLSTM and hybrid TDNN BLSTM deep
networks are investigated. The best model among all is selected
1 Available at: https://tech.cle.org.pk/services/ and fine tuned by varying different parameters such as number
speech/asr of hidden layers and cell dimensions. On each step, the best

2979
(a) Speaker information form for recording (b) Recording interface

Figure 2: User interface of recording utility

configuration is opted to proceed further. At the end, output lat- and i-vector of dimensionality 100 for each sample. Network
tice from the finest tuned configuration is selected for re-scoring consists of 7 hidden layers. First layer is fixed affine layer
using recurrent neural network language model. Kaldi toolkit whereas rest of the hidden layers are TDNN with cell dimen-
[36] is used for Urdu ASR development. sionality of 1024. Relu-renorm [16] is used as activation func-
tion. Experiment with LF-MMI based TDNN is also done and
3.3. Language Modeling results show that it performs better than cross-entropy based
TDNN. So, LF-MMI based networks are used for rest of the
SRI Language Modeling (SRILM) toolkit [7] is used for build-
DNN experiments. Lattice free MMI based TDNN, LSTM and
ing trigram language model. A very large corpus is collected by
BLSTM networks are trained. Furthermore, reduced frame rate
crawling a huge number of Urdu websites covering a number
is used in decoding to speed up the process. These networks
of categories such as news, magazines, books and blogs. Cor-
are termed as chain models in Kaldi. Comparison of all these
pus is sentence tokenized and cleaned for only Urdu and code-
experiments is shown in Table 3.
switched sentences. This collected corpus along with readily
available Urdu corpora [28, 29] is used for language modeling.
Table 3: Comparison of %WER of differnet deep neural net-
It contains around 154 million Urdu words forming 35 million
works for Urdu ASR. Number of epochs fixed to 2
trigrams. This corpus is also used for training RNNLM to com-
bine with best acoustic model.
Model No. of Cell
hidden dim. %WER
4. Experimental Results layers
Various experiments with different configurations of deep neu-
TDNN 7 1024 21.33
ral networks for acoustic modeling are performed. Results are
Chain TDNN 8 625 19.92
reported in terms of Word Error Rate (WER).
Chain TDNN- 7 512 19.18
LSTM
4.1. Acoustic Modeling
Chain BLSTM 3 1024 19.38
4.1.1. GMM-HMM Chain TDNN- 5 1024 18.64
BLSTM
Using 300 hours of training data, a baseline system on GMM-
HMM using high resolution MFCCs is built. Furthermore,
speaker independent Linear Discriminant Analysis (LDA) and It is evident from Table 3 that chain TDNN-BLSTM out-
Maximum Likelihood Linear Transform (MLLT) transforms are performs for Urdu speech data also. In further experiments, it is
investigated. On top of LDA+MLLT, Speaker Adapted Train- optimized by varying different parameters to achieve best con-
ing (SAT) is done and its alignments are used for DNN training. figuration.
Summary of WER of baseline systems is shown in Table 2.
4.1.3. Number of layers
Table 2: %WER of GMM-HMM Urdu ASR Several experiments are done to optimize number of hidden lay-
ers for chain TDNN-BLSTM network. By default, network
Model %WER consists of 5 hidden layers among which first two are time de-
lay neural layers while rest of 3 are bidirectional long short-
GMM 46.87
term memory layers. Number of hidden layers are varied on
GMM+LDA+MLLT 37.26
this stage and results are shown in Table 4.
GMM+LDA+MLLT+SAT 32.24
4.1.4. Hidden layers’ size
For experiments shown in Table 4, number of neurons in hid-
4.1.2. Deep Neural Networks (DNNs)
den layers are fixed to 1024 per layer which means 1024 nodes
On alignments of SAT training, various deep neural networks per memory cells are used for each forward and backward di-
are trained. DNN training is done with high resolution MFCCs rection. Table 5 compares the different layer sizes for chain

2980
Table 4: Comparison of hidden number of layers. Cell dimen- correct acoustically and semantically in both conditions but may
sions fixed to 1024, recurrent and non-recurrent projection di- contradict to reference text. For example, the word /P@Y®k/

mension as 256, delay=-3, decay-time=20, No. of epochs=2 
/h@qd”A:r/ is sometimes decoded as /P@X ‡k/ /h@q d”A:r/ that is
No. of hidden layers No. of param %WER correct both acoustically and semantically but not correct for
(M) WER calculation.
To compensate such errors, retraining is done after text nor-
4 (2 TDNN + 2 51.2 19.80 malization of training and test sets’ transcriptions. In text nor-
BLSTM) malization process, all the numbers are standardized to same
5 (2 TDNN + 3 51.2 18.64 format (in words). All the words with same pronunciation but
BLSTM) different orthographic representations are replaced with one of
6 (2 TDNN + 4 62.7 18.88 the representations. Words that are correct with and without
BLSTM) spaces are replaced with the one having spaces. Furthermore,
7 (2 TDNN + 5 74.3 18.92 for all cases, redundant entries from lexicon are removed. Af-
BLSTM) ter retraining the acoustic model, WER is further reduced to
13.50% from 16.94% (shown in Table 6).

TDNN-BLSTM with best number of hidden layers. Table 6: Comparison of WER after RNNLM and text normal-
ization
Table 5: Comparison of cell dimensionality (layer size). Num-
ber of hidden layer fixed to 5, recurrent and non-recurrent pro- %WER
jection dimension as 256, delay=-3, decay-time=20, Number of
3-gram LM 18.64
epochs=2
RNNLM 16.94
Text normalized acoustic model+RNNLM 13.50
Layer size No. of param %WER
(M)
512 26.8 28.6
1024 51.2 18.64
6. Conclusion
This paper presents collection of Urdu speech corpus of 292.5
hours from 1586 speakers. A large vocabulary Urdu speech
recognition system is developed using 300 hours of microphone
4.2. Language Modeling speech data from 1671 speakers. A text corpus of 154 million
All word error rates reported in last section are decoded using words is developed for 3-gram and neural network based lan-
3-gram language model. After choosing the best configuration guage modeling. For evaluation of speech recognition system,
of acoustic model, trigram LM is replaced with recurrent neural a test data set of 9.5 hours is collected from 62 unseen speak-
network based language model. For RNNLM training, TDNN- ers. Different state-of-the-art modeling techniques are inves-
LSTM network is used with 3 TDNN and 2 LSTM hidden lay- tigated to develop Urdu LVCSR system. After evaluation of
ers. Layer size is fixed to 1024 cells. Best lattice is re-scored various techniques for acoustic modeling, TDNN-BLSTM net-
using this model and WER is improved from 18.64% to 16.94% work is chosen to develop the system. Decoded output lattice is
which is shown in Table 6. re-scored using RNNLM. To compensate error due to insertion
or deletion of spaces and alternate orthographic representations
5. Discussion of same pronunciation, text normalization is done and acoustic
model is retrained. A minimum WER of 13.50% is achieved on
A post analysis is done on decoded output by aligning hypothe- test data set. Speech corpora, collected in this work, can also
sis and reference texts. It is found that digits are being decoded be used for development of various other speech technologies
into words and vice versa. Sometimes, ASR decodes digit 6 as such as Urdu speakers’ recognition, age estimation and gender
/êk/ /Ùh e:/ (six in Urdu) and /êk/ /Ùh e:/ as 6 or /6 / (digit six in identification systems.
arabic script). In case of a single digit, error computation penal-
izes it as one substitution. However, for larger numbers, penalty 7. References
goes higher. For instance, decoding year 2021, ASR decodes it
ðX/ /d”o: h@za:r Ikki:s/ (two thousand twenty-one [1] N. Patel, S. Agarwal, N. Rajput, A. Nanavati, P. Dave, and T.
as / » @ P@Qï S.Parikh, “A comparative study of speech and dialed input voice
f
in Urdu) that raises a penalty of three words (one substitution in-terfaces in rural india,” in SIGCHI Conference on Human Fac-
and two insertions). Conversely ASR is penalized as one sub- torsin Computing Systems. ACM, 2009.
stitution and two deletions. [2] J. Sherwani, N. Ali, S. Mirza, A. Fatma, Y. Memon, M. Karim,R.
Similarly, some Urdu words can be written in two alternate Tongia, and R. Rosenfeld, “Healthline: Speech-based access to-
ways. And lexicon contains alternate orthographic represen- health information by low-literate users,” in ICTD. IEEE, 2007.
tations of same pronunciation. For instance, there is a proper
noun in Urdu / Õæï @QK. @/ /Ibra:hi:m/ which can be written as / Õæï @QK. @/ [3] British Broadcasting Coorporation (BBC)., [Online] Available:
f f http://www.bbc.co.uk/languages/other/urdu/guide/facts.shtml
or / Õæë@QK. @/. If the decoded one is different than the one in refer- (Last Accessed on March 12, 2019).
ence text, ASR is penalized as one substitution. [4] J. Ashraf, N. Iqbal, N. S. Khattak, A. M. Zaidi, “Speaker Inde-
Additionally, ASR intermittently inserts space in some pendent Urdu Speech Recognition,” in International Conference on
words that are correct with or without space. Such words are Informatics and Systems (INFOS), Cairo, Egypt, 2010.

2981
[5] V. Peddinti, D. Povey, S. Khudanpur, “A time delay neural net- [23] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency
workarchitecture for efficient modeling of long temporal contexts,” acoustic modeling using temporal convolution and LSTMs,” IEEE
In Sixteenth Annual Conference of the International Speech Com- Signal Processing Letters, 2017.
munication Association 2015.
[24] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D.
[6] L. Besacier, E. Barnard, A. Karpov, T. Schultz, “Automatic speech Povey, S. Khudanpur, “A pruned rnnlm lattice-rescoring algorithm
recognition for under-resourced languages: A survey,” Speech for automatic speech recognition,” in IEEE International Confer-
Commun. , vol. 56, pp. 85–100, Jan. 2014 ence on Acoustics, Speech and Signal Processing (ICASSP), IEEE,
2018.
[7] Andreas Stolcke, “SRILM – an extensible language modeling
toolkit,” In Proceedings of the International Conference on Spoken [25] H. Sarfraz, S. Hussain, R. Bokhari, A.A. Raza, I. Ullah, Z. Sar-
Language Processing, Vol. 2, pages 901–904. fraz, S. Pervez, A. Mustafa, I. Javed, R. Parveen.“ Speech Corpus
Development for a Speaker Independent Spontaneous Urdu Speech
[8] L. E. Baum, J. A. Eagon, “An inequality with applications to sta- Recognition System,” in O-COCOSDA 2010.
tistical estimation for probabilistic functions of Markov processes
and to a model for ecology,” Bulletin of American Mathematical [26] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X.
Society, vol. 73, pp. 360–363, 1967. Na,Y. Wang, and S. Khudanpur, “Purely sequence-trained neural
networksfor ASR based on lattice-free MMI,” in Proc. Interspeech,
[9] M. Gales and S. Young. “The application of hidden markov pp. 2751–2755, 2016.
models in speech recognition,” Founds. Trends Signal Process.,
1(3):195–304, January 2007. [27] Raza A., Hussain S., Sarfraz H., Ullah I., and Sarfraz Z.,“Design
and Development of Phonetically Rich Urdu Speech Corpus,” in
[10] V. Manohar, D. Povey, and S. Khudanpur, “Semi-supervised max- Proceedings of IEEE Oriental COCOSDA International Confer-
imum mutual information training of deep neural network acoustic ence on Speech Database and Assessments, Urumqi, pp. 38-43,
models,” in Proc. Interspeech, Dresden, Germany, Sep. 2015, pp. 2009.
2630–2634
[28] S, Urooj, S, Hussain, F, Adeeba, F. Jabeen, R. Perveen, “CLE
[11] K. Vesely, A. Ghoshal, L. Burget, D. Povey, “Sequence-
Urdu Digest Corpus”, in the Proc. of Conference on Language and
discriminative training of deep neural networks,” In Proceedings
Technology 2012 (CLT12), Lahore, Pakistan, 2012.
of Interspeech. 2013

[12] P. Upadhyaya, S. K. Mittal, O. Farooq, Y. V. Varshney, M. R. [29] F. Adeeba, Q. Akram, H. Khalid, and S. Hussain, “CLE Urdu
Abidi, “Continuous Hindi Speech Recognition Using Kaldi ASR BooksN-gram,” in Proc. Conf. Lang. Technol., Karachi, Pakistan,
Based on Deep Neural Network,” In: Tanveer M., Pachori R. (eds) 2014, pp. 87–92.
Machine Intelligence and Signal Analysis. Advances in Intelligent
[30] F. Adeeba, S. Hussain, T. Habib, E. Ul-Haq, K. S. Shahid, “Com-
Systems and Computing, vol 748. Springer, Singapore (2019)
parison of Urdu text to speech synthesis using unit selection and
[13] D. Dash, M. Kim, K. Teplansky, J. Wang, “Automatic Speech HMM based techniques”, Presented at the Oriental COCOSDA
Recognition with Articulatory Information and a Unified Dictio- Bali (Indonesia, 2016)
nary for Hindi, Marathi, Bengali and Oriya,” in Interspeech 2018.
[31] B. Mumtaz, A.Hussain, S. Hussain, A.Mehmood., R. Bhatti, M.
[14] U. G. Patil, S. D. Shirbahadurkar, A. N. Paithane, “Automatic Farooq, S. Rauf.,“Multitier Annotation of Urdu Speech Corpus”,
Speech Recognition of isolated words in Hindi language using Conference on Language and Technology (CLT14), Karachi, Pak-
MFCC,” in 2016 International Conference on Computing, Analyt- istan, 2014
ics and Security Trends (CAST), Dec 2016, pp. 433–438.
[32] H. Sarfraz, S. Hussain, R. Bokhari, A. A. Raza, I. Ullah, Z. Sar-
[15] A. Mishra, M. Chandra, A. Biswas, S. N Sharan, “Robust fea- fraz, S. Pervez, A. Mustafa, I. Javed, R. Parveen, “Large vocabu-
tures for connected Hindi digits recognition,” International Journal lary continuous speech recognition for urdu”, in 8th International
of Signal Processing, Image Processing and Pattern Recogni-tion, Conference on Frontiers of Information Technology. ACM, 2010.
vol. 4, no. 2, Jun 2011.
[33] M. A. B. Shaik, Z. Tukse, M. A.Tahir, M. Nubaum-Thom, R.
[16] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Schluter, H.Ney,“Improvements in RWTH LVCSR evaluation sys-
boltzmann machines,” In ICML, 2010. tems for Polish, Portgeese, English, Urdu and Arabic”, in Sixteenth
Annual Conference of the ISCA, 2015.
[17] R. K. Aggarwal, M. Dave, “Using gaussian mixtures for hindi
speech recognition system,” in International Journal of Signal Pro- [34] M. Qasim, S. Nawaz, S. Hussain, T. Habib, “Urdu speech recog-
cessing, Image Processing and Pattern Recognition, 2011. nition system for district names of Pkaistan: Development, chal-
lenges and solutions”, in O-COCOSDA, 2016.
[18] B. Venkataramani, “SOPC-based speech-to-text conversion,” in
Nios II Embedded Processor Design Contest Outstanding Designs, [35] A. A. Raza, A. Athar, S. Randhawa, Z. Tariq, M. B. Saleem, H.
2006. B. Zia, U. Saif, R. Rosenfeld, “Rapid Collection of Spontaneous
Speech Corpora Using Telephonic Community Forums,” in Proc.
[19] K. V. S. Parsad and S. M. Virk, “Computational evidence that Interspeech 2018 (2018), 1021–1025.
Hindi and Urdu share a grammar but not the lexicon,” In the 3rd
Workshop on South and Southeast Asian NLP, COLING (2012). [36] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N.
Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky,
[20] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term mem- G. Stemmer, K. Vesely, “The Kaldi Speech Recognition Toolkit,”
ory recurrent neural network architectures for large scale acoustic In IEEE 2011 Workshop on Automatic Speech Recognitionand Un-
modeling,” in Proc. INTERSPEECH, 2014, pp. 338–342. derstanding. IEEE Signal Processing Society, 2011.
[21] A. Graves, S. Fernandez, and J. Schmidhuber, “Bidirectional
LSTM networks for improved phoneme classification and recogni-
tion,” in Proc. Int. Conf. Artif. Neural Netw.: Formal Models Their
Appl., 2005, pp. 799–804.

[22] I. McCowan et al., “The AMI meeting corpus,” in Proc. 5th Int.
Conf. Methods Tech. Behav. Res., 2005, vol. 88.

2982

You might also like