Abstract
This work is an effort towards building Neural Speech Recognizers system for Quranic recitations that can be effectively used by anyone regardless of their gender and age. Despite having a lot of recitations available online, most of them are recorded by professional male adult reciters, which means that an ASR system trained on such datasets would not work for female/child reciters. We address this gap by adopting a benchmark dataset of audio records of Quranic recitations that consists of recitations by both genders from different ages. Using this dataset, we build several speaker-independent NSR systems based on the DeepSpeech model and use word error rate (WER) for evaluating them. The goal is to show how an NSR system trained and tuned on a dataset of a certain gender would perform on a test set from the other gender. Unfortunately, the number of female recitations in our dataset is rather small while the number of male recitations is much larger. In the first set of experiments, we avoid the imbalance issue between the two genders and down-sample the male part to match the female part. For this small subset of our dataset, the results are interesting with 0.968 WER when the system is trained on male recitations and tested on female recitations. The same system gives 0.406 WER when tested on male recitations. On the other hand, training the system on female recitations and testing it on male recitation gives 0.966 WER while testing it on female recitations gives 0.608 WER.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abdelhamid, A., Alsayadi, H., Hegazy, I., & Fayed, Z. (2020). End-to-end Arabic speech recognition: A review. In The 19th conference of language engineering (ESOLEC’19).
Abro, B., Naqvi, A.B., & Hussain, A. (2012). Qur’an recognition for the purpose of memorisation using speech recognition technique. In 2012 15th International multitopic conference (INMIC) (pp. 30–34). https://doi.org/10.1109/INMIC.2012.6511440
Abushariah, M. A. M. (2017). Tameem v1.0: Speakers and text independent Arabic automatic continuous speech recognizer. International Journal of Speech Technology, 20(2), 261–280.
Agarwal, A., & Zesch, T. (2019). German end-to-end speech recognition based on deepspeech. In Proceedings of the 15th conference on natural language processing (KONVENS 2019).
Akkila, A.N., & Abu-Naser, S. S. (2018). In Rules of Tajweed the Holy Quran Intelligent Tutoring System.
Al-Anzi, F., & AbuZeina, D. (2018). Literature survey of Arabic speech recognition. In 2018 International conference on computing sciences and engineering (ICCSE), (pp. 1–6).
Al-Ayyoub, M., Damer, N. A., & Hmeidi, I. (2018). Using deep learning for automatically determining correct application of basic quranic recitation rules. International Arab Journal of Information Technology, 15, 620.
Algihab, W., Alawwad, N., Aldawish, A., & AlHumoud, S. (2019). Arabic speech recognition with deep learning: A review. In G. Meiselwitz (Ed.), Social computing and social media. Design, human behavior and analytics (pp. 15–31). Springer.
Alhawarat, M., Hegazi, M. O., & Hilal, A. (2015). Processing the text of the Holy Quran: A text mining study. International Journal of Advanced Computer Science and Applications, 6, 262–267.
Alkhateeb, J. (2020). A machine learning approach for recognizing the Holy Quran reciter. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2020.0110735
AlKhatib, H., Mansor, E., Alsamel, Z., & AlBarazi, J. (2020). A study of using VR game in teaching Tajweed for teenagers (pp. 244–260). https://doi.org/10.4018/978-1-7998-2637-8.ch013
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., Elsen, E., Engel, J.H., Fan, L., Fougner, C., Han, T., Hannun, A.Y., Jun, B., LeGresley, P., Lin, L., …, Zhu, Z. (2015). Deep speech 2: End-to-end speech recognition in English and Mandarin. CoRR. http://arxiv.org/abs/1512.02595
Battenberg, E., Chen, J., Child, R., Coates, A., Li, Y.G.Y., Liu, H., Satheesh, S., Sriram, A., & Zhu, Z. (2017). Exploring neural transducers for end-to-end speech recognition. In 2017 IEEE automatic speech recognition and understanding workshop (ASRU) (pp. 206–213). https://doi.org/10.1109/ASRU.2017.8268937
Bettayeb, N. (2020). Speech synthesis system for the Holy Quran recitation. The International Arab Journal of Information Technology, 18, 8–15. https://doi.org/10.34028/iajit/18/1/2
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4960–4964). https://doi.org/10.1109/ICASSP.2016.7472621
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2letter: An end-to-end convnet-based speech recognition system. CoRR. http://arxiv.org/abs/1609.03193
Czerepinski, K. C. (2006). Tajweed rules of the Quran. DAR-AL-KHAIR ISLAMIC BOOK.
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 30–42. https://doi.org/10.1109/TASL.2011.2134090
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. https://doi.org/10.1109/TASSP.1980.1163420
El Amrani, M. Y., Rahman, M. H., Wahiddin, M. R., & Shah, A. (2016). Building Cmu sphinx language model for the Holy Quran using simplified Arabic phonemes. Egyptian Informatics Journal, 17(3), 305–314. https://doi.org/10.1016/j.eij.2016.04.002
Eldeeb, T. Deepspeech-quran. (2021) https://github.com/tarekeldeeb/DeepSpeech-Quran
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., & Ng, A.Y. (2014). Deep speech: Scaling up end-to-end speech recognition.
Hayou, S., Doucet, A., & Rousseau, J. (2019). On the impact of the activation function on deep neural networks training. https://doi.org/10.48550/ARXIV.1902.06853
Heafield, K. (2011). KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation (pp. 187–197). Association for Computational Linguistics, Edinburgh, Scotland. https://www.aclweb.org/anthology/W11-2123
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Huang, X., & Deng, L. (2010). An overview of modern speech recognition. In Handbook of Natural Language Processing, Second Edition (pp. 339–366)
Hyassat, H., & Abu Zitar, R. (2006). Arabic speech recognition using sphinx engine. International Journal of Speech Technology, 9(3), 133–150.
Iakushkin, O., Fedoseev, G., Shaleva, A., Degtyarev, A., & Sedova, O. (2018). Russian-language speech recognition system based on deepspeech. In Proceedings of the VIII international conference “Distributed computing and grid-technologies in science and education”.
Ibrahim, N. J., Idris, M., Razak, Z., & Rahman, N. (2013). Automated Tajweed checking rules engine for quranic learning. Multicultural Education & Technology Journal, 7, 275–287. https://doi.org/10.1108/metj-03-2013-0012
Ibrahim, Y. A., Odiketa, J. C., & Ibiyemi, T. S. (2017). Preprocessing technique in automatic speech recognition for human computer interaction: An overview. The journal Annals. Computer Science Series, XV, 186–191.
Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33(3), 251–272.
Khalaf, E., Daqrouq, K., & Morfeq, A. (2014). Arabic vowels recognition by modular arithmetic and wavelets using neural network. Life Science Journal, 11, 33–41.
Khalaf, E., Daqrouq, K., & Sherif, M. (2011a). Modular arithmetic and wavelets for speaker verification. Journal of Applied Sciences. https://doi.org/10.3923/jas.2011.2782.2790
Khalaf, E., Daqrouq, K., & Sherif, M. (2011b). Wavelet packet and percent of energy distribution with neural networks based gender identification system. Journal of Applied Sciences, 11, 2940.
Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schwartz, R., & Vergyri, D. (2003). Novel approaches to arabic speech recognition: Report from the 2002 johns-hopkins summer workshop (pp. I–344). https://doi.org/10.1109/ICASSP.2003.1198788
Lamere, P., Kwok, P., Gouvêa, E., Raj, B., Singh, R., Walker, W., Warmuth, M., & Wolf, P. (2003). The cmu sphinx-4 speech recognition system.
Lee, K. F., Hon, H. W., & Reddy, R. (1990). An overview of the sphinx speech recognition system. IEEE Transactions on Acoustics, Speech, and Signal Processing, 38(1), 35–45. https://doi.org/10.1109/29.45616
Lei, Z., Jiandong, L., Jing, L., & Guanghui, Z. (2005). A novel wavelet packet division multiplexing based on maximum likelihood algorithm and optimum pilot symbol assisted modulation for rayleigh fading channels. Circuits, Systems and Signal Processing, 24(3), 287–302.
Lou, H. L. (1995). Implementing the viterbi algorithm. IEEE Signal Processing Magazine, 12(5), 42–52. https://doi.org/10.1109/79.410439
Mohammed, A., Sunar, M. S., & Salam, M. S. (2015). Quranic verses verification using speech recognition techniques. Journal Teknologi. https://doi.org/10.11113/jt.v73.4200
Mozilla: Deepspeech. (2021) https://github.com/mozilla/DeepSpeech
Mozilla: Deepspeech 0.9.3. (2020) https://github.com/mozilla/DeepSpeech/releases
Mustafa, B.S. Qdat. (2020) https://www.kaggle.com/annealdahi/quran-recitation
Panaite, M., Ruseti, S., Dascalu, M., & Trausan-Matu, S. (2019). Towards a deep speech model for Romanian language. In 2019 22nd International Conference on Control Systems and Computer Science (CSCS) (pp. 416–419). https://doi.org/10.1109/CSCS.2019.00076
Pratap, V., Hannun, A., Xu, Q., Cai, J., Kahn, J., Synnaeve, G., Liptchinsky, V., & Collobert, R. (2018). wav2letter++: The fastest open-source speech recognition system. CoRR. http://arxiv.org/abs/1812.07625
Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. Prentice-Hall Inc.
Rabiner, L.R., & Schafer, R.W. (2007). An introduction to digital speech processing. Foundations and Trends.
Radha, V. (2012). Implementing the Viterbi algorithm. World of Computer Science and Information Technology Journal (WCSIT), 2(1), 1–7.
Riesen, K., & Bunke, H. (2010). Graph classification and clustering based on vector space embedding. World Scientific Publishing Co.
Santosh, K., Bharti, W., & Yannawar, P. (2010). A review on speech recognition technique. International Journal of Computer Applications. https://doi.org/10.5120/1462-1976
Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Shafie, N., Adam, M., & Abas, H. (2017). The model of al-quran recitation evaluation to support in da’wah technology media for self-learning of recitation using mobile apps (2017). https://doi.org/10.13140/RG.2.2.29744.87041
Tabbal, H., El Falou, W., & Monla, B. (2006). Analysis and implementation of a “quranic” verses delimitation system in audio files using speech recognition techniques. In 2006 2nd international conference on information communication technologies (vol. 2, pp. 2979–2984). https://doi.org/10.1109/ICTTA.2006.1684889
Wang, Y.Y., & Waibel, A. (1997). Decoding algorithm in statistical machine translation. In Proceedings of the 35th annual meeting of the association for computational linguistics and eighth conference of the European Chapter of the Association for Computational Linguistics, ACL ’98/EACL ’98 (pp. 366-372). association for computational linguistics, USA. https://doi.org/10.3115/976909.979664
Wang, D., Wang, X., & Lv, S. (2019). An overview of end-to-end automatic speech recognition. Symmetry, 11(8), 1018.
Young, S. (1994). The htk hidden markov model toolkit: Design and philosophy (vol. 2, pp. 2–44). Entropic Cambridge Research Laboratory, Ltd.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Al-Issa, S., Al-Ayyoub, M., Al-Khaleel, O. et al. Building a neural speech recognizer for quranic recitations. Int J Speech Technol 26, 1131–1151 (2023). https://doi.org/10.1007/s10772-022-09988-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-022-09988-3