Abstract
The increased need for foreign language learning, along with advances in speech technology have heightened interest in computer-assisted pronunciation teaching (CAPT) applications. Herein, the automatic diagnosis of pronunciation errors is essential, it allows language learners to identify their mispronunciations and thus improve their oral skills. Meanwhile, the emergence of deep learning algorithms for speech processing led to the use of deep neural networks at several stages of the mispronunciation detection and diagnosis process. Therefore, an overview of the state-of-the-art in deep learning algorithms for mispronunciation diagnosis is needed, for which we performed a systematic literature review. This study aims to provide an overview of the recent use of deep neural networks for mispronunciation detection and diagnosis (MDD). A thorough statistical analysis is provided in this review which was conducted by extracting specific information from 53 papers published between the years 2015 and 2023. This review indicates that the diagnosis of pronunciation errors is a highly active area of research. Quite a few deep learning models and approaches have been proposed in this area, but there are still some important open issues and limitations to be addressed in future works.
Similar content being viewed by others
Data availability
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
References
Shahin M, Ahmed B (2019) Anomaly detection based pronunciation verification approach using speech attribute features. Speech Commun 111:29–43. https://doi.org/10.1016/j.specom.2019.06.003
Cohen M, Murveit H, Bernstein J, Price P, Weintraub M (1990) The decipher speech recognition system. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, Albuquerque, pp 77–80. https://doi.org/10.1109/ICASSP.1990.115541
Eskenazi M (2009) An overview of spoken language technology for education. Speech Commun 51(10):832–844. https://doi.org/10.1016/j.specom.2009.04.005
Chen NF, Li H (2016) Computer-assisted pronunciation training: from pronunciation scoring towards spoken language learning. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, Jeju, pp 1–7. https://doi.org/10.1109/APSIPA.2016.7820782
Franco H, Neumeyer L, Kim Y, Ronen O (1997) Automatic pronunciation scoring for language instruction. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, Munich, pp 1471–1474. https://doi.org/10.1109/ICASSP.1997.596227
Witt SM, Young SJ (2000) Phone-level pronunciation scoring and assessment for interactive language learning. Speech Commun 30(2–3):95–108. https://doi.org/10.1016/S0167-6393(99)00044-8
Bahi H, Necibi K (2020) Fuzzy logic applied for pronunciation assessment. Int J Comput Assisted Lang Learn Teach 10(1):60–72. https://doi.org/10.4018/IJCALLT.2020010105
Neumeyer L, Franco H, Digalakis V, Weintraub M (2000) Automatic scoring of pronunciation quality. Speech Commun 30(2–3):83–93. https://doi.org/10.1016/S0167-6393(99)00046-1
Strik H, Truong KP, Wet FD, Cucchiarini C (2007) Comparing classifiers for pronunciation error detection. 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium, pp 1837–1840. https://doi.org/10.21437/interspeech.2007-512
Harrison AM, Lo WK, Qian XJ, Meng H (2009) Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. In: International Workshop on Speech and Language Technology in Education (SLaTE), Warwickshire, pp 45–48
Wang YB, Lee LS (2015) Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning. IEEE ACM Trans Audio Speech Lang Process 23(3):564–579. https://doi.org/10.1109/taslp.2014.2387413
Lee A, Chen NF, Glass J (2016) Personalized mispronunciation detection and diagnosis based on unsupervised error pattern discovery. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016. IEEE, 6145–6149. https://doi.org/10.1109/icassp.2016.7472858
Duan R, Kawahara T, Dantsuji M, Nanjo H (2019) Cross-lingual transfer learning of non-native acoustic modeling for pronunciation error detection and diagnosis. IEEE ACM Trans Audio Speech Lang Process 28:391–401. https://doi.org/10.1109/taslp.2019.2955858
Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE 2007–001. Keele University and Durham University
Neri A, Cucchiarini C, Strik H, Boves L (2002) The pedagogy-technology interface in computer assisted pronunciation training. Comput Assisted Lang Learn 15(5):441–467. https://doi.org/10.1076/call.15.5.441.13473
Witt SM (2012) Automatic error detection in pronunciation training: where we are and where we need to go. In: International Symposium on Automatic Detection on Errors in Pronunciation Training (ISADEPT), Stockholm, pp 1–8
Hinton G, Deng L, Yu D, Dahl GE, Mohamed AR, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN, Kingsbury B (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process Mag 29(6):82–97. https://doi.org/10.1109/msp.2012.2205597
Agarwal C, Chakraborty P (2019) A review of tools and techniques for computer aided pronunciation training (CAPT) in English. Educ Inf Technol 24(6):3731–3743. https://doi.org/10.1007/s10639-019-09955-7
Wu Y, Zhang J, Dong Q (2019) The use of SDAE in noisy English mispronunciation detection and diagnosis towards application in mobile learning. In: International Symposium on Signal Processing Systems (SSPS). ACM, Beijing, pp 176–180. https://doi.org/10.1145/3364908.3365302
Li K, Mao S, Li X, Wu Z, Meng H (2018) Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks. Speech Commun 96:28–36. https://doi.org/10.1016/j.specom.2017.11.003
Li K, Wu X, Meng H (2017) Intonation classification for L2 English speech using multi-distribution deep neural networks. Comput Speech Lang 43:18–33. https://doi.org/10.1016/j.csl.2016.11.006
Ahmed A, Bader M, Shahin I, Nassif AB, Werghi N, Basel M (2023) Arabic Mispronunciation Recognition System Using LSTM Network. Information 14(7):413. https://doi.org/10.3390/info14070413
Yan BC, Wang HW, Wang YC, Chen B (2023) Effective graph-based modeling of articulation traits for mispronunciation detection and diagnosis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). Rhodes Island, IEEE, pp 1–5. https://doi.org/10.1109/icassp49357.2023.10097226
Peng L, Gao Y, Bao R, Li Y, Zhang J (2023) End-to-End Mispronunciation Detection and Diagnosis Using Transfer Learning. Appl Sci 13(11):6793. https://doi.org/10.3390/app13116793
Guo S, Kadeer Z, Wumaier A, Wang L, Fan C (2023) Multi-Feature and Multi-Modal Mispronunciation Detection and Diagnosis Method Based on the Squeezeformer Encoder. IEEE Access 11:66245–66256. https://doi.org/10.1109/access.2023.3278837
Yan BC, Wang HW, Chen B (2023) Peppanet: Effective mispronunciation detection and diagnosis leveraging phonetic, phonological, and acoustic cues. In: Spoken Language Technology Workshop (SLT). IEEE, Doha, pp 1045–1051. https://doi.org/10.1109/slt54892.2023.10022472
Zhang DY, Saha S, Campbell S (2023) Phonetic RNN-transducer for mispronunciation diagnosis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Rhodes Island, pp 1–5. https://doi.org/10.1109/icassp49357.2023.10094945
Ye W, Mao S, Soong F, Wu W, Xia Y, Tien J, Wu Z (2022) An approach to mispronunciation detection and diagnosis with acoustic, phonetic and linguistic (APL) embeddings. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). Singapore, IEEE, pp 6827–6831. https://doi.org/10.1109/icassp43922.2022.9746604
Zhang Z, Wang Y, Yang J (2022) Masked acoustic unit for mispronunciation detection and correction. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Singapore, pp 6832–6836. https://doi.org/10.1109/icassp43922.2022.9747414
Yan BC, Wang HW, Jiang SW, Chao FA, Chen B (2022) Maximum f1-score training for end-to-end mispronunciation detection and diagnosis of L2 English speech. In: International Conference on Multimedia and Expo (ICME). IEEE, Taipei, pp 1–5. https://doi.org/10.1109/icme52920.2022.9858931
Algabri M, Mathkour H, Alsulaiman M, Bencherif MA (2022) Mispronunciation detection and diagnosis with articulatory-level feedback generation for non-native arabic speech. Mathematics 10(15):2727. https://doi.org/10.3390/math10152727
Shen Y, Liu Q, Fan Z, Liu J, Wumaier A (2022) Self-Supervised Pre-Trained Speech Representation Based End-to-End Mispronunciation Detection and Diagnosis of Mandarin. IEEE Access 10:106451–106462. https://doi.org/10.1109/access.2022.3212417
Nazir F, Majeed MN, Ghazanfar MA, Maqsood M (2021) A computer-aided speech analytics approach for pronunciation feedback using deep feature clustering. Multimed Syst 29(3):1699–1715. https://doi.org/10.1007/s00530-021-00822-5
Qin Y, Qian Y, Loukina A, Lange P, Misra A, Evanini K, Lee T (2021) Automatic detection of wordlevel reading errors in nonnative English speech based on ASR output. In: International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, Hong Kong, pp 1–5. https://doi.org/10.1109/iscslp49672.2021.9362102
Huang Y (1952) Huang Y (2021) Detection of Mispronunciation in Non-native Speech Using Acoustic Model and Convolutional Recurrent Neural Networks. J Phys Conf Ser 3:032043. https://doi.org/10.1088/1742-6596/1952/3/032043
Yan BC, Chen B (2021) End-to-end mispronunciation detection and diagnosis from raw waveforms. In: European Signal Processing Conference (EUSIPCO). IEEE, Dublin, pp 61–65. https://doi.org/10.23919/eusipco54536.2021.9615987
Gan Z, Zhao X, Zhou S, Wang R (2021) Improving mispronunciation detection of Mandarin for Tibetan students based on the end-to-end speech recognition model. In: International Symposium on Artificial Intelligence and its Application on Media (ISAIAM). IEEE, Xi’an, pp 151–154. https://doi.org/10.1109/isaiam53259.2021.00039
Yang L, Fu K, Zhang J, Shinozaki T (2021) Non-native acoustic modeling for mispronunciation verification based on language adversarial representation learning. Neural Netw 142:597–607. https://doi.org/10.1016/j.neunet.2021.07.017
Zhang Z, Wang Y, Yang J (2021) Text-conditioned transformer for automatic pronunciation error detection. Speech Commun 130:55–63. https://doi.org/10.1016/j.specom.2021.04.004
Wu M, Li K, Leung WK, Meng H (2021) Transformer based end-to-end mispronunciation detection and diagnosis. Interspeech, ISCA, Brno, pp 3954–3958. https://doi.org/10.21437/interspeech.2021-1467
Xie Y, Wang Z, Fu K (2020) L2 Mispronunciation Verification Based on Acoustic Phone Embedding and Siamese Networks. J Signal Process Syst. https://doi.org/10.1007/s11265-020-01598-z
Feng Y, Fu G, Chen Q, Chen K (2020) SED-MDD: Towards sentence dependent end-to-end mispronunciation detection and diagnosis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). Barcelona, IEEE, pp 3492–3496. https://doi.org/10.1109/icassp40776.2020.9052975
Guo M, Rui C, Wang W, Lin B, Zhang J, Xie Y (2019) A study on mispronunciation detection based on fine-grained speech attribute. In: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, Lanzhou, pp 1197–1201. https://doi.org/10.1109/APSIPAASC47483.2019.9023156
Leung WK, Liu X, Meng H (2019) CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Brighton, pp 8132–8136. https://doi.org/10.1109/ICASSP.2019.8682654
Li W, Chen NF, Siniscalchi SM, Lee CH (2019) Improving mispronunciation detection of mandarin tones for non-native learners with soft-target tone labels and BLSTM-based deep tone models. IEEE ACM Trans Audio Speech Lang Process 27(12):2012–2024. https://doi.org/10.1109/TASLP.2019.2936755
Nazir F, Majeed MN, Ghazanfar MA, Maqsood M (2019) Mispronunciation detection using deep convolutional neural network features and transfer learning-based model for Arabic phonemes. IEEE Access 7:52589–52608. https://doi.org/10.1109/ACCESS.2019.2912648
Yang L, Xie Y, Zhang J (2019) Pronunciation Erroneous Tendency Detection with Combination of Convolutional Neural Network and Long Short-Term Memory. Int J Asian Lang Process 28(2):49–66
Mao S, Wu Z, Li R, Li X, Meng H, Cai L (2018) Applying multitask learning to acoustic-phonemic model for mispronunciation detection and diagnosis in l2 English speech. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). Calgary, IEEE, pp 6254–6258. https://doi.org/10.1109/ICASSP.2018.8461841
Mao S, Li X, Li K, Wu Z, Liu X, Meng H (2018) Unsupervised discovery of an extended phoneme set in l2 English speech for mispronunciation detection and diagnosis. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 2018. IEEE, Calgary, pp 6244–6248. https://doi.org/10.1109/ICASSP.2018.8462635
Wei X, Chen J, Wang W, Xie Y, Zhang J (2017) A study of automatic annotation of PETs with articulatory features. In: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, Kuala Lumpur, pp 1608–1612. https://doi.org/10.1109/APSIPA.2017.8282281
Duan R, Kawahara T, Dantsuji M, Zhang J (2017) Articulatory modeling for pronunciation error detection without non-native training data based on DNN transfer learning. IEICE TRANS Inf Syst E100.D(9):2174–2182. https://doi.org/10.1587/transinf.2017edp7019
Duan R, Kawahara T, Dantsuji M, Zhang J (2017) Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, New Orleans, pp 5815–5819. https://doi.org/10.1109/ICASSP.2017.7953271
Yang L, Xie Y, Gao Y, Zhang J (2017) Improving pronunciation erroneous tendency detection with convolutional long short-term memory. In: International Conference on Asian Language Processing (IALP). IEEE, Singapore, pp 52–56. https://doi.org/10.1109/IALP.2017.8300544
Ryu H, Chung M (2017) Mispronunciation diagnosis of L2 English at articulatory level using articulatory goodness-of-pronunciation features. In: Workshop on Speech and Language Technology in Education (SLaTE). ISCA, Stockholm, pp 65–70. https://doi.org/10.21437/slate.2017-12
Duan R, Kawahara T, Dantsuji M, Nanjo H (2017) Transfer learning based non-native acoustic modeling for pronunciation error detection. In: 7th ISCA Workshop on Speech and Language Technology in Education (SLaTE ). ISCA, Stockholm, pp 42–46. https://doi.org/10.21437/slate.2017-8
Tong R, Chen NF, Ma B, Li H (2016) Context aware mispronunciation detection for mandarin pronunciation training. Interspeech, San Francisco, ISCA, pp 3112–3116. https://doi.org/10.21437/interspeech.2016-289
Duan R, Kawahara T, Dantsuji M, Zhang J (2016) Multi-lingual and multi-task DNN learning for articulatory error detection. In: Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, Jeju, pp 1–4. https://doi.org/10.1109/APSIPA.2016.7820800
Gao Y, Xie Y, Cao W, Zhang J (2015) A study on robust detection of pronunciation erroneous tendency based on deep neural network. Interspeech, Dresden, ISCA, pp 693–696. https://doi.org/10.21437/interspeech.2015-242
Wang HW, Yan BC, Chiu HS, Hsu YC, Chen B (2022) Exploring non-autoregressive end-to-end neural modeling for English mispronunciation detection and diagnosis. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Singapore, pp 6817–6821. https://doi.org/10.1109/ICASSP43922.2022.9747569
Khanal S, Johnson MT, Soleymanpour M, Bozorg N (2021) Mispronunciation detection and diagnosis for Mandarin accented English speech. In: International Conference on Speech Technology and Human-Computer Dialogue (SpeD). IEEE, Bucharest, pp 62–67. https://doi.org/10.1109/SpeD53181.2021.9587408
Mao S, Wu Z, Li X, Li R, Wu X, Meng H (2018) Integrating articulatory features into acoustic phonemic model for mispronunciation detection and diagnosis in l2 English speech. In: International Conference on Multimedia and Expo (ICME). IEEE, San Diego, pp 1–6. https://doi.org/10.1109/ICME.2018.8486462
Li K, Qian X, Meng H (2017) Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks. IEEE ACM Trans Audio Speech Lang Process 25(1):193–207. https://doi.org/10.1109/TASLP.2016.2621675
Li W, Siniscalchi SM, Chen NF, Lee CH (2016) Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling. In: International conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 6135–6139. https://doi.org/10.1109/ICASSP.2016.7472856
Peng L, Fu K, Lin B, Ke D, Zhang J (2021) A study on fine-tuning wav2vec2.0 model for the task of mispronunciation detection and diagnosis. Interspeech, Brno, ISCA, pp 4448–4452. https://doi.org/10.21437/interspeech.2021-1344
Li X, Mao S, Wu X, Li K, Liu X, Meng H (2018) Unsupervised discovery of non-native phonetic patterns in L2 English speech for mispronunciation detection and diagnosis. Interspeech, Hyderabad, ISCA, pp 2554–2558. https://doi.org/10.21437/interspeech.2018-2027
Li W, Chen NF, Siniscalchi SM, Lee CH (2017) Improving mispronunciation detection for nonnative learners with multisource information and LSTM-based deep models. In: Interspeech. ISCA, Stockholm, pp 2759–2763. https://doi.org/10.21437/interspeech.2017-464
Arora V, Lahiri A, Reetz H (2017) Phonological feature based mispronunciation detection and diagnosis using multi-task DNNs and active learning. Interspeech, Stockholm, ISCA, pp 1432–1436. https://doi.org/10.21437/interspeech.2017-1350
Li W, Li K, Siniscalchi SM, Chen NF, Lee CH (2016) Detecting mispronunciations of L2 learners and providing corrective feedback using knowledge-guided and data-driven decision trees. In: Interspeech. ISCA, San Francisco, pp 3127–3131. https://doi.org/10.21437/interspeech.2016-517
Hu W, Qian Y, Soong FK (2015) An improved DNN-based approach to mispronunciation detection and diagnosis of L2 learners’ speech. In: Workshop on Speech and Language Technology in Education (SLaTE). ISCA, Leipzig, pp 71–76
Chen B, Hsu YC (2019) Mandarin Chinese mispronunciation detection and diagnosis leveraging deep neural network based acoustic modeling and training techniques. In: Lu X, Chen B (eds) Computational and Corpus Approaches to Chinese Language Learning. Chinese Language Learning Sciences, Springer, Singapore, pp 217–234. https://doi.org/10.1007/978-981-13-3570-9_11
Raux A, Kawahara T (2002) Automatic intelligibility assessment and diagnosis of critical pronunciation errors for computer assisted pronunciation learning. In: International Conference on Spoken Language Processing (ICSLP). ISCA, Denver, pp 737–740. https://doi.org/10.21437/icslp.2002-241
Cheng J, Chen X, Metallinou A (2015) Deep neural network acoustic models for spoken assessment applications. Speech Commun 73:14–27. https://doi.org/10.1016/j.specom.2015.07.006
Jiang SW, Yan BC, Lo TH, Chao FA, Chen B (2021) Towards robust mispronunciation detection and diagnosis for L2 English learners with accent-modulating methods. In: Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, Cartagena, pp 1065–1070. https://doi.org/10.1109/ASRU51503.2021.9688291
Kim S, Gholami A, Shaw A, Lee N, Mangalam K, Malik J, Mahoney MW, Keutzer K (2022) Squeezeformer: An efficient transformer for automatic speech recognition. In: Advances in Neural Information Processing Systems 35 (NeurIPS 2022), New Orleans, pp 9361–9373
Qian X, Meng H, Soong F (2010) Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT). In: International Symposium on Chinese Spoken Language Processing, IEEE, Tainan, pp 84–88. https://doi.org/10.1109/iscslp.2010.5684845
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors have no competing interests to declare that are relevant to the article’s content.
Conflict of interests
"Not Applicable".
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lounis, M., Dendani, B. & Bahi, H. Mispronunciation detection and diagnosis using deep neural networks: a systematic review. Multimed Tools Appl 83, 62793–62827 (2024). https://doi.org/10.1007/s11042-023-17899-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17899-x