Abstract
Automatic spelling correction stands as a pivotal challenge within the ambit of natural language processing (NLP), demanding nuanced solutions. Traditional spelling correction techniques are typically only capable of detecting and correcting non-word errors, such as typos and misspellings. However, context-sensitive errors, also known as real-word errors, are more challenging to detect because they are valid words that are used incorrectly in a given context. The Persian language, characterized by its rich morphology and complex syntax, presents formidable challenges to automatic spelling correction systems. Furthermore, the limited availability of Persian language resources makes it difficult to train effective spelling correction models. This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text. Our methodology adopts a structured, multi-tiered approach, employing semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy. The innovative architecture discovers and stores semantic similarities between words and phrases in Persian text. The classifiers accurately identify real-word errors, while the semantic ranking algorithm determines the most probable corrections for real-word errors, taking into account specific spelling correction and context properties such as context, semantic similarity, and edit-distance measures. Evaluations have demonstrated that our proposed method surpasses previous Persian real-word error correction models. Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase. These results clearly indicate that our approach is a highly promising solution for automatic real-word error correction in Persian text.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The dataset supporting this article is from previously reported studies and datasets (Hamshahri corpus), which have been cited. The data are available at:
Notes
Available at: http://aspell.net.
Available at: https://sourceforge.net/projects/jazzy.
All pronunciations have been provided in International Phonetic Alphabet (IPA).
References
Wilcox-O’Hearn A, Hirst G, Budanitsky A (2008) Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model. In: International conference on intelligent text processing and computational linguistics, Springer
Hirst G, Budanitsky A (2005) Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng 11(1):87–111
Deng L, Huang X (2004) Challenges in adopting speech recognition. Commun ACM 47(1):69–75
Jurafsky D, James H, Martin J (2008) Speech and language processing: an introduction to natural language processing. Computational linguistics, and speech recognition, 2nd edn. Prentice-Hall, Hoboken NJ
Bassil Y, Alwani M (2012) Ocr context-sensitive error correction based on google web 1t 5-gram data set. arXiv preprint arXiv:1204.0188
Hartley RT, Crumpton K (1999) Quality of OCR for degraded text images. arXiv preprint cs/9902009
Huang Y, Murphey YL, Ge Y (2013) Automotive diagnosis typo correction using domain knowledge and machine learning. In: 2013 IEEE symposium on computational intelligence and data mining (CIDM), IEEE
Kukich K (1993) Techniques for automatically correcting words in text(abstract). In: ACM Annual computer science conference: proceedings of the 1993 ACM conference on computer science
Fred JD (1964) A technique for computer detection and correction of spelling errors. Commun ACM 3(7):171–176
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. Soviet Union, Moscow
Atkinson K (2006) Gnu aspell 0.60. 4. 2006, GNU Aspell. Retrieved from http://aspell.net
Idzelis M, Galbraith B (2005) Jazzy: the java open source spell checker. Retrieved from 2019/10/10 http://jazzy.sourceforge.net
Crowell J et al (2004) A frequency-based technique to improve the spelling suggestion rank in medical queries. J Am Med Inform Assoc 11(3):179–185
Mitton R (2009) Ordering the suggestions of a spellchecker without using context. Nat Lang Eng 15(2):173–192
Turchin A et al (2007) Identification of misspelled words without a comprehensive dictionary using prevalence analysis. In: AMIA annual symposium proceedings. American Medical Informatics Association
Church KW, Gale WA (1991) Probability scoring for spelling correction. Stat Comput 1(2):93–103
Flor M, Futagi Y (2012) On using context for automatic correction of non-word misspellings in student essays. In: Proceedings of the seventh workshop on building educational applications using NLP
Lai KH et al (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195
Norvig P (2009) Natural language corpus data. Beautiful data. O’Reilly Media Inc, Sebastopol, pp 219–242
Wilbur WJ, Kim W, Xie N (2006) Spelling correction in the PubMed search engine. Inf Retr 9(5):543–564
Dashti SMS et al (2014) Toward a thesis in automatic context-sensitive spelling correction
Cauteruccio F et al (2022) Extraction and analysis of text patterns from NSFW adult content in Reddit. Data Knowl Eng 138:101979
Bonifazi G et al (2022) A Space-time framework for sentiment scope analysis in social media. Big Data Cognit Comput 6(4):130
Mays E, Damerau FJ, Mercer RL (1991) Context based spelling correction. Inf Process Manag 27(5):517–522
Samanta P, Chaudhuri BB (2013) A simple real-word error detection and correction using local word bigram and trigram. In: Proceedings of the 25th conference on computational linguistics and speech processing (ROCLING 2013)
Wilcox-O'Hearn LA (2014) Detection is the central problem in real-word spelling correction. arXiv preprint arXiv:1408.3153
Dashti SM, Khatibi Bardsiri A, Khatibi Bardsiri V (2018) Correcting real-word spelling errors: a new hybrid approach. Digit Scholarsh Humanit 33(3):488–499
Dashti SM (2018) Real-word error correction with trigrams: correcting multiple errors in a sentence. Lang Resour Eval 52(2):485–502
Kilicoglu H et al (2015) An ensemble method for spelling correction in consumer health questions. In: AMIA annual symposium proceedings, American Medical Informatics Association
Pande H (2017) Effective search space reduction for spell correction using character neural embeddings. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2, Short Papers
Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In: BioNLP 2017
Lu CJ et al (2019) Spell checker for consumer language (CSpell). J Am Med Inform Assoc 26(3):211–218
Hu Y et al (2020) Misspelling correction with pre-trained contextual language model. In: 2020 IEEE 19th international conference on cognitive informatics & cognitive computing (ICCI* CC), IEEE
Lee J-H, Kim M, Kwon H-C (2020) Deep learning-based context-sensitive spelling typing error correction. IEEE Access 8:152565–152578
Sun R, Wu X, Wu Y (2023) An error-guided correction model for Chinese spelling error correction. arXiv preprint arXiv:2301.06323
Jayanthi SM, Pruthi D, Neubig G (2020) Neuspell: a neural spelling correction toolkit. arXiv preprint arXiv:2010.11085
Ji T, Yan H, Qiu X (2021) SpellBERT: a lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing
Liu S et al (2021) PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, vol 1, Long Papers
Zhang R et al (2021) Correcting Chinese spelling errors with phonetic pre-training. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Dublin Ireland
Tran K et al (2022) Vietnamese electronic medical record management with text preprocessing for spelling errors. In: 2022 9th NAFOSTED conference on information and computer science (NICS), IEEE
Wang X et al (2022) Towards contextual spelling correction for customization of end-to-end speech recognition systems. IEEE/ACM Trans Audio Speech Lang Process 30:3089–3097
Zhu C et al (2022) MDCSpell: a multi-task detector-corrector framework for Chinese spelling correction. Findings of the association for computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin Ireland
Liu S et al (2022) CRASpell: a contextual typo robust approach to improve Chinese spelling correction. Findings of the association for computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin Ireland
Salhab M, Abu-Khzam F (2023) AraSpell: a deep learning approach for arabic spelling correction
Wu H et al (2023) Rethinking Masked Language Modeling for Chinese Spelling Correction. arXiv preprint arXiv:2305.17721
Liang Z, Quan X, Wang Q (2023) Disentangled phonetic representation for chinese spelling correction. arXiv preprint arXiv:2305.14783
Mosavi Miangah T (2014) FarsiSpell: a spell-checking system for Persian using a large monolingual corpus. Lit Linguist Comput 29(1):56–73
Kashefi O, Sharifi M, Minaie B (2013) A novel string distance metric for ranking Persian respelling suggestions. Nat Lang Eng 19(2):259–284
Shamsfard M, Jafari HS, Ilbeygi M (2010) STeP-1: a set of fundamental tools for Persian text processing. In: Proceedings of the 7th international conference on language resources and evaluation (LREC'10)
Shamsfard M (2011) Challenges and open problems in Persian text processing. Proceedings of LTC 11:65–69
Ghayoomi M, Assi SM (2005) Word prediction in a running text: A Statistical language modeling for the Persian language. In: Proceedings of the Australasian language technology workshop 2005
Naseem T, Hussain S (2007) A novel approach for ranking spelling error corrections for Urdu. Lang Resour Eval 41(2):117–128
Faili H et al (2016) Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digit Scholarsh Humanit 31(1):95–117
Dastgheib MB, Fakhrahmad SM, Jahromi MZ (2017) Perspell: a new Persian semantic-based spelling correction system. Digital Scholarsh Humanit 32(3):543–553
Yazdani A et al (2020) Automated misspelling detection and correction in Persian clinical text. J Digit Imaging 33:555–562
Ghayoomi M, Momtazi S, Bijankhan M (2010) A study of corpus development for Persian. International journal on ALP. Citeseer, Princeton
Dehkhoda AA (1998) Dehkhoda dictionary. Tehran University, Tehran, p 1377
Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval
Taghi-Zadeh H et al (2017) A new hybrid stemming method for Persian language. Digit Scholarsh Humanit 32(1):209–221
Melucci M, Orio N (2003) A novel method for stemmer generation based on hidden Markov models. In: Proceedings of the twelfth international conference on information and knowledge management
Brown PF et al (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480
Zhang W (2015) Comparing the effect of smoothing and N-gram order: finding the best way to combine the smoothing and order of N-gram
Alonso I, Contreras D (2016) Evaluation of semantic similarity metrics applied to the automatic retrieval of medical documents: an UMLS approach. Expert Syst Appl 44:386–399
Anuar FM, Setchi R, Lai Y-K (2015) Semantic retrieval of trademarks based on conceptual similarity. IEEE Trans Syst Man Cybern Syst 46(2):220–233
Otegi A et al (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44(3):689–718
Dongsuk O et al (2018) Word sense disambiguation based on word similarity calculation using word vector representation from a knowledge-based graph. In: Proceedings of the 27th international conference on computational linguistics
Zhu G, Iglesias CA (2018) Exploiting semantic similarity for named entity disambiguation in knowledge graphs. Expert Syst Appl 101:8–24
Liu Q et al (2016) Improving opinion aspect extraction using semantic similarity and aspect associations. In: 30th AAAI conference on artificial intelligence
Ru C et al (2018) Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Inf Process Manag 54(4):593–608
Mikolov T et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Qu R et al (2018) Computing semantic similarity based on novel models of semantic representation using Wikipedia. Inf Process Manag 54(6):1002–1021
Shamsfard M et al (2010) Semi automatic development of farsnet; the persian wordnet. In: Proceedings of 5th global WordNet conference, Mumbai, India
Finkelstein L et al (2001) Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web
Sebti A, Barfroush AA (2008) A new word sense similarity measure in WordNet. In: 2008 International multiconference on computer science and information technology, IEEE
Zhou Z, Wang Y, Gu J (2008) New model of semantic similarity measuring in wordnet. In: 2008 3rd international conference on intelligent system and knowledge engineering, IEEE
Islam A, Inkpen D (2009) Real-word spelling correction using google web 1tn-gram data set. In: Proceedings of the 18th ACM conference on information and knowledge management
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176
Peterson JL (1986) A note on undetected typing errors. Commun ACM 29(7):633–637
Pedler J (2007) Computer correction of real-word spelling errors in dyslexic text. University of London, London
John Lu Z (2010) The elements of statistical learning: data mining, inference, and prediction. Wiley, New York
Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167
Joachims T (2002) Learning to classify text using support vector machines, vol 668. Springer, Cham
Azmi AM, Almutery MN, Aboalsamh HA (2019) Real-word errors in Arabic texts: a better algorithm for detection and correction. IEEE/ACM Trans Audio Speech Lang Process 27(8):1308–1320
Liu Z et al (2010) Study on SVM compared with the other text classification methods. In: 2010 Second international workshop on education technology and computer science, IEEE
Hsu C-W, Chang C-C, Lin C-J (2003) A practical guide to support vector classification. Taipei, Taiwan
Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. Encycl Database Syst 5:532–538
Gupta S (2015) A correction model for real-word errors. Proc Comput Sci 70:99–106
Kaveh-Yazdy F, Zareh-Bidoki A-M (2014) Aleph or Aleph-Maddah, that is the question! Spelling correction for search engine autocomplete service. In: 2014 4th international conference on computer and knowledge engineering (ICCKE), IEEE
AleAhmad A et al (2009) Hamshahri: a standard persian text collection. Knowl-Based Syst 22(5):382–387
Funding
The research did not receive any specific funding.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dashti, S.M.S., Bardsiri, A.K. & Shahbazzadeh, M.J. Automatic real-word error correction in persian text. Neural Comput & Applic 36, 18125–18149 (2024). https://doi.org/10.1007/s00521-024-10045-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-10045-0