Automatic real-word error correction in persian text

Dashti, Seyed Mohammad Sadegh; Bardsiri, Amid Khatibi; Shahbazzadeh, Mehdi Jafari

doi:10.1007/s00521-024-10045-0

Automatic real-word error correction in persian text

Original Article
Published: 19 July 2024

Volume 36, pages 18125–18149, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Seyed Mohammad Sadegh Dashti¹,
Amid Khatibi Bardsiri¹ &
Mehdi Jafari Shahbazzadeh²

198 Accesses
1 Altmetric
Explore all metrics

Abstract

Automatic spelling correction stands as a pivotal challenge within the ambit of natural language processing (NLP), demanding nuanced solutions. Traditional spelling correction techniques are typically only capable of detecting and correcting non-word errors, such as typos and misspellings. However, context-sensitive errors, also known as real-word errors, are more challenging to detect because they are valid words that are used incorrectly in a given context. The Persian language, characterized by its rich morphology and complex syntax, presents formidable challenges to automatic spelling correction systems. Furthermore, the limited availability of Persian language resources makes it difficult to train effective spelling correction models. This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text. Our methodology adopts a structured, multi-tiered approach, employing semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy. The innovative architecture discovers and stores semantic similarities between words and phrases in Persian text. The classifiers accurately identify real-word errors, while the semantic ranking algorithm determines the most probable corrections for real-word errors, taking into account specific spelling correction and context properties such as context, semantic similarity, and edit-distance measures. Evaluations have demonstrated that our proposed method surpasses previous Persian real-word error correction models. Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase. These results clearly indicate that our approach is a highly promising solution for automatic real-word error correction in Persian text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Spelling Detection and Correction in the Medical Domain: A Systematic Literature Review

A hybrid model for spelling error detection and correction for Urdu language

Article 05 August 2021

Improving the quality of Persian clinical text with a novel spelling correction system

Article Open access 05 August 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

The dataset supporting this article is from previously reported studies and datasets (Hamshahri corpus), which have been cited. The data are available at:

https://dbrg.ut.ac.ir/hamshahri/.

Notes

Available at: http://aspell.net.
Available at: https://sourceforge.net/projects/jazzy.
All pronunciations have been provided in International Phonetic Alphabet (IPA).
http://farsnet.nlp.sbu.ac.ir.
https://dbrg.ut.ac.ir/hamshahri/.

References

Wilcox-O’Hearn A, Hirst G, Budanitsky A (2008) Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model. In: International conference on intelligent text processing and computational linguistics, Springer
Hirst G, Budanitsky A (2005) Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng 11(1):87–111
Google Scholar
Deng L, Huang X (2004) Challenges in adopting speech recognition. Commun ACM 47(1):69–75
Google Scholar
Jurafsky D, James H, Martin J (2008) Speech and language processing: an introduction to natural language processing. Computational linguistics, and speech recognition, 2nd edn. Prentice-Hall, Hoboken NJ
Google Scholar
Bassil Y, Alwani M (2012) Ocr context-sensitive error correction based on google web 1t 5-gram data set. arXiv preprint arXiv:1204.0188
Hartley RT, Crumpton K (1999) Quality of OCR for degraded text images. arXiv preprint cs/9902009
Huang Y, Murphey YL, Ge Y (2013) Automotive diagnosis typo correction using domain knowledge and machine learning. In: 2013 IEEE symposium on computational intelligence and data mining (CIDM), IEEE
Kukich K (1993) Techniques for automatically correcting words in text(abstract). In: ACM Annual computer science conference: proceedings of the 1993 ACM conference on computer science
Fred JD (1964) A technique for computer detection and correction of spelling errors. Commun ACM 3(7):171–176
Google Scholar
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. Soviet Union, Moscow
Google Scholar
Atkinson K (2006) Gnu aspell 0.60. 4. 2006, GNU Aspell. Retrieved from http://aspell.net
Idzelis M, Galbraith B (2005) Jazzy: the java open source spell checker. Retrieved from 2019/10/10 http://jazzy.sourceforge.net
Crowell J et al (2004) A frequency-based technique to improve the spelling suggestion rank in medical queries. J Am Med Inform Assoc 11(3):179–185
Google Scholar
Mitton R (2009) Ordering the suggestions of a spellchecker without using context. Nat Lang Eng 15(2):173–192
Google Scholar
Turchin A et al (2007) Identification of misspelled words without a comprehensive dictionary using prevalence analysis. In: AMIA annual symposium proceedings. American Medical Informatics Association
Church KW, Gale WA (1991) Probability scoring for spelling correction. Stat Comput 1(2):93–103
Google Scholar
Flor M, Futagi Y (2012) On using context for automatic correction of non-word misspellings in student essays. In: Proceedings of the seventh workshop on building educational applications using NLP
Lai KH et al (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195
Google Scholar
Norvig P (2009) Natural language corpus data. Beautiful data. O’Reilly Media Inc, Sebastopol, pp 219–242
Google Scholar
Wilbur WJ, Kim W, Xie N (2006) Spelling correction in the PubMed search engine. Inf Retr 9(5):543–564
Google Scholar
Dashti SMS et al (2014) Toward a thesis in automatic context-sensitive spelling correction
Cauteruccio F et al (2022) Extraction and analysis of text patterns from NSFW adult content in Reddit. Data Knowl Eng 138:101979
Google Scholar
Bonifazi G et al (2022) A Space-time framework for sentiment scope analysis in social media. Big Data Cognit Comput 6(4):130
Google Scholar
Mays E, Damerau FJ, Mercer RL (1991) Context based spelling correction. Inf Process Manag 27(5):517–522
Google Scholar
Samanta P, Chaudhuri BB (2013) A simple real-word error detection and correction using local word bigram and trigram. In: Proceedings of the 25th conference on computational linguistics and speech processing (ROCLING 2013)
Wilcox-O'Hearn LA (2014) Detection is the central problem in real-word spelling correction. arXiv preprint arXiv:1408.3153
Dashti SM, Khatibi Bardsiri A, Khatibi Bardsiri V (2018) Correcting real-word spelling errors: a new hybrid approach. Digit Scholarsh Humanit 33(3):488–499
Google Scholar
Dashti SM (2018) Real-word error correction with trigrams: correcting multiple errors in a sentence. Lang Resour Eval 52(2):485–502
Google Scholar
Kilicoglu H et al (2015) An ensemble method for spelling correction in consumer health questions. In: AMIA annual symposium proceedings, American Medical Informatics Association
Pande H (2017) Effective search space reduction for spell correction using character neural embeddings. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2, Short Papers
Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In: BioNLP 2017
Lu CJ et al (2019) Spell checker for consumer language (CSpell). J Am Med Inform Assoc 26(3):211–218
MathSciNet Google Scholar
Hu Y et al (2020) Misspelling correction with pre-trained contextual language model. In: 2020 IEEE 19th international conference on cognitive informatics & cognitive computing (ICCI* CC), IEEE
Lee J-H, Kim M, Kwon H-C (2020) Deep learning-based context-sensitive spelling typing error correction. IEEE Access 8:152565–152578
Google Scholar
Sun R, Wu X, Wu Y (2023) An error-guided correction model for Chinese spelling error correction. arXiv preprint arXiv:2301.06323
Jayanthi SM, Pruthi D, Neubig G (2020) Neuspell: a neural spelling correction toolkit. arXiv preprint arXiv:2010.11085
Ji T, Yan H, Qiu X (2021) SpellBERT: a lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing
Liu S et al (2021) PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, vol 1, Long Papers
Zhang R et al (2021) Correcting Chinese spelling errors with phonetic pre-training. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Dublin Ireland
Google Scholar
Tran K et al (2022) Vietnamese electronic medical record management with text preprocessing for spelling errors. In: 2022 9th NAFOSTED conference on information and computer science (NICS), IEEE
Wang X et al (2022) Towards contextual spelling correction for customization of end-to-end speech recognition systems. IEEE/ACM Trans Audio Speech Lang Process 30:3089–3097
Google Scholar
Zhu C et al (2022) MDCSpell: a multi-task detector-corrector framework for Chinese spelling correction. Findings of the association for computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin Ireland
Google Scholar
Liu S et al (2022) CRASpell: a contextual typo robust approach to improve Chinese spelling correction. Findings of the association for computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin Ireland
Google Scholar
Salhab M, Abu-Khzam F (2023) AraSpell: a deep learning approach for arabic spelling correction
Wu H et al (2023) Rethinking Masked Language Modeling for Chinese Spelling Correction. arXiv preprint arXiv:2305.17721
Liang Z, Quan X, Wang Q (2023) Disentangled phonetic representation for chinese spelling correction. arXiv preprint arXiv:2305.14783
Mosavi Miangah T (2014) FarsiSpell: a spell-checking system for Persian using a large monolingual corpus. Lit Linguist Comput 29(1):56–73
Google Scholar
Kashefi O, Sharifi M, Minaie B (2013) A novel string distance metric for ranking Persian respelling suggestions. Nat Lang Eng 19(2):259–284
Google Scholar
Shamsfard M, Jafari HS, Ilbeygi M (2010) STeP-1: a set of fundamental tools for Persian text processing. In: Proceedings of the 7th international conference on language resources and evaluation (LREC'10)
Shamsfard M (2011) Challenges and open problems in Persian text processing. Proceedings of LTC 11:65–69
Google Scholar
Ghayoomi M, Assi SM (2005) Word prediction in a running text: A Statistical language modeling for the Persian language. In: Proceedings of the Australasian language technology workshop 2005
Naseem T, Hussain S (2007) A novel approach for ranking spelling error corrections for Urdu. Lang Resour Eval 41(2):117–128
Google Scholar
Faili H et al (2016) Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digit Scholarsh Humanit 31(1):95–117
Google Scholar
Dastgheib MB, Fakhrahmad SM, Jahromi MZ (2017) Perspell: a new Persian semantic-based spelling correction system. Digital Scholarsh Humanit 32(3):543–553
Google Scholar
Yazdani A et al (2020) Automated misspelling detection and correction in Persian clinical text. J Digit Imaging 33:555–562
Google Scholar
Ghayoomi M, Momtazi S, Bijankhan M (2010) A study of corpus development for Persian. International journal on ALP. Citeseer, Princeton
Google Scholar
Dehkhoda AA (1998) Dehkhoda dictionary. Tehran University, Tehran, p 1377
Google Scholar
Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval
Taghi-Zadeh H et al (2017) A new hybrid stemming method for Persian language. Digit Scholarsh Humanit 32(1):209–221
Google Scholar
Melucci M, Orio N (2003) A novel method for stemmer generation based on hidden Markov models. In: Proceedings of the twelfth international conference on information and knowledge management
Brown PF et al (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480
Google Scholar
Zhang W (2015) Comparing the effect of smoothing and N-gram order: finding the best way to combine the smoothing and order of N-gram
Alonso I, Contreras D (2016) Evaluation of semantic similarity metrics applied to the automatic retrieval of medical documents: an UMLS approach. Expert Syst Appl 44:386–399
Google Scholar
Anuar FM, Setchi R, Lai Y-K (2015) Semantic retrieval of trademarks based on conceptual similarity. IEEE Trans Syst Man Cybern Syst 46(2):220–233
Google Scholar
Otegi A et al (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44(3):689–718
Google Scholar
Dongsuk O et al (2018) Word sense disambiguation based on word similarity calculation using word vector representation from a knowledge-based graph. In: Proceedings of the 27th international conference on computational linguistics
Zhu G, Iglesias CA (2018) Exploiting semantic similarity for named entity disambiguation in knowledge graphs. Expert Syst Appl 101:8–24
Google Scholar
Liu Q et al (2016) Improving opinion aspect extraction using semantic similarity and aspect associations. In: 30th AAAI conference on artificial intelligence
Ru C et al (2018) Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Inf Process Manag 54(4):593–608
Google Scholar
Mikolov T et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Qu R et al (2018) Computing semantic similarity based on novel models of semantic representation using Wikipedia. Inf Process Manag 54(6):1002–1021
Google Scholar
Shamsfard M et al (2010) Semi automatic development of farsnet; the persian wordnet. In: Proceedings of 5th global WordNet conference, Mumbai, India
Finkelstein L et al (2001) Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web
Sebti A, Barfroush AA (2008) A new word sense similarity measure in WordNet. In: 2008 International multiconference on computer science and information technology, IEEE
Zhou Z, Wang Y, Gu J (2008) New model of semantic similarity measuring in wordnet. In: 2008 3rd international conference on intelligent system and knowledge engineering, IEEE
Islam A, Inkpen D (2009) Real-word spelling correction using google web 1tn-gram data set. In: Proceedings of the 18th ACM conference on information and knowledge management
Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176
Google Scholar
Peterson JL (1986) A note on undetected typing errors. Commun ACM 29(7):633–637
Google Scholar
Pedler J (2007) Computer correction of real-word spelling errors in dyslexic text. University of London, London
Google Scholar
John Lu Z (2010) The elements of statistical learning: data mining, inference, and prediction. Wiley, New York
Google Scholar
Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Google Scholar
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167
Google Scholar
Joachims T (2002) Learning to classify text using support vector machines, vol 668. Springer, Cham
Google Scholar
Azmi AM, Almutery MN, Aboalsamh HA (2019) Real-word errors in Arabic texts: a better algorithm for detection and correction. IEEE/ACM Trans Audio Speech Lang Process 27(8):1308–1320
Google Scholar
Liu Z et al (2010) Study on SVM compared with the other text classification methods. In: 2010 Second international workshop on education technology and computer science, IEEE
Hsu C-W, Chang C-C, Lin C-J (2003) A practical guide to support vector classification. Taipei, Taiwan
Google Scholar
Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. Encycl Database Syst 5:532–538
Google Scholar
Gupta S (2015) A correction model for real-word errors. Proc Comput Sci 70:99–106
Google Scholar
Kaveh-Yazdy F, Zareh-Bidoki A-M (2014) Aleph or Aleph-Maddah, that is the question! Spelling correction for search engine autocomplete service. In: 2014 4th international conference on computer and knowledge engineering (ICCKE), IEEE
AleAhmad A et al (2009) Hamshahri: a standard persian text collection. Knowl-Based Syst 22(5):382–387
Google Scholar

Download references

Funding

The research did not receive any specific funding.

Author information

Authors and Affiliations

Department of Computer Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran
Seyed Mohammad Sadegh Dashti & Amid Khatibi Bardsiri
Department of Electrical Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran
Mehdi Jafari Shahbazzadeh

Authors

Seyed Mohammad Sadegh Dashti
View author publications
You can also search for this author in PubMed Google Scholar
Amid Khatibi Bardsiri
View author publications
You can also search for this author in PubMed Google Scholar
Mehdi Jafari Shahbazzadeh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amid Khatibi Bardsiri.

Ethics declarations

Conflict of interest

The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dashti, S.M.S., Bardsiri, A.K. & Shahbazzadeh, M.J. Automatic real-word error correction in persian text. Neural Comput & Applic 36, 18125–18149 (2024). https://doi.org/10.1007/s00521-024-10045-0

Download citation

Received: 07 December 2022
Accepted: 19 June 2024
Published: 19 July 2024
Issue Date: October 2024
DOI: https://doi.org/10.1007/s00521-024-10045-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic real-word error correction in persian text

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatic Spelling Detection and Correction in the Medical Domain: A Systematic Literature Review

A hybrid model for spelling error detection and correction for Urdu language

Improving the quality of Persian clinical text with a novel spelling correction system

Explore related subjects

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now