Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

Automatic real-word error correction in persian text

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Automatic spelling correction stands as a pivotal challenge within the ambit of natural language processing (NLP), demanding nuanced solutions. Traditional spelling correction techniques are typically only capable of detecting and correcting non-word errors, such as typos and misspellings. However, context-sensitive errors, also known as real-word errors, are more challenging to detect because they are valid words that are used incorrectly in a given context. The Persian language, characterized by its rich morphology and complex syntax, presents formidable challenges to automatic spelling correction systems. Furthermore, the limited availability of Persian language resources makes it difficult to train effective spelling correction models. This paper introduces a cutting-edge approach for precise and efficient real-word error correction in Persian text. Our methodology adopts a structured, multi-tiered approach, employing semantic analysis, feature selection, and advanced classifiers to enhance error detection and correction efficacy. The innovative architecture discovers and stores semantic similarities between words and phrases in Persian text. The classifiers accurately identify real-word errors, while the semantic ranking algorithm determines the most probable corrections for real-word errors, taking into account specific spelling correction and context properties such as context, semantic similarity, and edit-distance measures. Evaluations have demonstrated that our proposed method surpasses previous Persian real-word error correction models. Our method achieves an impressive F-measure of 96.6% in the detection phase and an accuracy of 99.1% in the correction phase. These results clearly indicate that our approach is a highly promising solution for automatic real-word error correction in Persian text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Pseudocode 1
Pseudocode 2
Fig. 4

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

The dataset supporting this article is from previously reported studies and datasets (Hamshahri corpus), which have been cited. The data are available at:

https://dbrg.ut.ac.ir/hamshahri/.

Notes

  1. Available at: http://aspell.net.

  2. Available at: https://sourceforge.net/projects/jazzy.

  3. All pronunciations have been provided in International Phonetic Alphabet (IPA).

  4. http://farsnet.nlp.sbu.ac.ir.

  5. https://dbrg.ut.ac.ir/hamshahri/.

References

  1. Wilcox-O’Hearn A, Hirst G, Budanitsky A (2008) Real-word spelling correction with trigrams: a reconsideration of the Mays, Damerau, and Mercer model. In: International conference on intelligent text processing and computational linguistics, Springer

  2. Hirst G, Budanitsky A (2005) Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng 11(1):87–111

    Google Scholar 

  3. Deng L, Huang X (2004) Challenges in adopting speech recognition. Commun ACM 47(1):69–75

    Google Scholar 

  4. Jurafsky D, James H, Martin J (2008) Speech and language processing: an introduction to natural language processing. Computational linguistics, and speech recognition, 2nd edn. Prentice-Hall, Hoboken NJ

    Google Scholar 

  5. Bassil Y, Alwani M (2012) Ocr context-sensitive error correction based on google web 1t 5-gram data set. arXiv preprint arXiv:1204.0188

  6. Hartley RT, Crumpton K (1999) Quality of OCR for degraded text images. arXiv preprint cs/9902009

  7. Huang Y, Murphey YL, Ge Y (2013) Automotive diagnosis typo correction using domain knowledge and machine learning. In: 2013 IEEE symposium on computational intelligence and data mining (CIDM), IEEE

  8. Kukich K (1993) Techniques for automatically correcting words in text(abstract). In: ACM Annual computer science conference: proceedings of the 1993 ACM conference on computer science

  9. Fred JD (1964) A technique for computer detection and correction of spelling errors. Commun ACM 3(7):171–176

    Google Scholar 

  10. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady. Soviet Union, Moscow

    Google Scholar 

  11. Atkinson K (2006) Gnu aspell 0.60. 4. 2006, GNU Aspell. Retrieved from http://aspell.net

  12. Idzelis M, Galbraith B (2005) Jazzy: the java open source spell checker. Retrieved from 2019/10/10 http://jazzy.sourceforge.net

  13. Crowell J et al (2004) A frequency-based technique to improve the spelling suggestion rank in medical queries. J Am Med Inform Assoc 11(3):179–185

    Google Scholar 

  14. Mitton R (2009) Ordering the suggestions of a spellchecker without using context. Nat Lang Eng 15(2):173–192

    Google Scholar 

  15. Turchin A et al (2007) Identification of misspelled words without a comprehensive dictionary using prevalence analysis. In: AMIA annual symposium proceedings. American Medical Informatics Association

  16. Church KW, Gale WA (1991) Probability scoring for spelling correction. Stat Comput 1(2):93–103

    Google Scholar 

  17. Flor M, Futagi Y (2012) On using context for automatic correction of non-word misspellings in student essays. In: Proceedings of the seventh workshop on building educational applications using NLP

  18. Lai KH et al (2015) Automated misspelling detection and correction in clinical free-text records. J Biomed Inform 55:188–195

    Google Scholar 

  19. Norvig P (2009) Natural language corpus data. Beautiful data. O’Reilly Media Inc, Sebastopol, pp 219–242

    Google Scholar 

  20. Wilbur WJ, Kim W, Xie N (2006) Spelling correction in the PubMed search engine. Inf Retr 9(5):543–564

    Google Scholar 

  21. Dashti SMS et al (2014) Toward a thesis in automatic context-sensitive spelling correction

  22. Cauteruccio F et al (2022) Extraction and analysis of text patterns from NSFW adult content in Reddit. Data Knowl Eng 138:101979

    Google Scholar 

  23. Bonifazi G et al (2022) A Space-time framework for sentiment scope analysis in social media. Big Data Cognit Comput 6(4):130

    Google Scholar 

  24. Mays E, Damerau FJ, Mercer RL (1991) Context based spelling correction. Inf Process Manag 27(5):517–522

    Google Scholar 

  25. Samanta P, Chaudhuri BB (2013) A simple real-word error detection and correction using local word bigram and trigram. In: Proceedings of the 25th conference on computational linguistics and speech processing (ROCLING 2013)

  26. Wilcox-O'Hearn LA (2014) Detection is the central problem in real-word spelling correction. arXiv preprint arXiv:1408.3153

  27. Dashti SM, Khatibi Bardsiri A, Khatibi Bardsiri V (2018) Correcting real-word spelling errors: a new hybrid approach. Digit Scholarsh Humanit 33(3):488–499

    Google Scholar 

  28. Dashti SM (2018) Real-word error correction with trigrams: correcting multiple errors in a sentence. Lang Resour Eval 52(2):485–502

    Google Scholar 

  29. Kilicoglu H et al (2015) An ensemble method for spelling correction in consumer health questions. In: AMIA annual symposium proceedings, American Medical Informatics Association

  30. Pande H (2017) Effective search space reduction for spell correction using character neural embeddings. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, vol 2, Short Papers

  31. Fivez P, Suster S, Daelemans W (2017) Unsupervised context-sensitive spelling correction of clinical free-text with word and character n-gram embeddings. In: BioNLP 2017

  32. Lu CJ et al (2019) Spell checker for consumer language (CSpell). J Am Med Inform Assoc 26(3):211–218

    MathSciNet  Google Scholar 

  33. Hu Y et al (2020) Misspelling correction with pre-trained contextual language model. In: 2020 IEEE 19th international conference on cognitive informatics & cognitive computing (ICCI* CC), IEEE

  34. Lee J-H, Kim M, Kwon H-C (2020) Deep learning-based context-sensitive spelling typing error correction. IEEE Access 8:152565–152578

    Google Scholar 

  35. Sun R, Wu X, Wu Y (2023) An error-guided correction model for Chinese spelling error correction. arXiv preprint arXiv:2301.06323

  36. Jayanthi SM, Pruthi D, Neubig G (2020) Neuspell: a neural spelling correction toolkit. arXiv preprint arXiv:2010.11085

  37. Ji T, Yan H, Qiu X (2021) SpellBERT: a lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing

  38. Liu S et al (2021) PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, vol 1, Long Papers

  39. Zhang R et al (2021) Correcting Chinese spelling errors with phonetic pre-training. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Dublin Ireland

    Google Scholar 

  40. Tran K et al (2022) Vietnamese electronic medical record management with text preprocessing for spelling errors. In: 2022 9th NAFOSTED conference on information and computer science (NICS), IEEE

  41. Wang X et al (2022) Towards contextual spelling correction for customization of end-to-end speech recognition systems. IEEE/ACM Trans Audio Speech Lang Process 30:3089–3097

    Google Scholar 

  42. Zhu C et al (2022) MDCSpell: a multi-task detector-corrector framework for Chinese spelling correction. Findings of the association for computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin Ireland

    Google Scholar 

  43. Liu S et al (2022) CRASpell: a contextual typo robust approach to improve Chinese spelling correction. Findings of the association for computational linguistics: ACL 2022. Association for Computational Linguistics, Dublin Ireland

    Google Scholar 

  44. Salhab M, Abu-Khzam F (2023) AraSpell: a deep learning approach for arabic spelling correction

  45. Wu H et al (2023) Rethinking Masked Language Modeling for Chinese Spelling Correction. arXiv preprint arXiv:2305.17721

  46. Liang Z, Quan X, Wang Q (2023) Disentangled phonetic representation for chinese spelling correction. arXiv preprint arXiv:2305.14783

  47. Mosavi Miangah T (2014) FarsiSpell: a spell-checking system for Persian using a large monolingual corpus. Lit Linguist Comput 29(1):56–73

    Google Scholar 

  48. Kashefi O, Sharifi M, Minaie B (2013) A novel string distance metric for ranking Persian respelling suggestions. Nat Lang Eng 19(2):259–284

    Google Scholar 

  49. Shamsfard M, Jafari HS, Ilbeygi M (2010) STeP-1: a set of fundamental tools for Persian text processing. In: Proceedings of the 7th international conference on language resources and evaluation (LREC'10)

  50. Shamsfard M (2011) Challenges and open problems in Persian text processing. Proceedings of LTC 11:65–69

    Google Scholar 

  51. Ghayoomi M, Assi SM (2005) Word prediction in a running text: A Statistical language modeling for the Persian language. In: Proceedings of the Australasian language technology workshop 2005

  52. Naseem T, Hussain S (2007) A novel approach for ranking spelling error corrections for Urdu. Lang Resour Eval 41(2):117–128

    Google Scholar 

  53. Faili H et al (2016) Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digit Scholarsh Humanit 31(1):95–117

    Google Scholar 

  54. Dastgheib MB, Fakhrahmad SM, Jahromi MZ (2017) Perspell: a new Persian semantic-based spelling correction system. Digital Scholarsh Humanit 32(3):543–553

    Google Scholar 

  55. Yazdani A et al (2020) Automated misspelling detection and correction in Persian clinical text. J Digit Imaging 33:555–562

    Google Scholar 

  56. Ghayoomi M, Momtazi S, Bijankhan M (2010) A study of corpus development for Persian. International journal on ALP. Citeseer, Princeton

    Google Scholar 

  57. Dehkhoda AA (1998) Dehkhoda dictionary. Tehran University, Tehran, p 1377

    Google Scholar 

  58. Krovetz R (1993) Viewing morphology as an inference process. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval

  59. Taghi-Zadeh H et al (2017) A new hybrid stemming method for Persian language. Digit Scholarsh Humanit 32(1):209–221

    Google Scholar 

  60. Melucci M, Orio N (2003) A novel method for stemmer generation based on hidden Markov models. In: Proceedings of the twelfth international conference on information and knowledge management

  61. Brown PF et al (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480

    Google Scholar 

  62. Zhang W (2015) Comparing the effect of smoothing and N-gram order: finding the best way to combine the smoothing and order of N-gram

  63. Alonso I, Contreras D (2016) Evaluation of semantic similarity metrics applied to the automatic retrieval of medical documents: an UMLS approach. Expert Syst Appl 44:386–399

    Google Scholar 

  64. Anuar FM, Setchi R, Lai Y-K (2015) Semantic retrieval of trademarks based on conceptual similarity. IEEE Trans Syst Man Cybern Syst 46(2):220–233

    Google Scholar 

  65. Otegi A et al (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44(3):689–718

    Google Scholar 

  66. Dongsuk O et al (2018) Word sense disambiguation based on word similarity calculation using word vector representation from a knowledge-based graph. In: Proceedings of the 27th international conference on computational linguistics

  67. Zhu G, Iglesias CA (2018) Exploiting semantic similarity for named entity disambiguation in knowledge graphs. Expert Syst Appl 101:8–24

    Google Scholar 

  68. Liu Q et al (2016) Improving opinion aspect extraction using semantic similarity and aspect associations. In: 30th AAAI conference on artificial intelligence

  69. Ru C et al (2018) Using semantic similarity to reduce wrong labels in distant supervision for relation extraction. Inf Process Manag 54(4):593–608

    Google Scholar 

  70. Mikolov T et al (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  71. Qu R et al (2018) Computing semantic similarity based on novel models of semantic representation using Wikipedia. Inf Process Manag 54(6):1002–1021

    Google Scholar 

  72. Shamsfard M et al (2010) Semi automatic development of farsnet; the persian wordnet. In: Proceedings of 5th global WordNet conference, Mumbai, India

  73. Finkelstein L et al (2001) Placing search in context: the concept revisited. In Proceedings of the 10th international conference on World Wide Web

  74. Sebti A, Barfroush AA (2008) A new word sense similarity measure in WordNet. In: 2008 International multiconference on computer science and information technology, IEEE

  75. Zhou Z, Wang Y, Gu J (2008) New model of semantic similarity measuring in wordnet. In: 2008 3rd international conference on intelligent system and knowledge engineering, IEEE

  76. Islam A, Inkpen D (2009) Real-word spelling correction using google web 1tn-gram data set. In: Proceedings of the 18th ACM conference on information and knowledge management

  77. Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176

    Google Scholar 

  78. Peterson JL (1986) A note on undetected typing errors. Commun ACM 29(7):633–637

    Google Scholar 

  79. Pedler J (2007) Computer correction of real-word spelling errors in dyslexic text. University of London, London

    Google Scholar 

  80. John Lu Z (2010) The elements of statistical learning: data mining, inference, and prediction. Wiley, New York

    Google Scholar 

  81. Hall M et al (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Google Scholar 

  82. Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2(2):121–167

    Google Scholar 

  83. Joachims T (2002) Learning to classify text using support vector machines, vol 668. Springer, Cham

    Google Scholar 

  84. Azmi AM, Almutery MN, Aboalsamh HA (2019) Real-word errors in Arabic texts: a better algorithm for detection and correction. IEEE/ACM Trans Audio Speech Lang Process 27(8):1308–1320

    Google Scholar 

  85. Liu Z et al (2010) Study on SVM compared with the other text classification methods. In: 2010 Second international workshop on education technology and computer science, IEEE

  86. Hsu C-W, Chang C-C, Lin C-J (2003) A practical guide to support vector classification. Taipei, Taiwan

    Google Scholar 

  87. Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. Encycl Database Syst 5:532–538

    Google Scholar 

  88. Gupta S (2015) A correction model for real-word errors. Proc Comput Sci 70:99–106

    Google Scholar 

  89. Kaveh-Yazdy F, Zareh-Bidoki A-M (2014) Aleph or Aleph-Maddah, that is the question! Spelling correction for search engine autocomplete service. In: 2014 4th international conference on computer and knowledge engineering (ICCKE), IEEE

  90. AleAhmad A et al (2009) Hamshahri: a standard persian text collection. Knowl-Based Syst 22(5):382–387

    Google Scholar 

Download references

Funding

The research did not receive any specific funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amid Khatibi Bardsiri.

Ethics declarations

Conflict of interest

The author(s) declare(s) that there is no conflict of interest regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dashti, S.M.S., Bardsiri, A.K. & Shahbazzadeh, M.J. Automatic real-word error correction in persian text. Neural Comput & Applic 36, 18125–18149 (2024). https://doi.org/10.1007/s00521-024-10045-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-024-10045-0

Keywords