Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Survey of Post-OCR Processing Approaches

Published: 13 July 2021 Publication History

Abstract

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

Supplementary Material

a124-nguyen-supp.pdf (nguyen.zip)
Supplemental movie, appendix, image and software files for, Survey of Post-OCR Processing Approaches

References

[1]
Haithem Afli, Loïc Barrault, and Holger Schwenk. 2016. OCR error correction using statistical machine translation.Int. J. Comput. Ling. Appl. 7, 1 (2016), 175–191.
[2]
Haithem Afli, Zhengwei Qiu, Andy Way, and Páraic Sheridan. 2016. Using SMT for OCR error correction of historical texts. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 962–966.
[3]
Mayce Al Azawi and Thomas M. Breuel. 2014. Context-dependent confusions rules for building error model using weighted finite state transducers for OCR post-processing. In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE, 116–120.
[4]
Mayce Al Azawi, Marcus Liwicki, and Thomas M. Breuel. 2015. Combination of multiple aligned recognition outputs using WFST and LSTM. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR’15). IEEE, 31–35.
[5]
Lloyd Allison and Trevor I. Dix. 1986. A bit-string longest-common-subsequence algorithm. Inform. Process. Lett. 23, 5 (1986), 305–310.
[6]
Chantal Amrhein and Simon Clematide. 2018. Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Ling. 33, 1 (2018), 49–76.
[7]
Richard C. Angell, George E. Freund, and Peter Willett. 1983. Automatic spelling correction using a trigram similarity measure. Inf. Process. Manage. 19, 4 (1983), 255–261.
[8]
Mayce Ibrahim Ali Al Azawi, Adnan Ul-Hasan, Marcus Liwicki, and Thomas M. Breuel. 2014. Character-level alignment using WFST and LSTM for post-processing in multi-script recognition systems—A comparative study. In Proceedings of the 11th International Conference on Image Analysis and Recognition ICIA(R’14),Lecture Notes in Computer Science, Vol. 8814. Springer, 379–386.
[9]
Youssef Bassil and Mohammad Alwani. 2012. OCR context-sensitive error correction based on google web 1T 5-gram data set. Am. J. Sci. Res. 50 (2012).
[10]
Youssef Bassil and Mohammad Alwani. 2012. OCR post-processing error correction algorithm using google’s online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3, 1 (2012).
[11]
Guilherme Torresan Bazzo, Gustavo Acauan Lorentz, Danny Suarez Vargas, and Viviane P Moreira. 2020. Assessing the impact of ocr errors in information retrieval. In Proceedings of the European Conference on Information Retrieval. Springer, 102–109.
[12]
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb.2003), 1137–1155.
[13]
Anurag Bhardwaj, Faisal Farooq, Huaigu Cao, and Venu Govindaraju. 2008. Topic based language models for OCR correction. In Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data (AND’08)ACM International Conference Proceeding Series, Vol. 303. ACM, 107–112.
[14]
Maximilian Bisani and Hermann Ney. 2008. Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50, 5 (2008), 434–451.
[15]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5 (2017), 135–146.
[16]
Lars Borin, Gerlof Bouma, and Dana Dannélls. 2016. A free cloud service for OCR/En fri molntjänst för OCR. Technical Report. Göteborg.
[17]
Eugene Borovikov, Ilya Zavorin, and Mark Turner. 2004. A filter based post-OCR accuracy boost system. In Proceedings of the 1st ACM Workshop on Hardcopy Document Processing. 23–28.
[18]
Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1 LDC2006T13. In Philadelphia: Linguistic Data Consortium. Google Inc.
[19]
Eric Brill and Robert C. Moore. 2000. An improved error model for noisy channel spelling correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 286–293.
[20]
Jorge Ramón Fonseca Cacho and Kazem Taghva. 2020. OCR post processing using support vector machines. In Intelligent Computing: Proceedings of the 2020 Computing Conference, Volume 2 (AI’20), Advances in Intelligent Systems and Computing, Vol. 1229. Springer, 694–713.
[21]
Jorge Ramón Fonseca Cacho, Kazem Taghva, and Daniel Alvarez. 2019. Using the Google web 1T 5-gram corpus for OCR error correction. In Proceedings of the 16th International Conference on Information Technology-New Generations (ITNG’19). Springer, 505–511.
[22]
Ewerton Cappelatti, Regina De Oliveira Heidrich, Ricardo Oliveira, Cintia Monticelli, Ronaldo Rodrigues, Rodrigo Goulart, and Eduardo Velho. 2018. Post-correction of OCR errors using pyenchant spelling suggestions selected through a modified needleman–wunsch algorithm. In Proceedings of the International Conference on Human-Computer Interaction. Springer, 3–10.
[23]
Rafael C. Carrasco. 2014. An open-source OCR evaluation tool. In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. 179–184.
[24]
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH’14). ISCA, 2635–2639.
[25]
Guillaume Chiron, Antoine Doucet, Mickaël Coustaty, and Jean-Philippe Moreux. 2017. ICDAR2017 competition on post-OCR text correction. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1. IEEE, 1423–1428.
[26]
Guillaume Chiron, Antoine Doucet, Mickaël Coustaty, Muriel Visani, and Jean-Philippe Moreux. 2017. Impact of OCR errors on the use of digital libraries: Towards a better access to information. In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL’17). IEEE, 1–4.
[27]
Otto Chrons and Sami Sundell. 2011. Digitalkoot: Making old archives accessible using crowdsourcing. In Human Computation, Papers from the 2011 AAAI Workshop (AAAI Workshops), Vol. WS-11-11. AAAI.
[28]
Kenneth W. Church and William A. Gale. 1991. Probability scoring for spelling correction. Stat. Comput. 1, 2 (1991), 93–103.
[29]
Philip Clarkson and Ronald Rosenfeld. 1997. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the 5th European Conference on Speech Communication and Technology.
[30]
Simon Clematide, Lenz Furrer, and Martin Volk. 2016. Crowdsourcing an OCR gold standard for a german and french heritage corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 975–982.
[31]
Ryan Cotterell, Nanyun Peng, and Jason Eisner. 2014. Stochastic contextual edit distance and probabilistic FSTs. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 625–630.
[32]
W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. 1994. An evaluation of information retrieval accuracy with simulated OCR output. In Proceedings of the Symposium on Document Analysis and Information Retrieval. 115–126.
[33]
Fred J. Damerau. 1964. A technique for computer detection and correction of spelling errors. Commun. ACM 7, 3 (1964), 171–176.
[34]
Dana Dannélls and Simon Persson. 2020. Supervised OCR post-correction of historical swedish texts: what role does the OCR system play? In Proceedings of the Digital Humanities in the Nordic Countries 5th ConferenceCEUR Workshop Proceedings, Vol. 2612. CEUR-WS.org, 24–37.
[35]
Deepayan Das, Jerin Philip, Minesh Mathew, and C. V. Jawahar. 2019. A cost efficient approach to correct OCR errors in large document collections. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’19). IEEE, 655–662.
[36]
DBNL. 2019. DBNL OCR Data set.
[37]
Andreas Dengel, Rainer Hoch, Frank Hönes, Thorsten Jäger, Michael Malburg, and Achim Weigel. 1997. Techniques for improving OCR results. In Handbook of Character Recognition and Document Image Analysis. World Scientific, 227–258.
[38]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for Computational Linguistics, 4171–4186.
[39]
Eva D’hondt, Cyril Grouin, and Brigitte Grau. 2016. Low-resource OCR error detection and correction in french clinical texts. In Proceedings of the 7th International Workshop on Health Text Mining and Information Analysis (Louhi@EMNLP’16). Association for Computational Linguistics, 61–68.
[40]
Eva D’hondt, Cyril Grouin, and Brigitte Grau. 2017. Generating a training corpus for ocr post-correction using encoder-decoder model. In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP’17). Asian Federation of Natural Language Processing, 1006–1014.
[41]
Rui Dong and David Smith. 2018. Multi-input attention for unsupervised OCR correction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18). Association for Computational Linguistics, 2363–2372.
[42]
Senka Drobac, Pekka Kauppinen, and Krister Lindén. 2017. OCR and post-correction of historical Finnish texts. In Proceedings of the 21st Nordic Conference on Computational Linguistics. 70–76.
[43]
Senka Drobac and Krister Lindén. 2020. Optical character recognition with neural networks and post-correction with finite state methods. Int. J. Doc. Anal. Recogn. 23, 4 (2020), 279–295.
[44]
Steffen Eger, Tim vor der Brück, and Alexander Mehler. 2016. A comparison of four character-level string-to-string translation models for (OCR) spelling error correction. Prague Bull. Math. Ling. 105 (2016), 77–100.
[45]
Tobias Englmeier, Florian Fink, and Klaus U. Schulz. 2019. AI-PoCoTo: Combining automated and interactive ocr postcorrection. In Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage. 19–24.
[46]
Paula Estrella and Pablo Paliza. 2014. OCR correction of documents generated during Argentina’s national reorganization process. In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. 119–123.
[47]
John Evershed and Kent Fitch. 2014. Correcting noisy OCR: Context beats confusion. In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. ACM, 45–51.
[48]
Shaolei Feng and R. Manmatha. 2006. A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL’06). ACM, 109–118.
[49]
Florian Fink, Klaus U. Schulz, and Uwe Springmann. 2017. Profiling of OCR’ed historical texts revisited. In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. 61–66.
[50]
Lenz Furrer and Martin Volk. 2011. Reducing OCR errors in gothic-script documents. In Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage. 97–103.
[51]
Simon Gabay. 2020. OCR17: GT for 17th French prints.
[52]
Michel Généreux, Egon W. Stemle, Verena Lyding, and Lionel Nicolas. 2014. Correcting OCR errors for german in fraktur font. In Proceedings of the First Italian Conference on Computational Linguistics CLiC-It. Pisa University Press, 186–190.
[53]
Stefan Gerdjikov, Stoyan Mihov, and Vladislav Nenchev. 2013. Extraction of spelling variations from language structure for noisy text correction. In Proceedings of the 12th International Conference on Document Analysis and Recognition. IEEE, 324–328.
[54]
Anne Göhring and Martin Volk. 2011. The Text+ Berg corpus: An alpine french-german parallel resource. In Proceedings of Traitement Automatique des Langues Naturelles (TALN’11).
[55]
Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2009. On lexical resources for digitization of historical documents. In Proceedings of the 9th ACM Symposium on Document Engineering. ACM, 193–200.
[56]
Isabelle Guyon, Robert M. Haralick, Jonathan J. Hull, and Ihsin Tsaiyun Phillips. 1997. Data sets for OCR and document image understanding research. In Handbook of Character Recognition and Document Image Analysis. World Scientific, 779–799.
[57]
Kai Hakala, Aleksi Vesanto, Niko Miekka, Tapio Salakoski, and Filip Ginter. 2019. Leveraging text repetitions and denoising autoencoders in OCR post-correction. CoRR abs/1906.10907 (2019).
[58]
Mika Hämäläinen and Simon Hengchen. 2019. From the paft to the fiiture: A fully automatic NMT and word embeddings method for OCR post-correction. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’19). INCOMA Ltd., 431–436.
[59]
Ahmed Hamdi, Axel Jean-Caurant, Nicolas Sidere, Mickaël Coustaty, and Antoine Doucet. 2019. An analysis of the performance of named entity recognition over ocred documents. In Proceedings of the 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL’19). IEEE, 333–334.
[60]
Harald Hammarström, Shafqat Mumtaz Virk, and Markus Forsberg. 2017. Poor Man’s OCR post-correction: unsupervised recognition of variant spelling applied to a multilingual document collection. In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage. 71–75.
[61]
Andreas W. Hauser. 2007. OCR-Postcorrection of Historical Texts. Master’s thesis. Ludwig-Maximilians-Universität München.
[62]
Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the 6th Workshop on Statistical Machine Translation. 187–197.
[63]
Daniel Hládek, Ján Staš, Stanislav Ondáš, Jozef Juhár, and Lászlo Kovács. 2017. Learning string distance with smoothing for OCR spelling correction. Multimedia Tools Appl. 76, 22 (2017), 24549–24567.
[64]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[65]
Rose Holley. 2009. Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers. National Library of Australia.
[66]
Aminul Islam and Diana Inkpen. 2009. Real-word spelling correction using Google Web IT 3-grams. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 1241–1249.
[67]
Adam Jatowt, Ricardo Campos, Sourav S. Bhowmick, and Antoine Doucet. 2019. Document in context of its time (DICT): Providing temporal context to support analysis of past documents. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). Association for Computing Machinery, New York, NY, 2869–2872.
[68]
Axel Jean-Caurant, Nouredine Tamani, Vincent Courboulay, and Jean-Christophe Burie. 2017. Lexicographical-based order for post-OCR correction of named entities. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 1. IEEE, 1192–1197.
[69]
Sittichai Jiampojamarn, Colin Cherry, and Grzegorz Kondrak. 2010. Integrating joint n-gram features into a discriminative training framework. In Human Language Technologies: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 697–700.
[70]
Hongyan Jing, Daniel Lopresti, and Chilin Shih. 2003. Summarization of noisy documents: A pilot study. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop, Volume 5. Association for Computational Linguistics, 25–32.
[71]
D. R. Jordan. 1945. Daily Battle Communiques. Harold B. Lee Library.
[72]
Philip Kahle, Sebastian Colutto, Günter Hackl, and Günter Mühlberger. 2017. Transkribus-A service platform for transcription, recognition and retrieval of historical documents. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17), Vol. 4. IEEE, 19–24.
[73]
Paul B. Kantor and Ellen M. Voorhees. 2000. The TREC-5 confusion track: Comparing retrieval methods for scanned text. Inf. Retriev. 2, 2-3 (2000), 165–176.
[74]
Kimmo Kettunen. 2015. Keep, change or delete? Setting up a low resource ocr post-correction framework for a digitized old finnish newspaper collection. In Proceedings of the Italian Research Conference on Digital Libraries. Springer, 95–103.
[75]
Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkönen, Jukka Kervinen, et al. 2014. Analyzing and improving the quality of a historical news collection using language technology and statistical machine learning methods. In Proceedings of the IFLA World Library and Information Congress (IFLA’14).
[76]
Gitansh Khirbat. 2017. OCR post-processing text correction using simulated annealing (OPTeCA). In Proceedings of the Australasian Language Technology Association Workshop 2017. 119–123.
[77]
Scott Kirkpatrick, C. Daniel Gelatt, and Mario P. Vecchi. 1983. Optimization by simulated annealing. Science 220, 4598 (1983), 671–680.
[78]
Ido Kissos and Nachum Dershowitz. 2016. OCR error correction using character correction and feature-based word classification. In Proceedings of the 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, 198–203.
[79]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In Proceedings of the Association for Computational Linguistics on System Demonstrations (ACL’17). 67–72.
[80]
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 177–180.
[81]
Okan Kolak, William Byrne, and Philip Resnik. 2003. A generative probabilistic OCR model for NLP applications. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Volume 1. 55–62.
[82]
Okan Kolak and Philip Resnik. 2002. OCR error correction using a noisy channel model. In Proceedings of the 2nd International Conference on Human Language Technology Research. Morgan Kaufmann Publishers Inc., 257–262.
[83]
Okan Kolak and Philip Resnik. 2005. OCR post-processing for low density languages. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 867–874.
[84]
Henry Kucera, Henry Kučera, and Winthrop Nelson Francis. 1967. Computational Analysis of Present-Day American English. Brown University Press.
[85]
Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady.
[86]
Heng Li and Nils Homer. 2010. A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 5 (2010), 473–483.
[87]
Xiaofan Lin. 2001. Reliable OCR solution for digital content re-mastering. In Document Recognition and Retrieval IX, Vol. 4670. International Society for Optics and Photonics, 223–231.
[88]
Xiaofan Lin. 2003. Impact of imperfect OCR on part-of-speech tagging. In Proceedings of the Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. IEEE, 284–288.
[89]
Rafael Llobet, Jose-Ramon Cerdan-Navarro, Juan-Carlos Perez-Cortes, and Joaquim Arlandis. 2010. OCR post-processing using weighted finite-state transducers. In Proceedings of the 2010 International Conference on Pattern Recognition. IEEE, 2021–2024.
[90]
Daniel Lopresti. 2009. Optical character recognition errors and their effects on natural language processing. Int. J. Doc. Anal. Recogn. 12, 3 (2009), 141–151.
[91]
Daniel Lopresti and Jiangying Zhou. 1997. Using consensus sequence voting to correct OCR errors. Comput. Vis. Image Understand. 67, 1 (1997), 39–47.
[92]
William B. Lund, Douglas J. Kennard, and Eric K. Ringger. 2013. Combining multiple thresholding binarization values to improve OCR output. In Document Recognition and Retrieval XX (SPIE Proceedings), Vol. 8658. SPIE, 86580R.
[93]
William B. Lund and Eric K. Ringger. 2009. Improving optical character recognition through efficient multiple system alignment. In Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries. ACM, 231–240.
[94]
William B. Lund and Eric K. Ringger. 2011. Error correction with in-domain training across multiple OCR system outputs. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 658–662.
[95]
William B. Lund, Eric K. Ringger, and Daniel David Walker. 2014. How well does multiple OCR error correction generalize? In Document Recognition and Retrieval XXI, Vol. 9021. SPIE, 76–88.
[96]
William B. Lund, Daniel D. Walker, and Eric K. Ringger. 2011. Progressive alignment and discriminative error correction for multiple OCR engines. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 764–768.
[97]
Thibault Magallon, Frédéric Béchet, and Benoît Favre. 2018. Combining character level and word level RNNs for post-OCR error detection. In Proceedings of the Actes de la Conférence TALN (CORIA-TALN-RJC’18), Volume 1. ATALA, 233–240.
[98]
Jie Mei, Aminul Islam, Abidalrahman Moh’d, Yajing Wu, and Evangelos E. Milios. 2017. Post-processing OCR text using web-scale corpora. In Proceedings of the 2017 ACM Symposium on Document Engineering (DocEng 2017). ACM, 117–120.
[99]
Jie Mei, Aminul Islam, Abidalrahman Moh’d, Yajing Wu, and Evangelos Milios. 2018. Statistical learning for OCR error correction. Inf. Process. Manage. 54, 6 (2018), 874–887.
[100]
Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K Gray, Joseph P Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, et al. 2011. Quantitative analysis of culture using millions of digitized books. Science 331, 6014 (2011), 176–182.
[101]
Margot Mieskes and Stefan Schmunk. 2019. OCR quality and NLP preprocessing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
[102]
Stoyan Mihov, Svetla Koeva, Christoph Ringlstetter, Klaus U. Schulz, and Christian Strohmaier. 2004. Precise and efficient text correction using levenshtein automata, dynamic Web dictionaries and optimized correction models. In Proceedings of Workshop on International Proofing Tools and Language Technologies (2004).
[103]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR’13).
[104]
David Miller, Sean Boisen, Richard Schwartz, Rebecca Stone, and Ralph Weischedel. 2000. Named entity extraction from noisy input: speech and OCR. In Proceedings of the 6th Conference on Applied Natural Language Processing. Association for Computational Linguistics, 316–324.
[105]
Elke Mittendorf and Peter Schäuble. 2000. Information retrieval can cope with many errors. Inf. Retriev. 3, 3 (2000), 189–216.
[106]
Kareem Mokhtar, Syed Saqib Bukhari, and Andreas Dengel. 2018. OCR error correction: State-of-the-Art vs an NMT-based approach. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS’18). IEEE, 429–434.
[107]
Diego Molla and Steve Cassidy. 2017. Overview of the 2017 ALTA shared task: Correcting ocr errors. In Proceedings of the Australasian Language Technology Association Workshop 2017. 115–118.
[108]
Stephen Mutuvi, Antoine Doucet, Moses Odeo, and Adam Jatowt. 2018. Evaluating the impact of OCR errors on topic modeling. In Proceedings of the International Conference on Asian Digital Libraries. Springer, 3–14.
[109]
Vivi Nastase and Julian Hitschler. 2018. Correction of OCR word segmentation errors in articles from the ACL collection through neural machine translation methods. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language Resources Association (ELRA).
[110]
J. Ramon Navarro-Cerdan, Joaquim Arlandis, Rafael Llobet, and Juan-Carlos Perez-Cortes. 2015. Batch-adaptive rejection threshold estimation with application to OCR post-processing. Expert Syst. Appl. 42, 21 (2015), 8111–8122.
[111]
J. Ramon Navarro-Cerdan, Joaquim Arlandis, Juan-Carlos Perez-Cortes, and Rafael Llobet. 2010. User-defined expected error rate in OCR postprocessing by means of automatic threshold estimation. In Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition. IEEE, 405–409.
[112]
Quoc-Dung Nguyen, Duc-Anh Le, Nguyet-Minh Phan, and Ivan Zelinka. 2021. OCR error correction using correction patterns and self-organizing migrating algorithm. Pattern Anal. Appl. 24, 2 (2021), 701–721.
[113]
Thi-Tuyet-Hai Nguyen, Mickaël Coustaty, Antoine Doucet, Adam Jatowt, and Nhu-Van Nguyen. 2018. Adaptive edit-distance and regression approach for post-OCR text correction. In Maturity and Innovation in Digital Libraries: Proceedings of the 20th International Conference on Asia-Pacific Digital Libraries (ICADL’18)(Lecture Notes in Computer Science, Vol. 11279. Springer, 278–289.
[114]
Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickaël Coustaty, Nhu-Van Nguyen, and Antoine Doucet. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In Proceedings of the 19th ACM/IEEE Joint Conference on Digital Libraries (JCDL’19). IEEE, 29–38.
[115]
Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickaël Coustaty, Nhu-Van Nguyen, and Antoine Doucet. 2019. Post-OCR error detection by generating plausible candidates. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR’19). IEEE, 876–881.
[116]
Thi-Tuyet-Hai Nguyen, Adam Jatowt, Nhu-Van Nguyen, Mickaël Coustaty, and Antoine Doucet. 2020. Neural machine translation with BERT for Post-OCR error detection and correction. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 (JCDL’20). ACM, 333–336.
[117]
Kai Niklas. 2010. Unsupervised post-correction of OCR errors. Master’s thesis. Leibniz Universität Hannover (2010).
[118]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318.
[119]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[120]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[121]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). ACL, 1532–1543.
[122]
Juan Carlos Perez-Cortes, Juan-Carlos Amengual, Joaquim Arlandis, and Rafael Llobet. 2000. Stochastic error-correcting parsing for OCR post-processing. In Proceedings of the 15th International Conference on Pattern Recognition (ICPR’00), Vol. 4. IEEE, 405–408.
[123]
Juan-Carlos Perez-Cortes, Rafael Llobet, J. Ramon Navarro-Cerdan, and Joaquim Arlandis. 2010. Using field interdependence to improve correction performance in a transducer-based ocr post-processing system. In Proceedings of the 12th International Conference on Frontiers in Handwriting Recognition. IEEE, 605–610.
[124]
Alberto Poncelas, Mohammad Aboomar, Jan Buts, James Hadley, and Andy Way. 2020. A tool for facilitating ocr postediting in historical documents. In Proceedings of the 1st Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA’20). 47–51.
[125]
Elvys Linhares Pontes, Ahmed Hamdi, Nicolas Sidere, and Antoine Doucet. 2019. Impact of OCR quality on named entity linking. In Proceedings of the International Conference on Asian Digital Libraries. Springer, 102–115.
[126]
Ulrich Reffle, Annette Gotscharek, Christoph Ringlstetter, and Klaus U. Schulz. 2009. Successfully detecting and correcting false friends using channel profiles. Int. J. Doc. Anal. Recogn. 12, 3 (2009), 165–174.
[127]
Ulrich Reffle and Christoph Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recogn. 46, 5 (2013), 1346–1357.
[128]
Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
[129]
Christian Reul, Uwe Springmann, Christoph Wick, and Frank Puppe. 2018. Improving OCR accuracy on early printed books by utilizing cross fold training and voting. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS’18). IEEE, 423–428.
[130]
Martin Reynaert. 2008. All, and only, the errors: More complete and consistent spelling and ocr-error correction evaluation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’08). European Language Resources Association.
[131]
Martin Reynaert. 2008. Non-interactive OCR post-correction for giga-scale digitization projects. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 617–630.
[132]
Martin W. C. Reynaert. 2011. Character confusion versus focus word-based correction of spelling and OCR variants in corpora. Int. J. Doc. Anal. Recogn. 14, 2 (2011), 173–187.
[133]
Stephen Vincent Rice. 1996. Measuring the Accuracy of Page-Reading Systems. Ph.D. Dissertation. USA.
[134]
Caitlin Richter, Matthew Wickes, Deniz Beser, and Mitch Marcus. 2018. Low-resource post processing of noisy OCR output for historical corpus digitisation. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).
[135]
Christophe Rigaud, Antoine Doucet, Mickaël Coustaty, and Jean-Philippe Moreux. 2019. ICDAR 2019 competition on post-OCR text correction. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR’19). IEEE, 1588–1593.
[136]
Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov. 2007. Text correction using domain dependent bigram models from web crawls. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07) Workshop on Analytics for Noisy Unstructured Text Data.
[137]
Christoph Ringlstetter, Klaus U. Schulz, and Stoyan Mihov. 2007. Adaptive text correction with Web-crawled domain-dependent dictionaries. ACM Trans. Speech Lang. Process. 4, 4 (2007), 9.
[138]
Rohit Saluja, Devaraj Adiga, Ganesh Ramakrishnan, Parag Chaudhuri, and Mark James Carman. 2017. A framework for document specific error detection and corrections in indic OCR. In Proceedings of the 1st International Workshop on Open Services and Tools for Document Analysis and the 14th IAPR International Conference on Document Analysis and Recognition (OST@ICDAR’17). IEEE, 25–30.
[139]
Andrey Sariev, Vladislav Nenchev, Stefan Gerdjikov, Petar Mitankin, Hristo Ganchev, Stoyan Mihov, and Tinko Tinchev. 2014. Flexible noisy text correction. In Proceedings of the 2014 11th IAPR International Workshop on Document Analysis Systems. IEEE, 31–35.
[140]
Robin Schaefer and Clemens Neudecker. 2020. A two-step approach for automatic OCR post-correction. In Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 52–57.
[141]
Carsten Schnober, Steffen Eger, Erik-Lân Do Dinh, and Iryna Gurevych. 2016. Still not there? comparing traditional sequence-to-sequence models to encoder-decoder neural networks on monotone string translation tasks. In Proceedings of the 26th International Conference on Computational Linguistics (COLING’16). ACL, 1703–1714.
[142]
Klaus U. Schulz, Stoyan Mihov, and Petar Mitankin. 2007. Fast selection of small and precise candidate sets from dictionaries for text correction tasks. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07), Vol. 1. IEEE, 471–475.
[143]
Sarah Schulz and Jonas Kuhn. 2017. Multi-modular domain-tailored OCR post-correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2716–2726.
[144]
Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, et al. 2017. Nematus: A toolkit for neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL’17). 65–68.
[145]
Miikka Silfverberg, Pekka Kauppinen, Krister Linden, et al. 2016. Data-driven spelling correction using weighted finite-state methods. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics Proceedings of the SIGFSM Workshop on Statistical NLP and Weighted Automata. ACL.
[146]
David A. Smith and Ryan Cordell. 2018. A Research Agenda for Historical and Multilingual Optical Character Recognition. NUlab, Northeastern University (2018).
[147]
Sandeep Soni, Lauren Klein, and Jacob Eisenstein. 2019. Correcting whitespace errors in digitized historical texts. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. 98–103.
[148]
Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. 2018. Ground Truth for training OCR engines on historical documents in german fraktur and early modern latin. J. Lang. Technol. Comput. Ling. 33, 1 (2018), 97–114.
[149]
Christian Strohmaier, Christoph Ringlstetter, Klaus U. Schulz, and Stoyan Mihov. 2003. A visual and interactive tool for optimizing lexical postcorrection of OCR results. In Proceedings of the 2003 Conference on Computer Vision and Pattern Recognition Workshop, Vol. 3. IEEE, 32–32.
[150]
Christian M. Strohmaier, Christoph Ringlstetter, Klaus U. Schulz, and Stoyan Mihov. 2003. Lexical postcorrection of OCR-results: The web as a dynamic secondary dictionary?. In Proceedings of the 7th International Conference on Document Analysis and Recognition. Citeseer, 1133–1137.
[151]
Kazem Taghva and Shivam Agarwal. 2014. Utilizing web data in identification and correction of OCR errors. In Document Recognition and Retrieval XXI, Vol. 9021. International Society for Optics and Photonics, 902109.
[152]
Kazem Taghva, Julie Borsack, Allen Condit, and Srinivas Erva. 1994. The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45, 1 (1994), 50–58.
[153]
Kazem Taghva, Allen Condit, Julie Borsack, John Kilburg, Changshi Wu, and Jeff Gilbreth. 1998. Manicure document processing system. In Document Recognition V, Vol. 3305. International Society for Optics and Photonics, 179–184.
[154]
Kazem Taghva and Eric Stofsky. 2001. OCRSpell: An interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recogn. 3, 3 (2001), 125–137.
[155]
Dan Tasse and Noah A Smith. 2008. SOUR CREAM: Toward Semantic Processing of Recipes. Technical Report CMU-LTI-08-005 (2008).
[156]
Konstantin Todorov and Giovanni Colavizza. 2020. Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition. In Proceedings of the Workshop on Computational Humanities Research (CHR’20), Vol. 2723. CEUR-WS.org, 310–339.
[157]
Esko Ukkonen. 1995. On-line construction of suffix trees. Algorithmica 14, 3 (1995), 249–260.
[158]
Daniel van Strien, Kaspar Beelen, Mariona Coll Ardanuy, Kasra Hosseini, Barbara McGillivray, and Giovanni Colavizza. 2020. Assessing the impact of OCR quality on downstream NLP Tasks. In Proceedings of the 12th International Conference on Agents and Artificial Intelligence (ICAART’20). SCITEPRESS, 484–496.
[159]
Thorsten Vobl, Annette Gotscharek, Uli Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2014. PoCoTo-an open source system for efficient interactive postcorrection of OCRed historical texts. In Proceedings of the 1st International Conference on Digital Access to Textual Cultural Heritage. 57–61.
[160]
Martin Volk, Lenz Furrer, and Rico Sennrich. 2011. Strategies for reducing and correcting OCR errors. In Language Technology for Cultural Heritage. Springer, 3–22.
[161]
Martin Volk, Torsten Marek, and Rico Sennrich. 2010. Reducing OCR errors by combining two OCR systems. In Proceedings of the ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH’10). 61–65.
[162]
Luis Von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, and Manuel Blum. 2008. recaptcha: Human-based character recognition via web security measures. Science 321, 5895 (2008), 1465–1468.
[163]
David Wemhoener, Ismet Zeki Yalniz, and R. Manmatha. 2013. Creating an improved version using noisy OCR from multiple editions. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition. IEEE, 160–164.
[164]
Christoph Wick, Christian Reul, and Frank Puppe. 2020. Calamari - A high-performance tensorflow-based deep learning package for optical character recognition. Digit. Humanit. Q. 14, 2 (2020).
[165]
Michael L. Wick, Michael G. Ross, and Erik G. Learned-Miller. 2007. Context-sensitive error correction: Using topic models to improve OCR. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07). IEEE Computer Society, 1168–1172.
[166]
L. Wilms, R. Nijssen, and T. Koster. 2020. Historical newspaper OCR ground-truth data set. KB Lab: The Hague.
[167]
Shaobin Xu and David Smith. 2017. Retrieving and combining repeated passages to improve OCR. In Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL’17). IEEE, 1–4.
[168]
Ismet Zeki Yalniz and Raghavan Manmatha. 2011. A fast alignment scheme for automatic OCR evaluation of books. In Proceedings of the 2011 International Conference on Document Analysis and Recognition. IEEE, 754–758.
[169]
Guowei Zu, Mayo Murata, Wataru Ohyama, Tetsushi Wakabayashi, and Fumitaka Kimura. 2004. The impact of OCR accuracy on automatic text classification. In Advanced Workshop on Content Computing. Springer, 403–409.

Cited By

View all
  • (2024)Positive Online Customer Reviews Significantly Boost Sales for Micro-BusinessesIntegrated Journal for Research in Arts and Humanities10.55544/ijrah.4.4.144:4(85-92)Online publication date: 18-Jul-2024
  • (2024)System Design for Sensing in Manufacturing to Apply AI through Hierarchical Abstraction LevelsSensors10.3390/s2414450824:14(4508)Online publication date: 12-Jul-2024
  • (2024)A Survey on Heterogeneity Taxonomy, Security and Privacy Preservation in the Integration of IoT, Wireless Sensor Networks and Federated LearningSensors10.3390/s2403096824:3(968)Online publication date: 1-Feb-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 54, Issue 6
Invited Tutorial
July 2022
799 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3475936
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 July 2021
Accepted: 01 February 2021
Revised: 01 January 2021
Received: 01 August 2020
Published in CSUR Volume 54, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OCR merging
  2. Post-OCR processing
  3. error model
  4. language model
  5. machine learning
  6. statistical and neural machine translation

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4,735
  • Downloads (Last 6 weeks)589
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Positive Online Customer Reviews Significantly Boost Sales for Micro-BusinessesIntegrated Journal for Research in Arts and Humanities10.55544/ijrah.4.4.144:4(85-92)Online publication date: 18-Jul-2024
  • (2024)System Design for Sensing in Manufacturing to Apply AI through Hierarchical Abstraction LevelsSensors10.3390/s2414450824:14(4508)Online publication date: 12-Jul-2024
  • (2024)A Survey on Heterogeneity Taxonomy, Security and Privacy Preservation in the Integration of IoT, Wireless Sensor Networks and Federated LearningSensors10.3390/s2403096824:3(968)Online publication date: 1-Feb-2024
  • (2024)Dynamic Price Application to Prevent Financial Losses to Hospitals Based on Machine Learning AlgorithmsHealthcare10.3390/healthcare1213127212:13(1272)Online publication date: 26-Jun-2024
  • (2024)Hybrid Quantum Image Classification and Federated Learning for Hepatic Steatosis DiagnosisDiagnostics10.3390/diagnostics1405055814:5(558)Online publication date: 6-Mar-2024
  • (2024)Analysis of Federated Learning Paradigm in Medical Domain: Taking COVID-19 as an Application Use CaseApplied Sciences10.3390/app1410410014:10(4100)Online publication date: 11-May-2024
  • (2024)Research on a Web System Data-Filling Method Based on Optical Character Recognition and Multi-Text SimilarityApplied Sciences10.3390/app1403103414:3(1034)Online publication date: 25-Jan-2024
  • (2024)MISpeller: Multimodal Information Enhancement for Chinese Spelling CorrectionIEICE Transactions on Information and Systems10.1587/transinf.2023EDP7269E107.D:10(1342-1352)Online publication date: 1-Oct-2024
  • (2024)Privacy-Preserving and Cross-Domain Human Sensing by Federated Domain Adaptation with Semantic Knowledge CorrectionProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36435038:1(1-26)Online publication date: 6-Mar-2024
  • (2024)ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper PagesProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657891(2038-2048)Online publication date: 10-Jul-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media