article

Unsupervised profiling of OCRed historical documents

Authors:

Christoph RinglstetterAuthors Info & Claims

Pattern Recognition, Volume 46, Issue 5

Pages 1346 - 1357

https://doi.org/10.1016/j.patcog.2012.10.002

Published: 01 May 2013 Publication History

Abstract

In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) ''global'' information on typical recognition errors found in the OCR output, typical patterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) ''local'' hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.

References

[1]

G. Inc., Google books, {http://books.google.com/}.

[2]

R. Holley, How good can it get?, D-Lib Magazine 15(3/4).

[3]

T.L. Packer, Performing information extraction to improve OCR error detection in semi-structured historical documents, in: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11), Beijing, China, 2011, pp. 67-74.

Digital Library

[4]

A. Fischer, V. Frinken, A. Fornes, H. Bunke, Transcription alignment of Latin manuscripts using hidden Markov models, in: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11), Beijing, China, 2011, pp. 29-36.

Digital Library

[5]

H. Balk, A. Conteh, IMPACT: centre of competence in text digitisation, in: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11), Beijing, China, 2011, pp. 155-160.

Digital Library

[6]

A. Gotscharek, U. Reffle, C. Ringlstetter, K. U. Schulz, On lexical resources for digitization of historical documents, in: DocEng '09: Proceedings of the Ninth ACM symposium on Document engineering, ACM, New York, NY, USA, 2009, pp. 193-200.

Digital Library

[7]

R. Smith, Limits on the application of frequency-based language models to ocr, in: ICDAR, IEEE, 2011, pp. 538-542.

[8]

P. Yang, A. Antonacopoulos, C. Clausner, S. Pletschacher, Grid-based modelling and correction of arbitrarily warped historical document images for large-scale digitisation, in: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, HIP '11, ACM, New York, NY, USA, 2011, pp. 106-111.

Digital Library

[9]

Gotscharek, A., Reffle, U., Ringlstetter, C., Schulz, K. and Neumann, A., Towards information retrieval on historical document collections: the role of matching procedures and special lexica. . International Journal of Document Analysis and Recognition. 1-13.

[10]

K. Taghva, T. Nartker, J. Borsack, Information access in the presence of ocr errors, in: Proceedings of the First ACM Workshop on Hardcopy Document Processing, HDP '04, ACM, New York, NY, USA, 2004, pp. 1-8.

Digital Library

[11]

A.C. Popat, A panlingual anomalous text detector, in: Proceedings of the Ninth ACM Symposium on Document Engineering, DocEng '09, ACM, New York, NY, USA, 2009, pp. 201-204.

Digital Library

[12]

Reffle, U., Efficiently generating correction suggestions for garbled tokens of historical language. Natural Language Engineering. v17. 265-282.

[13]

Shannon, C.E., A mathematical theory of communication. The Bell System Technical Journal. v27. 379-423.

[14]

R. Wagner, M. Fisher, The string-to-string correction problem, Journal of the ACM 21 (1) (1974) 168-173

Digital Library

[15]

F. Weigel, S. Baumann, J. Rohrschneider, Lexical postprocessing by heuristic search and automatic determination of the edit costs, in: Proceedings of the Third International Conference on Document Analysis and Recognition (ICDAR 95), 1995, pp. 857-860.

[16]

E.S. Ristad, P.N. Yianilos, Learning string edit distance, in: Proceedings of the 14th International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 287-295.

[17]

E. Brill, R.C. Moore, An improved error model for noisy channel spelling correction, in: ACL '00: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, 2000, pp. 286-293.

Digital Library

[18]

O. Kolak, W. Byrne, P. Resnik, A generative probabilistic ocr model for nlp applications, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, NAACL '03, Association for Computational Linguistics, Stroudsburg, PA, USA, 2003, pp. 55-62.

[19]

Probability scoring for spelling correction. Statistics and Computing. v1. 93-103.

[20]

L.-M. Liu, Y.M. Babad, W. Sun, K.-K. Chan, Adaptive post-processing of ocr text via knowledge acquisition, in: Proceedings of the 19th annual conference on Computer Science, CSC '91, ACM, New York, NY, USA, 1991, pp. 558-569.

Digital Library

[21]

Rice, S.V., Nagy, G. and Nartker, T.A., Optical Character Recognition: An Illustrated Guide to the Frontier. . 1999. Kluwer Academic Publishers.

[22]

X. Tong, D.A. Evans, A statistical approach to automatic OCR error correction in context, in: Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen, Denmark, 1996, pp. 88-100.

[23]

C. Ringlstetter, U. Reffle, A. Gotscharek, K.U. Schulz, Deriving symbol dependent edit weights for text correction-the use of error dictionaries, in: ICDAR, 2007, pp. 639-643.

[24]

R. Jin, C. Zhai, A.G. Hauptmann, Information retrieval for ocr documents: a content-based probabilistic correction model, in: Document Recognition and Retrieval X Santa Clara, CA, USA, Proceedings, 2003, pp. 128-135.

[25]

A. Ernst-Gerlach, N. Fuhr, Generating search term variants for text collections with historic spellings, in: Proceedings of the 28th European Conference on Information Retrieval Research (ECIR 2006), Springer, 2006, pp. 1-8.

[26]

Jurish, B., More than words: Using token context to improve canonicalization of historical German. . JLCL. v25 i1. 23-39.

[27]

Schulz, K.U. and Mihov, S., Fast string correction with Levenshtein automata. International Journal of Document Analysis and Recognition. v5 i1. 67-85.

Cited By

Neudecker CBaierer KGerber MClausner CAntonacopoulos APletschacher S(2021)A survey of OCR evaluation tools and metricsProceedings of the 6th International Workshop on Historical Document Imaging and Processing10.1145/3476887.3476888(13-18)Online publication date: 5-Sep-2021
https://dl.acm.org/doi/10.1145/3476887.3476888
Nguyen TJatowt ACoustaty MDoucet A(2021)Survey of Post-OCR Processing ApproachesACM Computing Surveys10.1145/345347654:6(1-37)Online publication date: 13-Jul-2021
https://dl.acm.org/doi/10.1145/3453476
Neudecker CBaierer KFederbusch MBoenig MWürzner KHartmann VHerrmann E(2019)OCR-DProceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage10.1145/3322905.3322917(53-58)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.1145/3322905.3322917
Show More Cited By

Index Terms

Unsupervised profiling of OCRed historical documents
1. Hardware
  1. Power and energy
    1. Power estimation and optimization
      1. Platform power issues

Index terms have been assigned to the content through auto-classification.

Recommendations

Representing OCRed documents in HTML
ICDAR '97: Proceedings of the 4th International Conference on Document Analysis and Recognition

OCR is an error-prone process. It is time-consuming and expensive to manually proofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. ...
Attempts to recognize anomalously deformed Kana in Japanese historical documents
HIP '17: Proceedings of the 4th International Workshop on Historical Document Imaging and Processing

This paper presents methods for three different tasks of recognizing anomalously deformed Kana in Japanese historical documents, which were contested by IEICE PRMU1 2017. The tasks have three levels: single character recognition, three Kana characters ...
A knowledge-based recognition system for historical Mongolian documents

This paper proposes a knowledge-based system to recognize historical Mongolian documents in which the words exhibit remarkable variation and character overlapping. According to the characteristics of Mongolian word formation, the system combines a ...

Comments

Information & Contributors

Information

Published In

Copyright © Elsevier Ltd © 2012.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 May 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Neudecker CBaierer KGerber MClausner CAntonacopoulos APletschacher S(2021)A survey of OCR evaluation tools and metricsProceedings of the 6th International Workshop on Historical Document Imaging and Processing10.1145/3476887.3476888(13-18)Online publication date: 5-Sep-2021
https://dl.acm.org/doi/10.1145/3476887.3476888
Nguyen TJatowt ACoustaty MDoucet A(2021)Survey of Post-OCR Processing ApproachesACM Computing Surveys10.1145/345347654:6(1-37)Online publication date: 13-Jul-2021
https://dl.acm.org/doi/10.1145/3453476
Neudecker CBaierer KFederbusch MBoenig MWürzner KHartmann VHerrmann E(2019)OCR-DProceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage10.1145/3322905.3322917(53-58)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.1145/3322905.3322917
Englmeier TFink FSchulz K(2019)A-I-PoCoToProceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage10.1145/3322905.3322908(19-24)Online publication date: 8-May-2019
https://dl.acm.org/doi/10.1145/3322905.3322908
Brodić DAmelio A(2019)Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documentsNeural Computing and Applications10.1007/s00521-017-3292-131:8(3493-3513)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1007/s00521-017-3292-1
Fink FSchulz KSpringmann U(2017)Profiling of OCR'ed Historical Texts RevisitedProceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage10.1145/3078081.3078096(61-66)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1145/3078081.3078096
Christy MGupta AGrumbach EMandell LFuruta RGutierrez-Osuna R(2017)Mass Digitization of Early Modern Texts With Optical Character RecognitionJournal on Computing and Cultural Heritage 10.1145/307564511:1(1-25)Online publication date: 7-Dec-2017
https://dl.acm.org/doi/10.1145/3075645
Hládek DStaš JOndáš SJuhár JKovács L(2017)Learning string distance with smoothing for OCR spelling correctionMultimedia Tools and Applications10.1007/s11042-016-4185-576:22(24549-24567)Online publication date: 1-Nov-2017
https://dl.acm.org/doi/10.1007/s11042-016-4185-5
Chen YYu P(2016)An evidence-based model of saliency feature extraction for scene text analysisInternational Journal on Document Analysis and Recognition10.1007/s10032-016-0270-619:3(269-287)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10032-016-0270-6
Springmann UNajock DMorgenroth HSchmid HGotscharek AFink FAntonacopoulos ASchulz K(2014)OCR of historical printings of Latin textsProceedings of the First International Conference on Digital Access to Textual Cultural Heritage10.1145/2595188.2595205(71-75)Online publication date: 19-May-2014
https://dl.acm.org/doi/10.1145/2595188.2595205
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents