Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3322905.3322917acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

OCR-D: An end-to-end open source OCR framework for historical printed documents

Published: 08 May 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep neural networks and, despite the various project's achievements, there remains a lack of OCR software that is a) comprehensive with regard to the challenges presented by the wide variety of historical documents and b) available as ready-to-use Free Software. The OCR-D project aims to rectify that.
    In this paper we introduce the background of OCR-D, the main challenges and shortcomings in the availability of open tools and resources for OCR of historical printed documents and discuss the various software modules and related components (repositories, workflows) that are being made available through OCR-D. Finally we provide an outlook to a number of remaining challenges that are not addressed by OCR-D and point out several examples for the positive community aspects arisen through the creation and sharing of open resources for historical German OCR.

    References

    [1]
    Hildelies Balk and Aly Conteh. 2011. IMPACT: Centre of Competence in Text Digitisation. In Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11). ACM, New York, NY, USA, 155--160. https://doi.org/10.1145/2037342.2037369
    [2]
    Scott Bradner. 1997. Key words for use in RFCs to Indicate Requirement Levels. BCP 14. RFC Editor. http://www.rfc-editor.org/rfc/rfc2119.txt http://www.rfc-editor.org/rfc/rfc2119.txt.
    [3]
    Christian Clausner and Apostolos Antonacopoulos. 2018. Ontology and framework for semantic labelling of document data and software methods. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS2018). IEEE, New York, NY, USA, 73--78. https://doi.org/10.1109/DAS.2018.46
    [4]
    Ryan Cordell and David Smith. 2018. A Research Agenda for Historical and Multilingual Optical Character Recognition. http://hdl.handle.net/2047/D20297452. Accessed: 2019-01-18.
    [5]
    Maria Federbusch, Christian Polzin, and Thomas Stäcker. 2013. Volltext via OCR- Möglichkeiten und Grenzen. Beiträge aus der Staatsbibliothek zu Berlin -Preußischer Kulturbesitz 43 (2013), 1--138.
    [6]
    Emilio Granell, Verónica Romero, and Carlos D. Martinez-Hinarejos. 2018. Multimodality, interactivity, and crowdsourcing for document transcription. Computational Intelligence 34, 2 (2018), 398--419. https://doi.org/10.1111/coin.12169
    [7]
    Thomas Jejkal, Alexander Vondrous, Andreas Kopmann, Rainer Stotzka, and Volker Hartmann. 2014. KIT Data Manager: The Repository Architecture Enabling Cross-Disciplinary Research. KIT, Karlsruhe, 9--11.
    [8]
    John Kunze, Justin Littman, Elizabeth Madden, John Scancella, and Chris Adams. 2018. The BagIt File Packaging Format (V1.0). https://tools.ietf.org/html/draft-kunze-bagit-17. Accessed: 2019-01-18.
    [9]
    Laura C. Mandell, Clemens Neudecker, Apostolos Antonacopoulos, Elizabeth Grumbach, Loretta Auvil, Matthew J. Christy, Jacob A. Heil, and Todd Samuelson. 2017. Navigating the storm: IMPACT, eMOP, and agile steering standards. Digital Scholarship in the Humanities 32, 1 (2017), 189--194. https://doi.org/10.1093/llc/fqv062
    [10]
    Clemens Neudecker, Sven Schlarb, Zeki Mustafa Dogan, Paolo Missier, Shoaib Sufi, Alan Williams, and Katy Wolstencroft. 2011. An Experimental Workflow Development Platform for Historical Document Digitisation and Analysis. In Proceedings of the 2011 Workshop on Historical Document Imaging and Processing (HIP '11). ACM, New York, NY, USA, 161--168. https://doi.org/10.1145/2037342.2037370
    [11]
    Clemens Neudecker and Asaf Tzadok. 2010. User collaboration for improving access to historical texts. Liber Quarterly 20, 1 (2010), 119--128. https://doi.org/10.18352/lq.7981
    [12]
    Stefan Pletschacher and Apostolos Antonacopoulos. 2010. The PAGE (Page Analysis and Ground-Truth Elements) Format Framework. In 2010 20th International Conference on Pattern Recognition. IEEE, New York, NY, USA, 257--260. https://doi.org/10.1109/ICPR.2010.72
    [13]
    Ajinkya Prabhune, Rainer Stotzka, Vaibhav Sakharkar, Jürgen W. Hesser, and Michael Gertz. 2018. MetaStore: an adaptive metadata management framework for heterogeneous metadata models. Distributed and parallel databases 36, 1 (2018), 153--194. https://doi.org/10.1007/s10619-017-7210-4
    [14]
    Ulrich Reffle and Christoph Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recognition 46, 5 (2013), 1346 - 1357. https://doi.org/10.1016/j.patcog.2012.10.002
    [15]
    Christian Reul, Uwe Springmann, and Frank Puppe. 2017. LAREX: A Semi-automatic Open-source Tool for Layout Analysis and Region Extraction on Early Printed Books. In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage (DATeCH2017). ACM, New York, NY, USA, 137--142. https://doi.org/10.1145/3078081.3078097
    [16]
    Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 2. IEEE, New York, NY, USA, 629--633. https://doi.org/10.1109/ICDAR.2007.4376991
    [17]
    Uwe Springmann. 2016. OCR für alte Drucke. Informatik Spektrum 39, 6 (2016), 459--462. https://doi.org/10.1007/s00287-016-1004-3
    [18]
    Uwe Springmann, Christian Reul, Stefanie Dipper, and Johannes Baiter. 2018. Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin. CoRR abs/1809.05501 (2018). arXiv:1809.05501 http://arxiv.org/abs/1809.05501
    [19]
    Christoph Stollwerk. 2016. Machbarkeitsstudie zu Einsatzmöglichkeiten von OCR-Software im Bereich "Alter Drucke" zur Vorbereitung einer vollständigen Digitalisierung deutscher Druckerzeugnisse zwischen 1500 und 1930. DARIAH-DE working papers 16 (2016). http://nbn-resolving.de/urn:nbn:de:gbv:7-dariah-2016-2-8
    [20]
    Simon Tanner. 2001. Digitization of Printed Material: The Metadata Engine Project (METAe). Library Hi Tech News 18, 4 (2001). https://doi.org/10.1108/lhtn.2001.23918daf.002
    [21]
    Thorsten Vobl, Annette Gotscharek, Uli Reffle, Christoph Ringlstetter, and Klaus U. Schulz. 2014. PoCoTo - an Open Source System for Efficient Interactive Postcorrection of OCRed Historical Texts. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH '14). ACM, New York, NY, USA, 57--61. https://doi.org/10.1145/2595188.2595197

    Cited By

    View all

    Index Terms

    1. OCR-D: An end-to-end open source OCR framework for historical printed documents

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
      May 2019
      163 pages
      ISBN:9781450371940
      DOI:10.1145/3322905
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 May 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. digital libraries
      2. digitization
      3. historical prints
      4. open source
      5. optical character recognition

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      DATeCH2019

      Acceptance Rates

      Overall Acceptance Rate 60 of 86 submissions, 70%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)73
      • Downloads (Last 6 weeks)7
      Reflects downloads up to 11 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Boundary Gaussian Distance Loss Function for Enhancing Character Extraction from High-Resolution Scans of Ancient Metal-Type Printed BooksElectronics10.3390/electronics1310195713:10(1957)Online publication date: 16-May-2024
      • (2023)Digitale Sammlungen als offene Daten für die ForschungBibliothek Forschung und Praxis10.1515/bfp-2023-002147:2(200-212)Online publication date: 18-Jul-2023
      • (2023)Digital Curation and AIAI in Museums10.14361/9783839467107-013(149-162)Online publication date: 4-Dec-2023
      • (2023)Document Layout Analysis with Deep Learning and HeuristicsProceedings of the 7th International Workshop on Historical Document Imaging and Processing10.1145/3604951.3605513(73-78)Online publication date: 25-Aug-2023
      • (2023)Search in Archival Facsimile Documents for Digital History2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254826(1-10)Online publication date: 9-Oct-2023
      • (2023)Classification of Brahmi script characters using HOG features and multiclass error-correcting output codes (ECOC) model containing SVM binary learners2023 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE)10.1109/IITCEE57236.2023.10091084(448-451)Online publication date: 27-Jan-2023
      • (2023)Classification of incunable glyphs and out-of-distribution detection with joint energy-based modelsInternational Journal on Document Analysis and Recognition (IJDAR)10.1007/s10032-023-00442-x26:3(223-240)Online publication date: 22-Jun-2023
      • (2023)The Adaptability of a Transformer-Based OCR Model for Historical DocumentsDocument Analysis and Recognition – ICDAR 2023 Workshops10.1007/978-3-031-41498-5_3(34-48)Online publication date: 15-Aug-2023
      • (2023)Efficient Annotation of Medieval ChartersDocument Analysis and Recognition – ICDAR 2023 Workshops10.1007/978-3-031-41498-5_20(284-295)Online publication date: 15-Aug-2023
      • (2023)Self-paced Learning to Improve Text Row Detection in Historical Documents with Missing LabelsComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25069-9_17(253-262)Online publication date: 14-Feb-2023
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media