Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3383583.3398605acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
short-paper

Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

Published: 01 August 2020 Publication History

Abstract

The quality of OCR has a direct impact on information access, and an indirect impact on the performance of natural language processing applications, making fine-grained (e.g., semantic) information access even harder. This work proposes a novel post-OCR approach based on a contextual language model and neural machine translation, aiming to improve the quality of OCRed text by detecting and rectifying erroneous tokens. This new technique obtains results comparable to the best-performing approaches on English datasets of the competition on post-OCR text correction in ICDAR 2017/2019.

Supplementary Material

MP4 File (3383583.3398605.mp4)
I would like to present our post-OCR correction approach which applies neural machine translation and contextual language model BERT.

References

[1]
Chantal Amrhein and Simon Clematide. 2018. Supervised OCR Error Detection and Correction Using Statistical and Neural Machine Translation Methods. Journal for Language Technology and Computational Linguistics (2018).
[2]
Guillaume Chiron, Antoine Doucet, Mickael Coustaty, and Jean-Philippe Moreux. 2017. ICDAR2017 competition on post-OCR text correction. In 14th IAPR International Conference on Document Analysis and Recognition. IEEE, 1423--1428.
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4]
John Evershed and Kent Fitch. 2014. Correcting noisy OCR: Context beats confusion. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. ACM, 45--51.
[5]
Ido Kissos and Nachum Dershowitz. 2016. OCR error correction using character correction and feature-based word classification. In Document Analysis Systems (DAS), 2016 12th IAPR Workshop on. IEEE, 198--203.
[6]
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and AlexanderRush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of ACL 2017, System Demonstrations. 67--72.
[7]
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. 6294--6305.
[8]
Thi-Tuyet-Hai Nguyen, Mickael Coustaty, Antoine Doucet, Adam Jatowt, and Nhu-Van Nguyen. 2018. Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction. In 20th International Conference on Asia-Pacific Digital Libraries, ICADL 2018. 278--289.
[9]
Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, and Antoine Doucet. 2019. Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing. In 19th ACM/IEEE Joint Conf. on Digital Libraries. 29--38.
[10]
Thi-Tuyet-Hai Nguyen, Adam Jatowt, Mickael Coustaty, Nhu-Van Nguyen, and Antoine Doucet. 2019. Post-OCR Error Detection by Generating Plausible Candidates. In 2019 International Conference on Document Analysis and Recognition, ICDAR 2019, Sydney, Australia, September 20--25, 2019. IEEE, 876--881.
[11]
Christophe Rigaud, Antoine Doucet, Mickael Coustaty, and Jean-Philippe Moreux. 2019. ICDAR 2019 Competition on Post-OCR Text Correction. In 15th IAPR International Conference on Document Analysis and Recognition, ICDAR 2019.
[12]
Sarah Schulz and Jonas Kuhn. 2017. Multi-modular domain-tailored OCR post-correction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2716--2726.
[13]
Myriam C Traub, Jacco Van Ossenbruggen, and Lynda Hardman. 2015. Impact analysis of OCR quality on research tasks in digital archives. In International Conference on Theory and Practice of Digital Libraries. Springer, 252--263.
[14]
Yonghui Wu, Mike Schuster, Zhifeng Chen, et al. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).

Cited By

View all
  • (2025)Quelle solution pour améliorer les performances de la reconnaissance d’entités nommées sur des données bruitées, corriger l’entrée ou filtrer la sortie ?Corpus10.4000/1364s26Online publication date: 2025
  • (2025)End-to-End Information Extraction from Courier Order Images Using a Neural Network Model with Feature EnhancementApplied Sciences10.3390/app1502069815:2(698)Online publication date: 12-Jan-2025
  • (2025)Augmented dialectal speech recognition for AI-based neuropsychological scale assessment in Alzheimer’s diseaseBiomedical Signal Processing and Control10.1016/j.bspc.2024.10682199(106821)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. Neural Machine Translation with BERT for Post-OCR Error Detection and Correction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020
      August 2020
      611 pages
      ISBN:9781450375856
      DOI:10.1145/3383583
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 August 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. BERT
      2. neural machine translation
      3. post-ocr processing

      Qualifiers

      • Short-paper

      Funding Sources

      • H2020

      Conference

      JCDL '20
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)93
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Quelle solution pour améliorer les performances de la reconnaissance d’entités nommées sur des données bruitées, corriger l’entrée ou filtrer la sortie ?Corpus10.4000/1364s26Online publication date: 2025
      • (2025)End-to-End Information Extraction from Courier Order Images Using a Neural Network Model with Feature EnhancementApplied Sciences10.3390/app1502069815:2(698)Online publication date: 12-Jan-2025
      • (2025)Augmented dialectal speech recognition for AI-based neuropsychological scale assessment in Alzheimer’s diseaseBiomedical Signal Processing and Control10.1016/j.bspc.2024.10682199(106821)Online publication date: Jan-2025
      • (2025)Multi-modal deep learning for credit rating prediction using text and numerical data streamsApplied Soft Computing10.1016/j.asoc.2025.112771171(112771)Online publication date: Mar-2025
      • (2024)End-to-end entity extraction from OCRed texts using summarization modelsNeural Computing and Applications10.1007/s00521-024-10422-936:35(22347-22363)Online publication date: 1-Dec-2024
      • (2024)Confidence-Aware Document OCR Error DetectionDocument Analysis Systems10.1007/978-3-031-70442-0_13(213-228)Online publication date: 11-Sep-2024
      • (2023)Visual information extraction deep learning method:a critical reviewJournal of Image and Graphics10.11834/jig.22090428:8(2276-2297)Online publication date: 2023
      • (2023)AutoDesc: Facilitating Convenient Perusal of Web Data Items for Blind UsersProceedings of the 28th International Conference on Intelligent User Interfaces10.1145/3581641.3584049(32-45)Online publication date: 27-Mar-2023
      • (2023)Text error correction after text recognition based on MacBERT4CSCSixth International Conference on Advanced Electronic Materials, Computers, and Software Engineering (AEMCSE 2023)10.1117/12.3004939(97)Online publication date: 16-Aug-2023
      • (2023)Detecting speech recognition errors using topic information and BERT2023 14th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI59060.2023.00084(400-405)Online publication date: 8-Jul-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media