Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Toward a Period-specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts

Published: 07 April 2022 Publication History

Abstract

Over the past few decades, large archives of paper-based historical documents, such as books and newspapers, have been digitized using the Optical Character Recognition (OCR) technology. Unfortunately, this broadly used technology is error-prone, especially when an OCRed document was written hundreds of years ago. Neural networks have shown great success in solving various text processing tasks, including OCR post-correction. The main disadvantage of using neural networks for historical corpora is the lack of sufficiently large training datasets they require to learn from, especially for morphologically rich languages like Hebrew. Moreover, it is not clear what are the optimal structure and values of hyperparameters (predefined parameters) of neural networks for OCR error correction in Hebrew due to its unique features. Furthermore, languages change across genres and periods. These changes may affect the accuracy of OCR post-correction neural network models. To overcome these challenges, we developed a new multi-phase method for generating artificial training datasets with OCR errors and hyperparameters’ optimization for building an effective neural network for OCR post-correction in Hebrew. To evaluate the proposed approach, a series of experiments using several literary Hebrew corpora from various periods and genres were conducted. The obtained results demonstrate that (1) training a network on texts from a similar period dramatically improves the network's ability to fix OCR errors, (2) using the proposed error injection algorithm, based on character-level period-specific errors, minimizes the need for manually corrected data and improves the network accuracy by 9%, (3) the optimized network design improves the accuracy by 3% compared to the state-of-the-art network, and (4) the constructed optimized network outperforms neural machine translation models and industry-leading spellcheckers. The proposed methodology may have practical implications for digital humanities projects that aim to search and analyze OCRed documents in Hebrew and potentially other morphologically rich languages.

References

[1]
A. Ali and S. Renals. 2018. Word error rate estimation for speech recognition: E-WER. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 20–24.
[2]
C. Amrhein and S. Clematide. 2018. Supervised OCR error detection and correction using statistical and neural machine translation methods. J. Lang. Technol. Comput. Linguist. 33, 1 (2018), 49–76.
[3]
M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich. 2010. Soylent: A word processor with a crowd inside In Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology. ACM, 313–322.
[4]
J. Barzilai and J. M. Borwein. 1988. Two-point step size gradient methods. IMA J. Numer. Anal. 8, 1 (1988), 141–148.
[5]
G. Chiron, A. Doucet, M. Coustaty, and J. P. Moreux. 2017. ICDAR2017 competition on post-OCR text correction. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR’17). IEEE 1, 1423–1428.
[6]
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14).
[7]
J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. Retrieved from https://arXiv:1412.3555.
[8]
G. Chrupała. 2014. Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 680–686.
[9]
L. Deng and Y. Liu. (Eds.). 2018. Deep Learning in Natural Language Processing. Springer.
[10]
T. G. Dietterich. 1998. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 10, 7 (1988), 1895–1923.
[11]
S. Drobac, P. S. Kauppinen, and B. K. J. Linden. 2017. OCR and post-correction of historical Finnish texts. In Proceedings of the 21st Nordic Conference on Computational Linguistics (NoDaLiDa’17). Linköping University Electronic Press.
[12]
J. Evershed and K. Fitch. 2014. Correcting noisy OCR: Context beats confusion. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage. 45–51.
[13]
S. Ghosh and P. O. Kristensson. 2017. Neural networks for text correction and completion in keyboard decoding. Retrieved from https://arXiv:1709.06429.
[14]
C. Grouin, E. D'hondt, and B. Grau. 2017. Generating a training corpus for OCR post-correction using encoder-decoder model. In Proceedings of the 8th International Joint Conference on Natural Language Processing. 1006–1014.
[15]
K. Hakala, A. Vesanto, N. Miekka, T. Salakoski, and F. Ginter. 2019. Leveraging text repetitions and denoising autoencoders in OCR post-correction. Retrieved from https://arXiv:1906.10907.
[16]
M. Hämäläinen and S. Hengchen. 2019. From the Paft to the Fiiture: A fully automatic NMT and word embeddings method for OCR post-correction. Retrieved from https://arXiv:1910.05535.
[17]
M. A. Hasnat, S. M. Habib, and M. Khan. 2008. A high-performance domain-specific OCR for Bangla script. In Novel Algorithms and Techniques in Telecommunications, Automation and Industrial Electronics. Springer, Dordrecht, 174–178.
[18]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[19]
S. Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert. Fuzz. Knowl.-Based Syst. 6, 02 (1998), 107–116.
[20]
J. Hoffman. 2004. In the Beginning: A Short History of the Hebrew Language. NYU Press.
[21]
I. Kissos and N. Dershowitz. 2016. OCR error correction using character correction and feature-based word classification. In Proceedings of the 12th IAPR Workshop on Document Analysis Systems (DAS’16). IEEE, 198–203.
[22]
G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. Retrieved from https://arXiv:1701.02810.
[23]
O. Kolak and P. Resnik. 2002. OCR error correction using a noisy channel model. In Proceedings of the Human Language Technology Conference (HLT’02).
[24]
A. Krishna, B. P. Majumder, R. S. Bhat, and P. Goyal. 2018. Upcycle your OCR: Reusing OCR s for post-OCR text correction in romanised sanskrit. Retrieved from https://arXiv:1809.02147.
[26]
A. Jatowt, M. Coustaty, N. V. Nguyen, and A. Doucet. 2019. Deep statistical analysis of OCR errors for effective post-OCR processing. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL’19). IEEE, 29–38.
[27]
J. Jin, Z. Yan, K. Fu, N. Jiang, and C. Zhang. 2016. Neural network architecture optimization through submodularity and supermodularity. arXiv preprint arXiv:1609.00074.
[28]
D. Jurafsky and J. H. Martin. 2009. Spelling correction and the noisy channel. The Spelling Correction Task. Speech and Language Processing, 2nd ed. Prentice-Hall.
[29]
D. P. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In ICLR (Poster).
[30]
A. Kumar and G. S. Lehal. 2016. Automatic text correction for Devanagari OCR. Indian J. Sci. Technol. 9, 45.
[31]
T. Lansdall-Welfare, S. Sudhahar, J. Thompson, J. Lewis, F. N. Team, and N. Cristianini. 2017. Content analysis of 150 years of British periodicals. Proc. Natl. Acad. Sci. U.S.A. 114, 4 (2017), E457–E465.
[32]
V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 10, 8 (1966), 707–710.
[33]
C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard and D. McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 55–60.
[34]
T. Mikolov, A. Deoras, D. Povey, L. Burget, and J. Černocký. 2011. Strategies for training large scale neural network language models. In Proceedings of the IEEE Workshop on Automatic Speech Recognition & Understanding. IEEE. 196–201.
[35]
K. Mokhtar, S. S. Bukhari, and A. Dengel. 2018. OCR error correction: State-of-the-Art vs an NMT-based Approach. In Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (DAS’18). IEEE. 429–434.
[36]
S. B. Needleman and C. D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 3 (1970), 443–453.
[37]
T. T. H. Nguyen, A. Jatowt, N. V. Nguyen, M. Coustaty, and A. Doucet. 2020. Neural machine translation with BERT for post-OCR error detection and correction. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries. 333–336.
[38]
Q. D. Nguyen, D. A. Le, and I. Zelinka. 2019. OCR error correction for unconstrained Vietnamese handwritten text. In Proceedings of the 10th International Symposium on Information and Communication Technology. 132–138.
[39]
R. Pascanu, T. Mikolov, and Y. Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning. 1310–1318.
[40]
S. Raaijmakers. 2013. A deep graphical model for spelling correction. In Proceedings of the 25th Benelux Conference on Artificial Intelligence (BNAIC’13). Retrieved from https://pdfs.semanticscholar.org/75aa/9ccef270af5ac31023876257fa32c49024b5.pdf.
[41]
M. Reynaert. 2008. Non-interactive OCR post-correction for giga-scale digitization projects. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Berlin, 617–630.
[42]
C. Rigaud, A. Doucet, M. Coustaty, and J. P. Moreux. 2019. ICDAR 2019 competition on post-OCR text correction. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’19). IEEE, 1588–1593.
[43]
M. Schuster and K. K. Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45, 11 (1997), 2673–2681.
[44]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (2014), 1929–1958.
[45]
O. Suissa, A. Elmalech, and M. Zhitomirsky-Geffet. 2019. Toward the optimized crowdsourcing strategy for OCR post-correction. Aslib J. Info. Manage. 72, 2 (2019), 179–197.
[46]
O. Suissa, A. Elmalech, and M. Zhitomirsky-Geffet. 2020. Optimizing the neural network training for OCR error correction of historical hebrew texts. In Proceedings of the I-Conference.
[47]
O. Suissa, A. Elmalech, and M. Zhitomirsky-Geffet. 2022. Text analysis using deep neural networks in digital humanities and information science. J. Assoc. Info. Sci. Technol. 73, 2 (2022), 268--287.
[48]
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems. MIT Press, 3104–3112.
[49]
R. Tsarfaty, D. Seddah, Y. Goldberg, S. Kübler, Y. Versley, M. Candito, \(\ldots\) and L. Tounsi. 2010. Statistical parsing of morphologically rich languages (SPMRL) what, how, and whither. In Proceedings of the NAACL HLT 1st Workshop on Statistical Parsing of Morphologically Rich Languages. 1–12.
[50]
R. Tsarfaty, A. Seker, S. Sadde, and S. Klein. 2019. What's wrong with Hebrew NLP? And How to Make it Right. Retrieved from https://arXiv:1908.05453.
[51]
M. Volk, S. Clematide, and L. Furrer. 2016. Crowdsourcing an OCR gold standard for a German and French heritage corpus. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). 975–982.
[52]
J. Wehrmann, W. Becker, H. E. Cagnini, and R. C. Barros. 2017. A character-based convolutional neural network for language-agnostic Twitter sentiment analysis. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’17). IEEE, 2384–2391.
[53]
P. J. Werbos. 1974. New Tools for Prediction and Analysis in the Behavioral Sciences. Ph. D. dissertation, Harvard University.
[54]
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, and J. Dean. 2016. Google's neural machine translation system: Bridging the gap between human and machine translation. Retrieved from https://arXiv:1609.08144.
[55]
W. Yin, K. Kann, M. Yu, and H. Schütze. 2017. Comparative study of CNN and RNN for natural language processing. Retrieved from https://arxiv.org/abs/1702.01923.

Cited By

View all
  • (2024)Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew TranslationsJournal on Computing and Cultural Heritage 10.1145/362716817:2(1-23)Online publication date: 23-Feb-2024
  • (2024)Mapping the landscape and roadmap of geospatial artificial intelligence (GeoAI) in quantitative human geography: An extensive systematic reviewInternational Journal of Applied Earth Observation and Geoinformation10.1016/j.jag.2024.103734128(103734)Online publication date: Apr-2024
  • (2023)The status of the Jewish temple in modern Hebrew literature (1848–1948): A big-data analysisDigital Scholarship in the Humanities10.1093/llc/fqad01038:3(1101-1114)Online publication date: 9-Mar-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 15, Issue 2
June 2022
403 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3514179
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2022
Online AM: 01 April 2022
Accepted: 01 August 2021
Revised: 01 June 2021
Received: 01 August 2020
Published in JOCCH Volume 15, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DNN
  2. OCR post-correction
  3. dataset generation
  4. Hebrew
  5. neural machine translation
  6. historical newspapers
  7. digital humanities

Qualifiers

  • Research-article
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)109
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Machine Translation for Historical Research: A Case Study of Aramaic-Ancient Hebrew TranslationsJournal on Computing and Cultural Heritage 10.1145/362716817:2(1-23)Online publication date: 23-Feb-2024
  • (2024)Mapping the landscape and roadmap of geospatial artificial intelligence (GeoAI) in quantitative human geography: An extensive systematic reviewInternational Journal of Applied Earth Observation and Geoinformation10.1016/j.jag.2024.103734128(103734)Online publication date: Apr-2024
  • (2023)The status of the Jewish temple in modern Hebrew literature (1848–1948): A big-data analysisDigital Scholarship in the Humanities10.1093/llc/fqad01038:3(1101-1114)Online publication date: 9-Mar-2023

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media