Article

Evaluating the Impact of OCR Quality on Short Texts Classification Task

Authors:

Yevhen Kostiuk,

Paul Plachinda,

Grigori Sidorov,

Alexander GelbukhAuthors Info & Claims

Advances in Computational Intelligence: 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Monterrey, Mexico, October 24–29, 2022, Proceedings, Part II

Pages 163 - 177

https://doi.org/10.1007/978-3-031-19496-2_13

Published: 24 October 2022 Publication History

Abstract

The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.

References

[1]

Alex, B., Burns, J.: Estimating and rating the quality of optically character recognised text. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 97–102 (2014)

[2]

Alex, B., Grover, C., Klein, E., Tobin, R.: Digitised historical text: does it have to be mediocre? In: KONVENS, pp. 401–409 (2012)

[3]

Amjad, M., et al.: Urduthreat@ fire2021: shared track on abusive threat identification in Urdu. In: Forum for Information Retrieval Evaluation, pp. 9–11. FIRE 2021, Association for Computing Machinery, New York, NY, USA (2021).

[4]

Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information Trans. Assoc. Comput. Linguist. 2017 5 135-146

[5]

Briskilal, J., Subalalitha, C.: An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Inf. Process. Manage. 59(1), 102756 (2022)

[6]

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019.

[7]

Guo, Y., Dong, X., Al-Garadi, M.A., Sarker, A., Paris, C., Aliod, D.M.: Benchmarking of transformer-based pre-trained models on social media text classification datasets. In: Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association, pp. 86–91. Australasian Language Technology Association, Virtual Workshop, December 2020. https://aclanthology.org/2020.alta-1.10

[8]

Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Zhao, T.: SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2177–2190. Association for Computational Linguistics, July 2020., https://aclanthology.org/2020.acl-main.197

[9]

Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

[10]

Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751 (2014). https://aclweb.org/anthology/D/D14/D14-1181.pdf

[11]

Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). arxiv.org/abs/1907.11692

[12]

Mieskes, M., Schmunk, S.: OCR quality and NLP preprocessing. In: WNLP@ ACL, pp. 102–105 (2019)

[13]

Murarka, A., Radhakrishnan, B., Ravichandran, S.: Classification of mental illnesses on social media using RoBERTa. In: Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, pp. 59–68. Association for Computational Linguistics, April 2021. https://aclanthology.org/2021.louhi-1.7

[14]

Murata M, Busagala LSP, Ohyama W, Wakabayashi T, and Kimura F Bunke H and Spitz AL The impact of OCR accuracy and feature transformation on automatic text classification Document Analysis Systems VII 2006 Heidelberg Springer 506-517

[15]

Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

[16]

Smith, D.A., Cordell, R.: A research agenda for historical and multilingual optical character recognition, p. 36. NULab, Northeastern University (2018). https://ocr.northeastern.edu/report

[17]

Stein, S.S., Argamon, S., Frieder, O.: The effect of OCR errors on stylistic text classification. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 701–702 (2006)

[18]

Van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)

[19]

Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

[20]

Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020., https://aclanthology.org/2020.emnlp-demos.6

[21]

Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)

[22]

Zhao, Z., Zhang, Z., Hopfgartner, F.: SS-BERT: mitigating identity terms bias in toxic comment classification by utilising the notion of “subjectivity” and “identity terms”. CoRR abs/2109.02691 (2021). arxiv.org/abs:2109.02691

[23]

Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015).

Index Terms

Evaluating the Impact of OCR Quality on Short Texts Classification Task
1. Applied computing
  1. Document management and text processing
2. Computing methodologies
  1. Artificial intelligence
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Index terms have been assigned to the content through auto-classification.

Recommendations

Handwritten Character Recognition from Images using CNN-ECOC
Abstract
Recently, deep learning and character recognition have drawn the attention of many researchers. The deep neural networks have state-of-the-art performance in solving many classification and recognition problems. The Optical Character Recognition (...
Neural Networks Pipeline for Offline Machine Printed Arabic OCR

In the context of Arabic optical characters recognition, Arabic poses more challenges because of its cursive nature. We purpose a system for recognizing a document containing Arabic text, using a pipeline of three neural networks. The first network ...
Ligature Segmentation for Urdu OCR
ICDAR '13: Proceedings of the 2013 12th International Conference on Document Analysis and Recognition

Urdu script uses superset of Arabic alphabet, but uses Nastaliq writing style. Nastaliq script is highly cursive, context sensitive and is written diagonally from top right to bottom left with stacking of characters, which makes it very hard to process ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

Advances in Computational Intelligence: 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Monterrey, Mexico, October 24–29, 2022, Proceedings, Part II

Oct 2022

401 pages

ISBN:978-3-031-19495-5

DOI:10.1007/978-3-031-19496-2

Editors:
Obdulia Pichardo Lagunas
Instituto Politécnico Nacional, Mexico, Mexico
,
Juan Martínez-Miranda
Centro de Investigación Científica y de Educación Superior de Ensenada, Ensenada, Baja California, Mexico
,
Bella Martínez Seis
Instituto Politécnico Nacional, Mexico, Mexico

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 October 2022

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents