Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-19496-2_13guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Evaluating the Impact of OCR Quality on Short Texts Classification Task

Published: 24 October 2022 Publication History

Abstract

The majority of text classification algorithms have been developed and evaluated for texts written by humans and originated in text mode. However, in the contemporary world with an abundance of smartphones and readily available cameras, the ever-increasing amount of textual information comes from the text captured on photographed objects such as road and business signs, product labels and price tags, random phrases on t-shirts, the list can be infinite. One way to process such information is to pass an image with a text in it through an Optical Character Recognition (OCR) processor and then apply a natural language processing (NLP) system to that text. However, OCR text is not quite equivalent to the ‘natural’ language or human-written text because spelling errors are not the same as those usually committed by humans. Implying that the distribution of human errors is different from the distribution of OCR errors, we compare how much and how it affects the classifiers. We focus on deterministic classifiers such as fuzzy search as well as on the popular Neural Network based classifiers including CNN, BERT, and RoBERTa. We discovered that applying spell corrector on OCRed text increases F1 score by 4% for CNN and by 2% for BERT.

References

[1]
Alex, B., Burns, J.: Estimating and rating the quality of optically character recognised text. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 97–102 (2014)
[2]
Alex, B., Grover, C., Klein, E., Tobin, R.: Digitised historical text: does it have to be mediocre? In: KONVENS, pp. 401–409 (2012)
[3]
Amjad, M., et al.: Urduthreat@ fire2021: shared track on abusive threat identification in Urdu. In: Forum for Information Retrieval Evaluation, pp. 9–11. FIRE 2021, Association for Computing Machinery, New York, NY, USA (2021).
[4]
Bojanowski P, Grave E, Joulin A, and Mikolov T Enriching word vectors with subword information Trans. Assoc. Comput. Linguist. 2017 5 135-146
[5]
Briskilal, J., Subalalitha, C.: An ensemble model for classifying idioms and literal texts using BERT and RoBERTa. Inf. Process. Manage. 59(1), 102756 (2022)
[6]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019.
[7]
Guo, Y., Dong, X., Al-Garadi, M.A., Sarker, A., Paris, C., Aliod, D.M.: Benchmarking of transformer-based pre-trained models on social media text classification datasets. In: Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association, pp. 86–91. Australasian Language Technology Association, Virtual Workshop, December 2020. https://aclanthology.org/2020.alta-1.10
[8]
Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Zhao, T.: SMART: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2177–2190. Association for Computational Linguistics, July 2020., https://aclanthology.org/2020.acl-main.197
[9]
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
[10]
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1746–1751 (2014). https://aclweb.org/anthology/D/D14/D14-1181.pdf
[11]
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach (2019). arxiv.org/abs/1907.11692
[12]
Mieskes, M., Schmunk, S.: OCR quality and NLP preprocessing. In: WNLP@ ACL, pp. 102–105 (2019)
[13]
Murarka, A., Radhakrishnan, B., Ravichandran, S.: Classification of mental illnesses on social media using RoBERTa. In: Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, pp. 59–68. Association for Computational Linguistics, April 2021. https://aclanthology.org/2021.louhi-1.7
[14]
Murata M, Busagala LSP, Ohyama W, Wakabayashi T, and Kimura F Bunke H and Spitz AL The impact of OCR accuracy and feature transformation on automatic text classification Document Analysis Systems VII 2006 Heidelberg Springer 506-517
[15]
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
[16]
Smith, D.A., Cordell, R.: A research agenda for historical and multilingual optical character recognition, p. 36. NULab, Northeastern University (2018). https://ocr.northeastern.edu/report
[17]
Stein, S.S., Argamon, S., Frieder, O.: The effect of OCR errors on stylistic text classification. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 701–702 (2006)
[18]
Van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks (2020)
[19]
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[20]
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020., https://aclanthology.org/2020.emnlp-demos.6
[21]
Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)
[22]
Zhao, Z., Zhang, Z., Hopfgartner, F.: SS-BERT: mitigating identity terms bias in toxic comment classification by utilising the notion of “subjectivity” and “identity terms”. CoRR abs/2109.02691 (2021). arxiv.org/abs:2109.02691
[23]
Zhu, Y., et al.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
Advances in Computational Intelligence: 21st Mexican International Conference on Artificial Intelligence, MICAI 2022, Monterrey, Mexico, October 24–29, 2022, Proceedings, Part II
Oct 2022
401 pages
ISBN:978-3-031-19495-5
DOI:10.1007/978-3-031-19496-2

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 24 October 2022

Author Tags

  1. NLP
  2. OCR
  3. Text classification
  4. Multi-class classification
  5. CNN
  6. BERT
  7. RoBERTa
  8. Fuzzy search
  9. Short texts

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media