The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification

Murata, Mayo; Busagala, Lazaro S. P.; Ohyama, Wataru; Wakabayashi, Tetsushi; Kimura, Fumitaka

doi:10.1007/11669487_45

Mayo Murata¹⁸,
Lazaro S. P. Busagala¹⁸,
Wataru Ohyama¹⁸,
Tetsushi Wakabayashi¹⁸ &
…
Fumitaka Kimura¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3872))

Included in the following conference series:

International Workshop on Document Analysis Systems

1928 Accesses
7 Citations

Abstract

Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.

Download to read the full chapter text

Chapter PDF

Understanding of Data Preprocessing for Dimensionality Reduction Using Feature Selection Techniques in Text Classification

Advances in Text Classification Based on Machine Learning

Automatic text classification method based on Zipf’s law

Article 01 May 2015

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Ohta, M., Takasu, A., Adachi, J.: Retrieval Methods for English-Text with Missrecognized OCR Characters. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany, August 18-20, pp. 950–956 (1997)
Google Scholar
Myka, A.: Measuring the Effects of OCR Errors on Similarity Linking. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany, August 18-20, pp. 968–973 (1997)
Google Scholar
Zu, G., Murata, M., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy on Automatic Text Categorization. In: Proceedings of Advanced Workshop Content Computing, pp. 403–409 (2004)
Google Scholar
Fukumoto, T., Wakabayashi, T., Kimura, F., Miyake, Y.: Accuracy Improvement of Handwritten Character Recognition By GLVQ. In: Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition Proceedings(IWFHR VII), September 2000, pp. 271–280 (2000)
Google Scholar
Zu, G., Ohyama, W., Wakabayashi, T., Kimura, F.: Accuracy improvement of automatic text classification based on feature transformation. In: DocEng 2003 (ACM Symposium on Document Engineering 2003), Grenoble, France, November 20–22, pp. 118–120 (2003)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM – A Library for Support Vector Machines, Version 2.33 (March 2002), http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
Library Digital Initiative Project Team, Harvad University Library: Measuring Search Retrieval Accuracy of uncorrected OCR: Findings from the Harvad-Radcliffe Online Historical Reference Shelf Digitization Project, A research report (August 2001), available at http://preserve.harvard.edu/resources/ocr_report.pdf
Bicknes, D.A.: Measuring the accuracy of the OCR in the Making of America. A research report (1998), available at http://www.hti.umich.edu/m/moagrp/moaocr.html
Junker, M., Hoch, R.: An experimental evaluation of OCR text representations for learning document classifiers. International Journal on Document Analysis and Recognition, 116–122 (1998)
Google Scholar
Frasconi, P., Soda, G., Vullo, A.: Text Categorization for Multi-page Documents: A Hybrid Naïve Bayes HMM Approach. In: 1st ACM-IEEE Joint Conference on Digital Libraries(JCDL 2001), Roanoke Virginia (2001)
Google Scholar
Frasconi, P., Soda, G., Vullo, A.: Hidden Markov Models for Text Categorization in Mult-page Documents. Journal of Intelligent Information Systems 18(2/3), 195–217 (2002)
Article Google Scholar
Taghva, K., Nartker, T., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating Text Categorization in the Presence of OCR Errors. In: Proceedings of the Symposium on Electronic Imaging Science and Technology, San Jose, CA, January 2001, pp. 68–74 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu-shi, Mie, 5148507, Japan
Mayo Murata, Lazaro S. P. Busagala, Wataru Ohyama, Tetsushi Wakabayashi & Fumitaka Kimura

Authors

Mayo Murata
View author publications
You can also search for this author in PubMed Google Scholar
Lazaro S. P. Busagala
View author publications
You can also search for this author in PubMed Google Scholar
Wataru Ohyama
View author publications
You can also search for this author in PubMed Google Scholar
Tetsushi Wakabayashi
View author publications
You can also search for this author in PubMed Google Scholar
Fumitaka Kimura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science and Applied Mathematics, University of Bern, Neubrückstrasse 10, CH-3012, Bern, Switzerland
Horst Bunke
DocRec Ltd, 34 Strathaven Place, 7001, Atawhai, Nelson, New Zealand
A. Lawrence Spitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Murata, M., Busagala, L.S.P., Ohyama, W., Wakabayashi, T., Kimura, F. (2006). The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_45

Download citation

DOI: https://doi.org/10.1007/11669487_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)