Abstract
Digitization process of various printed documents involves generating texts by an OCR system for different applications including full-text retrieval and document organizations. However, OCR-generated texts have errors as per present OCR technology. Moreover, previous studies have revealed that as OCR accuracy decreases the classification performance also decreases. The reason for this is the use of absolute word frequency as feature vector. Representing OCR texts using absolute word frequency has limitations such as dependency on text length and word recognition rate consequently lower classification performance due to higher within-class variances. We describe feature transformation techniques which do not have such limitations and present improved experimental results from all used classifiers.
Chapter PDF
Similar content being viewed by others
Keywords
- Feature Vector
- Classification Rate
- Optical Character Recognition
- Feature Transformation
- Linear Discriminant Function
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Ohta, M., Takasu, A., Adachi, J.: Retrieval Methods for English-Text with Missrecognized OCR Characters. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany, August 18-20, pp. 950–956 (1997)
Myka, A.: Measuring the Effects of OCR Errors on Similarity Linking. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR), Ulm, Germany, August 18-20, pp. 968–973 (1997)
Zu, G., Murata, M., Ohyama, W., Wakabayashi, T., Kimura, F.: The impact of OCR accuracy on Automatic Text Categorization. In: Proceedings of Advanced Workshop Content Computing, pp. 403–409 (2004)
Fukumoto, T., Wakabayashi, T., Kimura, F., Miyake, Y.: Accuracy Improvement of Handwritten Character Recognition By GLVQ. In: Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition Proceedings(IWFHR VII), September 2000, pp. 271–280 (2000)
Zu, G., Ohyama, W., Wakabayashi, T., Kimura, F.: Accuracy improvement of automatic text classification based on feature transformation. In: DocEng 2003 (ACM Symposium on Document Engineering 2003), Grenoble, France, November 20–22, pp. 118–120 (2003)
Chang, C.C., Lin, C.J.: LIBSVM – A Library for Support Vector Machines, Version 2.33 (March 2002), http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html
Library Digital Initiative Project Team, Harvad University Library: Measuring Search Retrieval Accuracy of uncorrected OCR: Findings from the Harvad-Radcliffe Online Historical Reference Shelf Digitization Project, A research report (August 2001), available at http://preserve.harvard.edu/resources/ocr_report.pdf
Bicknes, D.A.: Measuring the accuracy of the OCR in the Making of America. A research report (1998), available at http://www.hti.umich.edu/m/moagrp/moaocr.html
Junker, M., Hoch, R.: An experimental evaluation of OCR text representations for learning document classifiers. International Journal on Document Analysis and Recognition, 116–122 (1998)
Frasconi, P., Soda, G., Vullo, A.: Text Categorization for Multi-page Documents: A Hybrid Naïve Bayes HMM Approach. In: 1st ACM-IEEE Joint Conference on Digital Libraries(JCDL 2001), Roanoke Virginia (2001)
Frasconi, P., Soda, G., Vullo, A.: Hidden Markov Models for Text Categorization in Mult-page Documents. Journal of Intelligent Information Systems 18(2/3), 195–217 (2002)
Taghva, K., Nartker, T., Borsack, J., Lumos, S., Condit, A., Young, R.: Evaluating Text Categorization in the Presence of OCR Errors. In: Proceedings of the Symposium on Electronic Imaging Science and Technology, San Jose, CA, January 2001, pp. 68–74 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Murata, M., Busagala, L.S.P., Ohyama, W., Wakabayashi, T., Kimura, F. (2006). The Impact of OCR Accuracy and Feature Transformation on Automatic Text Classification. In: Bunke, H., Spitz, A.L. (eds) Document Analysis Systems VII. DAS 2006. Lecture Notes in Computer Science, vol 3872. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11669487_45
Download citation
DOI: https://doi.org/10.1007/11669487_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32140-8
Online ISBN: 978-3-540-32157-6
eBook Packages: Computer ScienceComputer Science (R0)