Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Abstract

A significant portion of currently available documents exist in the form of images, for instance, as scanned documents. Electronic documents produced by scanning and OCR software contain recognition errors. This paper uses an automatic approach to examine the selection and the effectiveness of searching techniques for possible erroneous terms for query expansion. The proposed method consists of two basic steps. In the first step, confused characters in erroneous words are located and editing operations are applied to create a collection of erroneous error-grams in the basic unit of the model. The second step uses query terms and error-grams to generate additional query terms, identify appropriate matching terms, and determine the degree of relevance of retrieved document images to the user's query, based on a vector space IR model. The proposed approach has been trained on 979 document images to construct about 2,822 error-grams and tested on 100 scanned Web pages, 200 advertisements and manuals, and 700 degraded images. The performance of our method is evaluated experimentally by determining retrieval effectiveness with respect to recall and precision. The results obtained show its effectiveness and indicate an improvement over standard methods such as vectorial systems without expanded query and 3-gram overlapping.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baird, H.S.: Document image quality: making fine discriminations. In: Proceedings of the IAPR International Conference on Document Analysis and Recognition, vol. 11, pp. 1209–1223. Bangalore, India (1999)

  2. Bunke, H., Csirik, J.: Parametric string edit-distance and its application to pattern recognition. IEEE Trans. Syst. Man Cybern. 25, 202–206 (1975)

    Article  Google Scholar 

  3. Fataicha, Y., Cheriet, M., Nie, J.Y., Suen, C.Y.: Content analysis in document images: A scale space approach. In: Proceedings of the 16th IEEE International Conference on Pattern Recognition. vol. 3, pp. 335–338. Quebec (2002)

  4. Fataicha, Y., Cheriet, M., Nie, J.Y., Suen, C.Y.: Information retrieval based on OCR errors in scanned documents. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogntion Workshops (CVPR'03), vol. 3. Madison, WI (2003)

  5. Harding, S.M., Croft, W.B., Weir, C.: Probabilistic retrieval of OCR degraded text using n-grams. In: Proceedings of the 1st European Conference ECDL. vol. 1324, pp. 345–359. Pisa, Italy, Research and Advanced Technology for Digital Libraries (1997)

  6. Mäkinen, V., Baeza-Yates, R., Riberro-Neto, B.: Modern Information Retrieval, p. 513. Addison-Wesley, Reading, MA (1999)

    Google Scholar 

  7. Mäkinen, V., Navarro, G., Ukkonen, E.: Algorithms for transposition invariant string matching. In: Proceedings of the STACS 20th Annual Symposium on Theoretical Aspects of Computer Science. Lecture Notes in Computer Science, vol. 2607, pp. 191–202. Springer, Berlin Heidelberg New York (2003)

  8. Ohta, O., Takasu, A., Adachi, J.: Probabilistic retrieval methods for text missrecognized OCR characters. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1224–1240 (1998)

    Google Scholar 

  9. Phillips, I.T.: User's reference manual for the uw english/technical document image database. UW-I English-Technical Document Image Database, University of Washington (1993)

  10. Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Inf. Process. Manage. 17, 77–91 (1981)

    Article  Google Scholar 

  11. Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Inf. Process. Manage. 17, 77–91 (1981)

    Article  Google Scholar 

  12. Salton, G.: The Smart Retrieval System-Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ (1971)

    Google Scholar 

  13. Salton, G., McGill, M.: Introduction to Modern Information Retrieval, p. 1. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  14. Smeaton, A.F.: Retrieval images of scanned text documents. In: Proceedings of the Optical Engineering Society of Ireland and Irish Machine Vision and Image Processing Joint Conference (OESI-IMVIP), pp. 271–286 (1998)

  15. Souza, A., Cheriet, M., Naoi, S., Suen, C.Y.: Automatic filter selection using image quality assessment. In: Proceedings of the 7th International Conference on Document Analysis and Recognition ICDAR'03, pp. 508–512. Edinburgh, UK (2003)

  16. Spink, A., Saracevic, T.: Interactive information retrieval: Sources and effectiveness of search terms during mediated online searching. J. Am. Soc. Inf. Sci. 48(8), 741–761 (1997)

    Article  Google Scholar 

  17. Strohmaier, C.M., Ringlstetter, C., Schulz, K.U., Mihov, S.: Lexical postcorrection of OCR-results: The web as a dynamic secondary. In: Proceedings of the 7th International Conference on Document Analysis and Recognition ICDAR'03, pp. 1133–1137. Edinburgh, UK (2003)

  18. Suen, C.Y.: N-gram statistics for natural language understanding and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 1, 164–171 (1979)

    Article  Google Scholar 

  19. Taghva, K., Borsack, J., Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manage. 32(3), 317–327 (1996)

    Article  Google Scholar 

  20. Taghva, K., Borsack, J., Erva Condit, S.: Hairetes: A search engine for OCR documents. In: Proceedings of the 5th IAPR International Workshop on Document Analysis Systems, pp. 412–422. Princeton, NJ (2002)

  21. Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. 14(1), 64–93 (1996)

    Article  Google Scholar 

  22. Taghva, K., Stofsky, E.: Ocrspell: An interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recog. 3(3), 125–137 (2001)

    Article  Google Scholar 

  23. Ukkonen, E.: On approximate string matching. In: Proceedings of the International Conference on Foundations of Computer Theory, pp. 487–495. Lecture Notes in Computer Science, vol. 158. Springer, Berlin Heidelberg New York (1983)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Y. Fataicha.

Additional information

Youssef Fataicha received his B.Sc. degree from Université de Rennes1, Rennes, France, in 1982. In 1984 he obtained his M.Sc. in computer science from Université de Rennes1, France. Between 1984 and 1986 he was a lecturer at the Université de Rennes1, France. He then served as engineer, from 1987 to 2000, at {Office de l'eau potable et de l'électricité} in Morocco. Since 2001 has been a Ph.D. student at the {école de Technologie Supérieure de l'Université du Québec} in Montreal, Québec, Canada. His research interests include pattern recognition, information retrieval, and image analysis.

Mohamed Cheriet received his B.Eng. in computer science from {Université des Sciences et de Technologie d'Alger} (Bab Ezouar, Algiers) in 1984 and his M.Sc. and Ph.D., also in computer science, from the University of Pierre et Marie Curie (Paris VI) in 1985 and 1988, respectively. Dr. Cheriet was appointed assistant professor in 1992, associate professor in 1995, and full professor in 1998 in the Department of Automation Engineering, {école de Technologie Supérieure} of the University of Québec, Montreal. Currently he is the director of LIVIA, the Laboratory for Imagery, Vision and Artificial Intelligence at ETS, and an active member of CENPARMI, the Centre for Pattern Recognition and Machine Intelligence. Professor Cheriet's research focuses on mathematical modeling for signal and image processing (scale-space, PDEs, and variational methods), pattern recognition, character recognition, text processing, document analysis and recognition, and perception. He has published more than 100 technical papers in these fields. He was the co-chair of the 11th and the 13th Vision Interface Conferences held respectively in Vancouver in 1998 and in Montreal in 2000. He was also the general co-chair of the 8th International Workshop on Frontiers on Handwriting Recognition held in Niagara-on-the-Lake in 2002. He has served as associate editor of the International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI) since 2000. Dr. Cheriet is a senior member of IEEE.

Jian Yun Nie is a professor in the computer science department (DIRO), Université de Montreal, Québec, Canada. His research focuses on problems related to information retrieval, including multilingual and multimedia information retrieval, as well as natural language processing.

Ching Y. Suen received his M.Sc. (Eng.) from the University of Hong Kong and Ph.D. from the University of British Columbia, Canada. In 1972 he joined the Department of Computer Science of Concordia University, where he became professor in 1979 and served as chairman from 1980 to 1984 and as associate dean for research of the Faculty of Engineering and Computer Science from 1993 to 1997. He has guided/hosted 65 visiting scientists and professors and supervised 60 doctoral and master's graduates. Currently he holds the distinguished Concordia Research Chair in Artificial Intelligence and Pattern Recognition and is the Director of CENPARMI, the Centre for Pattern Recognition and Machine Intelligence.

Professor Suen is the author/editor of 11 books and more than 400 papers on subjects ranging from computer vision and handwriting recognition to expert systems and computational linguistics. A Google search on “Ching Y. Suen” will show some of his publications. He is the founder of the International Journal of Computer Processing of Oriental Languages and served as its first editor-in-chief for 10 years. Presently he is an associate editor of several journals related to pattern recognition.

A fellow of the IEEE, IAPR, and the Academy of Sciences of the Royal Society of Canada, he has served several professional societies as president, vice-president, or governor. He is also the founder and chair of several conference series including ICDAR, IWFHR, and VI. He has been the general chair of numerous international conferences, including the International Conference on Computer Processing of Chinese and Oriental Languages in August 1988 held in Toronto, International Conference on Document Analysis and Recognition held in Montreal in August 1995, and the International Conference on Pattern Recognition held in Québec City in August 2002.

Dr. Suen has given 150 seminars at major computer companies and various government and academic institutions around the world. He has been the principal investigator of 25 industrial/government research contracts and is a grant holder and recipient of prestigious awards, including the ITAC/NSERC award from the Information Technology Association of Canada and the Natural Sciences and Engineering Research Council of Canada in 1992 and the Concordia “Research Fellow” award in 1998.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fataicha, Y., Cheriet, M., Nie, J.Y. et al. Retrieving poorly degraded OCR documents. IJDAR 8, 15–26 (2006). https://doi.org/10.1007/s10032-005-0147-6

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-005-0147-6