Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1568296.1568304acmotherconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Text retrieval from early printed books

Published: 23 July 2009 Publication History

Abstract

We describe a text indexing and retrieval technique that does not rely on word segmentation and is tolerant to errors in character segmentation. The method is designed to process early printed documents and we evaluate it on the well known Latin Gutenberg Bible.
The approach relies on two main components. First, character objects (in most cases corresponding to individual characters) are extracted from the document and clustered together, so as to assign a symbolic class to each indexed object. Second, a query word is compared against the indexed character objects with a Dynamic Time Warping (DTW) based approach. The peculiarity of the matching technique described in this paper is the incorporation of sub-symbolic information in the string matching process. In particular, we take into account the estimated widths of potential subwords that are computed by accumulating lengths of partial matches in the DTW array.

References

[1]
A. Belaïd, I. Turcan, J.-M. Pierrel, Y. Belaïd, Y. Rangoni, and H. Hadjamar. Automatic indexing and reformulation of ancient dictionaries. In International Workshop on Document Image Analysis for Libraries, pages 342--354. IEEE Computer Society, 2004.
[2]
H. Cao, A. Bhardwaj, and V. Govindaraju. A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognition, February 2009.
[3]
F. Coulmans. The Blackwell encyclopedia of writing systems. Blackwell Publishing, 1999.
[4]
M. Delalandre, J.-M. Ogier, and J. Lladós. A fast CBIR system of old ornamental letter. In Int'l Workshop on Graphics Recognition, pages 135--144, 2007.
[5]
Y. Fataicha, M. Cheriet, J. Y. Nie, and C. Y. Suen. Retrieving poorly degraded ocr documents. International Journal on Document Analysis and Recognition, 8(1), 2006.
[6]
M. R. Gupta, N. P. Jacobson, and E. K. Garcia. Ocr binarization and image pre-processing for searching historical documents. Pattern Recognition, 40(2): 389--397, 2007.
[7]
A. K. Jain and A. M. Namboodiri. Indexing and retrieval of on-line handwritten documents. In Int'l Conference on Document Analysis and Recognition, pages 655-, 2003.
[8]
A. Karray, J.-M. Ogier, S. Kanoun, and M. A. Alimi. An ancient graphic documents indexing method based on spatial similarity. In Int'l Workshop on Graphics Recognition, pages 126--134, 2007.
[9]
T. Konidaris, B. Gatos, K. Ntzios, I. Pratikakis, S. Theodoridis, and S. J. Perantonis. Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. International Journal on Document Analysis and Recognition, 9(2--4): 167--177, 2007.
[10]
D. P. Lopresti. String techniques for detecting duplicates in document databases. International Journal on Document Analysis and Recognition, 2(4): 186--199, 2000.
[11]
D. P. Lopresti. Optical character recognition errors and their effects on natural language processing. In Workshop on analytics for noisy unstructured text data, pages 9--16, 2008.
[12]
L. M. Lorigo and V. Govindaraju. Transcript mapping for handwritten Arabic documents. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 6500, Jan. 2007.
[13]
S. Lu, L. Li, and C. L. Tan. Document image retrieval through word shape coding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11): 1913--1918, Nov. 2008.
[14]
S. Marinai, E. Marino, and G. Soda. Font adaptive word indexing of modern printed documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8): 1187--1199, 2006.
[15]
M. Meshesha and C. V. Jawahar. Matching word images for content-based retrieval from printed document images. International Journal on Document Analysis and Recognition, 11(1): 29--38, 2008.
[16]
T. M. Rath and R. Manmatha. Word spotting for historical documents. International Journal on Document Analysis and Recognition, 9(2--4): 139--152, 2007.
[17]
E. Smigiel, A. Belaïd, and H. Hamza. Self-organizing maps and ancient documents. In Document Analysis Systems, pages 125--134, 2004.
[18]
T. Takamiya. How to make good use of digital contents: The Gutenberg bible and the HUMI project. In Kyoto International Conference on Digital Libraries, pages 110--112, 2000.
[19]
C. L. Tan, W. Huang, Z. Yu, and Y. Xu. Imaged document text retrieval without OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6): 838--844, June 2002.

Cited By

View all
  • (2016)Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground TruthInformation Processing and Management: an International Journal10.1016/j.ipm.2016.03.00652:5(873-884)Online publication date: 1-Sep-2016

Recommendations

Reviews

Sithu D. Sudarsan

Historians and archaeologists address the challenge of reading ancient scriptures. Now, computer scientists are extending recent advances in text retrieval techniques to ancient scriptures. Using the Latin Gutenberg Bible, Marinai's work is on the identification of character objects. Character objects are connected components or portions thereof that correspond to single characters. The retrieval technique has a preprocessing phase in which columns, text lines, and character objects are extracted. Subsequently, character clustering is performed by creating initial vectors and using them to train self-organizing maps; this step is very crucial for the success of the work. Finally, the matching algorithm uses the query-by-example approach. The various steps of the algorithm are nicely described in the paper. Marinai explains the experiments she carried out and provides an outline of the issues she faced. To increase the confidence level in the algorithm, she proposes to continue the experiments. In spite of its interesting aspects, the approach does have its limitations. If various formats are used, the proposed preprocessing may not be feasible for documents. This makes the technique useful for one class of documents, rather than generic. Also, the selection of the initial vector set needs to cover sufficient details to capture as many character objects as possible. These limitations are dependent on subject matter experts, and the field has a long way to go. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AND '09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
July 2009
127 pages
ISBN:9781605584966
DOI:10.1145/1568296
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

AND '09

Acceptance Rates

AND '09 Paper Acceptance Rate 15 of 22 submissions, 68%;
Overall Acceptance Rate 15 of 22 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2016)Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground TruthInformation Processing and Management: an International Journal10.1016/j.ipm.2016.03.00652:5(873-884)Online publication date: 1-Sep-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media