research-article

Text retrieval from early printed books

Author:

Simone MarinaiAuthors Info & Claims

AND '09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data

Pages 33 - 40

https://doi.org/10.1145/1568296.1568304

Published: 23 July 2009 Publication History

Get Access

Abstract

We describe a text indexing and retrieval technique that does not rely on word segmentation and is tolerant to errors in character segmentation. The method is designed to process early printed documents and we evaluate it on the well known Latin Gutenberg Bible.

The approach relies on two main components. First, character objects (in most cases corresponding to individual characters) are extracted from the document and clustered together, so as to assign a symbolic class to each indexed object. Second, a query word is compared against the indexed character objects with a Dynamic Time Warping (DTW) based approach. The peculiarity of the matching technique described in this paper is the incorporation of sub-symbolic information in the string matching process. In particular, we take into account the estimated widths of potential subwords that are computed by accumulating lengths of partial matches in the DTW array.

References

[1]

A. Bela&#239;d, I. Turcan, J.-M. Pierrel, Y. Bela&#239;d, Y. Rangoni, and H. Hadjamar. Automatic indexing and reformulation of ancient dictionaries. In International Workshop on Document Image Analysis for Libraries, pages 342--354. IEEE Computer Society, 2004.

Digital Library

Google Scholar

[2]

H. Cao, A. Bhardwaj, and V. Govindaraju. A probabilistic method for keyword retrieval in handwritten document images. Pattern Recognition, February 2009.

Digital Library

Google Scholar

[3]

F. Coulmans. The Blackwell encyclopedia of writing systems. Blackwell Publishing, 1999.

Google Scholar

[4]

M. Delalandre, J.-M. Ogier, and J. Llad&#243;s. A fast CBIR system of old ornamental letter. In Int'l Workshop on Graphics Recognition, pages 135--144, 2007.

Google Scholar

[5]

Y. Fataicha, M. Cheriet, J. Y. Nie, and C. Y. Suen. Retrieving poorly degraded ocr documents. International Journal on Document Analysis and Recognition, 8(1), 2006.

Digital Library

Google Scholar

[6]

M. R. Gupta, N. P. Jacobson, and E. K. Garcia. Ocr binarization and image pre-processing for searching historical documents. Pattern Recognition, 40(2): 389--397, 2007.

Digital Library

Google Scholar

[7]

A. K. Jain and A. M. Namboodiri. Indexing and retrieval of on-line handwritten documents. In Int'l Conference on Document Analysis and Recognition, pages 655-, 2003.

Digital Library

Google Scholar

[8]

A. Karray, J.-M. Ogier, S. Kanoun, and M. A. Alimi. An ancient graphic documents indexing method based on spatial similarity. In Int'l Workshop on Graphics Recognition, pages 126--134, 2007.

Google Scholar

[9]

T. Konidaris, B. Gatos, K. Ntzios, I. Pratikakis, S. Theodoridis, and S. J. Perantonis. Keyword-guided word spotting in historical printed documents using synthetic data and user feedback. International Journal on Document Analysis and Recognition, 9(2--4): 167--177, 2007.

Digital Library

Google Scholar

[10]

D. P. Lopresti. String techniques for detecting duplicates in document databases. International Journal on Document Analysis and Recognition, 2(4): 186--199, 2000.

Crossref

Google Scholar

[11]

D. P. Lopresti. Optical character recognition errors and their effects on natural language processing. In Workshop on analytics for noisy unstructured text data, pages 9--16, 2008.

Digital Library

Google Scholar

[12]

L. M. Lorigo and V. Govindaraju. Transcript mapping for handwritten Arabic documents. In Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 6500, Jan. 2007.

Crossref

Google Scholar

[13]

S. Lu, L. Li, and C. L. Tan. Document image retrieval through word shape coding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11): 1913--1918, Nov. 2008.

Digital Library

Google Scholar

[14]

S. Marinai, E. Marino, and G. Soda. Font adaptive word indexing of modern printed documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8): 1187--1199, 2006.

Digital Library

Google Scholar

[15]

M. Meshesha and C. V. Jawahar. Matching word images for content-based retrieval from printed document images. International Journal on Document Analysis and Recognition, 11(1): 29--38, 2008.

Digital Library

Google Scholar

[16]

T. M. Rath and R. Manmatha. Word spotting for historical documents. International Journal on Document Analysis and Recognition, 9(2--4): 139--152, 2007.

Digital Library

Google Scholar

[17]

E. Smigiel, A. Bela&#239;d, and H. Hamza. Self-organizing maps and ancient documents. In Document Analysis Systems, pages 125--134, 2004.

Crossref

Google Scholar

[18]

T. Takamiya. How to make good use of digital contents: The Gutenberg bible and the HUMI project. In Kyoto International Conference on Digital Libraries, pages 110--112, 2000.

Crossref

Google Scholar

[19]

C. L. Tan, W. Huang, Z. Yu, and Y. Xu. Imaged document text retrieval without OCR. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6): 838--844, June 2002.

Digital Library

Google Scholar

Cited By

View all

Ghosh KChakraborty AParui SMajumder P(2016)Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground TruthInformation Processing and Management: an International Journal10.1016/j.ipm.2016.03.00652:5(873-884)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1016/j.ipm.2016.03.006

Index Terms

Text retrieval from early printed books
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Text retrieval from early printed books
Special issue on noisy text analytics

Retrieving text from early printed books is particularly difficult because in these documents, the words are very close one to the other and, similarly to medieval manuscripts, there is a large use of ligatures and abbreviations. To address these ...
A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters
ICPRAM 2014: Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods

The web site of National Diet Library in Japan provides a lot of early-modern (AD1868-1945) Japanese printed books to the public, but full-text search is essentially impossible. In order to perform advanced search for historical literatures, the ...
Dataset of Pages from Early Printed Books with Multiple Font Groups
HIP '19: Proceedings of the 5th International Workshop on Historical Document Imaging and Processing

Based on contemporary scripts, early printers developed a large variety of different fonts. While fonts may slightly differ from one printer to another, they can be divided into font groups, such as Textura, Antiqua, or Fraktur. The recognition of font ...

Reviews

Reviewer: Sithu D. Sudarsan

Historians and archaeologists address the challenge of reading ancient scriptures. Now, computer scientists are extending recent advances in text retrieval techniques to ancient scriptures. Using the Latin Gutenberg Bible, Marinai's work is on the identification of character objects. Character objects are connected components or portions thereof that correspond to single characters. The retrieval technique has a preprocessing phase in which columns, text lines, and character objects are extracted. Subsequently, character clustering is performed by creating initial vectors and using them to train self-organizing maps; this step is very crucial for the success of the work. Finally, the matching algorithm uses the query-by-example approach. The various steps of the algorithm are nicely described in the paper. Marinai explains the experiments she carried out and provides an outline of the issues she faced. To increase the confidence level in the algorithm, she proposes to continue the experiments. In spite of its interesting aspects, the approach does have its limitations. If various formats are used, the proposed preprocessing may not be feasible for documents. This makes the technique useful for one class of documents, rather than generic. Also, the selection of the initial vector set needs to cover sufficient details to capture as many character objects as possible. These limitations are dependent on subject matter experts, and the field has a long way to go. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

AND '09: Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data

July 2009

127 pages

ISBN:9781605584966

DOI:10.1145/1568296

Program Chairs:
Daniel Lopresti
Lehigh University
,
Shourya Roy
Xerox India Innovation Hub
,
Klaus Schulz
University of Munich
,
L. Venkata Subramaniam
IBM India Research Lab

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

AND '09

AND '09: Third Workshop on Analytics for Noisy Unstructured Text Data

July 23 - 24, 2009

Barcelona, Spain

Acceptance Rates

AND '09 Paper Acceptance Rate 15 of 22 submissions, 68%;

Overall Acceptance Rate 15 of 22 submissions, 68%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
289
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ghosh KChakraborty AParui SMajumder P(2016)Improving Information Retrieval Performance on OCRed Text in the Absence of Clean Text Ground TruthInformation Processing and Management: an International Journal10.1016/j.ipm.2016.03.00652:5(873-884)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1016/j.ipm.2016.03.006

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations