Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3170967.3170982guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
research-article
Free access

Text line extraction and baseline detection

Published: 02 April 1991 Publication History

Abstract

A hypothesis-driven analysis of text regions in document images is presented. We assume that document images are constructed by the TeX imaging model (the boxes and glue model), and introduce the box representation of a printed symbol called character prototype. The algorithm estimates the location of an hbox (line of text) in a vbox (column of text). This hypothesis is accepted if most connected components in the hbox form character boxes. Character prototypes are used to determine whether a box in an hbox is a character box. The algorithm correctly extracted an average of 96--99 percent of text-lines from digitized text-columns written in English. The character prototype scheme also accurately represented the Japanese alphabet, Chinese characters, and Bengali words, and text-lines were correctly extracted from documents written in these languages.

References

[1]
{AKM71} R. Ascher, G. Koppelman, M. Miller, G. Nagy, and G. Shelton, Jr., "An Interactive System for Reading Unformatted Printed Text," IEEE Transactions on Computers, Vol. C-20, No. 12, pp. 1527--1543, December 1971.
[2]
{Bai88} H. Baird, "Global-to-Local Layout Analysis," IAPR Proceedings of the Workshop on Syntactic and Structural Pattern Recognition, Pont-a-Mousson, France, pp. 136--146 September 1988.
[3]
{HH85} M. Hase, and Y. Hoshino, "Segmentation Method of Document Images by Two-Dimensional Fourier Transformation," System and Computers in Japan, Vol. 16, No. 3, pp. 38--47, 1985.
[4]
{Kan90} J. Kanai, "Text-Line Extraction Using Character Prototypes," Pre-Proceedings of International Association for Pattern Recognition Workshop on Syntactic & Structural Pattern Recognition, Murray Hill, New Jersey. June 13--15, 1990, pp. 182--191.
[5]
{Kat86} T. Kato, Theoretical and Experimental Study on Parallel Processing in Real-Time Layout Understanding and Article Extraction of Document Images, Doctorat Thesis, Kyoto University, Nov. 1986.
[6]
{Knu84} D. Knuth, The TeXbook, Addison-Wesley, 1984.
[7]
{Knu86} D. Knuth, Computers & Typesetting, Vol. E: Computer Modern Typefaces, Addison-Wesley, 1986.
[8]
{NKK88} G. Nagy, J. Kanai, M. Krishnamoorthy, M. Thomas, and M. Viswanathan, "Two Complementary Techniques for Digitized Document Analysis," Proceedings: ACM Conference on Document Processing Systems, Santa Fe, New Mexico, pp. 169--176, December 1988.
[9]
{NS84} G. Nagy, and S. Seth, "Hierarchical Representation of Optically Scanned Documents," Proceedings of the 7th International Conference on Pattern Recognition, Montreal, Canada, pp. 347--349, 1984.
[10]
{NSS86} G. Nagy, S. Seth, and S. Stoddard, "Document Analysis with an Expert System," Pattern Recognition in Practice II, ed. E. S. Gelsema and L. N. Kanal, Elsevier Science Publishers B. V., North-Holland, pp. 149--159, 1986.
[11]
{OM90} M. Okamoto and A. Miyazawa, "An Experimental Implementation of Document Recognition System for Papers Containing Mathematical Expressions," Pre-Proceedings of International Association for Pattern Recognition Workshop on Syntactic & Structural Pattern Recognition, Murray Hill, New Jersey, June 13--15, 1990, pp. 335--350.
[12]
{Ots79} N. Otsu, "A Threshold Selection Method from Gray-Level Histograms," IEEE SMC, Vol. SMC-9, No. 1, January 1979, pp. 62--66.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
RIAO '91: Intelligent Text and Image Handling
April 1991
522 pages

Publisher

LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE

Paris, France

Publication History

Published: 02 April 1991

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 362
    Total Downloads
  • Downloads (Last 12 months)50
  • Downloads (Last 6 weeks)11
Reflects downloads up to 01 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media