Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Classification of document pages using structure-based features

  • SI: Document Analysis for Office Systems
  • Published:
International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

Abstract.

Searching for documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. The layout of a document contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels can also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers and self-organizing maps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Additional information

Received June 15, 2000 / Revised November 15, 2000

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shin, C., Doermann, D. & Rosenfeld, A. Classification of document pages using structure-based features. IJDAR 3, 232–247 (2001). https://doi.org/10.1007/PL00013566

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/PL00013566