Transforming paper documents into XML format with WISDOM++

Altamura, Oronzo; Esposito, Floriana; Malerba, Donato

doi:10.1007/PL00013569

Transforming paper documents into XML format with WISDOM++

SI: Document Analysis for Office Systems
Published: August 2001

Volume 4, pages 2–17, (2001)
Cite this article

International Journal on Document Analysis and Recognition Aims and scope Submit manuscript

Oronzo Altamura¹,
Floriana Esposito¹ &
Donato Malerba¹

250 Accesses
6 Altmetric
Explore all metrics

Abstract.

The transformation of scanned paper documents to a form suitable for an Internet browser is a complex process that requires solutions to several problems. The application of an OCR to some parts of the document image is only one of the problems. In fact, the generation of documents in HTML format is easier when the layout structure of a page has been extracted by means of a document analysis process. The adoption of an XML format is even better, since it can facilitate the retrieval of documents in the Web. Nevertheless, an effective transformation of paper documents into this format requires further processing steps, namely document image classification and understanding. WISDOM++ is a document processing system that operates in five steps: document analysis, document classification, document understanding, text recognition with an OCR, and transformation into HTML/XML format. The innovative aspects described in the paper are: the preprocessing algorithm, the adaptive page segmentation, the acquisition of block classification rules using techniques from machine learning, the layout analysis based on general layout principles, and a method that uses document layout information for conversion to HTML/XML formats. A benchmarking of the system components implementing these innovative aspects is reported.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi di Bari, via Orabona 4, 70126 Bari, Italy; e-mail: {altamura,esposito,malerba}@di.uniba.it , , , , , , IT
Oronzo Altamura, Floriana Esposito & Donato Malerba

Authors

Oronzo Altamura
View author publications
You can also search for this author in PubMed Google Scholar
Floriana Esposito
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Received June 15, 2000 / Revised November 7, 2000

Rights and permissions

Reprints and permissions

About this article

Cite this article

Altamura, O., Esposito, F. & Malerba, D. Transforming paper documents into XML format with WISDOM++. IJDAR 4, 2–17 (2001). https://doi.org/10.1007/PL00013569

Download citation

Issue Date: August 2001
DOI: https://doi.org/10.1007/PL00013569

Key words: Document image analysis – Layout analysis – Induction of decision trees – Transformation into HTML/XML format

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transforming paper documents into XML format with WISDOM++

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Loghi: An End-to-End Framework for Making Historical Documents Machine-Readable

Automated Text and Tabular Data Extraction from Scanned Document Images

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Subscribe and save

Buy Now

Transforming paper documents into XML format with WISDOM++

Abstract.

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Loghi: An End-to-End Framework for Making Historical Documents Machine-Readable

Automated Text and Tabular Data Extraction from Scanned Document Images

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Subscribe and save

Buy Now