Abstract
WISDOM++ is an intelligent document processing system that transforms a paper document into HTML/XML format. The main design requirement is adaptivity, which is realized through the application of machine learning methods. This paper illustrates the application of symbolic learning algorithms to the first three steps of document processing, namely document analysis, document classification and document understanding. Machine learning issues related to the application are: Efficient incremental induction of decision trees from numeric data, handling of both numeric and symbolic data in first-order rule learning, learning mutually dependent concepts. Experimental results obtained on a set of real-world documents are illustrated and commented.
Acknowledgments
The authors would like to thank Francesco De Tommaso, Dario Gerbino, Ignazio Sardella, Giacomo Sidella, Rosa Maria Spadavecchia, and Silvana Spagnoletta for their contribution to the development of WISDOM++. Thanks also to the authors of the systems C4.5 and ITI.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
O. Altamura, F. Esposito & D. Malerba (1999). WISDOM++: An Interactive and Adaptive Document Analysis System. To appear in Proc. of the 5th Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press: Los Alamitos.
H.S. Baird (1987). The Skew Angle of Printed Documents. Proc. Conf. Of the Society of Photographic Scientists and Engineers, 14–21 (also in R.K.L. O’Gorman (ed.), Document Image Analysis, 204-208, IEEE Computer Society: Los Alamitos, CA, 1995).
T. Bayer, U. Bohnacher, & H. Mogg-Schneider (1994). InforPortLab: An Experimental Document Analysis System. Proc. of the IAPR Workshop on Document Analysis Systems, Kaiserslautern, Germany.
J.H. Connell & M. Brady. Generating and Generalizing Models of Visual Objects. Artificial Intelligence, 31, 2, 159–183, 1987.
A. Dengel & G. Barth. ANASTASIL: A Hybrid Knowledge-based System for Document Layout Analysis. Proc. of the 6th Int. Joint Conf. on Artificial Intelligence, 1249–1254, 1989.
M.A. Eshera, & K.S. Fu (1986). An Image Understanding System using Attributed Symbolic Representation and Inexact Graph-matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8, 604–618.
F. Esposito, D. Malerba, G. Semeraro, E. Annese, &; G. Scafuro (1990). An Experimental Page Layout Recognition System for Office Document Automatic Classification: An Integrated Approach for Inductive Generalization. Proceedings of the 10th International Conference on Pattern Recognition, IEEE Computer Society Press: Los Alamitos, CA, 557–562.
F. Esposito, D. Malerba, & G. Semeraro (1994). Multistrategy Learning for Document Recognition. Applied Artificial Intelligence, 8, 1, 33–84.
F. Esposito, D. Malerba, and G. Semeraro (1995). A Knowledge-based Approach to the Layout Analysis. Proc. of the 3rd Int. Conf. on Document Analysis and Recognition, IEEE Computer Society: Los Alamitos, CA, 466–471.
F. Esposito, D. Malerba, & G. Semeraro (1997). A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 5, 476–491.
F. Esposito, D. Malerba, G. Semeraro, N. Fanizzi, & S. Ferilli (1998). Adding Machine Learning and Knowledge Intensive Techniques to a Digital Library Service. International Journal on Digital Libraries, 2, 1, 3–19.
J.L. Fisher, S.C. Hinds, & D. P. D’Amato (1990). A Rule-based System for Document Image Segmentation. Proc. of the 10th Int. Conf. on Pattern Recognition, IEEE Computer Society Press: Los Alamitos, CA, 567–572.
T. Hong, & S. N. Srihari (1997). Representing OCRed Documents in HTML. Proc. of the 4th Int. Conf. on Document Analysis and Recognition, IEEE Computer Society Press: Los Alamitos, CA, 831–834.
D. Malerba, F. Esposito, G. Semeraro, &; L. De Filippis (1997). Processing Paper Documents with WISDOM. In M. Lenzerini (Ed.), AI*IA 97: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Springer: Berlin, 1321, 439–442.
D. Malerba, F. Esposito, G. Semeraro, & S. Caggese (1997). Handling Continuous Data in Top-down Induction of First-order Rules. In M. Lenzerini (Ed.), AI*IA 97: Advances in Artificial Intelligence, Lecture Notes in Artificial Intelligence, Springer: Berlin, 1321, 24–35.
D. Malerba, F. Esposito, & F. A. Lisi (1998). Learning Recursive Theories with ATRE. In H. Prade (Ed.), Proc. of the 13th European Conf. on Artificial Intelligence, John Wiley & Sons: Chichester, UK, 435-439.
D. Malerba, G. Semeraro, & E. Bellisari (1995). LEX: A KnowledgeBased System for the Layout Analysis. Proc. of the 3rd Int. Conf. on the Practical Application of Prolog, 429–443.
D. Malerba, G. Semeraro, & F. Esposito (1997). A Multistrategy Approach to Learning Multiple Dependent Concepts. Chapter 4 in C., Taylor & R., Nakhaeizadeh (Eds.), Machine Learning and Statistics: The Interface, Wiley: London, United Kingdom, 87–106.
G. Nagy, S. Seth & M. Viswanathan (1992). A Prototype Document Image Analysis System for Technical Journals. IEEE Computer, 25, 7, 10–22.
L. O’Gorman (1992). Image and Document Processing Techniques for the RightPages Electronic Library System. Proc. of the 11th Int. Conf. on Pattern Recognition, 260–263.
M. Orkin & R. Drogin (1990). Vital Statistics, McGraw Hill: New York.
J. R. Quinlan (1993). C4.5: Programs for induction. Morgan Kaufmann: San Mateo, CA.
J. C. Schlimmer, & D. Fisher (1986). A Case Study of Incremental Concept Induction. Proc. of the 5th Nat. Conf. on Artificial Intelligence, Morgan Kaufmann: Philadelphia, 496–501.
F. Y. Shih, & S. S. Chen (1996). Adaptive Document Block Segmentation and Classification. IEEE Trans. on Systems, Man, and Cybernetics Part B, 26, 5, 797–802.
Y. Y. Tang, C. De Yan & C. Y. Suen. Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering, 6(1) (1994) 3–21.
P. E. Utgoff (1989). Incremental Induction of Decision Trees. Machine Learning, 4, 2, 161–186.
P. E. Utgoff (1994). An Improved Algorithm for Incremental Induction of Decision Trees. Proc. of the 11th Int. Conf. on Machine Learning, Morgan Kaufmann: San Francisco, CA.
D. Wang & R.N. Srihari (1989). Classification of Newspaper Image Blocks Using Texture Analysis. Computer Vision, Graphics, and Image Processing, 47, 327–352.
K. Y. Wong, R.G. Casey, & F. M. Wahl (1982). Document Analysis System. IBM Journal of Research Development, 26, 6, 647–656.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Altamura, O., Esposito, F., Lisi, F.A., Malerba, D. (1999). Symbolic Learning Techniques in Paper Document Processing. In: Perner, P., Petrou, M. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 1999. Lecture Notes in Computer Science(), vol 1715. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48097-8_13
Download citation
DOI: https://doi.org/10.1007/3-540-48097-8_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66599-1
Online ISBN: 978-3-540-48097-6
eBook Packages: Springer Book Archive