Abstract
In this paper, we present a probabilistic method that can improve the efficiency of document classification when applied to structured documents. The analysis of the structure of a document is the starting point of document classification. Our method is designed to augment other classification schemes and complement pre-filtering information extraction procedures to reduce uncertainties. To this end, a probabilistic distribution on the structure of XML documents is introduced. We show how to parameterise existing learning methods to describe the structure distribution efficiently. The learned distribution is then used to predict the classes of unseen documents. Novelty detection making use of the structure-based distribution function is also discussed. Demonstration on model documents and on Internet XML documents are presented.
Similar content being viewed by others
References
Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284(5):34–43
Abiteboul S, Buneman P, Suciu D (2000) Data on the web: From relations to semistructured data and XML. Morgan Kaufmann, San Francisco, CA
Kosala R, Van den Bussche J, Bruynooghe M, Blockeel H (2002) Information extraction in structured documents using tree automata induction. In: Elomaa T, Mannila H, Toivonen H (eds) Principles of data mining and knowledge discovery: Lecture notes in computer science, vol 2431, 6th European conference, PKDD 2002, Helsinki, Finland, August 2002, pp 299–310
Muslea I, Minton S, Knoblock CA (2001) Hierarchical wrapper induction for semistructured information sources. Auton Agent Multi-Ag 4(1–2):93–114
Sahuguet A, Azavant F (2001) Building intelligent web applications using lightweight wrappers. Data Knowl Eng 36(3):283–316
Lee D, Chu WW (2000) Comparative analysis of six XML schema languages, SIGMOD Record 29(3):76–87
Bertino E, Guerrini G, Mesiti M, Rivara I, Tavella C (2001) Measuring the structural similarity among XML documents and DTDs
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41
Jacquin DC (2001) Indexing a web site with a terminology oriented ontology
Jutten C, Herault J (1991) Blind separation of sources. Part I: An adaptive algorithm based on neuromimetic architecture. Signal Process 24:1–10
Comon P (1994) Independent component analysis – A new concept? Signal Process 36:287–314
Cardoso JF, Laheld B (1996) Equivalent adaptive source separation. IEEE T Signal Proces 44:3017–3030
Bell AJ, Sejnowski TJ (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129–1159
Amari SL, Cichocki A, Yang HH (1996) A new learning algorithm for blind signal separation. Advances in neural information processing systems. Morgan Kaufmann, San Mateo, CA, pp 757–763
Chow CK, Liu CN (1968) Approximating discrete probability distributions with dependence trees. IEEE T Inform Theory 14:462–467
Meila M, Jordan MI (2000) Learning with mixtures of trees. J Mach Learn Res 1:1–48
Meila-Predoviciu M (1999) Learning with mixtures of trees. PhD thesis, Massachusetts Institute of Technology, January 1999
Cormen TH, Leiserson CE (1990) Introduction to algorithms. MIT Press, Cambridge, MA
ADC XML resourceshttp://xml.gsfc.nasa.gov
Examples of CMLhttp://www.xml-cml.org/examples
Cohen WW (1999) Recognizing structure in web pages using similarity queries. AAAI/IAAI, pp 59–66
Acknowledgements
This work was supported by the Hungarian National Science Foundation (Grant OTKA 32487) and by EOARD (Grant F61775–00-WE065). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the European Office of Aerospace Research and Development, Air Force Office of Scientific Research or the Air Force Research Laboratory.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hévízi, G., Marcinkovics, T. & Lőrincz, A. Improving recognition accuracy on structured documents by learning structural patterns. Pattern Anal Applic 7, 66–76 (2004). https://doi.org/10.1007/s10044-004-0208-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-004-0208-3