Abstract
Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.
Similar content being viewed by others
References
Akutsu T (1992) An RNC algorithm for finding a largest common subtree of two trees. IEICE Trans Inf Syst E75-D:95–101
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pp. In: Proceedings of SIGMOD conference 2003, pp 337–348
Baumgartner R, Flesca S, Gottlob G (2001) Visual Web information extraction with lixto. In: Proceedings of 27th international conference on VLDB, pp 119–128
Chang C-H, Lui S-C (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th international WWW conference, pp 681–688
Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of 27th international conference on VLDB, pp 109–118
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of 5th international workshop on the Web and databases
Gottlob G, Koch C (2000) Monadic datalog and the expressive power of languages for web information extraction. In: Proceedings of the 21st PODS, pp 17–28
Gupta A, Harinarayan V, Rajaraman A (1998) Virtual database technology. In: Proceedings of the 14th international conference on data engineering, pp 297–301
Karypis G (2002) A clustering toolkit. Technical report TR#2-017, Univ Minnesota
Kosala R, Bruynooghe M, Blokceel H, Van den Bussche J (2003) Information extraction from web documents based on local unranked tree automaton inference. In: Proceedings of the 18th IJCAI-2003
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2):15–68
Kushmerick N, Thomas B (2002) Adaptive information extraction: core technologies for information agents. In: Intelligent information agents R&D in Europe: an agentlink perspective
Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of SIGKDD
Liu Z, Li F, Ng WK (2002) Wiccap data model: mapping physical websites to logical views. In: Proceedings of the 21st international conference on conceptual modelling (ER2002)
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of 5th international workshop on the web and databases
Rajaraman A, Ullman JD (2001) Querying websites using compact skeletons. In: Proceedings of PODS
Sakamoto H, Murakami Y, Arimura H, Arikawa S (2001) Extracting partial structures from html documents. In: 14th international Florida artificial intelligence research symposium (FLAIRS’2001) conference, pp 264–268
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining
Termier A, Rousset M-C, Sebag M (2002) Treefinder: a first step towards SML data mining. In: Proceedings of IEEE ICDM
Lian W, Cheung DW-L (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96
Zaki MJ, Aggarwal CC (2003) Xrules: an effective structural classifier for XML data. In: Proceedings of SIGKDD 03
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Z., Ng, W. & Sun, A. Web data extraction based on structural similarity. Knowl Inf Syst 8, 438–461 (2005). https://doi.org/10.1007/s10115-004-0188-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-004-0188-z