Web data extraction based on structural similarity

Li, Zhao; Ng, Wee Keong; Sun, Aixin

doi:10.1007/s10115-004-0188-z

Web data extraction based on structural similarity

Regular Paper
Published: 02 February 2005

Volume 8, pages 438–461, (2005)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Zhao Li¹,
Wee Keong Ng¹ &
Aixin Sun¹

124 Accesses
25 Citations
Explore all metrics

Abstract

Web data-extraction systems in use today mainly focus on the generation of extraction rules, i.e., wrapper induction. Thus, they appear ad hoc and are difficult to integrate when a holistic view is taken. Each phase in the data-extraction process is disconnected and does not share a common foundation to make the building of a complete system straightforward. In this paper, we demonstrate a holistic approach to Web data extraction. The principal component of our proposal is the notion of a document schema. Document schemata are patterns of structures embedded in documents. Once the document schemata are obtained, the various phases (e.g. training set preparation, wrapper induction and document classification) can be easily integrated. The implication of this is improved efficiency and better control over the extraction procedure. Our experimental results confirmed this. More importantly, because a document can be represented as avector of schema, it can be easily incorporated into existing systems as the fabric for integration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Akutsu T (1992) An RNC algorithm for finding a largest common subtree of two trees. IEICE Trans Inf Syst E75-D:95–101
Arasu A, Garcia-Molina H (2003) Extracting structured data from web pp. In: Proceedings of SIGMOD conference 2003, pp 337–348
Baumgartner R, Flesca S, Gottlob G (2001) Visual Web information extraction with lixto. In: Proceedings of 27th international conference on VLDB, pp 119–128
Chang C-H, Lui S-C (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the 10th international WWW conference, pp 681–688
Crescenzi V, Mecca G, Merialdo P (2001) Roadrunner: towards automatic data extraction from large web sites. In: Proceedings of 27th international conference on VLDB, pp 109–118
Flesca S, Manco G, Masciari E, Pontieri L, Pugliese A (2002) Detecting structural similarities between XML documents. In: Proceedings of 5th international workshop on the Web and databases
Gottlob G, Koch C (2000) Monadic datalog and the expressive power of languages for web information extraction. In: Proceedings of the 21st PODS, pp 17–28
Gupta A, Harinarayan V, Rajaraman A (1998) Virtual database technology. In: Proceedings of the 14th international conference on data engineering, pp 297–301
Karypis G (2002) A clustering toolkit. Technical report TR#2-017, Univ Minnesota
Kosala R, Bruynooghe M, Blokceel H, Van den Bussche J (2003) Information extraction from web documents based on local unranked tree automaton inference. In: Proceedings of the 18th IJCAI-2003
Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2):15–68
Google Scholar
Kushmerick N, Thomas B (2002) Adaptive information extraction: core technologies for information agents. In: Intelligent information agents R&D in Europe: an agentlink perspective
Lin S-H, Ho J-M (2002) Discovering informative content blocks from web documents. In: Proceedings of SIGKDD
Liu Z, Li F, Ng WK (2002) Wiccap data model: mapping physical websites to logical views. In: Proceedings of the 21st international conference on conceptual modelling (ER2002)
Nierman A, Jagadish HV (2002) Evaluating structural similarity in XML documents. In: Proceedings of 5th international workshop on the web and databases
Rajaraman A, Ullman JD (2001) Querying websites using compact skeletons. In: Proceedings of PODS
Sakamoto H, Murakami Y, Arimura H, Arikawa S (2001) Extracting partial structures from html documents. In: 14th international Florida artificial intelligence research symposium (FLAIRS’2001) conference, pp 264–268
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: Proceedings of KDD workshop on text mining
Termier A, Rousset M-C, Sebag M (2002) Treefinder: a first step towards SML data mining. In: Proceedings of IEEE ICDM
Lian W, Cheung DW-L (2004) An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans Knowl Data Eng 16(1):82–96
Article Google Scholar
Zaki MJ, Aggarwal CC (2003) Xrules: an effective structural classifier for XML data. In: Proceedings of SIGKDD 03

Download references

Author information

Authors and Affiliations

Centre for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, Singapore, 639798
Zhao Li, Wee Keong Ng & Aixin Sun

Authors

Zhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Wee Keong Ng
View author publications
You can also search for this author in PubMed Google Scholar
Aixin Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Ng, W. & Sun, A. Web data extraction based on structural similarity. Knowl Inf Syst 8, 438–461 (2005). https://doi.org/10.1007/s10115-004-0188-z

Download citation

Received: 03 August 2004
Revised: 15 October 2004
Accepted: 24 October 2004
Published: 02 February 2005
Issue Date: November 2005
DOI: https://doi.org/10.1007/s10115-004-0188-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Web data extraction based on structural similarity

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

A survey of methods for the extraction of information from Web resources

Flexible Detection of Similar DOM Elements

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Web data extraction based on structural similarity

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

The BigGrams: the semi-supervised information extraction system from HTML: an improvement in the wrapper induction

A survey of methods for the extraction of information from Web resources

Flexible Detection of Similar DOM Elements

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation