research-article

From one tree to a forest: a unified solution for structured web data extraction

Authors:

Lei ZhangAuthors Info & Claims

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 775 - 784

https://doi.org/10.1145/2009916.2010020

Published: 24 July 2011 Publication History

Abstract

Structured data, in the form of entities and associated attributes, has been a rich web resource for search engines and knowledge databases. To efficiently extract structured data from enormous websites in various verticals (e.g., books, restaurants), much research effort has been attracted, but most existing approaches either require considerable human effort or rely on strong features that lack of flexibility. We consider an ambitious scenario -- can we build a system that (1) is general enough to handle any vertical without re-implementation and (2) requires only one labeled example site from each vertical for training to automatically deal with other sites in the same vertical? In this paper, we propose a unified solution to demonstrate the feasibility of this scenario. Specifically, we design a set of weak but general features to characterize vertical knowledge (including attribute-specific semantics and inter-attribute layout relationships). Such features can be adopted in various verticals without redesign; meanwhile, they are weak enough to avoid overfitting of the learnt knowledge to seed sites. Given a new unseen site, the learnt knowledge is first applied to identify page-level candidate attribute values, while inevitably involve false positives. To remove noise, site-level information of the new site is then exploited to boost up the true values. The site-level information is derived in an unsupervised manner, without harm to the applicability of the solution. Promising experimental performance on 80 websites in 8 distinct verticals demonstrated the feasibility and flexibility of the proposed solution.

References

[1]

Document object model. http://en.wikipedia.org/wiki/Document\_Object\_Model.

[2]

Freebase. http://www.freebase.com/.

[3]

MSHTML reference. http://msdn.microsoft.com/en-us/library/aa741317(v=VS.85).aspx.

[4]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proc. SIGMOD, pages 337--348, 2003.

Digital Library

[5]

A. Carlson and C. Schafer. Bootstrapping information extraction from semi-structured web pages. In Proc. ECML, pages 195--210, 2008.

Digital Library

[6]

C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In Proc. WWW, pages 681--688, 2001.

Digital Library

[7]

V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proc. VLDB, pages 109--118, 2001.

Digital Library

[8]

S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In Proc. WWW, pages 178--186, 2003.

Digital Library

[9]

N. Kushmerick. Wrapper induction for information extraction. PhD thesis, 1997.

Digital Library

[10]

B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proc. KDD, pages 601--606, 2003.

Digital Library

[11]

I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proc. AGENTS, pages 190--197, 1999.

Digital Library

[12]

S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34:233--272, 1999.

Digital Library

[13]

M. L. A. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proc. SIGIR, pages 292--299, 2006.

Digital Library

[14]

T.-L. Wong and W. Lam. Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. TKDE, 22:523--536, 2010.

Digital Library

[15]

T.-L. Wong, W. Lam, and B. Chen. Mining employment market via text block detection and adaptive cross-domain information extraction. In Proc. SIGIR, pages 283--290, 2009.

Digital Library

[16]

J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma. Incorporating site-level knowledge to extract structured data from web forums. In Proc. WWW, pages 181--190, 2009.

Digital Library

[17]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. WWW, pages 76--85, 2005.

Digital Library

[18]

S. Zheng, R. Song, J.-R. Wen, and D. Wu. Joint optimization of wrapper generation and template detection. In Proc. KDD, pages 894--902, 2007.

Digital Library

[19]

J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In Proc. ICML, pages 1044--1051, 2005.

Digital Library

[20]

J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. KDD, pages 494--503, 2006.

Digital Library

Cited By

Niu RLi JWang SFu YHu XLeng XKong HChang YWang QLarson K(2024)ScreenAgentProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/711(6433-6441)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/711
Xu HChen LZhao ZMa DCao RZhu ZYu KAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Hierarchical Multimodal Pre-training for Visually Rich Webpage UnderstandingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635753(864-872)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635753
Arora SYang BEyuboglu SNarayan AHojel ATrummer IRé C(2023)Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data LakesProceedings of the VLDB Endowment10.14778/3626292.362629417:2(92-105)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626294
Show More Cited By

Index Terms

From one tree to a forest: a unified solution for structured web data extraction
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Web-scale knowledge extraction from semi-structured tables
WWW '10: Proceedings of the 19th international conference on World wide web

A wealth of knowledge is encoded in the form of tables on the World Wide Web. We propose a classification algorithm and a rich feature set for automatically recognizing layout tables and attribute/value tables. We report the frequencies of these table ...
Incorporating site-level knowledge to extract structured data from web forums
WWW '09: Proceedings of the 18th international conference on World wide web

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In ...
Web-scale table census and classification
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

July 2011

1374 pages

ISBN:9781450307574

DOI:10.1145/2009916

General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 July 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGIR '11

Sponsor:

SIGIR

SIGIR '11: The 34th International ACM SIGIR conference on research and development in Information Retrieval

July 24 - 28, 2011

Beijing, China

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
833
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)8

Reflects downloads up to 25 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Niu RLi JWang SFu YHu XLeng XKong HChang YWang QLarson K(2024)ScreenAgentProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/711(6433-6441)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.24963/ijcai.2024/711
Xu HChen LZhao ZMa DCao RZhu ZYu KAngélica LLattanzi SMuñoz Medina AAkoglu LGionis AVassilvitskii S(2024)Hierarchical Multimodal Pre-training for Visually Rich Webpage UnderstandingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635753(864-872)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1145/3616855.3635753
Arora SYang BEyuboglu SNarayan AHojel ATrummer IRé C(2023)Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data LakesProceedings of the VLDB Endowment10.14778/3626292.362629417:2(92-105)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.14778/3626292.3626294
Sarkhel RHuang BLockard CShiralkar P(2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611511
Nunes MDorneles C(2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3617023.3617055
Zhang ZYu BLiu TLiu TWang YGuo L(2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583387
Bevendorff JGupta SKiesel JStein BChen HDuh WHuang HKato MMothe JPoblete B(2023)An Empirical Comparison of Web Content Extraction AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591920(2594-2603)Online publication date: 19-Jul-2023
https://dl.acm.org/doi/10.1145/3539618.3591920
Yang YFeng JLi BYuan FCao CLiu Y(2023)EDDVPL: A Web Attribute Extraction Method with Prompt LearningNeural Information Processing10.1007/978-981-99-8181-6_36(474-484)Online publication date: 27-Nov-2023
https://doi.org/10.1007/978-981-99-8181-6_36
Burget RSalem H(2023)Creating Searchable Web Page Snapshots Using Semantic TechnologiesWeb Engineering10.1007/978-3-031-34444-2_26(355-358)Online publication date: 16-Jun-2023
https://doi.org/10.1007/978-3-031-34444-2_26
Feng JCao CYuan FZhang XLi ZLiu YTan J(2023)DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph TransformerNeural Information Processing10.1007/978-3-031-30105-6_39(468-479)Online publication date: 13-Apr-2023
https://doi.org/10.1007/978-3-031-30105-6_39
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten