Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2009916.2010020acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

From one tree to a forest: a unified solution for structured web data extraction

Published: 24 July 2011 Publication History

Abstract

Structured data, in the form of entities and associated attributes, has been a rich web resource for search engines and knowledge databases. To efficiently extract structured data from enormous websites in various verticals (e.g., books, restaurants), much research effort has been attracted, but most existing approaches either require considerable human effort or rely on strong features that lack of flexibility. We consider an ambitious scenario -- can we build a system that (1) is general enough to handle any vertical without re-implementation and (2) requires only one labeled example site from each vertical for training to automatically deal with other sites in the same vertical? In this paper, we propose a unified solution to demonstrate the feasibility of this scenario. Specifically, we design a set of weak but general features to characterize vertical knowledge (including attribute-specific semantics and inter-attribute layout relationships). Such features can be adopted in various verticals without redesign; meanwhile, they are weak enough to avoid overfitting of the learnt knowledge to seed sites. Given a new unseen site, the learnt knowledge is first applied to identify page-level candidate attribute values, while inevitably involve false positives. To remove noise, site-level information of the new site is then exploited to boost up the true values. The site-level information is derived in an unsupervised manner, without harm to the applicability of the solution. Promising experimental performance on 80 websites in 8 distinct verticals demonstrated the feasibility and flexibility of the proposed solution.

References

[1]
Document object model. http://en.wikipedia.org/wiki/Document\_Object\_Model.
[2]
Freebase. http://www.freebase.com/.
[3]
MSHTML reference. http://msdn.microsoft.com/en-us/library/aa741317(v=VS.85).aspx.
[4]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In Proc. SIGMOD, pages 337--348, 2003.
[5]
A. Carlson and C. Schafer. Bootstrapping information extraction from semi-structured web pages. In Proc. ECML, pages 195--210, 2008.
[6]
C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In Proc. WWW, pages 681--688, 2001.
[7]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proc. VLDB, pages 109--118, 2001.
[8]
S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In Proc. WWW, pages 178--186, 2003.
[9]
N. Kushmerick. Wrapper induction for information extraction. PhD thesis, 1997.
[10]
B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In Proc. KDD, pages 601--606, 2003.
[11]
I. Muslea, S. Minton, and C. Knoblock. A hierarchical approach to wrapper induction. In Proc. AGENTS, pages 190--197, 1999.
[12]
S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34:233--272, 1999.
[13]
M. L. A. Vidal, A. S. da Silva, E. S. de Moura, and J. M. B. Cavalcanti. Structure-driven crawler generation by example. In Proc. SIGIR, pages 292--299, 2006.
[14]
T.-L. Wong and W. Lam. Learning to adapt web information extraction knowledge and discovering new attributes via a Bayesian approach. TKDE, 22:523--536, 2010.
[15]
T.-L. Wong, W. Lam, and B. Chen. Mining employment market via text block detection and adaptive cross-domain information extraction. In Proc. SIGIR, pages 283--290, 2009.
[16]
J.-M. Yang, R. Cai, Y. Wang, J. Zhu, L. Zhang, and W.-Y. Ma. Incorporating site-level knowledge to extract structured data from web forums. In Proc. WWW, pages 181--190, 2009.
[17]
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. WWW, pages 76--85, 2005.
[18]
S. Zheng, R. Song, J.-R. Wen, and D. Wu. Joint optimization of wrapper generation and template detection. In Proc. KDD, pages 894--902, 2007.
[19]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In Proc. ICML, pages 1044--1051, 2005.
[20]
J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In Proc. KDD, pages 494--503, 2006.

Cited By

View all
  • (2024)ScreenAgentProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/711(6433-6441)Online publication date: 3-Aug-2024
  • (2024)Hierarchical Multimodal Pre-training for Visually Rich Webpage UnderstandingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635753(864-872)Online publication date: 4-Mar-2024
  • (2023)Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data LakesProceedings of the VLDB Endowment10.14778/3626292.362629417:2(92-105)Online publication date: 1-Oct-2023
  • Show More Cited By

Index Terms

  1. From one tree to a forest: a unified solution for structured web data extraction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
      July 2011
      1374 pages
      ISBN:9781450307574
      DOI:10.1145/2009916
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 July 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information extraction
      2. site-level information
      3. structured data
      4. vertical knowledge

      Qualifiers

      • Research-article

      Conference

      SIGIR '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)77
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 25 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)ScreenAgentProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/711(6433-6441)Online publication date: 3-Aug-2024
      • (2024)Hierarchical Multimodal Pre-training for Visually Rich Webpage UnderstandingProceedings of the 17th ACM International Conference on Web Search and Data Mining10.1145/3616855.3635753(864-872)Online publication date: 4-Mar-2024
      • (2023)Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data LakesProceedings of the VLDB Endowment10.14778/3626292.362629417:2(92-105)Online publication date: 1-Oct-2023
      • (2023)Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesProceedings of the VLDB Endowment10.14778/3611479.361151116:11(3098-3110)Online publication date: 24-Aug-2023
      • (2023)EDREW - Enhanced Data Representation for Extraction in WebProceedings of the 29th Brazilian Symposium on Multimedia and the Web10.1145/3617023.3617055(230-237)Online publication date: 23-Oct-2023
      • (2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
      • (2023)An Empirical Comparison of Web Content Extraction AlgorithmsProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591920(2594-2603)Online publication date: 19-Jul-2023
      • (2023)EDDVPL: A Web Attribute Extraction Method with Prompt LearningNeural Information Processing10.1007/978-981-99-8181-6_36(474-484)Online publication date: 27-Nov-2023
      • (2023)Creating Searchable Web Page Snapshots Using Semantic TechnologiesWeb Engineering10.1007/978-3-031-34444-2_26(355-358)Online publication date: 16-Jun-2023
      • (2023)DOM2R-Graph: A Web Attribute Extraction Architecture with Relation-Aware Heterogeneous Graph TransformerNeural Information Processing10.1007/978-3-031-30105-6_39(468-479)Online publication date: 13-Apr-2023
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media