Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Automatic wrappers for large scale web extraction

Published: 01 January 2011 Publication History

Abstract

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.

References

[1]
T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.
[2]
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003.
[3]
L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996.
[4]
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.
[5]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001.
[6]
N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: An approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009.
[7]
H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009.
[8]
O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, pages 100--110, 2004.
[9]
D. Freitag. Multistrategy learning for information extraction. In ICML, pages 161--169, 1998.
[10]
D. Freitag and N. Kushmerick. Boosted wrapper induction. In AAAI/IAAI, pages 577--583, 2000.
[11]
W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001.
[12]
B. He and K. C.-C. Chang. Making holistic schema matching robust: An ensemble approach. In KDD, pages 429--438, 2005.
[13]
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998.
[14]
N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:2000, 2000.
[15]
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.
[16]
J. Madhavan, L. Afanasiev, L. Antova, and A. Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.
[17]
I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.
[18]
J. Myllymaki and J. Jackson. Robust web data extraction with xml path expressions. Technical report, IBM Research Report RJ 10245, May 2002.
[19]
A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In VLDB, pages 738--741, 1999.
[20]
P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008.
[21]
G. Yang, I. V. Ramakrishnan, and M. Kifer. On the complexity of schema inference from web pages in the presence of nullable data attributes. In CIKM, pages 224--231, 2003.

Cited By

View all
  • (2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 4, Issue 4
January 2011
59 pages
ISSN:2150-8097
  • Editors:
  • José Blakeley,
  • Joseph M. Hellerstein,
  • Nick Koudas,
  • Wolfgang Lehner,
  • Sunita Sarawagi,
  • Uwe Röhm
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2011
Published in PVLDB Volume 4, Issue 4

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
  • (2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
  • (2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
  • (2021)WebKEProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482491(2211-2220)Online publication date: 26-Oct-2021
  • (2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
  • (2019)Hybrid Crowd-Machine Wrapper InferenceACM Transactions on Knowledge Discovery from Data10.1145/334472013:5(1-43)Online publication date: 24-Sep-2019
  • (2019)Synthesis and machine learning for heterogeneous extractionProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3322485(301-315)Online publication date: 8-Jun-2019
  • (2019)RED: Redundancy-Driven Data Extraction from Result Pages?The World Wide Web Conference10.1145/3308558.3313529(605-615)Online publication date: 13-May-2019
  • (2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
  • (2018)CERESProceedings of the VLDB Endowment10.14778/3231751.323175811:10(1084-1096)Online publication date: 1-Jun-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media