research-article

Automatic wrappers for large scale web extraction

Editors: José Blakeley, Joseph M. Hellerstein, Nick Koudas, Wolfgang Lehner, Sunita Sarawagi, Uwe Röhm Authors:

Mohamed SolimanAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 4, Issue 4

Pages 219 - 230

https://doi.org/10.14778/1938545.1938547

Published: 01 January 2011 Publication History

Abstract

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform information extraction at web-scale, with accuracy unattained with existing unsupervised extraction techniques. Our system is used in production at Yahoo! and powers live applications.

References

[1]

T. Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.

[2]

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003.

Digital Library

[3]

L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996.

[4]

M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.

Digital Library

[5]

V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001.

Digital Library

[6]

N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction: An approach based on a probabilistic tree-edit model. In SIGMOD, pages 335--348, 2009.

Digital Library

[7]

H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009.

Digital Library

[8]

O. Etzioni, M. J. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in knowitall: (preliminary results). In WWW, pages 100--110, 2004.

Digital Library

[9]

D. Freitag. Multistrategy learning for information extraction. In ICML, pages 161--169, 1998.

Digital Library

[10]

D. Freitag and N. Kushmerick. Boosted wrapper induction. In AAAI/IAAI, pages 577--583, 2000.

Digital Library

[11]

W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001.

Digital Library

[12]

B. He and K. C.-C. Chang. Making holistic schema matching robust: An ensemble approach. In KDD, pages 429--438, 2005.

Digital Library

[13]

C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998.

Digital Library

[14]

N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:2000, 2000.

Digital Library

[15]

N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.

[16]

J. Madhavan, L. Afanasiev, L. Antova, and A. Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.

[17]

I. Muslea, S. Minton, and C. Knoblock. Stalker: Learning extraction rules for semistructured. In AAAI: Workshop on AI and Information Integration, 1998.

[18]

J. Myllymaki and J. Jackson. Robust web data extraction with xml path expressions. Technical report, IBM Research Report RJ 10245, May 2002.

[19]

A. Sahuguet and F. Azavant. Building light-weight wrappers for legacy web data-sources using W4F. In VLDB, pages 738--741, 1999.

Digital Library

[20]

P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008.

Digital Library

[21]

G. Yang, I. V. Ramakrishnan, and M. Kifer. On the complexity of schema inference from web pages in the presence of nullable data attributes. In CIKM, pages 224--231, 2003.

Digital Library

Cited By

Zhang ZYu BLiu TLiu TWang YGuo L(2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583387
Parthasarathy SPattanaik LKhatry AIyer ARadhakrishna ARajamani SRaza MJhala RDillig I(2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523705
Wang QFang YRavula AFeng FQuan XLiu D(2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512032
Show More Cited By

Index Terms

Automatic wrappers for large scale web extraction

Recommendations

A framework for learning web wrappers from the crowd
WWW '13: Proceedings of the 22nd international conference on World Wide Web

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their ...
Predicting unseen labels using label hierarchies in large-scale multi-label learning
ECMLPKDD'15: Proceedings of the 2015th European Conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

An important problem in multi-label classification is to capture label patterns or underlying structures that have an impact on such patterns. One way of learning underlying structures over labels is to project both instances and labels into the same ...
Automatic repairing of web wrappers
WIDM '01: Proceedings of the 3rd international workshop on Web information and data management

We study the problem of automatic repairing of wrappers for Web information providers. Majority of Web wrappers use "hooks'' or "landmarks'' to find and extract relevant information from Web pages and such wrappers often become inoperable when the page ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 4, Issue 4

January 2011

59 pages

ISSN:2150-8097

Editors:
José Blakeley,
Joseph M. Hellerstein,
Nick Koudas,
Wolfgang Lehner,
Sunita Sarawagi,
Uwe Röhm

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 January 2011

Published in PVLDB Volume 4, Issue 4

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

57
Total Citations
View Citations
632
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang ZYu BLiu TLiu TWang YGuo L(2023)Learning Structural Co-occurrences for Structured Web Data Extraction in Low-Resource SettingsProceedings of the ACM Web Conference 202310.1145/3543507.3583387(1683-1692)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543507.3583387
Parthasarathy SPattanaik LKhatry AIyer ARadhakrishna ARajamani SRaza MJhala RDillig I(2022)Landmarks and regions: a robust approach to data extractionProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523705(993-1009)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523705
Wang QFang YRavula AFeng FQuan XLiu D(2022)WebFormer: The Web-page Transformer for Structure Information ExtractionProceedings of the ACM Web Conference 202210.1145/3485447.3512032(3124-3133)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3485447.3512032
Xie CHuang WLiang JHuang CXiao YDemartini GZuccon GCulpepper JHuang ZTong H(2021)WebKEProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482491(2211-2220)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3459637.3482491
Wang QKanagal BGarg VSivakumar DZhu WTao DCheng XCui PRundensteiner ECarmel DHe QXu Yu J(2019)Constructing a Comprehensive Events Database from the WebProceedings of the 28th ACM International Conference on Information and Knowledge Management10.1145/3357384.3357986(229-238)Online publication date: 3-Nov-2019
https://dl.acm.org/doi/10.1145/3357384.3357986
Crescenzi VMerialdo PQiu D(2019)Hybrid Crowd-Machine Wrapper InferenceACM Transactions on Knowledge Discovery from Data10.1145/334472013:5(1-43)Online publication date: 24-Sep-2019
https://dl.acm.org/doi/10.1145/3344720
Iyer AJonnalagedda MParthasarathy SRadhakrishna ARajamani SMcKinley KFisher K(2019)Synthesis and machine learning for heterogeneous extractionProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3322485(301-315)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3322485
Guo JCrescenzi VFurche TGrasso GGottlob G(2019)RED: Redundancy-Driven Data Extraction from Result Pages?The World Wide Web Conference10.1145/3308558.3313529(605-615)Online publication date: 13-May-2019
https://dl.acm.org/doi/10.1145/3308558.3313529
Yuliang WQi ZFang LXixian HGuodong XBailing W(2019)A novel approach for Web page modeling in personal information extractionWorld Wide Web10.1007/s11280-018-0631-922:2(603-620)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11280-018-0631-9
Lockard CDong XEinolghozati AShiralkar P(2018)CERESProceedings of the VLDB Endowment10.14778/3231751.323175811:10(1084-1096)Online publication date: 1-Jun-2018
https://dl.acm.org/doi/10.14778/3231751.3231758
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents