Article

Adaptive record extraction from web pages

Authors:

Justin Park,

Denilson BarbosaAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 1335 - 1336

https://doi.org/10.1145/1242572.1242838

Published: 08 May 2007 Publication History

Get Access

Abstract

We describe an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns.We compare our method experimentally to the state-of-the-art, and show that our approach is very competitive for rigidly-structured records (such as product descriptions) and far superior for loosely-structured records (such as entrieson blogs).

References

[1]

V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In VLDB 2001: p. 109--118.

Digital Library

Google Scholar

[2]

A. Laender, A. da Silva, B. Ribeiro-Neto, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2): p. 84--93.

Digital Library

Google Scholar

[3]

B. Liu, R. Grossman, Y. Zhai. Mining Data Records in Web Pages. In KDD 2003: pg. 601--606.

Digital Library

Google Scholar

[4]

D. Reis, P. Golgher, A. Silva, and A. Laender, Automatic Web News Extraction Using Tree Edit Distance. In WWW 2004:, pp. 502--511.

Digital Library

Google Scholar

[5]

Y. Zhai, and B. Liu. Web Data Extraction Based on Partial Tree Alignment. In WWW 2005: p. 76--85.

Digital Library

Google Scholar

[6]

K. Zhang, and D. Shasha. Tree Pattern Matching. In Pattern Matching Algorithms; Oxford University Press, 1997.

Digital Library

Google Scholar

[7]

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. In WWW 2005: p. 66--75.

Digital Library

Google Scholar

Cited By

View all

Patnaik SNarendra Babu C(2022)Building Self-Healing Feature Based on Faster R-CNN Deep Learning Technique in Web Data Extraction SystemsJournal of Information & Knowledge Management10.1142/S021964922250029021:02Online publication date: 28-Apr-2022
https://doi.org/10.1142/S0219649222500290
Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700
Jiménez PCorchuelo R(2016)RollerKnowledge and Information Systems10.1007/s10115-016-0921-449:1(197-241)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-016-0921-4
Show More Cited By

Index Terms

Adaptive record extraction from web pages
1. Information systems

Recommendations

Automatic Data Records Extraction from List Page in Deep Web Sources
APCIP '09: Proceedings of the 2009 Asia-Pacific Conference on Information Processing - Volume 01

with the explosive growth and popularity of the World Wide Web, a wealth of online e-commerce information resources become available. List pages in these web sites are usually automatically generated from the back-end DBMS using scripts. In order to ...
Web Record Extraction with Invariants

Web records are structured data on a Web page that embeds records retrieved from an underlying database according to some templates. Mining data records on the Web enables the integration of data from multiple Web sites for providing value-added ...
Effective Web Data Extraction with Ducky
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications Symposium

The World Wide Web has become an invaluable source of data. However, extracting useful information from the vastness of the web can become a challenge as depending on the amount of data needed, manual extraction or creation of web scraping programs may ...

Comments

Information & Contributors

Information

Published In

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
484
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Patnaik SNarendra Babu C(2022)Building Self-Healing Feature Based on Faster R-CNN Deep Learning Technique in Web Data Extraction SystemsJournal of Information & Knowledge Management10.1142/S021964922250029021:02Online publication date: 28-Apr-2022
https://doi.org/10.1142/S0219649222500290
Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700
Jiménez PCorchuelo R(2016)RollerKnowledge and Information Systems10.1007/s10115-016-0921-449:1(197-241)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-016-0921-4
Gubanov MStonebraker MBruckner D(2014)Text and structured data fusion in data tamer at scale2014 IEEE 30th International Conference on Data Engineering10.1109/ICDE.2014.6816755(1258-1261)Online publication date: Mar-2014
https://doi.org/10.1109/ICDE.2014.6816755
Sleiman HCorchuelo R(2013)A Survey on Region Extractors from Web DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.13525:9(1960-1981)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1109/TKDE.2012.135
Sleiman HCorchuelo R(2013)TEX: An efficient and effective unsupervised Web information extractorKnowledge-Based Systems10.1016/j.knosys.2012.10.00939(109-123)Online publication date: Feb-2013
https://doi.org/10.1016/j.knosys.2012.10.009
Chen F(2012)Automatic Extraction of Terminology under CRF ModelAdvances in Electric and Electronics10.1007/978-3-642-28744-2_4(31-37)Online publication date: 2012
https://doi.org/10.1007/978-3-642-28744-2_4
Alim SAbdulrahman RNeagu DRidley M(2011)Online social network profile data extraction for vulnerability analysisInternational Journal of Internet Technology and Secured Transactions10.1504/IJITST.2011.0397783:2(194-209)Online publication date: 1-Apr-2011
https://dl.acm.org/doi/10.1504/IJITST.2011.039778
Li JZhao Y(2010)Website-Level Data ExtractionWeb Information Systems and Technologies10.1007/978-3-642-12436-5_18(242-255)Online publication date: 2010
https://doi.org/10.1007/978-3-642-12436-5_18
Wu BCheng XWang YZhang GDing G(2009)Facilitating wrapper generation with page analysisProceedings of the 2009 IEEE international conference on Intelligence and security informatics10.5555/1706428.1706466(191-193)Online publication date: 8-Jun-2009
https://dl.acm.org/doi/10.5555/1706428.1706466
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Automatic Data Records Extraction from List Page in Deep Web Sources

Web Record Extraction with Invariants

Effective Web Data Extraction with Ducky