Article

U-REST: an unsupervised record extraction system

Authors:

Yuan Kui Shen and

David R. KargerAuthors Info & Claims

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

Pages 1347 - 1348

https://doi.org/10.1145/1242572.1242844

Published: 08 May 2007 Publication History

Get Access

Abstract

In this paper, we describe a system that can extract recordstructures from web pages with no direct human supervision.Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products,company profiles, etc. We present a simplified frameworkfor studying the problem of unsupervised record extraction. one which separates the algorithms from the feature engineering.Our system, U-REST formalizes an approach tothe problem of unsupervised record extraction using a simple two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging improvements in extraction accuracy.

References

[1]

D. Buttler, L. Liu, and C. Pu. A fully automated extraction system for the world wide web. In IEEE ICDCS--21, April 2001.

Digital Library

Google Scholar

[2]

A. Hogue and D. Karger. Thresher: Automating the unwrapping of semantic content from the world wide web. In WWW 2005 Conference, 2005.

Digital Library

Google Scholar

[3]

B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. UIC Technical Report, 2003.

Google Scholar

[4]

Y. K. Shen. Automatic record extraction from the world wide web. Master's thesis, MIT, 2005.

Google Scholar

[5]

Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 76--85, New York, NY, USA, 2005. ACM Press.

Digital Library

Google Scholar

[6]

H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines, 2005.

Google Scholar

Cited By

View all

Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700
Wai FYong LThing VPomponiu V(2017)CMDR: Classifying nodes for mining data records with different HTML structuresTENCON 2017 - 2017 IEEE Region 10 Conference10.1109/TENCON.2017.8228162(1862-1862)Online publication date: Nov-2017
https://doi.org/10.1109/TENCON.2017.8228162
Varlamov MTurdakov D(2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1134/S0361768816050078
Show More Cited By

Index Terms

U-REST: an unsupervised record extraction system
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Extracting Records from the Web Using a Signal Processing Approach
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Extracting records from web pages enables a number of important applications and has immense value due to the amount and diversity of available information that can be extracted. This problem, although vastly studied, remains open because it is not a ...
Read More
Randomized Dimensionality Reduction for <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Means Clustering
We study the topic of dimensionality reduction for k-means clustering. Dimensionality reduction encompasses the union of two approaches: 1) feature selection and 2) feature extraction. A feature selection-based algorithm for k-means clustering selects a ...
Read More
RELIEF-C: Efficient Feature Selection for Clustering over Noisy Data
ICTAI '11: Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence

RELIEF is a very effective and extremely popular feature selection algorithm developed for the first time in 1992 by Kira and Rendell. Since then it has been modified and expanded in various ways to make it more efficient. But the original RELIEF and ...
Read More

Comments

Information & Contributors

Information

Published In

WWW '07: Proceedings of the 16th international conference on World Wide Web

May 2007

1382 pages

ISBN:9781595936547

DOI:10.1145/1242572

General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW'07

Sponsor:

WWW'07: 16th International World Wide Web Conference

May 8 - 12, 2007

Alberta, Banff, Canada

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
301
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

View all

Jiménez PCorchuelo R(2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.116700
Wai FYong LThing VPomponiu V(2017)CMDR: Classifying nodes for mining data records with different HTML structuresTENCON 2017 - 2017 IEEE Region 10 Conference10.1109/TENCON.2017.8228162(1862-1862)Online publication date: Nov-2017
https://doi.org/10.1109/TENCON.2017.8228162
Varlamov MTurdakov D(2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1134/S0361768816050078
Jiménez PCorchuelo R(2016)RollerKnowledge and Information Systems10.1007/s10115-016-0921-449:1(197-241)Online publication date: 1-Oct-2016
https://dl.acm.org/doi/10.1007/s10115-016-0921-4
Thamviset WWongthanavasu S(2014)Bottom-up region extractor for semi-structured web pages2014 International Computer Science and Engineering Conference (ICSEC)10.1109/ICSEC.2014.6978209(284-289)Online publication date: Jul-2014
https://doi.org/10.1109/ICSEC.2014.6978209
Keller MHartenstein H(2013)GRABEXProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0110.1109/WI-IAT.2013.42(290-297)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1109/WI-IAT.2013.42
Sleiman HCorchuelo R(2013)A Survey on Region Extractors from Web DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.13525:9(1960-1981)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1109/TKDE.2012.135
Sleiman HCorchuelo R(2013)TEX: An efficient and effective unsupervised Web information extractorKnowledge-Based Systems10.1016/j.knosys.2012.10.00939(109-123)Online publication date: Feb-2013
https://doi.org/10.1016/j.knosys.2012.10.009
Sleiman HRivero CCorchuelo R(2011)On a proposal to integrate web sources using semantic-web technologies2011 7th International Conference on Next Generation Web Services Practices10.1109/NWeSP.2011.6088199(326-331)Online publication date: Oct-2011
https://doi.org/10.1109/NWeSP.2011.6088199

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Extracting Records from the Web Using a Signal Processing Approach

Randomized Dimensionality Reduction for <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-Means Clustering

RELIEF-C: Efficient Feature Selection for Clustering over Noisy Data