Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1242572.1242844acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
Article

U-REST: an unsupervised record extraction system

Published: 08 May 2007 Publication History
  • Get Citation Alerts
  • Abstract

    In this paper, we describe a system that can extract recordstructures from web pages with no direct human supervision.Records are commonly occurring HTML-embedded data tuples that describe people, offered courses, products,company profiles, etc. We present a simplified frameworkfor studying the problem of unsupervised record extraction. one which separates the algorithms from the feature engineering.Our system, U-REST formalizes an approach tothe problem of unsupervised record extraction using a simple two-stage machine learning framework. The first stage involves clustering, where structurally similar regions are discovered, and the second stage involves classification, where discovered groupings (clusters of regions) are ranked by their likelihood of being records. In our work, we describe, and summarize the results of an extensive survey of features for both stages. We conclude by comparing U-REST to related systems. The results of our empirical evaluation show encouraging improvements in extraction accuracy.

    References

    [1]
    D. Buttler, L. Liu, and C. Pu. A fully automated extraction system for the world wide web. In IEEE ICDCS--21, April 2001.
    [2]
    A. Hogue and D. Karger. Thresher: Automating the unwrapping of semantic content from the world wide web. In WWW 2005 Conference, 2005.
    [3]
    B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. UIC Technical Report, 2003.
    [4]
    Y. K. Shen. Automatic record extraction from the world wide web. Master's thesis, MIT, 2005.
    [5]
    Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 76--85, New York, NY, USA, 2005. ACM Press.
    [6]
    H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines, 2005.

    Cited By

    View all
    • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
    • (2017)CMDR: Classifying nodes for mining data records with different HTML structuresTENCON 2017 - 2017 IEEE Region 10 Conference10.1109/TENCON.2017.8228162(1862-1862)Online publication date: Nov-2017
    • (2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
    • Show More Cited By

    Index Terms

    1. U-REST: an unsupervised record extraction system

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '07: Proceedings of the 16th international conference on World Wide Web
        May 2007
        1382 pages
        ISBN:9781595936547
        DOI:10.1145/1242572
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 08 May 2007

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. clustering
        2. record extraction

        Qualifiers

        • Article

        Conference

        WWW'07
        Sponsor:
        WWW'07: 16th International World Wide Web Conference
        May 8 - 12, 2007
        Alberta, Banff, Canada

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)On validating web information extraction proposalsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.116700199:COnline publication date: 1-Aug-2022
        • (2017)CMDR: Classifying nodes for mining data records with different HTML structuresTENCON 2017 - 2017 IEEE Region 10 Conference10.1109/TENCON.2017.8228162(1862-1862)Online publication date: Nov-2017
        • (2016)A survey of methods for the extraction of information from Web resourcesProgramming and Computing Software10.1134/S036176881605007842:5(279-291)Online publication date: 1-Sep-2016
        • (2016)RollerKnowledge and Information Systems10.1007/s10115-016-0921-449:1(197-241)Online publication date: 1-Oct-2016
        • (2014)Bottom-up region extractor for semi-structured web pages2014 International Computer Science and Engineering Conference (ICSEC)10.1109/ICSEC.2014.6978209(284-289)Online publication date: Jul-2014
        • (2013)GRABEXProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0110.1109/WI-IAT.2013.42(290-297)Online publication date: 17-Nov-2013
        • (2013)A Survey on Region Extractors from Web DocumentsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2012.13525:9(1960-1981)Online publication date: 1-Sep-2013
        • (2013)TEX: An efficient and effective unsupervised Web information extractorKnowledge-Based Systems10.1016/j.knosys.2012.10.00939(109-123)Online publication date: Feb-2013
        • (2011)On a proposal to integrate web sources using semantic-web technologies2011 7th International Conference on Next Generation Web Services Practices10.1109/NWeSP.2011.6088199(326-331)Online publication date: Oct-2011

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media