Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1316902.1316920acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Adaptive web-page content identification

Published: 09 November 2007 Publication History

Abstract

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

References

[1]
D. K. Evans, J. L. Klavans and K. McKeown, Columbia Newsblaster: Multilingual News Summarization on the Web, Proceedings of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics, 2004, pp. 1--4.
[2]
S. Gupta, D. Daiser, P. Grimm, M. Chiang and J. Starren, Automating Content Extraction of HTML Documents, World Wide Web - Internet and Information Systems, 8 (2005), pp. 179--224.
[3]
D. Kang and J. Choi, MetaNews: An Information Agent for Gathering News Articles on the Web., International Symposium on Methodologies for Intelligent Systems, 2003.
[4]
C. Knoblock, K. Lerman, S. Minton and I. Muslea, Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach, Data Engineering Bulletin, 23 (2000).
[5]
N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, AAAI-98 Workshop on AI and Information Integration, 1998.
[6]
A. Laender, B. Ribeiro-Neto, A. Silva and J. Teixeira, A Brief Survey of Web Data Extraction Tools, SIGMOD, 31 (2002).
[7]
J. D. Lafferty, A. McCallum and F. C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequences, ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco CA, USA, 2001, pp. 282--289.
[8]
L. Marquez, P. R. Comas, J. Gimenez and N. Catala, Semantic Role Labeling as Sequential Tagging, Conference on Natural Language Learning (CoNLL), Ann Arbor, MI, 2005.
[9]
I. Muslea, S. Minton and C. Knoblock, A Heirarchical Approach to Wrapper Induction, Proceedings of the Third International Conference on Autonomous Agents (Agents '99), Seattle, WA, 1999.
[10]
A. Palmer, E. Ponvert, J. Baldridge and C. Smith, A Sequencing Model for Situation Entity Classification, Association for Computation Linguistics, Prague, Czech Republic, 2007.
[11]
D. Pinto, A. McCallum, X. Wei and W. B. Croft, Table Extraction using Conditional Random Fields, Proceedings of the ACM SIGIR, 2003.
[12]
A. F. R. Rahman, H. Alam and R. Hartono, Understanding the Flow of Content in Summarizing HTML Documents, International Workshop on Document Layout Interpretation and its Applications (DLIA), 2001.
[13]
S. Sarawagi and W. Cohen, Semi-Markov Conditional Random Fields for Information Extraction, Proceedings of Neural Information Processing Systems, 2004.
[14]
F. Sha and F. Pereira, Shallow parsing with conditional random fields., Proceedings of HLT-NAACL 2003., 2003.
[15]
W. a. a. C. WAC 2007, Web as a Corpus, UCLouvain, Louvain-la-Neuve, Belgium, 2007.
[16]
B. Wellner and M. Vilain, Leveraging Machine-Readable Dictionaries in Discriminative Sequence Models, Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, 2006.

Cited By

View all
  • (2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
  • (2019)An effective and efficient Web content extractor for optimizing the crawling processSoftware—Practice & Experience10.1002/spe.219544:10(1181-1199)Online publication date: 4-Jan-2019
  • (2018)Content extraction from news web pages using tag treeInternational Journal of Autonomic Computing10.1504/IJAC.2018.0925483:1(34-51)Online publication date: 11-Dec-2018
  • Show More Cited By

Index Terms

  1. Adaptive web-page content identification

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management
    November 2007
    168 pages
    ISBN:9781595938299
    DOI:10.1145/1316902
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 November 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. conditional random fields
    2. content identification
    3. maximum entropy markov models
    4. sequence labeling

    Qualifiers

    • Research-article

    Conference

    CIKM07

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
    • (2019)An effective and efficient Web content extractor for optimizing the crawling processSoftware—Practice & Experience10.1002/spe.219544:10(1181-1199)Online publication date: 4-Jan-2019
    • (2018)Content extraction from news web pages using tag treeInternational Journal of Autonomic Computing10.1504/IJAC.2018.0925483:1(34-51)Online publication date: 11-Dec-2018
    • (2018)An Automatic e-news Article Content Extraction and Classification2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer)10.1109/ICTER.2018.8615480(196-202)Online publication date: Sep-2018
    • (2018)Using Machine Learning to Improve Regulatory Review of Flight Waivers and Exemptions2018 IEEE/AIAA 37th Digital Avionics Systems Conference (DASC)10.1109/DASC.2018.8569775(1-8)Online publication date: Sep-2018
    • (2018)Web2Text: Deep Structured Boilerplate RemovalAdvances in Information Retrieval10.1007/978-3-319-76941-7_13(167-179)Online publication date: 1-Mar-2018
    • (2017)A novel algorithm for extracting the user reviews from web pagesJournal of Information Science10.1177/016555151666644643:5(696-712)Online publication date: 1-Oct-2017
    • (2017)SVM-based web content mining with leaf classification unit from DOM-tree2017 9th International Conference on Knowledge and Smart Technology (KST)10.1109/KST.2017.7886134(359-364)Online publication date: Feb-2017
    • (2017)Web content information extraction based on DOM tree and statistical information2017 IEEE 17th International Conference on Communication Technology (ICCT)10.1109/ICCT.2017.8359846(1308-1311)Online publication date: Oct-2017
    • (2016)To extract informative content from online web pages by using hybrid approach2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT)10.1109/ICEEOT.2016.7754831(972-977)Online publication date: Mar-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media