research-article

Adaptive web-page content identification

Authors:

Susan LubarAuthors Info & Claims

WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management

Pages 105 - 112

https://doi.org/10.1145/1316902.1316920

Published: 09 November 2007 Publication History

Abstract

Identifying which parts of a Web-page contain target content (e.g., the portion of an online news page that contains the actual article) is a significant problem that must be addressed for many Web-based applications. Most approaches to this problem involve crafting hand-tailored rules or scripts to extract the content, customized separately for particular Web sites. Besides requiring considerable time and effort to implement, hand-built extraction routines are brittle: they fail to properly extract content in some cases and break when the structure of a site's Web-pages changes. In this work we treat the problem of identifying content as a sequence labeling problem, a common problem structure in machine learning and natural language processing. Using a Conditional Random Field sequence labeling model, we correctly identify the content portion of web-pages anywhere from 80-97% of the time depending on experimental factors such as ensuring the absence of duplicate documents and application of the model against unseen sources.

References

[1]

D. K. Evans, J. L. Klavans and K. McKeown, Columbia Newsblaster: Multilingual News Summarization on the Web, Proceedings of Human Language Technology Conference/North Americal Chapter of the Association for Computational Linguistics, 2004, pp. 1--4.

Digital Library

[2]

S. Gupta, D. Daiser, P. Grimm, M. Chiang and J. Starren, Automating Content Extraction of HTML Documents, World Wide Web - Internet and Information Systems, 8 (2005), pp. 179--224.

Digital Library

[3]

D. Kang and J. Choi, MetaNews: An Information Agent for Gathering News Articles on the Web., International Symposium on Methodologies for Intelligent Systems, 2003.

[4]

C. Knoblock, K. Lerman, S. Minton and I. Muslea, Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach, Data Engineering Bulletin, 23 (2000).

[5]

N. Kushmerick, Wrapper Induction: Efficiency and Expressiveness, AAAI-98 Workshop on AI and Information Integration, 1998.

[6]

A. Laender, B. Ribeiro-Neto, A. Silva and J. Teixeira, A Brief Survey of Web Data Extraction Tools, SIGMOD, 31 (2002).

Digital Library

[7]

J. D. Lafferty, A. McCallum and F. C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequences, ICML 01: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann, San Francisco CA, USA, 2001, pp. 282--289.

Digital Library

[8]

L. Marquez, P. R. Comas, J. Gimenez and N. Catala, Semantic Role Labeling as Sequential Tagging, Conference on Natural Language Learning (CoNLL), Ann Arbor, MI, 2005.

Digital Library

[9]

I. Muslea, S. Minton and C. Knoblock, A Heirarchical Approach to Wrapper Induction, Proceedings of the Third International Conference on Autonomous Agents (Agents '99), Seattle, WA, 1999.

Digital Library

[10]

A. Palmer, E. Ponvert, J. Baldridge and C. Smith, A Sequencing Model for Situation Entity Classification, Association for Computation Linguistics, Prague, Czech Republic, 2007.

[11]

D. Pinto, A. McCallum, X. Wei and W. B. Croft, Table Extraction using Conditional Random Fields, Proceedings of the ACM SIGIR, 2003.

Digital Library

[12]

A. F. R. Rahman, H. Alam and R. Hartono, Understanding the Flow of Content in Summarizing HTML Documents, International Workshop on Document Layout Interpretation and its Applications (DLIA), 2001.

[13]

S. Sarawagi and W. Cohen, Semi-Markov Conditional Random Fields for Information Extraction, Proceedings of Neural Information Processing Systems, 2004.

[14]

F. Sha and F. Pereira, Shallow parsing with conditional random fields., Proceedings of HLT-NAACL 2003., 2003.

Digital Library

[15]

W. a. a. C. WAC 2007, Web as a Corpus, UCLouvain, Louvain-la-Neuve, Belgium, 2007.

[16]

B. Wellner and M. Vilain, Leveraging Machine-Readable Dictionaries in Discriminative Sequence Models, Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, 2006.

Cited By

Leonhardt JAnand AKhosla M(2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3383547
Uzun ESerdar Güner EKılıçaslan YYerlikaya TAgun H(2019)An effective and efficient Web content extractor for optimizing the crawling processSoftware—Practice & Experience10.1002/spe.219544:10(1181-1199)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1002/spe.2195
(2018)Content extraction from news web pages using tag treeInternational Journal of Autonomic Computing10.1504/IJAC.2018.0925483:1(34-51)Online publication date: 11-Dec-2018
https://dl.acm.org/doi/10.1504/IJAC.2018.092548
Show More Cited By

Index Terms

Adaptive web-page content identification
1. Applied computing

Recommendations

Conditional Random Fields Model for Web Content Extraction
ICCGI '10: Proceedings of the 2010 Fifth International Multi-conference on Computing in the Global Information Technology

The web contains an abundance of semi-structured information, but not all the information is useful for users, it always contains so many noises such as the advertisement, navigation information, and so on. Identifying which parts of the web page contain ...
Factored Latent-Dynamic Conditional Random Fields for single and multi-label sequence modeling
Highlights
- We propose a single and multi-label generalization of LDCRF (Morency et al., 2007), called the Factored LDCRF.
Graphical abstract

Display Omitted

Abstract
Conditional Random Fields (CRF) are frequently applied for labeling and segmenting sequence data. Morency et al. (2007) introduced hidden state variables in a labeled CRF structure in order to model the latent dynamics within class ...
Learning Discriminative Sequence Models from Partially Labelled Data for Activity Recognition
PRICAI '08: Proceedings of the 10th Pacific Rim International Conference on Artificial Intelligence: Trends in Artificial Intelligence

Recognising daily activity patterns of people from low-level sensory data is an important problem. Traditional approaches typically rely on generative models such as the hidden Markov models and training on fully labelled data. While activity data can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WIDM '07: Proceedings of the 9th annual ACM international workshop on Web information and data management

November 2007

168 pages

ISBN:9781595938299

DOI:10.1145/1316902

Program Chairs:
Irini Fundulaki
University of Edinburgh, UK
,
Neoklis Polyzotis
University of California-Santa Cruz, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM07

Sponsor:

CIKM07: Conference on Information and Knowledge Management

November 9, 2007

Lisbon, Portugal

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
532
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Leonhardt JAnand AKhosla M(2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3366424.3383547
Uzun ESerdar Güner EKılıçaslan YYerlikaya TAgun H(2019)An effective and efficient Web content extractor for optimizing the crawling processSoftware—Practice & Experience10.1002/spe.219544:10(1181-1199)Online publication date: 4-Jan-2019
https://dl.acm.org/doi/10.1002/spe.2195
(2018)Content extraction from news web pages using tag treeInternational Journal of Autonomic Computing10.1504/IJAC.2018.0925483:1(34-51)Online publication date: 11-Dec-2018
https://dl.acm.org/doi/10.1504/IJAC.2018.092548
Dandeniya D(2018)An Automatic e-news Article Content Extraction and Classification2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer)10.1109/ICTER.2018.8615480(196-202)Online publication date: Sep-2018
https://doi.org/10.1109/ICTER.2018.8615480
Martinez CDay D(2018)Using Machine Learning to Improve Regulatory Review of Flight Waivers and Exemptions2018 IEEE/AIAA 37th Digital Avionics Systems Conference (DASC)10.1109/DASC.2018.8569775(1-8)Online publication date: Sep-2018
https://doi.org/10.1109/DASC.2018.8569775
Vogels TGanea OEickhoff C(2018)Web2Text: Deep Structured Boilerplate RemovalAdvances in Information Retrieval10.1007/978-3-319-76941-7_13(167-179)Online publication date: 1-Mar-2018
https://doi.org/10.1007/978-3-319-76941-7_13
Uçar EUzun ETüfekci P(2017)A novel algorithm for extracting the user reviews from web pagesJournal of Information Science10.1177/016555151666644643:5(696-712)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.1177/0165551516666446
Kim YLee S(2017)SVM-based web content mining with leaf classification unit from DOM-tree2017 9th International Conference on Knowledge and Smart Technology (KST)10.1109/KST.2017.7886134(359-364)Online publication date: Feb-2017
https://doi.org/10.1109/KST.2017.7886134
Yu XJin Z(2017)Web content information extraction based on DOM tree and statistical information2017 IEEE 17th International Conference on Communication Technology (ICCT)10.1109/ICCT.2017.8359846(1308-1311)Online publication date: Oct-2017
https://doi.org/10.1109/ICCT.2017.8359846
Kaddu MKulkarni R(2016)To extract informative content from online web pages by using hybrid approach2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT)10.1109/ICEEOT.2016.7754831(972-977)Online publication date: Mar-2016
https://doi.org/10.1109/ICEEOT.2016.7754831
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents