research-article

Automatic Identification of Informative Sections of Web Pages

Authors:

Sandip Debnath,

Prasenjit Mitra,

Nirmal Pal,

C. Lee GilesAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 17, Issue 9

Pages 1233 - 1246

https://doi.org/10.1109/TKDE.2005.138

Published: 01 September 2005 Publication History

Publisher Site

Abstract

Web pages especially dynamically generated ones contain several items that cannot be classified as the "primary content, e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections from the other content sections. We call these sections as "Web page blocks or just "blocks. First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.

Cited By

View all

Manabe TTajima K(2015)Extracting logical hierarchical structure of HTML documents based on headingsProceedings of the VLDB Endowment10.14778/2824032.28240588:12(1606-1617)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.14778/2824032.2824058
Lundgren EPapapetrou PAsker LMakedon FMariottini GKorn OMaglogiannis IMetsis V(2015)Extracting news text from web pagesProceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments10.1145/2769493.2769573(1-4)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1145/2769493.2769573
Zhou NFan J(2015)Automatic image-text alignment for large-scale web image indexing and retrievalPattern Recognition10.1016/j.patcog.2014.07.00148:1(205-219)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.patcog.2014.07.001
Show More Cited By

Index Terms

Automatic Identification of Informative Sections of Web Pages

Recommendations

Two-level Clustering of Web Sites Using Self-Organizing Maps

Web sites contain an ever increasing amount of information within their pages. As the amount of information increases so does the complexity of the structure of the web site. Consequently it has become difficult for visitors to find the information ...
Web data mining: exploring hyperlinks, contents, and usage data

This paper presents a review of the book "Web Data Mining - Exploring Hyperlinks, Contents, and Usage Data" by Bing Liu. The review concludes that the breadth and depth of this book makes it a required staple for every Web mining researcher, student, or ...
Interpretable Mining of Influential Patterns from Sparse Web
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Big data are everywhere. World Wide Web is an example of these big data. It has become a vast data production and consumption platform, at which threads of data evolve from multiple devices, by different human interactions, over worldwide locations, ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 17, Issue 9

September 2005

143 pages

ISSN:1041-4347

Issue’s Table of Contents

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 September 2005

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Manabe TTajima K(2015)Extracting logical hierarchical structure of HTML documents based on headingsProceedings of the VLDB Endowment10.14778/2824032.28240588:12(1606-1617)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.14778/2824032.2824058
Lundgren EPapapetrou PAsker LMakedon FMariottini GKorn OMaglogiannis IMetsis V(2015)Extracting news text from web pagesProceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments10.1145/2769493.2769573(1-4)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1145/2769493.2769573
Zhou NFan J(2015)Automatic image-text alignment for large-scale web image indexing and retrievalPattern Recognition10.1016/j.patcog.2014.07.00148:1(205-219)Online publication date: 1-Jan-2015
https://dl.acm.org/doi/10.1016/j.patcog.2014.07.001
Cuzzola JJovanović JBagheri EGašević D(2015)Automated classification and localization of daily deal content from the WebApplied Soft Computing10.1016/j.asoc.2015.02.02931:C(241-256)Online publication date: 1-Jun-2015
https://dl.acm.org/doi/10.1016/j.asoc.2015.02.029
Soska KChristin NFu K(2014)Automatically detecting vulnerable websites before they turn maliciousProceedings of the 23rd USENIX conference on Security Symposium10.5555/2671225.2671265(625-640)Online publication date: 20-Aug-2014
https://dl.acm.org/doi/10.5555/2671225.2671265
Bing LGuo RLam WNiu ZWang HGeva STrotman ABruza PClarke CJärvelin K(2014)Web page segmentation with structured prediction and its application in web page classificationProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609630(767-776)Online publication date: 3-Jul-2014
https://dl.acm.org/doi/10.1145/2600428.2609630
Zhang XZhang YHe JCobia F(2013)Vision-Based Web Page Block Segmentation and Informative Block DetectionProceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0310.1109/WI-IAT.2013.194(265-269)Online publication date: 17-Nov-2013
https://dl.acm.org/doi/10.1109/WI-IAT.2013.194
Uzun EAgun HYerlikaya T(2013)A hybrid approach for extracting informative content from web pagesInformation Processing and Management: an International Journal10.1016/j.ipm.2013.02.00549:4(928-944)Online publication date: 1-Jul-2013
https://dl.acm.org/doi/10.1016/j.ipm.2013.02.005
Pappas NKatsimpras GStamatatos ELindstaedt SGranitzer M(2012)Extracting informative textual parts from web pages containing user-generated contentProceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies10.1145/2362456.2362462(1-8)Online publication date: 5-Sep-2012
https://dl.acm.org/doi/10.1145/2362456.2362462
Ly PPedrinaci CDomingue J(2012)Automated information extraction from web APIs documentationProceedings of the 13th international conference on Web Information Systems Engineering10.1007/978-3-642-35063-4_36(497-511)Online publication date: 28-Nov-2012
https://dl.acm.org/doi/10.1007/978-3-642-35063-4_36
Show More Cited By

Abstract

Cited By

Index Terms

Recommendations

Two-level Clustering of Web Sites Using Self-Organizing Maps

Web data mining: exploring hyperlinks, contents, and usage data

Interpretable Mining of Influential Patterns from Sparse Web

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations