Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleMay 2015
TeMex: The Web Template Extractor
WWW '15 Companion: Proceedings of the 24th International Conference on World Wide WebPages 155–158https://doi.org/10.1145/2740908.2742835This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, ...
- research-articleOctober 2013
Locality sensitive hashing for scalable structural classification and clustering of web documents
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementPages 359–368https://doi.org/10.1145/2505515.2505673Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information ...
- research-articleJune 2013
Cluster-based page segmentation-a fast and precise method for web page pre-processing
WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and SemanticsArticle No.: 7, Pages 1–12https://doi.org/10.1145/2479787.2479792Segmenting a web page may be one of initial steps of information retrieval or content classification performed on that page. While there has been an extensive research in this area, the approaches usually focus either on performance or quality of the ...
- posterMay 2013
Content extraction using diverse feature sets
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebPages 89–90https://doi.org/10.1145/2487788.2487828The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction ...
- research-articleFebruary 2010
Boilerplate detection using shallow text features
WSDM '10: Proceedings of the third ACM international conference on Web search and data miningPages 441–450https://doi.org/10.1145/1718487.1718542In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In ...
- ArticleOctober 2009
A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page
KSE '09: Proceedings of the 2009 International Conference on Knowledge and Systems EngineeringPages 232–236https://doi.org/10.1109/KSE.2009.39Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of ...
- posterApril 2009
A densitometric analysis of web template content
WWW '09: Proceedings of the 18th international conference on World wide webPages 1165–1166https://doi.org/10.1145/1526709.1526909What makes template content in the Web so special that we need to remove it? In this paper I present a large-scale aggregate analysis of textual Web content, corroborating statistical laws from the field of Quantitative Linguistics. I analyze the ...
- research-articleOctober 2008
A densitometric approach to web page segmentation
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementPages 1173–1182https://doi.org/10.1145/1458082.1458237Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from ...
- ArticleAugust 2007
Joint optimization of wrapper generation and template detection
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningPages 894–902https://doi.org/10.1145/1281192.1281287Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added ...
- ArticleMay 2007
Page-level template detection via isotonic smoothing
WWW '07: Proceedings of the 16th international conference on World Wide WebPages 61–70https://doi.org/10.1145/1242572.1242582We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM ...
- ArticleApril 2006
Template detection for large scale search engines
SAC '06: Proceedings of the 2006 ACM symposium on Applied computingPages 1094–1098https://doi.org/10.1145/1141277.1141534Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we ...