Keyword: template detection : Search

research-article

TeMex: The Web Template Extractor

WWW '15 Companion: Proceedings of the 24th International Conference on World Wide WebPages 155–158https://doi.org/10.1145/2740908.2742835

This paper presents and describes TeMex, a site-level web template extractor. TeMex is fully automatic, and it can work with online webpages without any preprocessing stage (no information about the template or the associated webpages is needed) and, ...

research-article

Locality sensitive hashing for scalable structural classification and clustering of web documents

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementPages 359–368https://doi.org/10.1145/2505515.2505673

Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information ...

research-article

Cluster-based page segmentation-a fast and precise method for web page pre-processing

WIMS '13: Proceedings of the 3rd International Conference on Web Intelligence, Mining and SemanticsArticle No.: 7, Pages 1–12https://doi.org/10.1145/2479787.2479792

Segmenting a web page may be one of initial steps of information retrieval or content classification performed on that page. While there has been an extensive research in this area, the approaches usually focus either on performance or quality of the ...

poster

Content extraction using diverse feature sets

WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide WebPages 89–90https://doi.org/10.1145/2487788.2487828

The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction ...

research-article

Boilerplate detection using shallow text features

WSDM '10: Proceedings of the third ACM international conference on Web search and data miningPages 441–450https://doi.org/10.1145/1718487.1718542

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In ...

Article

A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page

KSE '09: Proceedings of the 2009 International Conference on Knowledge and Systems EngineeringPages 232–236https://doi.org/10.1109/KSE.2009.39

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of ...

poster

A densitometric analysis of web template content

Christian Kohlschütter

WWW '09: Proceedings of the 18th international conference on World wide webPages 1165–1166https://doi.org/10.1145/1526709.1526909

What makes template content in the Web so special that we need to remove it? In this paper I present a large-scale aggregate analysis of textual Web content, corroborating statistical laws from the field of Quantitative Linguistics. I analyze the ...

research-article

A densitometric approach to web page segmentation

CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementPages 1173–1182https://doi.org/10.1145/1458082.1458237

Web Page segmentation is a crucial step for many applications in Information Retrieval, such as text classification, de-duplication and full-text search. In this paper we describe a new approach to segment HTML pages, building on methods from ...

Article

Joint optimization of wrapper generation and template detection

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningPages 894–902https://doi.org/10.1145/1281192.1281287

Many websites have large collections of pages generated dynamically from an underlying structured source like a database. The data of a category are typically encoded into similar pages by a common script or template. In recent years, some value-added ...

Article

Page-level template detection via isotonic smoothing

WWW '07: Proceedings of the 16th international conference on World Wide WebPages 61–70https://doi.org/10.1145/1242572.1242582

We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM ...

Article

Template detection for large scale search engines

SAC '06: Proceedings of the 2006 ACM symposium on Applied computingPages 1094–1098https://doi.org/10.1145/1141277.1141534

Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we ...

Applied Filters

People

Names

Institutions

Authors

Publications

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

TeMex: The Web Template Extractor

Locality sensitive hashing for scalable structural classification and clustering of web documents

Cluster-based page segmentation-a fast and precise method for web page pre-processing

Content extraction using diverse feature sets

Boilerplate detection using shallow text features

A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page

A densitometric analysis of web template content

A densitometric approach to web page segmentation

Joint optimization of wrapper generation and template detection

Page-level template detection via isotonic smoothing

Template detection for large scale search engines

Applied Filters

People

Names

Institutions

Authors

Publications

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder