Article

Template detection via data mining and its applications

Authors:

Ziv Bar-Yossef,

Sridhar RajagopalanAuthors Info & Claims

WWW '02: Proceedings of the 11th international conference on World Wide Web

Pages 580 - 591

https://doi.org/10.1145/511446.511522

Published: 07 May 2002 Publication History

Abstract

We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.

References

[1]

R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 487--499, Santiago, Chile, 1994.

Digital Library

[2]

K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998.

Digital Library

[3]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 107--117, 1998.

Digital Library

[4]

A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference (WWW6), pages 1157--1166, 1997.

Digital Library

[5]

V. Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.

[6]

S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference (WWW2001), pages 211--220, 2001.

Digital Library

[7]

S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998.

Digital Library

[8]

S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Topic distillation and spectral filtering. Artificial Intelligence Review, 13(5-6):409--435, 1999.

Digital Library

[9]

S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998.

Digital Library

[10]

S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998.

Digital Library

[11]

S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.

Digital Library

[12]

S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375--386, 1999.

Digital Library

[13]

S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1623--1640, 1999.

Digital Library

[14]

B. D. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.

[15]

J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467--1479, 1999.

Digital Library

[16]

E. Garfield. "Citation Analysis as a Tool in Journal Evaluation". Science, 178:471--479, 1972.

[17]

Google. http://www.google.com.

[18]

M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.

[19]

J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, pages 604--632, 1999.

Digital Library

[20]

R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1481--1493, 1999.

Digital Library

[21]

R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1--6):387--401, June 2000.

Digital Library

[22]

Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. Transactions on Software Engineering, 17(8):800--813, 1991.

Digital Library

[23]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.

[24]

G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. and Management, 12, 1976.

[25]

P. Pirolli, J. E. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the Web. In Conference Proceedings on Human Factors and Computing (CHI), pages 118--125, 1996.

Digital Library

[26]

H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265--269, 1973.

Cited By

Uzun E(2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 11-Oct-2023
https://dl.acm.org/doi/10.1145/3616849
Huynh MLe QNguyen VNguyen T(2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 21-Oct-2023
https://doi.org/10.1007/978-981-99-7254-8_25
Bagban TKulkarni P(2022)Clustering of Template-Generated Webpages Using DOM Tree Paths of URLsInternational Journal of Software Innovation10.4018/IJSI.29799410:1(1-24)Online publication date: 6-May-2022
https://dl.acm.org/doi/10.4018/IJSI.297994
Show More Cited By

Index Terms

Template detection via data mining and its applications
1. Information systems
  1. Information retrieval

Recommendations

Mining fuzzy specific rare itemsets for education data

Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Mining uncertain data for constrained frequent sets
IDEAS '09: Proceedings of the 2009 International Database Engineering & Applications Symposium

Data mining aims to search for implicit, previously unknown, and potentially useful pieces of information---such as sets of items that are frequently co-occurring together---that are embedded in data. The mined frequent sets can be used in the discovery ...
Big Data Mining Applications and Services
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services

Data mining and analytics aims to analyze valuable data and extract implicit, previously unknown, and potentially useful information from the data. Due to advances in technology, high volumes of valuable data are generated at a high velocity in high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '02: Proceedings of the 11th international conference on World Wide Web

May 2002

754 pages

ISBN:1581134495

DOI:10.1145/511446

Conference Chairs:
David Lassner
University of Hawaii
,
Dave De Roure
University of Southampton
,
Arun Iyengar
IBM T.J. Watson Research Center

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

WWW02

Sponsor:

ACM

WWW02: Hypermedia Track of the Eleventh International World-Wide Web Conference

May 7 - 11, 2002

Hawaii, Honolulu, USA

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

143
Total Citations
View Citations
1,548
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Uzun E(2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 11-Oct-2023
Huynh MLe QNguyen VNguyen T(2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 21-Oct-2023
Bagban TKulkarni P(2022)Clustering of Template-Generated Webpages Using DOM Tree Paths of URLsInternational Journal of Software Innovation10.4018/IJSI.29799410:1(1-24)Online publication date: 6-May-2022
Waheed AQunaibi SBarradas DWeinberg ZHong YWang L(2022)Darwin's Theory of CensorshipProceedings of the 21st Workshop on Privacy in the Electronic Society10.1145/3559613.3563206(103-108)Online publication date: 7-Nov-2022
Alarte JSilva J(2022)HybEx: A Hybrid Tool for Template ExtractionCompanion Proceedings of the Web Conference 202210.1145/3487553.3524242(205-209)Online publication date: 25-Apr-2022
Schäfer RBildhauer F(2022)Web Corpus ConstructionundefinedOnline publication date: 2-Apr-2022
Alarte JSilva J(2021)Page-Level Main Content Extraction From Heterogeneous WebpagesACM Transactions on Knowledge Discovery from Data10.1145/345116815:6(1-105)Online publication date: 28-Jun-2021
Leonhardt JAnand AKhosla M(2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
Uzun E(2020)A Novel Web Scraping Approach Using the Additional Information Obtained From Web PagesIEEE Access10.1109/ACCESS.2020.29845038(61726-61740)Online publication date: 2020
Al-Dailami ARuan CBao ZZhang T(2019)QoS3Security and Communication Networks10.1155/2019/31075432019Online publication date: 1-Jan-2019
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents