Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/511446.511522acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Template detection via data mining and its applications

Published: 07 May 2002 Publication History

Abstract

We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic "pure" implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall.

References

[1]
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 487--499, Santiago, Chile, 1994.
[2]
K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, 1998.
[3]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 107--117, 1998.
[4]
A. Z. Broder, S. C. Glassman, and M. S. Manasse. Syntactic clustering of the web. In Proceedings of the 6th International World Wide Web Conference (WWW6), pages 1157--1166, 1997.
[5]
V. Bush. As we may think. The Atlantic Monthly, 176(1):101--108, July 1945.
[6]
S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proceedings of the 10th International World Wide Web Conference (WWW2001), pages 211--220, 2001.
[7]
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998.
[8]
S. Chakrabarti, B. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Topic distillation and spectral filtering. Artificial Intelligence Review, 13(5-6):409--435, 1999.
[9]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, pages 307--318, 1998.
[10]
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 65--74, 1998.
[11]
S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyperlinks. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.
[12]
S. Chakrabarti, M. van den Berg, and B. Dom. Distributed hypertext resource discovery through examples. In Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375--386, 1999.
[13]
S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1623--1640, 1999.
[14]
B. D. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.
[15]
J. Dean and M. Henzinger. Finding related pages in the world wide web. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1467--1479, 1999.
[16]
E. Garfield. "Citation Analysis as a Tool in Journal Evaluation". Science, 178:471--479, 1972.
[17]
Google. http://www.google.com.
[18]
M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.
[19]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, pages 604--632, 1999.
[20]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1481--1493, 1999.
[21]
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1--6):387--401, June 2000.
[22]
Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. Transactions on Software Engineering, 17(8):800--813, 1991.
[23]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Computer Science Department, Stanford University, 1998.
[24]
G. Pinski and F. Narin. Citation influence for journal aggregates of scientific publications: Theory, with application to the literature of physics. Inf. Proc. and Management, 12, 1976.
[25]
P. Pirolli, J. E. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the Web. In Conference Proceedings on Human Factors and Computing (CHI), pages 118--125, 1996.
[26]
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265--269, 1973.

Cited By

View all
  • (2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 11-Oct-2023
  • (2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 21-Oct-2023
  • (2022)Clustering of Template-Generated Webpages Using DOM Tree Paths of URLsInternational Journal of Software Innovation10.4018/IJSI.29799410:1(1-24)Online publication date: 6-May-2022
  • Show More Cited By

Index Terms

  1. Template detection via data mining and its applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '02: Proceedings of the 11th international conference on World Wide Web
    May 2002
    754 pages
    ISBN:1581134495
    DOI:10.1145/511446
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 May 2002

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data mining
    2. hypertext
    3. information retrieval
    4. web searching

    Qualifiers

    • Article

    Conference

    WWW02
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 11-Oct-2023
    • (2023)Web Page Segmentation: A DOM-Structural Cohesion Analysis ApproachWeb Information Systems Engineering – WISE 202310.1007/978-981-99-7254-8_25(319-333)Online publication date: 21-Oct-2023
    • (2022)Clustering of Template-Generated Webpages Using DOM Tree Paths of URLsInternational Journal of Software Innovation10.4018/IJSI.29799410:1(1-24)Online publication date: 6-May-2022
    • (2022)Darwin's Theory of CensorshipProceedings of the 21st Workshop on Privacy in the Electronic Society10.1145/3559613.3563206(103-108)Online publication date: 7-Nov-2022
    • (2022)HybEx: A Hybrid Tool for Template ExtractionCompanion Proceedings of the Web Conference 202210.1145/3487553.3524242(205-209)Online publication date: 25-Apr-2022
    • (2022)Web Corpus ConstructionundefinedOnline publication date: 2-Apr-2022
    • (2021)Page-Level Main Content Extraction From Heterogeneous WebpagesACM Transactions on Knowledge Discovery from Data10.1145/345116815:6(1-105)Online publication date: 28-Jun-2021
    • (2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
    • (2020)A Novel Web Scraping Approach Using the Additional Information Obtained From Web PagesIEEE Access10.1109/ACCESS.2020.29845038(61726-61740)Online publication date: 2020
    • (2019)QoS3Security and Communication Networks10.1155/2019/31075432019Online publication date: 1-Jan-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media