Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2187980.2188109acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
poster

A statistical approach to URL-based web page clustering

Published: 16 April 2012 Publication History
  • Get Citation Alerts
  • Abstract

    Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.

    References

    [1]
    E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely URL-based topic classification. In WWW, pages 1109--1110, 2009.
    [2]
    L. Blanco, N. Dalvi, and A. Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In WWW, pages 437--446, 2011.
    [3]
    I. Hernández, C. Rivero, D. Ruiz, and R. Corchuelo. A tool for link-based web page classification. In CAEPIA, pages 443--452. 2011.
    [4]
    M.-Y. Kan and H. O. N. Thi. Fast webpage classification using URL features. In CIKM, pages 325--326, 2005.
    [5]
    C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Generating SPARQL executable mappings to integrate ontologies. In ER, pages 118--131, 2011.
    [6]
    C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. On benchmarking data translation systems for semantic-web ontologies. In CIKM, pages 1613--1618, 2011.

    Cited By

    View all
    • (2022)Clustering of Template-Generated Webpages Using DOM Tree Paths of URLsInternational Journal of Software Innovation10.4018/IJSI.29799410:1(1-24)Online publication date: 6-May-2022
    • (2021)Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset PagesThe Semantic Web – ISWC 202110.1007/978-3-030-88361-4_20(338-356)Online publication date: 30-Sep-2021
    • (2020)On the Large-scale Graph Data Processing for User Interface Testing in Big Data Science Projects2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378153(2049-2056)Online publication date: 10-Dec-2020
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
    April 2012
    1250 pages
    ISBN:9781450312301
    DOI:10.1145/2187980

    Sponsors

    • Univ. de Lyon: Universite de Lyon

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 April 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. URL classification
    2. URL patterns
    3. web page clustering

    Qualifiers

    • Poster

    Conference

    WWW 2012
    Sponsor:
    • Univ. de Lyon
    WWW 2012: 21st World Wide Web Conference 2012
    April 16 - 20, 2012
    Lyon, France

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Clustering of Template-Generated Webpages Using DOM Tree Paths of URLsInternational Journal of Software Innovation10.4018/IJSI.29799410:1(1-24)Online publication date: 6-May-2022
    • (2021)Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset PagesThe Semantic Web – ISWC 202110.1007/978-3-030-88361-4_20(338-356)Online publication date: 30-Sep-2021
    • (2020)On the Large-scale Graph Data Processing for User Interface Testing in Big Data Science Projects2020 IEEE International Conference on Big Data (Big Data)10.1109/BigData50022.2020.9378153(2049-2056)Online publication date: 10-Dec-2020
    • (2018)A Naive Bayes approach for URL classification with supervised feature selection and rejection frameworkComputational Intelligence10.1111/coin.1215834:1(363-396)Online publication date: 15-Jan-2018
    • (2018)Approximating Multi-class Text Classification Via Automatic Generation of Training ExamplesComputational Linguistics and Intelligent Text Processing10.1007/978-3-319-77116-8_44(585-601)Online publication date: 10-Oct-2018
    • (2017)A Fuzzy Ontology and SVM–Based Web Content Classification SystemIEEE Access10.1109/ACCESS.2017.27685645(25781-25797)Online publication date: 2017
    • (2017)Exploiting Linguistic Analysis on URLs for Recommending Web Pages: A Comparative StudyTransactions on Computational Collective Intelligence XXVI10.1007/978-3-319-59268-8_2(26-45)Online publication date: 15-Jun-2017
    • (2016)Enterprise information integrationAI Communications10.3233/AIC-15067029:2(397-399)Online publication date: 2-Mar-2016
    • (2016)CALAJournal of Systems and Software10.1016/j.jss.2016.02.006115:C(130-143)Online publication date: 1-May-2016
    • (2016)Web Page Clustering for More Efficient Website Accessibility EvaluationsComputers Helping People with Special Needs10.1007/978-3-319-41264-1_35(259-266)Online publication date: 6-Jul-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media