Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1242572.1242582acmconferencesArticle/Chapter ViewAbstractPublication PageswebconfConference Proceedingsconference-collections
Article

Page-level template detection via isotonic smoothing

Published: 08 May 2007 Publication History
  • Get Citation Alerts
  • Abstract

    We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.

    References

    [1]
    S. Angelov, B. Harb, S. Kannan, and L.-S. Wang. Weighted isotonic regression under the l1 norm. In Proc. 17th SODA, pages 783--791, 2006.
    [2]
    S. Baluja. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proc. 15th WWW, pages 33--42, 2006.
    [3]
    Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proc. 11th WWW, pages 580--591, 2002.
    [4]
    K. Bharat, A. Broder, J. Dean, and M.R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000.
    [5]
    A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW6 / Computer Networks, 29(8-13):1157--1166, 1997.
    [6]
    S. Chakrabarti, B.E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. SIGMOD, pages 307--318, 1998.
    [7]
    Y. Chen, X. Xie, W.-Y. Ma, and H.-J. Zhang. Adapting web pages for small-screen devices. Internet Computing, 9(1):50--56, 2005.
    [8]
    J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. SIGMOD, pages 355--366, 2000.
    [9]
    B. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.
    [10]
    S. Debnath, P. Mitra, N. Pal, and C.L. Giles. Automatic identification of informative sections of web pages. TKDE, 17(9):1233--1246, 2005.
    [11]
    D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Proc. 14th WWW (Special interest tracks and posters), pages 830--839, 2005.
    [12]
    L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.
    [13]
    H.-Y. Kao, M.-S. Chen, S.-H. Lin, and J.-M. Ho. Entropy-based link analysis for mining web informative structures. In Proc. 11th CIKM, pages 574--581, 2002.
    [14]
    H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. TKDE, 17(5):614--627, 2005.
    [15]
    R. Kumar, K. Punera, and A. Tomkins. Hierarchical topic segmentation of websites. In Proc. 12th KDD, pages 257--266, 2006.
    [16]
    N. Kushmerick. Learning to remove internet advertisement. In Proc. 3rd Agents, pages 175--181, 1999.
    [17]
    G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.
    [18]
    T. Mitchell. Machine Learning. McGraw Hill, 1997.
    [19]
    T. Morton-Jones, P. Diggle, L. Parker, H.O. Dickinson, and K. Blinks. Additive isotonic regression models in epidemiology. Statistics in Medicine, 19(6):849--859, 2000.
    [20]
    P.M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23(3):211--222, 1999.
    [21]
    T. Robertson, F.T. Wright, and R.L. Dykstra. Order-Restrictied Statistical Inference. Wiley, 1988.
    [22]
    R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for web pages. In Proc. 13th WWW, pages 203--211, 2004.
    [23]
    Q. Stout. Optimal algorithms for unimodal regression. Computing Science and Statistics, 32:348--355, 2000.
    [24]
    K. Vieira, A. Silva, N. Pinto, E. Moura, J. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proc. 15th CIKM, pages 256--267, 2006.
    [25]
    L. Yi and B. Liu. Web page cleaning for web mining through feature weighting. In Proc. 18th IJCAI, pages 43--50, 2003.
    [26]
    L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proc. 9th KDD, pages 296--305, 2003.
    [27]
    X. Yin and W.S. Lee. Using link analysis to improve layout on mobile devices. In Proc. 13th WWW, pages 338--344, 2004.

    Cited By

    View all

    Index Terms

    1. Page-level template detection via isotonic smoothing

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '07: Proceedings of the 16th international conference on World Wide Web
      May 2007
      1382 pages
      ISBN:9781595936547
      DOI:10.1145/1242572
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 08 May 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. isotonic regression
      2. template detection
      3. webpage sectioning
      4. webpage segmentation

      Qualifiers

      • Article

      Conference

      WWW'07
      Sponsor:
      WWW'07: 16th International World Wide Web Conference
      May 8 - 12, 2007
      Alberta, Banff, Canada

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)0

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
      • (2020)A Novel Web Scraping Approach Using the Additional Information Obtained From Web PagesIEEE Access10.1109/ACCESS.2020.29845038(61726-61740)Online publication date: 2020
      • (2019)Constructing Novel Block Layouts for Webpage AnalysisACM Transactions on Internet Technology10.1145/332645719:3(1-18)Online publication date: 10-Jul-2019
      • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
      • (2019)Multi-layer Filtering Webpage Classification Method Based on SVMHuman Centered Computing10.1007/978-3-030-37429-7_56(554-559)Online publication date: 12-Dec-2019
      • (2018)Web2Text: Deep Structured Boilerplate RemovalAdvances in Information Retrieval10.1007/978-3-319-76941-7_13(167-179)Online publication date: 1-Mar-2018
      • (2017)A novel algorithm for extracting the user reviews from web pagesJournal of Information Science10.1177/016555151666644643:5(696-712)Online publication date: 1-Oct-2017
      • (2016)Lossless Separation of Web Pages into Layout Code and DataProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2939672.2939858(1805-1814)Online publication date: 13-Aug-2016
      • (2015)HTML Segmentation for Different Types of Web PagesThe Evolution of the Internet in the Business Sector10.4018/978-1-4666-7262-8.ch005(98-119)Online publication date: 2015
      • (2015)In-depth querying of web-based medical documentsInternational Journal of Computational Science and Engineering10.1504/IJCSE.2015.07265011:3(284-296)Online publication date: 1-Oct-2015
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media