Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1242572.1242582acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Page-level template detection via isotonic smoothing

Published: 08 May 2007 Publication History

Abstract

We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.

References

[1]
S. Angelov, B. Harb, S. Kannan, and L.-S. Wang. Weighted isotonic regression under the l1 norm. In Proc. 17th SODA, pages 783--791, 2006.
[2]
S. Baluja. Browsing on small screens: Recasting web-page segmentation into an efficient machine learning framework. In Proc. 15th WWW, pages 33--42, 2006.
[3]
Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proc. 11th WWW, pages 580--591, 2002.
[4]
K. Bharat, A. Broder, J. Dean, and M.R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. JASIS, 51(12):1114--1122, 2000.
[5]
A.Z. Broder, S.C. Glassman, M.S. Manasse, and G. Zweig. Syntactic clustering of the web. WWW6 / Computer Networks, 29(8-13):1157--1166, 1997.
[6]
S. Chakrabarti, B.E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. SIGMOD, pages 307--318, 1998.
[7]
Y. Chen, X. Xie, W.-Y. Ma, and H.-J. Zhang. Adapting web pages for small-screen devices. Internet Computing, 9(1):50--56, 2005.
[8]
J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated web collections. In Proc. SIGMOD, pages 355--366, 2000.
[9]
B. Davison. Recognizing nepotistic links on the web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23--28, 2000.
[10]
S. Debnath, P. Mitra, N. Pal, and C.L. Giles. Automatic identification of informative sections of web pages. TKDE, 17(9):1233--1246, 2005.
[11]
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Proc. 14th WWW (Special interest tracks and posters), pages 830--839, 2005.
[12]
L. Hubert and P. Arabie. Comparing partitions. J. Classification, 2:193--218, 1985.
[13]
H.-Y. Kao, M.-S. Chen, S.-H. Lin, and J.-M. Ho. Entropy-based link analysis for mining web informative structures. In Proc. 11th CIKM, pages 574--581, 2002.
[14]
H.-Y. Kao, J.-M. Ho, and M.-S. Chen. WISDOM: Web intrapage informative structure mining based on document object model. TKDE, 17(5):614--627, 2005.
[15]
R. Kumar, K. Punera, and A. Tomkins. Hierarchical topic segmentation of websites. In Proc. 12th KDD, pages 257--266, 2006.
[16]
N. Kushmerick. Learning to remove internet advertisement. In Proc. 3rd Agents, pages 175--181, 1999.
[17]
G. Milligan and M. Cooper. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research, 21(4):441--458, 1986.
[18]
T. Mitchell. Machine Learning. McGraw Hill, 1997.
[19]
T. Morton-Jones, P. Diggle, L. Parker, H.O. Dickinson, and K. Blinks. Additive isotonic regression models in epidemiology. Statistics in Medicine, 19(6):849--859, 2000.
[20]
P.M. Pardalos and G. Xue. Algorithms for a class of isotonic regression problems. Algorithmica, 23(3):211--222, 1999.
[21]
T. Robertson, F.T. Wright, and R.L. Dykstra. Order-Restrictied Statistical Inference. Wiley, 1988.
[22]
R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma. Learning block importance models for web pages. In Proc. 13th WWW, pages 203--211, 2004.
[23]
Q. Stout. Optimal algorithms for unimodal regression. Computing Science and Statistics, 32:348--355, 2000.
[24]
K. Vieira, A. Silva, N. Pinto, E. Moura, J. Cavalcanti, and J. Freire. A fast and robust method for web page template detection and removal. In Proc. 15th CIKM, pages 256--267, 2006.
[25]
L. Yi and B. Liu. Web page cleaning for web mining through feature weighting. In Proc. 18th IJCAI, pages 43--50, 2003.
[26]
L. Yi, B. Liu, and X. Li. Eliminating noisy information in web pages for data mining. In Proc. 9th KDD, pages 296--305, 2003.
[27]
X. Yin and W.S. Lee. Using link analysis to improve layout on mobile devices. In Proc. 13th WWW, pages 338--344, 2004.

Cited By

View all

Index Terms

  1. Page-level template detection via isotonic smoothing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '07: Proceedings of the 16th international conference on World Wide Web
    May 2007
    1382 pages
    ISBN:9781595936547
    DOI:10.1145/1242572
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 May 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. isotonic regression
    2. template detection
    3. webpage sectioning
    4. webpage segmentation

    Qualifiers

    • Article

    Conference

    WWW'07
    Sponsor:
    WWW'07: 16th International World Wide Web Conference
    May 8 - 12, 2007
    Alberta, Banff, Canada

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Boilerplate Removal using a Neural Sequence Labeling ModelCompanion Proceedings of the Web Conference 202010.1145/3366424.3383547(226-229)Online publication date: 20-Apr-2020
    • (2020)A Novel Web Scraping Approach Using the Additional Information Obtained From Web PagesIEEE Access10.1109/ACCESS.2020.29845038(61726-61740)Online publication date: 2020
    • (2019)Constructing Novel Block Layouts for Webpage AnalysisACM Transactions on Internet Technology10.1145/332645719:3(1-18)Online publication date: 10-Jul-2019
    • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
    • (2019)Multi-layer Filtering Webpage Classification Method Based on SVMHuman Centered Computing10.1007/978-3-030-37429-7_56(554-559)Online publication date: 12-Dec-2019
    • (2018)Web2Text: Deep Structured Boilerplate RemovalAdvances in Information Retrieval10.1007/978-3-319-76941-7_13(167-179)Online publication date: 1-Mar-2018
    • (2017)A novel algorithm for extracting the user reviews from web pagesJournal of Information Science10.1177/016555151666644643:5(696-712)Online publication date: 1-Oct-2017
    • (2016)Lossless Separation of Web Pages into Layout Code and DataProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2939672.2939858(1805-1814)Online publication date: 13-Aug-2016
    • (2015)HTML Segmentation for Different Types of Web PagesThe Evolution of the Internet in the Business Sector10.4018/978-1-4666-7262-8.ch005(98-119)Online publication date: 2015
    • (2015)In-depth querying of web-based medical documentsInternational Journal of Computational Science and Engineering10.1504/IJCSE.2015.07265011:3(284-296)Online publication date: 1-Oct-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media