article

Effectiveness of template detection on noise reduction and websites summarization

Authors:

Reda AlhajjAuthors Info & Claims

Information Sciences—Informatics and Computer Science, Intelligent Systems, Applications: An International Journal, Volume 219

Pages 41 - 72

https://doi.org/10.1016/j.ins.2012.07.022

Published: 01 January 2013 Publication History

Abstract

The World Wide Web is the most rapidly growing and accessible source of information. Its popularity has been largely influenced by the wide availability of the Internet in almost every modern house and even on the go after the wide-spread of the handheld devices. Yet, pages on the Web have an additional template (we call it noisy) information that does not add value to the actual content of the page. Even worse, it can harm the effectiveness of Web mining techniques; these templates could be eliminated by preprocessing. Templates form one popular type of noise on the Internet. In this paper, we introduce Noise Detector (ND) as an effective approach for detecting and removing templates from Web pages. ND segments Web pages into semantically coherent blocks. Then it computes content and structure similarities between these blocks; a presentational noise measure is used as well. ND dynamically calculates a threshold for differentiating noisy blocks. Provided that the investigated website has a single visible template, ND can detect the template with high accuracy using two pages only. However, ND can be expanded to detect multiple templates per website, and the challenge will be to minimize the number of pages to be checked. Further, ND leads to website summarization. The conducted experiments show that ND outperforms existing approaches in space complexity, time complexity (see Section 4.6 for more details on ND's processing time against other algorithms'), minimum requirements to produce acceptable results, and results accuracy.

References

[1]

Ahmadi-Abkenari, F. and Selamat, A., An architecture for a focused trend parallel Web crawler with the application of clickstream analysis. Information Sciences. v184 i1. 266-281.

[2]

S. Akbar, L. Slaughter, Ø. Nytrø, Extracting main content-blocks from blog posts, in: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, 2010, pp. 438-443.

[3]

Bar-Yossef, Z. and Rajagopalan, S., Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web, ACM, Honolulu, Hawaii, USA. pp. 580-591.

[4]

Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., Block-based Web search. In: SIGIR'04, acm, Sheffield, South Yorkshire, UK.

[5]

Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: A Vision-based Page Segmentation Algorithm. 2003. Microsoft Corporation, Redmond.

[6]

J. Chen, B. Zhou, J. Shi, H. Zhang, Q. Fenqwu, Function-based object model towards website adaptation, in: Proceedings of the 10th International Conference on World Wide Web, Hong Kong, 2001.

Digital Library

[7]

Debnath, S., Mitra, P. and Giles, C.L., Automatic extraction of informative blocks from webpages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, ACM, Santa Fe, New Mexico. pp. 1722-1726.

[8]

D. Fernandes, E. Moura, B. Ribiero-Neto, A. Silva, M. Goncalves, Computing block importance for searching on Web sites, in: CIKM 2007, 2007, pp. 165-174.

[9]

Gibson, D., Punera, K. and Tomkins, A., The volume and evolution of Web page template. In: International World Wide Web Conference, ACM, Chiba, Japan. pp. 830-839.

[10]

Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P., DOM-based Content Extraction of HTML Documents. 2003. ACM, Budapest, Hungary.

[11]

Kovacevic, M., Diligenti, M., Gori, M. and Milutinovic, V., Recognition of common areas in a Web page using visual information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi TERRSA/EEE Computer Society, Maebashi City, Japan/Washington, DC, USA.

[12]

Li, C., Dong, J. and Chen, J., Extraction of informative blocks from Web pages based on VIPS. Journal of Computational Information Systems. v6 i1. 271-277.

[13]

J. Li, C. Ezeife, Cleaning Web pages for effective Web content mining, in: 17th International Conference Database and Expert Systems Applications, Krakow, Poland, 2006, pp. 560-571.

Digital Library

[14]

Lin, S.-H. and Ho, J.-M., Discovering informative content blocks from Web documents. In: The 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. pp. 588-593.

[15]

Liu, B., Web Data Mining: Exploring Hyperlinks* Contents and Usage Data. 2006. Springer, New York.

[16]

Liu, B. and Chen-Chuan-Chang, K., Editorial: special issue on Web content mining. ACM SIGKDD Explorations Newsletter. 1-4.

[17]

Liu, B., Hsu, W. and Ma, Y., Integrating classification and association rule mining. In: ACM SIGKDD, AAAI Press. pp. 80-86.

[18]

L. Lo, V. To-Yee Ng, P. Ng, S.C. Chan, Automatic template detection for structured Web pages, in: Proceedings of the 10th International Conference on Computer Supported Cooperative Work in Design, Nanjing, China, 2006, pp. 1-6.

[19]

Murata, M., Toda, H., Matsuura, Y. and Kataoka, R., Access concentration detection in click logs to improve mobile Web-IR. Information Sciences. v179 i12. 1859-1869.

[20]

C. van Rijsbergen, S. Robertson, M. Porter, New models in probabilistic information retrieval. British Library Research and Development Report, No. 5587, London, UK, 1980.

[21]

Song, R., Liu, H., Wen, J.-R. and Ma, W.-Y., Learning block importance models for Web pages. In: Proceedings of the 13th International Conference on World Wide Web, ACM, New York, NY, USA. pp. 203-211.

[22]

Tai, K.-C., . The Tree-to-Tree Correction Problem. v26 i3.

[23]

van Gils, B., Proper, H.A.E., van Bommel, P. and van der Weide, Th.P., On the quality of resources on the Web: an information retrieval perspective. Information Sciences. v177 i21. 4566-4597.

[24]

Vieira, K., Silva, A.S., Pinto, N., Moura, E.S., Cavalcanti, J.M. and Freire, J., A fast and robust method for Web page template detection and removal. In: The 15th ACM International Conference on Information and Knowledge Management, ACM, Arlington, Virginia, USA. pp. 258-267.

[25]

World Internet Usage Statistics News and World Population Stats, from Internet Usage World Stats. <www.internetworldstats.com/stats.htm> (retrieved 16.01.09).

[26]

W. Yang, Identifying Syntactic Differences Between Two Programs 21, 1991.

[27]

Yi, L., Liu, B. and Li, X., Eliminating noisy information in Web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Washington. pp. 296-305.

[28]

K. Zhang, The Editing Distance Between Trees: Algorithms and Applications, PhD thesis, Courant Institute, Departement of Computer Science, 1989.

Cited By

Yu XShan CBian JYang XChen YSong H(2021)AdaGUMSecurity and Communication Networks10.1155/2021/99549512021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/9954951
Lu HYang RDeng ZZhang YGao GLan R(2021)Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTMACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342266817:1s(1-18)Online publication date: 31-Mar-2021
https://dl.acm.org/doi/10.1145/3422668
Jaffri NShi LAbrar U(2020)Tomographic Image Deblurring Using Steepest DescentProceedings of the 2020 2nd International Conference on Intelligent Medicine and Image Processing10.1145/3399637.3399640(60-65)Online publication date: 23-Apr-2020
https://dl.acm.org/doi/10.1145/3399637.3399640
Show More Cited By

Index Terms

Effectiveness of template detection on noise reduction and websites summarization
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Impulse noise removal by a global-local noise detector and adaptive median filter
Special section: Advances in signal processing-assisted cross-layer designs

Noise detection-based median filters have been widely adopted to reduce impulse noise. However, the number of misclassified pixels is obviously increased at high noise density. In order to overcome this drawback, a global-local noise detector is ...
A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page
KSE '09: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering

Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant web pages. One reason is because search engines also look at non-informative blocks of ...
Bridging the gap: from multi document Template Detection to single document Content Extraction
EuroIMSA '08: Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications

Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper ...

Comments

Information & Contributors

Information

Published In

Copyright © Elsevier Inc. © 2012.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 January 2013

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu XShan CBian JYang XChen YSong H(2021)AdaGUMSecurity and Communication Networks10.1155/2021/99549512021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/9954951
Lu HYang RDeng ZZhang YGao GLan R(2021)Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTMACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342266817:1s(1-18)Online publication date: 31-Mar-2021
https://dl.acm.org/doi/10.1145/3422668
Jaffri NShi LAbrar U(2020)Tomographic Image Deblurring Using Steepest DescentProceedings of the 2020 2nd International Conference on Intelligent Medicine and Image Processing10.1145/3399637.3399640(60-65)Online publication date: 23-Apr-2020
https://dl.acm.org/doi/10.1145/3399637.3399640
Alarte JSilva JTamarit S(2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
https://dl.acm.org/doi/10.1145/3316810
Yu XTian ZQiu JJiang FWang D(2018)A Data Leakage Prevention Method Based on the Reduction of Confidential and Context Terms for Smart Mobile DevicesWireless Communications & Mobile Computing10.1155/2018/58234392018Online publication date: 21-Oct-2018
https://dl.acm.org/doi/10.1155/2018/5823439
Zeleny JBurget RZendulka J(2017)Box clustering segmentationInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00253:3(735-750)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1016/j.ipm.2017.02.002
(2016)A focused crawler combinatory link and content model based on T-Graph principlesComputer Standards & Interfaces10.1016/j.csi.2015.07.00143:C(1-11)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1016/j.csi.2015.07.001
Katz GOfek NShapira B(2015)ConSentKnowledge-Based Systems10.1016/j.knosys.2015.04.00984:C(162-178)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1016/j.knosys.2015.04.009

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents