Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Effectiveness of template detection on noise reduction and websites summarization

Published: 01 January 2013 Publication History

Abstract

The World Wide Web is the most rapidly growing and accessible source of information. Its popularity has been largely influenced by the wide availability of the Internet in almost every modern house and even on the go after the wide-spread of the handheld devices. Yet, pages on the Web have an additional template (we call it noisy) information that does not add value to the actual content of the page. Even worse, it can harm the effectiveness of Web mining techniques; these templates could be eliminated by preprocessing. Templates form one popular type of noise on the Internet. In this paper, we introduce Noise Detector (ND) as an effective approach for detecting and removing templates from Web pages. ND segments Web pages into semantically coherent blocks. Then it computes content and structure similarities between these blocks; a presentational noise measure is used as well. ND dynamically calculates a threshold for differentiating noisy blocks. Provided that the investigated website has a single visible template, ND can detect the template with high accuracy using two pages only. However, ND can be expanded to detect multiple templates per website, and the challenge will be to minimize the number of pages to be checked. Further, ND leads to website summarization. The conducted experiments show that ND outperforms existing approaches in space complexity, time complexity (see Section 4.6 for more details on ND's processing time against other algorithms'), minimum requirements to produce acceptable results, and results accuracy.

References

[1]
Ahmadi-Abkenari, F. and Selamat, A., An architecture for a focused trend parallel Web crawler with the application of clickstream analysis. Information Sciences. v184 i1. 266-281.
[2]
S. Akbar, L. Slaughter, Ø. Nytrø, Extracting main content-blocks from blog posts, in: Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, 2010, pp. 438-443.
[3]
Bar-Yossef, Z. and Rajagopalan, S., Template detection via data mining and its applications. In: Proceedings of the 11th International Conference on World Wide Web, ACM, Honolulu, Hawaii, USA. pp. 580-591.
[4]
Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., Block-based Web search. In: SIGIR'04, acm, Sheffield, South Yorkshire, UK.
[5]
Cai, D., Yu, S., Wen, J.-R. and Ma, W.-Y., VIPS: A Vision-based Page Segmentation Algorithm. 2003. Microsoft Corporation, Redmond.
[6]
J. Chen, B. Zhou, J. Shi, H. Zhang, Q. Fenqwu, Function-based object model towards website adaptation, in: Proceedings of the 10th International Conference on World Wide Web, Hong Kong, 2001.
[7]
Debnath, S., Mitra, P. and Giles, C.L., Automatic extraction of informative blocks from webpages. In: Proceedings of the 2005 ACM Symposium on Applied Computing, ACM, Santa Fe, New Mexico. pp. 1722-1726.
[8]
D. Fernandes, E. Moura, B. Ribiero-Neto, A. Silva, M. Goncalves, Computing block importance for searching on Web sites, in: CIKM 2007, 2007, pp. 165-174.
[9]
Gibson, D., Punera, K. and Tomkins, A., The volume and evolution of Web page template. In: International World Wide Web Conference, ACM, Chiba, Japan. pp. 830-839.
[10]
Gupta, S., Kaiser, G., Neistadt, D. and Grimm, P., DOM-based Content Extraction of HTML Documents. 2003. ACM, Budapest, Hungary.
[11]
Kovacevic, M., Diligenti, M., Gori, M. and Milutinovic, V., Recognition of common areas in a Web page using visual information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM'02), Maebashi TERRSA/EEE Computer Society, Maebashi City, Japan/Washington, DC, USA.
[12]
Li, C., Dong, J. and Chen, J., Extraction of informative blocks from Web pages based on VIPS. Journal of Computational Information Systems. v6 i1. 271-277.
[13]
J. Li, C. Ezeife, Cleaning Web pages for effective Web content mining, in: 17th International Conference Database and Expert Systems Applications, Krakow, Poland, 2006, pp. 560-571.
[14]
Lin, S.-H. and Ho, J.-M., Discovering informative content blocks from Web documents. In: The 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA. pp. 588-593.
[15]
Liu, B., Web Data Mining: Exploring Hyperlinks* Contents and Usage Data. 2006. Springer, New York.
[16]
Liu, B. and Chen-Chuan-Chang, K., Editorial: special issue on Web content mining. ACM SIGKDD Explorations Newsletter. 1-4.
[17]
Liu, B., Hsu, W. and Ma, Y., Integrating classification and association rule mining. In: ACM SIGKDD, AAAI Press. pp. 80-86.
[18]
L. Lo, V. To-Yee Ng, P. Ng, S.C. Chan, Automatic template detection for structured Web pages, in: Proceedings of the 10th International Conference on Computer Supported Cooperative Work in Design, Nanjing, China, 2006, pp. 1-6.
[19]
Murata, M., Toda, H., Matsuura, Y. and Kataoka, R., Access concentration detection in click logs to improve mobile Web-IR. Information Sciences. v179 i12. 1859-1869.
[20]
C. van Rijsbergen, S. Robertson, M. Porter, New models in probabilistic information retrieval. British Library Research and Development Report, No. 5587, London, UK, 1980.
[21]
Song, R., Liu, H., Wen, J.-R. and Ma, W.-Y., Learning block importance models for Web pages. In: Proceedings of the 13th International Conference on World Wide Web, ACM, New York, NY, USA. pp. 203-211.
[22]
Tai, K.-C., . The Tree-to-Tree Correction Problem. v26 i3.
[23]
van Gils, B., Proper, H.A.E., van Bommel, P. and van der Weide, Th.P., On the quality of resources on the Web: an information retrieval perspective. Information Sciences. v177 i21. 4566-4597.
[24]
Vieira, K., Silva, A.S., Pinto, N., Moura, E.S., Cavalcanti, J.M. and Freire, J., A fast and robust method for Web page template detection and removal. In: The 15th ACM International Conference on Information and Knowledge Management, ACM, Arlington, Virginia, USA. pp. 258-267.
[25]
World Internet Usage Statistics News and World Population Stats, from Internet Usage World Stats. <www.internetworldstats.com/stats.htm> (retrieved 16.01.09).
[26]
W. Yang, Identifying Syntactic Differences Between Two Programs 21, 1991.
[27]
Yi, L., Liu, B. and Li, X., Eliminating noisy information in Web pages for data mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Washington. pp. 296-305.
[28]
K. Zhang, The Editing Distance Between Trees: Algorithms and Applications, PhD thesis, Courant Institute, Departement of Computer Science, 1989.

Cited By

View all
  • (2021)AdaGUMSecurity and Communication Networks10.1155/2021/99549512021Online publication date: 1-Jan-2021
  • (2021)Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTMACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342266817:1s(1-18)Online publication date: 31-Mar-2021
  • (2020)Tomographic Image Deblurring Using Steepest DescentProceedings of the 2020 2nd International Conference on Intelligent Medicine and Image Processing10.1145/3399637.3399640(60-65)Online publication date: 23-Apr-2020
  • Show More Cited By

Index Terms

  1. Effectiveness of template detection on noise reduction and websites summarization
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    Publisher

    Elsevier Science Inc.

    United States

    Publication History

    Published: 01 January 2013

    Author Tags

    1. Noise detection
    2. Template detection
    3. Web mining
    4. Website summarization

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)AdaGUMSecurity and Communication Networks10.1155/2021/99549512021Online publication date: 1-Jan-2021
    • (2021)Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTMACM Transactions on Multimedia Computing, Communications, and Applications10.1145/342266817:1s(1-18)Online publication date: 31-Mar-2021
    • (2020)Tomographic Image Deblurring Using Steepest DescentProceedings of the 2020 2nd International Conference on Intelligent Medicine and Image Processing10.1145/3399637.3399640(60-65)Online publication date: 23-Apr-2020
    • (2019)What Web Template Extractor Should I Use? A Benchmarking and Comparison for Five Template ExtractorsACM Transactions on the Web10.1145/331681013:2(1-19)Online publication date: 27-Mar-2019
    • (2018)A Data Leakage Prevention Method Based on the Reduction of Confidential and Context Terms for Smart Mobile DevicesWireless Communications & Mobile Computing10.1155/2018/58234392018Online publication date: 21-Oct-2018
    • (2017)Box clustering segmentationInformation Processing and Management: an International Journal10.1016/j.ipm.2017.02.00253:3(735-750)Online publication date: 1-May-2017
    • (2016)A focused crawler combinatory link and content model based on T-Graph principlesComputer Standards & Interfaces10.1016/j.csi.2015.07.00143:C(1-11)Online publication date: 1-Jan-2016
    • (2015)ConSentKnowledge-Based Systems10.1016/j.knosys.2015.04.00984:C(162-178)Online publication date: 1-Aug-2015

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media