Abstract
One of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. Web high-quality pages provide readers proper entries to get more concentrated required information on the web. This paper focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on high-quality page selection are studied. Then K-means clustering with these features is performed to separate high-quality pages from common ones. Experiments on 19GB (document size) TREC web data set (.GOV data) have been made. By this proposed approach, less than 50% of web pages are obtained as high-quality ones, covering about 90% key information in the whole set. Information retrieval on this high-quality page set achieves more than 40% improvement, compared with that on the whole data collection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Davison, B.D.: Topical locality in the web. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval, pp. 272–279 (2000)
Zhang, M., Lin, C., Liu, Y., Zhao, L., Ma, L., Ma, S.: THUIR at TREC 2003: Novelty, Robust, Web and HARD (2003)
Hawking, D., Craswell, N.: Overview of the TREC-2002 web track. In: Voorhees and Buckland (2002)
Hawking, D., Craswell, N.: Overview of the TREC 2003 web track, 2003. In: NIST Special Publication: SP 500-255, The Twelfth Text Retrieval Conference (2003)
Lozano, J.A., Pena, J.M., Larranaga, P.: An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett. 20, 1027–1040 (1999)
Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in a hyperlinked environment. In: 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–111 (August 1998)
Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web Search Engines. In: proceedings of the International Joint Conference on Artificial Intelligence (2003)
Craswell, N., Hawking, D.: Query-independent evidence in home page finding. ACM Transactions on Information Systems (TOIS) archive 21(3), 286–313 (2003); table of contents
Westerveld, T., Hiemstra, D., Kraaij, W.: Retrieving Web Pages Using Content, Links, URLs and Anchors. In: Voorhees and Harman, pp. 663–672 (2002)
Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: 25th annual international ACM SIGIR conference on research and development in information retrieval, pp. 27–34 (2002)
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition (2003)
Liu, Y., Zhang, M., Ma, S.: Effective topic distillation with key resource pre-selection. In: Proceedings of the Asia Information Retrieval Symposium (2004)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.: The analysis of a simple k-means clustering algorithm. In: Symposium on Computational Geometry (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, C., Liu, Y., Zhang, M., Ma, S. (2005). Topic-Independent Web High-Quality Page Selection Based on K-Means Clustering. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_43
Download citation
DOI: https://doi.org/10.1007/11562382_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)