Abstract
The Internet makes it possible to share and manipulate a vast quantity of information efficiently and effectively, but the rapid and chaotic growth experienced by the Net has generated a poorly organized environment that hinders the sharing and mining of useful data. The need for meaningful web-page classification techniques is therefore becoming an urgent issue. This paper describes a novel approach to web-page classification based on a fuzzy representation of web pages. A doublet representation that associates a weight with each of the most representative words of the web document so as to characterize its relevance in the document. This weight is derived by taking advantage of the characteristics of HTML language. Then a fuzzy-rule-based classifier is generated from a supervised learning process that uses a genetic algorithm to search for the minimum fuzzy-rule set that best covers the training examples. The proposed system has been demonstrated with two significantly different classes of web pages.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
UNCTAD E-Commerce and development report 2002. Report of the United Nations Conference on Trade and Development. United Nations, New York and Geneva (2002).
Gudivada, V.N., Raghavan, V.V., Grosky, W.I., and Kasanagottu, R.: Information retrieval on the World Wide Web. IEEE Internet Computing. September–October (1997) 58–68.
Chen, H. and Dumais, S.T.: Bringing order to the Web: automatically categorizing search results. Proceedings of the CHI’00, Human Factor in Computing Systems, Den Haag, New York, US. ACM Press (2000) 145–152.
Salton, G., Wong, A., and Yang, C.S.: A vector space model for information retrieval. Communications of the ACM. 18-11 (1975) 613–620.
Baeza-Yates, R. and Ribeiro-Neto, B..:Modern information retrieval. ACM Press Books, Addison-Wesley (1999).
Kosala, R. and Blockeel H.: Web mining research: a survey. ACM SIGKDD Explorations. 2-1 (2000) 1–15.
Koller, D. and Sahami, M.: Toward Optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco, CA (1996) 284–292.
Henzinger, M.: Link analysis in web information retrieval. Bulletin of the Technical Committee on Data Engineering. 23-3 (2000) 3–8.
Yang, Y.: A study of approach to hypertext categorization. Journal of Intelligent Information Systems. 18-2/3 (2002) 219–241.
Ribeiro, A., Fresno, V., García-Alegre, M.C., and Guinea, D.: A fuzzy system for the web representation. Intelligent Exploration of the Web. Studies in Fuzziness and Soft Computing. Szczepaniak, P.S., Segovia, J., Kacprzyk, J., and Zadeh, L.A. Editors. Physica-Verlag, Berlin Heidelberg New York (2003) 19–37.
Pierre, J.M.: On the automated classification of web sites. Linköping Electronic Articles in Computer and Information Science. Linköping University Electronic Press Linköping, Sweden. 6 (2001).
Fresno V. and Ribeiro.: A.feature selection and dimensionality reduction in web pages representation. Proceedings of the International Congress on Computational Intelligence: Methods & Applications. Bangor, Wales, U.K. (2001) 416–421.
Gasós J., Fernandéz P.D., García-Alegre M.C., Garcia Rosa R.: Environment for the development of fuzzy controllers. Proceedings of the International Conference. on AI: Applications & N.N. (1990) 121–124.
Michalewicz Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996).
Freitas, A.A.: Data mining and knowledge discovery with evolutionary algorithms. Natural Computing Series. Springer-Verlag, Berlin Heidelberg New York (2002).
Dasgupta, D. and Gonzales, F.A.: Evolving complex fuzzy classifier rules using a linear tree genetic representation. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’2001). Morgan Kaufmann (2001) 299–305.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ribeiro, A., Fresno, V., Garcia-Alegre, M.C., Guinea, D. (2003). Web Page Classification: A Soft Computing Approach. In: Menasalvas, E., Segovia, J., Szczepaniak, P.S. (eds) Advances in Web Intelligence. AWIC 2003. Lecture Notes in Computer Science, vol 2663. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44831-4_12
Download citation
DOI: https://doi.org/10.1007/3-540-44831-4_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40124-7
Online ISBN: 978-3-540-44831-0
eBook Packages: Springer Book Archive