Abstract
The exponential growth of the web has led to the necessity to put some order to its content. The automatic classification of web documents into predefined classes, that is hypertext categorization, came to elevate humans from that task. The extra information available in a hypertext document poses new challenges for automatic categorization. HTML tags and linked neighbourhood all provide rich information for hypertext categorization that is not available in traditional text classification. This paper looks at (i) which extra information hidden in HTML tags and linked neighbourhood pages to take into consideration to improve the classification task, and (ii) how to deal with the high level of noise in linked pages. A hypertext dataset and four well-known learning algorithms (Naive Bayes, K-Nearest Neighbour, Support Vector Machine and C4.5) were used to exploit the enriched text representation. The results showed that the clever use of the information in linked neighbourhood and HTML tags improved the accuracy of the classification algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
K. Bharat and A. Broader. A technique for measuring the relative size and overlap of public web search engines. In Proc. Of the 7th World Wide Web Conference (WWW7), 1998.
H. Chen and S. Dumais. Bringing order to the web: automatically categorizing search results. In proceedings of CHI-00, ACM International Conference on Human Factors in Computing Systems, p 145–152, Den Haag, NL, 2000. ACM Press, New York, US.
http://www.yahoo.com
S. Chakrabarti, B.Dom, R. Agrawal and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In VLDB, Athens, Greece, Aug. 1997.
H. Benbrahim and M. Bramer. Impact on performance of hypertext classification by selective rich html capture. IFIP World Computer Congress, Toulouse, France, Aug 2004 (to appear).
H. Oh, S. Myaeng, and M. Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of the Twenty Third ACM SIGIR Conference, Athens, Greece, July 2000.
T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext categorization. In International Conference on Machine Learning (ICML’01), San Francisco, CA, 2001, Morgan Kaufmann.
Y. Yang, S. Slattery and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems (Special Issue on Automatic Text Categorization) 18(2–3) 2002, pp. 219–241.
C. Apte, F. Damereau, and S. Weiss. Automated learning of decision rules for text categorization. ACM trans.Information Systems, Vol.12, No.3, July 1994, pp. 233–251.
A. Bensaid and N. Tazi. Text categorization with semi-supervised agglomerative hierarchical clustering. International Journal of Intelligent Systems, 1999.
S. Chakrabati, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proceedings ACM SIGMOD International Conference on Management of Data, pages 307–318, Seattle, Washington, June 1998. ACM Press.
T. Joachims. Text categorization with Support Vector Machines: Learning with many relevant features. Proceedings of ECML-98, 10* European Conference on Machine Learning, pages 137–142.
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/proiect/theo-20/www/data/
http://www.pedal.rdg.ac.uk/banksearchdataset/
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1999.
D. Lewis. Feature selection and feature extraction for text categorization. Proceedings of Speech and Natural Language Workshop, 1992, pp. 212–217.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag London Limited
About this paper
Cite this paper
Benbrahim, H., Bramer, M. (2005). Neighbourhood Exploitation in Hypertext Categorization. In: Bramer, M., Coenen, F., Allen, T. (eds) Research and Development in Intelligent Systems XXI. SGAI 2004. Springer, London. https://doi.org/10.1007/1-84628-102-4_19
Download citation
DOI: https://doi.org/10.1007/1-84628-102-4_19
Publisher Name: Springer, London
Print ISBN: 978-1-85233-907-4
Online ISBN: 978-1-84628-102-0
eBook Packages: Computer ScienceComputer Science (R0)