Abstract
This paper introduces a classification system that exploits the content information as well as citation structure for scientific paper classification. The system first applies a content-based statistical classification method which is similar to general text classification. We investigate several classification methods including K-nearest neighbours, nearest centroid, naive Bayes and decision trees. Among those methods, the K-nearest neighbours is found to outperform others while the rest perform comparably. Using phrases in addition to words and a good feature selection strategy such as information gain can improve system accuracy and reduce training time in comparison with using words only. To combine citation links for classification, the system proposes an iterative method to update the labellings of classified instances using citation links. Our results show that, combining contents and citations significantly improves the system performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Borko, H., Bernick, M.: Automatic document classification. J. ACM 10, 151–162 (1963)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Han, E.-H., Karypis, G.: Centroid-Based Document Classification: Analysis and Experimental Results. Principles of Data Mining and Knowledge Discovery, 424–431 (2000)
Witten, I.H., Frank, E.: Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Nigam, K., Lafferty, J., McCallum, A.: Using maximum entropy for text classification. In: IJCAI-1999 Workshop on Machine Learning for Information Filtering, pp. 61–67 (1999)
Wiener, E., Pedersen, L.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proc. of the Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Mitchell, T.: Machine Learning. McGraw-Hill, New York (1997)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998)
Getoor, L., Friedman, N., Koller, D., Taskar, B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2003)
Taskar, B., Segal, E., Koller, D.: Probabilistic classification and clustering in relational data. In: Nebel, B. (ed.) Proceeding of IJCAI-2001, 17th International Joint Conference on Artificial Intelligence, Seattle, US, pp. 870–878 (2001)
Craven, M., Slattery, S.: Relational learning with statistical predicate invention: Better models for hypertext. Mach. Learn. 43, 97–119 (2001)
Quinlan, J.R.: Learning logical definitions from relations. Mach. Learn. 5, 239–266 (1990)
Cohen, W.: Learning to classify English text with ILP methods. In: Advances in Inductive Logic Programming, pp. 124–143. IOS Press, Amsterdam (1996)
Junker, M., Sintek, M., Rinck, M.: Learning for text categorization and information extraction with ILP. In: Cussens, J. (ed.) Proceedings of the 1st Workshop on Learning Language in Logic, Bled, Slovenia, pp. 84–93 (1999)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Fisher, D.H. (ed.) Proceedings of ICML-1997, 14th International Conference on Machine Learning, Nashville, US, pp. 412–420. Morgan Kaufmann Publishers, San Francisco (1997)
Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, 313–316 (1997)
McCallum, A.K., Nigam, K., Rennie, J., Seymore, K.: Automating the construction of internet portals with machine learning. Information Retrieval 3, 127–163 (2000)
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI-1998 Workshop on Learning for Text Categorization (1998)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Proceedings of SDAIR-1994, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, pp. 81–93 (1994)
Lewis, D.: An evaluation of prasal and clustered representation of text categorisation tasks. In: Proceedings of SIGIR-1992, 15th ACM International Conference on Reseach and Deveplopment in Information Retrieval, pp. 289–297 (1992)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 307–318. ACM Press, New York (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cao, M.D., Gao, X. (2005). Combining Contents and Citations for Scientific Document Classification. In: Zhang, S., Jarvis, R. (eds) AI 2005: Advances in Artificial Intelligence. AI 2005. Lecture Notes in Computer Science(), vol 3809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11589990_17
Download citation
DOI: https://doi.org/10.1007/11589990_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30462-3
Online ISBN: 978-3-540-31652-7
eBook Packages: Computer ScienceComputer Science (R0)