Abstract
Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.
Feldman R, Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006.
3] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.
Manning C D, Raghavan P, SchÄutze H.An Introduction to Information Retrieval. Cambridge University Press, 2008.
Schutze H, Hull D A, Pedersen J O. A comparison of classifiers and document representations for the routing problem. In Proc. the 18th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 1995, pp.229-237.
Blanzieri E, Bryl A. A survey of learning-based techniques of email spam ¯ltering. Arti¯cial Intelligence Review, 2008, 29(1): 63-92.
Kao A, Quach L, Poteet S, Woods S. User assisted text classification and knowledge management. In Proc. the 12th International Conference on Information and Knowledge Management, November 2003, pp.524-527.
Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A. Automatic document metadata extraction using support vector machines. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries, May 2003, pp.37-48.
Kessler B, Numberg G, SchÄutze H. Automatic detection of text genre. In Proc. the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for ComputationalLinguistics, August 1997, pp.32-38.
Dumais S, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Annual International Conference on Research and Development in Information Retrieval, July 2000, pp.256-263.
Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc., 1989.
Lu Q, Getoor L. Link-based classification. In Proc. International Conference on Machine Learning, August 2003, pp.496-503.
Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kau®man, 2002.
Oh H J, Myaeng S H, Lee M H. A practical hypertext categorization method using links and incrementally available class information. In Proc. the 23rd ACM Int. SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.264-271.
Angelova R, Weikum G. Graph-based text classification: Learn from your neighbors. In Proc. the 29th Annual Int. SIGIR Conf. Research and Development in Information Retrieval Conference, August 2006, pp.485-492.
Tseng Y H, Ho Z P, Yang, K S, Chen C C. Mining term networks from text collections for crime investigation. Expert Systems with Applications, 2012, 39(11): 10082-10090.
Wang W, Do D B, Lin X. Term graph model for text classification. In Proc. International Conference on Advanced Data Mining and Applications, July 2005, pp.19-30.
Newman M. Networks: An Introduction. Oxford University Press, 2010.
Widrow B, Ho® M E. Adaptive switching circuits. In Neurocomputing: Foundation of Research, Anderson J A (ed.), Cambridge.USA: MIT Press, 1998, pp.123-134.
Rossi R G, Faleiros T P, Lopes A A, Rezende S O. Inductive model generation for text categorization using a bipartite heterogeneous network. In Proc. the 12th International Conference on Data Mining, December 2012, pp.1086-1091.
Melville P, Gryc W, Lawrence R D. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proc. the 15th International Conference on Knowledge Discovery and Data Mining, June 2009, pp.1275-1284.
Boiy E, Hens P, Deschacht K, Moens M F. Automatic sentiment analysis in on-line text. In Proc. the 11th International Conference on Electronic Publishing, June 2007, pp.349-360.
Durant K T, Smith M D. Predicting the political sentiment of web log posts using supervised machine learning techniques coupled with feature selection. In Proc. the 8th International Workshop on Knowledge Discovery on the Web, August 2006, pp.187-206.
Chen R C, Hsieh C H. Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 2006, 31(2): 427-435.
Wilcox A, Hripcsak G. Medical text representations for inductive learning. In Proc. American Medical Informatics Association Symposium, Nov. 2000, pp.923-927.
Sun A, Lim E P, Ng W K. Web classification using support vector machine. In Proc. the 4th International Workshop on Web Information and Data Management, November 2002, pp.96-99.
Yu H, Han J, Chang K C C. PEBL: Positive example based learning for Web page classification using SVM. In Proc. the 8th International Conference on Knowledge Discovery and Data Mining, July 2002, pp.239-248.
Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 2002, 18(2/3): 219-241.
Dumais S T, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.256-263.
Han E H, Karypis G, Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 2001, pp.53-65.
Yang Y. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1(1/2): 69-90.
Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C. An evaluation of naive Bayesian anti-spam filtering. In Proc. Workshop on Machine Learning in the New Information Age, May 2000, pp.9-17.
Drucker H, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054.
Han E, Karypis G. Centroid-based document classification: Analysis and experimental results. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, June 2000, pp.424-431.
Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85.
Marcacini R M, Cherman E A, Metz J, Rezende S O. A fast dendrogram refinement approach for unsupervised expansion of hierarchies. In Proc. ECML/PKDD Discovery Challenge: Third Challenge on Large Scale Hierarchical Text Classification, September 2012, pp. 1-12.
Frank E, Bouckaert R R. Naive Bayes for text classification with unbalanced classes. In Proc. the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases, September 2003, pp.503-510.
Ji M, Sun Y, Danilevsky M, Han J, Gao J. Graph regularized transductive classification on heterogeneous information networks. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2010, pp.570-586.
Chiang M, Liou J, Wang J, Peng W, Shan M. Exploring heterogeneous information networks and random walk with restart for academic search. Knowledge and Information Systems, 2013, 36(1): 59-82.
Xue G R, Shen D, Yang Q et al. IRC: An iterative reinforcement categorization algorithm for interrelated Web objects. In Proc. the 4th International Conference on Data Mining, November 2004, pp. 273{280.
Yin Z, Li R, Mei Q, Han J. Exploring social tagging graph for web object classification. In Proc. International Conference on Knowledge Discovery and Data Mining, June 2009, pp.957-966.
Zhou D, Bousquet O, Lal T N, Weston J, SchÄolkopf B. Learning with local and global consistency. In Proc. Advances in Neural Information Processing Systems, December 2003.
Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.
Markov A, Last M, Kandel A. Model-based classification of Web documents represented by graphs. In Proc. WEBKDD, August 2006, pp.84-89.
Mishra M, Huan J, Bleik S, Song M. Biomedical text categorization with concept graph representations using a controlled vocabulary. In Proc. the 11th International Workshop on Data Mining in Bioinformatics, August 2012, pp.26-32.
46] Cancho R F, Sole R V, Kohler. Patterns in syntactic dependency networks. Physical Review E, 2004, 69(1): 051915.
Sousa C A R, Rezende S O, Batista G E A P A. Influence of graph construction on semi-supervised learning. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2013, pp.160-175.
Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.
Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (2nd edition). Morgan Kaufmann, 2005.
Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd International Conference on Machine Learning, June 2006, pp.161-168.
Kohonen T, Barna G, Chrisley R. Statistical pattern recognition with neural networks: Benchmarking studies. In Proc. International Conference on Neural Networks, July 1988, pp.61-68.
Demsar J. Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning Research, 2006, 7(1): 1-30.
Author information
Authors and Affiliations
Corresponding author
Additional information
The work is supported by São Paulo Research Foundation (FAPESP) of Brasil under Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9.
Electronic supplementary material
Below is the link to the electronic supplementary material.
ESM 1
(PDF 76 kb)
Rights and permissions
About this article
Cite this article
Rossi, R.G., de Andrade Lopes, A., de Paulo Faleiros, T. et al. Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network. J. Comput. Sci. Technol. 29, 361–375 (2014). https://doi.org/10.1007/s11390-014-1436-7
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-014-1436-7