Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rossi, Rafael Geraldeli; de Andrade Lopes, Alneu; de Paulo Faleiros, Thiago; Rezende, Solange Oliveira

doi:10.1007/s11390-014-1436-7

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Regular Paper
Published: 09 May 2014

Volume 29, pages 361–375, (2014)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Rafael Geraldeli Rossi¹,
Alneu de Andrade Lopes¹,
Thiago de Paulo Faleiros¹ &
…
Solange Oliveira Rezende¹

288 Accesses
19 Citations
Explore all metrics

Abstract

Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Graph vs. bag representation models for the topic classification of web documents

Article 12 August 2015

Mining Text Enriched Heterogeneous Citation Networks

Automatic Document Classification Based on J.S. Mill’s Ideas

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.
Feldman R, Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006.
3] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.
Article Google Scholar
Manning C D, Raghavan P, SchÄutze H.An Introduction to Information Retrieval. Cambridge University Press, 2008.
Schutze H, Hull D A, Pedersen J O. A comparison of classifiers and document representations for the routing problem. In Proc. the 18th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 1995, pp.229-237.
Blanzieri E, Bryl A. A survey of learning-based techniques of email spam ¯ltering. Arti¯cial Intelligence Review, 2008, 29(1): 63-92.
Kao A, Quach L, Poteet S, Woods S. User assisted text classification and knowledge management. In Proc. the 12th International Conference on Information and Knowledge Management, November 2003, pp.524-527.
Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A. Automatic document metadata extraction using support vector machines. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries, May 2003, pp.37-48.
Kessler B, Numberg G, SchÄutze H. Automatic detection of text genre. In Proc. the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for ComputationalLinguistics, August 1997, pp.32-38.
Dumais S, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Annual International Conference on Research and Development in Information Retrieval, July 2000, pp.256-263.
Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc., 1989.
Lu Q, Getoor L. Link-based classification. In Proc. International Conference on Machine Learning, August 2003, pp.496-503.
Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kau®man, 2002.
Oh H J, Myaeng S H, Lee M H. A practical hypertext categorization method using links and incrementally available class information. In Proc. the 23rd ACM Int. SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.264-271.
Angelova R, Weikum G. Graph-based text classification: Learn from your neighbors. In Proc. the 29th Annual Int. SIGIR Conf. Research and Development in Information Retrieval Conference, August 2006, pp.485-492.
Tseng Y H, Ho Z P, Yang, K S, Chen C C. Mining term networks from text collections for crime investigation. Expert Systems with Applications, 2012, 39(11): 10082-10090.
Google Scholar
Wang W, Do D B, Lin X. Term graph model for text classification. In Proc. International Conference on Advanced Data Mining and Applications, July 2005, pp.19-30.
Newman M. Networks: An Introduction. Oxford University Press, 2010.
Widrow B, Ho® M E. Adaptive switching circuits. In Neurocomputing: Foundation of Research, Anderson J A (ed.), Cambridge.USA: MIT Press, 1998, pp.123-134.
Rossi R G, Faleiros T P, Lopes A A, Rezende S O. Inductive model generation for text categorization using a bipartite heterogeneous network. In Proc. the 12th International Conference on Data Mining, December 2012, pp.1086-1091.
Melville P, Gryc W, Lawrence R D. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proc. the 15th International Conference on Knowledge Discovery and Data Mining, June 2009, pp.1275-1284.
Boiy E, Hens P, Deschacht K, Moens M F. Automatic sentiment analysis in on-line text. In Proc. the 11th International Conference on Electronic Publishing, June 2007, pp.349-360.
Durant K T, Smith M D. Predicting the political sentiment of web log posts using supervised machine learning techniques coupled with feature selection. In Proc. the 8th International Workshop on Knowledge Discovery on the Web, August 2006, pp.187-206.
Chen R C, Hsieh C H. Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 2006, 31(2): 427-435.
Article Google Scholar
Wilcox A, Hripcsak G. Medical text representations for inductive learning. In Proc. American Medical Informatics Association Symposium, Nov. 2000, pp.923-927.
Sun A, Lim E P, Ng W K. Web classification using support vector machine. In Proc. the 4th International Workshop on Web Information and Data Management, November 2002, pp.96-99.
Yu H, Han J, Chang K C C. PEBL: Positive example based learning for Web page classification using SVM. In Proc. the 8th International Conference on Knowledge Discovery and Data Mining, July 2002, pp.239-248.
Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 2002, 18(2/3): 219-241.
Article Google Scholar
Dumais S T, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.256-263.
Han E H, Karypis G, Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 2001, pp.53-65.
Yang Y. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1(1/2): 69-90.
Article Google Scholar
Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C. An evaluation of naive Bayesian anti-spam filtering. In Proc. Workshop on Machine Learning in the New Information Age, May 2000, pp.9-17.
Drucker H, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054.
Article Google Scholar
Han E, Karypis G. Centroid-based document classification: Analysis and experimental results. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, June 2000, pp.424-431.
Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85.
Article Google Scholar
Marcacini R M, Cherman E A, Metz J, Rezende S O. A fast dendrogram refinement approach for unsupervised expansion of hierarchies. In Proc. ECML/PKDD Discovery Challenge: Third Challenge on Large Scale Hierarchical Text Classification, September 2012, pp. 1-12.
Frank E, Bouckaert R R. Naive Bayes for text classification with unbalanced classes. In Proc. the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases, September 2003, pp.503-510.
Ji M, Sun Y, Danilevsky M, Han J, Gao J. Graph regularized transductive classification on heterogeneous information networks. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2010, pp.570-586.
Chiang M, Liou J, Wang J, Peng W, Shan M. Exploring heterogeneous information networks and random walk with restart for academic search. Knowledge and Information Systems, 2013, 36(1): 59-82.
Article Google Scholar
Xue G R, Shen D, Yang Q et al. IRC: An iterative reinforcement categorization algorithm for interrelated Web objects. In Proc. the 4th International Conference on Data Mining, November 2004, pp. 273{280.
Yin Z, Li R, Mei Q, Han J. Exploring social tagging graph for web object classification. In Proc. International Conference on Knowledge Discovery and Data Mining, June 2009, pp.957-966.
Zhou D, Bousquet O, Lal T N, Weston J, SchÄolkopf B. Learning with local and global consistency. In Proc. Advances in Neural Information Processing Systems, December 2003.
Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.
Article Google Scholar
Markov A, Last M, Kandel A. Model-based classification of Web documents represented by graphs. In Proc. WEBKDD, August 2006, pp.84-89.
Mishra M, Huan J, Bleik S, Song M. Biomedical text categorization with concept graph representations using a controlled vocabulary. In Proc. the 11th International Workshop on Data Mining in Bioinformatics, August 2012, pp.26-32.
46] Cancho R F, Sole R V, Kohler. Patterns in syntactic dependency networks. Physical Review E, 2004, 69(1): 051915.
Sousa C A R, Rezende S O, Batista G E A P A. Influence of graph construction on semi-supervised learning. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2013, pp.160-175.
Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.
Article Google Scholar
Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (2nd edition). Morgan Kaufmann, 2005.
Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd International Conference on Machine Learning, June 2006, pp.161-168.
Kohonen T, Barna G, Chrisley R. Statistical pattern recognition with neural networks: Benchmarking studies. In Proc. International Conference on Neural Networks, July 1988, pp.61-68.
Demsar J. Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning Research, 2006, 7(1): 1-30.
MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, Brasil
Rafael Geraldeli Rossi, Alneu de Andrade Lopes, Thiago de Paulo Faleiros & Solange Oliveira Rezende

Authors

Rafael Geraldeli Rossi
View author publications
You can also search for this author in PubMed Google Scholar
Alneu de Andrade Lopes
View author publications
You can also search for this author in PubMed Google Scholar
Thiago de Paulo Faleiros
View author publications
You can also search for this author in PubMed Google Scholar
Solange Oliveira Rezende
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Geraldeli Rossi.

Additional information

The work is supported by São Paulo Research Foundation (FAPESP) of Brasil under Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9.

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 76 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rossi, R.G., de Andrade Lopes, A., de Paulo Faleiros, T. et al. Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network. J. Comput. Sci. Technol. 29, 361–375 (2014). https://doi.org/10.1007/s11390-014-1436-7

Download citation

Received: 02 September 2013
Revised: 06 March 2014
Published: 09 May 2014
Issue Date: May 2014
DOI: https://doi.org/10.1007/s11390-014-1436-7

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph vs. bag representation models for the topic classification of web documents

Mining Text Enriched Heterogeneous Citation Networks

Automatic Document Classification Based on J.S. Mill’s Ideas

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Graph vs. bag representation models for the topic classification of web documents

Mining Text Enriched Heterogeneous Citation Networks

Automatic Document Classification Based on J.S. Mill’s Ideas

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation