research-article

Variable Global Feature Selection Scheme for automatic classification of text documents

Authors:

Deepak Agnihotri,

Priyanka TripathiAuthors Info & Claims

Expert Systems with Applications: An International Journal, Volume 81, Issue C

Pages 268 - 281

https://doi.org/10.1016/j.eswa.2017.03.057

Published: 15 September 2017 Publication History

Abstract

A novel Variable Global Feature Selection Scheme (VGFSS) is proposed.VGFSS selects variable number of features from each class instead of equal features.The selection of features in VGFSS is based on distribution of terms in the classes.The methods are evaluated using Macro_F1 and Micro_F1 measure followed by Z-test.The VGFSS algorithm outperforms among seven competing methods in benchmark datasets. The feature selection is important to speed up the process of Automatic Text Document Classification (ATDC). At present, the most common method for discriminating feature selection is based on Global Filter-based Feature Selection Scheme (GFSS). The GFSS assigns a score to each feature based on its discriminating power and selects the top-N features from the feature set, where N is an empirically determined number. As a result, it may be possible that the features of a few classes are discarded either partially or completely. The Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative features from all the classes. However, it suffers in dealing with an unbalanced dataset having large number of classes. The distribution of features in these classes are highly variable. In this case, if an equal number of features are chosen from each class, it may exclude some important features from the class containing a higher number of features. To overcome this problem, we propose a novel Variable Global Feature Selection Scheme (VGFSS) to select a variable number of features from each class based on the distribution of terms in the classes. It ensures that, a minimum number of terms are selected from each class. The numerical results on benchmark datasets show the effectiveness of the proposed algorithm VGFSS over classical information science methods and IGFSS.

References

[1]

D. Agnihotri, K. Verma, P. Tripathi, Pattern and cluster mining on text data, IEEE Computer Society, CSNT, Bhopal, 2014.

Digital Library

[2]

D. Agnihotri, K. Verma, P. Tripathi, Computing correlative association of terms for automatic classification of text documents, ACM, 2016.

[3]

D. Agnihotri, K. Verma, P. Tripathi, Computing symmetrical strength of n-grams: A two pass filtering approach in automatic classification of text documents, SPRINGERPLUS, 5 (2016) 1-29.

[4]

S.D. Alejandro, V.A.J.I.A. N, S.J. Carlos, Comparison between svm and logistic regression: Which one is better to discriminate?, Revista Colombiana de Estadstica, 35 (2012) 223-237.

[5]

B.D. Chuong, Y.N. Andrew, Transfer learning for text classification, 2006.

[6]

A.M. Cohen, W.R. Hersh, The trec 2004 genomics track categorization task: classifying full text biomedical documents, Journal of Biomedical Discovery and Collaboration, 1 (2006) 4.

[7]

M. Craven, A. McCallum, D. PiPasquo, T. Mitchell, D. Freitag, Learning to extract symbolic knowledge from the World Wide Web, DTIC Document, 1998.

[8]

J. Demar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7 (2006) 1-30.

Digital Library

[9]

P. Fabian, V. Gal, G. Alexandre, M. Vincent, T. Bertrand, G. Olivier, D. douard, Scikit-learn: Machine learning in python, Journal of Machine Learning Research, 12 (2011) 2825-2830.

Digital Library

[10]

G. Forman, An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, 3 (2003) 1289-1305.

Digital Library

[11]

G. Forman, A pitfall and solution in multi-class feature selection for text classification a pitfall and solution in multi-class feature selection for text classification, Intelligent Enterprise (2004) 4-8.

[12]

S. Garca, F. Herrera, An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons, Journal of Machine Learning Research, 9 (2008) 2677-2694.

[13]

S. Garca, J. Luengo, F. Herrera, Data preprocessing in data mining, Springer, 2015.

Digital Library

[14]

I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, 3 (2003) 1157-1182.

Digital Library

[15]

J. Han, J. Pei, M. Kamber, Data mining: Concepts and techniques, Elsevier, 2011.

Digital Library

[16]

S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics (1979) 65-70.

[17]

G. Hommel, G. Bernhard, Bonferroni procedures for logically related hypotheses, Journal of Statistical Planning and Inference, 82 (1999) 119-128.

[18]

T. Joachims, Text classification with support vector machines: Learning with many relevant features, 1998.

[19]

N. Kamal, M.A. Kachites, T. Sebastian, M. Tom, Text classification from labeled and unlabeled documents using em, Machine Learning, 39 (2000) 103-134.

Digital Library

[20]

Kevin, B., & Moshe, L. (2013). Uci machine learning repository. URL http://archive.ics.uci.edu/ml, 901.

[21]

T. Liu, A novel text classification approach based on deep belief network, Springer-Verlag, Berlin, Heidelberg, 2010.

[22]

J. Luengon, S. Garca, H. Francisco, A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests, Expert Systems with Applications, 36 (2009) 7798-7808.

Digital Library

[23]

T. Luis, Feature selection as a preprocessing step for hierarchical clustering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999.

[24]

T. Luis, An evaluation of filter and wrapper methods for feature selection in categorical clustering, in: Advances in intelligent data analysis VI: 6th international symposium on intelligent data analysis, IDA 2005, Madrid, Spain, September 810, 2005. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 440-451).

Digital Library

[25]

C.D. Manning, P. Raghavan, H. Schutze, Introduction to information retrieval, CAMBRIDGE UNIVERSITY PRESS, New York, NY, 2008.

Digital Library

[26]

M. Manoj, G. Deepak, Semantic web mining of un-structured data: Challenges and opportunities, International Journal of Engineering (IJE), 5 (2011) 268.

[27]

J. Mingyang, L. Yanchun, F. Xiaoyue, F. Xiaojing, P. Zhili, X. Yu, G. Renchu, Text classification based on deep belief network and softmax regression, Neural Computing and Applications (2016) 1-10.

[28]

Moschitti, A., & Basili, R. (2004). Ohsumed medical corpus dataset. http://disi.unitn.it/moschitti/corpora.htm.

[29]

NIST, T. (2001). Ohsumed medical corpus dataset. http://trec.nist.gov/data/t9_filtering.html.

[30]

J.P. Shaffer, Modified sequentially rejective multiple test procedures, Journal of the American Statistical Association, 81 (1986) 826-831.

[31]

A.K. Uysal, An improved global feature selection scheme for text classification, Expert Systems with Applications, 43 (2016) 82-92.

Digital Library

[32]

A.K. Uysal, S. Gunal, A novel probabilistic feature selection method for text classification, Knowledge-Based Systems, Elsevier, 36 (2012) 226-235.

Digital Library

[33]

R.C.J. Van, Information retrieval, Butterworth-Heinemann, Newton, MA, USA, 1979.

[34]

D. Wang, H. Zhang, R. Liu, W. Lv, D. Wang, t-test feature selection approach based on term frequency for text categorization, Pattern Recognition Letters Elsevier, 45 (2014) 1-10.

[35]

Y. Yang, J.O. Pedersen, A comparative study on feature selection in text classification, 1997.

Cited By

Ming HHeyong W(2024)Filter feature selection methods for text classification: a reviewMultimedia Tools and Applications10.1007/s11042-023-15675-583:1(2053-2091)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15675-5
Wang ZYuan PCao ZZhang L(2024)Feature reduction of unbalanced data classification based on density clusteringComputing10.1007/s00607-023-01206-5106:1(29-55)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s00607-023-01206-5
Parlak BUysal A(2023)A novel filter feature selection method for text classificationJournal of Information Science10.1177/016555152199103749:1(59-78)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1177/0165551521991037
Show More Cited By

Variable Global Feature Selection Scheme for automatic classification of text documents
1. Computing methodologies
  1. Machine learning

Recommendations

An improved global feature selection scheme for text classification

An improved global feature selection scheme is proposed for text classification.It is an ensemble method combining the power of two filter-based methods.The new method combines a global and a one-sided local feature selection method.By incorporating ...
Comparison on Feature Selection Methods for Text Classification
ICMSS 2020: Proceedings of the 2020 4th International Conference on Management Engineering, Software Engineering and Service Sciences

The high-dimensional text data always contains a large quantity of noisy terms which bring negative effects on the performance of text classification. Feature selection is the common solution for dimension reduction in text classification. The choices of ...
Computing Correlative Association of Terms for Automatic Classification of Text Documents
VisionNet'16: Proceedings of the Third International Symposium on Computer Vision and the Internet

The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and ...

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal

Expert Systems with Applications: An International Journal Volume 81, Issue C

September 2017

455 pages

ISSN:0957-4174

Issue’s Table of Contents

Copyright © Elsevier Ltd.

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 15 September 2017

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ming HHeyong W(2024)Filter feature selection methods for text classification: a reviewMultimedia Tools and Applications10.1007/s11042-023-15675-583:1(2053-2091)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s11042-023-15675-5
Wang ZYuan PCao ZZhang L(2024)Feature reduction of unbalanced data classification based on density clusteringComputing10.1007/s00607-023-01206-5106:1(29-55)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s00607-023-01206-5
Parlak BUysal A(2023)A novel filter feature selection method for text classificationJournal of Information Science10.1177/016555152199103749:1(59-78)Online publication date: 1-Feb-2023
https://dl.acm.org/doi/10.1177/0165551521991037
Parlak B(2023)A novel feature and class-based globalization technique for text classificationMultimedia Tools and Applications10.1007/s11042-023-15459-x82:24(37635-37660)Online publication date: 25-Apr-2023
https://dl.acm.org/doi/10.1007/s11042-023-15459-x
Parlak B(2023)Ensemble feature selection for single-label text classification: a comprehensive analytical studyNeural Computing and Applications10.1007/s00521-023-08763-y35:26(19235-19251)Online publication date: 22-Jun-2023
https://dl.acm.org/doi/10.1007/s00521-023-08763-y
Hussain SAyoub MJilani GYu YKhan AWahid JButt MYang GMoller DWeiyan H(2022)Aspect2LabelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118119209:COnline publication date: 15-Dec-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.118119
Karayiğit Hİnan Acı ÇAkdağlı A(2022)Detecting abusive Instagram comments in Turkish using convolutional Neural network and machine learning methodsExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.114802174:COnline publication date: 6-May-2022
https://dl.acm.org/doi/10.1016/j.eswa.2021.114802
Liu THu RZhu Y(2022)Completed sample correlations and feature dependency-based unsupervised feature selectionMultimedia Tools and Applications10.1007/s11042-022-13903-y82:10(15305-15326)Online publication date: 3-Oct-2022
https://dl.acm.org/doi/10.1007/s11042-022-13903-y
Parlak BUysal A(2021)The effects of globalisation techniques on feature selection for text classificationJournal of Information Science10.1177/016555152093089747:6(727-739)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1177/0165551520930897
P AG SSrivastava GMaddikunta PGadekallu T(2021)A Two-stage Text Feature Selection Algorithm for Improving Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/342578120:3(1-19)Online publication date: 5-May-2021
https://dl.acm.org/doi/10.1145/3425781
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents