Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Variable Global Feature Selection Scheme for automatic classification of text documents

Published: 15 September 2017 Publication History

Abstract

A novel Variable Global Feature Selection Scheme (VGFSS) is proposed.VGFSS selects variable number of features from each class instead of equal features.The selection of features in VGFSS is based on distribution of terms in the classes.The methods are evaluated using Macro_F1 and Micro_F1 measure followed by Z-test.The VGFSS algorithm outperforms among seven competing methods in benchmark datasets. The feature selection is important to speed up the process of Automatic Text Document Classification (ATDC). At present, the most common method for discriminating feature selection is based on Global Filter-based Feature Selection Scheme (GFSS). The GFSS assigns a score to each feature based on its discriminating power and selects the top-N features from the feature set, where N is an empirically determined number. As a result, it may be possible that the features of a few classes are discarded either partially or completely. The Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative features from all the classes. However, it suffers in dealing with an unbalanced dataset having large number of classes. The distribution of features in these classes are highly variable. In this case, if an equal number of features are chosen from each class, it may exclude some important features from the class containing a higher number of features. To overcome this problem, we propose a novel Variable Global Feature Selection Scheme (VGFSS) to select a variable number of features from each class based on the distribution of terms in the classes. It ensures that, a minimum number of terms are selected from each class. The numerical results on benchmark datasets show the effectiveness of the proposed algorithm VGFSS over classical information science methods and IGFSS.

References

[1]
D. Agnihotri, K. Verma, P. Tripathi, Pattern and cluster mining on text data, IEEE Computer Society, CSNT, Bhopal, 2014.
[2]
D. Agnihotri, K. Verma, P. Tripathi, Computing correlative association of terms for automatic classification of text documents, ACM, 2016.
[3]
D. Agnihotri, K. Verma, P. Tripathi, Computing symmetrical strength of n-grams: A two pass filtering approach in automatic classification of text documents, SPRINGERPLUS, 5 (2016) 1-29.
[4]
S.D. Alejandro, V.A.J.I.A. N, S.J. Carlos, Comparison between svm and logistic regression: Which one is better to discriminate?, Revista Colombiana de Estadstica, 35 (2012) 223-237.
[5]
B.D. Chuong, Y.N. Andrew, Transfer learning for text classification, 2006.
[6]
A.M. Cohen, W.R. Hersh, The trec 2004 genomics track categorization task: classifying full text biomedical documents, Journal of Biomedical Discovery and Collaboration, 1 (2006) 4.
[7]
M. Craven, A. McCallum, D. PiPasquo, T. Mitchell, D. Freitag, Learning to extract symbolic knowledge from the World Wide Web, DTIC Document, 1998.
[8]
J. Demar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research, 7 (2006) 1-30.
[9]
P. Fabian, V. Gal, G. Alexandre, M. Vincent, T. Bertrand, G. Olivier, D. douard, Scikit-learn: Machine learning in python, Journal of Machine Learning Research, 12 (2011) 2825-2830.
[10]
G. Forman, An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, 3 (2003) 1289-1305.
[11]
G. Forman, A pitfall and solution in multi-class feature selection for text classification a pitfall and solution in multi-class feature selection for text classification, Intelligent Enterprise (2004) 4-8.
[12]
S. Garca, F. Herrera, An extension on statistical comparisons of classifiers over multiple data sets for all pairwise comparisons, Journal of Machine Learning Research, 9 (2008) 2677-2694.
[13]
S. Garca, J. Luengo, F. Herrera, Data preprocessing in data mining, Springer, 2015.
[14]
I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of Machine Learning Research, 3 (2003) 1157-1182.
[15]
J. Han, J. Pei, M. Kamber, Data mining: Concepts and techniques, Elsevier, 2011.
[16]
S. Holm, A simple sequentially rejective multiple test procedure, Scandinavian Journal of Statistics (1979) 65-70.
[17]
G. Hommel, G. Bernhard, Bonferroni procedures for logically related hypotheses, Journal of Statistical Planning and Inference, 82 (1999) 119-128.
[18]
T. Joachims, Text classification with support vector machines: Learning with many relevant features, 1998.
[19]
N. Kamal, M.A. Kachites, T. Sebastian, M. Tom, Text classification from labeled and unlabeled documents using em, Machine Learning, 39 (2000) 103-134.
[20]
Kevin, B., & Moshe, L. (2013). Uci machine learning repository. URL http://archive.ics.uci.edu/ml, 901.
[21]
T. Liu, A novel text classification approach based on deep belief network, Springer-Verlag, Berlin, Heidelberg, 2010.
[22]
J. Luengon, S. Garca, H. Francisco, A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests, Expert Systems with Applications, 36 (2009) 7798-7808.
[23]
T. Luis, Feature selection as a preprocessing step for hierarchical clustering, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999.
[24]
T. Luis, An evaluation of filter and wrapper methods for feature selection in categorical clustering, in: Advances in intelligent data analysis VI: 6th international symposium on intelligent data analysis, IDA 2005, Madrid, Spain, September 810, 2005. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 440-451).
[25]
C.D. Manning, P. Raghavan, H. Schutze, Introduction to information retrieval, CAMBRIDGE UNIVERSITY PRESS, New York, NY, 2008.
[26]
M. Manoj, G. Deepak, Semantic web mining of un-structured data: Challenges and opportunities, International Journal of Engineering (IJE), 5 (2011) 268.
[27]
J. Mingyang, L. Yanchun, F. Xiaoyue, F. Xiaojing, P. Zhili, X. Yu, G. Renchu, Text classification based on deep belief network and softmax regression, Neural Computing and Applications (2016) 1-10.
[28]
Moschitti, A., & Basili, R. (2004). Ohsumed medical corpus dataset. http://disi.unitn.it/moschitti/corpora.htm.
[29]
NIST, T. (2001). Ohsumed medical corpus dataset. http://trec.nist.gov/data/t9_filtering.html.
[30]
J.P. Shaffer, Modified sequentially rejective multiple test procedures, Journal of the American Statistical Association, 81 (1986) 826-831.
[31]
A.K. Uysal, An improved global feature selection scheme for text classification, Expert Systems with Applications, 43 (2016) 82-92.
[32]
A.K. Uysal, S. Gunal, A novel probabilistic feature selection method for text classification, Knowledge-Based Systems, Elsevier, 36 (2012) 226-235.
[33]
R.C.J. Van, Information retrieval, Butterworth-Heinemann, Newton, MA, USA, 1979.
[34]
D. Wang, H. Zhang, R. Liu, W. Lv, D. Wang, t-test feature selection approach based on term frequency for text categorization, Pattern Recognition Letters Elsevier, 45 (2014) 1-10.
[35]
Y. Yang, J.O. Pedersen, A comparative study on feature selection in text classification, 1997.

Cited By

View all
  1. Variable Global Feature Selection Scheme for automatic classification of text documents

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image Expert Systems with Applications: An International Journal
        Expert Systems with Applications: An International Journal  Volume 81, Issue C
        September 2017
        455 pages

        Publisher

        Pergamon Press, Inc.

        United States

        Publication History

        Published: 15 September 2017

        Author Tags

        1. Feature selection
        2. Text analysis
        3. Text document classification
        4. Text mining

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Filter feature selection methods for text classification: a reviewMultimedia Tools and Applications10.1007/s11042-023-15675-583:1(2053-2091)Online publication date: 1-Jan-2024
        • (2024)Feature reduction of unbalanced data classification based on density clusteringComputing10.1007/s00607-023-01206-5106:1(29-55)Online publication date: 1-Jan-2024
        • (2023)A novel filter feature selection method for text classificationJournal of Information Science10.1177/016555152199103749:1(59-78)Online publication date: 1-Feb-2023
        • (2023)A novel feature and class-based globalization technique for text classificationMultimedia Tools and Applications10.1007/s11042-023-15459-x82:24(37635-37660)Online publication date: 25-Apr-2023
        • (2023)Ensemble feature selection for single-label text classification: a comprehensive analytical studyNeural Computing and Applications10.1007/s00521-023-08763-y35:26(19235-19251)Online publication date: 22-Jun-2023
        • (2022)Aspect2LabelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118119209:COnline publication date: 15-Dec-2022
        • (2022)Detecting abusive Instagram comments in Turkish using convolutional Neural network and machine learning methodsExpert Systems with Applications: An International Journal10.1016/j.eswa.2021.114802174:COnline publication date: 6-May-2022
        • (2022)Completed sample correlations and feature dependency-based unsupervised feature selectionMultimedia Tools and Applications10.1007/s11042-022-13903-y82:10(15305-15326)Online publication date: 3-Oct-2022
        • (2021)The effects of globalisation techniques on feature selection for text classificationJournal of Information Science10.1177/016555152093089747:6(727-739)Online publication date: 1-Dec-2021
        • (2021)A Two-stage Text Feature Selection Algorithm for Improving Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/342578120:3(1-19)Online publication date: 5-May-2021
        • Show More Cited By

        View Options

        View options

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media