research-article

COMB: A Hybrid Method for Cross-validated Feature Selection

Authors:

Daniel Jimenez,

N. R. Sunitha, and

Prajwal BadrinathAuthors Info & Claims

ACM SE '20: Proceedings of the 2020 ACM Southeast Conference

April 2020

Pages 100 - 106

https://doi.org/10.1145/3374135.3385285

Published: 25 May 2020 Publication History

Abstract

When seeking to obtain insights from massive amounts of data, supervised classification problems require preprocessing to optimize computation. Among the various steps in preprocessing, feature selection (FS) empowers machine learning methods only to receive relevant data. We propose hybrid FS methods using unsupervised classification, statistical scoring, and a wrapper method. Among our tests using twelve dataset problems, the increase in performance from our novel method against existing FS methods represents an advancement in supervised classification.

References

[1]

M. Al-Zewairi, S. Almajali, and A. Awajan. 2017. Experimental evaluation of a multi-layer feed-forward artificial neural network classifier for network intrusion detection system. In 2017 International Conference on New Trends in Computing Sciences (ICTCS). IEEE, Amman, Jordan, 167--172.

[2]

J. Béjar Alonso. 2013. K-means vs Mini Batch K-means: A comparison. (2013).

[3]

D. Arthur and S. Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.

[4]

A. Askari, A. dÁspremont, and L. El Ghaoui. 2019. Naive Feature Selection: Sparsity in Naive Bayes. arXiv preprint arXiv:1905.09884 (2019).

[5]

R. Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks 5, 4 (1994), 537--550.

Digital Library

[6]

M. Belouch, S. El Hadaj, and M. Idhammad. 2017. A two-stage classifier approach using reptree algorithm for network intrusion detection. International Journal of Advanced Computer Science and Applications 8, 6 (2017), 389--394.

[7]

G. Biau and E. Scornet. 2016. A random forest guided tour. Test 25, 2 (2016), 197--227.

[8]

P. Breheny and J. Huang. 2011. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The annals of applied statistics 5, 1 (2011), 232.

[9]

L.Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140.

[10]

N. V Chawla, K. W Bowyer, L. O Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.

[11]

G. Chen and J. Chen. 2015. A novel wrapper method for feature selection and its applications. Neurocomputing 159 (2015), 219--226.

[12]

D. Dua and C. Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

[13]

O. Faker and E. Dogdu. 2019. Intrusion Detection Using Big Data and Deep Learning Techniques. In Proceedings of the 2019 ACM Southeast Conference (ACM SE '19). Association for Computing Machinery, New York, NY, USA, 86--93. https://doi.org/10.1145/3299815.3314439

[14]

A. Feizollah, N. Badrul Anuar, R. Salleh, and F. Amalina. 2014. Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis. In 2014 International Symposium on Biometrics and Security Technologies (ISBAST). IEEE, Kuala Lumpur, Malaysia, 193--197.

[15]

Y. Freund and R. E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.

Digital Library

[16]

S. García, S. Ramírez-Gallego, J. Luengo, J. Manuel Benítez, and F. Herrera. 2016. Big data preprocessing: methods and prospects. Big Data Analytics 1, 1 (2016), 9.

[17]

H. Hsu, C. Hsieh, and M. Lu. 2011. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications 38, 7 (2011), 8144--8150.

Digital Library

[18]

B. Jacek and W. Duch. 2007. Feature Selection for High-Dimensional Data --- A Pearson Redundancy Based Filter. Vol. 45. 242--249. https://doi.org/10.1007/9783-540-75175-5_30

[19]

X. Jin, A. Xu, R. Bie, and P. Guo. 2006. Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In International Workshop on Data Mining for Biomedical Applications. Springer, 106--115.

[20]

C. Khammassi and S. Krichen. 2017. A GA-LR wrapper approach for feature selection in network intrusion detection. computers & security 70 (2017), 255--277.

[21]

D. Seong Kim, S. Min Lee, and J. Sou Park. 2006. Building lightweight intrusion detection system based on random forest. In International Symposium on Neural Networks. Springer, 224--230.

[22]

R. Kohavi and G. H John. 1997. Wrappers for feature subset selection. Artificial intelligence 97, 1-2 (1997), 273--324.

[23]

V. Ch Korfiatis, P. A Asvestas, K. K Delibasis, and G. K Matsopoulos. 2013. A classification system based on a new wrapper feature selection algorithm for the diagnosis of primary and secondary polycythemia. Computers in biology and medicine 43, 12 (2013), 2118--2126.

[24]

X. Li and M. Yin. 2013. Multiobjective binary biogeography based optimization for feature selection using gene expression data. IEEE Transactions on NanoBioscience 12, 4 (2013), 343--353.

[25]

A. Liaw and M. Wiener. 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18--22.

[26]

Y. Liu. 2014. Random forest algorithm in big data environment. Computer Modelling & New Technologies 18, 12A (2014), 147--151.

[27]

P. E. Meyer and G. Bontempi. 2006. On the Use of Variable Complementarity for Feature Selection in Cancer Classification. In Applications of Evolutionary Computing. Springer Berlin Heidelberg, Berlin, Heidelberg, 91--102.

[28]

N. Moustafa and J. Slay. 2015. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 military communications and information systems conference (MilCIS). IEEE, Canberra, ACT, Australia, 1--6.

[29]

N. Moustafa and J. Slay. 2017. A hybrid feature selection for network intrusion detection systems: Central points. arXiv preprint arXiv:1707.05505 (2017).

[30]

J. Nicholson and C. Clapham. 2014. The Concise Oxford Dictionary of Mathematics. Vol. 5. Oxford University Press Oxford.

[31]

T. Mayumi Oshiro, P. Santoro Perez, and J. Augusto Baranauskas. 2012. How many trees in a random forest?. In International workshop on machine learning and data mining in pattern recognition. Springer, 154--168.

[32]

S. Pang, S. Ozawa, and N. Kasabov. 2005. Incremental linear discriminant analysis for classification of data streams. IEEE transactions on Systems, Man, and Cybernetics, part B (Cybernetics) 35, 5 (2005), 905--914.

[33]

R. Panthong and A. Srivihok. 2015. Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm. Procedia Computer Science 72 (2015), 162--169.

[34]

J. Sou Park, K. Mohammad Shazzad, and D. Seong Kim. 2005. Toward Modeling Lightweight Intrusion Detection System Through Correlation-Based Hybrid Feature Selection. In Information Security and Cryptology, D. Feng, D. Lin, and M. Yung (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 279--289.

[35]

E. Pashaei, M. Ozen, and N. Aydin. 2016. Gene selection and classification approach for microarray data based on Random Forest Ranking and BBHA. In 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, Las Vegas, NV, USA, 308--311.

[36]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825--2830.

Digital Library

[37]

H. Peng, F. Long, and C. Ding. 2005. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence 8 (2005), 1226--1238.

Digital Library

[38]

R. Primartha and B. Adhi Tama. 2017. Anomaly detection using random forest: A performance revisited. In 2017 International Conference on Data and Software Engineering (ICoDSE). IEEE, Palembang, Indonesia, 1--6.

[39]

A. Rosenberg and J. Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, Prague, Czech Republic, 410--420.

[40]

T. G. S., K. G Boroojeni, K. Chandna, I. Bhatia, S. S. Iyengar, and N. R. Sunitha. 2019. Deep Learning-based Model to Fight Against Ad Click Fraud. In Proceedings of the 2019 ACM Southeast Conference (ACM SE '19). Association for Computing Machinery, New York, NY, USA, 176--181. https://doi.org/10.1145/3299815.3314453

[41]

T. G. S., S. Raj Joshi, S. S. Iyengar, N. R. Sunitha, and P. Badrinath. 2019. Mini-Batch Normalized Mutual Information: A Hybrid Feature Selection Method. IEEE Access 7 (2019), 116875--116885.

[42]

T. G. S., T. C. P., S. S. Iyengar, and N. R. Sunitha. 2018. Intelligent Access Control: A Self-Adaptable Trust-Based Access Control (SATBAC) Framework Using Game Theory Strategy. In Proceedings of International Symposium on Sensor Networks, Systems and Security. Springer International Publishing, Cham, pp. 97--111. https://doi.org/10.1007/978-3-319-75683-7_7

[43]

T. G. S., J. Soni, K. G Boroojeni, S. S. Iyengar, K. Srivastava, P. Badrinath, N. R. Sunitha, N. Prabakar, and H. Upadhyay. 2019. A Multi-time-scale Time Series Analysis for Click Fraud Forecasting using Binary Labeled Imbalanced Dataset. (2019), 1--9.

[44]

T. G. S., J. Soni, K. Chandna, S. S. Iyengar, N. R. Sunitha, and N. Prabakar. 2019. Learning-based model to fight against fake like clicks on instagram posts. In IEEE SoutheastCon. Alabama, USA, 1--8.

[45]

H. Sanz, C. Valim, E. Vegas, J. M Oller, and F. Reverter. 2018. SVM-RFE: selection and visualization of the most relevant features through non-linear kernels. BMC bioinformatics 19, 1 (2018), 432.

[46]

F. Vafaee Sharbaf, S. Mosafer, and M. Hossein Moattar. 2016. A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. Genomics 107, 6 (2016), 231--238.

[47]

S. Krishnaj Shevade and S. Sathiya Keerthi. 2003. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, 17 (2003), 2246--2253.

[48]

S. Salameh Shreem, S. Abdullah, and M. Zakree Ahmad Nazri. 2016. Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm. International Journal of Systems Science 47, 6 (2016), 1312--1329.

Digital Library

[49]

Q. Song, J. Ni, and G. Wang. 2011. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE transactions on knowledge and data engineering 25, 1 (2011), 1--14.

[50]

C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics 8, 1 (2007), 25.

[51]

Y. Sun, A. KC Wong, and M. S Kamel. 2009. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23, 04 (2009), 687--719.

[52]

Y. Tang. 2010. Real-Time Automatic Face Tracking Using Adaptive Random Forests. Ph.D. Dissertation. McGill University Library.

[53]

B. Venkatesh and J. Anuradha. 2019. A Hybrid Feature Selection Approach for Handling a High-Dimensional Data. In Innovations in Computer Science and Engineering. Springer, 365--373.

[54]

H. Hua Yang and J. Moody. 1999. Data Visualization and Feature Selection: New Algorithms for Nongaussian Data. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS'99). MIT Press, Cambridge, MA, USA, 687--693.

[55]

Y. Zhai, Y. Ong, and I. W Tsang. 2014. The emerging" big dimensionality". (2014).

[56]

C. Zhang, Y. Li, Z. Yu, and F. Tian. 2016. Feature selection of power system transient stability assessment based on random forest and recursive feature elimination. In 2016 IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC). IEEE, Xian, China, 1264--1268.

Cited By

Liu XZhang GZhang Z(2020)A novel hybrid feature selection and modified KNN prediction model for coal and gas outburstsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20093739:5(7671-7691)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/JIFS-200937

Index Terms

COMB: A Hybrid Method for Cross-validated Feature Selection
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning algorithms
      1. Feature selection

Recommendations

Filter-Wrapper Hybrid Method on Feature Selection
GCIS '10: Proceedings of the 2010 Second WRI Global Congress on Intelligent Systems - Volume 03

Feature selection is a process commonly used in machine learning. This paper examines two broad classes of feature selection methods: filter methods and wrapper methods to find their individual advantages and disadvantages. This paper selects their ...
Read More
A study of feature selection techniques for predicting customer retention in telecommunication sector

Feature selection is the process of eliminating irrelevant features from the dataset, while maintaining acceptable classification accuracy. The selected features play an important role which can directly influence the effectiveness of the resulting ...
Read More
Feature selection using dynamic weights for classification

Feature selection aims at finding a feature subset that has the most discriminative information from the original feature set. In this paper, we firstly present a new scheme for feature relevance, interdependence and redundancy analysis using ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACM SE '20: Proceedings of the 2020 ACM Southeast Conference

April 2020

337 pages

ISBN:9781450371056

DOI:10.1145/3374135

Conference Chair:
Morris Chang,
Program Chair:
Dan Lo,
Publications Chair:
Eric Gamess

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 May 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ACM SE '20

Sponsor:

ACM

ACM SE '20: 2020 ACM Southeast Conference

April 2 - 4, 2020

FL, Tampa, USA

Acceptance Rates

Overall Acceptance Rate 178 of 377 submissions, 47%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
91
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Liu XZhang GZhang Z(2020)A novel hybrid feature selection and modified KNN prediction model for coal and gas outburstsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20093739:5(7671-7691)Online publication date: 1-Jan-2020
https://dl.acm.org/doi/10.3233/JIFS-200937

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents