Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3374135.3385285acmconferencesArticle/Chapter ViewAbstractPublication Pagesacm-seConference Proceedingsconference-collections
research-article

COMB: A Hybrid Method for Cross-validated Feature Selection

Published: 25 May 2020 Publication History
  • Get Citation Alerts
  • Abstract

    When seeking to obtain insights from massive amounts of data, supervised classification problems require preprocessing to optimize computation. Among the various steps in preprocessing, feature selection (FS) empowers machine learning methods only to receive relevant data. We propose hybrid FS methods using unsupervised classification, statistical scoring, and a wrapper method. Among our tests using twelve dataset problems, the increase in performance from our novel method against existing FS methods represents an advancement in supervised classification.

    References

    [1]
    M. Al-Zewairi, S. Almajali, and A. Awajan. 2017. Experimental evaluation of a multi-layer feed-forward artificial neural network classifier for network intrusion detection system. In 2017 International Conference on New Trends in Computing Sciences (ICTCS). IEEE, Amman, Jordan, 167--172.
    [2]
    J. Béjar Alonso. 2013. K-means vs Mini Batch K-means: A comparison. (2013).
    [3]
    D. Arthur and S. Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.
    [4]
    A. Askari, A. dÁspremont, and L. El Ghaoui. 2019. Naive Feature Selection: Sparsity in Naive Bayes. arXiv preprint arXiv:1905.09884 (2019).
    [5]
    R. Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks 5, 4 (1994), 537--550.
    [6]
    M. Belouch, S. El Hadaj, and M. Idhammad. 2017. A two-stage classifier approach using reptree algorithm for network intrusion detection. International Journal of Advanced Computer Science and Applications 8, 6 (2017), 389--394.
    [7]
    G. Biau and E. Scornet. 2016. A random forest guided tour. Test 25, 2 (2016), 197--227.
    [8]
    P. Breheny and J. Huang. 2011. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The annals of applied statistics 5, 1 (2011), 232.
    [9]
    L.Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123--140.
    [10]
    N. V Chawla, K. W Bowyer, L. O Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.
    [11]
    G. Chen and J. Chen. 2015. A novel wrapper method for feature selection and its applications. Neurocomputing 159 (2015), 219--226.
    [12]
    D. Dua and C. Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
    [13]
    O. Faker and E. Dogdu. 2019. Intrusion Detection Using Big Data and Deep Learning Techniques. In Proceedings of the 2019 ACM Southeast Conference (ACM SE '19). Association for Computing Machinery, New York, NY, USA, 86--93. https://doi.org/10.1145/3299815.3314439
    [14]
    A. Feizollah, N. Badrul Anuar, R. Salleh, and F. Amalina. 2014. Comparative study of k-means and mini batch k-means clustering algorithms in android malware detection using network traffic analysis. In 2014 International Symposium on Biometrics and Security Technologies (ISBAST). IEEE, Kuala Lumpur, Malaysia, 193--197.
    [15]
    Y. Freund and R. E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119--139.
    [16]
    S. García, S. Ramírez-Gallego, J. Luengo, J. Manuel Benítez, and F. Herrera. 2016. Big data preprocessing: methods and prospects. Big Data Analytics 1, 1 (2016), 9.
    [17]
    H. Hsu, C. Hsieh, and M. Lu. 2011. Hybrid feature selection by combining filters and wrappers. Expert Systems with Applications 38, 7 (2011), 8144--8150.
    [18]
    B. Jacek and W. Duch. 2007. Feature Selection for High-Dimensional Data --- A Pearson Redundancy Based Filter. Vol. 45. 242--249. https://doi.org/10.1007/9783-540-75175-5_30
    [19]
    X. Jin, A. Xu, R. Bie, and P. Guo. 2006. Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In International Workshop on Data Mining for Biomedical Applications. Springer, 106--115.
    [20]
    C. Khammassi and S. Krichen. 2017. A GA-LR wrapper approach for feature selection in network intrusion detection. computers & security 70 (2017), 255--277.
    [21]
    D. Seong Kim, S. Min Lee, and J. Sou Park. 2006. Building lightweight intrusion detection system based on random forest. In International Symposium on Neural Networks. Springer, 224--230.
    [22]
    R. Kohavi and G. H John. 1997. Wrappers for feature subset selection. Artificial intelligence 97, 1-2 (1997), 273--324.
    [23]
    V. Ch Korfiatis, P. A Asvestas, K. K Delibasis, and G. K Matsopoulos. 2013. A classification system based on a new wrapper feature selection algorithm for the diagnosis of primary and secondary polycythemia. Computers in biology and medicine 43, 12 (2013), 2118--2126.
    [24]
    X. Li and M. Yin. 2013. Multiobjective binary biogeography based optimization for feature selection using gene expression data. IEEE Transactions on NanoBioscience 12, 4 (2013), 343--353.
    [25]
    A. Liaw and M. Wiener. 2002. Classification and regression by randomForest. R news 2, 3 (2002), 18--22.
    [26]
    Y. Liu. 2014. Random forest algorithm in big data environment. Computer Modelling & New Technologies 18, 12A (2014), 147--151.
    [27]
    P. E. Meyer and G. Bontempi. 2006. On the Use of Variable Complementarity for Feature Selection in Cancer Classification. In Applications of Evolutionary Computing. Springer Berlin Heidelberg, Berlin, Heidelberg, 91--102.
    [28]
    N. Moustafa and J. Slay. 2015. UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). In 2015 military communications and information systems conference (MilCIS). IEEE, Canberra, ACT, Australia, 1--6.
    [29]
    N. Moustafa and J. Slay. 2017. A hybrid feature selection for network intrusion detection systems: Central points. arXiv preprint arXiv:1707.05505 (2017).
    [30]
    J. Nicholson and C. Clapham. 2014. The Concise Oxford Dictionary of Mathematics. Vol. 5. Oxford University Press Oxford.
    [31]
    T. Mayumi Oshiro, P. Santoro Perez, and J. Augusto Baranauskas. 2012. How many trees in a random forest?. In International workshop on machine learning and data mining in pattern recognition. Springer, 154--168.
    [32]
    S. Pang, S. Ozawa, and N. Kasabov. 2005. Incremental linear discriminant analysis for classification of data streams. IEEE transactions on Systems, Man, and Cybernetics, part B (Cybernetics) 35, 5 (2005), 905--914.
    [33]
    R. Panthong and A. Srivihok. 2015. Wrapper feature subset selection for dimension reduction based on ensemble learning algorithm. Procedia Computer Science 72 (2015), 162--169.
    [34]
    J. Sou Park, K. Mohammad Shazzad, and D. Seong Kim. 2005. Toward Modeling Lightweight Intrusion Detection System Through Correlation-Based Hybrid Feature Selection. In Information Security and Cryptology, D. Feng, D. Lin, and M. Yung (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 279--289.
    [35]
    E. Pashaei, M. Ozen, and N. Aydin. 2016. Gene selection and classification approach for microarray data based on Random Forest Ranking and BBHA. In 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, Las Vegas, NV, USA, 308--311.
    [36]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825--2830.
    [37]
    H. Peng, F. Long, and C. Ding. 2005. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence 8 (2005), 1226--1238.
    [38]
    R. Primartha and B. Adhi Tama. 2017. Anomaly detection using random forest: A performance revisited. In 2017 International Conference on Data and Software Engineering (ICoDSE). IEEE, Palembang, Indonesia, 1--6.
    [39]
    A. Rosenberg and J. Hirschberg. 2007. V-measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, Prague, Czech Republic, 410--420.
    [40]
    T. G. S., K. G Boroojeni, K. Chandna, I. Bhatia, S. S. Iyengar, and N. R. Sunitha. 2019. Deep Learning-based Model to Fight Against Ad Click Fraud. In Proceedings of the 2019 ACM Southeast Conference (ACM SE '19). Association for Computing Machinery, New York, NY, USA, 176--181. https://doi.org/10.1145/3299815.3314453
    [41]
    T. G. S., S. Raj Joshi, S. S. Iyengar, N. R. Sunitha, and P. Badrinath. 2019. Mini-Batch Normalized Mutual Information: A Hybrid Feature Selection Method. IEEE Access 7 (2019), 116875--116885.
    [42]
    T. G. S., T. C. P., S. S. Iyengar, and N. R. Sunitha. 2018. Intelligent Access Control: A Self-Adaptable Trust-Based Access Control (SATBAC) Framework Using Game Theory Strategy. In Proceedings of International Symposium on Sensor Networks, Systems and Security. Springer International Publishing, Cham, pp. 97--111. https://doi.org/10.1007/978-3-319-75683-7_7
    [43]
    T. G. S., J. Soni, K. G Boroojeni, S. S. Iyengar, K. Srivastava, P. Badrinath, N. R. Sunitha, N. Prabakar, and H. Upadhyay. 2019. A Multi-time-scale Time Series Analysis for Click Fraud Forecasting using Binary Labeled Imbalanced Dataset. (2019), 1--9.
    [44]
    T. G. S., J. Soni, K. Chandna, S. S. Iyengar, N. R. Sunitha, and N. Prabakar. 2019. Learning-based model to fight against fake like clicks on instagram posts. In IEEE SoutheastCon. Alabama, USA, 1--8.
    [45]
    H. Sanz, C. Valim, E. Vegas, J. M Oller, and F. Reverter. 2018. SVM-RFE: selection and visualization of the most relevant features through non-linear kernels. BMC bioinformatics 19, 1 (2018), 432.
    [46]
    F. Vafaee Sharbaf, S. Mosafer, and M. Hossein Moattar. 2016. A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. Genomics 107, 6 (2016), 231--238.
    [47]
    S. Krishnaj Shevade and S. Sathiya Keerthi. 2003. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19, 17 (2003), 2246--2253.
    [48]
    S. Salameh Shreem, S. Abdullah, and M. Zakree Ahmad Nazri. 2016. Hybrid feature selection algorithm using symmetrical uncertainty and a harmony search algorithm. International Journal of Systems Science 47, 6 (2016), 1312--1329.
    [49]
    Q. Song, J. Ni, and G. Wang. 2011. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE transactions on knowledge and data engineering 25, 1 (2011), 1--14.
    [50]
    C. Strobl, A. Boulesteix, A. Zeileis, and T. Hothorn. 2007. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics 8, 1 (2007), 25.
    [51]
    Y. Sun, A. KC Wong, and M. S Kamel. 2009. Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence 23, 04 (2009), 687--719.
    [52]
    Y. Tang. 2010. Real-Time Automatic Face Tracking Using Adaptive Random Forests. Ph.D. Dissertation. McGill University Library.
    [53]
    B. Venkatesh and J. Anuradha. 2019. A Hybrid Feature Selection Approach for Handling a High-Dimensional Data. In Innovations in Computer Science and Engineering. Springer, 365--373.
    [54]
    H. Hua Yang and J. Moody. 1999. Data Visualization and Feature Selection: New Algorithms for Nongaussian Data. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS'99). MIT Press, Cambridge, MA, USA, 687--693.
    [55]
    Y. Zhai, Y. Ong, and I. W Tsang. 2014. The emerging" big dimensionality". (2014).
    [56]
    C. Zhang, Y. Li, Z. Yu, and F. Tian. 2016. Feature selection of power system transient stability assessment based on random forest and recursive feature elimination. In 2016 IEEE PES Asia-Pacific Power and Energy Engineering Conference (APPEEC). IEEE, Xian, China, 1264--1268.

    Cited By

    View all
    • (2020)A novel hybrid feature selection and modified KNN prediction model for coal and gas outburstsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20093739:5(7671-7691)Online publication date: 1-Jan-2020

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ACM SE '20: Proceedings of the 2020 ACM Southeast Conference
    April 2020
    337 pages
    ISBN:9781450371056
    DOI:10.1145/3374135
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 25 May 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Classification
    2. Cross-validation
    3. Feature selection
    4. Filter Method
    5. Hybrid Feature Selection
    6. Mini-batch K-means
    7. Random Forest
    8. Supervised Machine Learning
    9. Wrapper Method

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ACM SE '20
    Sponsor:
    ACM SE '20: 2020 ACM Southeast Conference
    April 2 - 4, 2020
    FL, Tampa, USA

    Acceptance Rates

    Overall Acceptance Rate 178 of 377 submissions, 47%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)A novel hybrid feature selection and modified KNN prediction model for coal and gas outburstsJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-20093739:5(7671-7691)Online publication date: 1-Jan-2020

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media