Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1401890.1401986acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Stable feature selection via dense feature groups

Published: 24 August 2008 Publication History

Abstract

Many feature selection algorithms have been proposed in the past focusing on improving classification accuracy. In this work, we point out the importance of stable feature selection for knowledge discovery from high-dimensional data, and identify two causes of instability of feature selection algorithms: selection of a minimum subset without redundant features and small sample size. We propose a general framework for stable feature selection which emphasizes both good generalization and stability of feature selection results. The framework identifies dense feature groups based on kernel density estimation and treats features in each dense group as a coherent entity for feature selection. An efficient algorithm DRAGS (Dense Relevant Attribute Group Selector) is developed under this framework. We also introduce a general measure for assessing the stability of feature selection algorithms. Our empirical study based on microarray data verifies that dense feature groups remain stable under random sample hold out, and the DRAGS algorithm is effective in identifying a set of feature groups which exhibit both high classification accuracy and stability.

References

[1]
U. Alon, N. Barkai, D. A. Notterman, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96:6745--6750, 1999
[2]
A. Appice, M. Ceci, S. Rawles, and P. Flach. Redundant feature elimination for multi-class problems. In Proceedings of the 21st International Conference on Machine learning, pages 33--40, 2004.
[3]
W. Au, K. C. C. Chan, A. K. C. Wong, and Y. Wang. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2(2):83--101, 2005.
[4]
M. Berens, H. Liu, L. Parsons, L. Yu, and Z. Zhao. Fostering biological relevance in feature selection for microarray data. IEEE Intelligent Systems, 20(6):29--32, 2005.
[5]
R. Butterworth, G. Piatetsky-Shapiro, and D. A. Simovici. On feature selection through clustering. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 581--584, 2005.
[6]
B. Cao, D. Shen, J. Sun, Q. Yang, and Z. Chen. Feature selection in a kernel space. In Proceedings of the 24th International Conference on Machine learning, pages 121--127, 2007.
[7]
Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17:790--799, 1995.
[8]
D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603--619, 2002.
[9]
M. Dash and H. Liu. Consistency-based search in feature selection. Artificial Intelligence, 151(1-2):155--176, 2003.
[10]
C. A. Davis, F. Gerick, V. Hintermair, C. C. Friedel, K. Fundel, R. Küffner, and R. Zimmer. Reliable gene signatures for microarray classification: assessment of stability and performance. Bioinformatics, 22:2356--2363, 2006.
[11]
C. Ding and H. Peng. Minimum redundancy feature selection from microarray gene expression data. In Proceedings of the Computational Systems Bioinformatics conference (CSB'03), pages 523--529, 2003.
[12]
L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21:171--178, 2005.
[13]
G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289--1305, 2003.
[14]
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine Learning, 46:389--422, 2002.
[15]
T. Hastie, R. Tibshirani, D. Botstein, and P. Brown. Supervised harvesting of expression trees. Genome Biology, 2:0003.1-0003.12, 2001.
[16]
R. Jörnsten and B. Yu. Simultaneous gene clustering and subset selection for sample classification via MDL. Bioinformatics, 19:1100--1109, 2003.
[17]
A. Kalousis, J. Prados, and M. Hilario. Stability of feature selection algorithms: a study on high-dimensional spaces. Knowledge and Information Systems, 12:95--116, 2007.
[18]
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273--324, 1997.
[19]
D. Koller and M. Sahami. Toward optimal feature selection. In Proceedings of the Thirteenth International Conference on Machine Learning, pages 284--292, 1996.
[20]
T. Li, C. Zhang, and M. Ogihara. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics, 20:2429--2437, 2004.
[21]
H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. on Knowledge and Data Engineering, 17(3):1--12, 2005.
[22]
H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. on Knowledge and Data Engineering, 17(3):1--12, 2005.
[23]
M. S. Pepe, R. Etzioni, Z. Feng, et al. Phases of biomarker development for early detection of cancer. J Natl Cancer Inst, 93:1054{1060, 2001.
[24]
M. Robnik-Sikonja and I. Kononenko. Theoretical and empirical analysis of Relief and ReliefF. Machine Learning, 53:23--69, 2003.
[25]
R. L. Somorjai, B. Dolenko, and R. Baumgartner. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics, 19:1484--1491, 2003.
[26]
M. P. Wand and M. C. Jones. Kernel Smoothing. Chapman and Hall, 1995.
[27]
L. Wang, J. Zhu, and H. Zou. Hybrid huberized support vector machines for microarray classification. In Proceedings of the 24th International Conference on Machine learning, pages 983--990, 2007.
[28]
I. H. Witten and E. Frank. Data Mining - Pracitcal Machine Learning Tools and Techniques. Morgan Kaufmann Publishers, 2005.
[29]
E. Xing, M. Jordan, and R. Karp. Feature selection for high-dimensional genomic microarray data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 601--608, 2001.
[30]
L. Yu and H. Liu. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 5:1205--1224, 2004.
[31]
Y. Zhang, C. Ding, and T. Li. A two-stage gene selection algorithm by combining reliefF and mRMR. Proceedings of IEEE Conference of Bioinformatics and Bioengineering (BIBE2007), 2007.

Cited By

View all
  • (2024)Improving the Feature Selection Stability of the Delta Test in RegressionIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33131295:5(1911-1917)Online publication date: May-2024
  • (2024)A feature selection method for multimodal multispectral LiDAR sensingISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2024.04.022212(42-57)Online publication date: Jun-2024
  • (2023)Review of feature selection approaches based on grouping of featuresPeerJ10.7717/peerj.1566611(e15666)Online publication date: 17-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
  • General Chair:
  • Ying Li,
  • Program Chairs:
  • Bing Liu,
  • Sunita Sarawagi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. classification
  2. feature selection
  3. high-dimensional data
  4. kernel density estimation
  5. stability

Qualifiers

  • Research-article

Conference

KDD08

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)44
  • Downloads (Last 6 weeks)8
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improving the Feature Selection Stability of the Delta Test in RegressionIEEE Transactions on Artificial Intelligence10.1109/TAI.2023.33131295:5(1911-1917)Online publication date: May-2024
  • (2024)A feature selection method for multimodal multispectral LiDAR sensingISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2024.04.022212(42-57)Online publication date: Jun-2024
  • (2023)Review of feature selection approaches based on grouping of featuresPeerJ10.7717/peerj.1566611(e15666)Online publication date: 17-Jul-2023
  • (2023)A Comprehensive Review of Feature Selection and Feature Selection Stability in Machine LearningGazi University Journal of Science10.35378/gujs.99376336:4(1506-1520)Online publication date: 1-Dec-2023
  • (2023)Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discoveryBMC Bioinformatics10.1186/s12859-022-05132-924:1Online publication date: 9-Jan-2023
  • (2023)Feature Selection With Maximal Relevance and Minimal Supervised RedundancyIEEE Transactions on Cybernetics10.1109/TCYB.2021.313989853:2(707-717)Online publication date: Feb-2023
  • (2023)Evolutionary feature selection on high dimensional data using a search space reduction approachEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105556117(105556)Online publication date: Jan-2023
  • (2023)A new ranking-based stability measure for feature selection algorithmsSoft Computing10.1007/s00500-022-07767-527:9(5377-5396)Online publication date: 3-Jan-2023
  • (2022)Online Scalable Streaming Feature Selection via Dynamic DecisionACM Transactions on Knowledge Discovery from Data10.1145/350273716:5(1-20)Online publication date: 9-Mar-2022
  • (2022)A Light Causal Feature Selection Approach to High-Dimensional DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3218786(1-13)Online publication date: 2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media