Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Efficient feature selection for logical analysis of large-scale multi-class datasets

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

Feature selection in logical analysis of data (LAD) can be cast into a set covering problem. In this paper, extending the results on feature selection for binary classification using LAD, we present a mathematical model that selects a minimum set of necessary features for multi-class datasets and develop a heuristic algorithm that is both memory and time efficient for this model correspondingly. The utility of the algorithm is illustrated on a small example and the superiority of our work is demonstrated through experiments on 6 real-life multi-class datasets from UCI repository.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability and material

The data used in this paper can be found in UCI machine learning repository at https://archive.ics.uci.edu/ml/index.php

References

  • Alexe G, Hammer PL (2006a) Spanned patterns for the logical analysis of data. Discret Appl Math 154(7):1039–1049

    Article  MathSciNet  MATH  Google Scholar 

  • Alexe G, Alexe S, Liotta LA, Petricoin E, Reiss M, Hammer PL (2004) Ovarian cancer detection by logical analysis of proteimic data. Proteomics 4:766–783

    Article  Google Scholar 

  • Alexe G, Alexe S, Axelrod DE, Bonates T, Lozina II, Reiss M, Hammer PL (2006) Breast cancer prognosis by combinatorial analysis of gene expression data. Breast Cancer Research 8R41

  • Alexe G, Alexe S, Bonates TO, Kogan A (2007) Logical analysis of data - the vision of Peter L. Hammer. Ann Math Artif Intell 49:265–312

    Article  MathSciNet  MATH  Google Scholar 

  • Alexe S, Hammer PL (2006b) Accelerated algorithm for pattern detection in logical analysis of data. Discret Appl Math 154:1050–1063

    Article  MathSciNet  MATH  Google Scholar 

  • Alexe S, Blackstone E, Hammer PL, Ishwaran H, Lauer MS, Snader CEP (2003) Coronary risk prediction by logical analysis of data. Ann Op Res 119:15–42

    Article  MATH  Google Scholar 

  • Avila-Herrera JF, Subasi MM (2015) Logical analysis of multi-class data. In: 2015 Latin American Computing Conference (CLEI), pp 1–10

  • Bain TC, Avila-Herrera JF, Subasi E, Subasi MM (2020) Logical analysis of multiclass data with relaxed patterns. Ann Op Res 287:11–35

    Article  MathSciNet  MATH  Google Scholar 

  • Balas E, Carrera MC (1996) A dynamic subgradient-based branch-and-bound procedure for set covering problem. Op Res 44(6):875–890

    Article  MATH  Google Scholar 

  • Bonates TO, Hammer PL, Kogan A (2008) Maximum patterns in datasets. Discret Appl Math 156(6):846–861

    Article  MathSciNet  MATH  Google Scholar 

  • Boros E, Hammer PL, Ibaraki T, Kogan A (1997) Logical analysis of numerical data. Math Progr 79:163–190

    Article  MathSciNet  MATH  Google Scholar 

  • Boros E, Hammer PL, Ibaraki T, Kogan A, Mayoraz E, Muchnik I (2000) An implementation of logical analysis of data. IEEE Trans Knowl Data Eng 12:292–306

    Article  Google Scholar 

  • Brannon AR, Reddy A, Seiler M, Arreola A, Moore DT, Pruthi RS, Wallen EM, Nielsen M, Liu H, Nathanson KL, Ljungberg B, Zhao H, Brooks JD, Ganesan S, Bhanot G, Rathmell WK (2010) Molecular stratification of clear cell renal cell carcinoma by consensus clustering reveals distinct subtypes and survival patterns. Genes Cancer 1(2):152–163

    Article  Google Scholar 

  • Brauner MW, Brauner N, Hammer PL, Lozina I, Valeyre D (2007) Logical analysis of computed tomography data to differentiate entities of idiopathic interstitial pneumonias. Data Min Biomed 7:193–208

    Article  Google Scholar 

  • Bruni R (2007) Reformulation of the support set selection problem in the logical analysis of data. Ann Op Res 150:79–92

    Article  MathSciNet  MATH  Google Scholar 

  • Cai Z, Xu L, Shi Y, Salavatipour MR, Goebel R, Lin G (2006) Using gene clustering to identify discriminatory genes with higher classification accuracy. In: Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE’06), pp 235–242, 10.1109/BIBE.2006.253340

  • Cai Z, Goebel R, Salavatipour MR, Lin G (2007) Selecting dissimilar genes for multi-class classification, an application in cancer subtyping. BMC Bioinform 8(206):1–15

    Google Scholar 

  • Cai Z, Miao D, Li Y (2019) Deletion propagation for multiple key preserving conjunctive queries: Approximations and complexity. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp 506–517, 10.1109/ICDE.2019.00052

  • Caprara A, Fischetti M, Toth P (1999) A heuristic method for the set covering problem. Op Res 47(5):730–743

    Article  MathSciNet  MATH  Google Scholar 

  • Ceria S, Nobili P, Sassano A (1998) A lagrangian-based heuristic for large-scale set covering problems. Math Progr 81(2):215–228

    Article  MathSciNet  MATH  Google Scholar 

  • Chvatal V (1979) A greedy heuristic for the set-covering problem. Math Op Res 4(3):233–235

    Article  MathSciNet  MATH  Google Scholar 

  • Crama Y, Hammer PL, Ibaraki T (1988) Cause-effect relationships and partially defined Boolean functions. Ann Op Res 16:299–326

    Article  MathSciNet  MATH  Google Scholar 

  • Das TK, Adepu S, Zhou J (2020) Anomaly detection in industrial control systems using logical analysis of data. Comput Secur 16:299–326

    Google Scholar 

  • Fisher ML, Kedia P (1990) Optimal solution of set covering/partitioning problems using dual heuristics. Manag Sci 36:674–688

    Article  MathSciNet  MATH  Google Scholar 

  • Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44:1761–1776

    Article  Google Scholar 

  • Gubskaya AV, Bonates TO, Kholodovych V, Hammer P, Welsh WJ, Langer R, Kohn J (2011) Logical analysis of data in structure-activity investigation of polymeric gene delivery. Macromol Theory Simul 20(4):275–285

    Article  Google Scholar 

  • Guo C, Ryoo HS (2012) Compact MILP models for optimal and Pareto-optimal LAD patterns. Discret Appl Math 160:2339–2348

    Article  MathSciNet  MATH  Google Scholar 

  • Hammer PL (1986) Partially defined Boolean functions and cause-effect relationships

  • Jocelyn S, Chinniah Y, Ouali M, Yacout S (2017) Application of logical analysis of data to machinery-related accident prevention based on scarce data. Reliab Eng Syst Safety 159:223–236

    Article  Google Scholar 

  • Jocelyn S, Ouali MS, Chinniah Y (2018) Estimation of probability of harm in safety of machinery using an investigation systemic approach and logical analysis of data. Safety Sci 105:32–45

    Article  Google Scholar 

  • Kim HH, Choi JY (2015) Pattern generation for multi-class LAD using iterative genetic algorithm with flexible chromosomes and multiple populations. Expert Syst App 42:833–843

    Article  Google Scholar 

  • Kim K, Ryoo HS (2008) A LAD-based method for selecting short oligo probes for genotyping applications. OR Spectr 30:249–268

    Article  MathSciNet  MATH  Google Scholar 

  • Kohli R, Krishnamurtib R, Jedidi K (2006) Subset-conjunctive rules for breast cancer diagnosis. Discret Appl Math 154:1100–1112

    Article  MathSciNet  MATH  Google Scholar 

  • Kronek LP, Reddy A (2008) Logical analysis of survival data: prognostic survival models by detecting high-degree interactions in right-censored data. Bioinformatics 24:i248–i253

    Article  Google Scholar 

  • Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Miao D, Cai Z, Li J (2018) On the complexity of bounded view propagation for conjunctive queries. IEEE Trans Knowl Data Eng 30(1):115–127

    Article  Google Scholar 

  • Miao D, Cai Z, Li J, Gao X, Liu X (2020a) The computation of optimal subset repairs. Proc VLDB Endow 13(11):2061–2074

    Article  Google Scholar 

  • Miao D, Cai Z, Liu X, Li J (2020b) Functional dependency restricted insertion propagation. Theor Comput Sci 819:1–8. https://doi.org/10.1016/j.tcs.2017.03.043

    Article  MathSciNet  MATH  Google Scholar 

  • Mortada M, Carroll T, Yacout S, Lakis A (2012) Rogue components: their effect and control using logical analysis of data. J Intell Manuf 23:289–302

    Article  Google Scholar 

  • Mortada MA, Yacout S, Lakis A (2011) Diagnosis of rotor bearings using logical analysis of data. J Qual Maint Eng 17(4):371–397

    Article  Google Scholar 

  • Mortada MA, Yacout S, Lakis A (2014) Fault diagnosis in power transformers using multi-class logical analysis of data. J Intell Manuf 25(6):1429–1439. https://doi.org/10.1007/s10845-013-0750-1

    Article  Google Scholar 

  • Ragab A, Ouali M, Yacout S, Osman H (2016) Remaining useful life prediction using prognostic methodology based on logical analysis of data and kaplan-meier estimation. J Intell Manuf 27:943–958

    Article  Google Scholar 

  • Ragab A, El-Koujok M, Poulin B, Amazouz M, Yacout S (2018) Fault diagnosis in industrial chemical process using interpretable patterns based on logical analysis of data. Expert Syst Appl 95:368–383

    Article  Google Scholar 

  • Ryoo HS, Jang IY (2007) A heuristic method for selecting support features from large datasets. In: Kao M, Li X (eds) Algorithmic Aspects in Information and Management, Third International Conference, AAIM 2007, Portland, OR, USA, June 6-8, 2007, Proceedings, Springer, Lecture Notes in Computer Science, vol 4508, pp 411–423, 10.1007/978-3-540-72870-2\_39

  • Ryoo HS, Jang IY (2009) MILP approach to pattern generation in logical analysis of data. Discret Appl Math 157:749–761

    Article  MathSciNet  MATH  Google Scholar 

  • Shaban Y, Yacout S, Balazinski M (2015) Tool wear monitoring and alarm system based on pattern recognition with logical analysis of data. J Manuf Sci Eng 137(3):1–14

    Google Scholar 

  • Shaban Y, Meshreki M, Yacout S, Balazinski M, Attia H (2017a) Process control based on pattern recognition for routing carbon fiber reinforced polymer. J Intell Manuf 28(1):165–179

    Article  Google Scholar 

  • Shaban Y, Yacout S, Balazinski M, Jemielniak K (2017b) Cutting tool wear detection using multi-class logical analysis of data. J Mach Sci Technol 21(3):1–16

    Google Scholar 

  • Yacout S, Danish A, Saadany S, Kapongo J, Mani S, Gomes J (2013) Knowledge discovery from observational data of causal relationship between clinical procedure and alzheimer’s disease. J Publ Health 2:1–10

    Google Scholar 

  • Yan K, Ryoo HS (2017a) 0–1 multilinear programming as a unifying theory for LAD pattern generation. Discret Appl Math 218:21–39

    Article  MathSciNet  MATH  Google Scholar 

  • Yan K, Ryoo HS (2017b) Strong valid inequalities for Boolean logical pattern generation. J Global Optim 69(1):183–230

    Article  MathSciNet  MATH  Google Scholar 

  • Yan K, Ryoo HS (2019a) Cliques for multi-term linearization of 0-1 multilinear program for Boolean logical pattern generation. In: Thi HAL, Le HM, Dinh TP (eds) Optimization of Complex Systems: Theory, Models, Algorithms and Applications, WCGO 2019, World Congress on Global Optimization, Metz, France, 8-10 July, 2019, Springer, Advances in Intelligent Systems and Computing, vol 991, pp 376–386, 10.1007/978-3-030-21803-4\_38

  • Yan K, Ryoo HS (2019b) A multi-term, polyhedral relaxation of a 0–1multilinear function for boolean logical pattern generation. J Global Optim 74(4):705–735

    Article  MathSciNet  MATH  Google Scholar 

  • Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7(228):1–16

    Google Scholar 

Download references

Funding

This research was supported by the National Natural Science Foundation of China (NSFC) under grant U19A2059, 61806095, 61802183 and 61972110, the Fundamental Research Funds for the Central Universities under grant 30920021130, the Jiangsu Planned Projects for Postdoctoral Research Funds under grant 1701146B, the Natural Science Foundation of Guangdong Province under grant 2017A030307026, National Foundation Raising Project of Shantou University under grant NFC16002, Scientific Research Start-up Funding Project of Shantou University under grant STF15003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chanying Huang.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Code availability

The experiments were conducted using custom code written by the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yan, K., Miao, D., Guo, C. et al. Efficient feature selection for logical analysis of large-scale multi-class datasets. J Comb Optim 42, 1–23 (2021). https://doi.org/10.1007/s10878-021-00732-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-021-00732-2

Keywords