Abstract
Feature selection in logical analysis of data (LAD) can be cast into a set covering problem. In this paper, extending the results on feature selection for binary classification using LAD, we present a mathematical model that selects a minimum set of necessary features for multi-class datasets and develop a heuristic algorithm that is both memory and time efficient for this model correspondingly. The utility of the algorithm is illustrated on a small example and the superiority of our work is demonstrated through experiments on 6 real-life multi-class datasets from UCI repository.
Similar content being viewed by others
Data availability and material
The data used in this paper can be found in UCI machine learning repository at https://archive.ics.uci.edu/ml/index.php
References
Alexe G, Hammer PL (2006a) Spanned patterns for the logical analysis of data. Discret Appl Math 154(7):1039–1049
Alexe G, Alexe S, Liotta LA, Petricoin E, Reiss M, Hammer PL (2004) Ovarian cancer detection by logical analysis of proteimic data. Proteomics 4:766–783
Alexe G, Alexe S, Axelrod DE, Bonates T, Lozina II, Reiss M, Hammer PL (2006) Breast cancer prognosis by combinatorial analysis of gene expression data. Breast Cancer Research 8R41
Alexe G, Alexe S, Bonates TO, Kogan A (2007) Logical analysis of data - the vision of Peter L. Hammer. Ann Math Artif Intell 49:265–312
Alexe S, Hammer PL (2006b) Accelerated algorithm for pattern detection in logical analysis of data. Discret Appl Math 154:1050–1063
Alexe S, Blackstone E, Hammer PL, Ishwaran H, Lauer MS, Snader CEP (2003) Coronary risk prediction by logical analysis of data. Ann Op Res 119:15–42
Avila-Herrera JF, Subasi MM (2015) Logical analysis of multi-class data. In: 2015 Latin American Computing Conference (CLEI), pp 1–10
Bain TC, Avila-Herrera JF, Subasi E, Subasi MM (2020) Logical analysis of multiclass data with relaxed patterns. Ann Op Res 287:11–35
Balas E, Carrera MC (1996) A dynamic subgradient-based branch-and-bound procedure for set covering problem. Op Res 44(6):875–890
Bonates TO, Hammer PL, Kogan A (2008) Maximum patterns in datasets. Discret Appl Math 156(6):846–861
Boros E, Hammer PL, Ibaraki T, Kogan A (1997) Logical analysis of numerical data. Math Progr 79:163–190
Boros E, Hammer PL, Ibaraki T, Kogan A, Mayoraz E, Muchnik I (2000) An implementation of logical analysis of data. IEEE Trans Knowl Data Eng 12:292–306
Brannon AR, Reddy A, Seiler M, Arreola A, Moore DT, Pruthi RS, Wallen EM, Nielsen M, Liu H, Nathanson KL, Ljungberg B, Zhao H, Brooks JD, Ganesan S, Bhanot G, Rathmell WK (2010) Molecular stratification of clear cell renal cell carcinoma by consensus clustering reveals distinct subtypes and survival patterns. Genes Cancer 1(2):152–163
Brauner MW, Brauner N, Hammer PL, Lozina I, Valeyre D (2007) Logical analysis of computed tomography data to differentiate entities of idiopathic interstitial pneumonias. Data Min Biomed 7:193–208
Bruni R (2007) Reformulation of the support set selection problem in the logical analysis of data. Ann Op Res 150:79–92
Cai Z, Xu L, Shi Y, Salavatipour MR, Goebel R, Lin G (2006) Using gene clustering to identify discriminatory genes with higher classification accuracy. In: Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE’06), pp 235–242, 10.1109/BIBE.2006.253340
Cai Z, Goebel R, Salavatipour MR, Lin G (2007) Selecting dissimilar genes for multi-class classification, an application in cancer subtyping. BMC Bioinform 8(206):1–15
Cai Z, Miao D, Li Y (2019) Deletion propagation for multiple key preserving conjunctive queries: Approximations and complexity. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp 506–517, 10.1109/ICDE.2019.00052
Caprara A, Fischetti M, Toth P (1999) A heuristic method for the set covering problem. Op Res 47(5):730–743
Ceria S, Nobili P, Sassano A (1998) A lagrangian-based heuristic for large-scale set covering problems. Math Progr 81(2):215–228
Chvatal V (1979) A greedy heuristic for the set-covering problem. Math Op Res 4(3):233–235
Crama Y, Hammer PL, Ibaraki T (1988) Cause-effect relationships and partially defined Boolean functions. Ann Op Res 16:299–326
Das TK, Adepu S, Zhou J (2020) Anomaly detection in industrial control systems using logical analysis of data. Comput Secur 16:299–326
Fisher ML, Kedia P (1990) Optimal solution of set covering/partitioning problems using dual heuristics. Manag Sci 36:674–688
Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44:1761–1776
Gubskaya AV, Bonates TO, Kholodovych V, Hammer P, Welsh WJ, Langer R, Kohn J (2011) Logical analysis of data in structure-activity investigation of polymeric gene delivery. Macromol Theory Simul 20(4):275–285
Guo C, Ryoo HS (2012) Compact MILP models for optimal and Pareto-optimal LAD patterns. Discret Appl Math 160:2339–2348
Hammer PL (1986) Partially defined Boolean functions and cause-effect relationships
Jocelyn S, Chinniah Y, Ouali M, Yacout S (2017) Application of logical analysis of data to machinery-related accident prevention based on scarce data. Reliab Eng Syst Safety 159:223–236
Jocelyn S, Ouali MS, Chinniah Y (2018) Estimation of probability of harm in safety of machinery using an investigation systemic approach and logical analysis of data. Safety Sci 105:32–45
Kim HH, Choi JY (2015) Pattern generation for multi-class LAD using iterative genetic algorithm with flexible chromosomes and multiple populations. Expert Syst App 42:833–843
Kim K, Ryoo HS (2008) A LAD-based method for selecting short oligo probes for genotyping applications. OR Spectr 30:249–268
Kohli R, Krishnamurtib R, Jedidi K (2006) Subset-conjunctive rules for breast cancer diagnosis. Discret Appl Math 154:1100–1112
Kronek LP, Reddy A (2008) Logical analysis of survival data: prognostic survival models by detecting high-degree interactions in right-censored data. Bioinformatics 24:i248–i253
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
Miao D, Cai Z, Li J (2018) On the complexity of bounded view propagation for conjunctive queries. IEEE Trans Knowl Data Eng 30(1):115–127
Miao D, Cai Z, Li J, Gao X, Liu X (2020a) The computation of optimal subset repairs. Proc VLDB Endow 13(11):2061–2074
Miao D, Cai Z, Liu X, Li J (2020b) Functional dependency restricted insertion propagation. Theor Comput Sci 819:1–8. https://doi.org/10.1016/j.tcs.2017.03.043
Mortada M, Carroll T, Yacout S, Lakis A (2012) Rogue components: their effect and control using logical analysis of data. J Intell Manuf 23:289–302
Mortada MA, Yacout S, Lakis A (2011) Diagnosis of rotor bearings using logical analysis of data. J Qual Maint Eng 17(4):371–397
Mortada MA, Yacout S, Lakis A (2014) Fault diagnosis in power transformers using multi-class logical analysis of data. J Intell Manuf 25(6):1429–1439. https://doi.org/10.1007/s10845-013-0750-1
Ragab A, Ouali M, Yacout S, Osman H (2016) Remaining useful life prediction using prognostic methodology based on logical analysis of data and kaplan-meier estimation. J Intell Manuf 27:943–958
Ragab A, El-Koujok M, Poulin B, Amazouz M, Yacout S (2018) Fault diagnosis in industrial chemical process using interpretable patterns based on logical analysis of data. Expert Syst Appl 95:368–383
Ryoo HS, Jang IY (2007) A heuristic method for selecting support features from large datasets. In: Kao M, Li X (eds) Algorithmic Aspects in Information and Management, Third International Conference, AAIM 2007, Portland, OR, USA, June 6-8, 2007, Proceedings, Springer, Lecture Notes in Computer Science, vol 4508, pp 411–423, 10.1007/978-3-540-72870-2\_39
Ryoo HS, Jang IY (2009) MILP approach to pattern generation in logical analysis of data. Discret Appl Math 157:749–761
Shaban Y, Yacout S, Balazinski M (2015) Tool wear monitoring and alarm system based on pattern recognition with logical analysis of data. J Manuf Sci Eng 137(3):1–14
Shaban Y, Meshreki M, Yacout S, Balazinski M, Attia H (2017a) Process control based on pattern recognition for routing carbon fiber reinforced polymer. J Intell Manuf 28(1):165–179
Shaban Y, Yacout S, Balazinski M, Jemielniak K (2017b) Cutting tool wear detection using multi-class logical analysis of data. J Mach Sci Technol 21(3):1–16
Yacout S, Danish A, Saadany S, Kapongo J, Mani S, Gomes J (2013) Knowledge discovery from observational data of causal relationship between clinical procedure and alzheimer’s disease. J Publ Health 2:1–10
Yan K, Ryoo HS (2017a) 0–1 multilinear programming as a unifying theory for LAD pattern generation. Discret Appl Math 218:21–39
Yan K, Ryoo HS (2017b) Strong valid inequalities for Boolean logical pattern generation. J Global Optim 69(1):183–230
Yan K, Ryoo HS (2019a) Cliques for multi-term linearization of 0-1 multilinear program for Boolean logical pattern generation. In: Thi HAL, Le HM, Dinh TP (eds) Optimization of Complex Systems: Theory, Models, Algorithms and Applications, WCGO 2019, World Congress on Global Optimization, Metz, France, 8-10 July, 2019, Springer, Advances in Intelligent Systems and Computing, vol 991, pp 376–386, 10.1007/978-3-030-21803-4\_38
Yan K, Ryoo HS (2019b) A multi-term, polyhedral relaxation of a 0–1multilinear function for boolean logical pattern generation. J Global Optim 74(4):705–735
Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7(228):1–16
Funding
This research was supported by the National Natural Science Foundation of China (NSFC) under grant U19A2059, 61806095, 61802183 and 61972110, the Fundamental Research Funds for the Central Universities under grant 30920021130, the Jiangsu Planned Projects for Postdoctoral Research Funds under grant 1701146B, the Natural Science Foundation of Guangdong Province under grant 2017A030307026, National Foundation Raising Project of Shantou University under grant NFC16002, Scientific Research Start-up Funding Project of Shantou University under grant STF15003.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Code availability
The experiments were conducted using custom code written by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yan, K., Miao, D., Guo, C. et al. Efficient feature selection for logical analysis of large-scale multi-class datasets. J Comb Optim 42, 1–23 (2021). https://doi.org/10.1007/s10878-021-00732-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10878-021-00732-2