survey

How Complex Is Your Classification Problem?: A Survey on Measuring Classification Complexity

Authors:

Luís P. F. Garcia,

Marcilio C. P. Souto,

Tin Kam HoAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 52, Issue 5

Article No.: 107, Pages 1 - 34

https://doi.org/10.1145/3347711

Published: 13 September 2019 Publication History

Abstract

Characteristics extracted from the training datasets of classification problems have proven to be effective predictors in a number of meta-analyses. Among them, measures of classification complexity can be used to estimate the difficulty in separating the data points into their expected classes. Descriptors of the spatial distribution of the data and estimates of the shape and size of the decision boundary are among the known measures for this characterization. This information can support the formulation of new data-driven pre-processing and pattern recognition techniques, which can in turn be focused on challenges highlighted by such characteristics of the problems. This article surveys and analyzes measures that can be extracted from the training datasets to characterize the complexity of the respective classification problems. Their use in recent literature is also reviewed and discussed, allowing to prospect opportunities for future work in the area. Finally, descriptions are given on an R package named Extended Complexity Library (ECoL) that implements a set of complexity measures and is made publicly available.

References

[1]

Shawkat Ali and Kate A. Smith. 2006. On learning algorithm selection for classification. Appl. Soft Comput. 6, 2 (2006), 119--138.

Digital Library

[2]

Nafees Anwar, Geoff Jones, and Siva Ganesh. 2014. Measurement of data complexity for classification problems with unbalanced data. Statist. Anal. Data Mining 7, 3 (2014), 194--211.

Digital Library

[3]

Giuliano Armano. 2015. A direct measure of discriminant and characteristic capability for classifier building and assessment. Inform. Sci. 325 (2015), 466--483.

Digital Library

[4]

Giuliano Armano and Emanuele Tamponi. 2016. Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Anal. Appl. 19, 1 (2016), 129--137.

Digital Library

[5]

Mitra Basu and Tin K. Ho. 2006. Data Complexity in Pattern Recognition. Springer.

Digital Library

[6]

Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria C. Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newslett. 6, 1 (2004), 20--29.

Digital Library

[7]

Richard Baumgartner, Tin K. Ho, Ray Somorjai, Uwe Himmelreich, and Tania Sorrell. 2006. Complexity of magnetic resonance spectrum classification. In Data Complexity in Pattern Recognition. Springer, 241--248.

[8]

Ester Bernadó-Mansilla and Tin K. Ho. 2005. Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans. Evol. Comput. 9, 1 (2005), 82--104.

Digital Library

[9]

Léon Bottou and Chih-Jen Lin. 2007. Support vector machine solvers. Large Scale Kern. Mach. 3, 1 (2007), 301--320.

[10]

Alceu S. Britto Jr., Robert Sabourin, and Luiz E. S. Oliveira. 2014. Dynamic selection of classifiers—A comprehensive review. Pattern Recog. 47, 11 (2014), 3665--3680.

Digital Library

[11]

André L. Brun, Alceu S. Britto Jr., Luiz S. Oliveira, Fabricio Enembreck, and Robert Sabourin. 2018. A framework for dynamic classifier selection oriented by the classification problem difficulty. Pattern Recog. 76 (2018), 175--190.

Digital Library

[12]

Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat.-theor. Meth. 3, 1 (1974), 1--27.

[13]

Yoisel Campos, Carlos Morell, and Francesc J. Ferri. 2012. A local complexity based combination method for decision forests trained with high-dimensional data. In Proceedings of the 12th International Conference on Intelligent Systems Design and Applications (ISDA’12). 194--199.

[14]

Francisco Charte, Antonio Rivera, María J. del Jesus, and Francisco Herrera. 2016. On the impact of dataset complexity and sampling strategy in multilabel classifiers performance. In Proceedings of the 11th International Conference on Hybrid Artificial Intelligence Systems (HAIS’16). 500--511.

[15]

Patrick M. Ciarelli, Elias Oliveira, and Evandro O. T. Salles. 2013. Impact of the characteristics of data sets on incremental learning. Artific. Intell. Res. 2, 4 (2013), 63--74.

[16]

Ivan G. Costa, Ana C. Lorena, Liciana R. M. P. y Peres, and Marcilio C. P. de Souto. 2009. Using supervised complexity measures in the analysis of cancer gene expression data sets. In Proceedings of the Brazilian Symposium on Bioinformatics. 48--59.

Digital Library

[17]

Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press.

[18]

Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. 2017a. META-DES.Oracle: Meta-learning and feature selection for dynamic ensemble selection. Inform. Fus. 38 (2017), 84--103.

Digital Library

[19]

Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. 2018. Dynamic classifier selection: Recent advances and perspectives. Inform. Fus. 41 (2018), 195--216.

Digital Library

[20]

Rafael M. O. Cruz, Robert Sabourin, George D. C. Cavalcanti, and Tsang Ing Ren. 2015. META-DES: A dynamic ensemble selection framework using meta-learning. Pattern Recog. 48, 5 (2015), 1925--1935.

Digital Library

[21]

Rafael M. O. Cruz, Hiba H. Zakane, Robert Sabourin, and George D. C. Cavalcanti. 2017b. Dynamic ensemble selection VS K-NN: Why and when dynamic selection obtains higher classification performance? In Proceedings of the 17th International Conference on Image Processing Theory, Tools and Applications (IPTA’17). 1--6.

[22]

Lisa Cummins. 2013. Combining and Choosing Case Base Maintenance Algorithms. Ph.D. Dissertation. National University of Ireland, Cork.

[23]

Lisa Cummins and Derek Bridge. 2011. On dataset complexity for case base maintenance. In Proceedings of the 19th International Conference on Case-Based Reasoning (ICCBR’11). 47--61.

Digital Library

[24]

Silvia N. das Dôres, Luciano Alves, Duncan D. Ruiz, and Rodrigo C. Barros. 2016. A meta-learning framework for algorithm recommendation in software fault prediction. In Proceedings of the 31st ACM Symposium on Applied Computing (SAC’16). 1486--1491.

Digital Library

[25]

Vinícius V. de Melo and Ana C. Lorena. 2018. Using complexity measures to evolve synthetic classification datasets. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’18). 1--8.

[26]

Ming Dong and Rishabh P. Kothari. 2003. Feature subset selection using a new definition of classificability. Pattern Recog. Lett. 24 (2003), 1215--1225.

Digital Library

[27]

David A. Elizondo, Ralph Birkenhead, Matias Gamez, Noelia Garcia, and Esteban Alfaro. 2012. Linear separability and classification complexity. Expert Syst. Appl. 39, 9 (2012), 7796--7807.

Digital Library

[28]

Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, and Francisco Herrera. 2018. Learning from Imbalanced Data Sets. Springer.

[29]

María J. Flores, José A. Gámez, and Ana M. Martínez. 2014. Domains of competence of the semi-naive Bayesian network classifiers. Inform. Sci. 260 (2014), 120--148.

Digital Library

[30]

Albert Fornells, Elisabet Golobardes, Josep M. Martorell, Josep M. Garrell, Núria Macià, and Ester Bernadó. 2007. A methodology for analyzing case retrieval from a clustered case memory. In Proceedings of the 7th International Conference on Case-Based Reasoning (ICCBR’07). 122--136.

Digital Library

[31]

Benoit Frenay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 5 (2014), 845--869.

[32]

Luís P. F. Garcia, André C. P. L. F. de Carvalho, and Ana C. Lorena. 2013. Noisy data set identification. In Proceedings of the 8th International Conference on Hybrid Artificial Intelligent Systems (HAIS’13). 629--638.

[33]

Luís P. F. Garcia, André C. P. L. F. de Carvalho, and Ana C. Lorena. 2015. Effect of label noise in the complexity of classification problems. Neurocomputing 160 (2015), 108--119.

Digital Library

[34]

Luís P. F. Garcia, André C. P. L. F. de Carvalho, and Ana C. Lorena. 2016. Noise detection in the meta-learning level. Neurocomputing 176 (2016), 14--25.

Digital Library

[35]

Luís P. F. Garcia, Ana C. Lorena, Marcilio C. P. de Souto, and Tin Kam Ho. 2018. Classifier recommendation using data complexity measures. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR’18). 874--879.

[36]

David García-Callejas and Miguel B. Araújo. 2016. The effects of model and data complexity on predictions from species distributions models. Ecol. Modell. 326 (2016), 4--12.

[37]

Alvaro Garcia-Piquer, Albert Fornells, Albert Orriols-Puig, Guiomar Corral, and Elisabet Golobardes. 2012. Data classification through an evolutionary approach based on multiple criteria. Knowl. Inform. Syst. 33, 1 (2012), 35--56.

Digital Library

[38]

Rongsheng Gong and Samuel H. Huang. 2012. A Kolmogorov-Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction. Expert Syst. Appl. 39, 6 (2012), 6192--6200.

Digital Library

[39]

John Gower. 1971. A general coefficient of similarity and some of its properties. Biometrics 27, 4 (1971), 857--871.

[40]

Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 9 (2009), 1263--1284.

Digital Library

[41]

Zhi-Min He, Patrick P. K. Chan, Daniel S. Yeung, Witold Pedrycz, and Wing W. Y. Ng. 2015. Quantification of side-channel information leaks based on data complexity measures for web browsing. Int. J. Machine Learn. Cyber. 6, 4 (2015), 607--619.

[42]

Tin K. Ho. 2000. Complexity of classification problems and comparative advantages of combined classifiers. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS’00). 97--106.

Digital Library

[43]

Tin K. Ho. 2002. A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal. Appl. 5 (2002), 102--112.

[44]

Tin K. Ho and Mitra Basu. 2002. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Machine Intell. 24, 3 (2002), 289--300.

Digital Library

[45]

Tin K. Ho, Mitra Basu, and Martin H. C. Law. 2006. Measures of geometrical complexity in classification problems. In Data Complexity in Pattern Recognition. Springer, 1--23.

[46]

Tin K. Ho and Ester Bernadó-Mansilla. 2006. Classifier domains of competence in data complexity space. In Data Complexity in Pattern Recognition. Springer, 135--152.

[47]

Aarnoud Hoekstra and Robert P. W. Duin. 1996. On the nonlinearity of pattern classifiers. In Proceedings of the 13th International Conference on Pattern Recognition (ICPR’96), Vol. 4. 271--275.

[48]

Qinghua Hu, Witold Pedrycz, Daren Yu, and Jun Lang. 2010. Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans. Syst., Man Cyber., Part B (Cyber.) 40, 1 (2010), 137--150.

Digital Library

[49]

Vidya Kamath, Timothy J. Yeatman, and Steven A. Eschrich. 2008. Toward a measure of classification complexity in gene expression signatures. In Proceedings of the 30th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS’08). 5704--5707.

[50]

Sang-Woon Kim and John Oommen. 2009. On using prototype reduction schemes to enhance the computation of volume-based inter-class overlap measures. Pattern Recog. 42, 11 (2009), 2695--2704.

Digital Library

[51]

Eric D. Kolaczyk. 2009. Statistical Analysis of Network Data: Methods and Models. Springer.

Digital Library

[52]

Sotiris Kotsiantis and Dimitris Kanellopoulos. 2006. Discretization techniques: A recent survey. GESTS International Trans. Comput. Sci. Eng. 32, 1 (2006), 47--58.

[53]

Jesse H. Krijthe, Tin K. Ho, and Marco Loog. 2012. Improving cross-validation based classifier selection using meta-learning. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12). 2873--2876.

[54]

Frank Lebourgeois and Hubert Emptoz. 1996. Pretopological approach for supervised learning. In Proceedings of the 13th International Conference on Pattern Recognition, Vol. 4. 256--260.

Digital Library

[55]

Enrique Leyva, Antonio González, and Raúl Pérez. 2014. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans. Knowl. Data Eng. 27, 2 (2014), 354--367.

[56]

Li Ling and Yaser S. Abu-Mostafa. 2006. Data Complexity in Machine Learning. Technical Report CaltechCSTR:2006.004. California Institute of Technology.

Digital Library

[57]

Huan Liu, Hiroshi Motoda, Rudy Setiono, and Zheng Zhao. 2010. Feature selection: An ever evolving frontier in data mining. In Proceedings of the 4th International Workshop on Feature Selection in Data Mining (FSDM’10), Vol. 10. 4--13.

[58]

Victoria López, Alberto Fernández, Jose G. Moreno-Torres, and Francisco Herrera. 2012. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 39, 7 (2012), 6585--6608.

Digital Library

[59]

Ana C. Lorena, Ivan G. Costa, Newton Spolaôr, and Marcilio C. P. Souto. 2012. Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 75, 1 (2012), 33--42.

Digital Library

[60]

Ana C. Lorena and André C. P. L. F. de Carvalho. 2010. Building binary-tree-based multiclass classifiers using separability measures. Neurocomputing 73, 16–18 (2010), 2837--2845.

Digital Library

[61]

Ana C. Lorena, André C. P. L. F. de Carvalho, and João M. P. Gama. 2008. A review on the combination of binary classifiers in multiclass problems. Artific. Intell. Rev. 30, 1 (2008), 19--37.

Digital Library

[62]

Ana C. Lorena, Aron I. Maciel, Pericles B. C. Miranda, Ivan G. Costa, and Ricardo B. C. Prudêncio. 2018. Data complexity meta-features for regression problems. Machine Learning 107, 1 (2018), 209--246.

Digital Library

[63]

Giancarlo Lucca, Jose Sanz, Graçaliz P. Dimuro, Benjamín Bedregal, and Humberto Bustince. 2017. Analyzing the behavior of aggregation and pre-aggregation functions in fuzzy rule-based classification systems with data complexity measures. In Proceedings of the 10th Conference of the European Society for Fuzzy Logic and Technology (IWIFSGN’17). 443--455.

[64]

Julián Luengo and Francisco Herrera. 2015. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl. Inform. Syst. 42, 1 (2015), 147--180.

Digital Library

[65]

Núria Macià. 2011. Data Complexity in Supervised Learning: A Far-reaching Implication. Ph.D. Dissertation. La Salle, Universitat Ramon Llull.

[66]

Núria Macià and Ester Bernadó-Mansilla. 2014. Towards UCI+: A mindful repository design. Inform. Sci. 261 (2014), 237--262.

Digital Library

[67]

Núria Macia, Ester Bernadó-Mansilla, and Albert Orriols-Puig. 2008. Preliminary approach on synthetic data sets generation based on class separability measure. In Proceedings of the 19th International Conference on Pattern Recognition (ICPR’08). 1--4.

[68]

Núria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig, and Tin Kam Ho. 2013. Learner excellence biased by data set selection: A case for data characterisation and artificial data sets. Pattern Recog. 46, 3 (2013), 1054--1066.

Digital Library

[69]

Núria Macià, Albert Orriols-Puig, and Ester Bernadó-Mansilla. 2010. In search of targeted-complexity problems. In Proceedings of the 12th Conference on Genetic and Evolutionary Computation. 1055--1062.

Digital Library

[70]

Witold Malina. 2001. Two-parameter Fisher criterion. IEEE Trans. Syst., Man, Cyber., Part B (Cyber.) 31, 4 (2001), 629--636.

Digital Library

[71]

Li Ming and Paul Vitanyi. 1993. An Introduction to Kolmogorov Complexity and Its Applications. Springer.

Digital Library

[72]

Marvin Minsky and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. The MIT Press, Cambridge, MA.

Digital Library

[73]

Ramón A. Mollineda, José S. Sánchez, and José M. Sotoca. 2005. Data characterization for effective prototype selection. In Proceedings of the 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA’05). 27--34.

Digital Library

[74]

Ramón A. Mollineda, José S. Sánchez, and José M. Sotoca. 2006. A meta-learning framework for pattern classification by means of data complexity measures. Intel. Artific. 10, 29 (2006), 31--38.

[75]

Gleison Morais and Ronaldo C. Prati. 2013. Complex network measures for data set characterization. In Proceedings of the 2nd Brazilian Conference on Intelligent Systems (BRACIS’13). 12--18.

Digital Library

[76]

Laura Morán-Fernández, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos. 2017a. Can classification performance be predicted by complexity measures? A study using microarray data. Knowl. Inform. Syst. 51, 3 (2017), 1067--1090.

Digital Library

[77]

Laura Morán-Fernández, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos. 2017b. On the use of different base classifiers in multiclass problems. Prog. Artific. Intell. 6, 4 (2017), 315--323.

[78]

Linda Mthembu and Tshilidzi Marwala. 2008. A note on the separability index. Retrieved from: Arxiv Preprint Arxiv:0812.1107 (2008).

[79]

Mario A. Muñoz, Laura Villanova, Davaatseren Baatar, and Kate Smith-Miles. 2018. Instance spaces for machine learning classification. Machine Learn. 107, 1 (2018), 109--147.

Digital Library

[80]

Yusuke Nojima, Shinya Nishikawa, and Hisao Ishibuchi. 2011. A meta-fuzzy classifier for specifying appropriate fuzzy partitions by genetic fuzzy rule selection with data complexity measures. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ’11). 264--271.

[81]

Lucas Chesini Okimoto, Ricardo Manhães Savii, and Ana Carolina Lorena. 2017. Complexity measures effectiveness in feature selection. In Proceedings of the 6th Brazilian Conference on Intelligent Systems (BRACIS’17). 91--96.

[82]

Albert Orriols-Puig, Núria Macià, and Tin K. Ho. 2010. Documentation for the Data Complexity Library in C++. Technical Report. La Salle, Universitat Ramon Llull.

[83]

Antonio R. S. Parmezan, Huei D. Lee, and Feng C. Wu. 2017. Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework. Expert Syst. Appl. 75 (2017), 1--24.

[84]

Erinija Pranckeviciene, Tin K. Ho, and Ray Somorjai. 2006. Class separability in spaces reduced by feature selection. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Vol. 2. 254--257.

Digital Library

[85]

Thaise M. Quiterio and Ana C. Lorena. 2018. Using complexity measures to determine the structure of directed acyclic graphs in multiclass classification. Appl. Soft Comput. 65 (2018), 428--442.

Digital Library

[86]

George D. C. Cavalcantiand Tsang I. Ren and Breno A. Vale. 2012. Data complexity measures and nearest neighbor classifiers: A practical analysis for meta-learning. In Proceedings of the 24th International Conference on Tools with Artificial Intelligence (ICTAI’12), Vol. 1. 1065--1069.

Digital Library

[87]

Anandarup Roy, Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. 2016. Meta-learning recommendation of default size of classifier pool for META-DES. Neurocomputing 216 (2016), 351--362.

[88]

José A. Saéz, Julián Luengo, and Francisco Herrera. 2013. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recog. 46, 1 (2013), 355--364.

Digital Library

[89]

Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henrigues Abreu, Helder Araujo, and Joao Santos. 2018. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches. IEEE Comput. Intell. Mag. 13, 4 (2018), 59--76.

Digital Library

[90]

Borja Seijo-Pardo, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos. 2019. On developing an automatic threshold applied to feature selection ensembles. Inform. Fus. 45 (2019), 227--245.

[91]

Rushit Shah, Varun Khemani, Michael Azarian, Michael Pecht, and Yan Su. 2018. Analyzing data complexity using metafeatures for classification algorithm selection. In Proceedings of the Prognostics and System Health Management Conference (PHM-Chongqing’18). 1280--1284.

[92]

Sameer Singh. 2003a. Multiresolution estimates of classification complexity. IEEE Trans. Pattern Anal. Machine Intell. 25, 12 (2003), 1534--1539.

Digital Library

[93]

Sameer Singh. 2003b. PRISM: A novel framework for pattern recognition. Pattern Anal. Appl. 6, 2 (2003), 134--149.

Digital Library

[94]

Iryna Skrypnyk. 2011. Irrelevant features, class separability, and complexity of classification problems. In Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI’11). 998--1003.

Digital Library

[95]

Fred W. Smith. 1968. Pattern classifier design by linear programming. IEEE Trans. Comput. C-17, 4 (1968), 367--372.

Digital Library

[96]

Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014a. An instance level analysis of data complexity. Machine Learn. 95, 2 (2014), 225--256.

Digital Library

[97]

Michael R. Smith, Andrew White, Christophe Giraud-Carrier, and Tony Martinez. 2014b. An easy to use repository for comparing and improving machine learning algorithm usage. Arxiv Preprint Arxiv:1405.7292 (2014).

[98]

Kate A. Smith-Miles. 2009. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. 41, 1 (2009), 1--26.

Digital Library

[99]

José M. Sotoca, José Sánchez, and Ramón A. Mollineda. 2005. A review of data complexity measures and their applicability to pattern classification problems. In Actas Del III Taller Nacional de Minería de Dados y Aprendizaje (TAMIDA’05). 77--83.

[100]

Marcilio C. P. Souto, Ana C. Lorena, Newton Spolaôr, and Ivan G. Costa. 2010. Complexity measures of supervised classification tasks: A case study for cancer gene expression data. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’10). 1352--1358.

[101]

MengXin Sun, KunHong Liu, QingQiang Wu, QingQi Hong, BeiZhan Wang, and Haiying Zhang. 2019. A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis. Pattern Recog. 90 (2019), 346--362.

[102]

Ajay K. Tanwani and Muddassar Farooq. 2010. Classification potential vs. classification accuracy: A comprehensive study of evolutionary algorithms with biomedical datasets. Learn. Class. Syst. 6471 (2010), 127--144.

[103]

Leonardo Trujillo, Yuliana Martínez, Edgar Galván-López, and Pierrick Legrand. 2011. Predicting problem difficulty for genetic programming applied to data classification. In Proceedings of the 13th Conference on Genetic and Evolutionary Computation (GECCO’11). 1355--1362.

Digital Library

[104]

Ricardo Vilalta and Youssef Drissi. 2002. A perspective view and survey of meta-learning. Artif. Intell. Rev. 18, 2 (2002), 77--95.

Digital Library

[105]

Piyanoot Vorraboot, Suwanna Rasmequan, Chidchanok Lursinsap, and Krisana Chinnasarn. 2012. A modified error function for imbalanced dataset classification problem. In Proceedings of the 7th International Conference on Computing and Convergence Technology (ICCCT’12). 854--859.

[106]

Christiaan V. D. Walt and Etienne Barnard. 2007. Measures for the characterisation of pattern-recognition data sets. In Proceedings of the 18th Symposium of the Pattern Recognition Association of South Africa (PRASA’07).

[107]

D. Randall Wilson and Tony R. Martinez. 1997. Improved heterogeneous distance functions. J. Artific. Intell. Res. 6 (1997), 1--34.

Digital Library

[108]

David H. Wolpert. 1996. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390.

Digital Library

[109]

Yan Xing, Hao Cai, Yanguang Cai, Ole Hejlesen, and Egon Toft. 2013. Preliminary evaluation of classification complexity measures on imbalanced data. In Proceedings of the Chinese Intelligent Automation Conference: Intelligent Information Processing. 189--196.

[110]

Xueying Zhang, Ruixian Li, Bo Zhang, Yunxiang Yang, Jing Guo, and Xiang Ji. 2019. An instance-based learning recommendation algorithm of imbalance handling methods. Appl. Math. Comput. 351 (2019), 204--218.

Digital Library

[111]

Xingmin Zhao, Weipeng Cao, Hongyu Zhu, Zhong Ming, and Rana Aamir Raza Ashfaq. 2018. An initial study on the rank of input matrix for extreme learning machine. Int. J. Machine Learn. Cyber. 9, 5 (2018), 867--879.

[112]

Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. 2005. Semi-supervised Learning with Graphs. Ph.D. Dissertation. Carnegie Mellon University, Language Technologies Institute, School of Computer Science.

[113]

Julian Zubek and Dariusz M. Plewczynski. 2016. Complexity curve: A graphical measure of data complexity and classifier performance. PeerJ Comput. Sci. 2 (2016), e76.

Cited By

de Amorim LCavalcanti GCruz R(2025)Meta-Scaler: A Meta-Learning Framework for the Selection of Scaling TechniquesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2024.336661536:3(4805-4819)Online publication date: Mar-2025
https://doi.org/10.1109/TNNLS.2024.3366615
Mahmood ZSafran MAbdussamad Alfarhood SAshraf I(2025)Algorithmic and mathematical modeling for synthetically controlled overlappingScientific Reports10.1038/s41598-025-87992-815:1Online publication date: 4-Mar-2025
https://doi.org/10.1038/s41598-025-87992-8
Han MGuo HWang W(2025)A new data complexity measure for multi-class imbalanced classification tasksPattern Recognition10.1016/j.patcog.2024.110881157(110881)Online publication date: Jan-2025
https://doi.org/10.1016/j.patcog.2024.110881
Show More Cited By

Index Terms

How Complex Is Your Classification Problem?: A Survey on Measuring Classification Complexity
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

A Complexity Measure for Binary Classification Problems Based on Lost Points
Intelligent Data Engineering and Automated Learning – IDEAL 2021
Abstract
Complexity measures are focused on exploring and capturing the complexity of a data set. In this paper, the Lost points (LP) complexity measure is proposed. It is obtained by applying k-means in a recursive and hierarchical way and it provides ...
The Role of Biomedical Dataset in Classification
AIME '09: Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine

In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. We quantify the complexity of a biomedical dataset using five complexity measures: correlation-based feature selection subset merit, noise, ...
Hostility measure for multi-level study of data complexity
Abstract
Complexity measures aim to characterize the underlying complexity of supervised data. These measures tackle factors hindering the performance of Machine Learning (ML) classifiers like overlap, density, linearity, etc. The state-of-the-art has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 52, Issue 5

September 2020

791 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3362097

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2019

Accepted: 01 July 2019

Revised: 01 February 2019

Received: 01 December 2017

Published in CSUR Volume 52, Issue 5

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Funding Sources

FAPESP
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) -Finance Code 001
CNPq
CAPES-COFECUB
CAPES

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

165
Total Citations
View Citations
2,377
Total Downloads

Downloads (Last 12 months)409
Downloads (Last 6 weeks)45

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

de Amorim LCavalcanti GCruz R(2025)Meta-Scaler: A Meta-Learning Framework for the Selection of Scaling TechniquesIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2024.336661536:3(4805-4819)Online publication date: Mar-2025
https://doi.org/10.1109/TNNLS.2024.3366615
Mahmood ZSafran MAbdussamad Alfarhood SAshraf I(2025)Algorithmic and mathematical modeling for synthetically controlled overlappingScientific Reports10.1038/s41598-025-87992-815:1Online publication date: 4-Mar-2025
https://doi.org/10.1038/s41598-025-87992-8
Han MGuo HWang W(2025)A new data complexity measure for multi-class imbalanced classification tasksPattern Recognition10.1016/j.patcog.2024.110881157(110881)Online publication date: Jan-2025
https://doi.org/10.1016/j.patcog.2024.110881
Roy A(2025)Quantifying the Complexity of Agricultural Data for Regression and Classification ProblemsFood and Industry 5.0: Transforming the Food System for a Sustainable Future10.1007/978-3-031-76758-6_12(171-183)Online publication date: 23-Feb-2025
https://doi.org/10.1007/978-3-031-76758-6_12
Murata KIto SOhara K(2024)A Novel Analytical Method Based on Classification Complexity in Representation Spaces for Continual Learning表現空間における分類複雑性の評価に基づく継続学習分析手法の提案Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_A-N4139:2(A-N41_1-11)Online publication date: 1-Mar-2024
https://doi.org/10.1527/tjsai.39-2_A-N41
Ismail AAb Hamid SAbdul Sani AMohd Daud N(2024)KCO: Balancing class distribution in just-in-time software defect prediction using kernel crossover oversamplingPLOS ONE10.1371/journal.pone.029958519:4(e0299585)Online publication date: 11-Apr-2024
https://doi.org/10.1371/journal.pone.0299585
Malchiodi DRaimondi DFumagalli GGiancarlo RFrasca M(2024)The role of classifiers and data complexity in learned Bloom filters: insights and recommendationsJournal of Big Data10.1186/s40537-024-00906-911:1Online publication date: 27-Mar-2024
https://doi.org/10.1186/s40537-024-00906-9
Ali MLiu JMoore SNibouche O(2024)Assessing the Effect of Data Complexity and Instance Overlap Issues on Imbalanced LearningProceedings of the 2024 7th International Conference on Big Data and Education10.1145/3704289.3704292(49-56)Online publication date: 24-Sep-2024
https://dl.acm.org/doi/10.1145/3704289.3704292
Wan XZheng ZQin FLu X(2024)Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction TasksACM Transactions on Software Engineering and Methodology10.1145/364959633:6(1-45)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3649596
Xiong PTegegn MSarin JPal SRubin J(2024)It Is All about Data: A Survey on the Effects of Data on Adversarial RobustnessACM Computing Surveys10.1145/362781756:7(1-41)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3627817
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents