Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

How Complex Is Your Classification Problem?: A Survey on Measuring Classification Complexity

Published: 13 September 2019 Publication History
  • Get Citation Alerts
  • Abstract

    Characteristics extracted from the training datasets of classification problems have proven to be effective predictors in a number of meta-analyses. Among them, measures of classification complexity can be used to estimate the difficulty in separating the data points into their expected classes. Descriptors of the spatial distribution of the data and estimates of the shape and size of the decision boundary are among the known measures for this characterization. This information can support the formulation of new data-driven pre-processing and pattern recognition techniques, which can in turn be focused on challenges highlighted by such characteristics of the problems. This article surveys and analyzes measures that can be extracted from the training datasets to characterize the complexity of the respective classification problems. Their use in recent literature is also reviewed and discussed, allowing to prospect opportunities for future work in the area. Finally, descriptions are given on an R package named Extended Complexity Library (ECoL) that implements a set of complexity measures and is made publicly available.

    References

    [1]
    Shawkat Ali and Kate A. Smith. 2006. On learning algorithm selection for classification. Appl. Soft Comput. 6, 2 (2006), 119--138.
    [2]
    Nafees Anwar, Geoff Jones, and Siva Ganesh. 2014. Measurement of data complexity for classification problems with unbalanced data. Statist. Anal. Data Mining 7, 3 (2014), 194--211.
    [3]
    Giuliano Armano. 2015. A direct measure of discriminant and characteristic capability for classifier building and assessment. Inform. Sci. 325 (2015), 466--483.
    [4]
    Giuliano Armano and Emanuele Tamponi. 2016. Experimenting multiresolution analysis for identifying regions of different classification complexity. Pattern Anal. Appl. 19, 1 (2016), 129--137.
    [5]
    Mitra Basu and Tin K. Ho. 2006. Data Complexity in Pattern Recognition. Springer.
    [6]
    Gustavo E. A. P. A. Batista, Ronaldo C. Prati, and Maria C. Monard. 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newslett. 6, 1 (2004), 20--29.
    [7]
    Richard Baumgartner, Tin K. Ho, Ray Somorjai, Uwe Himmelreich, and Tania Sorrell. 2006. Complexity of magnetic resonance spectrum classification. In Data Complexity in Pattern Recognition. Springer, 241--248.
    [8]
    Ester Bernadó-Mansilla and Tin K. Ho. 2005. Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans. Evol. Comput. 9, 1 (2005), 82--104.
    [9]
    Léon Bottou and Chih-Jen Lin. 2007. Support vector machine solvers. Large Scale Kern. Mach. 3, 1 (2007), 301--320.
    [10]
    Alceu S. Britto Jr., Robert Sabourin, and Luiz E. S. Oliveira. 2014. Dynamic selection of classifiers—A comprehensive review. Pattern Recog. 47, 11 (2014), 3665--3680.
    [11]
    André L. Brun, Alceu S. Britto Jr., Luiz S. Oliveira, Fabricio Enembreck, and Robert Sabourin. 2018. A framework for dynamic classifier selection oriented by the classification problem difficulty. Pattern Recog. 76 (2018), 175--190.
    [12]
    Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat.-theor. Meth. 3, 1 (1974), 1--27.
    [13]
    Yoisel Campos, Carlos Morell, and Francesc J. Ferri. 2012. A local complexity based combination method for decision forests trained with high-dimensional data. In Proceedings of the 12th International Conference on Intelligent Systems Design and Applications (ISDA’12). 194--199.
    [14]
    Francisco Charte, Antonio Rivera, María J. del Jesus, and Francisco Herrera. 2016. On the impact of dataset complexity and sampling strategy in multilabel classifiers performance. In Proceedings of the 11th International Conference on Hybrid Artificial Intelligence Systems (HAIS’16). 500--511.
    [15]
    Patrick M. Ciarelli, Elias Oliveira, and Evandro O. T. Salles. 2013. Impact of the characteristics of data sets on incremental learning. Artific. Intell. Res. 2, 4 (2013), 63--74.
    [16]
    Ivan G. Costa, Ana C. Lorena, Liciana R. M. P. y Peres, and Marcilio C. P. de Souto. 2009. Using supervised complexity measures in the analysis of cancer gene expression data sets. In Proceedings of the Brazilian Symposium on Bioinformatics. 48--59.
    [17]
    Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press.
    [18]
    Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. 2017a. META-DES.Oracle: Meta-learning and feature selection for dynamic ensemble selection. Inform. Fus. 38 (2017), 84--103.
    [19]
    Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. 2018. Dynamic classifier selection: Recent advances and perspectives. Inform. Fus. 41 (2018), 195--216.
    [20]
    Rafael M. O. Cruz, Robert Sabourin, George D. C. Cavalcanti, and Tsang Ing Ren. 2015. META-DES: A dynamic ensemble selection framework using meta-learning. Pattern Recog. 48, 5 (2015), 1925--1935.
    [21]
    Rafael M. O. Cruz, Hiba H. Zakane, Robert Sabourin, and George D. C. Cavalcanti. 2017b. Dynamic ensemble selection VS K-NN: Why and when dynamic selection obtains higher classification performance? In Proceedings of the 17th International Conference on Image Processing Theory, Tools and Applications (IPTA’17). 1--6.
    [22]
    Lisa Cummins. 2013. Combining and Choosing Case Base Maintenance Algorithms. Ph.D. Dissertation. National University of Ireland, Cork.
    [23]
    Lisa Cummins and Derek Bridge. 2011. On dataset complexity for case base maintenance. In Proceedings of the 19th International Conference on Case-Based Reasoning (ICCBR’11). 47--61.
    [24]
    Silvia N. das Dôres, Luciano Alves, Duncan D. Ruiz, and Rodrigo C. Barros. 2016. A meta-learning framework for algorithm recommendation in software fault prediction. In Proceedings of the 31st ACM Symposium on Applied Computing (SAC’16). 1486--1491.
    [25]
    Vinícius V. de Melo and Ana C. Lorena. 2018. Using complexity measures to evolve synthetic classification datasets. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’18). 1--8.
    [26]
    Ming Dong and Rishabh P. Kothari. 2003. Feature subset selection using a new definition of classificability. Pattern Recog. Lett. 24 (2003), 1215--1225.
    [27]
    David A. Elizondo, Ralph Birkenhead, Matias Gamez, Noelia Garcia, and Esteban Alfaro. 2012. Linear separability and classification complexity. Expert Syst. Appl. 39, 9 (2012), 7796--7807.
    [28]
    Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, and Francisco Herrera. 2018. Learning from Imbalanced Data Sets. Springer.
    [29]
    María J. Flores, José A. Gámez, and Ana M. Martínez. 2014. Domains of competence of the semi-naive Bayesian network classifiers. Inform. Sci. 260 (2014), 120--148.
    [30]
    Albert Fornells, Elisabet Golobardes, Josep M. Martorell, Josep M. Garrell, Núria Macià, and Ester Bernadó. 2007. A methodology for analyzing case retrieval from a clustered case memory. In Proceedings of the 7th International Conference on Case-Based Reasoning (ICCBR’07). 122--136.
    [31]
    Benoit Frenay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neural Netw. Learn. Syst. 25, 5 (2014), 845--869.
    [32]
    Luís P. F. Garcia, André C. P. L. F. de Carvalho, and Ana C. Lorena. 2013. Noisy data set identification. In Proceedings of the 8th International Conference on Hybrid Artificial Intelligent Systems (HAIS’13). 629--638.
    [33]
    Luís P. F. Garcia, André C. P. L. F. de Carvalho, and Ana C. Lorena. 2015. Effect of label noise in the complexity of classification problems. Neurocomputing 160 (2015), 108--119.
    [34]
    Luís P. F. Garcia, André C. P. L. F. de Carvalho, and Ana C. Lorena. 2016. Noise detection in the meta-learning level. Neurocomputing 176 (2016), 14--25.
    [35]
    Luís P. F. Garcia, Ana C. Lorena, Marcilio C. P. de Souto, and Tin Kam Ho. 2018. Classifier recommendation using data complexity measures. In Proceedings of the 24th International Conference on Pattern Recognition (ICPR’18). 874--879.
    [36]
    David García-Callejas and Miguel B. Araújo. 2016. The effects of model and data complexity on predictions from species distributions models. Ecol. Modell. 326 (2016), 4--12.
    [37]
    Alvaro Garcia-Piquer, Albert Fornells, Albert Orriols-Puig, Guiomar Corral, and Elisabet Golobardes. 2012. Data classification through an evolutionary approach based on multiple criteria. Knowl. Inform. Syst. 33, 1 (2012), 35--56.
    [38]
    Rongsheng Gong and Samuel H. Huang. 2012. A Kolmogorov-Smirnov statistic based segmentation approach to learning from imbalanced datasets: With application in property refinance prediction. Expert Syst. Appl. 39, 6 (2012), 6192--6200.
    [39]
    John Gower. 1971. A general coefficient of similarity and some of its properties. Biometrics 27, 4 (1971), 857--871.
    [40]
    Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 9 (2009), 1263--1284.
    [41]
    Zhi-Min He, Patrick P. K. Chan, Daniel S. Yeung, Witold Pedrycz, and Wing W. Y. Ng. 2015. Quantification of side-channel information leaks based on data complexity measures for web browsing. Int. J. Machine Learn. Cyber. 6, 4 (2015), 607--619.
    [42]
    Tin K. Ho. 2000. Complexity of classification problems and comparative advantages of combined classifiers. In Proceedings of the International Workshop on Multiple Classifier Systems (MCS’00). 97--106.
    [43]
    Tin K. Ho. 2002. A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal. Appl. 5 (2002), 102--112.
    [44]
    Tin K. Ho and Mitra Basu. 2002. Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Machine Intell. 24, 3 (2002), 289--300.
    [45]
    Tin K. Ho, Mitra Basu, and Martin H. C. Law. 2006. Measures of geometrical complexity in classification problems. In Data Complexity in Pattern Recognition. Springer, 1--23.
    [46]
    Tin K. Ho and Ester Bernadó-Mansilla. 2006. Classifier domains of competence in data complexity space. In Data Complexity in Pattern Recognition. Springer, 135--152.
    [47]
    Aarnoud Hoekstra and Robert P. W. Duin. 1996. On the nonlinearity of pattern classifiers. In Proceedings of the 13th International Conference on Pattern Recognition (ICPR’96), Vol. 4. 271--275.
    [48]
    Qinghua Hu, Witold Pedrycz, Daren Yu, and Jun Lang. 2010. Selecting discrete and continuous features based on neighborhood decision error minimization. IEEE Trans. Syst., Man Cyber., Part B (Cyber.) 40, 1 (2010), 137--150.
    [49]
    Vidya Kamath, Timothy J. Yeatman, and Steven A. Eschrich. 2008. Toward a measure of classification complexity in gene expression signatures. In Proceedings of the 30th International Conference of the IEEE Engineering in Medicine and Biology Society (EMBS’08). 5704--5707.
    [50]
    Sang-Woon Kim and John Oommen. 2009. On using prototype reduction schemes to enhance the computation of volume-based inter-class overlap measures. Pattern Recog. 42, 11 (2009), 2695--2704.
    [51]
    Eric D. Kolaczyk. 2009. Statistical Analysis of Network Data: Methods and Models. Springer.
    [52]
    Sotiris Kotsiantis and Dimitris Kanellopoulos. 2006. Discretization techniques: A recent survey. GESTS International Trans. Comput. Sci. Eng. 32, 1 (2006), 47--58.
    [53]
    Jesse H. Krijthe, Tin K. Ho, and Marco Loog. 2012. Improving cross-validation based classifier selection using meta-learning. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR’12). 2873--2876.
    [54]
    Frank Lebourgeois and Hubert Emptoz. 1996. Pretopological approach for supervised learning. In Proceedings of the 13th International Conference on Pattern Recognition, Vol. 4. 256--260.
    [55]
    Enrique Leyva, Antonio González, and Raúl Pérez. 2014. A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans. Knowl. Data Eng. 27, 2 (2014), 354--367.
    [56]
    Li Ling and Yaser S. Abu-Mostafa. 2006. Data Complexity in Machine Learning. Technical Report CaltechCSTR:2006.004. California Institute of Technology.
    [57]
    Huan Liu, Hiroshi Motoda, Rudy Setiono, and Zheng Zhao. 2010. Feature selection: An ever evolving frontier in data mining. In Proceedings of the 4th International Workshop on Feature Selection in Data Mining (FSDM’10), Vol. 10. 4--13.
    [58]
    Victoria López, Alberto Fernández, Jose G. Moreno-Torres, and Francisco Herrera. 2012. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst. Appl. 39, 7 (2012), 6585--6608.
    [59]
    Ana C. Lorena, Ivan G. Costa, Newton Spolaôr, and Marcilio C. P. Souto. 2012. Analysis of complexity indices for classification problems: Cancer gene expression data. Neurocomputing 75, 1 (2012), 33--42.
    [60]
    Ana C. Lorena and André C. P. L. F. de Carvalho. 2010. Building binary-tree-based multiclass classifiers using separability measures. Neurocomputing 73, 16–18 (2010), 2837--2845.
    [61]
    Ana C. Lorena, André C. P. L. F. de Carvalho, and João M. P. Gama. 2008. A review on the combination of binary classifiers in multiclass problems. Artific. Intell. Rev. 30, 1 (2008), 19--37.
    [62]
    Ana C. Lorena, Aron I. Maciel, Pericles B. C. Miranda, Ivan G. Costa, and Ricardo B. C. Prudêncio. 2018. Data complexity meta-features for regression problems. Machine Learning 107, 1 (2018), 209--246.
    [63]
    Giancarlo Lucca, Jose Sanz, Graçaliz P. Dimuro, Benjamín Bedregal, and Humberto Bustince. 2017. Analyzing the behavior of aggregation and pre-aggregation functions in fuzzy rule-based classification systems with data complexity measures. In Proceedings of the 10th Conference of the European Society for Fuzzy Logic and Technology (IWIFSGN’17). 443--455.
    [64]
    Julián Luengo and Francisco Herrera. 2015. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl. Inform. Syst. 42, 1 (2015), 147--180.
    [65]
    Núria Macià. 2011. Data Complexity in Supervised Learning: A Far-reaching Implication. Ph.D. Dissertation. La Salle, Universitat Ramon Llull.
    [66]
    Núria Macià and Ester Bernadó-Mansilla. 2014. Towards UCI+: A mindful repository design. Inform. Sci. 261 (2014), 237--262.
    [67]
    Núria Macia, Ester Bernadó-Mansilla, and Albert Orriols-Puig. 2008. Preliminary approach on synthetic data sets generation based on class separability measure. In Proceedings of the 19th International Conference on Pattern Recognition (ICPR’08). 1--4.
    [68]
    Núria Macià, Ester Bernadó-Mansilla, Albert Orriols-Puig, and Tin Kam Ho. 2013. Learner excellence biased by data set selection: A case for data characterisation and artificial data sets. Pattern Recog. 46, 3 (2013), 1054--1066.
    [69]
    Núria Macià, Albert Orriols-Puig, and Ester Bernadó-Mansilla. 2010. In search of targeted-complexity problems. In Proceedings of the 12th Conference on Genetic and Evolutionary Computation. 1055--1062.
    [70]
    Witold Malina. 2001. Two-parameter Fisher criterion. IEEE Trans. Syst., Man, Cyber., Part B (Cyber.) 31, 4 (2001), 629--636.
    [71]
    Li Ming and Paul Vitanyi. 1993. An Introduction to Kolmogorov Complexity and Its Applications. Springer.
    [72]
    Marvin Minsky and Seymour Papert. 1969. Perceptrons: An Introduction to Computational Geometry. The MIT Press, Cambridge, MA.
    [73]
    Ramón A. Mollineda, José S. Sánchez, and José M. Sotoca. 2005. Data characterization for effective prototype selection. In Proceedings of the 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA’05). 27--34.
    [74]
    Ramón A. Mollineda, José S. Sánchez, and José M. Sotoca. 2006. A meta-learning framework for pattern classification by means of data complexity measures. Intel. Artific. 10, 29 (2006), 31--38.
    [75]
    Gleison Morais and Ronaldo C. Prati. 2013. Complex network measures for data set characterization. In Proceedings of the 2nd Brazilian Conference on Intelligent Systems (BRACIS’13). 12--18.
    [76]
    Laura Morán-Fernández, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos. 2017a. Can classification performance be predicted by complexity measures? A study using microarray data. Knowl. Inform. Syst. 51, 3 (2017), 1067--1090.
    [77]
    Laura Morán-Fernández, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos. 2017b. On the use of different base classifiers in multiclass problems. Prog. Artific. Intell. 6, 4 (2017), 315--323.
    [78]
    Linda Mthembu and Tshilidzi Marwala. 2008. A note on the separability index. Retrieved from: Arxiv Preprint Arxiv:0812.1107 (2008).
    [79]
    Mario A. Muñoz, Laura Villanova, Davaatseren Baatar, and Kate Smith-Miles. 2018. Instance spaces for machine learning classification. Machine Learn. 107, 1 (2018), 109--147.
    [80]
    Yusuke Nojima, Shinya Nishikawa, and Hisao Ishibuchi. 2011. A meta-fuzzy classifier for specifying appropriate fuzzy partitions by genetic fuzzy rule selection with data complexity measures. In Proceedings of the IEEE International Conference on Fuzzy Systems (FUZZ’11). 264--271.
    [81]
    Lucas Chesini Okimoto, Ricardo Manhães Savii, and Ana Carolina Lorena. 2017. Complexity measures effectiveness in feature selection. In Proceedings of the 6th Brazilian Conference on Intelligent Systems (BRACIS’17). 91--96.
    [82]
    Albert Orriols-Puig, Núria Macià, and Tin K. Ho. 2010. Documentation for the Data Complexity Library in C++. Technical Report. La Salle, Universitat Ramon Llull.
    [83]
    Antonio R. S. Parmezan, Huei D. Lee, and Feng C. Wu. 2017. Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework. Expert Syst. Appl. 75 (2017), 1--24.
    [84]
    Erinija Pranckeviciene, Tin K. Ho, and Ray Somorjai. 2006. Class separability in spaces reduced by feature selection. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Vol. 2. 254--257.
    [85]
    Thaise M. Quiterio and Ana C. Lorena. 2018. Using complexity measures to determine the structure of directed acyclic graphs in multiclass classification. Appl. Soft Comput. 65 (2018), 428--442.
    [86]
    George D. C. Cavalcantiand Tsang I. Ren and Breno A. Vale. 2012. Data complexity measures and nearest neighbor classifiers: A practical analysis for meta-learning. In Proceedings of the 24th International Conference on Tools with Artificial Intelligence (ICTAI’12), Vol. 1. 1065--1069.
    [87]
    Anandarup Roy, Rafael M. O. Cruz, Robert Sabourin, and George D. C. Cavalcanti. 2016. Meta-learning recommendation of default size of classifier pool for META-DES. Neurocomputing 216 (2016), 351--362.
    [88]
    José A. Saéz, Julián Luengo, and Francisco Herrera. 2013. Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recog. 46, 1 (2013), 355--364.
    [89]
    Miriam Seoane Santos, Jastin Pompeu Soares, Pedro Henrigues Abreu, Helder Araujo, and Joao Santos. 2018. Cross-validation for imbalanced datasets: Avoiding overoptimistic and overfitting approaches. IEEE Comput. Intell. Mag. 13, 4 (2018), 59--76.
    [90]
    Borja Seijo-Pardo, Verónica Bolón-Canedo, and Amparo Alonso-Betanzos. 2019. On developing an automatic threshold applied to feature selection ensembles. Inform. Fus. 45 (2019), 227--245.
    [91]
    Rushit Shah, Varun Khemani, Michael Azarian, Michael Pecht, and Yan Su. 2018. Analyzing data complexity using metafeatures for classification algorithm selection. In Proceedings of the Prognostics and System Health Management Conference (PHM-Chongqing’18). 1280--1284.
    [92]
    Sameer Singh. 2003a. Multiresolution estimates of classification complexity. IEEE Trans. Pattern Anal. Machine Intell. 25, 12 (2003), 1534--1539.
    [93]
    Sameer Singh. 2003b. PRISM: A novel framework for pattern recognition. Pattern Anal. Appl. 6, 2 (2003), 134--149.
    [94]
    Iryna Skrypnyk. 2011. Irrelevant features, class separability, and complexity of classification problems. In Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI’11). 998--1003.
    [95]
    Fred W. Smith. 1968. Pattern classifier design by linear programming. IEEE Trans. Comput. C-17, 4 (1968), 367--372.
    [96]
    Michael R. Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014a. An instance level analysis of data complexity. Machine Learn. 95, 2 (2014), 225--256.
    [97]
    Michael R. Smith, Andrew White, Christophe Giraud-Carrier, and Tony Martinez. 2014b. An easy to use repository for comparing and improving machine learning algorithm usage. Arxiv Preprint Arxiv:1405.7292 (2014).
    [98]
    Kate A. Smith-Miles. 2009. Cross-disciplinary perspectives on meta-learning for algorithm selection. ACM Comput. Surv. 41, 1 (2009), 1--26.
    [99]
    José M. Sotoca, José Sánchez, and Ramón A. Mollineda. 2005. A review of data complexity measures and their applicability to pattern classification problems. In Actas Del III Taller Nacional de Minería de Dados y Aprendizaje (TAMIDA’05). 77--83.
    [100]
    Marcilio C. P. Souto, Ana C. Lorena, Newton Spolaôr, and Ivan G. Costa. 2010. Complexity measures of supervised classification tasks: A case study for cancer gene expression data. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’10). 1352--1358.
    [101]
    MengXin Sun, KunHong Liu, QingQiang Wu, QingQi Hong, BeiZhan Wang, and Haiying Zhang. 2019. A novel ECOC algorithm for multiclass microarray data classification based on data complexity analysis. Pattern Recog. 90 (2019), 346--362.
    [102]
    Ajay K. Tanwani and Muddassar Farooq. 2010. Classification potential vs. classification accuracy: A comprehensive study of evolutionary algorithms with biomedical datasets. Learn. Class. Syst. 6471 (2010), 127--144.
    [103]
    Leonardo Trujillo, Yuliana Martínez, Edgar Galván-López, and Pierrick Legrand. 2011. Predicting problem difficulty for genetic programming applied to data classification. In Proceedings of the 13th Conference on Genetic and Evolutionary Computation (GECCO’11). 1355--1362.
    [104]
    Ricardo Vilalta and Youssef Drissi. 2002. A perspective view and survey of meta-learning. Artif. Intell. Rev. 18, 2 (2002), 77--95.
    [105]
    Piyanoot Vorraboot, Suwanna Rasmequan, Chidchanok Lursinsap, and Krisana Chinnasarn. 2012. A modified error function for imbalanced dataset classification problem. In Proceedings of the 7th International Conference on Computing and Convergence Technology (ICCCT’12). 854--859.
    [106]
    Christiaan V. D. Walt and Etienne Barnard. 2007. Measures for the characterisation of pattern-recognition data sets. In Proceedings of the 18th Symposium of the Pattern Recognition Association of South Africa (PRASA’07).
    [107]
    D. Randall Wilson and Tony R. Martinez. 1997. Improved heterogeneous distance functions. J. Artific. Intell. Res. 6 (1997), 1--34.
    [108]
    David H. Wolpert. 1996. The lack of a priori distinctions between learning algorithms. Neural Comput. 8, 7 (1996), 1341--1390.
    [109]
    Yan Xing, Hao Cai, Yanguang Cai, Ole Hejlesen, and Egon Toft. 2013. Preliminary evaluation of classification complexity measures on imbalanced data. In Proceedings of the Chinese Intelligent Automation Conference: Intelligent Information Processing. 189--196.
    [110]
    Xueying Zhang, Ruixian Li, Bo Zhang, Yunxiang Yang, Jing Guo, and Xiang Ji. 2019. An instance-based learning recommendation algorithm of imbalance handling methods. Appl. Math. Comput. 351 (2019), 204--218.
    [111]
    Xingmin Zhao, Weipeng Cao, Hongyu Zhu, Zhong Ming, and Rana Aamir Raza Ashfaq. 2018. An initial study on the rank of input matrix for extreme learning machine. Int. J. Machine Learn. Cyber. 9, 5 (2018), 867--879.
    [112]
    Xiaojin Zhu, John Lafferty, and Ronald Rosenfeld. 2005. Semi-supervised Learning with Graphs. Ph.D. Dissertation. Carnegie Mellon University, Language Technologies Institute, School of Computer Science.
    [113]
    Julian Zubek and Dariusz M. Plewczynski. 2016. Complexity curve: A graphical measure of data complexity and classifier performance. PeerJ Comput. Sci. 2 (2016), e76.

    Cited By

    View all
    • (2025)A new data complexity measure for multi-class imbalanced classification tasksPattern Recognition10.1016/j.patcog.2024.110881157(110881)Online publication date: Jan-2025
    • (2024)A Novel Analytical Method Based on Classification Complexity in Representation Spaces for Continual Learning表現空間における分類複雑性の評価に基づく継続学習分析手法の提案Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_A-N4139:2(A-N41_1-11)Online publication date: 1-Mar-2024
    • (2024)KCO: Balancing class distribution in just-in-time software defect prediction using kernel crossover oversamplingPLOS ONE10.1371/journal.pone.029958519:4(e0299585)Online publication date: 11-Apr-2024
    • Show More Cited By

    Index Terms

    1. How Complex Is Your Classification Problem?: A Survey on Measuring Classification Complexity

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 52, Issue 5
      September 2020
      791 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3362097
      • Editor:
      • Sartaj Sahni
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 September 2019
      Accepted: 01 July 2019
      Revised: 01 February 2019
      Received: 01 December 2017
      Published in CSUR Volume 52, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Supervised machine learning
      2. classification
      3. complexity measures

      Qualifiers

      • Survey
      • Research
      • Refereed

      Funding Sources

      • FAPESP
      • Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) -Finance Code 001
      • CNPq
      • CAPES-COFECUB
      • CAPES

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)419
      • Downloads (Last 6 weeks)37
      Reflects downloads up to 13 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)A new data complexity measure for multi-class imbalanced classification tasksPattern Recognition10.1016/j.patcog.2024.110881157(110881)Online publication date: Jan-2025
      • (2024)A Novel Analytical Method Based on Classification Complexity in Representation Spaces for Continual Learning表現空間における分類複雑性の評価に基づく継続学習分析手法の提案Transactions of the Japanese Society for Artificial Intelligence10.1527/tjsai.39-2_A-N4139:2(A-N41_1-11)Online publication date: 1-Mar-2024
      • (2024)KCO: Balancing class distribution in just-in-time software defect prediction using kernel crossover oversamplingPLOS ONE10.1371/journal.pone.029958519:4(e0299585)Online publication date: 11-Apr-2024
      • (2024)The role of classifiers and data complexity in learned Bloom filters: insights and recommendationsJournal of Big Data10.1186/s40537-024-00906-911:1Online publication date: 27-Mar-2024
      • (2024)Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction TasksACM Transactions on Software Engineering and Methodology10.1145/364959633:6(1-45)Online publication date: 27-Jun-2024
      • (2024)It Is All about Data: A Survey on the Effects of Data on Adversarial RobustnessACM Computing Surveys10.1145/362781756:7(1-41)Online publication date: 9-Apr-2024
      • (2024)Trusting My Predictions: On the Value of Instance-Level AnalysisACM Computing Surveys10.1145/361535456:7(1-28)Online publication date: 9-Apr-2024
      • (2024)Recurrence Rate spectrograms for the classification of nonlinear and noisy signalsPhysica Scripta10.1088/1402-4896/ad1fbe99:3(035223)Online publication date: 9-Feb-2024
      • (2024)Hidden classification layers: Enhancing linear separability between classes in neural networks layersPattern Recognition Letters10.1016/j.patrec.2023.11.016177(69-74)Online publication date: Jan-2024
      • (2024)Portability rules detection by Epilepsy Tracking META-Set AnalysisNeuroscience Informatics10.1016/j.neuri.2024.1001684:3(100168)Online publication date: Sep-2024
      • Show More Cited By

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media