research-article

Hostility measure for multi-level study of data complexity

Authors:

Isaac Martín De Diego,

Víctor Aceña,

Javier M. MoguerzaAuthors Info & Claims

Applied Intelligence, Volume 53, Issue 7

Pages 8073 - 8096

https://doi.org/10.1007/s10489-022-03793-w

Published: 26 July 2022 Publication History

Abstract

Complexity measures aim to characterize the underlying complexity of supervised data. These measures tackle factors hindering the performance of Machine Learning (ML) classifiers like overlap, density, linearity, etc. The state-of-the-art has mainly focused on the dataset perspective of complexity, i.e., offering an estimation of the complexity of the whole dataset. Recently, the instance perspective has also been addressed. In this paper, the hostility measure, a complexity measure offering a multi-level (instance, class, and dataset) perspective of data complexity is proposed. The proposal is built by estimating the novel notion of hostility: the difficulty of correctly classifying a point, a class, or a whole dataset given their corresponding neighborhoods. The proposed measure is estimated at the instance level by applying the k-means algorithm in a recursive and hierarchical way, which allows to analyze how points from different classes are naturally grouped together across partitions. The instance information is aggregated to provide complexity knowledge at the class and the dataset levels. The validity of the proposal is evaluated through a variety of experiments dealing with the three perspectives and the corresponding comparative with the state-of-the-art measures. Throughout the experiments, the hostility measure has shown promising results and to be competitive, stable, and robust.

References

[1]

Arruda J L, Prudêncio R B, Lorena A C (2020) Measuring instance hardness using data complexity measures. In: Brazilian conference on intelligent systems. Springer, pp 483–497

[2]

Barella VH, Garcia LP, de Souto MC, Lorena AC, and de Carvalho AC Assessing the data complexity of imbalanced datasets Inf Sci 2021 553 83-109

[3]

Basu M, Ho TK (2006) Data complexity in pattern recognition. Springer Science & Business Media

[4]

Bernadó-Mansilla E and Ho TK Domain of competence of xcs classifier system in complexity measurement space IEEE Trans Evol Comput 2005 9 1 82-104

[5]

Brighton H and Mellish C Advances in instance selection for instance-based learning algorithms Data Min Knowl Discov 2002 6 2 153-172

[6]

Brun AL, Britto ASJr, Oliveira LS, Enembreck F, and Sabourin R A framework for dynamic classifier selection oriented by the classification problem difficulty Pattern Recogn 2018 76 175-190

[7]

Cai Z, Long Y, and Shao L Classification complexity assessment for hyper-parameter optimization Pattern Recogn Lett 2019 125 396-403

[8]

Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 9 June 2022

[9]

Fahim A K and starting means for k-means algorithm J Comput Sci 2021 55 101445

[10]

Garcia L, Lorena A (2019) ECoL: complexity measures for supervised problems. https://CRAN.R-project.org/package=ECoL, r package version 0.3.0. Accessed 9 June 2022

[11]

Garcia LP, de Carvalho AC, and Lorena AC Effect of label noise in the complexity of classification problems Neurocomputing 2015 160 108-119

[12]

Hariri RH, Fredericks EM, and Bowers KM Uncertainty in big data analytics: survey, opportunities, and challenges J Big Data 2019 6 1 1-16

[13]

Ho TK and Baird HS Pattern classification with compact distribution maps Comput Vis Image Underst 1998 70 1 101-110

[14]

Ho TK and Basu M Complexity measures of supervised classification problems IEEE Trans Pattern Anal Mach Intell 2002 24 3 289-300

[15]

Hoekstra A, Duin R P (1996) On the nonlinearity of pattern classifiers. In: Proceedings of 13th international conference on pattern recognition, vol 4. IEEE, pp 271–275

[16]

Hornik K, Buchta C, and Zeileis AOpen-source machine learning: R meets WekaComput Stat2009242225-232https://doi.org/10.1007/s00180-008-0119-7 https://doi.org/10.1007/s00180-008-0119-7

[17]

Kaplansky I (2020) Set theory and metric spaces, vol 298. American Mathematical Society

[18]

Koziarski M Potential anchoring for imbalanced data classification Pattern Recogn 2021 120 108114

[19]

Kropat E, Weber GW, and Tirkolaee EB Foundations of semialgebraic gene-environment networks J Dyn Games 2020 7 4 253

[20]

Lancho C, Martín de Diego I, Cuesta M, Aceña V, Moguerza JM (2021) A complexity measure for binary classification problems based on lost points. In: International conference on intelligent data engineering and automated learning. Springer, pp 137–146

[21]

Leyva E, González A, and Perez R A set of complexity measures designed for applying meta-learning to instance selection IEEE Trans Knowl Data Eng 2014 27 2 354-367

[22]

Leyva E, González A, and Pérez R Three new instance selection methods based on local sets: a comparative study with several approaches from a bi-objective perspective Pattern Recogn 2015 48 4 1523-1537

[23]

Lorena AC, Costa IG, Spolaôr N, and De Souto MC Analysis of complexity indices for classification problems: cancer gene expression data Neurocomputing 2012 75 1 33-42

[24]

Lorena AC, Maciel AI, de Miranda PB, Costa IG, and Prudêncio RB Data complexity meta-features for regression problems Mach Learn 2018 107 1 209-246

[25]

Lorena AC, Garcia LP, Lehmann J, Souto MC, and Ho TK How complex is your classification problem? A survey on measuring classification complexity ACM Comput Surv (CSUR) 2019 52 5 1-34

[26]

Lu Y, Cheung YM, and Tang YY Bayes imbalance impact index: a measure of class imbalanced data set for classification problem IEEE Trans Neural Netw Learn Syst 2019 31 9 3525-3539

[27]

Luengo J and Herrera F An automatic extraction method of the domains of competence for learning classifiers using data complexity measures Knowl Inf Syst 2015 42 1 147-180

[28]

Oh S A new dataset evaluation method based on category overlap Comput Biol Med 2011 41 2 115-122

[29]

Orriols-Puig A, Macia N, and Ho TK Documentation for the data complexity library in c++ Universitat Ramon Llull La Salle 2010 196 1–40 12

[30]

Pascual-Triana J D, Charte D, Arroyo M A, Fernández A, Herrera F (2021) Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 1–29

[31]

Sáez JA, Galar M, and Krawczyk B Addressing the overlapping data problem in classification using the one-vs-one decomposition strategy IEEE Access 2019 7 83396-83411

[32]

Singh D, Gosain A, and Saha A Weighted k-nearest neighbor based data complexity metrics for imbalanced datasets Stat Anal Data Min: the ASA Data Science Journal 2020 13 4 394-404

[33]

Smith MR, Martinez T, and Giraud-Carrier C An instance level analysis of data complexity Mach Learn 2014 95 2 225-256

[34]

Tanwani A K, Farooq M (2009) Classification potential vs. classification accuracy: a comprehensive study of evolutionary algorithms with biomedical datasets. In: Learning classifier systems. Springer, pp 127–144

[35]

Triguero I, González S, Moyano JM, García S, Alcalá-Fdez J, Luengo J, Fernández A, del Jesús MJ, Sánchez L, and Herrera F Keel 3.0: an open source software for multi-stage analysis in data mining Int J Comput Intell Syst 2017 10 1 1238-1249

[36]

Vuttipittayamongkol P and Elyan E Neighbourhood-based undersampling approach for handling imbalanced and overlapped data Inf Sci 2020 509 47-70

[37]

Wan S, Zhao Y, Wang T, Gu Z, Abbasi QH, and Choo KKR Multi-dimensional data indexing and range query processing via voronoi diagram for internet of things Futur Gener Comput Syst 2019 91 382-391

[38]

Weitzman MS (1970) Measures of overlap of income distributions of white and Negro families in the United States, vol 22. US Bureau of the Census

[39]

Zhang X, Li R, Zhang B, Yang Y, Guo J, and Ji X An instance-based learning recommendation algorithm of imbalance handling methods Appl Math Comput 2019 351 204-218

Cited By

Lorena APaiva PPrudêncio R(2024)Trusting My Predictions: On the Value of Instance-Level AnalysisACM Computing Surveys10.1145/361535456:7(1-28)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3615354
Cuesta MLancho CFernández-Isabel ACano EMartín De Diego I(2024)CSViz: Class Separability Visualization for high-dimensional datasetsApplied Intelligence10.1007/s10489-023-05149-454:1(924-946)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10489-023-05149-4

Recommendations

Boosting meta-learning with simulated data complexity measures

Meta-Learning has been largely used over the last years to support the recommendation of the most suitable machine learning algorithm(s) and hyperparameters for new datasets. Traditionally, a meta-base is created containing meta-features extracted ...
A unifying view of class overlap and imbalance: Key concepts, multi-view panorama, and open avenues for research
Abstract
The combination of class imbalance and overlap is currently one of the most challenging issues in machine learning. While seminal work focused on establishing class overlap as a complicating factor for classification tasks in ...
Highlights
- A unique and global view of the problem of overlap in imbalanced domains is proposed.
Analysis of data complexity measures for classification

The study of data complexity metrics is an emergent area in the field of data mining and is focused on the analysis of several data set characteristics to extract knowledge from them. This information can be used to support the election of the proper ...

Comments

Information & Contributors

Information

Published In

cover image Applied Intelligence

Applied Intelligence Volume 53, Issue 7

Apr 2023

1164 pages

ISSN:0924-669X

Issue’s Table of Contents

© The Author(s) 2022.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 26 July 2022

Accepted: 21 May 2022

Author Tags

Qualifiers

Research-article

Funding Sources

Universidad Rey Juan Carlos
Universidad Rey Juan Carlos
Comunidad de Madrid
Ministerio de Ciencia, Innovación y Universidades
Universidad Rey Juan Carlos

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lorena APaiva PPrudêncio R(2024)Trusting My Predictions: On the Value of Instance-Level AnalysisACM Computing Surveys10.1145/361535456:7(1-28)Online publication date: 9-Apr-2024
https://dl.acm.org/doi/10.1145/3615354
Cuesta MLancho CFernández-Isabel ACano EMartín De Diego I(2024)CSViz: Class Separability Visualization for high-dimensional datasetsApplied Intelligence10.1007/s10489-023-05149-454:1(924-946)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10489-023-05149-4

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents