Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

WRND: : A weighted oversampling framework with relative neighborhood density for imbalanced noisy classification

Published: 25 June 2024 Publication History

Abstract

Imbalanced data and label noise are ubiquitous challenges in data mining and machine learning that severely impair classification performance. The synthetic minority oversampling technique (SMOTE) and its variants have been proposed, but they are easily constrained by hyperparameter optimization such as k-nearest neighbors, their performance deteriorates owing to noise, they rarely consider data distribution information, and they cause high complexity. Furthermore, SMOTE-based methods perform random linear interpolation between each minority class sample and its randomly selected k-nearest neighbors, regardless of sample differences and distribution information. To address the above problems, an adaptive, robust, and general weighted oversampling framework based on relative neighborhood density (WRND) is proposed. It can combine with most SMOTE-based sampling algorithms easily and improve performance. First, it adaptively distinguishes and filters noisy and outlier samples by introducing the natural neighbor, which inherently avoids the extra noise and overlapping samples introduced by the synthesis of noisy samples. The relative neighborhood density of each sample can then be obtained, which reflects the intra-class and inter-class distribution information within the natural neighborhood. To alleviate the blindness of SMOTE-based methods, the number and locations of synthetic samples are informedly assigned based on distribution information and reasonable generalization of natural neighborhoods of original samples. Extensive experiments on 23 benchmark datasets and six classic classifiers with eight pairs of representative sampling algorithms and two state-of-the-art frameworks, significantly demonstrate the effectiveness of the WRND framework. Code and framework are available at https://github.com/dream-lm/WRND_framework.

References

[1]
Asuncion A., Newman D., UCI machine learning repository, Irvine, CA, USA, 2007.
[2]
Azhar N.A., Mohd Pozi M.S., Mohamed Din A., Jatowt A., An investigation of SMOTE based methods for imbalanced datasets with data complexity analysis, IEEE Transactions on Knowledge and Data Engineering (2022) 1.
[3]
Barua S., Islam M.M., Yao X., Murase K., MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Transactions on Knowledge and Data Engineering 26 (2) (2012) 405–425.
[4]
Batista G.E., Prati R.C., Monard M.C., A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter 6 (1) (2004) 20–29.
[5]
Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C., Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem, in: Pacific-Asia conference on knowledge discovery and data mining, Springer, 2009, pp. 475–482.
[6]
Bunkhumpornpat C., Sinapiromsaran K., Lursinsap C., DBSMOTE: density-based synthetic minority over-sampling technique, Applied Intelligence 36 (3) (2012) 664–684.
[7]
Cao C., Wang Z., IMCStacking: Cost-sensitive stacking learning with feature inverse mapping for imbalanced problems, Knowledge-Based Systems 150 (2018) 27–37.
[8]
Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P., SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002) 321–357.
[9]
Chen B., Xia S., Chen Z., Wang B., Wang G., RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise, Information Sciences 553 (2021) 397–428.
[10]
Cheng D., Zhu Q., Huang J., Yang L., Wu Q., Natural neighbor-based clustering algorithm with local representatives, Knowledge-Based Systems 123 (2017) 238–253.
[11]
Demšar J., Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30.
[12]
Douzas G., Bacao F., Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE, Information Sciences 501 (2019) 118–135.
[13]
Dudjak M., Martinović G., An empirical study of data intrinsic characteristics that make learning from imbalanced data difficult, Expert Systems with Applications 182 (2021).
[14]
Elreedy D., Atiya A.F., A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance, Information Sciences 505 (2019) 32–64.
[15]
Fan X., Tang K., Weise T., Margin-based over-sampling method for learning from imbalanced datasets, in: Pacific-Asia conference on knowledge discovery and data mining, Springer, 2011, pp. 309–320.
[16]
Farquad M., Bose I., Preprocessing unbalanced data using support vector machine, Decision Support Systems 53 (1) (2012) 226–233.
[17]
Farshidvard A., Hooshmand F., MirHassani S., A novel two-phase clustering-based under-sampling method for imbalanced classification problems, Expert Systems with Applications 213 (2023).
[18]
Fei X., Zhou S., Han X., Wang J., Ying S., Chang C., et al., Doubly supervised parameter transfer classifier for diagnosis of breast cancer with imbalanced ultrasound imaging modalities, Pattern Recognition 120 (2021).
[19]
Fernández A., Garcia S., Herrera F., Chawla N.V., SMOTE for learning from imbalanced data: progress and challenges, marking the 15-year anniversary, Journal of Artificial Intelligence Research 61 (2018) 863–905.
[20]
García V., Sánchez J.S., Marqués A., Florencia R., Rivera G., Understanding the apparent superiority of over-sampling through an analysis of local information for class-imbalanced data, Expert Systems with Applications 158 (2020).
[21]
Guo Y., Nie L., Cheng Z., Tian Q., Zhang M., Loss re-scaling VQA: Revisiting the language prior problem from a class-imbalance view, IEEE Transactions on Image Processing 31 (2021) 227–238.
[22]
Han H., Wang W.-Y., Mao B.-H., Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, in: International conference on intelligent computing, Springer, 2005, pp. 878–887.
[23]
He H., Bai Y., Garcia E.A., Li S., ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, 2008, pp. 1322–1328.
[24]
He H., Garcia E.A., Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1263–1284.
[25]
Jing X.-Y., Zhang X., Zhu X., Wu F., You X., Gao Y., et al., Multiset feature learning for highly imbalanced data classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (1) (2021) 139–156.
[26]
Kamalov F., Denisov D., Gamma distribution-based sampling for imbalanced data, Knowledge-Based Systems 207 (2020).
[27]
Kim K., Normalized class coherence change-based kNN for classification of imbalanced data, Pattern Recognition 120 (2021).
[28]
Kim T., Lee J.-S., Maximizing AUC to learn weighted naive Bayes for imbalanced data classification, Expert Systems with Applications 217 (2023).
[29]
Kovács G., An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83 (2019).
[30]
Lemaître G., Nogueira F., Aridas C.K., Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, Journal of Machine Learning Research 18 (1) (2017) 559–563.
[31]
Li X., Huang M., Liu Y., Menon V.G., Paul A., Ding Z., I/Q imbalance aware nonlinear wireless-powered relaying of B5G networks: Security and reliability analysis, IEEE Transactions on Network Science and Engineering 8 (4) (2021) 2995–3008.
[32]
Li M., Xiong A., Wang L., Deng S., Ye J., ACO resampling: Enhancing the performance of oversampling methods for class imbalance classification, Knowledge-Based Systems 196 (2020).
[33]
Li M., Zhou H., Liu Q., Wang G., SW: A weighted space division framework for imbalanced problems with label noise, Knowledge-Based Systems (2022).
[34]
Liu J., A minority oversampling approach for fault detection with heterogeneous imbalanced data, Expert Systems with Applications 184 (2021).
[35]
Liu W., Fan H., Xia M., Xia M., A focal-aware cost-sensitive boosted tree for imbalanced credit scoring, Expert Systems with Applications 208 (2022).
[36]
Maldonado S., Vairetti C., Fernandez A., Herrera F., FW-SMOTE: A feature-weighted oversampling approach for imbalanced classification, Pattern Recognition 124 (2022).
[37]
Mullick S.S., Datta S., Dhekane S.G., Das S., Appropriateness of performance indices for imbalanced data classification: An analysis, Pattern Recognition 102 (2020).
[38]
Pan T., Zhao J., Wu W., Yang J., Learning imbalanced datasets based on SMOTE and Gaussian distribution, Information Sciences 512 (2020) 1214–1233.
[39]
Ramentol E., Gondres I., Lajes S., Bello R., Caballero Y., Cornelis C., et al., Fuzzy-rough imbalanced learning for the diagnosis of High Voltage Circuit Breaker maintenance: The SMOTE-FRST-2T algorithm, Engineering Applications of Artificial Intelligence 48 (2016) 134–139.
[40]
Roy S., Roy U., Sinha D., Pal R.K., Imbalanced ensemble learning in determining Parkinson’s disease using Keystroke dynamics, Expert Systems with Applications 217 (2023).
[41]
Sáez J.A., Luengo J., Stefanowski J., Herrera F., SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015) 184–203.
[42]
Siers M.J., Islam M.Z., Novel algorithms for cost-sensitive classification and knowledge discovery in class imbalanced datasets with an application to NASA software defects, Information Sciences 459 (2018) 53–70.
[43]
Sun Y., Cai L., Liao B., Zhu W., Minority sub-region estimation-based oversampling for imbalance learning, IEEE Transactions on Knowledge and Data Engineering (2020).
[44]
Sun Y., Cai L., Liao B., Zhu W., Xu J., A robust oversampling approach for class imbalance problem with small disjuncts, IEEE Transactions on Knowledge and Data Engineering (2022) 1.
[45]
Tanimoto A., Yamada S., Takenouchi T., Sugiyama M., Kashima H., Improving imbalanced classification using near-miss instances, Expert Systems with Applications 201 (2022).
[46]
Thabtah F., Hammoud S., Kamalov F., Gonsalves A., Data imbalance in classification: Experimental evaluation, Information Sciences 513 (2020) 429–441.
[47]
Triguero I., González S., Moyano J., García S., Alcala-Fdez J., Luengo J., et al., KEEL 3.0: An open source software for multi-stage analysis in data mining, International Journal of Computational Intelligence Systems 10 (2017) 1238–1249.
[48]
Xia S., Zheng Y., Wang G., He P., Li H., Chen Z., Random space division sampling for label-noisy classification or imbalanced classification, IEEE Transactions on Cybernetics (2021).
[49]
Xie Y., Qiu M., Zhang H., Peng L., Chen Z., Gaussian distribution based oversampling for imbalanced data classification, IEEE Transactions on Knowledge and Data Engineering 34 (2) (2022) 667–679.
[50]
Xu Y., Yu Z., Chen C.L.P., Liu Z., Adaptive subspace optimization ensemble method for high-dimensional imbalanced data classification, IEEE Transactions on Neural Networks and Learning Systems (2021) 1–14.
[51]
Yan Y., Jiang Y., Zheng Z., Yu C., Zhang Y., Zhang Y., LDAS: Local density-based adaptive sampling for imbalanced data classification, Expert Systems with Applications 191 (2022).
[52]
Zeraatkar S., Afsari F., Interval–valued fuzzy and intuitionistic fuzzy–KNN for imbalanced data classification, Expert Systems with Applications 184 (2021).
[53]
Zhang S., Challenges in KNN classification, IEEE Transactions on Knowledge and Data Engineering (2021) 1.
[54]
Zhou H., Dong X., Xia S., Wang G., Weighted oversampling algorithms for imbalanced problems and application in prediction of streamflow, Knowledge-Based Systems 229 (2021).
[55]
Zimmerman D.W., Zumbo B.D., Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks, The Journal of Experimental Education 62 (1) (1993) 75–86.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 241, Issue C
May 2024
1588 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 25 June 2024

Author Tags

  1. Imbalanced classification
  2. Label noise
  3. Oversampling framework
  4. Relative neighborhood density

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media