Combining over-sampling and under-sampling techniques for imbalance dataset

N Junsomboon, T Phienthrakul - … of the 9th international conference on …, 2017 - dl.acm.org
N Junsomboon, T Phienthrakul
Proceedings of the 9th international conference on machine learning and …, 2017dl.acm.org
An important problem in medical data analysis is imbalance dataset. This problem is a
cause of diagnostic mistake. The results of diagnostic affect to life of patients. If a doctor fails
in diagnostic of patient who have disease that means he cannot treat patient in timely.
However, the problem can be easily solved by adding or removing the data to closely
balance for performance of diagnostic in medically. This paper proposed a solution to adjust
imbalance dataset by combining Neighbor Cleaning Rule (NCL) and Synthetic Minority Over …
An important problem in medical data analysis is imbalance dataset. This problem is a cause of diagnostic mistake. The results of diagnostic affect to life of patients. If a doctor fails in diagnostic of patient who have disease that means he cannot treat patient in timely. However, the problem can be easily solved by adding or removing the data to closely balance for performance of diagnostic in medically. This paper proposed a solution to adjust imbalance dataset by combining Neighbor Cleaning Rule (NCL) and Synthetic Minority Over-Sampling Technique (SMOTE) techniques. The process of work is using NCL technique for removing sample data that are outliers in majority class and SMOTE technique is used for increasing sample data in minority class to closely balance dataset. After that, the balanced medical dataset is classified by Naive Bayes, SMO and KNN algorithm. The experimental results show that the recall rate can be improved from the models that were created from balanced dataset.
ACM Digital Library