Abstract
Selective preprocessing, representing data–level approach to the imbalanced data problem, is one of the most successful methods. This paper introduces novel algorithm combining this kind of technique with the filtering phase. The information granules are formed to distinguish specific types of positive examples that should be adequately treated. Three modes of oversampling, dedicated to minority class instances placed in specific areas of the feature space, are available. The rough set theory is applied to filter and remove inconsistencies from the generated positive samples. The experimental study shows that proposed method in most cases obtains better or similar performance of standard classifiers, such as C4.5 decision tree, in comparison with other techniques. Additionally, multiple values of algorithm’s parameters are evaluated. It is experimentally proven that two of the examined parameters values are the most appropriate to various applications. However, the automatic parameters tuning, based on the specific requirements of different data distributions, is recommended.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alcala-Fdez, J., et al.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. Newsl. 6(1), 20–29 (2004)
Borowska, K., Stepaniuk, J.: Imbalanced data classification: a novel re-sampling approach combining versatile improved SMOTE and rough sets. In: Saeed, K., Homenda, W. (eds.) CISIM 2016. LNCS, vol. 9842, pp. 31–42. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45378-1_4
Borowska, K., Stepaniuk, J.: Rough sets in imbalanced data problem: improving re–sampling process. In: Saeed, K., Homenda, W., Chaki, R. (eds.) CISIM 2017. LNCS, vol. 10244, pp. 459–469. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59105-6_39
Bunkhumpornpat, C., Sinapiromsaran, K., Lursinsap, C.: Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for handling the class imbalanced problem. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS (LNAI), vol. 5476, pp. 475–482. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-01307-2_43
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (2002)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)
Garcia, V., Mollineda, R.A., Sanchez, J.S.: On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 11(3–4), 269–280 (2008)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Hu, S., Liang, Y., Ma, L., He, Y.: MSMOTE: improving classification performance when training data is imbalanced. In: Second International Workshop on Computer Science and Engineering, WCSE 2009, Qingdao, pp. 13–17 (2009)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Napierała, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS (LNAI), vol. 6086, pp. 158–167. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13529-3_18
Pawlak, Z., Skowron, A.: Rudiments of rough sets. Inf. Sci. 177(1), 3–27 (2007)
Ramentol, E., Caballero, Y., Bello, R., Herrera, F.: SMOTE-RSB\(_{*}\): a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowl. Inf. Syst. 33(2), 245–265 (2011)
Stefanowski, J.: Dealing with data difficulty factors while learning from imbalanced data. In: Matwin, S., Mielniczuk, J. (eds.) Challenges in Computational Statistics and Data Mining. SCI, vol. 605, pp. 333–363. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-18781-5_17
Stepaniuk, J.: Rough-Granular Computing in Knowledge Discovery and Data Mining, vol. 152. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-70801-8
UC Irvine Machine Learning Repository. http://archive.ics.uci.edu/ml/. Accessed 28 Apr 2018
Wilson, D.R., Martinez, T.R.: Improved heterogeneous distance functions. J. Artif. Intell. Res. 6, 1–34 (1997)
Zhu X., Pedrycz W.: Granular under-sampling for processing imbalanced data. IEEE (2018, in Print)
Acknowledgements
This research was supported by the grant S/WI/1/2018 of the Polish Ministry of Science and Higher Education.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Borowska, K., Stepaniuk, J. (2018). Granular Computing and Parameters Tuning in Imbalanced Data Preprocessing. In: Saeed, K., Homenda, W. (eds) Computer Information Systems and Industrial Management. CISIM 2018. Lecture Notes in Computer Science(), vol 11127. Springer, Cham. https://doi.org/10.1007/978-3-319-99954-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-99954-8_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99953-1
Online ISBN: 978-3-319-99954-8
eBook Packages: Computer ScienceComputer Science (R0)