Abstract
Extensions of under-sampling bagging ensemble classifiers for class imbalanced data are considered. We propose a two phase approach, called Actively Balanced Bagging, which aims to improve recognition of minority and majority classes with respect to so far proposed extensions of bagging. Its key idea consists in additional improving of an under-sampling bagging classifier (learned in the first phase) by updating in the second phase the bootstrap samples with a limited number of examples selected according to an active learning strategy. The results of an experimental evaluation of Actively Balanced Bagging show that this approach improves predictions of the two different baseline variants of under-sampling bagging. The other experiments demonstrate the differentiated influence of four active selection strategies on the final results and the role of tuning main parameters of the ensemble.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
We are grateful to prof. W. Michalowski and the MET Research Group from the University of Ottawa for providing us an access to scrotal-pain data set.
- 4.
Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth Edition, 2016.
References
Abe, N., Mamitsuka, H.: Query learning strategies using boosting and bagging. In: Proceedings of 15th International Conference on Machine Learning, pp. 1–10 (2004)
Aggarwal, C., X., K., Gu, Q., Han, J., Yu, P.: Data Classification: Algorithms and Applications. Active learning: A survey, pp. 571–606. CRC Press (2015)
Batista, G., Prati, R., Monard, M.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20–29 (2004). https://doi.org/10.1145/1007730.1007735
Błaszczyński, J., Deckert, M., Stefanowski, J., Wilk, S.: Integrating selective preprocessing of imbalanced data with Ivotes ensemble. In: Proceedings of 7th International Conference RSCTC 2010, LNAI, vol. 6086, pp. 148–157. Springer (2010)
Błaszczyński, J., Lango, M.: Diversity analysis on imbalanced data using neighbourhood and roughly balanced bagging ensembles. In: Proceedings ICAISC 2016, LNCS, vol. 9692, pp. 552–562. Springer (2016)
Błaszczyński, J., Stefanowski, J., Idkowiak, L.: Extending bagging for imbalanced data. In: Proc. of the 8th CORES 2013, Springer Series on Advances in Intelligent Systems and Computing, vol. 226, pp. 226–269 (2013)
Błaszczyński, J., Stefanowski, J.: Neighbourhood sampling in bagging for imbalanced data. Neurocomputing 150A, 184–203 (2015)
Błaszczyński, J., Stefanowski, J.: Actively Balanced Bagging for Imbalanced Data. In: Proceedings ISMIS 2017, Springer LNAI, vol. 10352, pp. 271–281 (2017)
Błaszczyński, J., Stefanowski, J.: Local data characteristics in learning classifiers from imbalanced data. In: J. Kacprzyk, L. Rutkowski, A. Gaweda, G. Yen (eds.) Advances in Data Analysis with Computational Intelligence Methods, Studies in Computational Intelligence. p. 738. Springer (2017). https://doi.org/10.1007/978-3-319-67946-4_2 (to appear)
Borisov, A., Tuv, E., Runger, G.: Active Batch Learning with Stochastic Query-by-Forest (SQBF). Work. Act. Learn. Exp. Des. JMLR 16, 59–69 (2011)
Branco, P., Torgo, L., Ribeiro, R.: A survey of predictive modeling under imbalanced distributions. ACM Comput. Surv. 49(2), 31 (2016). https://doi.org/10.1145/2907070
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996). https://doi.org/10.1007/BF00058655
Chang, E.: Statistical learning for effective visual information retrieval. In: Proceedings of ICIP 2003, pp. 609–612 (2003). https://doi.org/10.1109/ICIP.2003.1247318
Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 16, 341–378 (2002)
Chen, X., Wasikowski, M.: FAST: A ROC–based feature selection metric for small samples and imbalanced data classification problems. In: Proceedings of the 14th ACM SIGKDD, pp. 124–133 (2008). https://doi.org/10.1145/1401890.1401910
Cieslak, D., Chawla, N.: Learning decision trees for unbalanced data. In: D. et al. (ed.) Proceedings of the ECML PKDD 2008, Part I, LNAI, vol. 5211, pp. 241–256. Springer (2008). https://doi.org/10.1007/978-3-540-87479-9_34
Ertekin, S., Huang, J., Bottou, L., Giles, C.: Learning on the border: Active learning in imbalanced data classification. In: Proceedings ACM Conference on Information and Knowledge Management, pp. 127–136 (2007). https://doi.org/10.1145/1321440.1321461
Ertekin, S.: Adaptive oversampling for imbalanced data classification. Inf. Sci. Syst. 264, 261–269 (2013)
Ferdowsi, Z., Ghani, R., Settimi, R.: Online Active Learning with Imbalanced Classes. In: Proceedings IEEE 13th International Conference on Data Mining, pp. 1043–1048 (2013)
Fu, J., Lee, S.: Certainty-based Active Learning for Sampling Imbalanced Datasets. Neurocomputing 119, 350–358 (2013). https://doi.org/10.1016/j.neucom.2013.03.023
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.: Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. C 99, 1–22 (2011)
Garcia, V., Sanchez, J., Mollineda, R.: An empirical study of the behaviour of classifiers on imbalanced and overlapped data sets. In: Proceedings of Progress in Pattern Recognition, Image Analysis and Applications, LNCS, vol. 4756, pp. 397–406. Springer (2007)
Grzymala-Busse, J., Stefanowski, J., Wilk, S.: A comparison of two approaches to data mining from imbalanced data. J. Intell. Manuf. 16, 565–574 (2005). https://doi.org/10.1007/s10845-005-4362-2
He H. Yungian, M.: Imbalanced Learning. Foundations, Algorithms and Applications. IEEE - Wiley (2013)
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Data Knowl. Eng. 21, 1263–1284 (2009). https://doi.org/10.1109/TKDE.2008.239
Hido, S., Kashima, H.: Roughly balanced bagging for imbalance data. Stat. Anal. Data Min. 2(5–6), 412–426 (2009)
Ho, T.: The random subspace method for constructing decision forests. Pattern Anal. Mach. Intell. 20(8), 832–844 (1998)
Hu, B., Dong, W.: A study on cost behaviors of binary classification measures in class-imbalanced problems. CoRR abs/1403.7100 (2014)
Japkowicz, N., Stephen, S.: Class imbalance problem: a systematic study. Intell. Data Anal. J. 6(5), 429–450 (2002)
Japkowicz, N.: Shah, Mohak: Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press (2011). https://doi.org/10.1017/CBO9780511921803
Jelonek, J., Stefanowski, J.: Feature subset selection for classification of histological images. Artif. Intell. Med. 9, 227–239 (1997). https://doi.org/10.1016/S0933-3657(96)00375-2
Jo, T., Japkowicz, N.: Class imbalances versus small disjuncts. ACM SIGKDD Explor. Newsl. 6(1), 40–49 (2004). https://doi.org/10.1145/1007730.1007737
Khoshgoftaar, T., Van Hulse, J., Napolitano, A.: Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Trans. Syst. Man Cybern. Part A 41(3), 552–568 (2011). https://doi.org/10.1109/TSMCA.2010.2084081
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-side selection. In: Proceedings of the 14th International Conference on Machine Learning ICML-1997, pp. 179–186 (1997)
Kuncheva, L.: Combining Pattern Classifiers. Methods and Algorithms, 2nd edn. Wiley (2014)
Lango, M., Stefanowski, J.: The usefulness of roughly balanced bagging for complex and high-dimensional imbalanced data. In: Proceedings of International ECML PKDD Workshop on New Frontiers in Mining Complex Patterns NFmC, LNAI, vol. 9607, pp. 94–107, Springer (2015)
Lango, M., Stefanowski, J.: Multi-class and feature selection extensions of Roughly Balanced Bagging for imbalanced data. J. Intell. Inf. Syst. (to appear). https://doi.org/10.1007/s10844-017-0446-7
Laurikkala, J.: Improving identification of difficult small classes by balancing class distribution. Tech. Rep. A-2001-2, University of Tampere (2001). https://doi.org/10.1007/3-540-48229-6_9
Lewis, D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: Proceedings of 11th International Conference on Machine Learning, pp. 148–156 (1994)
Liu, A., Zhu, Z.: Ensemble methods for class imbalance learning. In: Y.M. He H. (ed.) Imbalanced Learning. Foundations, Algorithms and Applications, pp. 61–82. Wiley (2013). https://doi.org/10.1002/9781118646106.ch4
Lopez, V., Fernandez, A., Garcia, S., Palade, V., Herrera, F.: An Insight into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics. Inf. Sci. 257, 113–141 (2014)
Napierała, K., Stefanowski, J., Wilk, S.: Learning from imbalanced data in presence of noisy and borderline examples. In: Proceedings of 7th International Conference RSCTC 2010, LNAI, vol. 6086, pp. 158–167. Springer (2010). https://doi.org/10.1007/978-3-642-13529-3_18
Napierała, K., Stefanowski, J.: BRACID: A comprehensive approach to learning rules from imbalanced data. J. Intell. Inf. Syst. 39, 335–373 (2012). https://doi.org/10.1007/s10844-011-0193-0
Napierała, K., Stefanowski, J.: Addressing imbalanced data with argument based rule learning. Expert Syst. Appl. 42, 9468–9481 (2015). https://doi.org/10.1016/j.eswa.2015.07.076
Napierała, K., Stefanowski, J.: Types of minority class examples and their influence on learning classifiers from imbalanced data. J. Intell. Inf. Syst. 46, 563–597 (2016). https://doi.org/10.1007/s10844-015-0368-1
Napierała, K.: Improving rule classifiers for imbalanced data. Ph.D. thesis, Poznań University of Technology (2013)
Prati, R., Batista, G., Monard, M.: Class imbalance versus class overlapping: an analysis of a learning system behavior. In: Proceedings 3rd Mexican International Conference on Artificial Intelligence, pp. 312–321 (2004)
Ramirez-Loaiza, M., Sharma, M., Kumar, G., Bilgic, M.: Active learning: An empirical study of common baselines. Data Min. Knowl. Discov. 31, 287–313 (2017). https://doi.org/10.1007/s10618-016-0469-7
Seaz, J., Krawczyk, B., Woźniak, M.: Analyzing the oversampling of different classes and types in multi-class imbalanced data. Pattern Recognit 57, 164–178 (2016). https://doi.org/10.1016/j.atcog.2016.03.012
Settles, B.: Active learning literature survey. Tech. Rep. 1648, University of Wisconsin-Madison (2009)
Stefanowski, J., Wilk, S.: Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th International Conference DaWaK. LNCS, vol. 5182, pp. 283–292. Springer (2008). https://doi.org/10.1007/978-3-540-85836-2_27
Stefanowski, J.: Overlapping, rare examples and class decomposition in learning classifiers from imbalanced data. In: S. Ramanna, L.C. Jain, R.J. Howlett (eds.) Emerging Paradigms in Machine Learning, vol. 13, pp. 277–306. Springer (2013). https://doi.org/10.1007/978-3-642-28699-5_11
Stefanowski, J.: Dealing with data difficulty factors while learning from imbalanced data. In: J. Mielniczuk, S. Matwin (eds.) Challenges in Computational Statistics and Data Mining, pp. 333–363. Springer (2016). https://doi.org/10.1007/978-3-319-18781-5_17
Sun, Y., Wong, A., Kamel, M.: Classification of imbalanced data: a review. Int. J.Pattern Recognit Artif. Intell. 23(4), 687–719 (2009). https://doi.org/10.1142/S0218001409007326
Tomasev, N., Mladenic, D.: Class imbalance and the curse of minority hubs. Knowl. Based Syst. 53, 157–172 (2013)
Wang, S., Yao, X.: Mutliclass imbalance problems: analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B 42(4), 1119–1130 (2012). https://doi.org/10.1109/TSMCB.2012.2187280
Weiss, G.: Mining with rarity: A unifying framework. ACM SIGKDD Explor. Newsl. 6(1), 7–19 (2004). https://doi.org/10.1145/1007730.1007734
Wojciechowski, S., Wilk, S.: Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found. Comput. Decis. Sci. 42(2), 149–176 (2017)
Yang, Y., Ma, G.: Ensemble-based active learning for class imbalance problem. J. Biomed. Sci. Eng. 3(10), 1022–1029 (2010). https://doi.org/10.4236/jbise.2010.310133
Ziȩba, M., Tomczak, J.: Boosted SVM with active learning strategy for imbalanced data. Soft Comput. 19(12), 3357–3368 (2015). https://doi.org/10.1007/s00500-014-1407-5
Acknowledgements
The research was supported from Poznań University of Technology Statutory Funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Błaszczyński, J., Stefanowski, J. (2018). Improving Bagging Ensembles for Class Imbalanced Data by Active Learning. In: Stańczyk, U., Zielosko, B., Jain, L. (eds) Advances in Feature Selection for Data and Pattern Recognition. Intelligent Systems Reference Library, vol 138. Springer, Cham. https://doi.org/10.1007/978-3-319-67588-6_3
Download citation
DOI: https://doi.org/10.1007/978-3-319-67588-6_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67587-9
Online ISBN: 978-3-319-67588-6
eBook Packages: EngineeringEngineering (R0)