survey

A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions

Authors:

Harsurinder Kaur,

Husanbir Singh Pannu,

Avleen Kaur MalhiAuthors Info & Claims

ACM Computing Surveys (CSUR), Volume 52, Issue 4

Article No.: 79, Pages 1 - 36

https://doi.org/10.1145/3343440

Published: 30 August 2019 Publication History

Abstract

In machine learning, the data imbalance imposes challenges to perform data analytics in almost all areas of real-world research. The raw primary data often suffers from the skewed perspective of data distribution of one class over the other as in the case of computer vision, information security, marketing, and medical science. The goal of this article is to present a comparative analysis of the approaches from the reference of data pre-processing, algorithmic and hybrid paradigms for contemporary imbalance data analysis techniques, and their comparative study in lieu of different data distribution and their application areas.

References

[1]

Ahmed Abbasi and Hsinchun Chen. 2009. A comparison of fraud cues and classification methods for fake escrow website detection. Info. Technol. Manage. 10, 2--3 (2009), 83--101.

Digital Library

[2]

Chirath Abeysinghe, Jianguo Li, and Jing He. 2016. A classifier hub for imbalanced financial data. In Proceedings of the Australasian Database Conference. Springer, 476--479.

[3]

Mohamed Abouelenien, Xiaohui Yuan, Balathasan Giritharan, Jianguo Liu, and Shoujiang Tang. 2013. Cluster-based sampling and ensemble for bleeding detection in capsule endoscopy videos. Amer. J. Sci. Eng. 2, 1 (2013), 24--32.

[4]

Hamzah Al Najada and Xingquan Zhu. 2014. iSRD: Spam review detection with imbalanced data distributions. In Proceedings of the IEEE 15th International Conference on Information Reuse and Integration (IRI’14). IEEE, 553--560.

[5]

Safdar Ali, Abdul Majid, Syed Gibran Javed, and Mohsin Sattar. 2016. Can-CSC-GBE: Developing cost-sensitive classifier with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data. Comput. Biol. Med. 73 (2016), 38--46.

Digital Library

[6]

Nafees Anwar, Geoff Jones, and Siva Ganesh. 2014. Measurement of data complexity for classification problems with unbalanced data. Stat. Anal.ysis and Data Min.: ASA Data Sci. J. 7, 3 (2014), 194--211.

Digital Library

[7]

Ömer Faruk Arar and Kürşat Ayan. 2015. Software defect prediction using cost-sensitive neural network. Appl. Soft Comput. 33 (2015), 263--277.

Digital Library

[8]

Malgorzata Bach, Aleksandra Werner, J. Żywiec, and W. Pluskiewicz. 2017. The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Info. Sci. 384 (2017), 174--190.

[9]

Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. 2014. MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26, 2 (2014), 405--425.

Digital Library

[10]

Rukshan Batuwita and Vasile Palade. 2009. microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25, 8 (2009), 989--995.

Digital Library

[11]

Oscar Beijbom, Mohammad Saberian, David Kriegman, and Nuno Vasconcelos. 2014. Guess-averse loss functions for cost-sensitive multiclass boosting. In Proceedings of the International Conference on Machine Learning. 586--594.

Digital Library

[12]

Mohamed Bekkar and Taklit Akrouf Alitouche. 2013. Imbalanced data learning approaches review. Int. J. Data Min. Knowl. Manage. Process 3, 4 (2013), 15.

[13]

Sanket M. Bhandari and Krunal Patel. 2015. A review on using clustering and classification techniques to predict student failure with high dimensional and imbalanced data.

[14]

Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and J. Christopher Westland. 2011. Data mining for credit card fraud: A comparative study. Decis. Supp. Syst. 50, 3 (2011), 602--613.

Digital Library

[15]

Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative learning under covariate shift. J. Mach. Learn. Res. 10 (Sep.2009), 2137--2155.

Digital Library

[16]

Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Distributed feature selection: An application to microarray data classification. Appl. Soft Comput. 30 (2015), 136--150.

Digital Library

[17]

Jonathan Burez and Dirk Van den Poel. 2009. Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 3 (2009), 4626--4636.

Digital Library

[18]

Lu Cao and Yikui Zhai. 2015. Imbalanced data classification based on a hybrid resampling SVM method. In Proceedings of the Ubiquitous Intelligence and Computing and IEEE 12th International Conference on Autonomic and Trusted Computing and IEEE 15th International Conference on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom’15). IEEE, 1533--1536.

[19]

Peng Cao, Dazhe Zhao, and Osmar Zaiane. 2013. An optimized cost-sensitive SVM for imbalanced data learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 280--292.

[20]

M. Emre Celebi, Hassan A. Kingravi, Bakhtiyar Uddin, Hitoshi Iyatomi, Y. Alp Aslandogan, William V. Stoecker, and Randy H. Moss. 2007. A methodological approach to the classification of dermoscopy images. Comput. Med. Imag. Graph. 31, 6 (2007), 362--373.

[21]

Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X. Ling. 2004. Test-cost sensitive naive bayes classification. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE, 51--58.

Digital Library

[22]

Edward Y. Chang, Beitao Li, Gang Wu, and Kingshy Goh. 2003. Statistical learning for effective visual information retrieval. In Proceedings of the International Conference on Image Processing (ICIP’03), vol. 3. IEEE, III--609.

[23]

Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3--16.

Digital Library

[24]

Nitesh V. Chawla. 2009. Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery Handbook. Springer, 875--886.

[25]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artific. Intell. Res. 16 (2002), 321--357.

Digital Library

[26]

David A. Cieslak and Nitesh V. Chawla. 2009. A framework for monitoring classifiers’ performance: When and why failure occurs? Knowl. Info. Syst. 18, 1 (2009), 83--108.

Digital Library

[27]

Michael Crawford, Taghi M. Khoshgoftaar, Joseph D. Prusa, Aaron N. Richter, and Hamzah Al Najada. 2015. Survey of review spam detection using machine-learning techniques. J. Big Data 2, 1 (2015), 23.

[28]

Dong Dai and Shaowen Hua. 2016. Random under-sampling ensemble methods for highly imbalanced rare disease classification. In Proceedings of the International Conference on Data Mining (DMIN’16). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 54.

[29]

Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 233--240.

Digital Library

[30]

Sauptik Dhar and Vladimir Cherkassky. 2015. Development and evaluation of cost-sensitive universum-SVM. IEEE Trans. Cybernet. 45, 4 (2015), 806--818.

[31]

Jie Du and C. M. Vong. 2018. Online multi-label learning under dynamic changes in data distribution with labels. Accepted and in Press IEEE Trans. Cybernet. (2018).

[32]

Jie Du, Chi-Man Vong, Chi-Man Pun, Pak-Kin Wong, and Weng-Fai Ip. 2017. Post-boosting of classification boundary for imbalanced data using geometric mean. Neural Netw. 96 (2017), 101--114.

Digital Library

[33]

Shihong Du, Fangli Zhang, and Xiuyuan Zhang. 2015. Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach. ISPRS J. Photogram. Remote Sens. 105 (2015), 107--119.

[34]

Sumeet Dua and Xian Du. 2016. Data Mining and Machine Learning in Cybersecurity. CRC Press.

Digital Library

[35]

Ekrem Duman, Yeliz Ekinci, and Aydın Tanrıverdi. 2012. Comparing alternative classifiers for database marketing: The case of imbalanced datasets. Expert Syst. Appl. 39, 1 (2012), 48--53.

Digital Library

[36]

Reda M. Elbasiony, Elsayed A. Sallam, Tarek E. Eltobely, and Mahmoud M. Fahmy. 2013. A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4, 4 (2013), 753--762.

[37]

T. Elhassan, M. Aljurf, F. Al-Mohanna, and M. Shoukri. 2016. Classification of imbalance data using tomek link (T-link) combined with random under-sampling (RUS) as a data reduction method. J. Info. Data Min. (2016).

[38]

Vegard Engen, Jonathan Vincent, and Keith Phalp. 2008. Enhancing network-based intrusion detection for imbalanced data. Int. J. Knowl.-Based Intell. Eng. Syst. 12, 5--6 (2008), 357--367.

Digital Library

[39]

Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 8 (2006), 861--874.

Digital Library

[40]

Alberto Fernández, Sara del Río, Nitesh V. Chawla, and Francisco Herrera. 2017. An insight into imbalanced big data classification: Outcomes and challenges. Complex Intell. Syst. (2017), 1--16.

[41]

Gianluigi Folino, Francesco Sergio Pisani, and Pietro Sabatino. 2016. An incremental ensemble evolved by using genetic programming to efficiently detect drifts in cyber security datasets. In Proceedings of the Conference on Genetic and Evolutionary Computation Conference Companion. ACM, 1103--1110.

Digital Library

[42]

Kang Fu, Dawei Cheng, Yi Tu, and Liqing Zhang. 2016. Credit card fraud detection using convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing. Springer, 483--490.

[43]

Song Fu, Jianguo Liu, and Husanbir Pannu. 2012. A hybrid anomaly detection framework in cloud computing using one-class and two-class support vector machines. In Proceedings of the International Conference on Advanced Data Mining and Applications. Springer, 726--738.

[44]

Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2012. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst., Man, Cybernet., Part C (Appl. Rev.) 42, 4 (2012), 463--484.

Digital Library

[45]

Mikel Galar, Alberto Fernández, Edurne Barrenechea, and Francisco Herrera. 2013. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 12 (2013), 3460--3471.

Digital Library

[46]

Vaishali Ganganwar. 2012. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2, 4 (2012), 42--47.

[47]

Zan Gao, Longfei Zhang, Ming yu Chen, Alexander G. Hauptmann, Hua Zhang 0003, and An-Ni Cai. 2014. Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools Appl. 68, 3 (2014), 641--657.

Digital Library

[48]

Nicolás García-Pedrajas, Juan A. Romero del Castillo, and Gonzalo Cerruela-García. 2017. A proposal for local k values for k-nearest neighbor rule. IEEE Trans. Neural Netw. Learn. Syst. 28, 2 (2017), 470--475.

[49]

Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2013. Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122 (2013), 535--544.

Digital Library

[50]

Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2014. Online neural network model for non-stationary and imbalanced data stream classification. Int. J. Mach. Learn. Cybernet. 5, 1 (2014), 51--62.

[51]

Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. 2017. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73 (2017), 220--239.

Digital Library

[52]

Ming Hao, Yanli Wang, and Stephen H. Bryant. 2014. An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Analyt. Chim. Acta 806 (2014), 117--127.

[53]

Amira Kamil Ibrahim Hassan and Ajith Abraham. 2016. Modeling insurance fraud detection using imbalanced data classification. In Advances in Nature and Biologically Inspired Computing. Springer, 117--127.

[54]

Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’08) (IEEE World Congress on Computational Intelligence). IEEE, 1322--1328.

[55]

Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 9 (2009), 1263--1284.

Digital Library

[56]

Nic Herndon and Doina Caragea. 2016. A study of domain adaptation classifiers derived from logistic regression for the task of splice site prediction. IEEE Trans. Nanobiosci. 15, 2 (2016), 75--83.

[57]

Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intell. Rev. 22, 2 (2004), 85--126.

Digital Library

[58]

Yi-Min Huang and Shu-Xin Du. 2005. Weighted support vector machine for classification with uneven training class sizes. In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 7. IEEE, 4365--4369.

[59]

Jae Pil Hwang, Seongkeun Park, and Euntai Kim. 2011. A new weighted approach to imbalanced data classification problem via support vector machine with quadratic cost function. Expert Syst. Appl. 38, 7 (2011), 8580--8585.

Digital Library

[60]

Ilnaz Jamali, Mohammad Bazmara, and Shahram Jafari. 2012. Feature selection in imbalance data sets. Int. J. Comput. Sci. Iss. 9, 3 (2012), 42--45.

[61]

Piyasak Jeatrakul, Kok Wai Wong, and Chun Che Fung. 2010. Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. In Proceedings of the International Conference on Neural Information Processing. Springer, 152--159.

Digital Library

[62]

Qi Kang, XiaoShuang Chen, SiSi Li, and MengChu Zhou. 2016. A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans. Cybernet. (2016).

[63]

Masanori Kasai and Yusuke Oike. 2010. Image pickup apparatus, image processing method, and computer program capable of obtaining high-quality image data by controlling imbalance among sensitivities of light-receiving devices. U.S. Patent 7,839,437.

[64]

Madian Khabsa, Ahmed Elmagarmid, Ihab Ilyas, Hossam Hammady, and Mourad Ouzzani. 2016. Learning to identify relevant studies for systematic reviews using random forest and external information. Mach. Learn. 102, 3 (2016), 465--482.

Digital Library

[65]

Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A. Sohel, and Roberto Togneri. 2017. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. (2017).

[66]

Taghi M. Khoshgoftaar, Kehan Gao, Amri Napolitano, and Randall Wald. 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Info. Syst. Front. 16, 5 (2014), 801--822.

Digital Library

[67]

Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2010. Supervised neural network modeling: An empirical investigation into learning from imbalanced data with labeling errors. IEEE Trans. Neural Netw. 21, 5 (2010), 813--830.

Digital Library

[68]

Gitae Kim, Bongsug Kevin Chae, and David L. Olson. 2013. A support vector machine (SVM) approach to imbalanced datasets of customer responses: Comparison with other customer response models. Service Bus. 7, 1 (2013), 167--182.

[69]

Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, et al. 2006. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 30, 1 (2006), 25--36.

[70]

Bartosz Krawczyk. 2016. Learning from imbalanced data: Open challenges and future directions. Progr. Artific. Intell. 5, 4 (2016), 221--232.

[71]

Bartosz Krawczyk, Mikel Galar, Łukasz Jeleń, and Francisco Herrera. 2016. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38 (2016), 714--726.

Digital Library

[72]

Bartosz Krawczyk, Michał Woźniak, and Gerald Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14 (2014), 554--562.

Digital Library

[73]

Miroslav Kubat, Stan Matwin, et al. 1997. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the International Conference on machine Learning (ICML’97), Vol. 97. 179--186.

[74]

Pallavi Kulkarni and Roshani Ade. 2016. Logistic regression learning model for handling concept drift with unbalanced data in credit card fraud detection system. In Proceedings of the 2nd International Conference on Computer and Communication Technologies. Springer, 681--689.

[75]

Taehyung Lee, Ki Bum Lee, and Chang Ouk Kim. 2016. Performance of machine-learning algorithms for class-imbalanced process fault detection problems. IEEE Trans. Semicond. Manufact. 29, 4 (2016), 436--445.

[76]

Boaz Lerner, Josepha Yeshaya, and Lev Koushnir. 2007. On the classification of a small imbalanced cytogenetic image database. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 2 (2007).

Digital Library

[77]

Miao Liu, Mingjun Wang, Jun Wang, and Duo Li. 2013. Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage and Chinese vinegar. Sens. Actuat. B: Chem. 177 (2013), 970--980.

[78]

Ying Liu, Han Tong Loh, and Aixin Sun. 2009. Imbalanced text classification: A term weighting approach. Expert Syst. Appl. 36, 1 (2009), 690--701.

Digital Library

[79]

Zhen Liu, Ruoyu Wang, Ming Tao, and Xianfa Cai. 2015. A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion. Neurocomputing 168 (2015), 365--381.

Digital Library

[80]

Rushi Longadge and Snehalata Dongre. 2013. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 (2013).

[81]

Victoria López, Sara del Río, José Manuel Benítez, and Francisco Herrera. 2015. Cost-sensitive linguistic fuzzy rule-based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst. 258 (2015), 5--38.

Digital Library

[82]

Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. 2016. Hybrid sampling with bagging for class imbalance learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 14--26.

[83]

Abdul Majid, Safdar Ali, Mubashar Iqbal, and Nabeela Kausar. 2014. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput. Methods Programs Biomed. 113, 3 (2014), 792--808.

Digital Library

[84]

Sebastián Maldonado, Richard Weber, and Fazel Famili. 2014. Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Info. Sci. 286 (2014), 228--246.

Digital Library

[85]

Shahla Mardani and Hamid Reza Shahriari. 2013. A new method for occupational fraud detection in process aware information systems. In Proceedings of the 10th International ISC Conference on Information Security and Cryptology (ISCISC’13). IEEE, 1--5.

[86]

Stephen O. Moepya, Sharat S. Akhoury, and Fulufhelo V. Nelwamondo. 2014. Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW’14). IEEE, 183--192.

[87]

Jose G. Moreno-Torres, Xavier Llorà, David E. Goldberg, and Rohit Bhargava. 2013. Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis. Info. Sci. 222 (2013), 805--823.

Digital Library

[88]

Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. 2016. Distributional random oversampling for imbalanced text classification. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 805--808.

Digital Library

[89]

Surya Nepal and Mukaddim Pathan. 2014. Security, Privacy and Trust in Cloud Systems. Springer.

Digital Library

[90]

Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. 2012. A comparative study on sampling techniques for handling class imbalance in streaming data. In Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent Systems (SCIS’12), 13th International Symposium on Advanced Intelligent Systems (ISIS’12). IEEE, 1762--1767.

[91]

Ana Palacios, Krzysztof Trawiński, Oscar Cordón, and Luciano Sánchez. 2014. Cost-sensitive learning of fuzzy rules for imbalanced classification problems using FURIA. Int. J. Uncertain. Fuzz. Knowl.-based Syst. 22, 05 (2014), 643--675.

[92]

Jiyan Pan, Quanfu Fan, Sharath Pankanti, Hoang Trinh, Prasad Gabbur, and Sachiko Miyazawa. 2011. Soft margin keyframe comparison: Enhancing precision of fraud detection in retail surveillance. In Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV’11). IEEE, 549--556.

Digital Library

[93]

Husanbir Singh Pannu and Harsurinder Kaur. 2017. Anomaly detection survey for information security. In Proceedings of the 10th International Conference on Security of Information and Networks. ACM, 251--258.

Digital Library

[94]

Husanbir S. Pannu, Jianguo Liu, Qiang Guan, and Song Fu. 2012. AFD: Adaptive failure detection system for cloud computing infrastructures. In Proceedings of the IEEE 31st International Performance Computing and Communications Conference (IPCCC’12). IEEE, 71--80.

[95]

Yubin Park and Joydeep Ghosh. 2014. Ensembles of alpha-trees for imbalanced classification problems. IEEE Trans. Knowl. Data Eng. 26, 1 (2014), 131--143.

Digital Library

[96]

Harshita Patel and G. S. Thakur. 2016. A hybrid weighted nearest neighbor approach to mine imbalanced data. In Proceedings of the International Conference on Data Mining (DMIN’16). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 106.

[97]

Naser Peiravian and Xingquan Zhu. 2013. Machine learning for Android malware detection using permission and API calls. In Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI’13). IEEE, 300--305.

Digital Library

[98]

Lizhi Peng, Bo Yang, Yuehui Chen, and Xiaoqing Zhou. 2016. An under-sampling imbalanced learning of data gravitation-based classification. In Proceedings of the 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD’16). IEEE, 419--425.

[99]

Yun Qian, Yanchun Liang, Mu Li, Guoxiang Feng, and Xiaohu Shi. 2014. A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing 143 (2014), 57--67.

Digital Library

[100]

Chen Qiu, Liangxiao Jiang, and Chaoqun Li. 2017. Randomly selected decision tree for test-cost sensitive learning. Appl. Soft Comput. 53 (2017), 27--33.

Digital Library

[101]

D. Ramyachitra and P. Manikandan. 2014. Imbalanced dataset classification and solutions: A review. Int. J. Comput.ing and Bus. Res. 5, 4 (2014).

[102]

K. Usha Rani, G. Naga Ramadevi, and D. Lavanya. 2016. Performance of synthetic minority oversampling technique on imbalanced breast cancer data. In Proceedings of the 3rd International Conference on Computing for Sustainable Global Development (INDIACom’16). IEEE, 1623--1627.

[103]

Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: A case study. ACM SIGKDD Explor. Newslett. 6, 1 (2004), 60--69.

Digital Library

[104]

Alice M. Richardson and Brett A. Lidbury. 2017. Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines. BMC Med. Info. Decis. Mak. 17, 1 (2017), 121.

[105]

Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507--2517.

Digital Library

[106]

Mahendra Sahare and Hitesh Gupta. 2012. A review of multi-class classification for imbalanced data. Int. J. Adv. Comput. Res. 2, 3 (2012), 160--164.

[107]

Yusuf Sahin, Serol Bulkan, and Ekrem Duman. 2013. A cost-sensitive decision tree approach for fraud detection. Expert Syst. Appl. 40, 15 (2013), 5916--5923.

Digital Library

[108]

Claude Sammut and Geoffrey I. Webb. 2011. Encyclopedia of Machine Learning. Springer Science 8 Business Media.

Digital Library

[109]

José Antonio Sanz, Dario Bernardo, Francisco Herrera, Humberto Bustince, and Hani Hagras. 2015. A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23, 4 (2015), 973--990.

Digital Library

[110]

Abeed Sarker and Graciela Gonzalez. 2015. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Info. 53 (2015), 196--207.

Digital Library

[111]

Asaf Shabtai, Robert Moskovitch, Clint Feher, Shlomi Dolev, and Yuval Elovici. 2012. Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Info. 1, 1 (2012), 1.

[112]

Yuan-Hai Shao, Wei-Jie Chen, Jing-Jing Zhang, Zhen Wang, and Nai-Yang Deng. 2014. An efficient weighted Lagrangian twin support vector machine for imbalanced data classification. Pattern Recogn. 47, 9 (2014), 3158--3167.

[113]

Mei-Ling Shyu, Zongxing Xie, Min Chen, and Shu-Ching Chen. 2008. Video semantic event/concept detection using a subspace-based multimedia data mining framework. IEEE Trans. Multimedia 10, 2 (2008), 252--259.

Digital Library

[114]

Arpit Singh and Anuradha Purohit. 2015. A survey on methods for solving data imbalance problem for classification. Work 127, 15 (2015).

[115]

Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo, and Quan Zou. 2014. nDNA-prot: Identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformat.ics 15, 1 (2014), 298.

[116]

Qun Song, Jun Zhang, and Qian Chi. 2010. Assistant detection of skewed data streams classification in cloud security. In Proceedings of the IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS’10), Vol. 1. IEEE, 60--64.

[117]

Robert A. Sowah, Moses A. Agebure, Godfrey A. Mills, Koudjo M. Koumadi, and Seth Y. Fiawoo. 2016. New cluster undersampling technique for class imbalance learning. Int. J. Mach. Learn. Comput. 6, 3 (2016), 205.

[118]

Yanmin Sun, Andrew K. C. Wong, and Mohamed S. Kamel. 2009. Classification of imbalanced data: A review. Int. J. Pattern Recogn. Artific. Intell. 23, 04 (2009), 687--719.

[119]

Zhongbin Sun, Qinbao Song, Xiaoyan Zhu, Heli Sun, Baowen Xu, and Yuming Zhou. 2015. A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 5 (2015), 1623--1637.

Digital Library

[120]

Mayank Taneja, Kavyanshi Garg, Archana Purwar, and Samarth Sharma. 2015. Prediction of click frauds in mobile advertising. In Proceedings of the 8th International Conference on Contemporary Computing (IC3’15). IEEE, 162--166.

Digital Library

[121]

Bo Tang, Haibo He, Paul M. Baggenstoss, and Steven Kay. 2016. A Bayesian classification approach using class-specific features for text categorization. IEEE Trans. Knowl. Data Eng. 28, 6 (2016), 1602--1606.

Digital Library

[122]

Yuchun Tang, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. 2009. SVMs modeling for highly imbalanced classification. IEEE Trans. Syst., Man, Cybernet., Part B (Cybernet.) 39, 1 (2009), 281--288.

Digital Library

[123]

Ciza Thomas. 2013. Improving intrusion detection for imbalanced network traffic. Secur. Commun. Netw. 6, 3 (2013), 309--324.

[124]

Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano, and Randall Wald. 2009. Feature selection with high-dimensional imbalanced data. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09). IEEE, 507--514.

Digital Library

[125]

Nguyen Ha Vo and Yonggwan Won. 2007. Classification of unbalanced medical data with weighted regularized least squares. In Proceedings of the Conference on Frontiers in the Convergence of Bioscience and Information Technologies (FBIT’07). IEEE, 347--352.

Digital Library

[126]

Chi-Man Vong, Jie Du, Chi-Man Wong, and Jiu-Wen Cao. 2018. Postboosting using extended G-Mean for online sequential multiclass imbalance learning. IEEE Trans. Neural Netw. Learn. Syst. (2018).

[127]

Shixiang Wan, Yucong Duan, and Quan Zou. 2017. HPSLPred: An ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17, 17--18 (2017), 1700262.

[128]

C. Wang, L. Hu, M. Guo, X. Liu, and Q. Zou. 2015. imDC: An ensemble learning method for imbalanced classification with miRNA data. Genet. Mol. Res. 14, 1 (2015), 123--133.

[129]

Qiang Wang. 2014. A hybrid sampling SVM approach to imbalanced data classification. In Abstract and Applied Analysis, Vol. 2014. Hindawi Publishing Corporation.

[130]

Suge Wang, Deyu Li, Lidong Zhao, and Jiahao Zhang. 2013. Sample cutting method for imbalanced text sentiment classification based on BRC. Knowl.-Based Syst. 37 (2013), 451--461.

Digital Library

[131]

Shuo Wang and Xin Yao. 2013. Using class imbalance learning for software defect prediction. IEEE Trans. Reliabil. 62, 2 (2013), 434--443.

[132]

Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013a. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449--475.

Digital Library

[133]

Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013b. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449--475.

Digital Library

[134]

Qingyao Wu, Yunming Ye, Haijun Zhang, Michael K. Ng, and Shen-Shyang Ho. 2014. ForesTexter: An efficient random forest algorithm for imbalanced text categorization. Knowl.-Based Syst. 67 (2014), 105--116.

Digital Library

[135]

Yufei Xia, Chuanzhe Liu, and Nana Liu. 2017. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending. Electron. Comm. Res. Appl. 24 (2017), 30--49.

Digital Library

[136]

Jieming Yang, Zhaoyang Qu, and Zhiying Liu. 2014. Improved feature-selection method considering the imbalance problem in text categorization. Sci. World J. 2014 (2014).

[137]

Junshan Yang, Jiarui Zhou, Zexuan Zhu, Xiaoliang Ma, and Zhen Ji. 2016. Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. J. Biol. Res. Thessaloniki 23, 1 (2016), 13.

[138]

Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman, Simon Fong, Zuraida Khairudin, and Nik Nik Abdullah. 2014. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the 1st International Conference on Advanced Data and Information Engineering (DaEng’13). Springer, 13--22.

[139]

Hualong Yu and Jun Ni. 2014. An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans. Comput. Biol. Bioinformat. 11, 4 (2014), 657--666.

Digital Library

[140]

Hualong Yu, Jun Ni, and Jing Zhao. 2013. ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101 (2013), 309--318.

Digital Library

[141]

Ashkan Zakaryazad and Ekrem Duman. 2016. A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing. Neurocomputing 175 (2016), 121--131.

Digital Library

[142]

Jia Zeng, Shanfeng Zhu, and Hong Yan. 2009. Towards accurate human promoter recognition: A review of currently used sequence features and classification methods. Brief. Bioinformat. 10, 5 (2009), 498--508.

[143]

Bin Zhang, Yi Zhou, and Christos Faloutsos. 2008. Toward a comprehensive model in internet auction fraud detection. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences. IEEE, 79--79.

Digital Library

[144]

Dongmei Zhang, Jun Ma, Jing Yi, Xiaofei Niu, and Xiaojing Xu. 2015. An ensemble method for unbalanced sentiment classification. In Proceedings of the 11th International Conference on Natural Computation (ICNC’15). IEEE, 440--445.

[145]

Huaxiang Zhang and Mingfang Li. 2014. RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Info. Fusion 20 (2014), 99--116.

[146]

Yan-Ping Zhang, Li-Na Zhang, and Yong-Cheng Wang. 2010. Cluster-based majority under-sampling approaches for class imbalance learning. In Proceedings of the 2nd IEEE International Conference on Information and Financial Engineering (ICIFE’10). IEEE, 400--404.

[147]

Zhancheng Zhang, Jun Dong, Xiaoqing Luo, Kup-Sze Choi, and Xiaojun Wu. 2014. Heartbeat classification using disease-specific feature selection. Comput. Biol. Med. 46 (2014), 79--89.

Digital Library

[148]

Xing-Ming Zhao, Xin Li, Luonan Chen, and Kazuyuki Aihara. 2008. Protein classification with imbalanced data. Proteins: Struct. Funct. Bioinformat. 70, 4 (2008), 1125--1132.

[149]

Zhuoyuan Zheng, Yunpeng Cai, and Ye Li. 2016. Oversampling method for imbalanced classification. Comput. Info. 34, 5 (2016), 1017--1037.

[150]

Weicai Zhong, Bijan Raahemi, and Jing Liu. 2013. Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream. Peer-to-Peer Netw. Appl. 6, 3 (2013), 233--246.

[151]

Maciej Zięba, Jakub M. Tomczak, Marek Lubicz, and Jerzy Świątek. 2014. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl. Soft Comput. 14 (2014), 99--108.

Digital Library

[152]

Quan Zou, Sifa Xie, Ziyu Lin, Meihong Wu, and Ying Ju. 2016. Finding the best classification threshold in imbalanced classification. Big Data Res. 5 (2016), 2--8.

Cited By

Metin ABilgin T(2024)Automated machine learning for fabric quality prediction: a comparative analysisPeerJ Computer Science10.7717/peerj-cs.218810(e2188)Online publication date: 23-Jul-2024
https://doi.org/10.7717/peerj-cs.2188
Fatima GKhan SAadil FKim DAtteia GAlabdulhafith M(2024)An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentationPeerJ Computer Science10.7717/peerj-cs.17569(e1756)Online publication date: 2-Jan-2024
https://doi.org/10.7717/peerj-cs.1756
Alie MNegesse Y(2024)Machine learning prediction of adolescent HIV testing services in EthiopiaFrontiers in Public Health10.3389/fpubh.2024.134127912Online publication date: 15-Mar-2024
https://doi.org/10.3389/fpubh.2024.1341279
Show More Cited By

Index Terms

A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
Abstract
Machine Learning (ML) algorithms have been increasingly replacing people in several application domains—in which the majority suffer from data imbalance. In order to solve this problem, published studies implement data preprocessing techniques, ...
Radial-Based oversampling for noisy imbalanced data classification
Abstract
Imbalanced data classification remains a focus of intense research, mostly due to the prevalence of data imbalance in various real-life application domains. A disproportion among objects from different classes may significantly affect ...
An Integrated GAN-Based Approach to Imbalanced Disk Failure Data
Intelligent Computing Theories and Application
Abstract
Real-world classification problems present a certain degree of categorical imbalance, and due to the imbalance of data, this feature leads to many difficulties in classification, and it is important to adjust the indicators and methods ...

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys

ACM Computing Surveys Volume 52, Issue 4

July 2020

769 pages

ISSN:0360-0300

EISSN:1557-7341

DOI:10.1145/3359984

Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2019

Accepted: 01 May 2019

Revised: 01 October 2018

Received: 01 March 2018

Published in CSUR Volume 52, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Survey
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

263
Total Citations
View Citations
7,175
Total Downloads

Downloads (Last 12 months)1,146
Downloads (Last 6 weeks)70

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Metin ABilgin T(2024)Automated machine learning for fabric quality prediction: a comparative analysisPeerJ Computer Science10.7717/peerj-cs.218810(e2188)Online publication date: 23-Jul-2024
https://doi.org/10.7717/peerj-cs.2188
Fatima GKhan SAadil FKim DAtteia GAlabdulhafith M(2024)An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentationPeerJ Computer Science10.7717/peerj-cs.17569(e1756)Online publication date: 2-Jan-2024
https://doi.org/10.7717/peerj-cs.1756
Alie MNegesse Y(2024)Machine learning prediction of adolescent HIV testing services in EthiopiaFrontiers in Public Health10.3389/fpubh.2024.134127912Online publication date: 15-Mar-2024
https://doi.org/10.3389/fpubh.2024.1341279
UESUGI TOKURA NNAKASHIMA TOGAWA KTSUTSUMI SSAWADA KNAKAMOTO K(2024)Surface Inspection of Ductile Cast Iron Pipe for Both Regression and Defective Classification by Deep Learning深層学習による回帰と不良品分類を両立するダクタイル鋳鉄管の鋳肌検査Journal of the Society of Materials Science, Japan10.2472/jsms.73.15773:2(157-164)Online publication date: 15-Feb-2024
https://doi.org/10.2472/jsms.73.157
Huo JYu YLin WHu AWu C(2024)Application of AI in in Multilevel Pain Assessment Using Facial Images: Systematic Review and Meta-AnalysisJournal of Medical Internet Research10.2196/5125026(e51250)Online publication date: 12-Apr-2024
https://doi.org/10.2196/51250
Ben Yehuda OItelman EVaisman ASegal GLerner B(2024)Early Detection of Pulmonary Embolism in a General Patient Population Immediately Upon Hospital Admission Using Machine Learning to Identify New, Unidentified Risk Factors: Model Development StudyJournal of Medical Internet Research10.2196/4859526(e48595)Online publication date: 30-Jul-2024
https://doi.org/10.2196/48595
GÖRMEZ YARSLAN HIŞIK YGÜNDÜZ V(2024)Developing Novel Deep Learning Models to Detect Insider Threats and Comparing the Models from Different Perspectivesİç Tehditlerin Tespit Edilmesi için Özgün Derin Öğrenme Modellerinin Geliştirilmesi ve Modellerin Farklı Perspektiflerde KarşılaştırılmasıBilişim Teknolojileri Dergisi10.17671/gazibtd.138673417:1(31-43)Online publication date: 16-Jan-2024
https://doi.org/10.17671/gazibtd.1386734
Liu YWang SSui HZhu L(2024)An ensemble learning method with GAN-based sampling and consistency check for anomaly detection of imbalanced data streams with concept driftPLOS ONE10.1371/journal.pone.029214019:1(e0292140)Online publication date: 26-Jan-2024
https://doi.org/10.1371/journal.pone.0292140
Lo YChen YWang PChang CWei GHung W(2024)Non-invasive glucose extraction by a single polarization rotator system in patients with diabetesBiomedical Optics Express10.1364/BOE.52903215:8(4909)Online publication date: 29-Jul-2024
https://doi.org/10.1364/BOE.529032
Paproki ASalvado OFookes C(2024)Synthetic Data for Deep Learning in Computer Vision & Medical Imaging: A Means to Reduce Data BiasACM Computing Surveys10.1145/3663759Online publication date: 9-May-2024
https://doi.org/10.1145/3663759
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Issue’s Table of Contents