Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

A Systematic Review on Imbalanced Data Challenges in Machine Learning: Applications and Solutions

Published: 30 August 2019 Publication History

Abstract

In machine learning, the data imbalance imposes challenges to perform data analytics in almost all areas of real-world research. The raw primary data often suffers from the skewed perspective of data distribution of one class over the other as in the case of computer vision, information security, marketing, and medical science. The goal of this article is to present a comparative analysis of the approaches from the reference of data pre-processing, algorithmic and hybrid paradigms for contemporary imbalance data analysis techniques, and their comparative study in lieu of different data distribution and their application areas.

References

[1]
Ahmed Abbasi and Hsinchun Chen. 2009. A comparison of fraud cues and classification methods for fake escrow website detection. Info. Technol. Manage. 10, 2--3 (2009), 83--101.
[2]
Chirath Abeysinghe, Jianguo Li, and Jing He. 2016. A classifier hub for imbalanced financial data. In Proceedings of the Australasian Database Conference. Springer, 476--479.
[3]
Mohamed Abouelenien, Xiaohui Yuan, Balathasan Giritharan, Jianguo Liu, and Shoujiang Tang. 2013. Cluster-based sampling and ensemble for bleeding detection in capsule endoscopy videos. Amer. J. Sci. Eng. 2, 1 (2013), 24--32.
[4]
Hamzah Al Najada and Xingquan Zhu. 2014. iSRD: Spam review detection with imbalanced data distributions. In Proceedings of the IEEE 15th International Conference on Information Reuse and Integration (IRI’14). IEEE, 553--560.
[5]
Safdar Ali, Abdul Majid, Syed Gibran Javed, and Mohsin Sattar. 2016. Can-CSC-GBE: Developing cost-sensitive classifier with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data. Comput. Biol. Med. 73 (2016), 38--46.
[6]
Nafees Anwar, Geoff Jones, and Siva Ganesh. 2014. Measurement of data complexity for classification problems with unbalanced data. Stat. Anal.ysis and Data Min.: ASA Data Sci. J. 7, 3 (2014), 194--211.
[7]
Ömer Faruk Arar and Kürşat Ayan. 2015. Software defect prediction using cost-sensitive neural network. Appl. Soft Comput. 33 (2015), 263--277.
[8]
Malgorzata Bach, Aleksandra Werner, J. Żywiec, and W. Pluskiewicz. 2017. The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Info. Sci. 384 (2017), 174--190.
[9]
Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. 2014. MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26, 2 (2014), 405--425.
[10]
Rukshan Batuwita and Vasile Palade. 2009. microPred: Effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 25, 8 (2009), 989--995.
[11]
Oscar Beijbom, Mohammad Saberian, David Kriegman, and Nuno Vasconcelos. 2014. Guess-averse loss functions for cost-sensitive multiclass boosting. In Proceedings of the International Conference on Machine Learning. 586--594.
[12]
Mohamed Bekkar and Taklit Akrouf Alitouche. 2013. Imbalanced data learning approaches review. Int. J. Data Min. Knowl. Manage. Process 3, 4 (2013), 15.
[13]
Sanket M. Bhandari and Krunal Patel. 2015. A review on using clustering and classification techniques to predict student failure with high dimensional and imbalanced data.
[14]
Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and J. Christopher Westland. 2011. Data mining for credit card fraud: A comparative study. Decis. Supp. Syst. 50, 3 (2011), 602--613.
[15]
Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative learning under covariate shift. J. Mach. Learn. Res. 10 (Sep.2009), 2137--2155.
[16]
Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Distributed feature selection: An application to microarray data classification. Appl. Soft Comput. 30 (2015), 136--150.
[17]
Jonathan Burez and Dirk Van den Poel. 2009. Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 3 (2009), 4626--4636.
[18]
Lu Cao and Yikui Zhai. 2015. Imbalanced data classification based on a hybrid resampling SVM method. In Proceedings of the Ubiquitous Intelligence and Computing and IEEE 12th International Conference on Autonomic and Trusted Computing and IEEE 15th International Conference on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom’15). IEEE, 1533--1536.
[19]
Peng Cao, Dazhe Zhao, and Osmar Zaiane. 2013. An optimized cost-sensitive SVM for imbalanced data learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 280--292.
[20]
M. Emre Celebi, Hassan A. Kingravi, Bakhtiyar Uddin, Hitoshi Iyatomi, Y. Alp Aslandogan, William V. Stoecker, and Randy H. Moss. 2007. A methodological approach to the classification of dermoscopy images. Comput. Med. Imag. Graph. 31, 6 (2007), 362--373.
[21]
Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X. Ling. 2004. Test-cost sensitive naive bayes classification. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE, 51--58.
[22]
Edward Y. Chang, Beitao Li, Gang Wu, and Kingshy Goh. 2003. Statistical learning for effective visual information retrieval. In Proceedings of the International Conference on Image Processing (ICIP’03), vol. 3. IEEE, III--609.
[23]
Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3--16.
[24]
Nitesh V. Chawla. 2009. Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery Handbook. Springer, 875--886.
[25]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-sampling technique. J. Artific. Intell. Res. 16 (2002), 321--357.
[26]
David A. Cieslak and Nitesh V. Chawla. 2009. A framework for monitoring classifiers’ performance: When and why failure occurs? Knowl. Info. Syst. 18, 1 (2009), 83--108.
[27]
Michael Crawford, Taghi M. Khoshgoftaar, Joseph D. Prusa, Aaron N. Richter, and Hamzah Al Najada. 2015. Survey of review spam detection using machine-learning techniques. J. Big Data 2, 1 (2015), 23.
[28]
Dong Dai and Shaowen Hua. 2016. Random under-sampling ensemble methods for highly imbalanced rare disease classification. In Proceedings of the International Conference on Data Mining (DMIN’16). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 54.
[29]
Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 233--240.
[30]
Sauptik Dhar and Vladimir Cherkassky. 2015. Development and evaluation of cost-sensitive universum-SVM. IEEE Trans. Cybernet. 45, 4 (2015), 806--818.
[31]
Jie Du and C. M. Vong. 2018. Online multi-label learning under dynamic changes in data distribution with labels. Accepted and in Press IEEE Trans. Cybernet. (2018).
[32]
Jie Du, Chi-Man Vong, Chi-Man Pun, Pak-Kin Wong, and Weng-Fai Ip. 2017. Post-boosting of classification boundary for imbalanced data using geometric mean. Neural Netw. 96 (2017), 101--114.
[33]
Shihong Du, Fangli Zhang, and Xiuyuan Zhang. 2015. Semantic classification of urban buildings combining VHR image and GIS data: An improved random forest approach. ISPRS J. Photogram. Remote Sens. 105 (2015), 107--119.
[34]
Sumeet Dua and Xian Du. 2016. Data Mining and Machine Learning in Cybersecurity. CRC Press.
[35]
Ekrem Duman, Yeliz Ekinci, and Aydın Tanrıverdi. 2012. Comparing alternative classifiers for database marketing: The case of imbalanced datasets. Expert Syst. Appl. 39, 1 (2012), 48--53.
[36]
Reda M. Elbasiony, Elsayed A. Sallam, Tarek E. Eltobely, and Mahmoud M. Fahmy. 2013. A hybrid network intrusion detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4, 4 (2013), 753--762.
[37]
T. Elhassan, M. Aljurf, F. Al-Mohanna, and M. Shoukri. 2016. Classification of imbalance data using tomek link (T-link) combined with random under-sampling (RUS) as a data reduction method. J. Info. Data Min. (2016).
[38]
Vegard Engen, Jonathan Vincent, and Keith Phalp. 2008. Enhancing network-based intrusion detection for imbalanced data. Int. J. Knowl.-Based Intell. Eng. Syst. 12, 5--6 (2008), 357--367.
[39]
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 8 (2006), 861--874.
[40]
Alberto Fernández, Sara del Río, Nitesh V. Chawla, and Francisco Herrera. 2017. An insight into imbalanced big data classification: Outcomes and challenges. Complex Intell. Syst. (2017), 1--16.
[41]
Gianluigi Folino, Francesco Sergio Pisani, and Pietro Sabatino. 2016. An incremental ensemble evolved by using genetic programming to efficiently detect drifts in cyber security datasets. In Proceedings of the Conference on Genetic and Evolutionary Computation Conference Companion. ACM, 1103--1110.
[42]
Kang Fu, Dawei Cheng, Yi Tu, and Liqing Zhang. 2016. Credit card fraud detection using convolutional neural networks. In Proceedings of the International Conference on Neural Information Processing. Springer, 483--490.
[43]
Song Fu, Jianguo Liu, and Husanbir Pannu. 2012. A hybrid anomaly detection framework in cloud computing using one-class and two-class support vector machines. In Proceedings of the International Conference on Advanced Data Mining and Applications. Springer, 726--738.
[44]
Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2012. A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst., Man, Cybernet., Part C (Appl. Rev.) 42, 4 (2012), 463--484.
[45]
Mikel Galar, Alberto Fernández, Edurne Barrenechea, and Francisco Herrera. 2013. EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 12 (2013), 3460--3471.
[46]
Vaishali Ganganwar. 2012. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv. Eng. 2, 4 (2012), 42--47.
[47]
Zan Gao, Longfei Zhang, Ming yu Chen, Alexander G. Hauptmann, Hua Zhang 0003, and An-Ni Cai. 2014. Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools Appl. 68, 3 (2014), 641--657.
[48]
Nicolás García-Pedrajas, Juan A. Romero del Castillo, and Gonzalo Cerruela-García. 2017. A proposal for local k values for k-nearest neighbor rule. IEEE Trans. Neural Netw. Learn. Syst. 28, 2 (2017), 470--475.
[49]
Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2013. Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122 (2013), 535--544.
[50]
Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2014. Online neural network model for non-stationary and imbalanced data stream classification. Int. J. Mach. Learn. Cybernet. 5, 1 (2014), 51--62.
[51]
Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. 2017. Learning from class-imbalanced data: Review of methods and applications. Expert Syst. Appl. 73 (2017), 220--239.
[52]
Ming Hao, Yanli Wang, and Stephen H. Bryant. 2014. An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Analyt. Chim. Acta 806 (2014), 117--127.
[53]
Amira Kamil Ibrahim Hassan and Ajith Abraham. 2016. Modeling insurance fraud detection using imbalanced data classification. In Advances in Nature and Biologically Inspired Computing. Springer, 117--127.
[54]
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’08) (IEEE World Congress on Computational Intelligence). IEEE, 1322--1328.
[55]
Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 9 (2009), 1263--1284.
[56]
Nic Herndon and Doina Caragea. 2016. A study of domain adaptation classifiers derived from logistic regression for the task of splice site prediction. IEEE Trans. Nanobiosci. 15, 2 (2016), 75--83.
[57]
Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intell. Rev. 22, 2 (2004), 85--126.
[58]
Yi-Min Huang and Shu-Xin Du. 2005. Weighted support vector machine for classification with uneven training class sizes. In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 7. IEEE, 4365--4369.
[59]
Jae Pil Hwang, Seongkeun Park, and Euntai Kim. 2011. A new weighted approach to imbalanced data classification problem via support vector machine with quadratic cost function. Expert Syst. Appl. 38, 7 (2011), 8580--8585.
[60]
Ilnaz Jamali, Mohammad Bazmara, and Shahram Jafari. 2012. Feature selection in imbalance data sets. Int. J. Comput. Sci. Iss. 9, 3 (2012), 42--45.
[61]
Piyasak Jeatrakul, Kok Wai Wong, and Chun Che Fung. 2010. Classification of imbalanced data by combining the complementary neural network and SMOTE algorithm. In Proceedings of the International Conference on Neural Information Processing. Springer, 152--159.
[62]
Qi Kang, XiaoShuang Chen, SiSi Li, and MengChu Zhou. 2016. A noise-filtered under-sampling scheme for imbalanced classification. IEEE Trans. Cybernet. (2016).
[63]
Masanori Kasai and Yusuke Oike. 2010. Image pickup apparatus, image processing method, and computer program capable of obtaining high-quality image data by controlling imbalance among sensitivities of light-receiving devices. U.S. Patent 7,839,437.
[64]
Madian Khabsa, Ahmed Elmagarmid, Ihab Ilyas, Hossam Hammady, and Mourad Ouzzani. 2016. Learning to identify relevant studies for systematic reviews using random forest and external information. Mach. Learn. 102, 3 (2016), 465--482.
[65]
Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A. Sohel, and Roberto Togneri. 2017. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. (2017).
[66]
Taghi M. Khoshgoftaar, Kehan Gao, Amri Napolitano, and Randall Wald. 2014. A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Info. Syst. Front. 16, 5 (2014), 801--822.
[67]
Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2010. Supervised neural network modeling: An empirical investigation into learning from imbalanced data with labeling errors. IEEE Trans. Neural Netw. 21, 5 (2010), 813--830.
[68]
Gitae Kim, Bongsug Kevin Chae, and David L. Olson. 2013. A support vector machine (SVM) approach to imbalanced datasets of customer responses: Comparison with other customer response models. Service Bus. 7, 1 (2013), 167--182.
[69]
Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, et al. 2006. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 30, 1 (2006), 25--36.
[70]
Bartosz Krawczyk. 2016. Learning from imbalanced data: Open challenges and future directions. Progr. Artific. Intell. 5, 4 (2016), 221--232.
[71]
Bartosz Krawczyk, Mikel Galar, Łukasz Jeleń, and Francisco Herrera. 2016. Evolutionary undersampling boosting for imbalanced classification of breast cancer malignancy. Appl. Soft Comput. 38 (2016), 714--726.
[72]
Bartosz Krawczyk, Michał Woźniak, and Gerald Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14 (2014), 554--562.
[73]
Miroslav Kubat, Stan Matwin, et al. 1997. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the International Conference on machine Learning (ICML’97), Vol. 97. 179--186.
[74]
Pallavi Kulkarni and Roshani Ade. 2016. Logistic regression learning model for handling concept drift with unbalanced data in credit card fraud detection system. In Proceedings of the 2nd International Conference on Computer and Communication Technologies. Springer, 681--689.
[75]
Taehyung Lee, Ki Bum Lee, and Chang Ouk Kim. 2016. Performance of machine-learning algorithms for class-imbalanced process fault detection problems. IEEE Trans. Semicond. Manufact. 29, 4 (2016), 436--445.
[76]
Boaz Lerner, Josepha Yeshaya, and Lev Koushnir. 2007. On the classification of a small imbalanced cytogenetic image database. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 2 (2007).
[77]
Miao Liu, Mingjun Wang, Jun Wang, and Duo Li. 2013. Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage and Chinese vinegar. Sens. Actuat. B: Chem. 177 (2013), 970--980.
[78]
Ying Liu, Han Tong Loh, and Aixin Sun. 2009. Imbalanced text classification: A term weighting approach. Expert Syst. Appl. 36, 1 (2009), 690--701.
[79]
Zhen Liu, Ruoyu Wang, Ming Tao, and Xianfa Cai. 2015. A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion. Neurocomputing 168 (2015), 365--381.
[80]
Rushi Longadge and Snehalata Dongre. 2013. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707 (2013).
[81]
Victoria López, Sara del Río, José Manuel Benítez, and Francisco Herrera. 2015. Cost-sensitive linguistic fuzzy rule-based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst. 258 (2015), 5--38.
[82]
Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. 2016. Hybrid sampling with bagging for class imbalance learning. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 14--26.
[83]
Abdul Majid, Safdar Ali, Mubashar Iqbal, and Nabeela Kausar. 2014. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput. Methods Programs Biomed. 113, 3 (2014), 792--808.
[84]
Sebastián Maldonado, Richard Weber, and Fazel Famili. 2014. Feature selection for high-dimensional class-imbalanced data sets using support vector machines. Info. Sci. 286 (2014), 228--246.
[85]
Shahla Mardani and Hamid Reza Shahriari. 2013. A new method for occupational fraud detection in process aware information systems. In Proceedings of the 10th International ISC Conference on Information Security and Cryptology (ISCISC’13). IEEE, 1--5.
[86]
Stephen O. Moepya, Sharat S. Akhoury, and Fulufhelo V. Nelwamondo. 2014. Applying cost-sensitive classification for financial fraud detection under high class-imbalance. In Proceedings of the IEEE International Conference on Data Mining Workshop (ICDMW’14). IEEE, 183--192.
[87]
Jose G. Moreno-Torres, Xavier Llorà, David E. Goldberg, and Rohit Bhargava. 2013. Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis. Info. Sci. 222 (2013), 805--823.
[88]
Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. 2016. Distributional random oversampling for imbalanced text classification. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 805--808.
[89]
Surya Nepal and Mukaddim Pathan. 2014. Security, Privacy and Trust in Cloud Systems. Springer.
[90]
Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. 2012. A comparative study on sampling techniques for handling class imbalance in streaming data. In Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent Systems (SCIS’12), 13th International Symposium on Advanced Intelligent Systems (ISIS’12). IEEE, 1762--1767.
[91]
Ana Palacios, Krzysztof Trawiński, Oscar Cordón, and Luciano Sánchez. 2014. Cost-sensitive learning of fuzzy rules for imbalanced classification problems using FURIA. Int. J. Uncertain. Fuzz. Knowl.-based Syst. 22, 05 (2014), 643--675.
[92]
Jiyan Pan, Quanfu Fan, Sharath Pankanti, Hoang Trinh, Prasad Gabbur, and Sachiko Miyazawa. 2011. Soft margin keyframe comparison: Enhancing precision of fraud detection in retail surveillance. In Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV’11). IEEE, 549--556.
[93]
Husanbir Singh Pannu and Harsurinder Kaur. 2017. Anomaly detection survey for information security. In Proceedings of the 10th International Conference on Security of Information and Networks. ACM, 251--258.
[94]
Husanbir S. Pannu, Jianguo Liu, Qiang Guan, and Song Fu. 2012. AFD: Adaptive failure detection system for cloud computing infrastructures. In Proceedings of the IEEE 31st International Performance Computing and Communications Conference (IPCCC’12). IEEE, 71--80.
[95]
Yubin Park and Joydeep Ghosh. 2014. Ensembles of alpha-trees for imbalanced classification problems. IEEE Trans. Knowl. Data Eng. 26, 1 (2014), 131--143.
[96]
Harshita Patel and G. S. Thakur. 2016. A hybrid weighted nearest neighbor approach to mine imbalanced data. In Proceedings of the International Conference on Data Mining (DMIN’16). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 106.
[97]
Naser Peiravian and Xingquan Zhu. 2013. Machine learning for Android malware detection using permission and API calls. In Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI’13). IEEE, 300--305.
[98]
Lizhi Peng, Bo Yang, Yuehui Chen, and Xiaoqing Zhou. 2016. An under-sampling imbalanced learning of data gravitation-based classification. In Proceedings of the 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD’16). IEEE, 419--425.
[99]
Yun Qian, Yanchun Liang, Mu Li, Guoxiang Feng, and Xiaohu Shi. 2014. A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing 143 (2014), 57--67.
[100]
Chen Qiu, Liangxiao Jiang, and Chaoqun Li. 2017. Randomly selected decision tree for test-cost sensitive learning. Appl. Soft Comput. 53 (2017), 27--33.
[101]
D. Ramyachitra and P. Manikandan. 2014. Imbalanced dataset classification and solutions: A review. Int. J. Comput.ing and Bus. Res. 5, 4 (2014).
[102]
K. Usha Rani, G. Naga Ramadevi, and D. Lavanya. 2016. Performance of synthetic minority oversampling technique on imbalanced breast cancer data. In Proceedings of the 3rd International Conference on Computing for Sustainable Global Development (INDIACom’16). IEEE, 1623--1627.
[103]
Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: A case study. ACM SIGKDD Explor. Newslett. 6, 1 (2004), 60--69.
[104]
Alice M. Richardson and Brett A. Lidbury. 2017. Enhancement of hepatitis virus immunoassay outcome predictions in imbalanced routine pathology data by data balancing and feature selection before the application of support vector machines. BMC Med. Info. Decis. Mak. 17, 1 (2017), 121.
[105]
Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics 23, 19 (2007), 2507--2517.
[106]
Mahendra Sahare and Hitesh Gupta. 2012. A review of multi-class classification for imbalanced data. Int. J. Adv. Comput. Res. 2, 3 (2012), 160--164.
[107]
Yusuf Sahin, Serol Bulkan, and Ekrem Duman. 2013. A cost-sensitive decision tree approach for fraud detection. Expert Syst. Appl. 40, 15 (2013), 5916--5923.
[108]
Claude Sammut and Geoffrey I. Webb. 2011. Encyclopedia of Machine Learning. Springer Science 8 Business Media.
[109]
José Antonio Sanz, Dario Bernardo, Francisco Herrera, Humberto Bustince, and Hani Hagras. 2015. A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23, 4 (2015), 973--990.
[110]
Abeed Sarker and Graciela Gonzalez. 2015. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. J. Biomed. Info. 53 (2015), 196--207.
[111]
Asaf Shabtai, Robert Moskovitch, Clint Feher, Shlomi Dolev, and Yuval Elovici. 2012. Detecting unknown malicious code by applying classification techniques on opcode patterns. Secur. Info. 1, 1 (2012), 1.
[112]
Yuan-Hai Shao, Wei-Jie Chen, Jing-Jing Zhang, Zhen Wang, and Nai-Yang Deng. 2014. An efficient weighted Lagrangian twin support vector machine for imbalanced data classification. Pattern Recogn. 47, 9 (2014), 3158--3167.
[113]
Mei-Ling Shyu, Zongxing Xie, Min Chen, and Shu-Ching Chen. 2008. Video semantic event/concept detection using a subspace-based multimedia data mining framework. IEEE Trans. Multimedia 10, 2 (2008), 252--259.
[114]
Arpit Singh and Anuradha Purohit. 2015. A survey on methods for solving data imbalance problem for classification. Work 127, 15 (2015).
[115]
Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo, and Quan Zou. 2014. nDNA-prot: Identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformat.ics 15, 1 (2014), 298.
[116]
Qun Song, Jun Zhang, and Qian Chi. 2010. Assistant detection of skewed data streams classification in cloud security. In Proceedings of the IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS’10), Vol. 1. IEEE, 60--64.
[117]
Robert A. Sowah, Moses A. Agebure, Godfrey A. Mills, Koudjo M. Koumadi, and Seth Y. Fiawoo. 2016. New cluster undersampling technique for class imbalance learning. Int. J. Mach. Learn. Comput. 6, 3 (2016), 205.
[118]
Yanmin Sun, Andrew K. C. Wong, and Mohamed S. Kamel. 2009. Classification of imbalanced data: A review. Int. J. Pattern Recogn. Artific. Intell. 23, 04 (2009), 687--719.
[119]
Zhongbin Sun, Qinbao Song, Xiaoyan Zhu, Heli Sun, Baowen Xu, and Yuming Zhou. 2015. A novel ensemble method for classifying imbalanced data. Pattern Recogn. 48, 5 (2015), 1623--1637.
[120]
Mayank Taneja, Kavyanshi Garg, Archana Purwar, and Samarth Sharma. 2015. Prediction of click frauds in mobile advertising. In Proceedings of the 8th International Conference on Contemporary Computing (IC3’15). IEEE, 162--166.
[121]
Bo Tang, Haibo He, Paul M. Baggenstoss, and Steven Kay. 2016. A Bayesian classification approach using class-specific features for text categorization. IEEE Trans. Knowl. Data Eng. 28, 6 (2016), 1602--1606.
[122]
Yuchun Tang, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. 2009. SVMs modeling for highly imbalanced classification. IEEE Trans. Syst., Man, Cybernet., Part B (Cybernet.) 39, 1 (2009), 281--288.
[123]
Ciza Thomas. 2013. Improving intrusion detection for imbalanced network traffic. Secur. Commun. Netw. 6, 3 (2013), 309--324.
[124]
Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano, and Randall Wald. 2009. Feature selection with high-dimensional imbalanced data. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09). IEEE, 507--514.
[125]
Nguyen Ha Vo and Yonggwan Won. 2007. Classification of unbalanced medical data with weighted regularized least squares. In Proceedings of the Conference on Frontiers in the Convergence of Bioscience and Information Technologies (FBIT’07). IEEE, 347--352.
[126]
Chi-Man Vong, Jie Du, Chi-Man Wong, and Jiu-Wen Cao. 2018. Postboosting using extended G-Mean for online sequential multiclass imbalance learning. IEEE Trans. Neural Netw. Learn. Syst. (2018).
[127]
Shixiang Wan, Yucong Duan, and Quan Zou. 2017. HPSLPred: An ensemble multi-label classifier for human protein subcellular location prediction with imbalanced source. Proteomics 17, 17--18 (2017), 1700262.
[128]
C. Wang, L. Hu, M. Guo, X. Liu, and Q. Zou. 2015. imDC: An ensemble learning method for imbalanced classification with miRNA data. Genet. Mol. Res. 14, 1 (2015), 123--133.
[129]
Qiang Wang. 2014. A hybrid sampling SVM approach to imbalanced data classification. In Abstract and Applied Analysis, Vol. 2014. Hindawi Publishing Corporation.
[130]
Suge Wang, Deyu Li, Lidong Zhao, and Jiahao Zhang. 2013. Sample cutting method for imbalanced text sentiment classification based on BRC. Knowl.-Based Syst. 37 (2013), 451--461.
[131]
Shuo Wang and Xin Yao. 2013. Using class imbalance learning for software defect prediction. IEEE Trans. Reliabil. 62, 2 (2013), 434--443.
[132]
Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013a. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449--475.
[133]
Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013b. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449--475.
[134]
Qingyao Wu, Yunming Ye, Haijun Zhang, Michael K. Ng, and Shen-Shyang Ho. 2014. ForesTexter: An efficient random forest algorithm for imbalanced text categorization. Knowl.-Based Syst. 67 (2014), 105--116.
[135]
Yufei Xia, Chuanzhe Liu, and Nana Liu. 2017. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending. Electron. Comm. Res. Appl. 24 (2017), 30--49.
[136]
Jieming Yang, Zhaoyang Qu, and Zhiying Liu. 2014. Improved feature-selection method considering the imbalance problem in text categorization. Sci. World J. 2014 (2014).
[137]
Junshan Yang, Jiarui Zhou, Zexuan Zhu, Xiaoliang Ma, and Zhen Ji. 2016. Iterative ensemble feature selection for multiclass classification of imbalanced microarray data. J. Biol. Res. Thessaloniki 23, 1 (2016), 13.
[138]
Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman, Simon Fong, Zuraida Khairudin, and Nik Nik Abdullah. 2014. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the 1st International Conference on Advanced Data and Information Engineering (DaEng’13). Springer, 13--22.
[139]
Hualong Yu and Jun Ni. 2014. An improved ensemble learning method for classifying high-dimensional and imbalanced biomedicine data. IEEE/ACM Trans. Comput. Biol. Bioinformat. 11, 4 (2014), 657--666.
[140]
Hualong Yu, Jun Ni, and Jing Zhao. 2013. ACOSampling: An ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data. Neurocomputing 101 (2013), 309--318.
[141]
Ashkan Zakaryazad and Ekrem Duman. 2016. A profit-driven Artificial Neural Network (ANN) with applications to fraud detection and direct marketing. Neurocomputing 175 (2016), 121--131.
[142]
Jia Zeng, Shanfeng Zhu, and Hong Yan. 2009. Towards accurate human promoter recognition: A review of currently used sequence features and classification methods. Brief. Bioinformat. 10, 5 (2009), 498--508.
[143]
Bin Zhang, Yi Zhou, and Christos Faloutsos. 2008. Toward a comprehensive model in internet auction fraud detection. In Proceedings of the 41st Annual Hawaii International Conference on System Sciences. IEEE, 79--79.
[144]
Dongmei Zhang, Jun Ma, Jing Yi, Xiaofei Niu, and Xiaojing Xu. 2015. An ensemble method for unbalanced sentiment classification. In Proceedings of the 11th International Conference on Natural Computation (ICNC’15). IEEE, 440--445.
[145]
Huaxiang Zhang and Mingfang Li. 2014. RWO-Sampling: A random walk over-sampling approach to imbalanced data classification. Info. Fusion 20 (2014), 99--116.
[146]
Yan-Ping Zhang, Li-Na Zhang, and Yong-Cheng Wang. 2010. Cluster-based majority under-sampling approaches for class imbalance learning. In Proceedings of the 2nd IEEE International Conference on Information and Financial Engineering (ICIFE’10). IEEE, 400--404.
[147]
Zhancheng Zhang, Jun Dong, Xiaoqing Luo, Kup-Sze Choi, and Xiaojun Wu. 2014. Heartbeat classification using disease-specific feature selection. Comput. Biol. Med. 46 (2014), 79--89.
[148]
Xing-Ming Zhao, Xin Li, Luonan Chen, and Kazuyuki Aihara. 2008. Protein classification with imbalanced data. Proteins: Struct. Funct. Bioinformat. 70, 4 (2008), 1125--1132.
[149]
Zhuoyuan Zheng, Yunpeng Cai, and Ye Li. 2016. Oversampling method for imbalanced classification. Comput. Info. 34, 5 (2016), 1017--1037.
[150]
Weicai Zhong, Bijan Raahemi, and Jing Liu. 2013. Classifying peer-to-peer applications using imbalanced concept-adapting very fast decision tree on IP data stream. Peer-to-Peer Netw. Appl. 6, 3 (2013), 233--246.
[151]
Maciej Zięba, Jakub M. Tomczak, Marek Lubicz, and Jerzy Świątek. 2014. Boosted SVM for extracting rules from imbalanced data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl. Soft Comput. 14 (2014), 99--108.
[152]
Quan Zou, Sifa Xie, Ziyu Lin, Meihong Wu, and Ying Ju. 2016. Finding the best classification threshold in imbalanced classification. Big Data Res. 5 (2016), 2--8.

Cited By

View all
  • (2024)Automated machine learning for fabric quality prediction: a comparative analysisPeerJ Computer Science10.7717/peerj-cs.218810(e2188)Online publication date: 23-Jul-2024
  • (2024)An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentationPeerJ Computer Science10.7717/peerj-cs.17569(e1756)Online publication date: 2-Jan-2024
  • (2024)Machine learning prediction of adolescent HIV testing services in EthiopiaFrontiers in Public Health10.3389/fpubh.2024.134127912Online publication date: 15-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 52, Issue 4
July 2020
769 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3359984
  • Editor:
  • Sartaj Sahni
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 August 2019
Accepted: 01 May 2019
Revised: 01 October 2018
Received: 01 March 2018
Published in CSUR Volume 52, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data imbalance
  2. data analysis
  3. machine learning
  4. sampling

Qualifiers

  • Survey
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,146
  • Downloads (Last 6 weeks)70
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automated machine learning for fabric quality prediction: a comparative analysisPeerJ Computer Science10.7717/peerj-cs.218810(e2188)Online publication date: 23-Jul-2024
  • (2024)An autonomous mixed data oversampling method for AIOT-based churn recognition and personalized recommendations using behavioral segmentationPeerJ Computer Science10.7717/peerj-cs.17569(e1756)Online publication date: 2-Jan-2024
  • (2024)Machine learning prediction of adolescent HIV testing services in EthiopiaFrontiers in Public Health10.3389/fpubh.2024.134127912Online publication date: 15-Mar-2024
  • (2024)Surface Inspection of Ductile Cast Iron Pipe for Both Regression and Defective Classification by Deep Learning深層学習による回帰と不良品分類を両立するダクタイル鋳鉄管の鋳肌検査Journal of the Society of Materials Science, Japan10.2472/jsms.73.15773:2(157-164)Online publication date: 15-Feb-2024
  • (2024)Application of AI in in Multilevel Pain Assessment Using Facial Images: Systematic Review and Meta-AnalysisJournal of Medical Internet Research10.2196/5125026(e51250)Online publication date: 12-Apr-2024
  • (2024)Early Detection of Pulmonary Embolism in a General Patient Population Immediately Upon Hospital Admission Using Machine Learning to Identify New, Unidentified Risk Factors: Model Development StudyJournal of Medical Internet Research10.2196/4859526(e48595)Online publication date: 30-Jul-2024
  • (2024)Developing Novel Deep Learning Models to Detect Insider Threats and Comparing the Models from Different Perspectivesİç Tehditlerin Tespit Edilmesi için Özgün Derin Öğrenme Modellerinin Geliştirilmesi ve Modellerin Farklı Perspektiflerde KarşılaştırılmasıBilişim Teknolojileri Dergisi10.17671/gazibtd.138673417:1(31-43)Online publication date: 16-Jan-2024
  • (2024)An ensemble learning method with GAN-based sampling and consistency check for anomaly detection of imbalanced data streams with concept driftPLOS ONE10.1371/journal.pone.029214019:1(e0292140)Online publication date: 26-Jan-2024
  • (2024)Non-invasive glucose extraction by a single polarization rotator system in patients with diabetesBiomedical Optics Express10.1364/BOE.52903215:8(4909)Online publication date: 29-Jul-2024
  • (2024)Synthetic Data for Deep Learning in Computer Vision & Medical Imaging: A Means to Reduce Data BiasACM Computing Surveys10.1145/3663759Online publication date: 9-May-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media