research-article

A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data

Authors:

Shaoping MaAuthors Info & Claims

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

Pages 1289 - 1298

https://doi.org/10.1145/3132847.3132940

Published: 06 November 2017 Publication History

Abstract

Highly imbalanced data is common in the real world and it is important but difficult to train an effective classifier. In this paper, Our major point is that the imbalance is the observed phenomenon but not the cause of the problem. The challenge is that useful information is been overshadowed in the large scale of data in both majority and minority classes. We propose a novel two-step strategy, Information Accumulation, which first selects the most discriminative data by the Zooming-in phase, and then leverages unlabeled data by pseudo active learning and self-training in the phase of Learning from Learned Results. Comparative experiments are conducted on large-scale highly imbalanced real customer service data on complaint detection task (where less than 2% of data is positive). The results on eight state-of-the-art classification algorithms show that significant improvements are observed on the performances of all algorithms with Information Accumulation(for example, the F-Measure score of Xgboost is increased by 197% from 0.115 to 0.347), which demonstrates the effectiveness and general applicability of the proposed strategy. This work explores a new idea on dealing with highly imbalanced data that we do not aim to balance the training examples as usual, but focus on finding the most discriminative information from labeled data and the learning results of unlabeled data.

References

[1]

Aida Ali, Siti Mariyam Shamsuddin, and Anca L Ralescu. 2015. Classification with class imbalance problem: A Review. Int. J. Advance Soft Compu. Appl 7, 3 (2015).

[2]

Alina Beygelzimer, Daniel J Hsu, John Langford, and Chicheng Zhang. 2016. Search Improves Label for Active Learning. In Advances in Neural Information Processing Systems. 3342--3350.

Digital Library

[3]

Philip K Chan, Wei Fan, Andreas L Prodromidis, and Salvatore J Stolfo. 1999. Distributed data mining in credit card fraud detection. IEEE Intelligent Systems and their Applications 14, 6 (1999), 67--74.

Digital Library

[4]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1--27:27. Issue 3.

Digital Library

[5]

Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006){Book reviews}. IEEE Transactions on Neural Networks 20, 3 (2009), 542--542.

Digital Library

[6]

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.

[7]

Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. 2004. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 1--6.

Digital Library

[8]

Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 107--119.

[9]

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794.

Digital Library

[10]

Everton Alvares Cherman, Grigorios Tsoumakas, and Maria-Carolina Monard. 2016. Active Learning Algorithms for Multi-label Data. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 267-- 279.

[11]

Dong Dai and Shaowen Hua. 2016. Random Under-Sampling Ensemble Methods for Highly Imbalanced Rare Disease Classification. In Proceedings of the International Conference on Data Mining (DMIN). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 54.

[12]

Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd, 973--978.

Digital Library

[13]

Seyda Ertekin, Jian Huang, and C Lee Giles. 2007. Active learning for class imbalance problem. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 823--824.

Digital Library

[14]

Alberto Fernández, Sara del Río, Nitesh V Chawla, and Francisco Herrera. 2017. An insight into imbalanced Big Data classification: outcomes and challenges. Complex & Intelligent Systems (2017), 1--16.

[15]

Joonho Gong and Hyunjoong Kim. 2017. RHSBoost: Improving classification performance in imbalance data. Computational Statistics & Data Analysis 111 (2017), 1--13.

Digital Library

[16]

Matthias Hirth, Tobias Hoßfeld, and Phuoc Tran-Gia. 2011. Cost-optimal validation mechanisms and cheat-detection for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on. IEEE, 316--321.

Digital Library

[17]

Nathalie Japkowicz. 2000. The class imbalance problem: Significance and strategies. In Proc. of the Int'l Conf. on Artificial Intelligence.

[18]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. CoRR abs/1408.5882 (2014). http://arxiv.org/abs/1408.5882

[19]

M Kubat, RC Holte, and S Matwin. 1998. Detection of oil spills in satellite radar images of sea surface. Machine Learning 30 (1998), 195--215.

Digital Library

[20]

David Lewis and William Gale. 1994. Training text classifiers by uncertainty sampling. (1994).

[21]

Shoushan Li, Shengfeng Ju, Guodong Zhou, and Xiaojun Li. 2012. Active learning for imbalanced sentiment classification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 139--148.

Digital Library

[22]

Shoushan Li, Zhongqing Wang, Guodong Zhou, and Sophia Yat Mei Lee. 2011. Semi-supervised learning for imbalanced sentiment classification. In IJCAI proceedings-international joint conference on artificial intelligence, Vol. 22. 1826.

Digital Library

[23]

Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550.

Digital Library

[24]

Weizhi Ma, Min Zhang, Yiqun Liu, and Shaoping Ma. 2016. Multi-Grained Role Labeling Based on Multi-Modality Information for Real Customer Service Telephone Conversation. In IJCAI. 1816--1822.

Digital Library

[25]

Maciej A Mazurowski, Piotr A Habas, Jacek M Zurada, Joseph Y Lo, Jay A Baker, and Georgia D Tourassi. 2008. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks 21, 2 (2008), 427--436.

Digital Library

[26]

Wing WY Ng, Junjie Hu, Daniel S Yeung, Shaohua Yin, and Fabio Roli. 2015. Diversified sensitivity-based undersampling for imbalance classification problems. IEEE transactions on cybernetics 45, 11 (2015), 2402--2412.

[27]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.

Digital Library

[28]

H Scudder. 1965. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory 11, 3 (1965), 363--371.

Digital Library

[29]

Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2010. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 1 (2010), 185--197.

Digital Library

[30]

Burr Settles. 2010. Active learning literature survey. University of Wisconsin, Madison 52, 55--66 (2010), 11.

Digital Library

[31]

Burr Settles, Mark Craven, and Soumya Ray. 2008. Multiple-instance active learning. In Advances in neural information processing systems. 1289--1296.

Digital Library

[32]

H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory. ACM, 287--294.

Digital Library

[33]

Manali Sharma and Mustafa Bilgic. 2017. Evidence-based uncertainty sampling for active learning. Data Mining and Knowledge Discovery 31, 1 (2017), 164--202.

Digital Library

[34]

Shiliang Sun and David R Hardoon. 2010. Active learning with extremely sparse labeled examples. Neurocomputing 73, 16 (2010), 2980--2988.

Digital Library

[35]

Jafar Tanha, Maarten van Someren, and Hamideh Afsarmanesh. 2015. Semisupervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics (2015), 1--16.

[36]

Isaac Triguero, José A Sáez, Julián Luengo, Salvador García, and Francisco Herrera. 2014. On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification. Neurocomputing 132 (2014), 30--41.

Digital Library

[37]

Yi-Lin Tsai, Richard Tzong-Han Tsai, Chuang-Hua Chueh, and Sen-Chia Chang. 2014. Cross-Domain Opinion Word Identification with Query-By-Committee Active Learning. In Technologies and Applications of Artificial Intelligence. Springer, 334--343.

[38]

Yanping Yang, Guangzhi Ma, et al. 2010. Ensemble-based active learning for class imbalance problem. Journal of Biomedical Science and Engineering 3, 10 (2010), 1022.

Cited By

Zhang TWu XLin MHan JHu S(2019)Imbalanced Sentiment Classification Enhanced with Discourse MarkerArtificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series10.1007/978-3-030-30490-4_11(117-129)Online publication date: 9-Sep-2019
https://doi.org/10.1007/978-3-030-30490-4_11

Index Terms

A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

Addressing class-imbalance in multi-label learning via two-stage multi-label hypernetwork

Multi-label learning is concerned with learning from data examples that are represented by a single feature vector while associated with multiple labels simultaneously. Existing multi-label learning approaches mainly focus on exploiting label ...
Imbalanced Sentiment Classification with Multi-Task Learning
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Supervised learning methods are widely used in sentiment classification. However, when sentiment distribution is imbalanced, the performance of these methods declines. In this paper, we propose an effective approach for imbalanced sentiment ...
An active learning budget-based oversampling approach for partially labeled multi-class imbalanced data streams
SAC '23: Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing

Learning classification models from multi-class imbalanced data streams is a challenging task in machine learning. Moreover, there is a common assumption that all instances are labeled and available for the training phase. However, this is not realistic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

November 2017

2604 pages

ISBN:9781450349185

DOI:10.1145/3132847

General Chairs:
Ee-Peng Lim
Singapore Management University, Singapore
,
Marianne Winslett
University of Illinois at Urbana-Champaign, USA, and Advanced Digital Sciences Center, Singapore
,
Program Chairs:
Mark Sanderson
RMIT, Australia
,
Ada Fu
Chinese University of Hong Kong, Hong Kong
,
Jimeng Sun
Georgia Tech, USA
,
Shane Culpepper
RMIT, Australia
,
Eric Lo
Chinese University of Hong Kong, Hong Kong
,
Joyce Ho
Emory University, USA
,
Debora Donato
Mix Tech, Inc., USA
,
Rakesh Agrawal
Data Insights Laboratories, USA
,
Yu Zheng
Microsoft Research Asia, China
,
Carlos Castillo
Qatar Computing Research Institute, Qatar
,
Aixin Sun
Nanyang Technological University, Singapore
,
Vincent S. Tseng
National Cheng Kung University, Taiwan
,
Chenliang Li
Wuhan University, China

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Basic Research Program
Natural Science Foundation of China

Conference

CIKM '17

Sponsor:

CIKM '17: ACM Conference on Information and Knowledge Management

November 6 - 10, 2017

Singapore, Singapore

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
278
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang TWu XLin MHan JHu S(2019)Imbalanced Sentiment Classification Enhanced with Discourse MarkerArtificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series10.1007/978-3-030-30490-4_11(117-129)Online publication date: 9-Sep-2019
https://doi.org/10.1007/978-3-030-30490-4_11

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten