Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3132847.3132940acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data

Published: 06 November 2017 Publication History

Abstract

Highly imbalanced data is common in the real world and it is important but difficult to train an effective classifier. In this paper, Our major point is that the imbalance is the observed phenomenon but not the cause of the problem. The challenge is that useful information is been overshadowed in the large scale of data in both majority and minority classes. We propose a novel two-step strategy, Information Accumulation, which first selects the most discriminative data by the Zooming-in phase, and then leverages unlabeled data by pseudo active learning and self-training in the phase of Learning from Learned Results. Comparative experiments are conducted on large-scale highly imbalanced real customer service data on complaint detection task (where less than 2% of data is positive). The results on eight state-of-the-art classification algorithms show that significant improvements are observed on the performances of all algorithms with Information Accumulation(for example, the F-Measure score of Xgboost is increased by 197% from 0.115 to 0.347), which demonstrates the effectiveness and general applicability of the proposed strategy. This work explores a new idea on dealing with highly imbalanced data that we do not aim to balance the training examples as usual, but focus on finding the most discriminative information from labeled data and the learning results of unlabeled data.

References

[1]
Aida Ali, Siti Mariyam Shamsuddin, and Anca L Ralescu. 2015. Classification with class imbalance problem: A Review. Int. J. Advance Soft Compu. Appl 7, 3 (2015).
[2]
Alina Beygelzimer, Daniel J Hsu, John Langford, and Chicheng Zhang. 2016. Search Improves Label for Active Learning. In Advances in Neural Information Processing Systems. 3342--3350.
[3]
Philip K Chan, Wei Fan, Andreas L Prodromidis, and Salvatore J Stolfo. 1999. Distributed data mining in credit card fraud detection. IEEE Intelligent Systems and their Applications 14, 6 (1999), 67--74.
[4]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2 (2011), 27:1--27:27. Issue 3.
[5]
Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006){Book reviews}. IEEE Transactions on Neural Networks 20, 3 (2009), 542--542.
[6]
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321--357.
[7]
Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. 2004. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6, 1 (2004), 1--6.
[8]
Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. In European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 107--119.
[9]
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 785--794.
[10]
Everton Alvares Cherman, Grigorios Tsoumakas, and Maria-Carolina Monard. 2016. Active Learning Algorithms for Multi-label Data. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, 267-- 279.
[11]
Dong Dai and Shaowen Hua. 2016. Random Under-Sampling Ensemble Methods for Highly Imbalanced Rare Disease Classification. In Proceedings of the International Conference on Data Mining (DMIN). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 54.
[12]
Charles Elkan. 2001. The foundations of cost-sensitive learning. In International joint conference on artificial intelligence, Vol. 17. Lawrence Erlbaum Associates Ltd, 973--978.
[13]
Seyda Ertekin, Jian Huang, and C Lee Giles. 2007. Active learning for class imbalance problem. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 823--824.
[14]
Alberto Fernández, Sara del Río, Nitesh V Chawla, and Francisco Herrera. 2017. An insight into imbalanced Big Data classification: outcomes and challenges. Complex & Intelligent Systems (2017), 1--16.
[15]
Joonho Gong and Hyunjoong Kim. 2017. RHSBoost: Improving classification performance in imbalance data. Computational Statistics & Data Analysis 111 (2017), 1--13.
[16]
Matthias Hirth, Tobias Hoßfeld, and Phuoc Tran-Gia. 2011. Cost-optimal validation mechanisms and cheat-detection for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2011 Fifth International Conference on. IEEE, 316--321.
[17]
Nathalie Japkowicz. 2000. The class imbalance problem: Significance and strategies. In Proc. of the Int'l Conf. on Artificial Intelligence.
[18]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. CoRR abs/1408.5882 (2014). http://arxiv.org/abs/1408.5882
[19]
M Kubat, RC Holte, and S Matwin. 1998. Detection of oil spills in satellite radar images of sea surface. Machine Learning 30 (1998), 195--215.
[20]
David Lewis and William Gale. 1994. Training text classifiers by uncertainty sampling. (1994).
[21]
Shoushan Li, Shengfeng Ju, Guodong Zhou, and Xiaojun Li. 2012. Active learning for imbalanced sentiment classification. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 139--148.
[22]
Shoushan Li, Zhongqing Wang, Guodong Zhou, and Sophia Yat Mei Lee. 2011. Semi-supervised learning for imbalanced sentiment classification. In IJCAI proceedings-international joint conference on artificial intelligence, Vol. 22. 1826.
[23]
Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2009. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 539--550.
[24]
Weizhi Ma, Min Zhang, Yiqun Liu, and Shaoping Ma. 2016. Multi-Grained Role Labeling Based on Multi-Modality Information for Real Customer Service Telephone Conversation. In IJCAI. 1816--1822.
[25]
Maciej A Mazurowski, Piotr A Habas, Jacek M Zurada, Joseph Y Lo, Jay A Baker, and Georgia D Tourassi. 2008. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural networks 21, 2 (2008), 427--436.
[26]
Wing WY Ng, Junjie Hu, Daniel S Yeung, Shaohua Yin, and Fabio Roli. 2015. Diversified sensitivity-based undersampling for imbalance classification problems. IEEE transactions on cybernetics 45, 11 (2015), 2402--2412.
[27]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.
[28]
H Scudder. 1965. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory 11, 3 (1965), 363--371.
[29]
Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2010. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 1 (2010), 185--197.
[30]
Burr Settles. 2010. Active learning literature survey. University of Wisconsin, Madison 52, 55--66 (2010), 11.
[31]
Burr Settles, Mark Craven, and Soumya Ray. 2008. Multiple-instance active learning. In Advances in neural information processing systems. 1289--1296.
[32]
H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory. ACM, 287--294.
[33]
Manali Sharma and Mustafa Bilgic. 2017. Evidence-based uncertainty sampling for active learning. Data Mining and Knowledge Discovery 31, 1 (2017), 164--202.
[34]
Shiliang Sun and David R Hardoon. 2010. Active learning with extremely sparse labeled examples. Neurocomputing 73, 16 (2010), 2980--2988.
[35]
Jafar Tanha, Maarten van Someren, and Hamideh Afsarmanesh. 2015. Semisupervised self-training for decision tree classifiers. International Journal of Machine Learning and Cybernetics (2015), 1--16.
[36]
Isaac Triguero, José A Sáez, Julián Luengo, Salvador García, and Francisco Herrera. 2014. On the characterization of noise filters for self-training semi-supervised in nearest neighbor classification. Neurocomputing 132 (2014), 30--41.
[37]
Yi-Lin Tsai, Richard Tzong-Han Tsai, Chuang-Hua Chueh, and Sen-Chia Chang. 2014. Cross-Domain Opinion Word Identification with Query-By-Committee Active Learning. In Technologies and Applications of Artificial Intelligence. Springer, 334--343.
[38]
Yanping Yang, Guangzhi Ma, et al. 2010. Ensemble-based active learning for class imbalance problem. Journal of Biomedical Science and Engineering 3, 10 (2010), 1022.

Cited By

View all
  • (2019)Imbalanced Sentiment Classification Enhanced with Discourse MarkerArtificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series10.1007/978-3-030-30490-4_11(117-129)Online publication date: 9-Sep-2019

Index Terms

  1. A Two-step Information Accumulation Strategy for Learning from Highly Imbalanced Data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
    November 2017
    2604 pages
    ISBN:9781450349185
    DOI:10.1145/3132847
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 November 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. complaint call detection
    2. imbalanced learning
    3. text classification

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Basic Research Program
    • Natural Science Foundation of China

    Conference

    CIKM '17
    Sponsor:

    Acceptance Rates

    CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 08 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Imbalanced Sentiment Classification Enhanced with Discourse MarkerArtificial Neural Networks and Machine Learning – ICANN 2019: Text and Time Series10.1007/978-3-030-30490-4_11(117-129)Online publication date: 9-Sep-2019

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media