Abstract
Predicting the behavior of customer is at great importance for a project manager. Data driven industries such as telecommunication industries have advantage of various data mining techniques to extract meaningful information regarding customer’s future behavior. However, the prediction accuracy of these data mining techniques is significantly affected if the real world data is highly imbalanced. In this study, we investigate and compare the predictive performance of two well-known oversampling techniques Synthetic Minority Oversampling Technique (SMOT) and Megatrend Diffusion Function (MTDF) and four different rule generation algorithms (Exhaustive, Genetic, Covering, and LEM2) based on rough set classification using publicly available data sets. As useful feature extraction can play a vital role not only in improving the classification performance, but also to reduce the computational cost and complexity by eliminating unnecessary features from the dataset. Minimum Redundancy Maximum Relevance (mRMR) technique has been used in the proposed study for feature extraction which not only selects the best feature subset but also reduces the features space. The results clearly demonstrate the predictive performance of both oversampling techniques and rules generation algorithms that will help the decision makers/researcher to select the ultimate one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ruparel, N.H., Shahane, N.M., Bhamare, D.P.: Learning from Small Data Set to Build Classification Model: A Survey. Int. Conf. Recent Trends Eng. Technol. 2013, 975–8887 (2013)
Burez, J., Van den Poel, D.: Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36, 4626–4636 (2009)
Gupta, S., Hanssens, D., Hardie, B., Kahn, W., Kumar, V., Lin, N., Ravishanker, N., Sriram, S.: Modeling Customer Lifetime Value. J. Serv. Res. 9, 139–155 (2006)
Weiss, G.M.: Mining with Rarity: A Unifying Framework. SIGKDD Explor 6, 7–19 (2004)
Peng, H.: Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Tang, Y., Krasser, S., Alperovitch, D., Judge, P.: Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data. In: 2006 8th Int. Conf. on Signal Process, vol. 3, pp. 174–180 (2008)
Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17, 786–795 (2005)
Probost, F.: Machine Learning from Imbalanced Data Sets 101 Extended Abstract. Invit. Pap. AAAI 2000 Work. Imbalanced Data Sets (2000)
Chawla, N.V., Japkowicz, N., Drive, P.: Editorial: Special Issue on Learning from Imbalanced Data Sets Aleksander Ko l cz. ACM SIGKDD Explor 6, 2000–(2004)
Batista, G.E., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6, 20 (2004)
Guo, H.: Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach. ACM SIGKDD Explor 6, 30–39 (2004)
Li, D.-C., Wu, C.-S., Tsai, T.-I., Lina, Y.-S.: Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge. Comput. Oper. Res. 34, 966–982 (2007)
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Wang, J., Xu, M., Wang, H., Zhang, J.: Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. In: 2006 8th Int. Conf. Signal Process, vol. 3, pp. 1–4 (2006)
Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)
Pawlak, Z.: Rough Sets, Rough Relations and Rough Functions - Fundamenta Informaticae, vol. 27(2-3). IOS Press (1996), http://iospress.metapress.com/content/vr21hm11p17k3uh0/
Nguyen, S.H., Nguyen, H.S.: Analysis of STULONG Data by Rough Set Exploration System ( RSES ). In: Proc. ECML/PKDD Work, pp. 71–82 (2003)
Bazan, J.G., Nguyen, H.S., Nguyen, S.H., Synak, P., Wróblewski, J.: Rough set algorithms in classification problem, pp. 49–88 (2000)
Wróblewski, J.: Genetic Algorithms in Decomposition and Classification Problems. Rough Sets Knowl. Discov. 2(19), 471–487 (1998)
Grzymala-Busse, J.W.: A New Version of the Rule Induction System LERS. Informaticae 31, 27–39 (1997)
Bazan, J., Szczuka, M.S.: The Rough Set Exploration System. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 37–56. Springer, Heidelberg (2005)
Dataset Source, http://www.sgi.com/tech/mlc/db/
Holmes, G., Donkin, A., Witten, I.H.: WEKA: a machine learning workbench. In: Proceedings of ANZIIS 1994 - Australian New Zealnd Intelligent Information Systems Conference, pp. 357–361 (1994)
STANDARDIZE function, http://office.microsoft.com/en-001/excel-help/standardize-function-HP010342919.aspx
He, F., Wang, X., Liu, B.: Attack Detection by Rough Set Theory in Recommendation System. In: 2010 IEEE International Conference on Granular Computing, pp. 692–695. IEEE (2010)
Bellazzi, R., Zupan, B.: Predictive data mining in clinical medicine: current issues and guidelines. Int. J. Med. Inform. 77, 81–97 (2008)
Amin, A., Shehzad, S., Khan, C., Ali, I., Anwar, S.: Churn Prediction in Telecommunication Industry Using Rough Set Approach. In: Camacho, D., Kim, S.-W., Trawiński, B. (eds.) ICCCI 2014, pp. 83–95. Springer International Publishing, Switzerland (2015)
Amin, A., Khan, C., Ali, I., Anwar, S.: Customer Churn Prediction in Telecommunication Industry: With and without Counter-Example (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Amin, A., Rahim, F., Ali, I., Khan, C., Anwar, S. (2015). A Comparison of Two Oversampling Techniques (SMOTE vs MTDF) for Handling Class Imbalance Problem: A Case Study of Customer Churn Prediction. In: Rocha, A., Correia, A., Costanzo, S., Reis, L. (eds) New Contributions in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-319-16486-1_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-16486-1_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16485-4
Online ISBN: 978-3-319-16486-1
eBook Packages: Computer ScienceComputer Science (R0)