Customer Churn Telecom

CUSTOMER CHURN PREDICTION MODEL IN
TELECOMMUNICATION SECTOR USING MACHINE

LEARNING TECHNIQUE
Submitted by
Nayema Taskin
A thesis submitted to the Department of Statistics

in partial fulfillment of the requirements
for a two-year Master of Science degree in Statistics
in the Faculty of Social Sciences
Supervisor
Yukai Yang
Spring, 2023
Abstract
Customer churn is a critical problem faced by telecom companies, leading to lost rev-
enue and increased marketing costs. In the highly competitive telecommunication sec-
tor, customer retention is essential for success. It costs five to seven times more to
acquire a new customer than it does to retain an existing one. Considering this, churn
prediction models are increasingly becoming an important tool for telecommunication
organizations looking to minimize their customer attrition rate. Churn, or customer at-
trition, is a major problem for businesses in the telecommunications sector. Every year,
millions of customers switch to new service providers, resulting in billions of dollars
in lost revenue. In the ever- evolving and highly competitive world of telecommuni-
cations, businesses are constantly looking for new ways to improve customer loyalty
and reduce customer churn. Machine learning techniques can be incredibly useful in
this endeavor. This study proposes a customer churn prediction model using machine
learning techniques to help telecom companies retain customers and reduce churn rates.
The proposed model analyzes big data using machine learning algorithms, including K
Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR),
Random Forest (RF), Adaboost, Light Gradient Boosting Machine (LGBM), Gradient
Boosting, and Extreme Gradient Boosting (XGBoost) to predict customer churn. The
proposed model achieves high accuracy score of 95.74% with the XGBoost and LGBM
classifier. The results demonstrate that machine learning algorithms have the potential
to predict customer churn effectively and provide insights into the primary drivers of
customer churn.
Keywords: Telecommunication, Customer Retention, Churn Prediction, Machine Learn-

ing.
Contents
Abstract I
List of Figures III
List of Tables IV
1 Introduction: 1
2 Related Works: 4
3 Proposed Methodology: 6
3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Model Selection and setup . . . . . . . . . . . . . . . . . . . . . . . . 10
3.5 Model Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Experimental Results and Analysis: 22

4.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Conclusion: 28
II
List of Figures
3.3.1 Relation Between International Plan and Churn . . . . . . . . . . . . . 8
3.3.2 Relation Between Churn and Voicemail Plans . . . . . . . . . . . . . . 8
3.3.3 Relation Between Area Code and Churn . . . . . . . . . . . . . . . . . 9
3.3.4 Relation Between Customer Service Calls and Churn . . . . . . . . . . 9
3.3.5 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3.6 Top Ten Features Correlated with Churn. . . . . . . . . . . . . . . . . . 11
3.4.1 Methodology Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2.1 Accuracy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.2 Precision Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.3 Recall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.4 Specificity Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.5 G-mean Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.6 Roc-Score Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.7 MCC Score Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.8 Feature Importance Analysis on XGBoost Model . . . . . . . . . . . . 26
III
List of Tables
4.1.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
IV
1 Introduction:
Customer churn refers to the phenomenon of customers ceasing to do business with a
company or brand. In the context of the telecommunication industry, churn occurs when
customers end their subscription with a particular telecom service provider and switch to
another provider or terminate the service altogether. Churn can be a significant problem
for companies as it can result in lost revenue, decreased market share, and increased
costs associated with acquiring new customers to replace those who have left.
Customer churn is a crucial problem in the telecommunication sector, and it has
been the focus of research in recent years. According to Coussement et al. (2017)Many
dynamic and competitive businesses within a marketplace view their consumers as one
of their most valuable assets. At an incredibly fast rate, telecom companies produce
a large amount of data in the modern world. Numerous telecom service providers are
vying for more customers by competing in the market. The competition in the telecom-
munication sector is intense, and telecom companies need to retain their customers to
remain profitable.
Churn can have a significant impact on telecom companies. When customers leave
a telecom service provider and switch to a competitor, the provider loses revenue and
market share. In addition, acquiring new customers to replace those who have left can
be expensive and time-consuming. The cost of acquiring a new customer can be five to
seven times higher than retaining an existing customer stated by Kumar (2022).
However, Ballings & Van den Poel (2012) believes churn management in customer
relationship management (CRM) has drawn a lot of interest because it has been estab-
lished that, keeping customers is more profitable than finding new ones. This has been
especially true in the digital age, as customers are more likely to spread their opinions
quickly and widely on social media and the internet due to the potential damage nega-
tive reviews can do to businesses. Customers have a variety of options, including more
1
affordable and superior services. Maximizing profits while maintaining viability in a
cutthroat market are telecom companies’ critical success factors according to Babu et
al. (2014).To accomplish this, businesses are relying on big data analytics to better un-
derstand consumer behavior, anticipate their needs and preferences, provide customized
services, and create fresh business plans.
Moreover, for Ali & Arıtürk (2014) churn detection is a challenging task for several
reasons. Firstly churn can be influenced by various factors such as pricing, network cov-
erage, customer service quality, product offerings, etc. Identifying the primary drivers
of churn and accurately measuring their impact might prove to be difficult. Secondly,
Churn is a rare event in most organizations. This means the proportion of customers
who may churn in a given period of time may be small compared to the total customer
base. This causes a class imbalance in the customer dataset, which can make build-
ing accurate predictive models challenging as they may be biased towards the majority
class. Thirdly, churn can be a nonlinear process. Meaning, its causes and effects may
not follow a linear relationship. Finally, the telecom industry generates a big volume of
data. Which can be both a blessing and a curse. Large volumes of data usually contain
lots of noise, and also not all factors contribute to churn. The abundance of data can
also make it difficult to extract meaningful insights from the noise.
There can be various reasons for churning. TechSee (2019) finds that one of the most
prominent reasons for churning is high customer effort. When a customer had to call
more than once, or companies took too long to resolve an issue, a high rate of customer
churn could be observed. And these disappointed customers spread their discontent
to friends, families, or co-workers which could, in turn, cause other people to churn.
To retain customers, customer behavior must be analyzed to determine what are the
factors that affect customers’ decision to migrate from their telecom service provider. In
recent times, machine learning models are very popular to accurately analyze data and to
create a predictive model which can predict customer behavior. For customer retention,
2
an accurate and high-performance machine learning model is necessary. To address
this problem, this research paper aims to develop a customer churn prediction model
using various machine-learning techniques. The various machine learning techniques
used here are RF, KNN, SVM, LR, Adaboost, LGBM, Gradient boosting, and XGB
classifiers to predict customer churn by analyzing big data.
Past research on this topic is mostly done by using a large volume of telecom data to
identify customer behavior patterns. However, most of this research has its limitations.
When a large volume dataset is created, most of the time, it has missing values to a cer-
tain degree. Thus while handling big telecom data, data preprocessing is very important.
Then while building the model, not all types of data have significance. It is important
to analyze data and find correlations between them and apply feature selection. Lastly,
various types of machine learning techniques should be experimented to compare and
find the best method to build a predictive model.
The scope of this thesis is to explore the effectiveness of different machine learning
techniques in predicting customer churn in the telecom sector. Our experimental results
denote that our built model is able to predict customer churn with high accuracy. The
results of this research will help telecom companies to identify customers who are likely
to churn and take appropriate measures to retain them.
The rest of the paper is organized as follows: section 2 discusses the relevant works
conducted in this field. The proposed methodology and dataset description are provided
in section 3. Experimental results and their analysis is conveyed in section 4. Finally,
section 5 draws the conclusion.
3
2 Related Works:
Several studies have been conducted on the prediction of customer churn in the telecom-
munication sector. This section provides an overview of some of the related works.
A study by Ullah et al. (2019) develops a churn prediction model using machine
learning techniques, such as Random Forest, Attribute Selected Classifier, J48, Random
Tree, Decision Stump, Naive Bayes, Logistic Regression, and some ensemble classi-
fiers. They used customer data, such as total calls, total minutes, charged calls, charged
minutes, revenue, etc information to predict customer churn. They use 2 Datasets. The
first one is from a South Asia GSM telecom provider with 64,107 instances with 29
features and all features are numerical. The second dataset is a publicly available churn-
bigml dataset with 3333 records and 16 features. Their results show that Random Forest
has the best accuracy for the first dataset and Attribute selected classifier and J48 has
the best accuracy for the second dataset.
Another study by Mishra & Reddy (2017) proposes a deep learning-based approach
for predicting customer churn in the telecommunication sector. They use deep learning
by a convolutional neural network (CNN) to predict customer churn. They achieved
86.85% accuracy and 92.06% F-score which outperforms other models.
Whereas Xu et al. (2021) uses an ensemble learning technique consisting of stack-
ing models and soft voting to predict customer churn in the telecommunication sector
in one study. They select Xgboost, Logistic regression, Decision tree, and Naïve Bayes
machine-learning algorithms to build a stacking model with two levels, and three out-
puts of the second level are used for soft voting. They use a publicly available churn
dataset of 3333 instances. They apply feature construction on the dataset and their ex-
perimental result accuracy on the original dataset is 96.12% and on the new dataset is
98.09%.
In a study by Amin et al. (2019), the authors proposed that a classifier shows dif-
4
ferent accuracy levels for different zones of the dataset. They estimate the certainty of
a classifier model using the distance factor. They group the datasets into two differ-
ent zones - data with high certainty and data with low certainty based on the distance
for predicting customers exhibiting churn and non-churn behavior. Their experimental
results conclude that the distance factor is strongly correlated with the certainty of the
classifier and the classifier obtains high accuracy in the zone with a greater distance
factor value.
Finally, Sana et al. (2022) investigates several machine learning techniques as well
as data transformation methods. The machine learning methods used are KNN, NB,
LR, RF, Decision Tree, GradientBoosting, FNN, and RNN. Their proposed technique
improved the prediction performance by up to 26.2% of the AUC score and 17% in
terms of F-measure.
Overall, these studies demonstrate the effectiveness of machine learning techniques
in predicting customer churn in the telecommunication sector. The proposed methodol-
ogy in this thesis will build upon these studies and evaluate the effectiveness of various
machine learning techniques in predicting customer churn in the telecommunication
sector.
5
3 Proposed Methodology:
This study proposes to develop a customer churn prediction model using machine learn-
ing techniques to predict customer churn in the telecommunication sector. The follow-
ing steps will be taken to achieve the research objectives:
3.1 Dataset Description
This subsection briefly discusses the dataset used. The dataset used is a publicly avail-
able churn-bigml dataset. The dataset contains 3333 instances and 16 features. All the
features are in numerical form except state, ’international plan’, ‘voice mail plan’, and
the target label ‘churn’. The target attribute ‘churn’ is a binary class where “True” de-
notes customers who have migrated from their telecom service provider. There are 483
instances of true labels which are 14.49% and 2850 false instances which is 85.51% of
the total data.
Each row of data denotes a single user’s information. And the dataset contains in-
formation of 3333 users. The column ‘state’ contains 51 unique instances which denote
51 states of the USA. ‘Area code’ denotes 3 areas to which the telecom service provider
divided their service region. The ‘account’ length column represents how many days
the user account has been active for. The ‘phone number’ column is a user’s uniquely
identifiable number. ‘international plan’ is categorical data with values “yes” or “no”
which represents if a user has subscribed to international roaming plans or not. Simi-
larly ‘voicemail plan’ is categorical data with binary values. ‘number vmail messages’
is the count of how many voicemails a user has received.
In this dataset a user’s talktime, number of calls, and amount of money charged are
separated into three-time shifts - day, evening, and night and these values are described
in nine columns.
International talktime, number of calls, and money charged are separately stored in
6
three other columns.
Customer service calls represent how many times a user contacted customer service.
3.2 Data Preprocessing
In this subsection, the pre-processing procedures applied to the dataset before training
models are explained.
• The dataset used has no missing values so there was no need to remove any rows.
• The ‘phone number’ column is a unique identifier for a user which has no corre-
lation with prediction. Therefore this column was dropped.
• Two categorical columns - ‘international plan’ and ‘voice mail plan’ were mapped
to numerical values. 1 denotes ‘yes’ and 0 denotes ‘no’.
• The target label ‘churn’ is in boolean true or false which was converted to 1 or 0
respectively.
• Lastly, the column ‘state’ was encoded to numerical values using label encoder.
3.3 Data analysis
There were lots of features available in the dataset that we used. To determine how each
feature affects churn, a data analysis was run. This section explains our findings of data
analysis on BigML dataset.
In figure 3.3.1, we can see that most customers do not choose to subscribe to inter-
national roaming plans. But those who do, are less likely to churn.
In figure 3.3.2, we can see that customers who have subscribed to voicemail plans
are less likely to churn.
7
Figure 3.3.1: Relation Between International Plan and Churn
Figure 3.3.2: Relation Between Churn and Voicemail Plans
From the figure 3.3.3, we can see that the second area has the highest number of
users. The number of churners is high in this region but it cannot be conclusively said
that area code affects churn behavior as the high number of churn could be simply
because of higher number of users compared to other areas.
The number of ’customer service calls’ is an important factor to discern customer
behavior. In the above figure, the number of customers who have contacted customer
care at least once is very high. And people who only contacted customer care once or
never at all had a high probability of staying loyal to their telecom service provider. And
the increase of customer service calls may also increase the rate of churning.
8
Figure 3.3.3: Relation Between Area Code and Churn
Figure 3.3.4: Relation Between Customer Service Calls and Churn
Figure 3.3.5, correlation analysis is shown. We can see that international plans, total
day minutes, total day charge, and customer service calls have a high correlation with
Churn. On the other hand voicemail plan, number of voicemail messages, and total
international calls have very low correlation with churn. Figure 3.3.6 shows the top 10
features according to correlation with target label churn.
9
Figure 3.3.5: Correlation Analysis
3.4 Model Selection and setup
In this subsection, the machine learning training process is described. Model selection
and model training are crucial steps in developing an accurate customer churn prediction
model. Various machine-learning techniques were selected for experimental evaluation.
Among them were RF, KNN, SVM, LR, Adaboost, LGBM, Gradientboosting, and XG-
Boost. These models were selected based on their popularity in the literature and their
ability to handle large datasets.
Since the dataset is relatively small, the dataset wasn’t split into train/test sets.
Rather stratified K-Fold cross-validation was used with K=10. The model is trained
on k-1 folds and validated on the remaining fold. To use 10-fold cross-validation means
that the data is split into 10 folds, and the model is trained on 9 folds and validated on
the remaining fold. This process is repeated k times, and the average performance is
10
Figure 3.3.6: Top Ten Features Correlated with Churn.
calculated to evaluate the model’s performance. This was implemented using the sci-kit
learn tool in Python. N-jobs variable was kept -1 which means all processors were used.
The hyperparameters for each model were tuned using a grid search technique,
where different combinations of hyperparameters were tested to identify the optimal
set of hyperparameters that maximizes the model’s performance.
KNN or K Nearest Neighbors is a non-parametric machine learning algorithm that
can be used for binary classification problems. The value of K represents the number of
nearest neighbors to be considered while classifying a data point. The distance metric
used to measure the distance between data points can be Euclidean distance, Manhattan
distance, etc.
For a given query point x0 , we identify the k training points x(r), with r ranging
from 1 to k, that is closest in distance to x0 and classify the result using a majority vote
amongst these k neighbors. Roughly speaking, ties are broken. Euclidean distance is
used in feature space, and we will assume the features are real-valued for the sake of
simplicity
di = ||xi − x0 ||. (1)
11
Figure 3.4.1: Methodology Diagram
Since it’s conceivable that the features are measured in various units, we usually first
normalize each one to have a mean of 0 and a variance of 1. Despite being straightfor-
ward, the k-nearest-neighbors algorithm has been effective in classifying a wide range
of data, including handwritten numbers, satellite picture sceneries, and EKG rhythms.
Where each class has numerous viable prototypes and the decision boundary is highly
erratic, it frequently succeeds. The bias of the 1-nearest-neighbor estimate is frequently
low but the variance is considerable since it only considers the training point that is
closest to the query point Hastie et al. (2009).
For this research, we implemented KNN using the sci-kit learn library in Python.
The value for the number of neighbors to be considered is k=10.
SVM or support vector machine Hastie et al. (2009) is another popular machine
learning algorithm. The SVM model is based on statistical approaches.The hyperplane
is defined by
x : f (x) = xT β + β0 = 0, (2)
12
where β is a unit vector meaning ||β|| = 1. Because the classes are distinct, we can
identify a function f (x) = xTi β + β0 with yi f (xi ) > 0 ∀i. Hence we are able to find the
hyperplane that creates the biggest margin between the training points for class 1 and
-1. We can define M = 1 / ||β|| and write

 yi (xTi β + β0 ≥ 1 − ξi ∀i,

min ||β|| subject to (3)
ξi ≥ 0, PN ξi ≤ constant.

i=1
SVM works by transforming the input data into a high-dimensional feature space
and then finding the hyperplane that maximizes the margin between the different classes.
The points closest to the hyperplane are called support vectors which are crucial for
building the model. The hyperplane is defined by the support vectors. The margin is
the distance between the hyperplane and the closest data points or support vectors from
each class. The larger the margin, the better the model performs. For this research, we
implemented SVM using the scikit- learn library, and 10 was used as a random seed.
Logistic Regression or simply called LR is another popular machine learning algo-
rithm that is commonly used for binary classification tasks. The goal of logistic regres-
sion is to find the best-fit line, also known as the decision boundary that can separate
the different classes of data points. Logistic Regression is a type of generalized linear
model that uses the sigmoid function, to estimate the probability of an event occurring.
In order to describe the posterior probabilities of K classes using linear functions in
x while also assuring that they add to one and remain in the range [0, 1], the logistic
regression model was developed. The form of the model is
13
Pr(G = 1 | X = x)
log = β10 + β1T x,
Pr(G = K | X = x)
Pr(G = 2 | X = x)
log = β20 + β2T x,
Pr(G = K | X = x) (4)
..
.
Pr(G = K − 1 | X = x) T
log = β(K−1)0 + β(K−1) x.
Pr(G = K | X = x)
The model is defined in terms of logit transformations or K - 1 log-odds which represents
the requirement that the probabilities add to 1. The estimates are equivariant under the
choice of the denominator, which makes it arbitrary even if the model utilizes the last
class as the denominator in the odds ratios. From the calculation, we can see

exp βk0 + βkT x
Pr(G = k | X = x) = , k = 1, . . . K − 1
1 + K−1 T
P
l=1 exp (βl0 + βl x) (5)
1
Pr(G = K | X = x) = PK−1 ,
1 + l=1 exp (βl0 + βlT x)
which sums up to one. We define the probabilities Pr(G = k | X = x) = pk(x; θ) to
n o
T
stress the dependency on the complete parameter set θ = β10 + β1,T ... , β(K−1) , β(K−1) .
The conditional likelihood of G given X is typically used to fit logistic regression models
using maximum likelihood. The multinomial distribution is suitable since Pr(G | X)
completely specifies the conditional distribution. For N number of observations the
log-likelihood is
N
X
ℓ(θ) = log pgi (xi; θ), (6)
i=1
where pk(xi; θ) = Pr(G = k | X = xi; θ) Hastie et al. (2009).

The logistic function maps any input value to a value between 0 and 1, which denotes
the probability of a binary event occurring. Logistic Regression works by first calcu-
lating the weighted sum of the input features and adding a bias term. This weighted
sum is then passed through the logistic function to obtain the probability of the event
occurring. If the probability is greater than a predefined threshold which is usually 0.5
14
or 50%, the model predicts the positive class. Otherwise, it predicts the negative class.
We also implemented this algorithm using the sci-kit learn library and kept the value of
random_state 10 for reproducibility.
Another machine learning model we used is RF or Random Forest. It is an en-
semble learning method that combines multiple decision trees to make more accurate
predictions. Ensemble simply means combining multiple models. There are two types
of ensemble methods - Bagging and Boosting. An approach for lowering the variance
of an estimated prediction function is called bagging which is also known as bootstrap
aggregation. For high-variance, low-bias techniques like trees, bagging appears to work
particularly well. To perform regression, the method involves fitting multiple regression
trees to several bootstrapped samples of the training data. The results are then averaged.
On the other hand, For classification, a group of trees forms a committee, and each
tree makes a prediction for the class, which is combined through voting to select the
predicted class. Random forests are a significant development of bagging which con-
structs many different trees that have little correlation with each other and then averages
them. In various scenarios, the performance of random forests is comparable to boost-
ing while being simpler to train and modify. Therefore, they are widely implemented in
many applications. Bagging can be beneficial for trees that are particularly noisy since
the average of a number of identically distributed (i.d.) trees has an expectation that is
equal to that of any single tree. In this way, bagging does not reduce bias in the trees but
only reduces variance when compared to individual trees. On the other hand, boosting
produces trees in an adaptive manner so as to remove bias, thus they are not i.d Hastie
et al. (2009).
Random Forest works by first randomly selecting a subset of the input features and
a random sample of the training data. A decision tree is then trained on the subset of
features and data. This process is repeated multiple times to create a group of decision
trees. During the prediction phase, each decision tree in the group makes a prediction
15
based on the input features and the majority vote of the trees is taken as the final predic-
tion. This model was implemented using the sci-kit learn library with random_state=10.
AdaBoost (Adaptive Boosting) Jung (2018) is another machine learning algorithm
that is commonly used for classification tasks. It is an ensemble learning method based
on the boosting method that combines multiple weak learners to create a strong classi-
fier. In AdaBoost, a group of weak learners such as decision trees or logistic regression
models is trained on the training data, with each weak learner focusing on the instances
that were misclassified by the previous weak learner. The weak learners are combined
using a weighted sum, with the weights assigned based on their performance on the
training data. During the prediction phase, each weak learner in the group makes a
prediction based on the input features, and the final prediction is made by taking a
weighted sum of the predictions. The key idea behind AdaBoost is to give more weight
to instances that are difficult to classify correctly so that the subsequent weak learners
focus more on these instances. This allows AdaBoost to focus on the most difficult in-
stances and build a strong classifier that performs well on the test data. This model was
implemented using the sci-kit learn library with random_state=10.
Light Gradient Boosting Machine or LightGBM or simply called LGBM Ke et
al. (2017) is a gradient boosting framework that uses tree-based learning algorithms.
LGBM uses a technique called Gradient-based One-Side Sampling (GOSS) to reduce
the number of samples used for training while preserving the accuracy of the model
which increases the convergence speed. LGBM builds decision trees using a split find-
ing algorithm that considers the histogram of the gradient values instead of the actual
values, which leads to faster training times. It also supports categorical features, which
can be represented as integers or one-hot encoded. This model was also implemented
using the sci-kit learn library and with random_state = 10.
Another similar model that we used is GradientBoosting Friedman (2002) . It is a
machine-learning algorithm that is based on the concept of boosting. GradientBoosting
16
builds an ensemble of weak decision trees in a sequential manner, where each new tree
is trained to correct the errors made by the previous one. This process continues until
the error is minimized, or a specified number of trees have been built. GradientBoosting
is similar to other boosting algorithms such as AdaBoost, but it differs in the way it
constructs the weak learners. Instead of assigning a weight to each training example,
GradientBoosting fits each new tree to the residual errors of the previous tree, thus fo-
cusing on the examples that were poorly predicted by the previous trees. One of the key
features of GradientBoosting include the ability to handle both sparse and dense data.
It also supports early stopping to prevent overfitting. This model was also implemented
using the sci-kit learn library with random_state = 10.
Finally, we used XGBoost or Extreme Gradient Boosting Chen & Guestrin (2016)
which is a powerful machine learning model that is based on the gradient boosting algo-
rithm. It is fast and efficient, and able to handle large datasets. XGBoost builds decision
trees using gradient descent optimization, where each new tree is fitted to the residuals
of the previous one. This process improves the model’s performance with each itera-
tion by minimizing the loss function. One of the key features of the XGBoost model
includes the ability to handle missing values, and the ability to handle both sparse and
dense data. XGBoost also supports early stopping to prevent overfitting. Implementa-
tion of this model was done by using the sci-kit learn library. The label_encoder wasn’t
used and the evaluation metric to run this model was logloss.
3.5 Model Evaluation Metrics
The next step is to evaluate the performance of the developed model using a hold-out
validation technique. For model evaluation we selected evaluation metrics accuracy,
precision, recall, specificity, geometric-mean, roc-auc score and mcc (Matthews Cor-
relation Coefficient). These evaluation metrics are explained below. Here TP denotes
True Positives, TN denotes True Negatives, FP denotes False Positives and FN denotes
17
False Negatives.
Accuracy: : Accuracy is one of the most used classification metrics. It is defined
as the ratio of the number of correct predictions made by the model to the total number
of predictions made. For example, if a model correctly predicts 90 out of 100 samples
in a test dataset, the accuracy of the model is 90/100 = 0.9 or 90%. Accuracy is a
simple and intuitive metric, but it may not always be the best measure of a model’s
performance, especially when the dataset is imbalanced. In such cases, other metrics
such as precision, recall, F1-score, or Roc-Auc score may be more appropriate
TP + TN
Accuracy = (7)
TP + TN + FP + FN
Precision: Precision is a commonly used evaluation metric in binary classification
problems, which measures the proportion of correctly predicted positive instances out
of all predicted positive instances. Precision is a useful metric in situations where false
positives are more costly than false negatives
TP
Precision = (8)
TP + FP
For example, if a model predicts that 100 samples are positive and 80 of them are actu-
ally positive, while the remaining 20 are negative, and among these 80 predicted positive
samples, 75 of them are actually positive, while the other 5 are negative, then the preci-
sion of the model would be: Precision = 75 / (75 + 5) = 0.9375 or 93.75%.
Recall: Recall is a metric that measures the ability of a model to identify all relevant
instances in a dataset. It is also called sensitivity or true positive rate. Mathematically, it
is defined as the ratio of the true positive (TP) instances to the sum of true positive and
false negative (FN) instances
TP
Recall = (9)
TP + FN
18
A high recall value indicates that the model is able to correctly identify a large pro-
portion of relevant instances in the dataset. However, it may also lead to a high number
of false positives. Therefore, recall is useful when false positives do not concern much.
Recall should be used in conjunction with other metrics such as precision and F1 score
to evaluate the overall performance of a model.
Specificity: Specificity is a metric that measures the ability of a model to correctly

identify negative instances in a dataset. It is also called the true negative rate. Mathe-
matically, it is defined as the ratio of the true negative (TN) instances to the sum of true
negative and false positive (FP) instances
TN
Specificity = (10)
TN + FP
A high specificity value indicates that the model is able to correctly identify a large
proportion of negative instances in the dataset. However, it may also lead to a high
number of false negatives. Therefore this metric should also be used in conjunction
with other metrics.
Geometric-mean: Geometric mean is a type of average that is calculated by taking

the nth root of the product of n numbers, where n is the total number of values. In the
context of binary classification, the geometric mean is often used as an evaluation metric
to measure. The overall performance of a classification model on imbalanced datasets.
The geometric mean is calculated as follows:
p
g - mean = sensitivity ∗ specificity (11)
The geometric mean takes into account both the true positive and true negative rates,
and thus it is a more reliable metric than accuracy on imbalanced datasets where the
number of instances of one class is much larger than the other class. A high geometric
19
mean value indicates that the model is able to correctly predict both positive and nega-
tive instances with high accuracy.
ROC Score: ROC (Receiver Operating Characteristic) is a graphical evaluation

metric used to evaluate the performance of binary classification models. It helps to
assess the trade-off between the true positive rate (sensitivity) and false positive rate (1-
specificity) for different thresholds of predicted probabilities.ROC score or AUC (Area
Under the Curve) is a numerical representation of the ROC curve, which provides a
single scalar value that summarizes the overall performance of a classification model.
The AUC ranges from 0 to 1, where a model with an AUC of 1 is perfect and a model
with an AUC of 0.5 performs no better than random guessing.
A model with a high ROC score indicates that the model has a good balance between
sensitivity and specificity and can accurately distinguish between the positive and nega-
tive classes. In contrast, a model with a low ROC score indicates poor performance and
may have difficulty distinguishing between the two classes.
MCC: MCC stands for Matthews correlation coefficient, which is a measure of the
quality of binary (two-class) classifications. It takes into account true positive, true
negative, false positive, and false negative predictions and is a balanced measure even
when the classes are of different sizes.
MCC ranges from -1 (total disagreement between predicted and actual labels) to
+1 (perfect prediction). A score of 0 indicates random predictions, and a score of 1
indicates perfect predictions.
MCC is calculated using the following formula:
TP * TN - FP * FN
MCC = √ (12)
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
MCC is commonly used in the evaluation of machine learning models, especially in
20
cases where the classes are imbalanced or the cost of false positives and false negatives
is different.
Once the models were trained and evaluated, we selected the best-performing model
based on precision, recall, Roc-auc score, and MCC score. We did not focus on accu-
racy, as our dataset was imbalanced with the positive class being the minority class.
What we are trying is to predict the churn, which is the positive class. Thus precision
and recall were more appropriate than accuracy alone. Also for imbalanced datasets -
G-mean, Roc-auc, and MCC scores are best suited. Thus to decide our best model we
also looked for a higher Roc-auc and MCC score.
And for the best-performing model, feature importance scores were also analyzed to
identify the most critical features contributing to churn. These features provide insights
into the primary drivers of churn. Feature importance is a technique in machine learning
used to determine which features or attributes of the dataset are the most important in
predicting the target variable. A commonly used method for feature importance was
used which is the F-score or F-test. The F-score is a statistical measure that compares
the variances of two samples. If the variance increases significantly when a feature
is added, it indicates that the feature is important for the prediction. The F-score is
calculated for each feature, and the features with the highest F-scores are considered to
be the most important. These important features can be used for further analysis and
model building.
The proposed methodology is expected to provide a comprehensive analysis of the
effectiveness of different machine learning techniques in predicting customer churn in
the telecommunication sector. The results of this research will be useful for telecom
companies to develop effective customer retention strategies.
21
4 Experimental Results and Analysis:
4.1 Experimental Results
Models Accuracy Precision Recall Specificity G-Mean Roc-Score MCC

RF 95.38 94.86 72.05 99.33 84.51 85.7 80
KNN 87.96 86.38 20 99.44 44.65 59.86 38.08
SVM 91.93 88.88 50.73 98.91 70.68 74.82 63.34
LR 86.05 55.52 21.31 97.02 45.2 59.16 28.1
Adaboost 87.94 64.43 38.12 96.4 60.42 67.25 43.3
LGBM 95.74 93.68 75.8 99.12 86.62 87.45 81.93
Grad 94.93 89.76 73.52 98.56 85.08 86.04 78.42
XGboost 95.74 92.32 77.02 98.91 87.21 87.97 81.95
Table 4.1.1: Experimental Results
4.2 Comparison
Figure 4.2.1: Accuracy Comparison Figure 4.2.2: Precision Comparison
In figure 4.2.1 shows that different machine learning performs in terms of accuracy.
Among the 8 models we used, LGBM and XGBoost achieved the highest accuracy of
22
95.74%. We can also see that the Random Forest and Gradient Boosting algorithm also
performs well.
Figure 4.2.2 shows the comparison between models in terms of precision is shown.
Precision is the proportion of correctly predicted positive values among predicted pos-
itives. In our research, the customers who churn are labeled by 1 or positive. A high
precision value denotes that the model can discern the customer who may churn accu-
rately. Among the 8 models we tested, Random Forest performs the best, achieving
94.86% Precision score.
Figure 4.2.3: Recall Comparison Figure 4.2.4: Specificity Comparison
While comparing recall we can see that some models perform poorly in comparison
to other models. KNN and LR performs in the range of 20% and adaboost achieves
less than 40% score. SVM also performs poorly. Recall is the measure of correctly
predicting positive classes among actual positives. In this measure, XGBoost achieves
the highest score. Random Forest, LGBM, Grad also performs well.
23
Specificity is the measure of correctly predicted negative values against actual nega-
tive values. In our research, customers who do not churn and stay loyal to their telecom
service providers are in the negative class. As we can see, all models achieve very high
specificity scores, almost reaching 100%. The reason for that is the dataset we used was
imbalanced where the majority class is negative, and 85.5% of the total data belonged
in the negative class. As there were high amount of negative class data, all models are
able to easily predict negative class correctly.
Figure 4.2.5: G-mean Comparison Figure 4.2.6: Roc-Score Comparison
When the dataset is imbalanced, Geometric mean is an excellent choice to measure

the balance between majority and minority classes. In this regard, XGBoost achieves
the highest score of 87.21%. Gradient, LGBM and RF classifiers also perform well.
In terms of Roc-score, XGBoost again performs the highest with 87.97% Roc-score.
Grad, LGBM and RF also perform well here.
The MCC is a good metric to use when the classes in the dataset are imbalanced, as it
takes into account the relative proportions of true and false positives and negatives. The
MCC is a balanced and informative metric that provides a measure of the overall quality
of a binary classification model. The MCC ranges from -1 to 1, where 1 indicates perfect
predictions, 0 indicates random predictions, and -1 indicates complete disagreement
between predictions and actual values. Again, XGBoost algorithm performs the highest
score, with 81.95%. Also we can see that Grad, LGBM and RF classifiers perform well.
24
Figure 4.2.7: MCC Score Comparison
As it can be seen, ensemble classifiers - XGBoost, GradientBoosting, LGBM and

RF perform well on the dataset whereas LR, KNN and SVM perform poorly. The reason
for that is - the dataset contains non-linearity and also there are 16 features where not all
have significant effect on target label. LR, SVM, and KNN are all linear models, which
means they assume a linear relationship between the features and the target variable.
If the relationship is nonlinear, then these models are unable to capture it effectively.
KNN is particularly sensitive to the "curse of dimensionality" Bengio et al. (2005),
which refers to the fact that as the number of features increases, the amount of data re-
quired to generalize accurately increases exponentially. As the bigML dataset used here
has lots of features, this probably caused KNN to perform poorly in high-dimensional
spaces. On the other hand, XGBoost performed best. There could be several reasons for
XGBoost performing so well. XGBoost has built-in regularization techniques that help
prevent overfitting, which can be a problem when working with complex datasets. The
25
dataset we used had 16 features and some of them had low correlation with the target
label. For this complex dataset, XGBoost performed effectively. XGBoost is optimized
for speed and efficiency, which allows it to train and make predictions quickly. Also,
XGBoost provides a measure of feature importance, which is useful for feature selection
and understanding the relationship between features and the target variable.
After evaluating and comparing models with different types of robust evaluation
metrics, it can be said that the XGBoost classifier performs best for our purpose as it
has the highest score on precision, recall, Roc-auc and MCC score. As discussed before,
accuracy measure is not enough in this case as the dataset used was imbalanced. Be-
sides the XGBoost model, LGBM, GradientBoosting, and RF classifiers also performed
excellently. Then we ran feature importance analysis on the best performing model -
XGBoost and the results are shown below.
Figure 4.2.8: Feature Importance Analysis on XGBoost Model
This analysis shows which features are the most important for customer behavior.
Total daytime minutes and total evening minutes have a very high F-score. Total night
26
minutes and total international minutes also have moderate F-score. Which means, how
much time a customer spends talking on the phone is very important to predict customer
behavior to churn. We can also see that voice mail plan, number of voicemail messages
and area codes are the least important features in the dataset. It can be said that a
customer’s area code has little or no effect on the customer’s decision.
27
5 Conclusion:
In conclusion, this thesis aimed to develop a customer churn prediction model in the
telecommunication sector using machine learning techniques. The proposed method-
ology involved the analysis of BigML data using various machine learning algorithms,
including KNN, SVM, LR, RF, Adaboost, LGBM, Gradient Boosting, and XGBoost.
The results of the study showed that LGBM and XGBoost algorithms had the highest
accuracy of 95.74% and the RF algorithm had the second-highest accuracy of 95.38%.
XGBoost shows high performance on all the evaluation metrics we used like precision,
recall, specificity, g-mean, roc-score and MCC. It can be said that the XGBoost classi-
fier performs best for our purpose of predicting churn customers. Also, Grad, LGBM
and RF algorithms performed well in our experiments.
This study contributes to the existing body of research by evaluating the effec-
tiveness of different machine learning techniques in predicting customer churn in the
telecommunication sector. The findings of this study provide insights for telecom com-
panies on which algorithms are the most effective in predicting customer churn and can
guide their decision-making processes in customer retention strategies.
Overall, this study demonstrates the importance of machine learning techniques in
predicting customer churn in the telecommunication sector. By predicting customer
churn, telecom companies can take proactive measures to retain their customers and
reduce the cost of customer acquisition. The proposed methodology can be further im-
proved by incorporating additional variables and feature selection techniques to enhance
the accuracy of the prediction model. A more balanced dataset may also improve model
performance.
In summary, the results of this study have implications for both academics and prac-
titioners in the telecommunication industry, providing insights into the effectiveness of
machine learning algorithms in predicting customer churn.
28
References
Ali, Ö. G., & Arıtürk, U. (2014). Dynamic churn prediction framework with more
effective use of rare event data: The case of private banking. Expert Systems with
Applications, 41(17), 7889–7903.
Amin, A., Shah, B., Khattak, A. M., Moreira, F. J. L., Ali, G., Rocha, A., & An-
war, S. (2019). Cross-company customer churn prediction in telecommunication:
A comparison of data transformation methods. International Journal of Information
Management, 46, 304–319.
Babu, S., Ananthanarayanan, D. N., & Ramesh, V. (2014). A survey on factors impact-
ing churn in telecommunication using datamininig techniques. International Journal
of Engineering Research & Technology (IJERT), 3(3).
Ballings, M., & Van den Poel, D. (2012). Customer event history for churn prediction:
How long is long enough? Expert Systems with Applications, 39(18), 13517–13522.
Bengio, Y., Delalleau, O., & Le Roux, N. (2005). The curse of dimensionality for local
kernel machines (Vol. 1258).
Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. ,
785-794.
Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis

of data preparation algorithms for customer churn prediction: A case study in the
telecommunication industry. Decision Support Systems, 95, 27–36.
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics Data

Analysis, 38(4), 367-378.
29
Hastie, T., Tibshirani, R., & Friedman, J. (2009, Feb). The elements of statistical
learning (2nd ed.). Springer New York. doi: 10.1007/978-0-387-84858-7
Jung, H. (2018). Adaboost for dummies: Breaking down the math (and its equations)
into simple terms. April.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., . . . Liu, T.-Y. (2017, Septem-
ber). Advances in neural information processing systems. , 30, 3146–3154.
Kumar, S. (2022, 12 12). Council post: Customer retention versus customer acquisition.
Forbes.
Mishra, A., & Reddy, U. S. (2017). A novel approach for churn prediction using deep
learning. In 2017 ieee international conference on computational intelligence and
computing research (iccic) (pp. 1–4).
Sana, J. K., Abedin, M. Z., Rahman, M. S., & Rahman, M. S. (2022). A novel customer
churn prediction model for the telecommunication industry using data transformation
methods and feature selection. Plos one, 17(12), e0278095.
TechSee. (2019). Reasons for customer churn in telecoms [survey results].
Ullah, I., Raza, B., Malik, A. K., Imran, M., Islam, S. U., & Kim, S. W. (2019). A churn
prediction model using random forest: analysis of machine learning techniques for
churn prediction and factor identification in telecom sector. IEEE access, 7, 60134–
60149.
Xu, T., Ma, Y., & Kim, K. (2021). Telecom churn prediction system based on ensemble
learning using feature grouping. Applied Sciences, 11(11), 4742.
30

Customer Churn Telecom

Uploaded by

Copyright:

Available Formats

Customer Churn Telecom

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Customer Churn Telecom

Uploaded by

Copyright:

Available Formats

CUSTOMER CHURN PREDICTION MODEL IN

TELECOMMUNICATION SECTOR USING MACHINE

A thesis submitted to the Department of Statistics

Keywords: Telecommunication, Customer Retention, Churn Prediction, Machine Learn-

List of Figures III

4 Experimental Results and Analysis: 22

3.1 Dataset Description

3.2 Data Preprocessing

3.3 Data analysis

Figure 3.3.2: Relation Between Churn and Voicemail Plans

Figure 3.3.4: Relation Between Customer Service Calls and Churn

3.4 Model Selection and setup

where pk(xi; θ) = Pr(G = k | X = xi; θ) Hastie et al. (2009).

3.5 Model Evaluation Metrics

Specificity: Specificity is a metric that measures the ability of a model to correctly

Geometric-mean: Geometric mean is a type of average that is calculated by taking

ROC Score: ROC (Receiver Operating Characteristic) is a graphical evaluation

4.1 Experimental Results

Models Accuracy Precision Recall Specificity G-Mean Roc-Score MCC

Table 4.1.1: Experimental Results

Figure 4.2.1: Accuracy Comparison Figure 4.2.2: Precision Comparison

Figure 4.2.3: Recall Comparison Figure 4.2.4: Specificity Comparison

Figure 4.2.5: G-mean Comparison Figure 4.2.6: Roc-Score Comparison

When the dataset is imbalanced, Geometric mean is an excellent choice to measure

As it can be seen, ensemble classifiers - XGBoost, GradientBoosting, LGBM and

Figure 4.2.8: Feature Importance Analysis on XGBoost Model

Coussement, K., Lessmann, S., & Verstraeten, G. (2017). A comparative analysis

Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics Data

TechSee. (2019). Reasons for customer churn in telecoms [survey results].

You might also like