Churn Analysis in Telecommunication Using Logistic Regression
Since the beginning of data mining the discovery of knowledge from the Databases has
been carried out to solve various problems and has helped the business come up with practical
solutions. Large companies are behind improving revenue due to the increase loss in customers. The
process where one customer leaves one company and joins another is called as churn. This paper
will be discussing how to predict the customers that might churn, R package is being used to do the
prediction. R package helps represent large dataset churn in the form of graphs which will help to
depict the outcome in the form of various data visualizations. Churn is a very important area in which
the telecom domain can make or lose their customers and hence the business/industry spends a lot
of time doing predictions, which in turn helps to make the necessary business conclusions. Churn can
be avoided by studying the past history of the customers. Logistic Regression is been used to make
necessary analysis. To proceed with logistic regression we must first eliminate the outliers that are
present, this has be achieved by cleaning the data (for redundancy, false data etc) and the resultant
has been populated into a prediction excel using which the analysis has been performed.
a new customer is approximately 5 higher than churn customers the industry is trying its best to
retaining the new customer. retain the profitable customers and this is named
as churn management.
Literature review
This paper provides an overview of The objective of the paper is to classify
doing a logistic regression with R studio to do an the possible customers that might churn
analysis on the CRM data and come up with the
churn prediction. This helps solving many business In many areas statistical analysis is used
related problems. This paper summarizes on the to predict the customers that might churn.
prediction with graphically representing the result in
Power BI where the actuals and the predictions are The outcomes of churn analysis are the
present and based on which we are also predicting below
the accuracy of the model based on the accuracy
business can decide if this approach helps improve • Improved retention
business or a better approach must be followed. • Propensity modelling
• Prioritized marketing
The voluntary and involuntary techniques • Increased customer value
for customer relationship management are also
discussed in brief. The types of churn can be classified as
Data collection the data that is used is of 2016 and we know the
For analysis the data that is available in customers that are churned, we are trying to use
the telecom dataset has been used and prediction backward regression model here and thereby come
has been done for the same. to conclusion is the model is accurate and based
on the accuracy then decide if the model has to be
Data preparation considered for future predictions or not.
Before the data can be analyzed we have
to clean the data and keep it ready so that the Prediction
desired results can be derived from it. The business is interested in the final
product and it is very important to represent your
Data has be clean so that the redundancy result in a “graphical representation” such a way that
and errors can be removed because having such it is understandable and the result helps business
data will lead to incorrect results as well. make the needed predictions which in turn brings
In this paper a Churn Analysis has been
applied on Telecom data, here the agenda is to There are many tools that help achieve
know the possible customers that might churn the same for example, Tableau, Power BI, qlikview
from the service provider. R programing is used for etc.
the same this will help give a statistical computing
for the data available, here backward logistic Data visulation tools
regression is been used to achieve the same. The The best way to get your message across
end result would give us the probability of churn for is to use visualization tools, by representing data
each customer. visually it is possible to uncover the surprising
patterns and the patterns that would go unnoticed
Here to do churn analysis Logistic if we took the stats alone
regression is been used, Logistic regression is
a statistical method here the resultant variable Here “Power BI” is the tool that is used
is categorical, rather than continuous. Logistic to do data visualization. Power BI is a business
regression limits the prediction to be in the interval analytics tool it is provided by Microsoft using which
of zero and one. reports can be created.
In this paper we are using Backward Here in this approach, data is already
stepwise regression, this involves taking all the cleaned and the result is populated in a file called”
variables into consideration then testing the deletion Prediction” which will be used to visually show how
of each variable with a certain criteria that is applied the data appears and the impact.
to it and this will be continued till there are no further
variables that are available to be deleted without The churn value is represented and given
any statistical loss of fit. as below
The dataset that is used has 22variables The churn value in the below graph is
available. These are related to Gender, customer_ 20.93% these are the possible customers that would
id, Phone Service etc. The dataset has over 2000 churn from the telecom service provider.
customer related information available.
Here the graph shows the remaining
After applying backward regression the 79.07% would not churn from the service provider.
approach inserted the new cleaned data into a They are of no risk to the business.
new file called as the “prediction” file and this is
having an extra column called “probability”. This new There are many factors based on which
column will give the probability of the customers we can come to a conclusion if the customer would
that might churn from the telecom provider, here churn or not.
The churn prediction based on various To check for the accuracy of the model
factors like age, tenure, job, payment details, This paper has used a confusion matrix
gender, call time, tech support usage etc. can be table which has variable Actuals, Frequency and
achieved. Prediction this confusion table will help describe
the performance of the model.
Below a few tables are used to predict the
possible domain or the kind of people that a likely In this model when the “Actuals=1” then
to churn. the “Predictions must also be=1”, but as we can
see in the below graph at one point when the
Based on Tenure “Actuals=1” the “Predictions=0” so this model is not
In the graph given below it is clearly stated totally reliable since the accuracy is not 100%.
that from a range of 0-30 months are the people
who are most likely to churn and 30-60 months Using accuracy we can get to know the
most likely not and anything above 60 months are accuracy of the model here the model is 80.02%
customers who would ideally not churn. accurate.
Based on customers who use Tech Support The accuracy is good enough for a churn
In the graph given below we can clearly prediction but it is not very accurate, hence using
come to a conclusion that people who use tech SVM (Support vector regression) with R we can get
support are the ones that would not churn where accurate probability and thus the result will be more
as if the customers who are not using the technical reliable another method of getting high accuracy is
support are possible to churn, this might be due by increasing the number of variables that is been
to the lack of knowledge about the services that used.
are provided by the telecom so it is very important
to highlight the kind of customer services that are When more variables are there to do the
been provided so that the services can be put to comparison the result will be precise and thus the
the right use and thus prevent the customers from business can do a near real time prediction with the
churning. given data.
Fig. 3: Churn prediction based on tenure Fig. 4: Churn prediction based on tech
