Project Report MBA Sem III
Project Report MBA Sem III
Project Report MBA Sem III
Project Report
On
As
“DESKTOP RESEARCH”
By
PRASHANT D KUMBHAR
Submitted to
M.B.A
Through
Page | 1
Annexure B
Page | 2
Page | 3
ANNEXURE D
DECLARATION OF STUDENT
(CERTIFICATE OF ORIGINALITY/DECLARATION)
This is to declare that I have carried out this project work myself in partial fulfillment of the
MBA Program of Savitribai Phule Pune University.
The work is original, has not been copied from anywhere else and not been submitted to any
other University/Institute for an award of any degree/diploma.
Date Signature
Page | 4
ANNEXURE E
DECLARATION OF GUIDE
This is to certify that the work incorporated in this Project Report A STUDY OF CUSTOMER
work and completed under my guidance. Material obtained from other sources has been duly
PLACE: PUNE
Page | 5
ANNEXURE F
ACKNOWLEDGEMENT
It gives me great privilege to show my deepest sense of gratitude to those people without whom
this project would have never been complete. These people, not only mentored me but they also
made it a point that this project becomes a classy piece of study and its only their creative ideas,
their mentoring, their constructive criticism and guidance that has made the project really
meaningful and a well thought out piece of literature.
It’s a privilege for me to express my deepest sense of gratitude to Prof.Dr. Prajakta Warale, my
Faculty Guide from Rajgad Institute of Management Research and Development, my mentor and
undoubtedly the mainstay behind this project. It has been an out and out honor to work under her.
Her versatile viewpoint and understanding of the subject matter, her guidance’s, her constructive
criticisms and above all the level of motivation and faith she showed really made me to stay
focused and work logically during the course of the study. I sincerely thank our honorable director,
Prof.Dr. D. B. Bharati for their valuable support.
Thank you,
With Regards,
Prashant D Kumbhar
Page | 6
Table of Content
Chapter Name of the Chapter Page
Number Number
1 Executive Summary 8
a. Abstract
b. Objectives of the Study
c. Scope of Study
d. Need of the Study
e. Limitations of the Study
2 Company Profile / Organizational Profile 12
3 Research Methodology 17
4 Theoretical Concepts 21
6 Learning of Students 73
(Findings)
7 Contribution to Host Organization 74
(Suggestion / Recommendations)
8 Conclusion 75
References 76
Page | 7
1 CHAPTER 1: EXECUTIVE SUMMARY
1.1 Abstract
Page | 8
modelling methods that have been used in the literature for predicting the churners using
different categories of customer records, and then quantitatively compare their performances.
Finally, summarized what kinds of performance metrics have been used to evaluate the
existing churn prediction methods. Analysing all these three perspectives is very crucial for
developing a more efficient churn prediction system for TSP.
In this competitive world, business is becoming highly saturated. Especially, the field of
telecommunication faces complex challenges due to a number of vibrant competitive service
providers. Therefore, it has become very difficult for them to retain existing customers. Since the
cost of acquiring new customers is much higher than the cost of retaining the existing customers,
it is the time for the telecom service providers to take necessary steps to retain the customers to
stabilize their market value. In the past, several data mining techniques have been proposed in
the literature for predicting the churners using heterogeneous customer records. This SIP under
Desktop Research Method, reviews the different categories of customer data available in open
datasets, predictive models and performance metrics used for churn prediction and control in
telecom company.
Page | 9
1.3 SCOPE OF THE STUDY
The focus of TSPs has long back shifted from product-centric to customer-centric. With
digitization, customers are well aware of their services in the market, forcing the Telecom
Companies to invest in new technologies and advanced analytics to understand the needs of
their customers and improve customer experience. However, today’s customer wants more
than just understanding - a valuable relationship, which may come through more timely,
informed or relevant interactions.
The Scope of project contains the overall idea about the types of customers and their
usage pattern in telecommunication domain that can be analysed in a form of a model which
will help in controlling customer churn and also help in increasing the revenue from these set
of customers with the help of various marketing techniques.
Data Analytics techniques are found to be more effective in churn prediction from the
researches carried out for the past one decade. Especially Predictive modelling techniques are
often found to be more accurate in churn prediction. The existing works on churn prediction in
three different perspectives like datasets, methods, and metrics. Firstly, worked on the details
about the availability of public datasets and what kinds of customer details are available in each
dataset for predicting customer churn. Secondly, compared and contrast the various predictive
modelling methods that have been used in the literature for predicting the churners using
different categories of customer records, and then quantitatively compare their performances.
Finally, summarized what kinds of performance metrics have been used to evaluate the existing
churn prediction methods. Analysing all these three perspectives is very crucial for developing
a more efficient churn prediction system for Telecom Service Provider.
Page | 10
1.5 LIMITATIONS OF THE STUDY
2. Summer internship project was confined for the period of 60 days only.
3. Data is confidential due to we cannot take actual figure for to do proper research study.
Page | 11
CHAPTER 2: COMPANY PROFILE/ ORGANISATION PROFILE
Vodafone Idea Limited is an Aditya Birla Group and Vodafone Group partnership. It is
India’s leading telecom service provider. The Company provides pan India Voice and Data
services across 2G, 3G and 4G platform. With the large spectrum portfolio to support the growing
demand for data and voice, the company is committed to deliver delightful customer experiences
and contribute towards creating a truly ‘Digital India’ by enabling millions of citizens to connect
and build a better tomorrow. The Company is developing infrastructure to introduce newer and
smarter technologies, making both retail and enterprise customers future ready with innovative
offerings, conveniently accessible through an ecosystem of digital channels as well as extensive
on-ground presence.
Vision - Create world class digital experiences to connect and inspire every Indian to build a better
tomorrow
Mission –
Customers - Be the most loved brand by continuously raising the bar in delivering simple,
delightful experience and meaningful innovations, through new age technologies
Team - Be an inspirational, agile and exciting organisation that challenges the status quo,
and champions a diverse team that has a winning attitude and thrives on delivering
customer excellence
Page | 12
Community - Be the most respected company by leveraging technology and purposeful
innovation to catalyse social prosperity, digital literacy and inclusivity.
CORPORATE AWARDS:
MARKETING AWARDS:
BRAND AWARDS:
Page | 14
2.1.4 SWOT Analysis of the company
Strength – 1> Increasing Revenue every quarter for the past 2 quarters
Weakness – 1> Companies with growing costs YoY for long term projects
2> MFs decreased their shareholding last quarter
3> Inefficient use of shareholder funds - ROE declining in the last 2 years
4> Inefficient use of assets to generate profits - ROA declining in the last 2 years
5> Red Flag: Downgrade by Credit Rating Agency
6> Poor cash generated from core business - Declining Cash Flow from Operations
for last 2 years
7> Decline in Net Profit (QoQ)
8> Decline in Quarterly Net Profit (YoY)
9> Decline in Net Profit with falling Profit Margin (QoQ)
10> Decline in Quarterly Net Profit with falling Profit Margin (YoY)
11> Companies with High Debt
12> Degrowth in Quarterly Revenue and Profit in Recent Results
13> Low Piotroski Score: Companies with weak financials
14> Declining Net Cash Flow: Companies not able to generate net cash
15> Annual net profit declining for last 2 years
16> Recent Results: Fall in Quarterly Revenue and Net Profit (YoY)
17> Weak performer: Stock lost more than 20% in 1 month
Opportunities – 1> Brokers upgraded recommendation or target price in the past three months
2> Positive Breakout First Resistance (LTP > R1)
3> Highest Recovery from 52 Week Low
4> Stock with Low PE (PE < = 10)
Threats – Nil
Page | 15
2.2 Organizational Chart
Page | 16
CHAPTER 3: RESEARCH METHODOLOGY
1) Research Problem:
2) Hypothesis:
Through literature study, surveys and previous works, various discrepancies, challenges
and difficulties are identified. After the data acquisition from the anonymous Telecom
provider and an experimental survey by Mounika Reddy Chandiri, statistical and Data
analytics were carried out to draw different convictions on usage trends. The analysis is
expected to result in certain correlations between the varying voice and data traffic,
churn risk and the quality of experience with respect to users and the data analytics
indicating churn. From the research work, a derivation of a general relation between user’s
satisfaction and users traffic volume is expected to be reached so that Customer Churn
Control will uplift revenue.
Page | 17
3) Methods of data collection:
a) Primary data
Primary data is the new or fresh data collected from the personal observations and also
taken personal interviews of the team members. But due to COVID-19 pandemic there are
challenges in collection of the Primary data by personally observing the day-to-day
operations. So preference was given to work using Secondary Data.
b) Secondary data
The secondary data are collected through the internet. Also some sample data patterns
are referenced from Company database. The secondary data was gathered mainly by going
through internet articles and the research articles released by other researchers.
This research aims to study and analyze customer churn based on usage volumes with respect
to Quality of Experience and user’s perspective using data analytics. Three different datasets were
analyzed statistically and with the help of decision trees. Statistical analysis includes calculation
and analysis of Mean, Standard deviation, Autocorrelations and Confidence intervals. Decision
tree analysis includes data acquisition, data preparation that includes normalization, data
preprocessing, data extraction and finally decision making.
5) Sample Design:
iii. Hypotheses Statements (Alternate and Null) - Customer Churn Control will uplift revenue
Page | 18
iv. Dependent and Independent Variables – Below is list of Sample Dependent Variable
vi. Sources of Data Collection - For Secondary Data, Customer usage & Recharge Data Dump
vii. Research Instrument - Customer usage & Recharge Data Dump collected as secondary data
from Telecom company using SQL from their system.
viii. Data Analysis Software (R/ SPSS/ Ms-Excel etc) – SAS Minor, SAS EG, MS-Excel
ix. Nature & Size of the Universe – Data Analysis is planned for 3 Telecom Circles whose
Customer Data Sample Size is 36 Lacs, 9 Lacs, 5 Lacs for MAG, MUM and UPW Circle
respectively.
Page | 19
xi. Sampling Technique (Name of the technique with Reason) –
Stratified Random Sampling: Subsets of the data sets or population are created based on a
common factor, and samples are randomly collected from each subgroup.
6) Statistical Technique:
Statistical Analysis is the science of collecting, exploring and presenting large amounts of
data to discover underlying patterns and trends. Telecom companies use statistics to optimize
network resources, improve service and reduce customer churn by gaining greater insight into
subscriber requirements.
Mean, Standard deviation, Standard error, Lag1 Autocorrelation, 95% Confidence
Intervals have been calculated for various combinations.
a) Mean: Mean means the statistical average of a dataset. It usually depicts the central value
of a set of numbers.
c) Standard deviation: The Standard Deviation is a measure of how spread out numbers
are. In simple words, it’s the square root of Variance.
e) 95% Confidence Intervals (CI): Confidence intervals are a type of interval estimates
that gives the most likely range of an unknown population. Confidence intervals consists
of different ranges of values, 90%, 95% and 99%. In practice, confidence intervals are
usually stated at 95% confidence level, 95 being not too far away from 100. Statistically,
if there is a large overlap in confidence intervals, difference is not significant; whereas if
the intervals do not overlap, there is a difference with 95% confidence value.
Page | 20
CHAPTER 4: THEORETICAL CONCEPTS
Many approaches were applied to predict churn in telecom companies. Most of these
approaches have used data analytics. The majority of related work focused on applying only
one method of data analysis to extract knowledge, and the others focused on comparing several
strategies to predict churn.
Review 1:
Description: This paper presented by Gavril (Methods for churn prediction in the prepaid
mobile telecommunications industry. In: International conference on communications) based
on an advanced methodology of data analytics to predict churn for prepaid customers using
dataset for call details of 3333 customers with 21 features, and a dependent churn parameter
with two values: Yes/No. Some features include information about the number of incoming
and outgoing messages and voicemail for each customer.
Observation: The author applied principal component analysis algorithm “PCA” to reduce
data dimensions. Three machine learning algorithms were used: Neural Networks, Support
Vector Machine, and Bayes Networks to predict churn factor. The dataset used in this study is
small and no missing values existed.
Review 2:
Description: He Y, He Z, Zhang D (A study on prediction of customer churn in fixed
communication network based on data mining. In: Sixth international conference on fuzzy
systems and knowledge discovery) proposed a model for prediction based on the Neural
Network algorithm in order to solve the problem of customer churn in a large Chinese telecom
company which contains about 5.23 million customers.
Observation: The prediction accuracy standard was the overall accuracy rate, and reached
91.1%.
Page | 21
Review 3:
Description: This paper proposed by Idris A, Khan, on approach based on genetic
programming with AdaBoost to model the churn problem in telecommunications. (Genetic
programming and adaboosting based churn prediction for telecom. In: IEEE international
conference on systems)
Observation: The model was tested on two standard data sets. One by Orange Telecom and
the other by cell2cell, with 89% accuracy for the cell2cell dataset and 63% for the other one.
Review 4:
Description: This paper proposed by Huang (In ACM SIGMOD international conference on
management of data.) to study the problem of customer churn in the big data platform. The goal
of the researchers was to prove that big data greatly enhance the process of predicting the churn
depending on the volume, variety, and velocity of the data.
Observation: Dealing with data from the Operation Support department and Business Support
department at China’s largest telecommunications company needed a big data platform to
engineer the fractures. Random Forest algorithm was used and evaluated using AUC.
Review 5:
Description: This paper proposed by Makhtar M, Nafis S (Ref Churn classification model for
local telecommunication company based on rough set theory. J Fundam Appl Sci.
2017;9(6):854–68.) describing a model for churn prediction using rough set theory in telecom.
As mentioned in this paper Rough Set classification algorithm outperformed the other
algorithms like Linear Regression, Decision Tree, and Voted Perception Neural Network.
Various researches studied the problem of unbalanced data sets where the churned customer
classes are smaller than the active customer classes, as it is a major issue in churn prediction
problem.
Observation: Compared six different sampling techniques for oversampling regarding telecom
churn prediction problem. The results showed that the algorithms (MTDF and rules-generation
based on genetic algorithms) outperformed the other compared oversampling algorithms.
Page | 22
4.2 Theoretical Background of the study
Definition -
In competitive Telecom market, the customers want competitive pricing, value for money
and high quality service. Today’s customers won’t hesitate to switch telecom providers if they
don’t find what they are looking for. This phenomenon is called Churning.
Customer churning is directly related to customer satisfaction. Since the cost of winning a
new customer is far greater than cost of retaining an existing one, mobile service providers have
now shifted their focus from customer acquisition to customer retention.
After substantial research in the field of Data analytics for churn prediction, it was found
to be an efficient way for identifying churn. This helps to achieve results more efficiently and
receive insights that sets alarm bells ringing before any damage could happen, giving telecom
companies an opportunity to take preventive measures. These techniques are usually applied to
predict customer churn by building models and learning from historical data. However, most of
these techniques provide a result that customers might churn or not, but few tell us why they
churn.
Conducting experiments with end users’ perspective, gathering their opinions on
network, data normalization, pre-processing data sets, eliminating class imbalance and missing
values, replacing existing variables with derived variables improves the accuracy of churn
prediction which assists Telecom companies to retain their customers more efficiently.
Comparatively, a smaller study was done on user’s perspective, taking into consideration their
quality of experience. In fact, no study was done taking into consideration only user’s data
volumes. Estimation of Quality of Experience by finding relationships between QoE and traffic
characteristics could help the service providers to continuously monitor the user satisfaction level,
react timely and appropriately to rectify the performance problems and reduce the churn.
Page | 23
Theories -
Before we dig into how to analyze churn, its critical to understand it from a high level all the
way down to how to calculate it and its impact on the bottom line. Once you have an understanding
of churn it will be a lot easier to analyze and develop strategies to reduce it.
From a high level, churn is the measure of how many customers leave over a set time period.
It’s used to measure how much revenue telecom operator loose through customer cancellations.
It’s also used to measure the number of users or accounts that cease using products or services. In
either case, churn represents the attrition rate of customer base.
For subscription based business churn is critical as every customer telecom operator loose is
lost re-occurring revenue. Example: Telecom Company 1000 customers paying Rs1,000/month,
giving them a monthly reoccurring revenue (MRR) of Rs1,000,000 and annual revenues of
Rs12,000,000. If they have a churn rate of 10%, that means they lose 100 customers, or Rs100,000
MRR which is a loss of Rs1,200,000 for the year.
As the example illustrates, a lost customer can have a huge impact on telecom operators
bottom line. This is why many businesses have account managers and customer service managers
whose job is to do everything they can to reduce churn,
Let’s understand how to calculate churn. There are two common methods.
1. Customer Churn: Take all the customers telecom company loose during a time
frame, such as a month, and divide it by the total number of customer’s company
had at the beginning of the month. Example: Telecom Company had 500 customers
at the beginning of the month and 450 customers at the end of the month. Their
churn rate would be: - (500-450)/500=50/500=10%. If Telecom company prefers
you can use same method on a different time frame such as quarterly or annually.
It’s important to understand that customer churn and revenue churn are not always
the same. The problem will only get worse if you have more product lines or the
price difference between product lines is greater. It’s important to note that you
may need to use both calculations. Revenue churn is a great way to report on
performance as well as understand the financial health of your customer base.
Customer churn is important for staffing reasons as an employee can only manage
so many accounts at one time.
Assuming the same model, if we calculate churn over a quarter we could run into a
problem. The reason is that there will be some new sales from the first month in the
quarter that could churn in the second or third month of the quarter. If those churns
are accidently included in the calculation, then we will overstate churn.
To get around this problem, you have to exclude all churns from new sales. If you
do that, you get the churn rate of Cohort A, which was our install base at the
beginning of the quarter. This method gives you the true churn rate, without
replacement, of your customer base over a quarter. The one issue with that solution
is that some of you may want to include the churn rate of Cohorts B and C. If that
is the case, then you will want to use the weighted average.
Page | 25
Example,
One last fundamental aspect of churn is that it will vary depending on customer
stage in their lifecycle. It’s very typical to find customers will churn at higher rate
during the beginning of their subscription compared to a few months in. This can
happen for a variety of reasons such as poor expectation setting during sales, sudden
shift in priorities, poor onboarding programs, poor support etc. As customers
mature, their churn rate will stabilize. Because of this, it can be important to
calculate churn for newer customers separate from older customers, so you don’t
overestimate the steady churn rate and underestimate early stage churn rates.
While there are many ways to track and analyze churn, three most common methods used to
analyze churn are cohort reports, churn by customer age, and churn by customer behavior.
There are two reasons which are focused while analyzing churn:
1. Before try to improve churn rate, need to identify problem areas so we can focus and
prioritize accordingly.
2. Once we implement action points to improve churn, need to know if action points are
working.
Cohort Reports:
A cohort report analyzes various cohorts of your customer base and their churn rate over time.
A cohort of customers is the segment of customers who purchased in a certain time frame. A
common cohort used would be customers who purchased services in specific month, for example
your January 2020 cohort would be all the customers who closed that month. There are two major
Page | 26
benefits of using cohort reports. The first is that it produces a clean number, not influenced by new
customer acquisition.
For example – Telecom company calculates their churn rate over 3 months.
Month 1
1000 existing customers; 50 churn; 100 new customers
50/1000 = 5% churn
Month 2
1050 customers (1000 from month 1 – 50 churn + 100 new); 40 churn; 100 new
40/1050 = 3.8% churn
Month 3
1110 customers (1050 - 40 churn + 100 new); 40 churn; 100 new customers
40/1110 = 3.6% churn
In example churn fell from 5% to 3.6%. With cohort reporting, it’s much easier to understand
where the change is coming from.
The second benefit to cohort reporting is that it enables to identify patterns in customer base.
The chart above is a sample cohort report. The way to read it is that customers who bought in
a cohort, like January 20, had a 100% renewal rate in month 0 (which is the month they bought,
so it’s impossible to churn), 81% of them were retained in month 1, 75% of the original number
from month 0 were retained in month 2, 71% of the original number from month 0 were retained
in month 3, and so on.
Page | 27
We then layered a heat map on top of this, which creates the color. The way that works is the
closer you are to 100%, the greener the box. The further away you get from 100%, the yellower
the box. As 63% is the lowest number in the chart, that box is completely yellow.
With just a glance an obvious pattern emergence with the May cohort as there is a clear
difference in the color. The way to interpret that is that customers who bought in May, and later
have higher retention rates, or lower churn rates, compared to earlier customers. Now we can deep
drive into what happened in May to understand why. Did you create new documentation site;
improve your onboarding process; released a new version of product? Once we found out reason
for increased in retention will try to improve on it. It’s important to note that we can find other
patterns this way. For example, may be one or two cohorts are performing well above or below
the average:
This could indicate that marketing team did a great job getting higher quality leads, or may
be sales team did a better job setting expectations, or customer care team gave those customers
more attention. While those reports can’t tell you exactly why those cohorts did better through
investigation what is common across two cohorts might get something that will help reduce churn
for the entire customer base.
Another possible pattern might show that after a certain time frame retention rates flatten out
for everyone. This is clear indication that customer who make it past that time frame are mature
and tend not to churn:
Page | 28
As seen from above looking at the numbers and through the heatmap, churn after month 4 is
pretty low. This type of cohort reporting is very effective. However, if you prefer to look at graphs
as opposed to charts of numbers.
In the above chart we are comparing 4 cohorts between January and April. The y-axis is the
churn rate and x-axis is age in months. In this graph, you can easily see that at month 4, churn falls
to ~1% for each of the cohorts. This pattern is much easier to see in graph as opposed to the charts
presented above. Viewing the data this way also makes it easy to see other the other patterns.
Page | 29
Churn by Customer Age:
Another popular way to analyze customer churn is through grouping customers together by
age. For example, measure the churn rate of all customers during their first month of service,
second month and so on. It would look like –
The way to read this chart is that across all customers, it’s seen churn rate is around ~6%
during their first month of service, approximately 10% in their second, and by the 5th month churn
stabilizes at around 2% or lower.
Major reasons for analyzing churn this way is to identify pattern for all customers, by age,
regardless of cohorts. When you look at the churn this way, it helps you understand the churn rate
as per customers age. Then same data can be used to try and resolve problem. For example, in the
chart above, churn is high for the first 120 days, so you can try to focus on improving your
onboarding process, or cleaning up documentation, or training your onboarding executives more,
and so on.
Another benefit to creating charts like this is that helps to measure impact of churn reduction
strategies. All we need to do is compare all customers before a cut-off date to all customers post a
cut-off date to see which performs better. To illustrate this let’s continue pulling the thread in the
example chart above, where churn is high in the first four months. Let’s see how data could look
like if we improve help and documentation site for new customers on 01/01/20:
Page | 30
If we go with a bar chart over a line graph so it’s easier to compare numbers. The red bar is
the churn rate for all customers that closed before 1/1/20 (which is the date the new documentation
site went live). The blue bar is the churn rate of all customers who bought after 1/1/20 and used
the new document verification site. As seen, months 1 and 2 were largely unaffected, however in
months 3 and 4, churn dropped a fair amount. This shows that improved documentation has helped
prevent churn in first few months but not all.
In addition to analyzing churn by different cohorts or customer age, need to analyze churn by
customer behavior. This means need to look at customers who use a certain feature or complete
certain action and determine its impact on churn. There are many benefits to performing this kind
of analysis:
Product may decide to focus on developing and improving features that retain customers.
Alternatively, product team may dig into why certain features have a high churn rate to
determine if they can fix the issue.
Customer service team could work on promoting features that retain customers.
The report need to create are simple. The time consuming part is dependent on how granular
you get. For example, if you track just accessing an app and its impact on churn, you would
Page | 31
only need to build a single report for each app you have. On the flip side, accessing an app
doesn’t tell you much. So, you could choose to get more granular and track various button
clicks in an app, or combinations of button clicks, and so on. For this we can get help from
engineering team to see if they are able to get this data for you or if you need to use software
designed to track user behavior. If able to get that granular it is well worth for understanding
of what behaviors lead to retention and churn.
Once decision is taken about what to test, the next thing is building reports to see what
impacts churn. For example:
The way to read this chart is each line represents customers who accessed a feature (X or Y
in this case) or completed some action in the feature (for example clicking a button) during the
month in question, and what say of the customers who accessed Feature X in March of 2020, their
churn rate that month was 4%. To create a chart like this, I recommend to use excel as it’s easy to
segment your customers into segments based on their behavior and then using the customer churn
formula presented earlier you can calculate the churn rate for those segments over time.
Page | 32
When you look at the chart above, 3 activities jump out as having consistently low churn – if
you click button 2 or 3 on feature Y and if your click button 1 on feature X. While this is great
evidence that customers who use those features wont churn, we need to dig deeper. The reason is
that correlation doesn’t always relate to conclusion. The first step is doing so is comparing the
churn rates above with customers who don’t use those features, and what their churn rate is:
In the chart above, you can easily see the customers who used button 1 on feature X (B1X)
and button 2 on feature Y (B2Y) had very low churn rates, while those customers that didn’t had
very high churn rates. At the same time customers who used button 3 on feature Y had a low churn
rate and customers who didn’t use that button also had a low churn rate. This means button 3 on
feature Y doesn’t really impact churn, however the other two buttons have a much clearer impact
on churn.
Now we need to dig a little further into B1X and B2Y as the charts above do not answer the
question of whether or not using those buttons in a month will reduce churn over the next few
months, or just in the month the button was used. For example, maybe those buttons are set-up
related and customers who are actively getting set-up don’t churn, but once set up is complete their
churn rate goes up over the next few months. Or maybe these buttons are rarely used, so while a
Page | 33
customer might not churn in a month they use them, in subsequent months their churn risk goes
up. To address these concerns we need to move back to cohort reporting:
If you look at the B1X chart, you can see that churn for customers who use Button 1 in Feature X
have a very low churn rate in the month they use the feature (month 0), and they continue to have
a low churn rate over the next 3 months. Whereas customers who use button 2 in feature Y have a
low churn rate in month 0, but it shoots up after that. To wrap up the analysis, you should complete
Page | 34
a cohort report for customers that don’t use B1X, just to make sure there is true causation, and not
just correlation.
Using reports and processes above has helped to determine, with confidence, what user behavior
leads to a reduction in churn. In addition to looking at simple user behavior, like clicking a button
or accessing a feature, we have to consider analyzing:
Patterns – What patterns of usage can reduce churn? For example, do people who use
Feature 1 and Feature 2 have a low churn rate? What about customers who completed at
least 50 actions in a month or access 4+ features in a month? What about customers who
call support more than once in the 1st month? Or more than 10 times in total?
Employee Behavior – If you have high touch sales and/or onboarding process, you should
determine churn by employee. You should also discover the impact on churn by various
behaviors, such as time between closing the sale and first point of contact with the
customer, or length of time between touch points, or subject matter covered on calls etc.
Page | 35
3) Construction models using different classifiers and cross validate the models,
Advantages/ Disadvantages –
1. This churn analysis provides Advantages that will help the telecom operator to:
Access all the relevant data seamlessly and quickly
2. This churn analysis provides predictive insights to telecom operator such as:
3. The Challenges in traditional customer churn prediction models is that they do not
align with their business objective, as they only predict the gross outcome, i.e.,
whether a customer will churn. Churn analytics estimating the net effect, focus on
Page | 36
whether a customer’s intent on churning and will be retained when targeted with the
campaign. The true business objective of analytics is to reduce customer churn.
Customers who are about to churn but cannot be retained should be excluded from the
campaign, as targeting them will be a waste of resources. Moreover, retention efforts
may provoke customers to churn. For example, a retention offer may remind a
customer about the expiration of a contractual agreement and cause churn as a result.
Features -
For Analysing Data, system aggregated some kind of telecom data like billing data,
Calls/SMS/Internet usage data, and complaints related data. Data Analytic techniques were applied
on top of the System Data, but the few models failed to give high results using this data. In contrast,
the data sources that are huge in size were ignored due to the complexity in dealing with them.
Few systems were not able to acquire, store, and process that huge amount of data at the same time
due to limitations of system handling. In addition, the data sources were from different types, and
gathering them in Data handling system is a very hard process so that adding new features for Data
Analytics algorithms requires a long time, high processing power, and more storage capacity. All
these difficulties in processes for system data handling is overcome easily using upgraded system
using distributed processing of data.
Many types of telecom data are used to build the churn model. These types are classified
as follow:
1. Customer data: It contains all data related to customer’s services and contract information.
In addition to all offers, packages, and services subscribed to by the customer.
Furthermore, it also contains information generated from CRM system like (all customer
GSMs, Type of subscription, birthday, gender, address etc.).
4. Call details records: “CDRs” Contain all charging information about calls, SMS, MMS,
and internet transaction made by customers. This data source is generated as text files.
5. Mobile IMEI information: It contains the brand, model, type of the mobile phone and if
it’s dual or mono SIM device. This data has a large size and there is a lot of detailed
information about it. We spent a lot of time to understand it and to know its sources and
storing format.
6. In addition to these records, the data must be linked to the detailed data stored in relational
databases that contain detailed information about the customer. Nine months of data sets
contained about ten million customers. Total number of columns is about ten thousand
columns. Collected data was full of columns, since there is a column for each service,
product, and offer related to calls, SMS, MMS, and internet, in addition to columns related
to personnel and demographic information. If we need to use all these data sources the
number of columns for each customer before the data being processed will exceed ten
thousand columns.
Page | 38
Applications/ examples:
I needed data labeled for testing, so contacted experts from the marketing section to
provide me with labeled sample of GSM data, they provided me with some prepaid customers in
idle phase after 2 months of the nine month’s data, considering them as churners. The other non-
churned customers were labeled as Active customers (customers acquired in the last 4 months are
excluded). The total count of the sample where 5 million customers containing 300,000 churned
customers and 4,700,000 active customers. Above figure shows the periods of historical data and
the future period when the customer may leave the company.
Page | 39
Flow charts/ block diagrams -
Page | 40
CHAPTER 5: DATA ANALYSIS AND INTERPRETATION
Customer churn, also known as customer attrition, is the loss of clients or customers.
Churn is an important business metric for subscription-based services such as telecom
companies. This project demonstrates a churn analysis using data downloaded from IBM sample
data sets. We will use the R statistical programming language in order to identify variables
associated with customer churn.
To identify the potential Drop in subs and DIU subs (35% drop in ARPU) in pre-
paid segment for identified circle/sector
To automate the generation of a weekly report which ranks subscriber’s basis their
propensity to drop within the next 30 days
Page | 41
1. Loading in data and R libraries
We begin by loading the R libraries we need for the project.
library(plyr)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine
Page | 42
Now let’s read in the data. Need to modify the file name within the quotations based on the file
location of the data within your computer. The first few rows are depicted below.
#Examine data
head(dat)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15
Page | 43
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes
The IBM sample data set website gives the following data dictionary, or description of the
variables:
customerID: Customer ID
genderCustomer: gender (female, male)
SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
PartnerWhether: the customer has a partner or not (Yes, No)
Dependents: Whether the customer has dependents or not (Yes, No)
tenure: Number of months the customer has stayed with the company
PhoneService: Whether the customer has a phone service or not (Yes, No)
MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone
service)
InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet
service)
OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet
service)
DeviceProtection: Whether the customer has device protection or not (Yes, No, No
internet service)
TechSupport: Whether the customer has tech support or not (Yes, No, No internet
service)
Page | 44
StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet
service)
StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No
internet service)
Contract: The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
PaymentMethod: The customer’s payment method (Electronic check, Mailed check,
Bank transfer (automatic), Credit card (automatic))
MonthlyCharges: The amount charged to the customer monthly
TotalCharges: The total amount charged to the customer
Churn: Whether the customer churned or not (Yes or No)
2. Data pre-processing
Let’s begin by seeing if there is any missing data.
There are 11 cases with missing values in the Total Charges variable. Let’s see these particular
cases.
dat[is.na(dat$TotalCharges),]
Page | 45
## customerID gender SeniorCitizen Partner Dependents tenure
## 489 4472-LVYGI Female 0 Yes Yes 0
## 754 3115-CZMZD Male 0 No Yes 0
## 937 5709-LVOEQ Female 0 Yes Yes 0
## 1083 4367-NUYAO Male 0 Yes Yes 0
## 1341 1371-DWPAZ Female 0 Yes Yes 0
## 3332 7644-OMVMY Male 0 Yes Yes 0
## 3827 3213-VVOLG Male 0 Yes Yes 0
## 4381 2520-SGTTA Female 0 Yes Yes 0
## 5219 2923-ARZLG Male 0 Yes Yes 0
## 6671 4075-WKNIU Female 0 Yes Yes 0
## 6755 2775-SEFEE Male 0 No Yes 0
## PhoneService MultipleLines InternetService OnlineSecurity
## 489 No No phone service DSL Yes
## 754 Yes No No No internet service
## 937 Yes No DSL Yes
## 1083 Yes Yes No No internet service
## 1341 No No phone service DSL Yes
## 3332 Yes No No No internet service
## 3827 Yes Yes No No internet service
## 4381 Yes No No No internet service
## 5219 Yes No No No internet service
## 6671 Yes Yes DSL No
## 6755 Yes Yes DSL Yes
## OnlineBackup DeviceProtection TechSupport
## 489 No Yes Yes
## 754 No internet service No internet service No internet service
## 937 Yes Yes No
## 1083 No internet service No internet service No internet service
## 1341 Yes Yes Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service
Page | 46
## 5219 No internet service No internet service No internet service
## 6671 Yes Yes Yes
## 6755 Yes No Yes
## StreamingTV StreamingMovies Contract PaperlessBilling
## 489 Yes No Two year Yes
## 754 No internet service No internet service Two year No
## 937 Yes Yes Two year No
## 1083 No internet service No internet service Two year No
## 1341 Yes No Two year No
## 3332 No internet service No internet service Two year No
## 3827 No internet service No internet service Two year No
## 4381 No internet service No internet service Two year No
## 5219 No internet service No internet service One year Yes
## 6671 Yes No Two year No
## 6755 No No Two year Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 489 Bank transfer (automatic) 52.55 NA No
## 754 Mailed check 20.25 NA No
## 937 Mailed check 80.85 NA No
## 1083 Mailed check 25.75 NA No
## 1341 Credit card (automatic) 56.05 NA No
## 3332 Mailed check 19.85 NA No
## 3827 Mailed check 25.35 NA No
## 4381 Mailed check 20.00 NA No
## 5219 Mailed check 19.70 NA No
## 6671 Mailed check 73.35 NA No
## 6755 Bank transfer (automatic) 61.90 NA No
Inspection of the Churn variable shows that these are all still subscribing customers. What
proportion of our sample is this subset with missing values?
sum(is.na(dat$TotalCharges))/nrow(dat)
## [1] 0.001561834
Page | 47
This subset is 0.16% of our data and is quite small. We will remove these cases in order to
accommodate our further analyses. Let’s call this cleaned data datc.
The SeniorCitizen variable is coded ‘0/1’ rather than yes/no. To ease our interpretation of later
graphs and models.
The MultipleLines variable is dependent on the PhoneService variable, where a ‘no’ for the latter
variable automatically means a ‘no’ for the former variable. We can again further ease our
graphics and modeling by recoding the ‘No phone service’ response to ‘No’ for
the MultipleLines variable.
for(i in 10:15){
datc[,i] <- as.factor(mapvalues(datc[,i],
from= c("No internet service"), to= c("No")
))
}
We will not need the customerID variable for graphs or modeling, so it can be removed.
Page | 48
3. Data visualization of descriptive statistics
Before we begin modeling our data, let’s examine descriptive statistics of our data within some
sample plots.
#Gender plot
p1 <- ggplot(datc, aes(x = gender)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Partner plot
p3 <- ggplot(datc, aes(x = Partner)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
#Dependents plot
Page | 49
p4 <- ggplot(datc, aes(x = Dependents)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
From these demographic plots, we notice that the sample is evenly split across gender and
partner status. A minority of the sample are senior citizens, and a minority have dependents.
Page | 50
#Phone service plot
p5 <- ggplot(datc, aes(x = PhoneService)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
Page | 51
position = position_dodge(.1),
size = 3)
Page | 52
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
Page | 53
Most of the sample have phone service with a single phone line. Fiber optic internet connection
is more popular than DSL internet service, and each online service has a minority of users.
The remaining categorical variables are related to contract and payment status.
Page | 55
Roughly half of the sample are on month-to-month contracts with the remaining split between
one and two year contracts. Most of the sample are on paperless billing, and pay by electronic
check.
#Tenure histogram
p17 <- ggplot(datc, aes(x = tenure)) +
geom_histogram(binwidth = 1) +
labs(x = "Months",
title = "Tenure Distribtion")
Page | 56
title = "Monthly charges Distribtion")
The tenure variable is stacked at the tails, so a large proportion of customers have either been had
the shortest (1 month) or longest (72 month) tenure.
Page | 57
Roughly a quarter of our sample (around 26.58%) are no longer customers. Below shown to
predict those churn with some classification modelling techniques.
4. Statistical modelling
Model parameters from modelling techniques make predictions on the test subset. We
will examine accuracy in terms of both the percentage of correct predictions as well as confusion
matrices.
Page | 58
From this decision tree, we can interpret the following:
The contract variable is the most important. Customers with month-to-month contracts
are more likely to churn.
Customers with DSL internet service are less likely to churn.
Customers who have stayed longer than 15 months are less likely to churn.
Now let’s assess the prediction accuracy of the decision tree model by investigating how well it
predicts churn in the test subset. We will begin with the confusion matrix, which is a useful
display of classification accuracy. It displays the following information:
true positives (TP): These are cases in which we predicted yes (they churned), and they
did churn.
true negatives (TN): We predicted no, and they didn’t churn.
false positives (FP): We predicted yes, but they didn’t actually churn. (Also known as a
“Type I error.”)
Page | 59
false negatives (FN): We predicted no, but they actually churned. (Also known as a
“Type II error.”)
Let’s examine the confusion matrix for our decision tree model.
The diagonal entries give our correct predictions, with the upper left being TN and the lower
right being TP. The upper right gives the FN while the lower left gives the FP. From this
confusion matrix, we can see that the model performs well at predicting non-churning customers
(1466 correct vs. 82 incorrect) but does not perform as well at predicting churning customers
(232 correct vs. 328 incorrect).
The decision tree model is fairly accurate, correctly predicting the churn status of customers in
the test subset 80.55% of the time.
Page | 60
5. Data visualization based on models
Our modelling efforts pointed to several important churn predictors: contract status,
internet status, tenure length, and total charges. Let’s analyse how these variables split by churn
status.
Churn status by contract status: - We will begin with the contract status variable.
p21
Page | 61
As would be expected, the churn rate of month-to-month contract customers is much higher than
the longer contract customers. Customers who are more willing to commit to longer contracts are
less likely to leave.
Churn status by internet service status: - Analysis for the internet service status of the
customer.
p22
Page | 62
It appears as if customers with internet service are more likely to churn than those that don’t.
This is more pronounced for customers with fiber optic internet service, who are the most likely
to churn.
Churn status by tenure length distribution status: - Analysis for the tenure length
distribution status of the customer.
There is a large spike at 1 month, indicating that there are a large portion of customers that will
leave the after just one month of service.
Page | 63
Churn status by distribution for total charges split by churn: - Analysis for the distribution
for total charges split by churn.
Similar to the tenure trend, customers who have spent more with the company tend not to
leave. This could just be a reflection of the tenure effect, or it could be due to financial
characteristics of the customer: customers who are more financially well off are less likely to
leave.
Page | 64
6. Modelling Objectives -
To identify the potential DIU subs (35% drop in ARPU) in pre-paid segment for
identified circle/sector
To automate the generation of a weekly report which ranks subscriber’s basis their
propensity to drop within the next 30 days
MN MN+1
» DIU is defined as drop in Customer revenue of more than 35% over the performance
window period of 30 days (MN+1)
» Day of prediction is day 9 (considering DW at D-2)
Evaluation Parameters –
No Yes
Hit rate = (B) / (B+D)
Drop predicted
No A C
25
Random Target Lift
20
Modelled Target Assume Base showing >35% Drop in M2 = 20%
15
Thus Prediction Accuracy achieved by Random Base Selection = 20%
10
If Model Hit Rate = 60%
5
Then, Lift = 3
0
0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 (Hit Rate) / (Prediction Accuracy achieved through Random Selection)
Page | 65
Data and ABT Creation
Worksheet in
Predictive Model - Churn_v1 2.xlsx
Modeling Techniques
CHAID (Decision Tree) -
Easy to understand and implement
Does not consider a lot of interaction between variables
Subscribers are not scored individually
Model is compared for the Hit rate, Penetration Rate, Misclassification and Lift to select the best model
Page | 66
Case Presentation: Maharashtra Circle Churn
Circle id = 13 (Maharashtra)
Super, Super +
Observation Performance
window window
M MN
N +1
Page | 67
Featured variables in the model
Feature Description Impact
LINEAR_MOD_7D_TO_30D 07 days- in linear Model -
CRREVM2M1_MOD Change is Cust_rev_M1 over M2; Floored at 0 +
Standard Deviation of customer revenue of 9
SD_CUST_REV_36D_4D_AVG
groups with mean of 4 days each for last 36 days
+
Count of distinct outgoing call numbers in the last
CNT_DIST_CALLS_OG_WK4
seven days
-
SLOPE_DEC_LAST_37DAYS Rate of change of decrement in last 37 Days +
MAX_MainBal_4DAY Maximum of the MAIN_BAL_INR in the last 4 days -
SLOPE_DEC_MOU_LST7D Rate of change of MOU in last 7 Days -
Page | 68
Case Presentation: Mumbai Circle Churn
Filter criteria for the subscriber base
Circle id = 15 (Mumbai)
Prepaid Retail subscribers
AON > 90 days
Subscribers who are active as on the day of churn prediction
Super, Super +
MOU > 150 in last 30 days
VLR Days >15 in last 30 days
Total Population - ~9L
70% – 30% split considered for development and validation
DIU Rate of the Population – 21.0 %
Observation Performance
window window
M M
N N+1
Page | 69
Featured variables in the model
Page | 70
Case Presentation: UPW Churn
Filter criteria for the subscriber base
Circle id = 22 (UPW)
Prepaid Retail subscribers
AON > 90 days
Subscribers who are active as on the day of churn prediction
Super +
MOU > 400 in last 30 days
VLR Days >15 in last 30 days
Total Population - ~5L
70% – 30% split considered for development and validation
DIU Rate of the Population – 27.5 %
Observation Performance
window window
M MN
N +1
Page | 71
Featured variables in the model
Page | 72
CHAPTER 6: FINDINGS
Customers with month-to-month contracts are more likely to churn than customers with
long tenure contracts.
Customers with internet service, in particular fiber optic service, are more likely to churn.
Customers who have been with the company longer or have paid more in total are less
likely to churn.
Decision Tree Model is compared for the Hit rate, Penetration Rate, Misclassification and
Lift to select the best model. It had a better false positive rate and was more accurate
overall.
With the help of the above modelling technique customer churn is controlled which has
resulted in revenue uplift and customer satisfaction.
With the help of the above modelling techniques it has become easier for the product
department to design and develop products for the customers who are likely to churn.
Page | 73
CHAPTER 7: SUGGESTION & RECOMMENDATIONS
1. To make the most of effective customer churn analysis a proper way of analytical method
is necessary.
2. Many at times it happens that a wrong model is chosen and due to this the customer churn
does not get control on time resulting into drop in usage and reduction in revenue.
3. To avoid all these consequences a proper analytics model should be used from beginning.
4. Before start for customer churn analytics following points should be considered.
Page | 74
CHAPTER 7: CONCLUSION
After going through various preparatory steps including data/library loading and pre-
processing, we carried out three statistical classification methods common in churn analysis. We
identified several important churn predictor variables from these models and concluded the
model on accuracy measures.
Page | 75
REFERENCES
[1] Liao, Shu-Hsien, Pei-Hui Chu, and Pei-Yuan Hsiao. "Data mining techniques and
applications–A decade review from 2000 to 2011." Expert Systems with Applications 39, no. 12
(2012): 11303-11311.
[2] Kamalraj, N., and A. Malathi. "A survey on churn prediction techniques in communication
sector." International Journal of Computer Applications 64, no. 5 (2013).
[3] N.Hashmi, N.ButtandM.Iqbal. Customer Churn Prediction in Telecommunication A Decade
Review and Classification. International Journal of Com-puter Science Vol.10(5),2013
[4] V. Umayaparvathi, K. Iyakutti, “ Applications of Data Mining Techniques in Telecom Churn
Prediction”, International Journal of Computer Applications, Vol. 42, No.20, 2012
[5] V. Umayaparvathi, K. Iyakutti,, “Attribute Selection and Customer Churn Prediction in
Telecom Industry”, Proceedings of the IEEE International Conference On Data Mining and
Advanced Computing, 2016 (to be appeared).
[6] Huang, Bingquan, Mohand Tahar Kechadi, and Brian Buckley. "Customer churn prediction in
telecommunications." Expert Systems with Applications 39, no. 1 (2012): 1414-1425
[7] Shaaban, Essam, Yehia Helmy, Ayman Khedr, and Mona Nasr. "A proposed churn prediction
model." IJERA 2 (2012): 693-697.
[8] Jain, Dipak, and Siddhartha S. Singh. "Customer lifetime value research in marketing: A
review and future directions." Journal of interactive marketing 16, no. 2 (2002): 34-46.
[9] Gupta, Sunil, and Valarie Zeithaml. "Customer metrics and their impact on financial
performance." Marketing Science 25, no. 6 (2006): 718-739.
[10] Yihui, Qiu, and Mi Hong. "Application of Feature Extraction method in customer churn
prediction based on Random Forest and Transduction." Journal of Convergence Information
Technology 5, no. 3 (2010): 73-78.
[11] https://en.wikipedia.org/
Page | 76