Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Project Report MBA Sem III

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

A

Project Report

On

“A STUDY OF CUSTOMER ANALYTICS AND CHURN CONTROL BY


BUILDING AN ANALYTICAL MODEL FOR TELECOM COMPANY”

As

“DESKTOP RESEARCH”

By

PRASHANT D KUMBHAR

Under the guidance of

PROF. DR. PRAJAKTA WARALE

Submitted to

Savitribai Phule Pune University

In partial fulfillment of the requirement for the award of the degree of

M.B.A

Batch – 2019 - 2021

Through

Rajgad Institute of Management Research and Development, Pune - 411043

Page | 1
Annexure B

Institute’s Certificate will be issued by you guide, after completion of SIP.

Page | 2
Page | 3
ANNEXURE D

DECLARATION OF STUDENT

(CERTIFICATE OF ORIGINALITY/DECLARATION)

This is to declare that I have carried out this project work myself in partial fulfillment of the
MBA Program of Savitribai Phule Pune University.

The work is original, has not been copied from anywhere else and not been submitted to any
other University/Institute for an award of any degree/diploma.

Date Signature

Place: Pune Name: Prashant D Kumbhar

Page | 4
ANNEXURE E

DECLARATION OF GUIDE

This is to certify that the work incorporated in this Project Report A STUDY OF CUSTOMER

ANALYTICS AND CHURN CONTROL BY BUILDING AN ANALYTICAL MODEL FOR

TELECOM COMPANY submitted by PRASHANT DATTATRAY KUMBHAR is his original

work and completed under my guidance. Material obtained from other sources has been duly

acknowledged in the Project Report.

DATE SIGNATURE OF GUIDE

PLACE: PUNE

Page | 5
ANNEXURE F

ACKNOWLEDGEMENT

It gives me great privilege to show my deepest sense of gratitude to those people without whom
this project would have never been complete. These people, not only mentored me but they also
made it a point that this project becomes a classy piece of study and its only their creative ideas,
their mentoring, their constructive criticism and guidance that has made the project really
meaningful and a well thought out piece of literature.

It’s a privilege for me to express my deepest sense of gratitude to Prof.Dr. Prajakta Warale, my
Faculty Guide from Rajgad Institute of Management Research and Development, my mentor and
undoubtedly the mainstay behind this project. It has been an out and out honor to work under her.
Her versatile viewpoint and understanding of the subject matter, her guidance’s, her constructive
criticisms and above all the level of motivation and faith she showed really made me to stay
focused and work logically during the course of the study. I sincerely thank our honorable director,
Prof.Dr. D. B. Bharati for their valuable support.

Thank you,
With Regards,
Prashant D Kumbhar

Page | 6
Table of Content
Chapter Name of the Chapter Page
Number Number
1 Executive Summary 8
a. Abstract
b. Objectives of the Study
c. Scope of Study
d. Need of the Study
e. Limitations of the Study
2 Company Profile / Organizational Profile 12

3 Research Methodology 17

4 Theoretical Concepts 21

5 Data Analysis & Interpretation 41

6 Learning of Students 73
(Findings)
7 Contribution to Host Organization 74
(Suggestion / Recommendations)

8 Conclusion 75

References 76

Page | 7
1 CHAPTER 1: EXECUTIVE SUMMARY

1.1 Abstract

Today is the competitive world of communication technologies. Customer Churn is the


major issue that almost all the Telecommunication companies in the world faces now. In
telecommunication, Churn is defined to be the activity of customers leaving the company and
discarding the services offered by it due to dissatisfaction of the services or due to better
offering from other network providers within the affordable price tag of the customer. This
leads to a potential loss of revenue to the company. Also, it has become a challenging task to
retain the customers. Therefore, companies are going behind introducing new state of the art
applications and technologies to offer their customers as much better services as possible so as
to retain them intact. Before doing so, it is necessary to identify those customers who are likely
to leave the company in the near future in advance because losing them would results in
significant loss of profit for the company. This process is called Churn Prediction.
The construction of effective churn prediction model is a significant task which involves
lots of research right from the identification of optimal predictor variables (features) from the
large volume of available customer data to the selection of effective predictive data mining
technique that is suitable for the feature set. Telecom Service Provider (TSP) collect a
voluminous amount of data regarding customers such as Customer Profiling, calling pattern,
Democratic data in addition to the network data that are generated by them. Based on the history
of the customers calling pattern and the behaviour, there is a possibility to identify their mind-
set of either they will leave or not.
Data Analytics techniques are found to be more effective in churn prediction from the
researches carried out for the past one decade. Especially Predictive modelling techniques are
often found to be more accurate in churn prediction. The existing works on churn prediction in
three different perspectives like datasets, methods, and metrics. Firstly, worked on the details
about the availability of public datasets and what kinds of customer details are available in each
dataset for predicting customer churn. Secondly, compared and contrast the various predictive

Page | 8
modelling methods that have been used in the literature for predicting the churners using
different categories of customer records, and then quantitatively compare their performances.
Finally, summarized what kinds of performance metrics have been used to evaluate the
existing churn prediction methods. Analysing all these three perspectives is very crucial for
developing a more efficient churn prediction system for TSP.

1.2 OBJECTIVES OF THE STUDY

In this competitive world, business is becoming highly saturated. Especially, the field of
telecommunication faces complex challenges due to a number of vibrant competitive service
providers. Therefore, it has become very difficult for them to retain existing customers. Since the
cost of acquiring new customers is much higher than the cost of retaining the existing customers,
it is the time for the telecom service providers to take necessary steps to retain the customers to
stabilize their market value. In the past, several data mining techniques have been proposed in
the literature for predicting the churners using heterogeneous customer records. This SIP under
Desktop Research Method, reviews the different categories of customer data available in open
datasets, predictive models and performance metrics used for churn prediction and control in
telecom company.

 To analyze customers who are showing drop in usage


 Targeting these subscribers with various promotional activities
 To analyze and help in enhancing the revenue from these customers
 To study and help in controlling customer churn

Page | 9
1.3 SCOPE OF THE STUDY

The focus of TSPs has long back shifted from product-centric to customer-centric. With
digitization, customers are well aware of their services in the market, forcing the Telecom
Companies to invest in new technologies and advanced analytics to understand the needs of
their customers and improve customer experience. However, today’s customer wants more
than just understanding - a valuable relationship, which may come through more timely,
informed or relevant interactions.
The Scope of project contains the overall idea about the types of customers and their
usage pattern in telecommunication domain that can be analysed in a form of a model which
will help in controlling customer churn and also help in increasing the revenue from these set
of customers with the help of various marketing techniques.

1.4 NEED OF THE STUDY

Data Analytics techniques are found to be more effective in churn prediction from the
researches carried out for the past one decade. Especially Predictive modelling techniques are
often found to be more accurate in churn prediction. The existing works on churn prediction in
three different perspectives like datasets, methods, and metrics. Firstly, worked on the details
about the availability of public datasets and what kinds of customer details are available in each
dataset for predicting customer churn. Secondly, compared and contrast the various predictive
modelling methods that have been used in the literature for predicting the churners using
different categories of customer records, and then quantitatively compare their performances.
Finally, summarized what kinds of performance metrics have been used to evaluate the existing
churn prediction methods. Analysing all these three perspectives is very crucial for developing
a more efficient churn prediction system for Telecom Service Provider.

Page | 10
1.5 LIMITATIONS OF THE STUDY

1. Due to security reasons, there was some restriction to share information.

2. Summer internship project was confined for the period of 60 days only.

3. Data is confidential due to we cannot take actual figure for to do proper research study.

4. Time constraint was one of the major limitations.

Page | 11
CHAPTER 2: COMPANY PROFILE/ ORGANISATION PROFILE

2.1 Introduction to Organization

Vodafone Idea Limited is an Aditya Birla Group and Vodafone Group partnership. It is
India’s leading telecom service provider. The Company provides pan India Voice and Data
services across 2G, 3G and 4G platform. With the large spectrum portfolio to support the growing
demand for data and voice, the company is committed to deliver delightful customer experiences
and contribute towards creating a truly ‘Digital India’ by enabling millions of citizens to connect
and build a better tomorrow. The Company is developing infrastructure to introduce newer and
smarter technologies, making both retail and enterprise customers future ready with innovative
offerings, conveniently accessible through an ecosystem of digital channels as well as extensive
on-ground presence.

2.1.1 Vision Mission Statement

Vision - Create world class digital experiences to connect and inspire every Indian to build a better
tomorrow

Mission –

 Customers - Be the most loved brand by continuously raising the bar in delivering simple,
delightful experience and meaningful innovations, through new age technologies

 Team - Be an inspirational, agile and exciting organisation that challenges the status quo,
and champions a diverse team that has a winning attitude and thrives on delivering
customer excellence

 Shareholders - Be the most valued company through smart leadership committed to


delivering sustainable growth, while adhering to the highest standards of governance and
compliance

Page | 12
 Community - Be the most respected company by leveraging technology and purposeful
innovation to catalyse social prosperity, digital literacy and inclusivity.

2.1.2 Key Achievement

CORPORATE AWARDS:

 Bagged the Amity Leadership Awards- 2013 in leveraging IT in Telecommunications


Sector
 Winner of the Citizen Journalist Awards- 2013
 Idea CTO won the 'CTO of the Year Award' at Voice & Data Awards 2012
 'Emerging Company of The Year Award' : Economic Times Corporate Excellence Award
2009
 'Mobile Operator of the Year - India: Asian Mobile News Awards in 2007 & '08

MARKETING AWARDS:

 Winner of ET Telecom Awards 2012, in the categories, 'Excellence in Marketing', and


'Innovative Products'
 Golden Peacock Award 2008' for Most Innovative Product & Services

BRAND AWARDS:

 3rd Best Client of the Year at EFFIES – January 2014


 2 Gold Awards in the ‘Integrated Advertising Campaign and Services - Telecom &
Related products category' for Honey Bunny and Telephone Exchange campaigns; 1
Silver Award in the 'Services - Telecom & related products category' for Honey Bunny;
and 1 Bronze Award in Best On-going campaign category for 'What an Idea!' series.
 Bagged awards of our various brand initiatives from two media houses - exchange for
media and Pitch for excellence in Media
 CNBC TV18 India Business Leader Awards 2013, Idea was awarded the 'Storyboard
Brand Campaign of the Year Award' for the Honey Bunny campaign.
 World Communication Awards 2012 'Best Brand Campaign of the Year for 3G
Population
 Idea won Gold for the 3G Population brand campaign at The 2012 APPIES - the Asia
Pacific Marketing Congress
 Best use of Social Media - India Social Case Campaign in 2012
Page | 13
 Bronze Awards for innovation in Ambient and TV Media in 2012
 Best Video Creative made for Internet/ Mobile Media at the Exchange for Media Digital
Awards- 2011
 Best use of Online Banner Advertising at the Digital Media Awards in 2011
 Yahoo Big Idea Chair Best Online Advertising in 2011
 The latest is GOLD EFFIE to MNP Campaign in 2011
 The same campaign won under the category, 'Relationship building' at the Annual MMA
Global Awards in 2011
 Olive Crown Gold Award for the Green Brand of the year at Goafest in 2011
 Silver at the Indian Digital Media Awards in 2011
 'Most outstanding use of Radio in an Ad campaign' at the India Radio Forum in 2011
 'Digital Brand of the Year': Campaign India Digital Media Awards, presented by
BBC.com in 2010
 'Best Ad Campaign' : Tele.Net Telecom Awards 2010
 Won 2 awards for, 'Use Mobile, Save Paper' at the Effies in 2010
 '4th Buzziest Brand in India' in 2009, '10: agencyfaqs
 Silver at the Yahoo Big Chair in the 'Best use of Technology' for the 'Language translator
Mobile App' in 2010
 Social Media Campaign of the Year at the WAT Awards in 2010
 Digital Media Campaign of the Year at the WAT Awards in 2010
 'Best Celebrity Endorsement of the Year Award' with Brand Ambassador Abhishek
Bachchan in 2009: NDTV
 Featured amongst the countries prominent, 'Powerbrands'
 Ranked 28th amongst all products and services brands climbing 117 ranks over last year
 'Break the Language Barrier' won many accolades across digital, radio, TV and media
innovation

2.1.3 Key Products or services

Vodafone Idea Cellular is a PAN-India integrated wireless broadband operator offering


2G, 3G and 4G services, and has its own National Long Distance (NLD) and International Long
Distance (ILD) operations, and Internet service provider (ISP) license.

Page | 14
2.1.4 SWOT Analysis of the company

Strength – 1> Increasing Revenue every quarter for the past 2 quarters

2> Company with Zero Promoter Pledge

Weakness – 1> Companies with growing costs YoY for long term projects
2> MFs decreased their shareholding last quarter
3> Inefficient use of shareholder funds - ROE declining in the last 2 years
4> Inefficient use of assets to generate profits - ROA declining in the last 2 years
5> Red Flag: Downgrade by Credit Rating Agency
6> Poor cash generated from core business - Declining Cash Flow from Operations
for last 2 years
7> Decline in Net Profit (QoQ)
8> Decline in Quarterly Net Profit (YoY)
9> Decline in Net Profit with falling Profit Margin (QoQ)
10> Decline in Quarterly Net Profit with falling Profit Margin (YoY)
11> Companies with High Debt
12> Degrowth in Quarterly Revenue and Profit in Recent Results
13> Low Piotroski Score: Companies with weak financials
14> Declining Net Cash Flow: Companies not able to generate net cash
15> Annual net profit declining for last 2 years
16> Recent Results: Fall in Quarterly Revenue and Net Profit (YoY)
17> Weak performer: Stock lost more than 20% in 1 month

Opportunities – 1> Brokers upgraded recommendation or target price in the past three months
2> Positive Breakout First Resistance (LTP > R1)
3> Highest Recovery from 52 Week Low
4> Stock with Low PE (PE < = 10)
Threats – Nil

Page | 15
2.2 Organizational Chart

Page | 16
CHAPTER 3: RESEARCH METHODOLOGY

1) Research Problem:

In the competitive Telecom industry, public policies and standardization of


mobile communication allow customers to easily switch over from one carrier to another,
resulting in a strained fluidic market. With millions of transactions, the telecom industry
deals with an enormous amount of data every second. Churn prediction, or the task of
identifying customers who are likely to discontinue use of a service, is an important and
lucrative concern of the Telecom industry. So need of Data Analytics is becoming a
crucial part of the business. In Indian Telecom Sector, With the rising amount of
competition and low cost of migration, customers immediately change their service
providers the moment they experience dissatisfaction in the service. According to a
survey about 66% of TSPs (Telecommunications Service Providers) identified customer-
centric objectives for adopting Data analytics as their organization’s top priority. The aim
of this research is to study and analyse customer churn prediction based on mobile data
usage volumes with respect to Quality of Experience and user’s perspective with the help
of Data analytics.

2) Hypothesis:

Through literature study, surveys and previous works, various discrepancies, challenges
and difficulties are identified. After the data acquisition from the anonymous Telecom
provider and an experimental survey by Mounika Reddy Chandiri, statistical and Data
analytics were carried out to draw different convictions on usage trends. The analysis is
expected to result in certain correlations between the varying voice and data traffic,
churn risk and the quality of experience with respect to users and the data analytics
indicating churn. From the research work, a derivation of a general relation between user’s
satisfaction and users traffic volume is expected to be reached so that Customer Churn
Control will uplift revenue.

Page | 17
3) Methods of data collection:

a) Primary data

Primary data is the new or fresh data collected from the personal observations and also
taken personal interviews of the team members. But due to COVID-19 pandemic there are
challenges in collection of the Primary data by personally observing the day-to-day
operations. So preference was given to work using Secondary Data.

b) Secondary data

The secondary data are collected through the internet. Also some sample data patterns
are referenced from Company database. The secondary data was gathered mainly by going
through internet articles and the research articles released by other researchers.

4) Measurement and scaling technique:

This research aims to study and analyze customer churn based on usage volumes with respect
to Quality of Experience and user’s perspective using data analytics. Three different datasets were
analyzed statistically and with the help of decision trees. Statistical analysis includes calculation
and analysis of Mean, Standard deviation, Autocorrelations and Confidence intervals. Decision
tree analysis includes data acquisition, data preparation that includes normalization, data
preprocessing, data extraction and finally decision making.

5) Sample Design:

i. Type of the Study - Semi-Experimental

ii. Sub-Type of Study - Semi-Experimental – Data Analysis

iii. Hypotheses Statements (Alternate and Null) - Customer Churn Control will uplift revenue

Page | 18
iv. Dependent and Independent Variables – Below is list of Sample Dependent Variable

v. Data Collection Methods - Secondary

vi. Sources of Data Collection - For Secondary Data, Customer usage & Recharge Data Dump

vii. Research Instrument - Customer usage & Recharge Data Dump collected as secondary data
from Telecom company using SQL from their system.

viii. Data Analysis Software (R/ SPSS/ Ms-Excel etc) – SAS Minor, SAS EG, MS-Excel

ix. Nature & Size of the Universe – Data Analysis is planned for 3 Telecom Circles whose
Customer Data Sample Size is 36 Lacs, 9 Lacs, 5 Lacs for MAG, MUM and UPW Circle
respectively.

x. Selected Sample Size (Using Scientific Formula) -


 Prepaid Retail subscribers
 AON > 90 days
 Subscribers who are active as on the day of churn prediction
 Super +
 MOU > 400 in last 30 days
 VLR Days >15 in last 30 days

Page | 19
xi. Sampling Technique (Name of the technique with Reason) –
Stratified Random Sampling: Subsets of the data sets or population are created based on a
common factor, and samples are randomly collected from each subgroup.

xii. Scale of the questionnaire - NA

6) Statistical Technique:

Statistical Analysis is the science of collecting, exploring and presenting large amounts of
data to discover underlying patterns and trends. Telecom companies use statistics to optimize
network resources, improve service and reduce customer churn by gaining greater insight into
subscriber requirements.
Mean, Standard deviation, Standard error, Lag1 Autocorrelation, 95% Confidence
Intervals have been calculated for various combinations.

a) Mean: Mean means the statistical average of a dataset. It usually depicts the central value
of a set of numbers.

b) Variance: Variance is the average of squared differences from the Mean.

c) Standard deviation: The Standard Deviation is a measure of how spread out numbers
are. In simple words, it’s the square root of Variance.

d) Standard error: Standard error is defined as the standard deviation of sampling


distribution (Mean). Mathematically, the division of standard deviation and square root of
number of total instances of sampled data gives the Standard error.

e) 95% Confidence Intervals (CI): Confidence intervals are a type of interval estimates
that gives the most likely range of an unknown population. Confidence intervals consists
of different ranges of values, 90%, 95% and 99%. In practice, confidence intervals are
usually stated at 95% confidence level, 95 being not too far away from 100. Statistically,
if there is a large overlap in confidence intervals, difference is not significant; whereas if
the intervals do not overlap, there is a difference with 95% confidence value.

f) Lag 1-Autocorrelation: Autocorrelation is correlation of data with itself at different


points in time. It often refers to the correlation of a time series with its own past and
future values. Autocorrelation is also sometimes called “lagged correlation” or “serial
correlation”, which refers to the correlation between members of a series of numbers
arranged in time.

Page | 20
CHAPTER 4: THEORETICAL CONCEPTS

4.1 Review of Related Literature

Many approaches were applied to predict churn in telecom companies. Most of these
approaches have used data analytics. The majority of related work focused on applying only
one method of data analysis to extract knowledge, and the others focused on comparing several
strategies to predict churn.

 Review 1:
Description: This paper presented by Gavril (Methods for churn prediction in the prepaid
mobile telecommunications industry. In: International conference on communications) based
on an advanced methodology of data analytics to predict churn for prepaid customers using
dataset for call details of 3333 customers with 21 features, and a dependent churn parameter
with two values: Yes/No. Some features include information about the number of incoming
and outgoing messages and voicemail for each customer.
Observation: The author applied principal component analysis algorithm “PCA” to reduce
data dimensions. Three machine learning algorithms were used: Neural Networks, Support
Vector Machine, and Bayes Networks to predict churn factor. The dataset used in this study is
small and no missing values existed.

 Review 2:
Description: He Y, He Z, Zhang D (A study on prediction of customer churn in fixed
communication network based on data mining. In: Sixth international conference on fuzzy
systems and knowledge discovery) proposed a model for prediction based on the Neural
Network algorithm in order to solve the problem of customer churn in a large Chinese telecom
company which contains about 5.23 million customers.
Observation: The prediction accuracy standard was the overall accuracy rate, and reached
91.1%.

Page | 21
 Review 3:
Description: This paper proposed by Idris A, Khan, on approach based on genetic
programming with AdaBoost to model the churn problem in telecommunications. (Genetic
programming and adaboosting based churn prediction for telecom. In: IEEE international
conference on systems)
Observation: The model was tested on two standard data sets. One by Orange Telecom and
the other by cell2cell, with 89% accuracy for the cell2cell dataset and 63% for the other one.

 Review 4:
Description: This paper proposed by Huang (In ACM SIGMOD international conference on
management of data.) to study the problem of customer churn in the big data platform. The goal
of the researchers was to prove that big data greatly enhance the process of predicting the churn
depending on the volume, variety, and velocity of the data.
Observation: Dealing with data from the Operation Support department and Business Support
department at China’s largest telecommunications company needed a big data platform to
engineer the fractures. Random Forest algorithm was used and evaluated using AUC.

 Review 5:
Description: This paper proposed by Makhtar M, Nafis S (Ref Churn classification model for
local telecommunication company based on rough set theory. J Fundam Appl Sci.
2017;9(6):854–68.) describing a model for churn prediction using rough set theory in telecom.
As mentioned in this paper Rough Set classification algorithm outperformed the other
algorithms like Linear Regression, Decision Tree, and Voted Perception Neural Network.
Various researches studied the problem of unbalanced data sets where the churned customer
classes are smaller than the active customer classes, as it is a major issue in churn prediction
problem.
Observation: Compared six different sampling techniques for oversampling regarding telecom
churn prediction problem. The results showed that the algorithms (MTDF and rules-generation
based on genetic algorithms) outperformed the other compared oversampling algorithms.

Page | 22
4.2 Theoretical Background of the study

Definition -

In competitive Telecom market, the customers want competitive pricing, value for money
and high quality service. Today’s customers won’t hesitate to switch telecom providers if they
don’t find what they are looking for. This phenomenon is called Churning.
Customer churning is directly related to customer satisfaction. Since the cost of winning a
new customer is far greater than cost of retaining an existing one, mobile service providers have
now shifted their focus from customer acquisition to customer retention.
After substantial research in the field of Data analytics for churn prediction, it was found
to be an efficient way for identifying churn. This helps to achieve results more efficiently and
receive insights that sets alarm bells ringing before any damage could happen, giving telecom
companies an opportunity to take preventive measures. These techniques are usually applied to
predict customer churn by building models and learning from historical data. However, most of
these techniques provide a result that customers might churn or not, but few tell us why they
churn.
Conducting experiments with end users’ perspective, gathering their opinions on
network, data normalization, pre-processing data sets, eliminating class imbalance and missing
values, replacing existing variables with derived variables improves the accuracy of churn
prediction which assists Telecom companies to retain their customers more efficiently.
Comparatively, a smaller study was done on user’s perspective, taking into consideration their
quality of experience. In fact, no study was done taking into consideration only user’s data
volumes. Estimation of Quality of Experience by finding relationships between QoE and traffic
characteristics could help the service providers to continuously monitor the user satisfaction level,
react timely and appropriately to rectify the performance problems and reduce the churn.

Page | 23
Theories -

Before we dig into how to analyze churn, its critical to understand it from a high level all the
way down to how to calculate it and its impact on the bottom line. Once you have an understanding
of churn it will be a lot easier to analyze and develop strategies to reduce it.

From a high level, churn is the measure of how many customers leave over a set time period.
It’s used to measure how much revenue telecom operator loose through customer cancellations.
It’s also used to measure the number of users or accounts that cease using products or services. In
either case, churn represents the attrition rate of customer base.

For subscription based business churn is critical as every customer telecom operator loose is
lost re-occurring revenue. Example: Telecom Company 1000 customers paying Rs1,000/month,
giving them a monthly reoccurring revenue (MRR) of Rs1,000,000 and annual revenues of
Rs12,000,000. If they have a churn rate of 10%, that means they lose 100 customers, or Rs100,000
MRR which is a loss of Rs1,200,000 for the year.

As the example illustrates, a lost customer can have a huge impact on telecom operators
bottom line. This is why many businesses have account managers and customer service managers
whose job is to do everything they can to reduce churn,

Let’s understand how to calculate churn. There are two common methods.

1. Customer Churn: Take all the customers telecom company loose during a time
frame, such as a month, and divide it by the total number of customer’s company
had at the beginning of the month. Example: Telecom Company had 500 customers
at the beginning of the month and 450 customers at the end of the month. Their
churn rate would be: - (500-450)/500=50/500=10%. If Telecom company prefers
you can use same method on a different time frame such as quarterly or annually.

2. Revenue Churn: Take companies monthly reoccurring revenue (MRR) at the


beginning of the month and divide it by monthly reoccurring revenue company lost
that month minus any upgrades or additional revenue from existing customer. The
reason you subtract additional revenue from existing customers is because you want
to know how much total revenue from existing customers is because you want to
Page | 24
know how much total revenue you lost and additional revenue is actually revenue
you gained. Example: - Telecom company had Rs500,000 MRR at the beginning
of the month and only Rs450,000 MRR at the end of month. They also had
Rs65,000 MRR in upgrades that month from existing customers. Their churn rate
would be:
(Rs500,000 – Rs450,000 – Rs65,000)/Rs500,000 =
(-Rs15,000)/Rs500,000 = -3%

It’s important to understand that customer churn and revenue churn are not always
the same. The problem will only get worse if you have more product lines or the
price difference between product lines is greater. It’s important to note that you
may need to use both calculations. Revenue churn is a great way to report on
performance as well as understand the financial health of your customer base.
Customer churn is important for staffing reasons as an employee can only manage
so many accounts at one time.

In the monthly calculation, there is an underlying assumption that no customer can


churn in the first month. The reason is that you pay for the month upfront. So when
you take a snapshot at the beginning of the month and then divide that by the total
number of churned customers from that month, you don’t have to worry about any
new sales churning in that time period.

Assuming the same model, if we calculate churn over a quarter we could run into a
problem. The reason is that there will be some new sales from the first month in the
quarter that could churn in the second or third month of the quarter. If those churns
are accidently included in the calculation, then we will overstate churn.

To get around this problem, you have to exclude all churns from new sales. If you
do that, you get the churn rate of Cohort A, which was our install base at the
beginning of the quarter. This method gives you the true churn rate, without
replacement, of your customer base over a quarter. The one issue with that solution
is that some of you may want to include the churn rate of Cohorts B and C. If that
is the case, then you will want to use the weighted average.

Page | 25
Example,

Cohort A – 1000 customers; churn rate of 15%


Cohort B – 100 customers; churn rate of 10%
Cohort C – 100 customers; churn rate of 5%
[(1000*0.15) + (100*0.1) + (100*0.05)] / (1000+100+100) =
[150 + 10 + 5]/1200 = 165/1200 = 13.75%

One last fundamental aspect of churn is that it will vary depending on customer
stage in their lifecycle. It’s very typical to find customers will churn at higher rate
during the beginning of their subscription compared to a few months in. This can
happen for a variety of reasons such as poor expectation setting during sales, sudden
shift in priorities, poor onboarding programs, poor support etc. As customers
mature, their churn rate will stabilize. Because of this, it can be important to
calculate churn for newer customers separate from older customers, so you don’t
overestimate the steady churn rate and underestimate early stage churn rates.

While there are many ways to track and analyze churn, three most common methods used to
analyze churn are cohort reports, churn by customer age, and churn by customer behavior.
There are two reasons which are focused while analyzing churn:

1. Before try to improve churn rate, need to identify problem areas so we can focus and
prioritize accordingly.

2. Once we implement action points to improve churn, need to know if action points are
working.

Cohort Reports:

A cohort report analyzes various cohorts of your customer base and their churn rate over time.
A cohort of customers is the segment of customers who purchased in a certain time frame. A
common cohort used would be customers who purchased services in specific month, for example
your January 2020 cohort would be all the customers who closed that month. There are two major

Page | 26
benefits of using cohort reports. The first is that it produces a clean number, not influenced by new
customer acquisition.

For example – Telecom company calculates their churn rate over 3 months.

Month 1
1000 existing customers; 50 churn; 100 new customers
50/1000 = 5% churn
Month 2
1050 customers (1000 from month 1 – 50 churn + 100 new); 40 churn; 100 new
40/1050 = 3.8% churn
Month 3
1110 customers (1050 - 40 churn + 100 new); 40 churn; 100 new customers
40/1110 = 3.6% churn

In example churn fell from 5% to 3.6%. With cohort reporting, it’s much easier to understand
where the change is coming from.

The second benefit to cohort reporting is that it enables to identify patterns in customer base.

The chart above is a sample cohort report. The way to read it is that customers who bought in
a cohort, like January 20, had a 100% renewal rate in month 0 (which is the month they bought,
so it’s impossible to churn), 81% of them were retained in month 1, 75% of the original number
from month 0 were retained in month 2, 71% of the original number from month 0 were retained
in month 3, and so on.

Page | 27
We then layered a heat map on top of this, which creates the color. The way that works is the
closer you are to 100%, the greener the box. The further away you get from 100%, the yellower
the box. As 63% is the lowest number in the chart, that box is completely yellow.

With just a glance an obvious pattern emergence with the May cohort as there is a clear
difference in the color. The way to interpret that is that customers who bought in May, and later
have higher retention rates, or lower churn rates, compared to earlier customers. Now we can deep
drive into what happened in May to understand why. Did you create new documentation site;
improve your onboarding process; released a new version of product? Once we found out reason
for increased in retention will try to improve on it. It’s important to note that we can find other
patterns this way. For example, may be one or two cohorts are performing well above or below
the average:

This could indicate that marketing team did a great job getting higher quality leads, or may
be sales team did a better job setting expectations, or customer care team gave those customers
more attention. While those reports can’t tell you exactly why those cohorts did better through
investigation what is common across two cohorts might get something that will help reduce churn
for the entire customer base.

Another possible pattern might show that after a certain time frame retention rates flatten out
for everyone. This is clear indication that customer who make it past that time frame are mature
and tend not to churn:
Page | 28
As seen from above looking at the numbers and through the heatmap, churn after month 4 is
pretty low. This type of cohort reporting is very effective. However, if you prefer to look at graphs
as opposed to charts of numbers.

In the above chart we are comparing 4 cohorts between January and April. The y-axis is the
churn rate and x-axis is age in months. In this graph, you can easily see that at month 4, churn falls
to ~1% for each of the cohorts. This pattern is much easier to see in graph as opposed to the charts
presented above. Viewing the data this way also makes it easy to see other the other patterns.

Page | 29
Churn by Customer Age:

Another popular way to analyze customer churn is through grouping customers together by
age. For example, measure the churn rate of all customers during their first month of service,
second month and so on. It would look like –

The way to read this chart is that across all customers, it’s seen churn rate is around ~6%
during their first month of service, approximately 10% in their second, and by the 5th month churn
stabilizes at around 2% or lower.

Major reasons for analyzing churn this way is to identify pattern for all customers, by age,
regardless of cohorts. When you look at the churn this way, it helps you understand the churn rate
as per customers age. Then same data can be used to try and resolve problem. For example, in the
chart above, churn is high for the first 120 days, so you can try to focus on improving your
onboarding process, or cleaning up documentation, or training your onboarding executives more,
and so on.

Another benefit to creating charts like this is that helps to measure impact of churn reduction
strategies. All we need to do is compare all customers before a cut-off date to all customers post a
cut-off date to see which performs better. To illustrate this let’s continue pulling the thread in the
example chart above, where churn is high in the first four months. Let’s see how data could look
like if we improve help and documentation site for new customers on 01/01/20:

Page | 30
If we go with a bar chart over a line graph so it’s easier to compare numbers. The red bar is
the churn rate for all customers that closed before 1/1/20 (which is the date the new documentation
site went live). The blue bar is the churn rate of all customers who bought after 1/1/20 and used
the new document verification site. As seen, months 1 and 2 were largely unaffected, however in
months 3 and 4, churn dropped a fair amount. This shows that improved documentation has helped
prevent churn in first few months but not all.

Churn by Customer Behavior:

In addition to analyzing churn by different cohorts or customer age, need to analyze churn by
customer behavior. This means need to look at customers who use a certain feature or complete
certain action and determine its impact on churn. There are many benefits to performing this kind
of analysis:

 Product may decide to focus on developing and improving features that retain customers.

 Alternatively, product team may dig into why certain features have a high churn rate to
determine if they can fix the issue.

 Customer service team could work on promoting features that retain customers.

 All relevant data can be used to prepare customer engagement score.

The report need to create are simple. The time consuming part is dependent on how granular
you get. For example, if you track just accessing an app and its impact on churn, you would
Page | 31
only need to build a single report for each app you have. On the flip side, accessing an app
doesn’t tell you much. So, you could choose to get more granular and track various button
clicks in an app, or combinations of button clicks, and so on. For this we can get help from
engineering team to see if they are able to get this data for you or if you need to use software
designed to track user behavior. If able to get that granular it is well worth for understanding
of what behaviors lead to retention and churn.

Once decision is taken about what to test, the next thing is building reports to see what
impacts churn. For example:

The way to read this chart is each line represents customers who accessed a feature (X or Y
in this case) or completed some action in the feature (for example clicking a button) during the
month in question, and what say of the customers who accessed Feature X in March of 2020, their
churn rate that month was 4%. To create a chart like this, I recommend to use excel as it’s easy to
segment your customers into segments based on their behavior and then using the customer churn
formula presented earlier you can calculate the churn rate for those segments over time.

Page | 32
When you look at the chart above, 3 activities jump out as having consistently low churn – if
you click button 2 or 3 on feature Y and if your click button 1 on feature X. While this is great
evidence that customers who use those features wont churn, we need to dig deeper. The reason is
that correlation doesn’t always relate to conclusion. The first step is doing so is comparing the
churn rates above with customers who don’t use those features, and what their churn rate is:

In the chart above, you can easily see the customers who used button 1 on feature X (B1X)
and button 2 on feature Y (B2Y) had very low churn rates, while those customers that didn’t had
very high churn rates. At the same time customers who used button 3 on feature Y had a low churn
rate and customers who didn’t use that button also had a low churn rate. This means button 3 on
feature Y doesn’t really impact churn, however the other two buttons have a much clearer impact
on churn.

Now we need to dig a little further into B1X and B2Y as the charts above do not answer the
question of whether or not using those buttons in a month will reduce churn over the next few
months, or just in the month the button was used. For example, maybe those buttons are set-up
related and customers who are actively getting set-up don’t churn, but once set up is complete their
churn rate goes up over the next few months. Or maybe these buttons are rarely used, so while a

Page | 33
customer might not churn in a month they use them, in subsequent months their churn risk goes
up. To address these concerns we need to move back to cohort reporting:

If you look at the B1X chart, you can see that churn for customers who use Button 1 in Feature X
have a very low churn rate in the month they use the feature (month 0), and they continue to have
a low churn rate over the next 3 months. Whereas customers who use button 2 in feature Y have a
low churn rate in month 0, but it shoots up after that. To wrap up the analysis, you should complete

Page | 34
a cohort report for customers that don’t use B1X, just to make sure there is true causation, and not
just correlation.

Using reports and processes above has helped to determine, with confidence, what user behavior
leads to a reduction in churn. In addition to looking at simple user behavior, like clicking a button
or accessing a feature, we have to consider analyzing:

 KPIs – What kind of key performance indicators can you use?

 Patterns – What patterns of usage can reduce churn? For example, do people who use
Feature 1 and Feature 2 have a low churn rate? What about customers who completed at
least 50 actions in a month or access 4+ features in a month? What about customers who
call support more than once in the 1st month? Or more than 10 times in total?

 Employee Behavior – If you have high touch sales and/or onboarding process, you should
determine churn by employee. You should also discover the impact on churn by various
behaviors, such as time between closing the sale and first point of contact with the
customer, or length of time between touch points, or subject matter covered on calls etc.

Different phases of a model churn prediction system proposed in research work.

A model churn prediction system consists of five phases:

1) Preprocessing the input customer records,

2) Extracting the required features for developing churn models,

Page | 35
3) Construction models using different classifiers and cross validate the models,

4) Calculation of prediction accuracy and variable importance report,

5) Providing customer retention polices to CRM executives.

Advantages/ Disadvantages –
1. This churn analysis provides Advantages that will help the telecom operator to:
 Access all the relevant data seamlessly and quickly

 Segment the customers based on behavior and demographics to improve


retention

 Deliver tailored promotions and offers to positively influence their behavior

 Minimize acquisition costs and increase marketing efficiency

 Keep customers engaged and loyal

2. This churn analysis provides predictive insights to telecom operator such as:

 Predicting customer’s overall satisfaction as well as their experience with


service quality

 Identifying potential network issues, competitive threats, and at-risk


customers

 Identifying the negative customer experience trends and reducing attrition


levels

 Building a robust predictive model and gathering data

 Creating new opportunities for cross-selling and upselling

3. The Challenges in traditional customer churn prediction models is that they do not
align with their business objective, as they only predict the gross outcome, i.e.,
whether a customer will churn. Churn analytics estimating the net effect, focus on
Page | 36
whether a customer’s intent on churning and will be retained when targeted with the
campaign. The true business objective of analytics is to reduce customer churn.
Customers who are about to churn but cannot be retained should be excluded from the
campaign, as targeting them will be a waste of resources. Moreover, retention efforts
may provoke customers to churn. For example, a retention offer may remind a
customer about the expiration of a contractual agreement and cause churn as a result.

Features -
For Analysing Data, system aggregated some kind of telecom data like billing data,
Calls/SMS/Internet usage data, and complaints related data. Data Analytic techniques were applied
on top of the System Data, but the few models failed to give high results using this data. In contrast,
the data sources that are huge in size were ignored due to the complexity in dealing with them.
Few systems were not able to acquire, store, and process that huge amount of data at the same time
due to limitations of system handling. In addition, the data sources were from different types, and
gathering them in Data handling system is a very hard process so that adding new features for Data
Analytics algorithms requires a long time, high processing power, and more storage capacity. All
these difficulties in processes for system data handling is overcome easily using upgraded system
using distributed processing of data.

Many types of telecom data are used to build the churn model. These types are classified
as follow:

1. Customer data: It contains all data related to customer’s services and contract information.
In addition to all offers, packages, and services subscribed to by the customer.
Furthermore, it also contains information generated from CRM system like (all customer
GSMs, Type of subscription, birthday, gender, address etc.).

2. Towers and complaints database: Information of tower location is represented as digits.


Mapping these digits with towers database provides the location of customer transaction,
giving the longitude and latitude, sub-area, area, city, and state. Complaints database
provides all complaints submitted and statistics inquiries related to coverage, problems in
offers and packages, and any other problem related to the telecom business.
Page | 37
3. Network logs data: Contains the internal sessions related to internet, calls, and SMS for
each transaction in Telecom operator, like the time needed to open a session for the
internet and call ending status. It could indicate if the session dropped due to an error in
the internal network.

4. Call details records: “CDRs” Contain all charging information about calls, SMS, MMS,
and internet transaction made by customers. This data source is generated as text files.

5. Mobile IMEI information: It contains the brand, model, type of the mobile phone and if
it’s dual or mono SIM device. This data has a large size and there is a lot of detailed
information about it. We spent a lot of time to understand it and to know its sources and
storing format.

6. In addition to these records, the data must be linked to the detailed data stored in relational
databases that contain detailed information about the customer. Nine months of data sets
contained about ten million customers. Total number of columns is about ten thousand
columns. Collected data was full of columns, since there is a column for each service,
product, and offer related to calls, SMS, MMS, and internet, in addition to columns related
to personnel and demographic information. If we need to use all these data sources the
number of columns for each customer before the data being processed will exceed ten
thousand columns.

Page | 38
Applications/ examples:

I needed data labeled for testing, so contacted experts from the marketing section to
provide me with labeled sample of GSM data, they provided me with some prepaid customers in
idle phase after 2 months of the nine month’s data, considering them as churners. The other non-
churned customers were labeled as Active customers (customers acquired in the last 4 months are
excluded). The total count of the sample where 5 million customers containing 300,000 churned
customers and 4,700,000 active customers. Above figure shows the periods of historical data and
the future period when the customer may leave the company.

Page | 39
Flow charts/ block diagrams -

Page | 40
CHAPTER 5: DATA ANALYSIS AND INTERPRETATION

Customer churn, also known as customer attrition, is the loss of clients or customers.
Churn is an important business metric for subscription-based services such as telecom
companies. This project demonstrates a churn analysis using data downloaded from IBM sample
data sets. We will use the R statistical programming language in order to identify variables
associated with customer churn.

In this project, we will carry out the following tasks:

1. Load the data and the relevant R libraries.


2. Preprocess the data with various cleaning and recoding techniques.
3. Provide data visualizations of descriptive statistics of the data
4. Fit models using statistical classification methods commonly used in churn analysis.
o Decision tree analysis
o Random forest analysis
o Logistic regression
5. Examine additional data visualization of selected variables based on our modeling
techniques.
6. Modelling Objective –

 To identify the potential Drop in subs and DIU subs (35% drop in ARPU) in pre-
paid segment for identified circle/sector

 To automate the generation of a weekly report which ranks subscriber’s basis their
propensity to drop within the next 30 days

Page | 41
1. Loading in data and R libraries
We begin by loading the R libraries we need for the project.

 plyr: various data manipulation methods


 randomForest: fitting random forest models
 rpart: fitting decision tree models
 rpart.plot: diplaying variable importance visualizations
 caret: some helpful calculation methods
 ggplot2: data visualization
 gridExtra: additional data visualization (organizing plots into grids)

library(plyr)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(rpart)
library(rpart.plot)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:randomForest':
##
## combine

Page | 42
Now let’s read in the data. Need to modify the file name within the quotations based on the file
location of the data within your computer. The first few rows are depicted below.

#Read data file


dat <- read.csv("C:\\Users\\Prashant\\Documents\\Portfolio1\\TelcoChurn\\WA_F
n-UseC_-Telco-Customer-Churn.csv")

#Examine data
head(dat)
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## MultipleLines InternetService OnlineSecurity OnlineBackup
## 1 No phone service DSL No Yes
## 2 No DSL Yes No
## 3 No DSL Yes Yes
## 4 No phone service DSL Yes No
## 5 No Fiber optic No No
## 6 Yes Fiber optic No No
## DeviceProtection TechSupport StreamingTV StreamingMovies Contract
## 1 No No No No Month-to-month
## 2 Yes No No No One year
## 3 No No No No Month-to-month
## 4 Yes Yes No No One year
## 5 No No No No Month-to-month
## 6 Yes No Yes Yes Month-to-month
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 1 Yes Electronic check 29.85 29.85
## 2 No Mailed check 56.95 1889.50
## 3 Yes Mailed check 53.85 108.15

Page | 43
## 4 No Bank transfer (automatic) 42.30 1840.75
## 5 Yes Electronic check 70.70 151.65
## 6 Yes Electronic check 99.65 820.50
## Churn
## 1 No
## 2 No
## 3 Yes
## 4 No
## 5 Yes
## 6 Yes

The IBM sample data set website gives the following data dictionary, or description of the
variables:

 customerID: Customer ID
 genderCustomer: gender (female, male)
 SeniorCitizen: Whether the customer is a senior citizen or not (1, 0)
 PartnerWhether: the customer has a partner or not (Yes, No)
 Dependents: Whether the customer has dependents or not (Yes, No)
 tenure: Number of months the customer has stayed with the company
 PhoneService: Whether the customer has a phone service or not (Yes, No)
 MultipleLines: Whether the customer has multiple lines or not (Yes, No, No phone
service)
 InternetService: Customer’s internet service provider (DSL, Fiber optic, No)
 OnlineSecurity: Whether the customer has online security or not (Yes, No, No internet
service)
 OnlineBackup: Whether the customer has online backup or not (Yes, No, No internet
service)
 DeviceProtection: Whether the customer has device protection or not (Yes, No, No
internet service)
 TechSupport: Whether the customer has tech support or not (Yes, No, No internet
service)
Page | 44
 StreamingTV: Whether the customer has streaming TV or not (Yes, No, No internet
service)
 StreamingMovies: Whether the customer has streaming movies or not (Yes, No, No
internet service)
 Contract: The contract term of the customer (Month-to-month, One year, Two year)
 PaperlessBilling: Whether the customer has paperless billing or not (Yes, No)
 PaymentMethod: The customer’s payment method (Electronic check, Mailed check,
Bank transfer (automatic), Credit card (automatic))
 MonthlyCharges: The amount charged to the customer monthly
 TotalCharges: The total amount charged to the customer
 Churn: Whether the customer churned or not (Yes or No)

2. Data pre-processing
Let’s begin by seeing if there is any missing data.

sapply(dat, function(x) sum(is.na(x)))


## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0

There are 11 cases with missing values in the Total Charges variable. Let’s see these particular
cases.

dat[is.na(dat$TotalCharges),]

Page | 45
## customerID gender SeniorCitizen Partner Dependents tenure
## 489 4472-LVYGI Female 0 Yes Yes 0
## 754 3115-CZMZD Male 0 No Yes 0
## 937 5709-LVOEQ Female 0 Yes Yes 0
## 1083 4367-NUYAO Male 0 Yes Yes 0
## 1341 1371-DWPAZ Female 0 Yes Yes 0
## 3332 7644-OMVMY Male 0 Yes Yes 0
## 3827 3213-VVOLG Male 0 Yes Yes 0
## 4381 2520-SGTTA Female 0 Yes Yes 0
## 5219 2923-ARZLG Male 0 Yes Yes 0
## 6671 4075-WKNIU Female 0 Yes Yes 0
## 6755 2775-SEFEE Male 0 No Yes 0
## PhoneService MultipleLines InternetService OnlineSecurity
## 489 No No phone service DSL Yes
## 754 Yes No No No internet service
## 937 Yes No DSL Yes
## 1083 Yes Yes No No internet service
## 1341 No No phone service DSL Yes
## 3332 Yes No No No internet service
## 3827 Yes Yes No No internet service
## 4381 Yes No No No internet service
## 5219 Yes No No No internet service
## 6671 Yes Yes DSL No
## 6755 Yes Yes DSL Yes
## OnlineBackup DeviceProtection TechSupport
## 489 No Yes Yes
## 754 No internet service No internet service No internet service
## 937 Yes Yes No
## 1083 No internet service No internet service No internet service
## 1341 Yes Yes Yes
## 3332 No internet service No internet service No internet service
## 3827 No internet service No internet service No internet service
## 4381 No internet service No internet service No internet service

Page | 46
## 5219 No internet service No internet service No internet service
## 6671 Yes Yes Yes
## 6755 Yes No Yes
## StreamingTV StreamingMovies Contract PaperlessBilling
## 489 Yes No Two year Yes
## 754 No internet service No internet service Two year No
## 937 Yes Yes Two year No
## 1083 No internet service No internet service Two year No
## 1341 Yes No Two year No
## 3332 No internet service No internet service Two year No
## 3827 No internet service No internet service Two year No
## 4381 No internet service No internet service Two year No
## 5219 No internet service No internet service One year Yes
## 6671 Yes No Two year No
## 6755 No No Two year Yes
## PaymentMethod MonthlyCharges TotalCharges Churn
## 489 Bank transfer (automatic) 52.55 NA No
## 754 Mailed check 20.25 NA No
## 937 Mailed check 80.85 NA No
## 1083 Mailed check 25.75 NA No
## 1341 Credit card (automatic) 56.05 NA No
## 3332 Mailed check 19.85 NA No
## 3827 Mailed check 25.35 NA No
## 4381 Mailed check 20.00 NA No
## 5219 Mailed check 19.70 NA No
## 6671 Mailed check 73.35 NA No
## 6755 Bank transfer (automatic) 61.90 NA No

Inspection of the Churn variable shows that these are all still subscribing customers. What
proportion of our sample is this subset with missing values?

sum(is.na(dat$TotalCharges))/nrow(dat)
## [1] 0.001561834

Page | 47
This subset is 0.16% of our data and is quite small. We will remove these cases in order to
accommodate our further analyses. Let’s call this cleaned data datc.

datc <- dat[complete.cases(dat), ]

The SeniorCitizen variable is coded ‘0/1’ rather than yes/no. To ease our interpretation of later
graphs and models.

datc$SeniorCitizen <- as.factor(mapvalues(datc$SeniorCitizen,


from=c("0","1"),
to=c("No", "Yes")))

The MultipleLines variable is dependent on the PhoneService variable, where a ‘no’ for the latter
variable automatically means a ‘no’ for the former variable. We can again further ease our
graphics and modeling by recoding the ‘No phone service’ response to ‘No’ for
the MultipleLines variable.

datc$MultipleLines <- as.factor(mapvalues(datc$MultipleLines,


from=c("No phone service"),
to=c("No")))

Similiarly, the OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV,


and StreamingMovies variables are all dependent on the OnlineService variable. We have recode
the responses from ‘No internet service’ to ‘No’ for these variables.

for(i in 10:15){
datc[,i] <- as.factor(mapvalues(datc[,i],
from= c("No internet service"), to= c("No")
))
}

We will not need the customerID variable for graphs or modeling, so it can be removed.

datc$customerID <- NULL

Page | 48
3. Data visualization of descriptive statistics
Before we begin modeling our data, let’s examine descriptive statistics of our data within some
sample plots.

Here are barplots of demographic data of our sample.

#Gender plot
p1 <- ggplot(datc, aes(x = gender)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Senior citizen plot


p2 <- ggplot(datc, aes(x = SeniorCitizen)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Partner plot
p3 <- ggplot(datc, aes(x = Partner)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Dependents plot

Page | 49
p4 <- ggplot(datc, aes(x = Dependents)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Plot demographic data within a grid


grid.arrange(p1, p2, p3, p4, ncol=2)

From these demographic plots, we notice that the sample is evenly split across gender and
partner status. A minority of the sample are senior citizens, and a minority have dependents.

The various offered services are plotted below.

Page | 50
#Phone service plot
p5 <- ggplot(datc, aes(x = PhoneService)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Multiple phone lines plot


p6 <- ggplot(datc, aes(x = MultipleLines)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Internet service plot


p7 <- ggplot(datc, aes(x = InternetService)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Online security service plot


p8 <- ggplot(datc, aes(x = OnlineSecurity)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',

Page | 51
position = position_dodge(.1),
size = 3)

#Online backup service plot


p9 <- ggplot(datc, aes(x = OnlineBackup)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Device Protection service plot


p10 <- ggplot(datc, aes(x = DeviceProtection)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Tech Support service plot


p11 <- ggplot(datc, aes(x = TechSupport)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Streaming TV service plot


p12 <- ggplot(datc, aes(x = StreamingTV)) +
geom_bar() +

Page | 52
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Streaming Movies service plot


p13 <- ggplot(datc, aes(x = StreamingMovies)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Plot service data within a grid


grid.arrange(p5, p6, p7,
p8, p9, p10,
p11, p12, p13,
ncol=3)

Page | 53
Most of the sample have phone service with a single phone line. Fiber optic internet connection
is more popular than DSL internet service, and each online service has a minority of users.

The remaining categorical variables are related to contract and payment status.

#Contract status plot


p14 <- ggplot(datc, aes(x = Contract)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Paperless billing plot


p15 <- ggplot(datc, aes(x = PaperlessBilling)) +
Page | 54
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Payment method plot


p16 <- ggplot(datc, aes(x = PaymentMethod)) +
geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)

#Plot contract data within a grid


grid.arrange(p14, p15, p16, ncol=1)

Page | 55
Roughly half of the sample are on month-to-month contracts with the remaining split between
one and two year contracts. Most of the sample are on paperless billing, and pay by electronic
check.

Let’s look at distributions of the quantitative variables.

#Tenure histogram
p17 <- ggplot(datc, aes(x = tenure)) +
geom_histogram(binwidth = 1) +
labs(x = "Months",
title = "Tenure Distribtion")

#Monthly charges histogram


p18 <- ggplot(datc, aes(x = MonthlyCharges)) +
geom_histogram(binwidth = 5) +
labs(x = "Dollars (binwidth = 5)",

Page | 56
title = "Monthly charges Distribtion")

#Total charges histogram


p19 <- ggplot(datc, aes(x = TotalCharges)) +
geom_histogram(binwidth = 100) +
labs(x = "Dollars (binwidth = 100)",
title = "Total charges Distribtion")

#Plot quantitative data within a grid


grid.arrange(p17, p18, p19, ncol=1)

The tenure variable is stacked at the tails, so a large proportion of customers have either been had
the shortest (1 month) or longest (72 month) tenure.

Lastly, let’s analyze our main outcome churn.

p20 <- ggplot(datc, aes(x = Churn)) +


geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3)
p20

Page | 57
Roughly a quarter of our sample (around 26.58%) are no longer customers. Below shown to
predict those churn with some classification modelling techniques.

4. Statistical modelling
Model parameters from modelling techniques make predictions on the test subset. We
will examine accuracy in terms of both the percentage of correct predictions as well as confusion
matrices.

Decision tree analysis


Decision tree analysis is a classification method that uses tree-like models of decisions
and their possible outcomes. This method is one of the most commonly used tools in machine
learning analysis. We will use the rpart library in order to use recursive partitioning methods for
decision trees. This exploratory method will identify the most important variables related to
churn in a hierarchical format.

tr_fit <- rpart(Churn ~., data = dtrain, method="class")


rpart.plot(tr_fit)

Page | 58
From this decision tree, we can interpret the following:

 The contract variable is the most important. Customers with month-to-month contracts
are more likely to churn.
 Customers with DSL internet service are less likely to churn.
 Customers who have stayed longer than 15 months are less likely to churn.

Now let’s assess the prediction accuracy of the decision tree model by investigating how well it
predicts churn in the test subset. We will begin with the confusion matrix, which is a useful
display of classification accuracy. It displays the following information:

 true positives (TP): These are cases in which we predicted yes (they churned), and they
did churn.
 true negatives (TN): We predicted no, and they didn’t churn.
 false positives (FP): We predicted yes, but they didn’t actually churn. (Also known as a
“Type I error.”)

Page | 59
 false negatives (FN): We predicted no, but they actually churned. (Also known as a
“Type II error.”)

Let’s examine the confusion matrix for our decision tree model.

tr_prob1 <- predict(tr_fit, dtest)


tr_pred1 <- ifelse(tr_prob1[,2] > 0.5,"Yes","No")
table(Predicted = tr_pred1, Actual = dtest$Churn)
## Actual
## Predicted No Yes
## No 1466 328
## Yes 82 232

The diagonal entries give our correct predictions, with the upper left being TN and the lower
right being TP. The upper right gives the FN while the lower left gives the FP. From this
confusion matrix, we can see that the model performs well at predicting non-churning customers
(1466 correct vs. 82 incorrect) but does not perform as well at predicting churning customers
(232 correct vs. 328 incorrect).

How about the overall accuracy of the decision tree model?

tr_prob2 <- predict(tr_fit, dtrain)


tr_pred2 <- ifelse(tr_prob2[,2] > 0.5,"Yes","No")
tr_tab1 <- table(Predicted = tr_pred2, Actual = dtrain$Churn)
tr_tab2 <- table(Predicted = tr_pred1, Actual = dtest$Churn)
tr_acc <- sum(diag(tr_tab2))/sum(tr_tab2)
tr_acc
## [1] 0.8055028

The decision tree model is fairly accurate, correctly predicting the churn status of customers in
the test subset 80.55% of the time.

Page | 60
5. Data visualization based on models
Our modelling efforts pointed to several important churn predictors: contract status,
internet status, tenure length, and total charges. Let’s analyse how these variables split by churn
status.

 Churn status by contract status: - We will begin with the contract status variable.

p21 <- ggplot(datc, aes(x = Contract, fill = Churn)) +


geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3) +
labs(title="Churn rate by contract status")

p21

Page | 61
As would be expected, the churn rate of month-to-month contract customers is much higher than
the longer contract customers. Customers who are more willing to commit to longer contracts are
less likely to leave.

 Churn status by internet service status: - Analysis for the internet service status of the
customer.

p22 <- ggplot(datc, aes(x = InternetService, fill = Churn)) +


geom_bar() +
geom_text(aes(y = ..count.. -200,
label = paste0(round(prop.table(..count..),4) * 100, '%')),
stat = 'count',
position = position_dodge(.1),
size = 3) +
labs(title="Churn rate by internet service status")

p22

Page | 62
It appears as if customers with internet service are more likely to churn than those that don’t.
This is more pronounced for customers with fiber optic internet service, who are the most likely
to churn.

 Churn status by tenure length distribution status: - Analysis for the tenure length
distribution status of the customer.

p23 <- ggplot(datc, aes(x = tenure, fill = Churn)) +


geom_histogram(binwidth = 1) +
labs(x = "Months",
title = "Churn rate by tenure")
p23

There is a large spike at 1 month, indicating that there are a large portion of customers that will
leave the after just one month of service.

Page | 63
 Churn status by distribution for total charges split by churn: - Analysis for the distribution
for total charges split by churn.

p24 <- ggplot(datc, aes(x = TotalCharges, fill = Churn)) +


geom_histogram(binwidth = 100) +
labs(x = "Dollars (binwidth=100)",
title = "Churn rate by tenure")
p24

Similar to the tenure trend, customers who have spent more with the company tend not to
leave. This could just be a reflection of the tenure effect, or it could be due to financial
characteristics of the customer: customers who are more financially well off are less likely to
leave.
Page | 64
6. Modelling Objectives -

 To identify the potential DIU subs (35% drop in ARPU) in pre-paid segment for
identified circle/sector

 To automate the generation of a weekly report which ranks subscriber’s basis their
propensity to drop within the next 30 days

 7 Days in – Month DIU Prediction: Illustration -

Observation window Performance window

MN MN+1

DAY -60 DAY -30 DAY 0 DAY 7 DAY 30

DAY 9 (Model Run date)

» DIU is defined as drop in Customer revenue of more than 35% over the performance
window period of 30 days (MN+1)
» Day of prediction is day 9 (considering DW at D-2)

 Evaluation Parameters –

Actual drop occurred Key parameters

No Yes
Hit rate = (B) / (B+D)
Drop predicted

No A C

Yes D B Penetration = (B) / (B+C)

25
Random Target Lift
20
Modelled Target Assume Base showing >35% Drop in M2 = 20%
15
Thus Prediction Accuracy achieved by Random Base Selection = 20%
10
If Model Hit Rate = 60%
5
Then, Lift = 3
0
0 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 (Hit Rate) / (Prediction Accuracy achieved through Random Selection)

Page | 65
 Data and ABT Creation

Worksheet in
Predictive Model - Churn_v1 2.xlsx

 Modeling Approach (SEMMA)

 Modeling Techniques
 CHAID (Decision Tree) -
 Easy to understand and implement
 Does not consider a lot of interaction between variables
 Subscribers are not scored individually

Model is compared for the Hit rate, Penetration Rate, Misclassification and Lift to select the best model

Page | 66
 Case Presentation: Maharashtra Circle Churn

Filter criteria for the subscriber base

 Circle id = 13 (Maharashtra)

 Prepaid Retail subscribers

 AON > 90 days

 Subscribers who are active as on the day of churn prediction

 Super, Super +

 MOU > 150 in last 30 days

 VLR Days >15 in last 30 days

 Total Population - ~36L

 70% – 30% split considered for development and validation

 DIU Rate of the Population – 19.7%

Observation Performance
window window

M MN
N +1

DAY - DAY - DAY DAY DAY


60 30 0 7 30

DAY 9 (Model Run date)

Page | 67
 Featured variables in the model
Feature Description Impact
LINEAR_MOD_7D_TO_30D 07 days- in linear Model -
CRREVM2M1_MOD Change is Cust_rev_M1 over M2; Floored at 0 +
Standard Deviation of customer revenue of 9
SD_CUST_REV_36D_4D_AVG
groups with mean of 4 days each for last 36 days
+
Count of distinct outgoing call numbers in the last
CNT_DIST_CALLS_OG_WK4
seven days
-
SLOPE_DEC_LAST_37DAYS Rate of change of decrement in last 37 Days +
MAX_MainBal_4DAY Maximum of the MAIN_BAL_INR in the last 4 days -
SLOPE_DEC_MOU_LST7D Rate of change of MOU in last 7 Days -

Page | 68
 Case Presentation: Mumbai Circle Churn
Filter criteria for the subscriber base
 Circle id = 15 (Mumbai)
 Prepaid Retail subscribers
 AON > 90 days
 Subscribers who are active as on the day of churn prediction
 Super, Super +
 MOU > 150 in last 30 days
 VLR Days >15 in last 30 days
 Total Population - ~9L
 70% – 30% split considered for development and validation
 DIU Rate of the Population – 21.0 %

Observation Performance
window window

M M
N N+1

DAY DAY DA DA DAY


-60 -30 Y0 Y7 30

DAY 9 (Model Run date)

Page | 69
 Featured variables in the model

Page | 70
 Case Presentation: UPW Churn
Filter criteria for the subscriber base
 Circle id = 22 (UPW)
 Prepaid Retail subscribers
 AON > 90 days
 Subscribers who are active as on the day of churn prediction
 Super +
 MOU > 400 in last 30 days
 VLR Days >15 in last 30 days
 Total Population - ~5L
 70% – 30% split considered for development and validation
 DIU Rate of the Population – 27.5 %

Observation Performance
window window

M MN
N +1

DAY - DAY - DAY DAY


0 DAY 30
60 30
7
DAY 9 (Model Run date)

Page | 71
 Featured variables in the model

 Case Presentation: Mumbai Circle Churn

Page | 72
CHAPTER 6: FINDINGS

Here is a summary of our findings:

 Customers with month-to-month contracts are more likely to churn than customers with
long tenure contracts.
 Customers with internet service, in particular fiber optic service, are more likely to churn.
 Customers who have been with the company longer or have paid more in total are less
likely to churn.
 Decision Tree Model is compared for the Hit rate, Penetration Rate, Misclassification and
Lift to select the best model. It had a better false positive rate and was more accurate
overall.
 With the help of the above modelling technique customer churn is controlled which has
resulted in revenue uplift and customer satisfaction.
 With the help of the above modelling techniques it has become easier for the product
department to design and develop products for the customers who are likely to churn.

Page | 73
CHAPTER 7: SUGGESTION & RECOMMENDATIONS

1. To make the most of effective customer churn analysis a proper way of analytical method
is necessary.

2. Many at times it happens that a wrong model is chosen and due to this the customer churn
does not get control on time resulting into drop in usage and reduction in revenue.

3. To avoid all these consequences a proper analytics model should be used from beginning.

4. Before start for customer churn analytics following points should be considered.

 How stable are the Requirement?

 Who are the end users of the system?

 Is the timeline taken for analysis is aggressive or conservative?

 What is the size of database targeted customers for controlling churn?

Page | 74
CHAPTER 7: CONCLUSION

After going through various preparatory steps including data/library loading and pre-
processing, we carried out three statistical classification methods common in churn analysis. We
identified several important churn predictor variables from these models and concluded the
model on accuracy measures.

 Customers with month-to-month contracts are more likely to churn.


 Customers with internet service, in particular fiber optic service, are more likely to churn.
 Customers who have been with the company longer or have paid more in total are less
likely to churn.
 Decision Tree Model is compared for the Hit rate, Penetration Rate, Misclassification and
Lift to select the best model. It had a better false positive rate and was more accurate
overall.
 Controlling the customer churn has uplifted the revenue.

Page | 75
REFERENCES

[1] Liao, Shu-Hsien, Pei-Hui Chu, and Pei-Yuan Hsiao. "Data mining techniques and
applications–A decade review from 2000 to 2011." Expert Systems with Applications 39, no. 12
(2012): 11303-11311.
[2] Kamalraj, N., and A. Malathi. "A survey on churn prediction techniques in communication
sector." International Journal of Computer Applications 64, no. 5 (2013).
[3] N.Hashmi, N.ButtandM.Iqbal. Customer Churn Prediction in Telecommunication A Decade
Review and Classification. International Journal of Com-puter Science Vol.10(5),2013
[4] V. Umayaparvathi, K. Iyakutti, “ Applications of Data Mining Techniques in Telecom Churn
Prediction”, International Journal of Computer Applications, Vol. 42, No.20, 2012
[5] V. Umayaparvathi, K. Iyakutti,, “Attribute Selection and Customer Churn Prediction in
Telecom Industry”, Proceedings of the IEEE International Conference On Data Mining and
Advanced Computing, 2016 (to be appeared).
[6] Huang, Bingquan, Mohand Tahar Kechadi, and Brian Buckley. "Customer churn prediction in
telecommunications." Expert Systems with Applications 39, no. 1 (2012): 1414-1425
[7] Shaaban, Essam, Yehia Helmy, Ayman Khedr, and Mona Nasr. "A proposed churn prediction
model." IJERA 2 (2012): 693-697.
[8] Jain, Dipak, and Siddhartha S. Singh. "Customer lifetime value research in marketing: A
review and future directions." Journal of interactive marketing 16, no. 2 (2002): 34-46.
[9] Gupta, Sunil, and Valarie Zeithaml. "Customer metrics and their impact on financial
performance." Marketing Science 25, no. 6 (2006): 718-739.
[10] Yihui, Qiu, and Mi Hong. "Application of Feature Extraction method in customer churn
prediction based on Random Forest and Transduction." Journal of Convergence Information
Technology 5, no. 3 (2010): 73-78.
[11] https://en.wikipedia.org/

Page | 76

You might also like