0% found this document useful (0 votes)

9 views

CH 03

Uploaded by

howrayan

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

CH 03

Uploaded by

howrayan

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 48

Chapter 3

Should you call a customer

because they are at risk of
churning?
CPIS 483
Spring 2023 – Third Term
Dr. Ghada Amoudi
Outline

• Identifying customers who are about to churn

• How to handle imbalanced data in your analysis
• How the XGBoost algorithm works
• Additional practice in using S3 and SageMaker

2
Introduction
• Carlos has a bakery and many competitors
• Carlos calls those who have stopped buying
• Churn: a word that means losing a customer
• Carlos comes to you for help
• find customers who are in the process of trying another
bakery
• Once identified the customers, he can call them to
see if there’s something he can do to keep them.

3
Introduction
In Carlos’s conversations with his lost customers, he
notices a common pattern:
• Customers place orders in a regular pattern, daily.
• When a customer tries another bakery 🡪 reduce the
number of orders from Carlos’s bakery.
• The customer negotiates an agreement with the
other bakery, which may result in stopping
temporarily the orders from Carlos’s bakery.
• Customers stop ordering from his bakery.
4
The
process
flow
In this chapter,
you make
decisions about
customers:
should Carlos
call a customer.

5
Preparing the dataset
• Carlos has 3,000 customers who, on average, place
3 orders per week.
• Over the past 3 months, Carlos received 117,000
orders
• 3,000 customer × 3 orders per week × 13 weeks
• We need to turn order data into customer data
• Turn 117,000 rows into a 3,000-row table

6
Preparing the dataset

• To turn 117,000 rows into a 3,000-row table (one row per customer), you need
to group the non-numerical data and summarize the numerical data.
• In the dataset shown in the table, the non-numerical fields are
customer_code, customer_name, and date. The only numerical field
is amount.
7
Preparing the dataset

We’ll apply two transformations to the data:

1. Normalize the data:
• calculate the percentage spend, relative to the average
week, instead of dollars, you are looking at a weekly
change relative to the average sales

2. Calculate the change from week to week:

• so the ML algorithm see the patterns in the weekly
changes as well as the relative figures for the same
time period.
8
Transformation 1: Normalizing
the data
2. Find the average per week
(total_sales/52)
3. For each week, calculate the
1. Find the sum of the total spent per week divided
total spent over the by the average spent per week
year for each customer to get a weekly spend as a
(total_sales). percentage of an average
spend
4. Create a column for each
week.

9
Transformation 2: Calculating the
change from week to week
• For each week from the column named week_minus_3 to
last_week, subtract the value from the preceding week and call it
the delta between the weeks.
• For example, in week_minus_3, the Gibson Group has sales that are
1.18 times their average week. In week_minus_4, their sales are
1.13 times their average sales.
• This means that weekly sales rose by 0.05 of their normal sales.

What do you think

about this delta? 10
XGBoost
primer
• How
XGBoost
works

11
How XGBoost works
• XGBoost is an ensemble machine learning model.
• Ensemble means: it uses a number of different
approaches to improve the effectiveness of its
learning.
• XGBoost stands for Extreme Gradient Boosting:
• The name has two parts:
• Gradient boosting
• Extreme

12
How XGBoost works
• Gradient boosting is a technique where different learners
are used to improve a function.
• The Extreme part of the name: XGBoost has a number of
other characteristics that makes the model accurate.
• It can handle sparse data.
• It handles overfitting problem by setting some parameters.
• Overfitting: is an undesirable ML behavior that occurs when the
machine learning model gives accurate predictions for training
data but not for new data.

13
ML model evaluation
• There are many evaluation metrics to measure the
performace of a ML medel.
• Precsion/ recall/ f1 score is among the most popular
• Accuracy
• Area under the curve
• And many more
Let’s take a look…

14
Confusion Matrix
Actual

Positive Nagative

True Positive False Positive

Positive
(TP) (FP)
Predicted
False Negative True Negative
Negative
(FN) (TN)

True Positives : Predicted YES 🡪 actual YES

True Negatives : Predicted NO 🡪 actual is NO
False Positives : Predicted YES 🡪 actual NO
False Negatives : Predicted NO 🡪 actual was YES
15
The proportion of predicted 1’s that are actually
Precision 1’s TP / TP + FP

Recall =
Sensitivity
= True The proportion of 1’s correctly classified TP / TP + FN
Positive
Rate (TPR)
Harmonic Mean between precision and recall.
Ranges [0, 1]. It tells how precise your
F measure classifier is (how many instances it classifies 2(P*R) / P+R
correctly), as well as how robust it is (it does
not miss a significant number of instances).
Specificity
= True
The proportion of 0’s correctly classified TN / TN + FP
Negative
Rate (TNR)
TP+TN /
Accuracy The proportion of cases correctly classified TP+FP+TN+F
N
False
Positive
The proportion of 0’s mistakenly classified as 16
Rate (FPR) FP/ FP + TN
1’s
Area Under the Curve (AUC)
• Q: How the machine learning model determines whether the
function is getting better or getting worse?
• A: We use the AUC – ROC curve
• ROC (Receiver Operator Characteristic) curve shows how much
model is capable of distinguishing between classes.
• AUC - ROC curve is a performance measurement for classification
problem at various cutoff settings.
• The higher the AUC, the better the model is at predicting 0s as 0s
and 1s as 1s.
• Example: the higher the AUC, the better the model is at
distinguishing between patients with disease and no disease.
17
Area Under the Curve (AUC)
• In XGBoost: objective is binary: logistic
• This means: not a prediction of a positive or
negative label.
• It means: the probability of a positive label.
• Result: a continuous value between 0 and 1.
• It is then up to us to decide what probability will
produce a positive prediction.

18
Area Under the Curve (AUC)
• The normal choice is 0.5 (50%) as the cutoff point.
• But we may change this.
• Because in some cases, the cost of missing a
positive can be more important and justify choosing
a cutoff much less than 0.5.

19
ROC* curve

The portion of all

positives that are
actually identified
as positive by our
model.

The portion of incorrect positive predictions as

a percentage of all negative numbers. 20
AUC – ROC

One probability cutoff means:

the model captures all of the
true positives. you will also
accidentally predict more
negatives as positives

21
AUC – ROC

22
AUC – ROC

When we use AUC

as our evaluation
metric, we are
telling XGBoost to
optimize our model
by maximizing the
area under the ROC
curve to give us the
best possible
results.

23
Build the model

We will use same steps as we did in chapter 2.

1. Upload a dataset to S3.

2. Set up a notebook on SageMaker.

3. Upload the starting notebook.

4. Run it against the data.

24
Build the model

As in chapter 2, you will go through the code in six parts:

1. Load and examine the data.

2. Get the data into the right shape.

3. Create training, validation, and test datasets.

4. Train the machine learning model.

5. Host the machine learning model.

6. Test the model and use it to make decisions.

25
1. Load and examine the data

26
1. Load and examine the data

27
1. Load and examine the data

28
1. Load and examine the data

29
2. Get the data into the right
shape
• ML can only work with numbers, so we need to either remove
our categorical data or encode it.
• The categorical data here are customer_name,
customer_code, and id
• We will remove them as they do not have an influence on the
model
• To remove the data, use the pandas drop function
• To display the first five rows of the dataset use the head
function.
• axis=1: indicates that you want to remove columns rather
than rows in the pandas DataFrame.
30
2. Get the data into the right
shape

31
3. Create training, validation, and
test datasets
• split the data into test, validation, and training
datasets as you did in chapter 2.
• we will use the stratify parameter during the
split.
• Why?
• In our dataset the target variable we are predicting is
relatively rare.
• The parameter works by making sure that the train,
validate, and test datasets contain similar ratios of target
variables.
32
33
34
Converting the datasets to CSV
and saving to S3

35
4. Train the machine learning
model
• We will explain XGBoost a bit more.
• The interesting parts of the following are the
estimator hyperparameters.
• The hyperparameters of interest to us are:

36
estimator
hyperparameters
of interest to us in
this chapter.

37
4. Train the machine learning
model
• Objective: we set this hyperparameter to
binary:logistic.
• Use this setting when the target variable is 1 or 0
• eval_metric: The evaluation metric you are
optimizing for. The metric argument auc (area
under the curve).

38
4. Train the machine learning
model
• num_round: How many times you want to let the
machine learning model run through the training data.
• With each loop through the data, the function gets better at
separating the dark circles from the light circles.
• After a while though, the model gets too good; it begins to
find patterns in the test data that are not reflected in the
real world (overfitting).
• The larger the number of rounds, the more likely you are to
be overfitting.
• To avoid this, you set early stopping rounds.

39
4. Train the machine learning
model
• early_stopping_rounds: The number of rounds
where the algorithm fails to improve.
• scale_pos_weight: The scale positive weight is
used with imbalanced datasets to make sure the model
puts enough emphasis on correctly predicting rare
classes during training.
• In the current dataset, about 1 in 17 customers will churn.
• So we set scale_pos_weight to 17 to accommodate for this
imbalance.
• This tells XGBoost to focus more on customers who actually
churn rather than on happy customers who are still happy.
40
4. Train the machine learning
model
• The output shown is taken from the output of the
Train the Model cell in the notebook (slide 36).
• You can see that round 24: 0.977821
• This is not better than the previous best of 0.978943
from round 14.
• Because we set early_stopping_rounds=10, the training
stops at round 24, which is 10 rounds past the best
result in round 14.

41
4. Train the machine learning
model

42
5. Host the machine learning
model

• Host the model on SageMaker so it is ready to

make decisions
• This is setting up a server that receives data
and returns decisions.
43
6. Test the model and use it to
make decisions
• Now that the endpoint is set up and hosted, you
can start making decisions.
• Start by running your test data through the system
to see how the model works on data it hasn’t seen
before.

44
create a function that returns 1 if the customer
is more likely to churn and 0 if they are less
likely to churn.

open the test CSV file you

created in slide 34, then apply
the get_prediction function to
every row in the test dataset to
display the data.

45
6. Test the model and use it to
make decisions

To see how the model performs overall,

you can look at how many customers
churned in the test dataset compared to
how many Carlos would have called. To
do this, you use the value_counts
function as shown in the next listing.

Create confusion
matrix.

46
Try it yourself
Actual

Positive Nagative

True Positive False Positive

Positive
(TP) (FP)
Predicted
False Negative True Negative
Negative
(FN) (TN)

Accuracy = TP+TN / TP+FP+TN+FN

47
Summary
• We created a machine learning model to determine which
customers to call because the risk of taking their business to a
competitor.
• XGBoost is a gradient-boosting, machine learning model that
uses an ensemble of different approaches to improve the
effectiveness of ML learning.
• Stratify is one technique to help you handle imbalanced
datasets. It makes sure that the train, validate, and test
datasets contain similar ratios of target variables.
• A confusion matrix is one of the most helpful tools in
understanding the performance of a model. 48

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Machine Learning Simplified
100% (1)
Machine Learning Simplified
109 pages
Service Manual - CoatronA - Rev12
No ratings yet
Service Manual - CoatronA - Rev12
78 pages
CH 02
No ratings yet
CH 02
32 pages
CH 02
No ratings yet
CH 02
32 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
ML Unit 1
No ratings yet
ML Unit 1
73 pages
MLT_Notes
No ratings yet
MLT_Notes
28 pages
Week 2: Machine Learning Intro: Instructor: Ting Sun
No ratings yet
Week 2: Machine Learning Intro: Instructor: Ting Sun
21 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Chapter 7 - LAST
No ratings yet
Chapter 7 - LAST
29 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Unit-4 Part 2 Modelling and Evaluation
No ratings yet
Unit-4 Part 2 Modelling and Evaluation
35 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
Types of Machine Learning Algorithms
No ratings yet
Types of Machine Learning Algorithms
14 pages
Int3209 - Data Mining: Week 5: Classification Model Improvements
No ratings yet
Int3209 - Data Mining: Week 5: Classification Model Improvements
56 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
Accelerated Data Science Introduction To Machine Learning Algorithms
No ratings yet
Accelerated Data Science Introduction To Machine Learning Algorithms
37 pages
Machine Learning Project Report (Group 3) Shahbaz Khan
No ratings yet
Machine Learning Project Report (Group 3) Shahbaz Khan
11 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Session 5 ppt
No ratings yet
Session 5 ppt
36 pages
Session 10 - Ensemble Methods (XGBoost)
No ratings yet
Session 10 - Ensemble Methods (XGBoost)
37 pages
Chapter 4- Machine Learning
No ratings yet
Chapter 4- Machine Learning
81 pages
subtitle
No ratings yet
subtitle
2 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
Trust-In Machine Learning Models
No ratings yet
Trust-In Machine Learning Models
11 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
100% (1)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) - Download the ebook and explore the most detailed content
60 pages
Unit6 Part3 General Procedure
No ratings yet
Unit6 Part3 General Procedure
19 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
General ML Notes
No ratings yet
General ML Notes
30 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
Machine Learning Intro & Evaluation Metrics
No ratings yet
Machine Learning Intro & Evaluation Metrics
49 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
5 markd
No ratings yet
5 markd
24 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
Unit 3
No ratings yet
Unit 3
55 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
Machine learning assignment (3)
No ratings yet
Machine learning assignment (3)
5 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) instant download
100% (1)
Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) instant download
38 pages
AIML105
No ratings yet
AIML105
5 pages
Untitled
No ratings yet
Untitled
11 pages
A10-Model-Performance-v2-2up
No ratings yet
A10-Model-Performance-v2-2up
11 pages
Precalculus: A Self-Teaching Guide
From Everand
Precalculus: A Self-Teaching Guide
Steve Slavin
4.5/5 (5)
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
Installation Instructions: GB D142H 32 Ed.06 2
No ratings yet
Installation Instructions: GB D142H 32 Ed.06 2
2 pages
Asset Management Document
No ratings yet
Asset Management Document
10 pages
Rent and Sales Report
No ratings yet
Rent and Sales Report
16 pages
Max Flow Homework
No ratings yet
Max Flow Homework
7 pages
Encryption OTP
No ratings yet
Encryption OTP
7 pages
Neo - Fire Panel Programming Worksheet - PSTN-Cell - Generic
No ratings yet
Neo - Fire Panel Programming Worksheet - PSTN-Cell - Generic
1 page
SUSE Linux Enterprise Server 15.x For SAP Applications Configuration Guide For SAP HANA
No ratings yet
SUSE Linux Enterprise Server 15.x For SAP Applications Configuration Guide For SAP HANA
56 pages
FPGA Design of A Fast 32-Bit Floating Point
No ratings yet
FPGA Design of A Fast 32-Bit Floating Point
3 pages
RoyalZProduction Music Video Guide
No ratings yet
RoyalZProduction Music Video Guide
61 pages
Strategic Management and Sustainability Assignment
No ratings yet
Strategic Management and Sustainability Assignment
15 pages
PDF Statement Details
No ratings yet
PDF Statement Details
4 pages
Robotic Arm
No ratings yet
Robotic Arm
106 pages
Data Mining - Practical Machine Learning Tools AndTechniques With Java Implementations
No ratings yet
Data Mining - Practical Machine Learning Tools AndTechniques With Java Implementations
3 pages
Avik Biswas
No ratings yet
Avik Biswas
5 pages
Aditi Agarwal - An Expert Guide To Problem-Solving - With Practical Examples (Learn Brainstorming, Fishbone, SWOT, FMEA, 5whys + 6 More) - Aditi Agarwal Books LLC (2016)
No ratings yet
Aditi Agarwal - An Expert Guide To Problem-Solving - With Practical Examples (Learn Brainstorming, Fishbone, SWOT, FMEA, 5whys + 6 More) - Aditi Agarwal Books LLC (2016)
55 pages
01 - Smart-Home Scenario
No ratings yet
01 - Smart-Home Scenario
12 pages
Booking
No ratings yet
Booking
37 pages
PD Iso TS 13399-305 2017
No ratings yet
PD Iso TS 13399-305 2017
86 pages
Mohammed Omer
No ratings yet
Mohammed Omer
2 pages
SIM7070 - SIM7080 - SIM7090 Series - FS - Application Note - V1.02
No ratings yet
SIM7070 - SIM7080 - SIM7090 Series - FS - Application Note - V1.02
9 pages
Byte To 1, and Then Successively Multiply It by 2 and Display Its Value 8 Times. Explain The Reason For The Last Result
No ratings yet
Byte To 1, and Then Successively Multiply It by 2 and Display Its Value 8 Times. Explain The Reason For The Last Result
15 pages
Lesson 9 On Milil
No ratings yet
Lesson 9 On Milil
28 pages
Gear Measurement CMM
No ratings yet
Gear Measurement CMM
4 pages
All in One Source For Grade 11 (2nd Sem)
No ratings yet
All in One Source For Grade 11 (2nd Sem)
1,561 pages
IOS-XE 17.3.1 - TDM: July 2020
No ratings yet
IOS-XE 17.3.1 - TDM: July 2020
168 pages
Experiences Tracking Agile Projects: An Empirical Study
No ratings yet
Experiences Tracking Agile Projects: An Empirical Study
20 pages
Vaisala Veriteq Mapping Brochure
No ratings yet
Vaisala Veriteq Mapping Brochure
4 pages
Untitled
No ratings yet
Untitled
15 pages