Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
19 views

Predective Modelling

The document describes building predictive models for a cubic zirconia manufacturer using linear and logistic regression. It summarizes exploratory data analysis performed on a dataset of over 26,000 stones. Univariate analysis found that price and carat are positively skewed, while bivariate analysis showed carat and price and x and price are positively correlated. Outliers were treated by imputing values. The goal is to predict price based on attributes to distinguish more and less profitable stones.

Uploaded by

Santosh Vardhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Predective Modelling

The document describes building predictive models for a cubic zirconia manufacturer using linear and logistic regression. It summarizes exploratory data analysis performed on a dataset of over 26,000 stones. Univariate analysis found that price and carat are positively skewed, while bivariate analysis showed carat and price and x and price are positively correlated. Outliers were treated by imputing values. The goal is to predict price based on attributes to distinguish more and less profitable stones.

Uploaded by

Santosh Vardhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

PROJECT MODULE

PREDECTIVE MODELING
CONTENTS
PAGE NO.
Problem 1: LINEAR REGRESSION 03
PROBLEM STATEMENT 03

Problem 2: LOGISTIC REGRESSION AND LDA 15


PROBLEM STATEMENT 15

2
PROBLEM STATEMENT:

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Sample of dataset:

carat cut color clarity depth table x y z price


1 0.3 Ideal E SI1 62.1 58 4.27 4.29 2.66 499
2 0.33 Premium G IF 60.8 58 4.42 4.46 2.7 984
3 0.9 Very Good E VVS2 62.2 60 6.04 6.12 3.78 6289
4 0.42 Ideal F VS1 61.6 56 4.82 4.8 2.96 1082
5 0.31 Ideal F VVS1 60.4 59 4.35 4.43 2.65 779
6 1.02 Ideal D VS2 61.5 56 6.46 6.49 3.99 9502
7 1.01 Good H SI1 63.7 60 6.35 6.3 4.03 4836
8 0.5 Premium E SI1 61.5 62 5.09 5.06 3.12 1415
9 1.21 Good H SI1 63.8 64 6.72 6.63 4.26 5407
Table 1.1 Sample dataset.

Dataset has 9 independent variables which are described as follows:

variable name Description


Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Cut
Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia. With D being the best and J the worst.
Cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In
Clarity order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1,
VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
The Height of a cubic zirconia, measured from the Culet to the table, divided by
Depth
its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Table
Diameter.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

3
In the dataset price is the dependent variable / target variable which indicates the price of the zirconia.
The size of the data set is 26967 X 10 (26967 rows & 10 columns).

Data-Set information:

Variable Data type Number of


Name missing values
Carat Float 64 0
Cut Object 0
Color Object 0
Clarity Object 0
Depth Float 64 697
Table Float 64 0
X Float 64 0
Y Float 64 0
Z Float 64 0
Price Int 64 0
Table 1.2 dataset information.

From the above table, it is evident that there are null values present in the depth variable.

Descriptive Statistics for the dataset:

Variable- count unique top freq mean std min 25% 50% 75% max
Name
carat 26967 NaN NaN NaN 0.8 0.5 0.2 0.4 0.7 1.05 4.5
cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN
color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN
clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN
depth 26270 NaN NaN NaN 61.7 1.4 50.8 61 61.8 62.5 73.6
table 26967 NaN NaN NaN 57.5 2.2 49 56 57 59 79
x 26967 NaN NaN NaN 5.7 1.1 0 4.71 5.69 6.55 10.23
y 26967 NaN NaN NaN 5.7 1.2 0 4.71 5.71 6.54 58.9
z 26967 NaN NaN NaN 3.5 0.7 0 2.9 3.52 4.04 31.8
price 26967 NaN NaN NaN 3939.5 4024.9 326 945 2375 5360 18818
Table 1.3 Descriptive Statistics table
Here, the descriptive statistics table shows the 5-point summary of the data set.
Inference from the descriptive statistics table:
 Carat variable is the continuous variable with a mean of 0.8 and median of 0.7 which indicates
that the distribution is lightly skewed.
 Cut variable is the categorical variable and has 5 features in it. The features are ordinal type
and ideal is the most frequent.
 Color variable is the categorical variable and has 7 features in it. The features are ordinal type
and G is the most frequent.
 Clarity variable is the categorical variable and has 8 features in it. The features are ordinal type
and SI1 is the most frequent.
4
 The variables depth, table, x, y and z are continuous variables and all the variables are lightly
skewed.
 Price variable is also a continuous variable which is positively skewed.

Total 34 no. of duplicate rows are identified in the dataset and all the duplicate rows are removed.
Hence the size of the dataset becomes 26933 X 10 (26933 rows & 10 columns).

Univariate and Bivariate Analysis:

Univariate Analysis: The main purpose of univariate analysis is to summarize and find patterns in the
data. The key point is that there is only one variable involved in the analysis.

From the below figure, we can identify the how numerical variables (carat, depth, table, x, y, z and
price) are distributed.

Fig 1.1 Distribution plot of continuous variables

Inference from the fig1.1 is the price variable & carat variable are not normally distributed, those are
positively skewed. Remaining all the variables are seems to be not normally distributed, outliers might
be the reason.

5
Bivariate Analysis: In bivariate analysis we try to determine if there is any relationship between two
variables.

Fig 1.2 Pair plot.

1. For linear regression we should focus on off diagonals.


2. The presence of multiple peaks in the diagonals could be an indication of no, of clusters.
3. The variables cart and price are positively correlated.
4. The variables x and price also showing the positive correlation.

6
Some other plots plotted below which gives some interpretation how carat variable influencing the
price with different hues (color, clarity & cut).

Fig 1.3 Scatter plot between Price and Carat with different hues

Correlation Map:

Fig 1.4 Correlation Map


The variables depth and table are not correlated. The variables carat, x, y, z and price are highly
correlated.
7
Finding out the outliers in dataset:

Fig 1.5 Box plot before treating outliers

From the Fig1.5 it clears that all the variables in the dataset has outliers. The outliers in the dataset will
show the impact on best-fit line. So, outliers to be removed/treated before proceeding into the model.

Here, the outliers in the dataset are treated by imputing upper-range values and lower-range values in
place of outliers. By this, the number of rows would not change. The boxplot in the fig 1.6 will shows
the presence of outliers after treatment.

Fig 1.6 Box plot after treating outliers

8
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning, or do we need to change them or drop them? Do you think scaling is
necessary in this case?
In the dataset, depth variable has 697 null values which is already shown in the Table 1.2. Here, the
missing values in the depth variable are replaced with the median of the depth variable (61.8)
because outliers are present in the depth variable.

And in the dataset, it is observed that variables x, y and z having zeros.

carat cut color clarity depth table x y z price


5821 0.71 Good F SI2 64.1 60.0 0.0 0.0 0.0 2130
6215 0.71 Good F SI2 64.1 60.0 0.0 0.0 0.0 2130
17506 1.14 Fair G VS1 57.5 67.0 0.0 0.0 0.0 6381
Table 1.4

Generally, the dimension of zirconia should not be a zero. So, these values should be treated as a
missing value and replaced with the median of their respective variable, as outliers present in all the
three variables.

Here, in this model we are solving the optimal solution using OLS regression. Scaling is not necessary,
and it will not show any impact on accuracies and R-square.

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test
and train (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using R square, RMSE
The variables cut, color and clarity are categorical variables, and these are of ordinal type. So, label
encoding has done here and the same presented here.

feature: cut
[Ideal, Premium, Very Good, Good, Fair]
Categories (5, object): [Fair, Good, Ideal, Premium, Very Good]
[2 3 4 1 0]

feature: color
[E, G, F, D, H, J, I]
Categories (7, object): [D, E, F, G, H, I, J]
[1 3 2 0 4 6 5]

feature: clarity
[SI1, IF, VVS2, VS1, VVS1, VS2, SI2, I1]
Categories (8, object): [I1, IF, SI1, SI2, VS1, VS2, VVS1, VVS2]
[2 1 7 4 6 5 3 0]

9
The dataset after encoding will be as follows:

carat cut color clarity depth table x y z price

0 0.30 2 1 2 62.1 58.0 4.27 4.29 2.66 499.0


1 0.33 3 3 1 60.8 58.0 4.42 4.46 2.70 984.0
2 0.90 4 1 7 62.2 60.0 6.04 6.12 3.78 6289.0
3 0.42 2 2 4 61.6 56.0 4.82 4.80 2.96 1082.0
4 0.31 2 2 6 60.4 59.0 4.35 4.43 2.65 779.0
Table 1.5 Sample dataset after encoding the categorical variables

Before proceeding to the further, the data has been splitted into two datasets x and y. where x represents
the independent variables and y represents the dependent variable.

The train and test sample exercise needs to be done before doing the model. Means, we train the model
on training set and validate on testing set.

Before proceeding into the models, the dataset should be split into train set and test set. test set having
30% of the data and remaining 70% of the dataset is train set.

Hence, we are having 8080 samples for test data and remaining 18853 samples for train data.

Shape of train dataset – (18853 x 10)


Shape of test dataset – (8080 x 10)

After separating the train parameters and target parameters, the following will be resulted.
o x_train: training set (18853, 9)
o x_test: test set (8080, 9)
o y_train: target variable of train set (18853,)
o y_test: target variable of test set (8080,)

Linear Regression:

The Linear Regression model models the data with linear combination of the explanatory variables
and this model is carried out with the following parameters and hyperparameters.

LinearRegression (copy_X=True, fit intercept=True, n_jobs=None,


normalize=False)

Training score Testing score


Train RMSE Test RMSE (R-square) (R-square)

1047.15 1031.7 0.9087 0.9117


Table 1.6 Performance Metric of Linear Regression model.

The magnitude of the RMSE will depends on the scale/ normalize of the y-variable. Here the model
has been run in stats model. The intercept and coefficients of each variable is tabulated here.

10
Coeff.
Intercept 9442.85
carat 9198.16
cut 50.00
color -233.06
clarity 253.81
depth -88.61
table -73.02
x -2028.80
y 1553.24
z -312.91

Table 1.7 Intercept and coefficient of explanatory variables

From the above, we can derive the linear regression equation as:

Price = (9442.85) * Intercept + (9198.16) * carat + (50.0) * cut + (-233.06) * color + (253.81) * clarity
+ (-88.61) * depth + (-73.02) * table + (-2028.8) * x + (1553.24) * y + (-312.91) * z

The ‘x’ variable coef is -2028.8 which shows the negative co-relation with price, but in pair plot it
shows positive correlation. This is due to the multicollinearity.

Scatter Plot between Actual y-test and predicted y-test:

Fig 1.7 Actual vs Prediction

11
OLS Regression Results:
==============================================================================
Dep. Variable: price R-squared: 0.909
Model: OLS Adj. R-squared: 0.909
Method: Least Squares F-statistic: 2.084e+04
Date: Fri, 03 Jul 2020 Prob (F-statistic): 0.00
Time: 22:29:04 Log-Likelihood: -1.5785e+05
No. Observations: 18853 AIC: 3.157e+05
Df Residuals: 18843 BIC: 3.158e+05
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 9442.8480 730.010 12.935 0.000 8011.963 1.09e+04
carat 9198.1563 93.404 98.477 0.000 9015.076 9381.236
cut 50.0018 7.716 6.480 0.000 34.878 65.126
color -233.0625 4.697 -49.620 0.000 -242.269 -223.856
clarity 253.8146 4.567 55.576 0.000 244.863 262.766
depth -88.6089 10.022 -8.842 0.000 -108.253 -68.965
table -73.0175 3.850 -18.966 0.000 -80.564 -65.471
x -2028.7985 137.469 -14.758 0.000 -2298.251 -1759.346
y 1553.2354 136.054 11.416 0.000 1286.557 1819.914
z -312.9073 110.389 -2.835 0.005 -529.279 -96.536
==============================================================================
Omnibus: 5117.050 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 25447.622
Skew: 1.228 Prob (JB): 0.00
Kurtosis: 8.135 Cond. No. 8.19e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
.
[2] The condition number is large, 8.19e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

When we normalize the y variable with log transform, then the model results are tabulated here under.

Training score Testing score


Train RMSE Test RMSE (R-square) (R-square)

0.224 0.225 0.949 0.9487

Table 1.8 Performance Metric of Linear Regression model (log transform(y-variable))

The intercept and coefficients of each variable is tabulated here:


Coeff.
Intercept -1.21
carat -1.14
cut 0.002
color -0.06
clarity 0.06
depth 0.04
table -0.003
x 0.63
y 0.59
z 0.24
Table 1.9 Intercept and coefficient of explanatory variables (log transform(y-variable))

12
From the above, the linear regression equation can also be written as:

Price = antilog( (-1.21) * Intercept + (-1.14) * carat + (0.0) * cut + (-0.06) * color + (0.06) * clarity +
(0.04) * depth + (-0.0) * table + (0.63) * x + (0.59) * y + (0.24) * z)

Scatter Plot between Actual y-test and predicted y-test with y variable log transform:

Fig 1.8 Actual vs Prediction (log transform(y-variable))

OLS Regression Results (y-variable (log transform):

==============================================================================
Dep. Variable: price R-squared: 0.949
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 3.900e+04
Date: Fri, 03 Jul 2020 Prob (F-statistic): 0.00
Time: 22:29:11 Log-Likelihood: 1446.7
No. Observations: 18853 AIC: -2873.
Df Residuals: 18843 BIC: -2795.
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.2142 0.156 -7.772 0.000 -1.520 -0.908
carat -1.1414 0.020 -57.100 0.000 -1.181 -1.102
cut 0.0022 0.002 1.336 0.182 -0.001 0.005
color -0.0645 0.001 -64.135 0.000 -0.066 -0.062
clarity 0.0628 0.001 64.295 0.000 0.061 0.065
depth 0.0355 0.002 16.571 0.000 0.031 0.040
table -0.0034 0.001 -4.090 0.000 -0.005 -0.002
x 0.6268 0.029 21.306 0.000 0.569 0.684
y 0.5901 0.029 20.266 0.000 0.533 0.647
z 0.2360 0.024 9.990 0.000 0.190 0.282
==============================================================================
Omnibus: 9554.691 Durbin-Watson: 1.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1576892.061
Skew: 1.372 Prob(JB): 0.00
Kurtosis: 47.720 Cond. No. 8.19e+03
==============================================================================

13
Checking of Multi collinearity:
Multicollinearity is collinearity between the independent variables. The presence of this collinearity
between the variables will not affect accuracy but the model is not explainable.

VIF Factor-
(with y
Features variable log
transformed)

0 carat 31.739524
1 cut 1.064228
2 color 1.100387
3 clarity 1.066402
4 depth 2.528086
5 table 1.181361
6 x 408.915952
7 y 395.015655
8 z 100.942140
Table 1.10 VIF (Multi collinearity) Table
From the table, the variables carat, x, y and z show severe multicollinearity. the dimensions of zirconia
x, y and z can be replaced with one feature like size or weight etc.,
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
From the Equation,

Price = (9442.85) * Intercept + (9198.16) * carat + (50.0) * cut + (-233.06) * color + (253.81) *
clarity + (-88.61) * depth + (-73.02) * table + (-2028.8) * x + (1553.24) * y + (-312.91) * z

 Carat, Width of the cubic zirconia and clarity are the main features for predicting the price of
the cubic zirconia.
 With one-unit increase in carat, price of zirconia increases by 9198.16 by keeping all other
predictors as constant.
 With one-unit increase in clarity, price of zirconia increases by 253.81 by keeping all other
predictors as constant.
 There are also some negative coefficients. For instance, table has its corresponding coefficient
of -73.02. This implies, when one unit of increase in table, the price of zirconia decreases by
73.02 by keeping all other predictors as constant
 Cut quality of a zirconia cubic is not much important factor in price decision of zirconia cubic.

Recommendations:
1. Carat, dimension of the zirconia cubic and clarity are the most important features.
2. The zirconia cubic with high carat and with FL/IF/VVS1/VVS2/VS1/VS2 clarity will be very
expensive. Offers / gifts can be added for these products.
3. As cut, depth and table factors are not much influencing on the price of zirconia cubic, to gain
more profits we can treat these factors as unimportant. Ex: zirconia cubic with high carat,
clarity and fair cut will give you more profits.
4. The dimensions of zirconia cubic is also proportional to the price. So, higher the dimension,
more the price.
14
2. PROBLEM STATEMENT:

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some did not. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important factors
on the basis of which the company will focus on particular employees to sell their packages.

2.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Sample of dataset:
Holliday_Package Salary age educ no_young_children no_older_children foreign
1 no 48412 30 8 1 1 no
2 yes 37207 45 8 0 1 no
3 no 58022 46 9 0 0 no
4 no 66503 31 11 2 0 no
5 no 66734 44 12 0 2 no
6 yes 61590 42 12 0 1 no
7 no 94344 51 8 0 0 no
8 yes 35987 32 8 0 2 no
9 no 41140 39 12 0 0 no
Table 2.1 Sample dataset.

Dataset has 6 independent variables which are described as follows:

Variable Name Description


Salary Employee salary
age Age in years
educ Years of formal education
The number of young children
no_young_children
(younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

In the dataset, Holliday_Package is the dependent variable / target variable which indicates that the
customer opted for holiday package or not. The size of the data set is 872 X 7 (872 rows & 7 columns).

15
Data-Set information:

Data type Number of


Variable Name
missing values
Holliday_Package Object 0
Salary Int 64 0
age Int 64 0
educ Int 64 0
no_young_children Int 64 0
no_older_children Int 64 0
foreign Object 0
Table 2.2 dataset information.

From the above table, it is evident that there are no null values present in the dataset. The variables
no_young_children, educ and no_older_children should be a categorical type (object type).

Descriptive Statistics for the dataset:

Variable coun uniq


top freq mean std min 25% 50% 75% max
-Name t ue
Holliday
872 2 no 471 NaN NaN NaN NaN NaN NaN NaN
_Package
3532 41903. 53469 23696
Salary 872 NaN NaN NaN 47729.2 23418.7 1322
4 5 .5 1
age 872 NaN NaN NaN 39.96 10.55 20 32 39 48 62
educ 872 NaN NaN NaN 9.31 3.04 1 8 9 12 21
no_youn
g_childre 872 4 0 665 NaN NaN NaN NaN NaN NaN NaN
n
no_older
872 7 0 393 NaN NaN NaN NaN NaN NaN NaN
_children
foreign 872 2 no 656 NaN NaN NaN NaN NaN NaN NaN

Table 2.3 Descriptive Statistics table

Here, the descriptive statistics table shows the 5-point summary of the data set.
Some Inference from the descriptive statistics table:
 Holliday_Package variable is the categorical variable and most of the customers in the dataset
are not opted holiday package.
 Salary variable is the continuous variable with a mean of 47729.2 and median of 41903.5 which
indicates that the distribution is lightly skewed.
 Most of the Foreigners are not opted for holiday-package.

None of the duplicate rows are identified in the dataset

16
Univariate and Bivariate Analysis:

Univariate Analysis: The main purpose of univariate analysis is to summarize and find patterns in the
data. The key point is that there is only one variable involved in the analysis.

Fig 2.1 Distribution of ‘Salary’ Variable

Inference from the fig1.1 is the Salary variable is not normally distributed; outliers might be the reason.

Fig 2.2 Count plot – Holliday_Package (hue = foreign)

17
Bivariate analysis: In bivariate analysis we try to determine if there is any relationship between two
variables.

Fig 2.3 Pair plot.

 For logistic regression we should focus on diagonals.


 The presence of multiple peaks in the diagonals could be an indication of no, of clusters.
 The variables are not correlated with each other.

18
Correlation Map:

Fig 2.4 Correlation Map

The variables are not correlated with each other.

Finding out the outliers in dataset:

Fig 2.5 Box plot before treating outliers


19
From the Fig2.5 it clears that the variable ‘salary’ has outliers. The outliers in the dataset will show
the impact on model. So, outliers to be removed/treated before proceeding into the model.

Here, the outliers in the dataset are treated by imputing upper-range values and lower-range values in
place of outliers. By this, the number of rows would not change. The boxplot in the fig 2.6 will shows
the presence of outliers after treatment.

Fig 2.6 Box plot after treating outliers

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
The variables Holliday_Package, no_young_children, no_older_children and foreign are categorical
variables. So, before proceeding into the model these categorical data should be converted into codes
and the same presented here

feature: Holliday_Package
[no, yes]
Categories (2, object): [no, yes]
[0 1]

feature: no_young_children
[1, 0, 2, 3]
Categories (4, int64): [0, 1, 2, 3]
[1 0 2 3]

20
feature: no_older_children
[1, 0, 2, 4, 3, 5, 6]
Categories (7, int64): [0, 1, 2, 3, 4, 5, 6]
[1 0 2 4 3 5 6]

feature: foreign
[no, yes]
Categories (2, object): [no, yes]
[0 1]

The train and test sample exercise needs to be done before doing any predictive modelling like Logistic
Regression and linear discriminant analysis (LDA). Means, we train the model on training set and
validate on testing set.

Before proceeding into the models, the dataset should be split into train set and test set. test set having
30% of the data and remaining 70% of the dataset is train set.

Hence, we are having 610 samples for train data and remaining 262 samples for test data.

Shape of train dataset – (610 x 6)


Shape of test dataset – (262 x 6)

After separating the train parameters and target parameters, the following will be resulted.
o X_train: training set (610, 6)
o X_test: test set (262, 6)
o train_labels: target variable of train set (610,)
o test labels: target variable of test set (262,)

The target value “Holliday_Package” column in the dataset says whether the person opted for holiday
package or not. The details of the Holliday_Package attribute tabulated here.

Opted No. of Customers Probability


0 471 0.54
1 401 0.46
Table 2.4 Holliday_Package Data

0: Customer not opted for holiday package


1: Customer opted for holiday package

The target variable/dependent variable is of categorical type. To predict the dependent variable, two
models are presented here.

Logistic Regression model:

It is supervised learning method for classification. In Logistic regression model, relationship between
dependent class variable and independent class variables using regression

The Logistic regression is carried out with the following parameters and hyperparameters.

Logistic Regression (C=1.0, class_weight=None, dual=False, fit_ intercept=True,


intercept scaling=1, l1_ratio=None, max_iter=10000,
multi_class='warn', n_jobs=2, penalty='none’, random_state=None,
21
solver='newton-cg', tol=0.0001, verbose=True, warm_start=False)

Here, the model is executed with 10,000 iterations and newton-cg solver. The model is fitted with
x_train and y_train. Now, the model is ready for prediction.

Linear discriminant analysis model:

Discriminant analysis created model to predict future observations where classes are known. LDA uses
linear combination of independent variables to predict the class in the response variable of a given
observation.

The LDA is carried out with the following parameters and hyperparameters.

LinearDiscriminantAnalysis (n_components=None, priors=None, shrinkage=None,


solver='svd', store_covariance=False, tol=0.0001)

Here, the model is executed with “singular value decomposition”. The model is fitted with x_train and
y_train. Now, the model is ready for prediction.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.

A. Performance Metrics of Logistic Regression Model:

Confusion Matrix – train labels:

Fig 2.7 Confusion matrix – train dataset (Log. Regression model)

 Accuracy is 68%

Classification report of train dataset:


precision recall f1-score support

0 0.68 0.80 0.74 342


1 0.67 0.52 0.59 268

accuracy 0.68 610


macro avg 0.68 0.66 0.66 610
weighted avg 0.68 0.68 0.67 610
22
ROC Curve of train dataset:

Fig 2.8 ROC Curve – train dataset (Log. Regression model)

AUC (Area under ROC Curve) is 74.1%.


This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.

TP rate = True positives/Total positives.


FP rate = False positives/Total negatives.

Confusion Matrix – test labels:

Fig 2.9Confusion matrix – test dataset (Log. Regression model)

 Accuracy is 64%

Classification report of test dataset:


precision recall f1-score support

0 0.61 0.76 0.68 129


1 0.69 0.53 0.60 133

accuracy 0.64 262


macro avg 0.65 0.64 0.64 262
weighted avg 0.65 0.64 0.64 262
23
ROC Curve of test dataset:

Fig 2.10 ROC Curve – test dataset (Log. Regression model)

AUC (Area under ROC Curve) is 74.1%.

This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.

Logistic regression model conclusion:

Train Data Test Data


AUC curve 74.1% 74.1%
Accuracy 68% 64%
Recall 52% 53%
Precision 67% 69%
F1 Score 59% 60%
Table 2.5 Log. Regression Model Performance Metrics Summary

 Training and Test set results are almost similar, and with the overall measures high,
the model is a good model.
 Salary is the most important variable for predicting target variable.

24
B. Performance Metrics of LDA Model:

Confusion Matrix – train labels:

Fig 2.11 Confusion matrix – train dataset (LDA model)

 Accuracy is 67%

Classification report of train dataset:

precision recall f1-score support

0 0.67 0.80 0.73 342


1 0.67 0.50 0.57 268

accuracy 0.67 610


macro avg 0.67 0.65 0.65 610
weighted avg 0.67 0.67 0.66 610

ROC Curve of train dataset:

Fig 2.4 ROC Curve – train dataset (LDA model)

AUC (Area under ROC Curve) is 73.9%.


This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.
25
TP rate = True positives/Total positives.
FP rate = False positives/Total negatives.

Confusion Matrix – test labels:

Fig 2.12 Confusion matrix – test dataset (LDA Model)

 Accuracy is 63%

Classification report of test dataset:


precision recall f1-score support

0 0.60 0.75 0.66 129


1 0.68 0.50 0.58 133

accuracy 0.63 262


macro avg 0.64 0.63 0.62 262
weighted avg 0.64 0.63 0.62 262

ROC Curve of test dataset:

Fig 2.13 ROC Curve – test dataset (LDA model)


AUC (Area under ROC Curve) is 73.9%.

26
This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.
LDA model conclusion:

Train Data Test Data


AUC curve 73.9% 73.9%
Accuracy 67% 63%
Recall 50% 50%
Precision 67% 68%
F1 Score 57% 58%
Table 2.6 Log. Regression Model Performance Metrics Summary

 Training and Test set results are almost similar, and with the overall measures high,
the model is a good model.
 Salary is the most important variable for predicting target variable.

Performance metrics of all the models are summarized here:

Train Test Data Train Test Data


Data (Log. R) Data (LDA)
(Log. R) (LDA)
AUC curve 74.1% 74.1% 73.9% 73.9%
Accuracy 68% 64% 67% 63%
Recall 52% 53% 50% 50%
Precision 67% 69% 67% 68%
F1 Score 59% 60% 57% 58%
Table 2.7 Performance metrics summary of both models.

ROC Curve for the 2 models on the training data:

Fig 2.14 ROC Curves for 2 models – training data

27
ROC Curve for the 2 models on the testing data:

Fig 2.15 ROC Curves for 2 models – test data

 Log. Regression model has good performance in all aspects compared to Linear discriminant
analysis.
 Random forest regressor also executed and observed that Salary and age variables are the most
important variables for predicting customer whether opted holiday package or not.
 The ROC curves for the both models are same.

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

“Salary” and “age” features are the most important in determining whether the customer will
opt holiday package or not.

Log (odds) = (2.8199) * Intercept + (-2.0788e-05) * Salary + (-0.0572) * age + (0.041) * educ
+ (-1.426) * no_young_children + (-0.1226) * no_older_children + (1.408) * foreign

Some of the business insights are noticed here:


 With one year of increase in educ, the log odds will be increases by 0.041. accordingly, the
probability of the customer to opt for holiday package also increases by keeping all other
predictors as constant.
 If the customer is a foreigner, the log odds will be increases by 1.408 accordingly, the
probability of the customer to opt for holiday package also increases by keeping all other
predictors as constant.
 There are also some negative coefficients. For instance, salary has its corresponding
coefficient of -2.0788e-05. This implies, when one unit of increase in salary, the log odds
decreases by 2.0788e-05. Accordingly, the probability of the customer to opt for the holiday
package also decreases by keeping all other predictors as constant
Recommendations:
 Offering tour packages depending upon the employee details.
 74% probability exists that younger age customers are more likely to opt for holiday
package even though the salary is near to 25K.
 Foreigners also more likely to opt for holiday package.
28

You might also like