0% found this document useful (0 votes)

19 views

Predective Modelling

The document describes building predictive models for a cubic zirconia manufacturer using linear and logistic regression. It summarizes exploratory data analysis performed on a dataset of over 26,000 stones. Univariate analysis found that price and carat are positively skewed, while bivariate analysis showed carat and price and x and price are positively correlated. Outliers were treated by imputing values. The goal is to predict price based on attributes to distinguish more and less profitable stones.

Uploaded by

Santosh Vardhan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Predective Modelling

Uploaded by

Santosh Vardhan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

PROJECT MODULE

PREDECTIVE MODELING
CONTENTS
PAGE NO.
Problem 1: LINEAR REGRESSION 03
PROBLEM STATEMENT 03

Problem 2: LOGISTIC REGRESSION AND LDA 15

PROBLEM STATEMENT 15

2
PROBLEM STATEMENT:

You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Sample of dataset:

carat cut color clarity depth table x y z price

1 0.3 Ideal E SI1 62.1 58 4.27 4.29 2.66 499
2 0.33 Premium G IF 60.8 58 4.42 4.46 2.7 984
3 0.9 Very Good E VVS2 62.2 60 6.04 6.12 3.78 6289
4 0.42 Ideal F VS1 61.6 56 4.82 4.8 2.96 1082
5 0.31 Ideal F VVS1 60.4 59 4.35 4.43 2.65 779
6 1.02 Ideal D VS2 61.5 56 6.46 6.49 3.99 9502
7 1.01 Good H SI1 63.7 60 6.35 6.3 4.03 4836
8 0.5 Premium E SI1 61.5 62 5.09 5.06 3.12 1415
9 1.21 Good H SI1 63.8 64 6.72 6.63 4.26 5407
Table 1.1 Sample dataset.

Dataset has 9 independent variables which are described as follows:

variable name Description

Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic zirconia. Quality is increasing order Fair,
Cut
Good, Very Good, Premium, Ideal.
Color Colour of the cubic zirconia. With D being the best and J the worst.
Cubic zirconia Clarity refers to the absence of the Inclusions and Blemishes. (In
Clarity order from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1,
VVS2, VS1, VS2, SI1, SI2, I1, I2, I3
The Height of a cubic zirconia, measured from the Culet to the table, divided by
Depth
its average Girdle Diameter.
The Width of the cubic zirconia's Table expressed as a Percentage of its Average
Table
Diameter.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.

3
In the dataset price is the dependent variable / target variable which indicates the price of the zirconia.
The size of the data set is 26967 X 10 (26967 rows & 10 columns).

Data-Set information:

Variable Data type Number of

Name missing values
Carat Float 64 0
Cut Object 0
Color Object 0
Clarity Object 0
Depth Float 64 697
Table Float 64 0
X Float 64 0
Y Float 64 0
Z Float 64 0
Price Int 64 0
Table 1.2 dataset information.

From the above table, it is evident that there are null values present in the depth variable.

Descriptive Statistics for the dataset:

Variable- count unique top freq mean std min 25% 50% 75% max
Name
carat 26967 NaN NaN NaN 0.8 0.5 0.2 0.4 0.7 1.05 4.5
cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN
color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN
clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN
depth 26270 NaN NaN NaN 61.7 1.4 50.8 61 61.8 62.5 73.6
table 26967 NaN NaN NaN 57.5 2.2 49 56 57 59 79
x 26967 NaN NaN NaN 5.7 1.1 0 4.71 5.69 6.55 10.23
y 26967 NaN NaN NaN 5.7 1.2 0 4.71 5.71 6.54 58.9
z 26967 NaN NaN NaN 3.5 0.7 0 2.9 3.52 4.04 31.8
price 26967 NaN NaN NaN 3939.5 4024.9 326 945 2375 5360 18818
Table 1.3 Descriptive Statistics table
Here, the descriptive statistics table shows the 5-point summary of the data set.
Inference from the descriptive statistics table:
 Carat variable is the continuous variable with a mean of 0.8 and median of 0.7 which indicates
that the distribution is lightly skewed.
 Cut variable is the categorical variable and has 5 features in it. The features are ordinal type
and ideal is the most frequent.
 Color variable is the categorical variable and has 7 features in it. The features are ordinal type
and G is the most frequent.
 Clarity variable is the categorical variable and has 8 features in it. The features are ordinal type
and SI1 is the most frequent.
4
 The variables depth, table, x, y and z are continuous variables and all the variables are lightly
skewed.
 Price variable is also a continuous variable which is positively skewed.

Total 34 no. of duplicate rows are identified in the dataset and all the duplicate rows are removed.
Hence the size of the dataset becomes 26933 X 10 (26933 rows & 10 columns).

Univariate and Bivariate Analysis:

Univariate Analysis: The main purpose of univariate analysis is to summarize and find patterns in the
data. The key point is that there is only one variable involved in the analysis.

From the below figure, we can identify the how numerical variables (carat, depth, table, x, y, z and
price) are distributed.

Fig 1.1 Distribution plot of continuous variables

Inference from the fig1.1 is the price variable & carat variable are not normally distributed, those are
positively skewed. Remaining all the variables are seems to be not normally distributed, outliers might
be the reason.

5
Bivariate Analysis: In bivariate analysis we try to determine if there is any relationship between two
variables.

Fig 1.2 Pair plot.

1. For linear regression we should focus on off diagonals.

2. The presence of multiple peaks in the diagonals could be an indication of no, of clusters.
3. The variables cart and price are positively correlated.
4. The variables x and price also showing the positive correlation.

6
Some other plots plotted below which gives some interpretation how carat variable influencing the
price with different hues (color, clarity & cut).

Fig 1.3 Scatter plot between Price and Carat with different hues

Correlation Map:

Fig 1.4 Correlation Map

The variables depth and table are not correlated. The variables carat, x, y, z and price are highly
correlated.
7
Finding out the outliers in dataset:

Fig 1.5 Box plot before treating outliers

From the Fig1.5 it clears that all the variables in the dataset has outliers. The outliers in the dataset will
show the impact on best-fit line. So, outliers to be removed/treated before proceeding into the model.

Here, the outliers in the dataset are treated by imputing upper-range values and lower-range values in
place of outliers. By this, the number of rows would not change. The boxplot in the fig 1.6 will shows
the presence of outliers after treatment.

Fig 1.6 Box plot after treating outliers

8
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning, or do we need to change them or drop them? Do you think scaling is
necessary in this case?
In the dataset, depth variable has 697 null values which is already shown in the Table 1.2. Here, the
missing values in the depth variable are replaced with the median of the depth variable (61.8)
because outliers are present in the depth variable.

And in the dataset, it is observed that variables x, y and z having zeros.

carat cut color clarity depth table x y z price

5821 0.71 Good F SI2 64.1 60.0 0.0 0.0 0.0 2130
6215 0.71 Good F SI2 64.1 60.0 0.0 0.0 0.0 2130
17506 1.14 Fair G VS1 57.5 67.0 0.0 0.0 0.0 6381
Table 1.4

Generally, the dimension of zirconia should not be a zero. So, these values should be treated as a
missing value and replaced with the median of their respective variable, as outliers present in all the
three variables.

Here, in this model we are solving the optimal solution using OLS regression. Scaling is not necessary,
and it will not show any impact on accuracies and R-square.

1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test
and train (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using R square, RMSE
The variables cut, color and clarity are categorical variables, and these are of ordinal type. So, label
encoding has done here and the same presented here.

feature: cut
[Ideal, Premium, Very Good, Good, Fair]
Categories (5, object): [Fair, Good, Ideal, Premium, Very Good]
[2 3 4 1 0]

feature: color
[E, G, F, D, H, J, I]
Categories (7, object): [D, E, F, G, H, I, J]
[1 3 2 0 4 6 5]

feature: clarity
[SI1, IF, VVS2, VS1, VVS1, VS2, SI2, I1]
Categories (8, object): [I1, IF, SI1, SI2, VS1, VS2, VVS1, VVS2]
[2 1 7 4 6 5 3 0]

9
The dataset after encoding will be as follows:

carat cut color clarity depth table x y z price

0 0.30 2 1 2 62.1 58.0 4.27 4.29 2.66 499.0

1 0.33 3 3 1 60.8 58.0 4.42 4.46 2.70 984.0
2 0.90 4 1 7 62.2 60.0 6.04 6.12 3.78 6289.0
3 0.42 2 2 4 61.6 56.0 4.82 4.80 2.96 1082.0
4 0.31 2 2 6 60.4 59.0 4.35 4.43 2.65 779.0
Table 1.5 Sample dataset after encoding the categorical variables

Before proceeding to the further, the data has been splitted into two datasets x and y. where x represents
the independent variables and y represents the dependent variable.

The train and test sample exercise needs to be done before doing the model. Means, we train the model
on training set and validate on testing set.

Before proceeding into the models, the dataset should be split into train set and test set. test set having
30% of the data and remaining 70% of the dataset is train set.

Hence, we are having 8080 samples for test data and remaining 18853 samples for train data.

Shape of train dataset – (18853 x 10)

Shape of test dataset – (8080 x 10)

After separating the train parameters and target parameters, the following will be resulted.
o x_train: training set (18853, 9)
o x_test: test set (8080, 9)
o y_train: target variable of train set (18853,)
o y_test: target variable of test set (8080,)

Linear Regression:

The Linear Regression model models the data with linear combination of the explanatory variables
and this model is carried out with the following parameters and hyperparameters.

LinearRegression (copy_X=True, fit intercept=True, n_jobs=None,

normalize=False)

Training score Testing score

Train RMSE Test RMSE (R-square) (R-square)

1047.15 1031.7 0.9087 0.9117

Table 1.6 Performance Metric of Linear Regression model.

The magnitude of the RMSE will depends on the scale/ normalize of the y-variable. Here the model
has been run in stats model. The intercept and coefficients of each variable is tabulated here.

10
Coeff.
Intercept 9442.85
carat 9198.16
cut 50.00
color -233.06
clarity 253.81
depth -88.61
table -73.02
x -2028.80
y 1553.24
z -312.91

Table 1.7 Intercept and coefficient of explanatory variables

From the above, we can derive the linear regression equation as:

Price = (9442.85) * Intercept + (9198.16) * carat + (50.0) * cut + (-233.06) * color + (253.81) * clarity
+ (-88.61) * depth + (-73.02) * table + (-2028.8) * x + (1553.24) * y + (-312.91) * z

The ‘x’ variable coef is -2028.8 which shows the negative co-relation with price, but in pair plot it
shows positive correlation. This is due to the multicollinearity.

Scatter Plot between Actual y-test and predicted y-test:

Fig 1.7 Actual vs Prediction

11
OLS Regression Results:
==============================================================================
Dep. Variable: price R-squared: 0.909
Model: OLS Adj. R-squared: 0.909
Method: Least Squares F-statistic: 2.084e+04
Date: Fri, 03 Jul 2020 Prob (F-statistic): 0.00
Time: 22:29:04 Log-Likelihood: -1.5785e+05
No. Observations: 18853 AIC: 3.157e+05
Df Residuals: 18843 BIC: 3.158e+05
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 9442.8480 730.010 12.935 0.000 8011.963 1.09e+04
carat 9198.1563 93.404 98.477 0.000 9015.076 9381.236
cut 50.0018 7.716 6.480 0.000 34.878 65.126
color -233.0625 4.697 -49.620 0.000 -242.269 -223.856
clarity 253.8146 4.567 55.576 0.000 244.863 262.766
depth -88.6089 10.022 -8.842 0.000 -108.253 -68.965
table -73.0175 3.850 -18.966 0.000 -80.564 -65.471
x -2028.7985 137.469 -14.758 0.000 -2298.251 -1759.346
y 1553.2354 136.054 11.416 0.000 1286.557 1819.914
z -312.9073 110.389 -2.835 0.005 -529.279 -96.536
==============================================================================
Omnibus: 5117.050 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 25447.622
Skew: 1.228 Prob (JB): 0.00
Kurtosis: 8.135 Cond. No. 8.19e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
.
[2] The condition number is large, 8.19e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

When we normalize the y variable with log transform, then the model results are tabulated here under.

Training score Testing score

Train RMSE Test RMSE (R-square) (R-square)

0.224 0.225 0.949 0.9487

Table 1.8 Performance Metric of Linear Regression model (log transform(y-variable))

The intercept and coefficients of each variable is tabulated here:

Coeff.
Intercept -1.21
carat -1.14
cut 0.002
color -0.06
clarity 0.06
depth 0.04
table -0.003
x 0.63
y 0.59
z 0.24
Table 1.9 Intercept and coefficient of explanatory variables (log transform(y-variable))

12
From the above, the linear regression equation can also be written as:

Price = antilog( (-1.21) * Intercept + (-1.14) * carat + (0.0) * cut + (-0.06) * color + (0.06) * clarity +
(0.04) * depth + (-0.0) * table + (0.63) * x + (0.59) * y + (0.24) * z)

Scatter Plot between Actual y-test and predicted y-test with y variable log transform:

Fig 1.8 Actual vs Prediction (log transform(y-variable))

OLS Regression Results (y-variable (log transform):

==============================================================================
Dep. Variable: price R-squared: 0.949
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 3.900e+04
Date: Fri, 03 Jul 2020 Prob (F-statistic): 0.00
Time: 22:29:11 Log-Likelihood: 1446.7
No. Observations: 18853 AIC: -2873.
Df Residuals: 18843 BIC: -2795.
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.2142 0.156 -7.772 0.000 -1.520 -0.908
carat -1.1414 0.020 -57.100 0.000 -1.181 -1.102
cut 0.0022 0.002 1.336 0.182 -0.001 0.005
color -0.0645 0.001 -64.135 0.000 -0.066 -0.062
clarity 0.0628 0.001 64.295 0.000 0.061 0.065
depth 0.0355 0.002 16.571 0.000 0.031 0.040
table -0.0034 0.001 -4.090 0.000 -0.005 -0.002
x 0.6268 0.029 21.306 0.000 0.569 0.684
y 0.5901 0.029 20.266 0.000 0.533 0.647
z 0.2360 0.024 9.990 0.000 0.190 0.282
==============================================================================
Omnibus: 9554.691 Durbin-Watson: 1.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1576892.061
Skew: 1.372 Prob(JB): 0.00
Kurtosis: 47.720 Cond. No. 8.19e+03
==============================================================================

13
Checking of Multi collinearity:
Multicollinearity is collinearity between the independent variables. The presence of this collinearity
between the variables will not affect accuracy but the model is not explainable.

VIF Factor-
(with y
Features variable log
transformed)

0 carat 31.739524
1 cut 1.064228
2 color 1.100387
3 clarity 1.066402
4 depth 2.528086
5 table 1.181361
6 x 408.915952
7 y 395.015655
8 z 100.942140
Table 1.10 VIF (Multi collinearity) Table
From the table, the variables carat, x, y and z show severe multicollinearity. the dimensions of zirconia
x, y and z can be replaced with one feature like size or weight etc.,
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
From the Equation,

Price = (9442.85) * Intercept + (9198.16) * carat + (50.0) * cut + (-233.06) * color + (253.81) *
clarity + (-88.61) * depth + (-73.02) * table + (-2028.8) * x + (1553.24) * y + (-312.91) * z

 Carat, Width of the cubic zirconia and clarity are the main features for predicting the price of
the cubic zirconia.
 With one-unit increase in carat, price of zirconia increases by 9198.16 by keeping all other
predictors as constant.
 With one-unit increase in clarity, price of zirconia increases by 253.81 by keeping all other
predictors as constant.
 There are also some negative coefficients. For instance, table has its corresponding coefficient
of -73.02. This implies, when one unit of increase in table, the price of zirconia decreases by
73.02 by keeping all other predictors as constant
 Cut quality of a zirconia cubic is not much important factor in price decision of zirconia cubic.

Recommendations:
1. Carat, dimension of the zirconia cubic and clarity are the most important features.
2. The zirconia cubic with high carat and with FL/IF/VVS1/VVS2/VS1/VS2 clarity will be very
expensive. Offers / gifts can be added for these products.
3. As cut, depth and table factors are not much influencing on the price of zirconia cubic, to gain
more profits we can treat these factors as unimportant. Ex: zirconia cubic with high carat,
clarity and fair cut will give you more profits.
4. The dimensions of zirconia cubic is also proportional to the price. So, higher the dimension,
more the price.
14
2. PROBLEM STATEMENT:

You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some did not. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important factors
on the basis of which the company will focus on particular employees to sell their packages.

2.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Sample of dataset:
Holliday_Package Salary age educ no_young_children no_older_children foreign
1 no 48412 30 8 1 1 no
2 yes 37207 45 8 0 1 no
3 no 58022 46 9 0 0 no
4 no 66503 31 11 2 0 no
5 no 66734 44 12 0 2 no
6 yes 61590 42 12 0 1 no
7 no 94344 51 8 0 0 no
8 yes 35987 32 8 0 2 no
9 no 41140 39 12 0 0 no
Table 2.1 Sample dataset.

Dataset has 6 independent variables which are described as follows:

Variable Name Description

Salary Employee salary
age Age in years
educ Years of formal education
The number of young children
no_young_children
(younger than 7 years)
no_older_children Number of older children
foreign foreigner Yes/No

In the dataset, Holliday_Package is the dependent variable / target variable which indicates that the
customer opted for holiday package or not. The size of the data set is 872 X 7 (872 rows & 7 columns).

15
Data-Set information:

Data type Number of

Variable Name
missing values
Holliday_Package Object 0
Salary Int 64 0
age Int 64 0
educ Int 64 0
no_young_children Int 64 0
no_older_children Int 64 0
foreign Object 0
Table 2.2 dataset information.

From the above table, it is evident that there are no null values present in the dataset. The variables
no_young_children, educ and no_older_children should be a categorical type (object type).

Descriptive Statistics for the dataset:

Variable coun uniq

top freq mean std min 25% 50% 75% max
-Name t ue
Holliday
872 2 no 471 NaN NaN NaN NaN NaN NaN NaN
_Package
3532 41903. 53469 23696
Salary 872 NaN NaN NaN 47729.2 23418.7 1322
4 5 .5 1
age 872 NaN NaN NaN 39.96 10.55 20 32 39 48 62
educ 872 NaN NaN NaN 9.31 3.04 1 8 9 12 21
no_youn
g_childre 872 4 0 665 NaN NaN NaN NaN NaN NaN NaN
n
no_older
872 7 0 393 NaN NaN NaN NaN NaN NaN NaN
_children
foreign 872 2 no 656 NaN NaN NaN NaN NaN NaN NaN

Table 2.3 Descriptive Statistics table

Here, the descriptive statistics table shows the 5-point summary of the data set.
Some Inference from the descriptive statistics table:
 Holliday_Package variable is the categorical variable and most of the customers in the dataset
are not opted holiday package.
 Salary variable is the continuous variable with a mean of 47729.2 and median of 41903.5 which
indicates that the distribution is lightly skewed.
 Most of the Foreigners are not opted for holiday-package.

None of the duplicate rows are identified in the dataset

16
Univariate and Bivariate Analysis:

Univariate Analysis: The main purpose of univariate analysis is to summarize and find patterns in the
data. The key point is that there is only one variable involved in the analysis.

Fig 2.1 Distribution of ‘Salary’ Variable

Inference from the fig1.1 is the Salary variable is not normally distributed; outliers might be the reason.

Fig 2.2 Count plot – Holliday_Package (hue = foreign)

17
Bivariate analysis: In bivariate analysis we try to determine if there is any relationship between two
variables.

Fig 2.3 Pair plot.

 For logistic regression we should focus on diagonals.

 The presence of multiple peaks in the diagonals could be an indication of no, of clusters.
 The variables are not correlated with each other.

18
Correlation Map:

Fig 2.4 Correlation Map

The variables are not correlated with each other.

Finding out the outliers in dataset:

Fig 2.5 Box plot before treating outliers

19
From the Fig2.5 it clears that the variable ‘salary’ has outliers. The outliers in the dataset will show
the impact on model. So, outliers to be removed/treated before proceeding into the model.

Here, the outliers in the dataset are treated by imputing upper-range values and lower-range values in
place of outliers. By this, the number of rows would not change. The boxplot in the fig 2.6 will shows
the presence of outliers after treatment.

Fig 2.6 Box plot after treating outliers

2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
The variables Holliday_Package, no_young_children, no_older_children and foreign are categorical
variables. So, before proceeding into the model these categorical data should be converted into codes
and the same presented here

feature: Holliday_Package
[no, yes]
Categories (2, object): [no, yes]
[0 1]

feature: no_young_children
[1, 0, 2, 3]
Categories (4, int64): [0, 1, 2, 3]
[1 0 2 3]

20
feature: no_older_children
[1, 0, 2, 4, 3, 5, 6]
Categories (7, int64): [0, 1, 2, 3, 4, 5, 6]
[1 0 2 4 3 5 6]

feature: foreign
[no, yes]
Categories (2, object): [no, yes]
[0 1]

The train and test sample exercise needs to be done before doing any predictive modelling like Logistic
Regression and linear discriminant analysis (LDA). Means, we train the model on training set and
validate on testing set.

Before proceeding into the models, the dataset should be split into train set and test set. test set having
30% of the data and remaining 70% of the dataset is train set.

Hence, we are having 610 samples for train data and remaining 262 samples for test data.

Shape of train dataset – (610 x 6)

Shape of test dataset – (262 x 6)

After separating the train parameters and target parameters, the following will be resulted.
o X_train: training set (610, 6)
o X_test: test set (262, 6)
o train_labels: target variable of train set (610,)
o test labels: target variable of test set (262,)

The target value “Holliday_Package” column in the dataset says whether the person opted for holiday
package or not. The details of the Holliday_Package attribute tabulated here.

Opted No. of Customers Probability

0 471 0.54
1 401 0.46
Table 2.4 Holliday_Package Data

0: Customer not opted for holiday package

1: Customer opted for holiday package

The target variable/dependent variable is of categorical type. To predict the dependent variable, two
models are presented here.

Logistic Regression model:

It is supervised learning method for classification. In Logistic regression model, relationship between
dependent class variable and independent class variables using regression

The Logistic regression is carried out with the following parameters and hyperparameters.

Logistic Regression (C=1.0, class_weight=None, dual=False, fit_ intercept=True,

intercept scaling=1, l1_ratio=None, max_iter=10000,
multi_class='warn', n_jobs=2, penalty='none’, random_state=None,
21
solver='newton-cg', tol=0.0001, verbose=True, warm_start=False)

Here, the model is executed with 10,000 iterations and newton-cg solver. The model is fitted with
x_train and y_train. Now, the model is ready for prediction.

Linear discriminant analysis model:

Discriminant analysis created model to predict future observations where classes are known. LDA uses
linear combination of independent variables to predict the class in the response variable of a given
observation.

The LDA is carried out with the following parameters and hyperparameters.

LinearDiscriminantAnalysis (n_components=None, priors=None, shrinkage=None,

solver='svd', store_covariance=False, tol=0.0001)

Here, the model is executed with “singular value decomposition”. The model is fitted with x_train and
y_train. Now, the model is ready for prediction.

2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.

A. Performance Metrics of Logistic Regression Model:

Confusion Matrix – train labels:

Fig 2.7 Confusion matrix – train dataset (Log. Regression model)

 Accuracy is 68%

Classification report of train dataset:

precision recall f1-score support

0 0.68 0.80 0.74 342

1 0.67 0.52 0.59 268

accuracy 0.68 610

macro avg 0.68 0.66 0.66 610
weighted avg 0.68 0.68 0.67 610
22
ROC Curve of train dataset:

Fig 2.8 ROC Curve – train dataset (Log. Regression model)

AUC (Area under ROC Curve) is 74.1%.

This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.

TP rate = True positives/Total positives.

FP rate = False positives/Total negatives.

Confusion Matrix – test labels:

Fig 2.9Confusion matrix – test dataset (Log. Regression model)

 Accuracy is 64%

Classification report of test dataset:

precision recall f1-score support

0 0.61 0.76 0.68 129

1 0.69 0.53 0.60 133

accuracy 0.64 262

macro avg 0.65 0.64 0.64 262
weighted avg 0.65 0.64 0.64 262
23
ROC Curve of test dataset:

Fig 2.10 ROC Curve – test dataset (Log. Regression model)

AUC (Area under ROC Curve) is 74.1%.

This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.

Logistic regression model conclusion:

Train Data Test Data

AUC curve 74.1% 74.1%
Accuracy 68% 64%
Recall 52% 53%
Precision 67% 69%
F1 Score 59% 60%
Table 2.5 Log. Regression Model Performance Metrics Summary

 Training and Test set results are almost similar, and with the overall measures high,
the model is a good model.
 Salary is the most important variable for predicting target variable.

24
B. Performance Metrics of LDA Model:

Confusion Matrix – train labels:

Fig 2.11 Confusion matrix – train dataset (LDA model)

 Accuracy is 67%

Classification report of train dataset:

precision recall f1-score support

0 0.67 0.80 0.73 342

1 0.67 0.50 0.57 268

accuracy 0.67 610

macro avg 0.67 0.65 0.65 610
weighted avg 0.67 0.67 0.66 610

ROC Curve of train dataset:

Fig 2.4 ROC Curve – train dataset (LDA model)

AUC (Area under ROC Curve) is 73.9%.

This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.
25
TP rate = True positives/Total positives.
FP rate = False positives/Total negatives.

Confusion Matrix – test labels:

Fig 2.12 Confusion matrix – test dataset (LDA Model)

 Accuracy is 63%

Classification report of test dataset:

precision recall f1-score support

0 0.60 0.75 0.66 129

1 0.68 0.50 0.58 133

accuracy 0.63 262

macro avg 0.64 0.63 0.62 262
weighted avg 0.64 0.63 0.62 262

ROC Curve of test dataset:

Fig 2.13 ROC Curve – test dataset (LDA model)

AUC (Area under ROC Curve) is 73.9%.

26
This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.
LDA model conclusion:

Train Data Test Data

AUC curve 73.9% 73.9%
Accuracy 67% 63%
Recall 50% 50%
Precision 67% 68%
F1 Score 57% 58%
Table 2.6 Log. Regression Model Performance Metrics Summary

 Training and Test set results are almost similar, and with the overall measures high,
the model is a good model.
 Salary is the most important variable for predicting target variable.

Performance metrics of all the models are summarized here:

Train Test Data Train Test Data

Data (Log. R) Data (LDA)
(Log. R) (LDA)
AUC curve 74.1% 74.1% 73.9% 73.9%
Accuracy 68% 64% 67% 63%
Recall 52% 53% 50% 50%
Precision 67% 69% 67% 68%
F1 Score 59% 60% 57% 58%
Table 2.7 Performance metrics summary of both models.

ROC Curve for the 2 models on the training data:

Fig 2.14 ROC Curves for 2 models – training data

27
ROC Curve for the 2 models on the testing data:

Fig 2.15 ROC Curves for 2 models – test data

 Log. Regression model has good performance in all aspects compared to Linear discriminant
analysis.
 Random forest regressor also executed and observed that Salary and age variables are the most
important variables for predicting customer whether opted holiday package or not.
 The ROC curves for the both models are same.

2.4 Inference: Basis on these predictions, what are the insights and recommendations.

“Salary” and “age” features are the most important in determining whether the customer will
opt holiday package or not.

Log (odds) = (2.8199) * Intercept + (-2.0788e-05) * Salary + (-0.0572) * age + (0.041) * educ
+ (-1.426) * no_young_children + (-0.1226) * no_older_children + (1.408) * foreign

Some of the business insights are noticed here:

 With one year of increase in educ, the log odds will be increases by 0.041. accordingly, the
probability of the customer to opt for holiday package also increases by keeping all other
predictors as constant.
 If the customer is a foreigner, the log odds will be increases by 1.408 accordingly, the
probability of the customer to opt for holiday package also increases by keeping all other
predictors as constant.
 There are also some negative coefficients. For instance, salary has its corresponding
coefficient of -2.0788e-05. This implies, when one unit of increase in salary, the log odds
decreases by 2.0788e-05. Accordingly, the probability of the customer to opt for the holiday
package also decreases by keeping all other predictors as constant
Recommendations:
 Offering tour packages depending upon the employee details.
 74% probability exists that younger age customers are more likely to opt for holiday
package even though the salary is near to 25K.
 Foreigners also more likely to opt for holiday package.
28

Predictive Modelling Project Report Final
45% (11)
Predictive Modelling Project Report Final
49 pages
Linear Regression: Prepared by Muralidharan N
77% (13)
Linear Regression: Prepared by Muralidharan N
34 pages
Assignment - Predictive Modeling
88% (24)
Assignment - Predictive Modeling
66 pages
Road Lighting Calculation - Sample
100% (6)
Road Lighting Calculation - Sample
2 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
Project-Predictive Modeling-Rajendra M Bhat
100% (3)
Project-Predictive Modeling-Rajendra M Bhat
14 pages
Project-Predictive Modelling - Tanaya - Lokhande
100% (1)
Project-Predictive Modelling - Tanaya - Lokhande
55 pages
An Introduction to Probability and Statistics
From Everand
An Introduction to Probability and Statistics
Vijay K. Rohatgi
4/5 (1)
Bill of Material BOM
No ratings yet
Bill of Material BOM
2 pages
Predective Modellig Project
100% (1)
Predective Modellig Project
18 pages
Predictive Modelling
67% (3)
Predictive Modelling
64 pages
Predictive Modeling Project Report
100% (2)
Predictive Modeling Project Report
31 pages
Project Predictive Modeling
50% (2)
Project Predictive Modeling
69 pages
RAJIV RANJAN - 19!02!2023 - Predictive Modelling Project Report - Final
100% (1)
RAJIV RANJAN - 19!02!2023 - Predictive Modelling Project Report - Final
34 pages
Project LDA
100% (1)
Project LDA
32 pages
Linear Regression
67% (3)
Linear Regression
15 pages
Predective Modelling Project Business Report
50% (2)
Predective Modelling Project Business Report
58 pages
Predictive Modeling
100% (1)
Predictive Modeling
22 pages
Cubic Zirconia Price Prediction Case Study Report
100% (1)
Cubic Zirconia Price Prediction Case Study Report
34 pages
Diamonds: Analyze Diamonds by Their Cut, Color, Clarity, Price, and Other Attributes
No ratings yet
Diamonds: Analyze Diamonds by Their Cut, Color, Clarity, Price, and Other Attributes
14 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
10 Grade Trigonometry Quiz
No ratings yet
10 Grade Trigonometry Quiz
6 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
BLD 401 - Module 5
No ratings yet
BLD 401 - Module 5
12 pages
UNIT-2 ML
No ratings yet
UNIT-2 ML
39 pages
DSS CATIA ELFINI Verification Manual
No ratings yet
DSS CATIA ELFINI Verification Manual
33 pages
Testing Concrete Blocks - 2
No ratings yet
Testing Concrete Blocks - 2
4 pages
Simple Linear Regression 1. Review of Least Squares Procedure 2. Inference For Least Squares Lines
No ratings yet
Simple Linear Regression 1. Review of Least Squares Procedure 2. Inference For Least Squares Lines
51 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
51 pages
06 Least Squar Regression
No ratings yet
06 Least Squar Regression
25 pages
Investment Analysis and Portfolio Management: Gareth Myles
No ratings yet
Investment Analysis and Portfolio Management: Gareth Myles
44 pages
DIAMOND PRICE PREDICTIONS - Ipynb - Colaboratory
No ratings yet
DIAMOND PRICE PREDICTIONS - Ipynb - Colaboratory
21 pages
3-TheSimpleLinearRegressionModelPart2
No ratings yet
3-TheSimpleLinearRegressionModelPart2
38 pages
Assembly - Global Stiffness Matrix
No ratings yet
Assembly - Global Stiffness Matrix
16 pages
Definitions of Matter and Sig Fig Review
No ratings yet
Definitions of Matter and Sig Fig Review
6 pages
Curve Fitting: There Are Two General Approaches For Curve Fitting
No ratings yet
Curve Fitting: There Are Two General Approaches For Curve Fitting
63 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Linear Regression
No ratings yet
Linear Regression
21 pages
Intro To Graphical Analysis Report Alex Tchantaev
No ratings yet
Intro To Graphical Analysis Report Alex Tchantaev
3 pages
Slab Template ST Des Temp
No ratings yet
Slab Template ST Des Temp
11 pages
FT Mba Section 4b Probability Sva
No ratings yet
FT Mba Section 4b Probability Sva
26 pages
Statistics and Probability
100% (1)
Statistics and Probability
11 pages
Chang Chap 1 BB PDF
No ratings yet
Chang Chap 1 BB PDF
35 pages
Experimental Error: Transi1on of Metal Ion Concentra1ons in Oceans
No ratings yet
Experimental Error: Transi1on of Metal Ion Concentra1ons in Oceans
7 pages
Least Square Regression: Numerical Methods ECE 410
No ratings yet
Least Square Regression: Numerical Methods ECE 410
44 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Supervised Example KNN
No ratings yet
Supervised Example KNN
22 pages
Lecture 2: Measures of Variability - Measures How Spread Out The Numbers Are
No ratings yet
Lecture 2: Measures of Variability - Measures How Spread Out The Numbers Are
6 pages
Lecture10 Regression2 TS PDF
No ratings yet
Lecture10 Regression2 TS PDF
22 pages
QBM101 Chapter10
No ratings yet
QBM101 Chapter10
40 pages
Part 2 The Simplex Algorithm
No ratings yet
Part 2 The Simplex Algorithm
48 pages
DS2 Section E Group 5
No ratings yet
DS2 Section E Group 5
19 pages
Alternative Manufacturing Process Selection by Weighted Average Method
No ratings yet
Alternative Manufacturing Process Selection by Weighted Average Method
17 pages
Algorithms - K Nearest Neighbors
No ratings yet
Algorithms - K Nearest Neighbors
23 pages
HW 1
No ratings yet
HW 1
4 pages
AI Lec4
No ratings yet
AI Lec4
17 pages
Linear Regression (Simple & Multiple)
No ratings yet
Linear Regression (Simple & Multiple)
29 pages
Information Representation - Notes
No ratings yet
Information Representation - Notes
13 pages
December 2021 Evening
No ratings yet
December 2021 Evening
7 pages
Iterative Optimizers: Difficulty Measures and Benchmarks
From Everand
Iterative Optimizers: Difficulty Measures and Benchmarks
Maurice Clerc
No ratings yet
Robustness Theory and Application
From Everand
Robustness Theory and Application
Brenton R. Clarke
No ratings yet
Foundations of Image Science
From Everand
Foundations of Image Science
Harrison H. Barrett
No ratings yet
Transient Ratings Report - 630Sqmm 11kV Cable
No ratings yet
Transient Ratings Report - 630Sqmm 11kV Cable
3 pages
DC Cable - Test
No ratings yet
DC Cable - Test
11 pages
VRLAbattery (Version 3)
No ratings yet
VRLAbattery (Version 3)
14 pages
YIAPL-MA-PH1-T1-SD-EL-DR-ZZ-008001 (1)
No ratings yet
YIAPL-MA-PH1-T1-SD-EL-DR-ZZ-008001 (1)
1 page
Sizing
No ratings yet
Sizing
13 pages
Algeria_Substation_MDL
No ratings yet
Algeria_Substation_MDL
2 pages
Cable High Voltage Report
No ratings yet
Cable High Voltage Report
8 pages
Grid Code Requirements For WTG Interconnection
No ratings yet
Grid Code Requirements For WTG Interconnection
30 pages
Cable Costing
No ratings yet
Cable Costing
9 pages
Xlpe Armoured LV Power Cable Rating Factor For Cables in Air
No ratings yet
Xlpe Armoured LV Power Cable Rating Factor For Cables in Air
9 pages
Industry Definitons (NEMA Standards) : Definitions Pertaining To Non-Hazardous Locations
No ratings yet
Industry Definitons (NEMA Standards) : Definitions Pertaining To Non-Hazardous Locations
1 page
Panel Board Schedule
100% (1)
Panel Board Schedule
12 pages
Impulse Testing of Transformers
No ratings yet
Impulse Testing of Transformers
20 pages
RIV Test 27 07 2018
No ratings yet
RIV Test 27 07 2018
54 pages
Lokesh Reddy Original
No ratings yet
Lokesh Reddy Original
3 pages
Types of Database Management Systems
No ratings yet
Types of Database Management Systems
4 pages
Alm PDF
No ratings yet
Alm PDF
35 pages
Process Dynamic and Control - Intro
No ratings yet
Process Dynamic and Control - Intro
156 pages
Unit 5
No ratings yet
Unit 5
104 pages
Spearman Rho
No ratings yet
Spearman Rho
18 pages
B. A Structure That Maps Keys To Values: B. Size of Elements Stored in The Hash Table
No ratings yet
B. A Structure That Maps Keys To Values: B. Size of Elements Stored in The Hash Table
5 pages
WST02 - 01 - Que - 2021 Jan
No ratings yet
WST02 - 01 - Que - 2021 Jan
24 pages
CG 4
No ratings yet
CG 4
38 pages
Machine Learning Based House Price Prediction Using Modified Extreme Boosting
No ratings yet
Machine Learning Based House Price Prediction Using Modified Extreme Boosting
14 pages
223 Midterm 1 Fall 2024
No ratings yet
223 Midterm 1 Fall 2024
8 pages
Lecture 5 6 Forecasting
100% (1)
Lecture 5 6 Forecasting
45 pages
02+IJISAE_BUDI+JUARTO
No ratings yet
02+IJISAE_BUDI+JUARTO
7 pages
7 Project Cost Management Terms
No ratings yet
7 Project Cost Management Terms
6 pages
Digital Signal Processing June 2022
No ratings yet
Digital Signal Processing June 2022
8 pages
CS273a Final Exam
No ratings yet
CS273a Final Exam
9 pages
An Improved Binary Quadratic Interpolation Optimization For 0-1 Knapsack Problems
No ratings yet
An Improved Binary Quadratic Interpolation Optimization For 0-1 Knapsack Problems
11 pages
DAA Assignment 1 For Students
No ratings yet
DAA Assignment 1 For Students
1 page
2024-09-06 Slides 01
No ratings yet
2024-09-06 Slides 01
27 pages
Design and Development of Machine Learning
No ratings yet
Design and Development of Machine Learning
9 pages
Basic Numeracy Unitary Method
No ratings yet
Basic Numeracy Unitary Method
5 pages
RL Lecture1-Introduction (IITH)
No ratings yet
RL Lecture1-Introduction (IITH)
44 pages
Fundamentals of Python: Data Structures: Kenneth A. Lambert
No ratings yet
Fundamentals of Python: Data Structures: Kenneth A. Lambert
16 pages
3D Reservoir Modeling Course_M1 A
No ratings yet
3D Reservoir Modeling Course_M1 A
41 pages
Convex Functions
No ratings yet
Convex Functions
13 pages
Using Convolutional Neural Networks To Forecast Sporting Event Results - SpringerLink
No ratings yet
Using Convolutional Neural Networks To Forecast Sporting Event Results - SpringerLink
24 pages
Research Skills-Coursera Courses
No ratings yet
Research Skills-Coursera Courses
3 pages
Beyond The Pole-Barn Paradox (Paper 64) PDF
No ratings yet
Beyond The Pole-Barn Paradox (Paper 64) PDF
7 pages
Coding Questions
No ratings yet
Coding Questions
10 pages
Adversarial Attacks
No ratings yet
Adversarial Attacks
5 pages