Predective Modelling
Predective Modelling
PREDECTIVE MODELING
CONTENTS
PAGE NO.
Problem 1: LINEAR REGRESSION 03
PROBLEM STATEMENT 03
2
PROBLEM STATEMENT:
You are hired by a company Gem Stones co ltd, which is a cubic zirconia manufacturer. You
are provided with the dataset containing the prices and other attributes of almost 27,000 cubic zirconia
(which is an inexpensive diamond alternative with many of the same qualities as a diamond). The
company is earning different profits on different prize slots. You have to help the company in
predicting the price for the stone on the bases of the details given in the dataset so it can distinguish
between higher profitable stones and lower profitable stones so as to have better profit share. Also,
provide them with the best 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Sample of dataset:
3
In the dataset price is the dependent variable / target variable which indicates the price of the zirconia.
The size of the data set is 26967 X 10 (26967 rows & 10 columns).
Data-Set information:
From the above table, it is evident that there are null values present in the depth variable.
Variable- count unique top freq mean std min 25% 50% 75% max
Name
carat 26967 NaN NaN NaN 0.8 0.5 0.2 0.4 0.7 1.05 4.5
cut 26967 5 Ideal 10816 NaN NaN NaN NaN NaN NaN NaN
color 26967 7 G 5661 NaN NaN NaN NaN NaN NaN NaN
clarity 26967 8 SI1 6571 NaN NaN NaN NaN NaN NaN NaN
depth 26270 NaN NaN NaN 61.7 1.4 50.8 61 61.8 62.5 73.6
table 26967 NaN NaN NaN 57.5 2.2 49 56 57 59 79
x 26967 NaN NaN NaN 5.7 1.1 0 4.71 5.69 6.55 10.23
y 26967 NaN NaN NaN 5.7 1.2 0 4.71 5.71 6.54 58.9
z 26967 NaN NaN NaN 3.5 0.7 0 2.9 3.52 4.04 31.8
price 26967 NaN NaN NaN 3939.5 4024.9 326 945 2375 5360 18818
Table 1.3 Descriptive Statistics table
Here, the descriptive statistics table shows the 5-point summary of the data set.
Inference from the descriptive statistics table:
Carat variable is the continuous variable with a mean of 0.8 and median of 0.7 which indicates
that the distribution is lightly skewed.
Cut variable is the categorical variable and has 5 features in it. The features are ordinal type
and ideal is the most frequent.
Color variable is the categorical variable and has 7 features in it. The features are ordinal type
and G is the most frequent.
Clarity variable is the categorical variable and has 8 features in it. The features are ordinal type
and SI1 is the most frequent.
4
The variables depth, table, x, y and z are continuous variables and all the variables are lightly
skewed.
Price variable is also a continuous variable which is positively skewed.
Total 34 no. of duplicate rows are identified in the dataset and all the duplicate rows are removed.
Hence the size of the dataset becomes 26933 X 10 (26933 rows & 10 columns).
Univariate Analysis: The main purpose of univariate analysis is to summarize and find patterns in the
data. The key point is that there is only one variable involved in the analysis.
From the below figure, we can identify the how numerical variables (carat, depth, table, x, y, z and
price) are distributed.
Inference from the fig1.1 is the price variable & carat variable are not normally distributed, those are
positively skewed. Remaining all the variables are seems to be not normally distributed, outliers might
be the reason.
5
Bivariate Analysis: In bivariate analysis we try to determine if there is any relationship between two
variables.
6
Some other plots plotted below which gives some interpretation how carat variable influencing the
price with different hues (color, clarity & cut).
Fig 1.3 Scatter plot between Price and Carat with different hues
Correlation Map:
From the Fig1.5 it clears that all the variables in the dataset has outliers. The outliers in the dataset will
show the impact on best-fit line. So, outliers to be removed/treated before proceeding into the model.
Here, the outliers in the dataset are treated by imputing upper-range values and lower-range values in
place of outliers. By this, the number of rows would not change. The boxplot in the fig 1.6 will shows
the presence of outliers after treatment.
8
1.2 Impute null values if present, also check for the values which are equal to zero. Do they
have any meaning, or do we need to change them or drop them? Do you think scaling is
necessary in this case?
In the dataset, depth variable has 697 null values which is already shown in the Table 1.2. Here, the
missing values in the depth variable are replaced with the median of the depth variable (61.8)
because outliers are present in the depth variable.
Generally, the dimension of zirconia should not be a zero. So, these values should be treated as a
missing value and replaced with the median of their respective variable, as outliers present in all the
three variables.
Here, in this model we are solving the optimal solution using OLS regression. Scaling is not necessary,
and it will not show any impact on accuracies and R-square.
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data into test
and train (70:30). Apply Linear regression. Performance Metrics: Check the performance of
Predictions on Train and Test sets using R square, RMSE
The variables cut, color and clarity are categorical variables, and these are of ordinal type. So, label
encoding has done here and the same presented here.
feature: cut
[Ideal, Premium, Very Good, Good, Fair]
Categories (5, object): [Fair, Good, Ideal, Premium, Very Good]
[2 3 4 1 0]
feature: color
[E, G, F, D, H, J, I]
Categories (7, object): [D, E, F, G, H, I, J]
[1 3 2 0 4 6 5]
feature: clarity
[SI1, IF, VVS2, VS1, VVS1, VS2, SI2, I1]
Categories (8, object): [I1, IF, SI1, SI2, VS1, VS2, VVS1, VVS2]
[2 1 7 4 6 5 3 0]
9
The dataset after encoding will be as follows:
Before proceeding to the further, the data has been splitted into two datasets x and y. where x represents
the independent variables and y represents the dependent variable.
The train and test sample exercise needs to be done before doing the model. Means, we train the model
on training set and validate on testing set.
Before proceeding into the models, the dataset should be split into train set and test set. test set having
30% of the data and remaining 70% of the dataset is train set.
Hence, we are having 8080 samples for test data and remaining 18853 samples for train data.
After separating the train parameters and target parameters, the following will be resulted.
o x_train: training set (18853, 9)
o x_test: test set (8080, 9)
o y_train: target variable of train set (18853,)
o y_test: target variable of test set (8080,)
Linear Regression:
The Linear Regression model models the data with linear combination of the explanatory variables
and this model is carried out with the following parameters and hyperparameters.
The magnitude of the RMSE will depends on the scale/ normalize of the y-variable. Here the model
has been run in stats model. The intercept and coefficients of each variable is tabulated here.
10
Coeff.
Intercept 9442.85
carat 9198.16
cut 50.00
color -233.06
clarity 253.81
depth -88.61
table -73.02
x -2028.80
y 1553.24
z -312.91
From the above, we can derive the linear regression equation as:
Price = (9442.85) * Intercept + (9198.16) * carat + (50.0) * cut + (-233.06) * color + (253.81) * clarity
+ (-88.61) * depth + (-73.02) * table + (-2028.8) * x + (1553.24) * y + (-312.91) * z
The ‘x’ variable coef is -2028.8 which shows the negative co-relation with price, but in pair plot it
shows positive correlation. This is due to the multicollinearity.
11
OLS Regression Results:
==============================================================================
Dep. Variable: price R-squared: 0.909
Model: OLS Adj. R-squared: 0.909
Method: Least Squares F-statistic: 2.084e+04
Date: Fri, 03 Jul 2020 Prob (F-statistic): 0.00
Time: 22:29:04 Log-Likelihood: -1.5785e+05
No. Observations: 18853 AIC: 3.157e+05
Df Residuals: 18843 BIC: 3.158e+05
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 9442.8480 730.010 12.935 0.000 8011.963 1.09e+04
carat 9198.1563 93.404 98.477 0.000 9015.076 9381.236
cut 50.0018 7.716 6.480 0.000 34.878 65.126
color -233.0625 4.697 -49.620 0.000 -242.269 -223.856
clarity 253.8146 4.567 55.576 0.000 244.863 262.766
depth -88.6089 10.022 -8.842 0.000 -108.253 -68.965
table -73.0175 3.850 -18.966 0.000 -80.564 -65.471
x -2028.7985 137.469 -14.758 0.000 -2298.251 -1759.346
y 1553.2354 136.054 11.416 0.000 1286.557 1819.914
z -312.9073 110.389 -2.835 0.005 -529.279 -96.536
==============================================================================
Omnibus: 5117.050 Durbin-Watson: 1.989
Prob(Omnibus): 0.000 Jarque-Bera (JB): 25447.622
Skew: 1.228 Prob (JB): 0.00
Kurtosis: 8.135 Cond. No. 8.19e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified
.
[2] The condition number is large, 8.19e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
When we normalize the y variable with log transform, then the model results are tabulated here under.
12
From the above, the linear regression equation can also be written as:
Price = antilog( (-1.21) * Intercept + (-1.14) * carat + (0.0) * cut + (-0.06) * color + (0.06) * clarity +
(0.04) * depth + (-0.0) * table + (0.63) * x + (0.59) * y + (0.24) * z)
Scatter Plot between Actual y-test and predicted y-test with y variable log transform:
==============================================================================
Dep. Variable: price R-squared: 0.949
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 3.900e+04
Date: Fri, 03 Jul 2020 Prob (F-statistic): 0.00
Time: 22:29:11 Log-Likelihood: 1446.7
No. Observations: 18853 AIC: -2873.
Df Residuals: 18843 BIC: -2795.
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -1.2142 0.156 -7.772 0.000 -1.520 -0.908
carat -1.1414 0.020 -57.100 0.000 -1.181 -1.102
cut 0.0022 0.002 1.336 0.182 -0.001 0.005
color -0.0645 0.001 -64.135 0.000 -0.066 -0.062
clarity 0.0628 0.001 64.295 0.000 0.061 0.065
depth 0.0355 0.002 16.571 0.000 0.031 0.040
table -0.0034 0.001 -4.090 0.000 -0.005 -0.002
x 0.6268 0.029 21.306 0.000 0.569 0.684
y 0.5901 0.029 20.266 0.000 0.533 0.647
z 0.2360 0.024 9.990 0.000 0.190 0.282
==============================================================================
Omnibus: 9554.691 Durbin-Watson: 1.985
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1576892.061
Skew: 1.372 Prob(JB): 0.00
Kurtosis: 47.720 Cond. No. 8.19e+03
==============================================================================
13
Checking of Multi collinearity:
Multicollinearity is collinearity between the independent variables. The presence of this collinearity
between the variables will not affect accuracy but the model is not explainable.
VIF Factor-
(with y
Features variable log
transformed)
0 carat 31.739524
1 cut 1.064228
2 color 1.100387
3 clarity 1.066402
4 depth 2.528086
5 table 1.181361
6 x 408.915952
7 y 395.015655
8 z 100.942140
Table 1.10 VIF (Multi collinearity) Table
From the table, the variables carat, x, y and z show severe multicollinearity. the dimensions of zirconia
x, y and z can be replaced with one feature like size or weight etc.,
1.4 Inference: Basis on these predictions, what are the business insights and
recommendations.
From the Equation,
Price = (9442.85) * Intercept + (9198.16) * carat + (50.0) * cut + (-233.06) * color + (253.81) *
clarity + (-88.61) * depth + (-73.02) * table + (-2028.8) * x + (1553.24) * y + (-312.91) * z
Carat, Width of the cubic zirconia and clarity are the main features for predicting the price of
the cubic zirconia.
With one-unit increase in carat, price of zirconia increases by 9198.16 by keeping all other
predictors as constant.
With one-unit increase in clarity, price of zirconia increases by 253.81 by keeping all other
predictors as constant.
There are also some negative coefficients. For instance, table has its corresponding coefficient
of -73.02. This implies, when one unit of increase in table, the price of zirconia decreases by
73.02 by keeping all other predictors as constant
Cut quality of a zirconia cubic is not much important factor in price decision of zirconia cubic.
Recommendations:
1. Carat, dimension of the zirconia cubic and clarity are the most important features.
2. The zirconia cubic with high carat and with FL/IF/VVS1/VVS2/VS1/VS2 clarity will be very
expensive. Offers / gifts can be added for these products.
3. As cut, depth and table factors are not much influencing on the price of zirconia cubic, to gain
more profits we can treat these factors as unimportant. Ex: zirconia cubic with high carat,
clarity and fair cut will give you more profits.
4. The dimensions of zirconia cubic is also proportional to the price. So, higher the dimension,
more the price.
14
2. PROBLEM STATEMENT:
You are hired by a tour and travel agency which deals in selling holiday packages. You are provided
details of 872 employees of a company. Among these employees, some opted for the package and
some did not. You have to help the company in predicting whether an employee will opt for the
package or not on the basis of the information given in the data set. Also, find out the important factors
on the basis of which the company will focus on particular employees to sell their packages.
2.1 Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis.
Sample of dataset:
Holliday_Package Salary age educ no_young_children no_older_children foreign
1 no 48412 30 8 1 1 no
2 yes 37207 45 8 0 1 no
3 no 58022 46 9 0 0 no
4 no 66503 31 11 2 0 no
5 no 66734 44 12 0 2 no
6 yes 61590 42 12 0 1 no
7 no 94344 51 8 0 0 no
8 yes 35987 32 8 0 2 no
9 no 41140 39 12 0 0 no
Table 2.1 Sample dataset.
In the dataset, Holliday_Package is the dependent variable / target variable which indicates that the
customer opted for holiday package or not. The size of the data set is 872 X 7 (872 rows & 7 columns).
15
Data-Set information:
From the above table, it is evident that there are no null values present in the dataset. The variables
no_young_children, educ and no_older_children should be a categorical type (object type).
Here, the descriptive statistics table shows the 5-point summary of the data set.
Some Inference from the descriptive statistics table:
Holliday_Package variable is the categorical variable and most of the customers in the dataset
are not opted holiday package.
Salary variable is the continuous variable with a mean of 47729.2 and median of 41903.5 which
indicates that the distribution is lightly skewed.
Most of the Foreigners are not opted for holiday-package.
16
Univariate and Bivariate Analysis:
Univariate Analysis: The main purpose of univariate analysis is to summarize and find patterns in the
data. The key point is that there is only one variable involved in the analysis.
Inference from the fig1.1 is the Salary variable is not normally distributed; outliers might be the reason.
17
Bivariate analysis: In bivariate analysis we try to determine if there is any relationship between two
variables.
18
Correlation Map:
Here, the outliers in the dataset are treated by imputing upper-range values and lower-range values in
place of outliers. By this, the number of rows would not change. The boxplot in the fig 2.6 will shows
the presence of outliers after treatment.
2.2 Do not scale the data. Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and LDA (linear
discriminant analysis).
The variables Holliday_Package, no_young_children, no_older_children and foreign are categorical
variables. So, before proceeding into the model these categorical data should be converted into codes
and the same presented here
feature: Holliday_Package
[no, yes]
Categories (2, object): [no, yes]
[0 1]
feature: no_young_children
[1, 0, 2, 3]
Categories (4, int64): [0, 1, 2, 3]
[1 0 2 3]
20
feature: no_older_children
[1, 0, 2, 4, 3, 5, 6]
Categories (7, int64): [0, 1, 2, 3, 4, 5, 6]
[1 0 2 4 3 5 6]
feature: foreign
[no, yes]
Categories (2, object): [no, yes]
[0 1]
The train and test sample exercise needs to be done before doing any predictive modelling like Logistic
Regression and linear discriminant analysis (LDA). Means, we train the model on training set and
validate on testing set.
Before proceeding into the models, the dataset should be split into train set and test set. test set having
30% of the data and remaining 70% of the dataset is train set.
Hence, we are having 610 samples for train data and remaining 262 samples for test data.
After separating the train parameters and target parameters, the following will be resulted.
o X_train: training set (610, 6)
o X_test: test set (262, 6)
o train_labels: target variable of train set (610,)
o test labels: target variable of test set (262,)
The target value “Holliday_Package” column in the dataset says whether the person opted for holiday
package or not. The details of the Holliday_Package attribute tabulated here.
The target variable/dependent variable is of categorical type. To predict the dependent variable, two
models are presented here.
It is supervised learning method for classification. In Logistic regression model, relationship between
dependent class variable and independent class variables using regression
The Logistic regression is carried out with the following parameters and hyperparameters.
Here, the model is executed with 10,000 iterations and newton-cg solver. The model is fitted with
x_train and y_train. Now, the model is ready for prediction.
Discriminant analysis created model to predict future observations where classes are known. LDA uses
linear combination of independent variables to predict the class in the response variable of a given
observation.
The LDA is carried out with the following parameters and hyperparameters.
Here, the model is executed with “singular value decomposition”. The model is fitted with x_train and
y_train. Now, the model is ready for prediction.
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model Final
Model: Compare Both the models and write inference which model is best/optimized.
Accuracy is 68%
Accuracy is 64%
This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.
Training and Test set results are almost similar, and with the overall measures high,
the model is a good model.
Salary is the most important variable for predicting target variable.
24
B. Performance Metrics of LDA Model:
Accuracy is 67%
Accuracy is 63%
26
This ROC curve technique for visualizing classifier performance. Here the graph plotted between TP
rates and FP rates.
LDA model conclusion:
Training and Test set results are almost similar, and with the overall measures high,
the model is a good model.
Salary is the most important variable for predicting target variable.
27
ROC Curve for the 2 models on the testing data:
Log. Regression model has good performance in all aspects compared to Linear discriminant
analysis.
Random forest regressor also executed and observed that Salary and age variables are the most
important variables for predicting customer whether opted holiday package or not.
The ROC curves for the both models are same.
2.4 Inference: Basis on these predictions, what are the insights and recommendations.
“Salary” and “age” features are the most important in determining whether the customer will
opt holiday package or not.
Log (odds) = (2.8199) * Intercept + (-2.0788e-05) * Salary + (-0.0572) * age + (0.041) * educ
+ (-1.426) * no_young_children + (-0.1226) * no_older_children + (1.408) * foreign