0% found this document useful (0 votes)

178 views

Assignment 5

Vijaya Kumar should adopt the following strategy for developing a forecasting model for 20,000 spare parts at L&T: - Categorize spare parts based on demand characteristics like being fast, medium, or slow moving. Target fast moving, high value items that contribute most to sales first. - For items with fluctuating demand that cannot be made stationary, do not include them initially as accurate forecasting models cannot be developed. - For items included, test for stationarity and develop ARIMA or Holt-Winters exponential smoothing models based on ACF/PACF plots and accuracy metrics like MAPE less than 10%. - For residual analysis, use Ljung-Box test to

Uploaded by

Priyanka Sindhwani

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

178 views

Assignment 5

Uploaded by

Priyanka Sindhwani

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 43

Assignment 8

BIA 7 – IIMB

Priyanka Sindhwani

4/25/2017
Question 1: What strategy should Vijaya Kumar adopt for developing forecasting model for
demand estimation of 20,000 spare parts?

The spare parts are categorized based on Fast, medium and slow. Also they have been further
classified as per their revenue generation

The items which contribute to more in sales, higher in value and are fast moving should be
targeted first therefore in forecasting, those item forecasting will be taken up first, also if the item
demand is fluctuating and cannot be stationarize they should not be picked up initially as we wont
be able to have a good forecast model for it

Question 2: Develop forecasting models for data provided in the Excel sheet titled ‘‘L&T
Spare Parts Forecasting’’ and discuss the choice for using a particular forecasting model.

Choosing Data set 1 : “ L&T spare parts forecasting

Following steps are followed

 Take the data set, check for stationarity

 As data is non-stationary, we differenced the data with difference = 2 to achieve stationarity

 Acf and Pacf plots along with adf.test to confirm stationarity

Augmented Dickey-Fuller Test

data: difference
Dickey-Fuller = -10.305, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
Now the data can be stationaries with difference 1,We determine the Arima model

In the model we take the difference of 2 and run the following model, converting data to log
Call:
arima(x = log(ITEM1), order = c(2, 1, 2), seasonal = list(order = c(1, 0, 1)))

Coefficients:
ar1 ar2 ma1 ma2 sar1 sma1
-0.1756 -0.7045 -0.3018 0.9442 0.3896 -0.0741
s.e. 0.1377 0.1402 0.1148 0.1763 0.5946 0.6401

sigma^2 estimated as 0.02693: log likelihood = 16.13, aic = -18.26

Checked for accuracy

MAPE: 6.8

Residuals plot:

Prediction
Point focusLo 95 Hi 95
May-13 387.9071 280.861 535.7529
Jun-13 360.8679 250.6561 519.5391
Jul-13 348.5916 221.3342 549.0157
Aug-13 315.7873 176.7107 564.3209
Sep-13 324.2867 170.6734 616.159
Observed vs fitted plot and forecast plot

H0: The data are independently distributed

Ha: The data are not independently distributed; they exhibit serial correlation.

For Ljung or box pierce

Box-Ljung test

data: resid(Model1)
X-squared = 6.2612, df = 10, p-value = 0.7929

As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 2 : “ L&T spare parts forecasting
 Take the data set, check for stationarity

 As data is non-stationary, we differenced the data with difference = 1 to achieve stationarity

 To check for Stationarity

Augmented Dickey-Fuller Test

data: difference
Dickey-Fuller = -14.185, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary

As p-value <0.05, alternative hypothesis is retained

Model:
Series: log(ITEM1)
ARIMA(2,1,3)(1,0,1)[12]

Coefficients:
ar1 ar2 ma1 ma2 ma3 sar1 sma1
-1.3387 -0.9938 0.5009 -0.2139 -0.8754 0.8697 -0.5095
s.e. 0.0531 0.0182 0.2918 0.2230 0.3509 0.2575 0.5084

sigma^2 estimated as 0.02462: log likelihood=18.94

AIC=-21.88 AICc=-18.19 BIC=-6.91

Accuracy
Mape : 8.83
Month Lower Point Forecast Upper
May 2013 220.9346 302.2483 413.4890
Jun 2013 184.0721 252.4793 346.3088
Jul 2013 175.8134 241.3509 331.3185
Aug 2013 164.6951 227.1128 313.1862
Sep 2013 154.5706 213.4543 294.7696

Residual plot and test

H0: The data are independently distributed

Ha: The data are not independently distributed; they exhibit serial correlation.

For Ljung or box pierce

Box-Ljung test

data: resid(Model1)
X-squared = 7.0564, df = 10, p-value = 0.7201

As p-value > 0.05, we retain the null , thus residual is independently distributed

Observed vs fitted plot and forecast plot

Choosing Data set 3 : “ L&T spare parts forecasting
Following steps are followed

 Take the data set, check for stationarity

For this item we would be using HW due to its good fit

model <- hw(ITEM1, initial ="optimal", h=12, aplha=NULL, beta = NULL, gamma = NULL, seasonal
= c("additive"),level = c(.95))
Model Information:
Holt-Winters' additive method
Call:
hw(x = ITEM1, h = 12, seasonal = c("additive"), level = c(0.95),
Call
initial = "optimal", beta = NULL, gamma = NULL, aplha = NULL)
Smoothing parameters
alpha = 1e-04
beta = 1e-04
gamma = 0.0072
Initial states
l = 370.8672
b = 3.7651
s=113.2595 -43.1134 3.3791 41.3779 2.2251 -72.807
11.0039 -39.5464 -26.1081 62.2355 6.7286 -58.6345
sigma: 57.4596
AIC AICc BIC
619.7052 636.7052 649.9743
Error measure
ME RMSE MAE MPE MAPE MASE ACF1
Training set -0.2049926 57.45958 42.62411 -1.284119 9.578407 0.5958036 -0.01930456
Forecasts:
Point Forecast Lo 95 Hi 95
May 2013 565.8691 453.2504 678.4878
Jun 2013 624.6372 512.0185 737.2559
Jul 2013 540.6159 427.9972 653.2346
Aug 2013 531.3781 418.7594 643.9969
Sep 2013 584.6043 471.9856 697.2231
Oct 2013 505.2144 392.5956 617.8332
Nov 2013 583.4437 470.8249 696.0625
Dec 2013 626.7479 514.1290 739.3667
Jan 2014 592.6612 480.0423 705.2801
Feb 2014 551.0512 438.4322 663.6702
Mar 2014 709.0731 596.4540 821.6922
Apr 2014 543.0636 430.4404 655.6867

MAPE: 9.5

Observed vs fitted and residual plots

H0: The data are independently distributed
Ha: The data are not independently distributed; they exhibit serial correlation.

For Ljung or box pierce

Box-Ljung test

data: resid(model)
X-squared = 6.1255, df = 10, p-value = 0.8046

As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 4 : “ L&T spare parts forecasting
Following steps are followed

 Take the data set, check for stationarity

ACF AND PACF ( also data can be stationeries with difference of 1)

Augmented Dickey-Fuller Test

data: ITEM1
Dickey-Fuller = -7.7298, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
MODEL:

Series: ITEM1
ARIMA(0,1,4)(1,1,1)[12]

Coefficients:
ma1 ma2 ma3 ma4 sar1 sma1
-0.9629 -0.0397 -0.0776 0.3997 -0.0628 -0.9997
s.e. 0.1660 0.2086 0.3390 0.2748 0.2569 0.9762

sigma^2 estimated as 6333: log likelihood=-215.59

AIC=445.18 AICc=449.18 BIC=456.26

MAPE: 8.5
Point Forecast Lo 95 Hi 95
May 2013 850.1776 674.8568 1025.4984
Jun 2013 598.4989 423.0542 773.9437
Jul 2013 803.1488 627.8436 978.4541
Aug 2013 807.7244 631.9686 983.4803
Sep 2013 616.4064 431.9856 800.8272

Observed vs fitted, and residual plots

Box-Ljung test

data: resid(Model1)
X-squared = 10.55, df = 10, p-value = 0.3936
As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 5 : “ L&T spare parts forecasting

Following steps are followed

Take the data set, check for stationarity

Stationarizing the data for seasonality component,

Acf and pacf plots

Augmented Dickey-Fuller Test

data: ITEM1
Dickey-Fuller = -5.5116, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary

Rejecting null
Model
ARIMA(0,1,2)(1,1,1)[12]

Coefficients:
ma1 ma2 sar1 sma1
-0.9401 0.1894 -0.328 -0.7643
s.e. 0.2113 0.2172 0.392 1.4232

sigma^2 estimated as 21712: log likelihood=-236.86

AIC=483.71 AICc=485.71 BIC=491.63

MAPE:14.69

Forecast:
Point Forecast Lo 95 Hi 95
May 2013 675.0444 372.4832 977.6056
Jun 2013 563.3441 261.3652 865.3230
Jul 2013 566.6540 255.5933 877.7148
Aug 2013 560.3258 240.4409 880.2107
Sep 2013 480.4308 151.9588 808.9028

Forecast plot, residual plot and observed vs fit

Forecast plot
Box-Ljung test

data: resid(Model1)
X-squared = 6.8745, df = 10, p-value = 0.7372

As p-value > 0.05, we retain the null , thus residual is independently distributed

Choosing Data set 6 : “ L&T spare parts forecasting

Following steps are followed

Take the data set, check for stationarity

Model:

Model1
Series: ITEM1
ARIMA(2,2,1)(1,1,1)[12]

Coefficients:
ar1 ar2 ma1 sar1 sma1
-0.5983 -0.5790 -0.9942 -0.0728 -0.5472
s.e. 0.1403 0.1471 0.1084 0.5797 0.7806

sigma^2 estimated as 386: log likelihood=-159.67

AIC=331.34 AICc=334.34 BIC=340.67
> acc<-accuracy(Model1)
> acc
ME RMSE MAE MPE MAPE MASE ACF1
Training set -1.68527 15.37225 9.934065 -13.32776 29.17808 0.6831978 -0.1452887

Point Forecast Lo 95 Hi 95
May 2013 48.82538 9.200454 88.45031
Jun 2013 30.28432 -13.234616 73.80325
Jul 2013 56.92338 12.241527 101.60523
Aug 2013 73.82233 20.312473 127.33218
Sep 2013 46.75125 -12.056058 105.55855

Residual plot, forecast and observed vs fitted

Box-Ljung test

data: resid(Model1)
X-squared = 13.157, df = 10, p-value = 0.215

Question 3. Which forecasting techniques should L&T use to forecast different spare items?

Answer 3. For most of the spare part, sesonal arima is able to predict well , with less than 10%
error, therefore that methodology can be used.
PARTA: CLUSTERING

Question 2: List and derive the metrics that can be used in ‘‘hierarchical clustering’’ and
‘‘partition around medoids’’ clustering algorithms.

Answer 2:

Took only numeric variables and the derived variables for clustering algorithms was as follows

 Total sales in INR

 Total cost in INR
 Total discount
 Markdown sensitivity
 Profit by area

Q2. Do you find outliers in the derived data from Q1? If yes, how can the same be treated for
use in cluster modeling?

Answer 2. Yes there were outliers in the Q1,

Firstly we tried if we could combine the outliers into an cluster, but as that was not possible after
seeing the dendogram, we decided to remove the outliers in this exercise

Q3. Develop a hierarchical clustering model with the modified data from Q2. How many
clusters seem appropriate? Justify

Answer 3.

Hierchachal Clustering:

Reading the dataset:

Selecting the numeric variables mentioned above

Removing outlier:
Removing the top 3 outliers

Elbow plot: to determine the optimium cluster

Optimum clusters based on the dendogram and elbow plot is 3 clusters, therefore it is ideal to have
3 clusters
4. Develop a partition around medoids clustering model with the modified data from
Q2What are the advantages of using partitioning around medoids (PAM) over K-means?
How do you decide on the appropriate number of clusters in this scenario?
Total.sales.in.INR Total.discount Total.cost Markdown.sensitivty profitbyarea

87 -0.2507675 -0.4011870 -0.2730098 -0.1195077 -0.08538677

161 -1.0164227 -0.9291651 -1.0128436 -0.1195077 -0.86135776

123 1.6491207 1.6081255 1.7261156 -0.1195077 1.52198171

141 0.4894637 0.9023874 0.6192578 -0.1195077 -0.14081327

Advantage:

PAM or k-medoid is based on centroids (or medoids) calculating by minimizing the absolute
distance between the points and the selected centroid, rather than minimizing the square distance.
As a result, it's more robust to noise and outliers than k-means.

K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to
converge - it really must only be used with distances that are consistent with the mean. So e.g.
Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids

Robustness of medoid

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also
is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the
median, you will see plenty of explanations and examples why the median is more robust to
outliers than the arithmetic mean. Essentially, these explanations and examples will also hold
for the medoid. It is a more robust estimate of a representative point than the mean as used in k-
means.

The number of appropriate clusters is decided on the distance from centroid and we can
determine this by elbow /scree plot
5. Validate the goodness of resulting clusters from hierarchical and PAM models obtained in
Q3 and Q4. Which is a better model as per validation measures?

Validating results
Clustering Methods:
hierarchical kmeans pam

Cluster sizes:
3 4 5

Validation Measures:
3 4 5

hierarchical Connectivity 8.0829 8.9758 12.8647

Dunn 0.2461 0.2461 0.2089
Silhouette 0.5567 0.5459 0.5350
kmeans Connectivity 7.6516 22.2611 27.8167
Dunn 0.1391 0.1269 0.1051
Silhouette 0.5558 0.5313 0.5294
pam Connectivity 0.3778 4.2667 31.9274
Dunn 0.1470 0.1620 0.0181
Silhouette 0.4960 0.5101 0.4073

Optimal Scores:

Score Method Clusters

Connectivity 0.3778 pam 3
Dunn 0.2461 hierarchical 3
Silhouette 0.5567 hierarchical 3

Connectivity should be minimized, while both the Dunn index and the silhouette width should be
maximized.

Thus, it appears that hierarchical clustering outperforms the other clustering algorithms under
each validation measure, for nearly every number of clusters evaluated.

PART B – TIME SERIES FORECASTING

7. Conduct exploratory data analysis on “Cluster=1, Brand=CRESCENT SET, Brick=CKD”

combination to identify the following relationships. Give a short description about the
relationships observed.

Answer 7

a. Relationship between Sales units(sales_units) & Discount % (discount_per)

Sales Unit vs discount per
3500 0.5
0.45
3000
0.4
2500 0.35
sales_units
2000 0.3
discount_per
0.25
1500 0.2
1000 0.15
0.1
500
0.05
0 0
9.2013 21.2013 33.2013 45.2013 5.2014 17.2014 29.2014 41.2014 1.2015 13.2015

As expected higher discount increase the number of sales , although in the beginning of every year
and end of year, the sales drop even when there is a discount.

Shows this is a seasonal selling product, and thus not high in demand in winters

b. Relationship between Sales units & Net Price (per_unit_netprice)

Sales Unit vs per_unit_netprice

3500 1800
3000 1600
1400
2500 per_unit_netprice
1200
2000 1000 sales_units
1500 800
600
1000
400
500 200
0 0
9.2013 22.2013 35.2013 48.2013 9.2014 22.2014 35.2014 48.2014 9.2015

As it can been seen, net price is directly proportional to sales unit, if net price of a product is
increase, sales go down, which can be clearly seen above

c. Relationship between Sales units & Age (age)

Correlation: -0.5842883, age and sales are negatively correlated. Thus, if the
The product is older, the sale of that product is likely to reduce
8. What is over-fitting and under-fitting in the context of regression models? What are the
consequences of over-fitting?

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the

underlying trend of the data. Intuitively, underfitting occurs when the model or the algorithm
does not fit the data well enough.
Overfitting occurs when a statistical model algorithm captures the noise of the data. overfitting
occurs when the model or the algorithm fits the data too well. Specifically, overfitting occurs if the
model or algorithm shows low bias but high variance.

Consequence of over fitting

 The problem is that these concepts do not apply to new data and negatively impact the models
ability to generalize.
 An overfit model is one that is too complicated for your data set. When this happens, the
regression model becomes tailored to fit the quirks and random noise in your specific sample
rather than reflecting the overall population

9. Explain why we have to partition the time series data before building the forecasting
model. Use data for “Cluster=2, Brand=BLINK, Brick=HAREMS” and partition this time
series data as explained below.

i. Consider all weeks until 51st week (including) of 2014 as training data.
ii. Consider weeks from 52nd week of 2014 to 3rd week of 2015 as test data.

Answer 9.
 Data partitioning is a necessary step, as the basic idea is to separate out the available data into
a training set and a testing (or validation) set.
 Primarily because we want to ensure that our model does a good job of predicting the "seen"
data, so that when we are presented with unseen data, we have some level of confidence about
the predictive power of the model.
 For cross-sectional data analysis, we usually take great care to make sure that the training and
testing samples are randomly chosen or in case of unbalanced data, carefully chosen.

10. Develop a time series forecast model using regression on the training data to forecast
sales units for “Cluster=2, Brand=BLINK, Brick=HAREMS” combination using the below
variables as predictors.
a. Lag 1 (i.e. immediate previous week) of sales units
b. Discount %
c. Lag 1 (i.e. immediate previous week) of discount %
d. Promotion week flag
e. Age
Apply appropriate transformations and evaluate the model fit.

Answer 10.
As the data was not stationary, we did a log transformation on sales_unit .
The model developed was as follows
Call:
lm(formula = log(TS_train) ~ as.factor(promo_week_flg) + age +
log(salenew) + disnew + discount_per, data = train_dataset)

Residuals:
Min 1Q Median 3Q Max
-1.24444 -0.15463 0.01581 0.18425 1.14370

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.31963 0.41381 5.606 2.35e-07 ***
as.factor(promo_week_flg)1 0.34650 0.12500 2.772 0.006800 **
age -0.01392 0.00346 -4.022 0.000121 ***
log(salenew) 0.53851 0.08338 6.458 5.64e-09 ***
disnew -1.90192 0.43924 -4.330 3.94e-05 ***
discount_per 2.25105 0.44093 5.105 1.89e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3747 on 88 degrees of freedom

(1 observation deleted due to missingness)
Multiple R-squared: 0.8581, Adjusted R-squared: 0.8501
F-statistic: 106.5 on 5 and 88 DF, p-value: < 2.2e-16
Got R2 of .85, with p-value less than 0.05, thus the model is valid

11. Perform checks to ensure that the model is valid and assumptions of regression are
met. Conduct appropriate statistical test and back the findings by visual examination
12. Based on the model result, explain the following:
a. How is this forecasting model able to account for trend and seasonality?
b. What is price elasticity and determine the price elasticity value (a proxy representing
price elasticity is enough) from the model output?
c. How do you interpret the coefficient of promotion week flag variable?

Answer 12.

a. Seasonality is difficult to account for in training data, as the data doesn’t overlap complete one
year .

b. Price elasticity a measure of the effect of a price change or a change in the quantity supplied on
the demand for a product or service,

Change in sales unit/change in price

week sales_units per_unit_discountprice

discount_per
promo_week_flg
dis_lag sale_lag
47 117 148.5085 0.64 1 0.62 119
48 56 109.8427 0.65 1 0.63 117
49 80 121.0075 0.66 1 0.64 124
50 34 101.0529 0.67 1 0.61 96
Let base price by 450, all the discount to be availed on base price
Week 49,discount 0.66, week 48, discount 0.66

Change in sales unit = (Sales in week 49 – Sales in 48) = (80-56) = 24

Change in price = (base price*discount of week 49 – base pricec*discount of week 48)
(450*80 – 450*56) = 297 – 292.5 = 4.5 , 24/4.5 = 5.3
c. Promo flag is a factor variable, therefore if the promotion is done the value is 1 else 0, looking at
the model , one unit change in intercept increase (i.e if no promotion is done), the week in which
promotion is done it goes up by (0.3465)

13.How do you check if the forecasting model is able to explain most of the important
features of the time series? Explain white noise in the context of time series.

To check the data following should be covered

 Is there a trend, meaning that, on average, the measurements tend to increase (or
decrease) over time?
 Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows
related to calendar time such as seasons, quarters, months, days of the week, and so on?
 Are their outliers? In regression, outliers are far away from your line. With time series data,
your outliers are far away from your other data.
 Is there a long-run cycle or period unrelated to seasonality factors?
 Is there constant varianceover time, or is the variance non-constant?
 Are there any abrupt changes to either the level of the series or the variance?

White Noise :

A white noise process is one with a mean zero and no correlation between its values at different
times

Consider a time series {wt:t=1,...n}{wt:t=1,...n}. If the elements of the series, wiwi, are
independent and identically distributed (i.i.d.), with a mean of zero, variance σ2σ2 and no serial
correlation (i.e. Cor(wi,wj)≠0,∀i≠jCor(wi,wj)≠0,∀i≠j) then we say that the time series is discrete
white noise (DWN).

In particular, if the values wiwi are drawn from a standard normal distribution (i.e.
wt∼N(0,σ2)wt∼N(0,σ2)), then the series is known as Gaussian White Noise.
14. Using the forecast model built, generate sales units forecast for test period (52nd week
of 2014 to 3rd week of 2015).
a. Assess the forecast model accuracy on the test time period which is not used for modeling
by calculating MAPE for the test period.

Answer: Sales will be as follows

fit lwr upr
1 36.36765 28.19258 46.91326
2 29.80780 22.56843 39.36937
3 33.35176 25.24706 44.05821
4 25.76207 19.22271 34.52605

Mape for this would be :

week Lag1_Sales_Units Discount_per Lag1_Discount_per Promotion_week_flag Age Actual sales MAPE

52.2014 48 0.569002728 0.579304707 1 96 45 -0.06667
1.2015 36 0.493558044 0.569002728 1 97 44 0.173463
2.2015 30 0.491279893 0.493558044 1 98 36 0.172006
3.2015 33 0.428838745 0.491279893 1 99 30 -0.11173
4%
This is with lag
Without lag MAPE will be

week Actual_Sales_Units Predicted Mape (Actual-predicted/actual

52.2014 45 36.3677 0.19183
1.2015 44 29.8078 0.32255
2.2015 36 33.3518 0.073562222
3.2015 30 25.7621 0.141264333
Mape 18%
PART C – OPTIMIZATION
15. Formulate an optimization model and solve it to determine the optimal discount % to
be given for “Cluster=2, Brand=BLINK, Brick=HAREMS” combination for each of the 4 weeks
of EOSS

Objective Function:
Max((606-(D1*606)*S1)+((606-(D2*606))*S2)+(606-(D3*606)*S3)+((606-(D3*606)*S3)+((606-
D4*606)*S4)+(2476-S1-S2-S3-S4)*(0.4*606))

Decision variable
Sales 4 week
S1,S2,S3,S4
Discount for 4 weeks
D1,D2,D3,D4
Constraints
Sales contraints
S1<= 2476
S2+S1 <=2476
S2+S2+S1 <= 2476
S1+S2+S3+S4<= 2476

D1 >= 57.9%
D2-D1 >= 0
D3-D2 >=0
D4-D2 >= 0

2476 – S1 >= 0
2476 – (S1+S2)>= 0
2476-(S1+S2+S3)>= 0
2476-(S1+S2+S3+S4 >= 0
D1,D2,D3,D4 <= 0.6
Excel output
Objective function : 602000.4
Sales
S1 : 63 , S2 = 33, S3 = 29 , S4 = 28
Discount
D1: 0.579 , D2 = 0.6 ,D3 = 0.6, D4 = 0.6
16. What are the weekly forecasted sales units if the optimal discounts identified are
implemented for the EOSS?

Answer 16 S1 : 63, S2 : 33, S3:29, S4 :28

We know that actual revenue realized by the retailer for “Cluster=2, Brand=BLINK,
Brick=HAREMS” combination during the 4 weeks of EOSS is INR1 41,320. Then, what is the
incremental lift in revenue the retailer would have achieved in these 4 weeks if he/she
implemented our analytics solution instead?

Current revenue 602000.4

previous revenue 41,320
lift in revenue 560,680

Question 3: Read the case titled, “Machine Learning Algorithms to Drive CRM in the Online
E-Commerce Site at VMWare”, and answer the following questions:

Problem definition

1. Outline the business problem and how it can be converted to an analytics problem.

Answer 1.

Business problem

Large revenue is generated in upgrading to a newer version of workstation everyyear. As this year
VMWare is not launching any new version of workstation, therefore they want to tap into the
untapped customers , target new customers, upsell to existing customers and cross –sell to the
customers that don’t have workstation as yet

Data analytics problem

To classify the user into people likely to make a purchase and people who will not buy by studying
their consumer data collected offline and online and connecting with email address, Rank them in
order of preference

Sampling

2. What is the right cross-validation strategy for this problem? What would happen if we
choose random sampling in this scenario or stratified sampling scenario?
Answer 2

In the current case we should ideally be doing a time-based cross validation. In this method we
simulate the real world by aggregating data to a period and then predicting for the next period

 Simple random samples involve the random selection of data from the entire population so
that each possible sample is equally likely to occur.
 stratified random sampling divides the population into smaller groups, or strata, based on
shared characteristics

In the current problem, we should be going ahead with stratified sampling, as we need to target
audience who meet only certain behavioral characteristics

3. What could be the training data and validation datasets for the model? How should we
go about choosing that and what should be the reasons for the same?

Answer 3.
For training , we would aggregate data upto September 2015 and predict the workstation buyers
during oct- dec 2015.
For validation, we could aggregate data up to December 2015 and compare the predictions against
actual workstation buyers from Jan-March 2016
For scoring , we can aggregate the data upto March 2016

Choice of Evaluation Metric

4. What are the pros and cons of using accuracy as a metric vs. precision vs. recall vs. F-
score vs. Area-under-curve? Which is better and why? What is area-under-curve?

Accuracy: Accuracy simply measures how often the classifier makes the correct prediction. It’s
the ratio between the number of correct predictions and the total number of predictions (the
number of test data points)

Precision and recall are actually two metrics. But they are often used together. Precision answers
the question: Out of the items that the classifier predicted to be true, how many are actually true?
Whereas, recall answers the question: Out of all the items that are true, how many are found to be
true by the classifier?

The precision score quantifies the ability of a classifier to not label a negative example as positive.
The precision score can be interpreted as the probability that a positive prediction made by the
classifier is positive. The score is in the range [0,1] with 0 being the worst, and 1 being perfect
F-score:
The F1 score, commonly used in information retrieval, measures accuracy using the statistics
precision p and recall r. Precision is the ratio of true positives (tp) to all predicted positives (tp +
fp). Recall is the ratio of true positives to all actual positives (tp + fn). The F1 score is given by
The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize
both precision and recall simultaneously. Thus moderately good performance on both will be
favored over extremely good performance on one and poor performance on the other.

The better among this will be accuracy as in this case we need accuracy to be as precise as
possible as the marketing campaign will be targeted accordingly

AUC:
The AUC score can also be defined when the target classes are of type string. For binary
classification, when the target label is of type string, then the labels are sorted alphanumerically
and the largest label is considered the "positive" label.

6.How do you define lift? Plot the lift curve for the different techniques?

Lift = confidence/expected confidence

Lift is nothing but the ratio of Confidence to Expected Confidence.In the area of association rules
- "A lift ratio larger than 1.0 implies that the relationship between the antecedent and the
consequent is more significant than would be expected if the two sets were independent.

Feature Selection

7. What feature selection techniques could be used to reduce the number of features?
What other feature selection techniques could be used in this scenario?

Selection techniques used in this scenario was the odds ratio of the target variable against each
of the features. If odds ratio is greater than 1indicates that the feature is favorable towards
purchase and odds ratio less than 1 indicates opposite. Higher degree of odd would mean high
favorability

OTHER TECHNIQUES
Filter Methods:
Filter feature selection methods apply a statistical measure to assign a scoring to each feature.
The features are ranked by the score and either selected to be kept or removed from the dataset.
The methods are often univariate and consider the feature independently, or with regard to the
dependent variable.
Other feature methods are as follows

Wrapper Methods

Wrapper methods consider the selection of a set of features as a search problem, where different
combinations are prepared, evaluated and compared to other combinations. A predictive model
us used to evaluate a combination of features and assign a score based on model accuracy.

Embedded Methods

Embedded methods learn which features best contribute to the accuracy of the model while the
model is being created. The most common type of embedded feature selection methods are
regularization methods.

Regularization methods are also called penalization methods that introduce additional
constraints into the optimization of a predictive algorithm (such as a regression algorithm) that
bias the model toward lower complexity (fewer coefficients).

Modeling Techniques

8. How is random forest different from gradient boosting?

The biggest difference between RFs and GBTs is how they optimize the bias–variance tradeoff
Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak
learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves).
Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating
the output from many models).

On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high
variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees
are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce
bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for
large, unpruned trees, so that the bias is initially as low as possible.

9. How could clustering be combined with classification for this problem?

Cluster before Classification will help us understand the audience better with their unique
identification

Thus it will be ideal to run cluster understand the audience and then do classification to further
have more precision in marketing activity and target only the most potential buyers

13. List a few uses of the propensity model in an online store.

Propensity modeling correlates customer characteristics with anticipated behaviors or
propensities. It tracks buying habits as well as other actions such as a customer’s propensity to
open a marketing email, sign up to a loyalty program, or participate in feedback surveys.

 You have three customer segments defined by their shopping frequency, the frequent
shoppers, the slow-and-steady customers, and the at-risk customers. Applying a propensity
modeling predictive tool to each of these customer segments will allow you to develop a far more
successful, long-term sales strategy
 What is the retention probability of your frequent shoppers? Is the frequency between
their shopping trips, or the amount of money they spend on each shop, increasing or declining—
and if so, why? Why do your frequent shoppers prefer to shop with you, and how can you
leverage this knowledge to influence your slow-and-steady and at-risk customers?

14. How can the model be white-boxed to explain the importance of various features in the
model?

Whitebox models are models whose workings can be explained to the sales teams. For example:
Customer X is more likely to upgrade if the support for the older version is coming to an end OR if
a compelling newer version is being launched

Q 4 :Use Naïve Bayes’s algorithm and calculate the probability of positive sentiment for the
following comment: Good location but the staff was unfriendly

The probability has been calculated in python as per following steps

Following steps were followed

Text pre processing

 Feature _ words : Saving the words in vocabulary into feature words

 TP = Documents that are classified as positive sentiment in the training dataset
 TN = Documents that are classified as negative sentiment in the training dataset.

 Loading the document and classifying the same into positive or negative sentiments
 Tokenization the data set : Tokenization is the process of breaking a stream of text up into
words, phrases, symbols, or other meaningful elements called tokens.
Feature: Extraction :Feature extractor is used to convert each comment to a feature set.

Stemming: Different forms of the same word. Stemming is a process of transforming a word into
its stem.

Tokenization :Process of extracting documents into meaningful elements.

N Gram Analysis: is a sequence of n consecutive words(e.g. “machine learning” is 2-gram)

Post text processing following code was used,

Using the vocab set

V = {Beautiful, Good Service, Good Location, Superb, Cleanliness, Mosquitoes, Unfriendly, bad
experience}

Trained the model on the following sentence

Positive Comments:

1. Service was very good. Excellent breakfast in beautiful restaurant included in price. I was
happy there and extended my stay for extra two days.

2. Really helpful staff, the room was clean, beds really comfortable. Great roof top restaurant
with yummy food and very friendly staff.

3. Good location. The Cleanliness part was superb.

4. I stayed for two days in deluxe A/C room (Room no. 404). I think it is renovated recently.
Staff behaviour, room cleanliness all are fine.

Negative Comments

1. The room and public spaces were infested with mosquitoes. I killed a dozen or so in my room
prior to sleeping but still woke up covered in bites.

2. Unfriendly staff with no care for guests.

3. Very worst and bad experience, Service I got from the hotel reception is too worst and
typical.

Then we test run them on following sentence

Good location but the staff was unfriendly

Python code : Prob of positive was 0.36

…………………………………………End of Assignment…………………………………………....

Submitted To:: Prof. Vinay Singh Chawan
No ratings yet
Submitted To:: Prof. Vinay Singh Chawan
12 pages
Repositioning Royco
100% (1)
Repositioning Royco
6 pages
Pathfinders' Guide To Eberron
100% (2)
Pathfinders' Guide To Eberron
316 pages
Unofficial Cheat Sheet For Forecasting
No ratings yet
Unofficial Cheat Sheet For Forecasting
2 pages
DAD Hospital-Case Data
No ratings yet
DAD Hospital-Case Data
8 pages
SIT718 Assessment-Task 4-T3 2019-Amended PDF
No ratings yet
SIT718 Assessment-Task 4-T3 2019-Amended PDF
7 pages
Application of Logistic Regression To People-Analytics
No ratings yet
Application of Logistic Regression To People-Analytics
30 pages
Cheat Sheet
No ratings yet
Cheat Sheet
4 pages
Ch07 Red Tomato Tools
No ratings yet
Ch07 Red Tomato Tools
8 pages
Correlation - Regression Cases
No ratings yet
Correlation - Regression Cases
2 pages
Korean Hangul Lessons
67% (6)
Korean Hangul Lessons
29 pages
2-Doctrine of Arbitrariness and Legislative Action-A Misconceived Application
No ratings yet
2-Doctrine of Arbitrariness and Legislative Action-A Misconceived Application
14 pages
Mahesh Kaushik RSI Based Nifty Ki Dukan Sheet
No ratings yet
Mahesh Kaushik RSI Based Nifty Ki Dukan Sheet
268 pages
2015 10 10
No ratings yet
2015 10 10
19 pages
Statistical Forecasting Models
100% (1)
Statistical Forecasting Models
37 pages
Quiz Forecasting
100% (1)
Quiz Forecasting
13 pages
Real Statistics Examples Part 1A
No ratings yet
Real Statistics Examples Part 1A
853 pages
Statistical Forcasting - Excel, ARIMA
No ratings yet
Statistical Forcasting - Excel, ARIMA
14 pages
Cart Project
75% (4)
Cart Project
17 pages
Recommended Reference Books PDF
No ratings yet
Recommended Reference Books PDF
1 page
Kohli Batting Analysis
No ratings yet
Kohli Batting Analysis
19 pages
Graded Quiz 1 - Working With Python Great Lakes
No ratings yet
Graded Quiz 1 - Working With Python Great Lakes
6 pages
Ensemble Techniques Project
100% (2)
Ensemble Techniques Project
28 pages
Statistics Assignment
No ratings yet
Statistics Assignment
4 pages
Chapter Ten: Forecasting
No ratings yet
Chapter Ten: Forecasting
49 pages
Credit Balance Analysis: Saee Chaudhari
No ratings yet
Credit Balance Analysis: Saee Chaudhari
26 pages
Rajesh and Goel
50% (2)
Rajesh and Goel
8 pages
Data Preparation
No ratings yet
Data Preparation
12 pages
XII STD - Statistics English Medium
No ratings yet
XII STD - Statistics English Medium
280 pages
Enterpreneurship and Innovations in The Digital Transformation Age
No ratings yet
Enterpreneurship and Innovations in The Digital Transformation Age
10 pages
A Predictive Analytics Approach For Demand Forecasting
100% (1)
A Predictive Analytics Approach For Demand Forecasting
22 pages
Business Statistics Assignment
No ratings yet
Business Statistics Assignment
7 pages
(Exp 4) Classification Via Decision Trees in WEKA
No ratings yet
(Exp 4) Classification Via Decision Trees in WEKA
10 pages
DADM Q7,8 and 9 Report
No ratings yet
DADM Q7,8 and 9 Report
4 pages
X Education - Lead Scoring Case Study
No ratings yet
X Education - Lead Scoring Case Study
24 pages
Lead Scoring Subjective Questions
No ratings yet
Lead Scoring Subjective Questions
3 pages
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
10.1 Time Series Analysis Sales Forecast
No ratings yet
10.1 Time Series Analysis Sales Forecast
7 pages
Business Report Advance Statistics
No ratings yet
Business Report Advance Statistics
39 pages
Operation Management
0% (1)
Operation Management
12 pages
PG Program Dsba Classroom
No ratings yet
PG Program Dsba Classroom
16 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Faculty PPT-Customer Life Time Value Analytics PDF
No ratings yet
Faculty PPT-Customer Life Time Value Analytics PDF
40 pages
KPMG GRP 1
No ratings yet
KPMG GRP 1
47 pages
Credit EDA Case Study
100% (3)
Credit EDA Case Study
16 pages
Time Series Project
No ratings yet
Time Series Project
19 pages
Simulating Trading Strategies
No ratings yet
Simulating Trading Strategies
47 pages
Scalene Works-HR Analytics
0% (1)
Scalene Works-HR Analytics
10 pages
Inf Sta3
No ratings yet
Inf Sta3
15 pages
Time Series Analysis: 1.case Study
No ratings yet
Time Series Analysis: 1.case Study
15 pages
Jury of Executive Opinion Method
No ratings yet
Jury of Executive Opinion Method
6 pages
Ba ZG521 Ec-3r First Sem 2023-2024
No ratings yet
Ba ZG521 Ec-3r First Sem 2023-2024
5 pages
Class Assignment 1 For Business Analytics
No ratings yet
Class Assignment 1 For Business Analytics
5 pages
House Price Prediction Using Machine Learning: Bachelor of Technology
No ratings yet
House Price Prediction Using Machine Learning: Bachelor of Technology
20 pages
Bca PDF
No ratings yet
Bca PDF
114 pages
CRM Unit 5 - Customer Analytics Part I
No ratings yet
CRM Unit 5 - Customer Analytics Part I
23 pages
in Your Line of Work, Cite A Situation Using One of The Quantitative Techniques As Basis of Decision/s You Made
No ratings yet
in Your Line of Work, Cite A Situation Using One of The Quantitative Techniques As Basis of Decision/s You Made
57 pages
Retail Analysis With Walmart Data
No ratings yet
Retail Analysis With Walmart Data
2 pages
Sales Prediction For Big Mart 3.0.pptx MM
No ratings yet
Sales Prediction For Big Mart 3.0.pptx MM
25 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
Gyaan Kosh Term 2: Competitive Strategy
No ratings yet
Gyaan Kosh Term 2: Competitive Strategy
23 pages
Minor Project
No ratings yet
Minor Project
21 pages
66 India Post Case Study
No ratings yet
66 India Post Case Study
3 pages
Forecasting: JY Le Boudec
No ratings yet
Forecasting: JY Le Boudec
93 pages
Pharmeasy
No ratings yet
Pharmeasy
16 pages
Month 205-70-N1190 PC - 198 - 27 - 42263 PC - 203 - 32 - 51461 PC - 600 - 863 - 4210 PC - 6735 - 61 - 3410
No ratings yet
Month 205-70-N1190 PC - 198 - 27 - 42263 PC - 203 - 32 - 51461 PC - 600 - 863 - 4210 PC - 6735 - 61 - 3410
4 pages
Assignment 2 Module 3
No ratings yet
Assignment 2 Module 3
26 pages
Classification Trees - CART and CHAID
No ratings yet
Classification Trees - CART and CHAID
50 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages
Ambition
No ratings yet
Ambition
1 page
Level 1 - Media Design Facilator Guide - V2.0
No ratings yet
Level 1 - Media Design Facilator Guide - V2.0
416 pages
Rethinking Marketing and Insights: Behavioral Economics Immersion
No ratings yet
Rethinking Marketing and Insights: Behavioral Economics Immersion
2 pages
University of Gondar: Multimedia Systems Chapter-Two Multimedia Authoring Tools
No ratings yet
University of Gondar: Multimedia Systems Chapter-Two Multimedia Authoring Tools
32 pages
How To Work Smart
No ratings yet
How To Work Smart
49 pages
Internal vs. External CEOs - Research Spotlight
No ratings yet
Internal vs. External CEOs - Research Spotlight
12 pages
Kural 2011 J. Phys.: Conf. Ser. 305 012088
No ratings yet
Kural 2011 J. Phys.: Conf. Ser. 305 012088
11 pages
Monika CV
No ratings yet
Monika CV
3 pages
OpMiniStatement09 12 2019
No ratings yet
OpMiniStatement09 12 2019
1 page
Virginia SOL - Art
No ratings yet
Virginia SOL - Art
36 pages
Kylee Williams - Personification Lesson Plan Draft 2
No ratings yet
Kylee Williams - Personification Lesson Plan Draft 2
5 pages
Michael Pollan-The Omnivores Dilemma
50% (2)
Michael Pollan-The Omnivores Dilemma
202 pages
Whither The Common Law Derivative Action? A Malaysian Case Study
No ratings yet
Whither The Common Law Derivative Action? A Malaysian Case Study
5 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
2 pages
LInux LVM
No ratings yet
LInux LVM
10 pages
IBA International Principles On Conduct For The Legal Profession May 2011
No ratings yet
IBA International Principles On Conduct For The Legal Profession May 2011
36 pages
1 s2.0 S0346251X23000970 Main
No ratings yet
1 s2.0 S0346251X23000970 Main
11 pages
Analysis of The Spider and The Bee Episo
No ratings yet
Analysis of The Spider and The Bee Episo
5 pages
DJAVIT An, AHMET, The Development of Turkish Cypriot Secularism
No ratings yet
DJAVIT An, AHMET, The Development of Turkish Cypriot Secularism
11 pages
Narrative Report For Final Demonstration Paolo F. Esguerra
No ratings yet
Narrative Report For Final Demonstration Paolo F. Esguerra
2 pages
Serial Killer Statistics
No ratings yet
Serial Killer Statistics
20 pages
Leshy PDF
100% (2)
Leshy PDF
2 pages
Determination Aluminium, Calcium, Manganese and Titanium in Ferrosilicon Alloys by Atomic-Absorption Spectrophotometry
No ratings yet
Determination Aluminium, Calcium, Manganese and Titanium in Ferrosilicon Alloys by Atomic-Absorption Spectrophotometry
5 pages
2.1 What Is A Contract of Sale?
100% (2)
2.1 What Is A Contract of Sale?
20 pages
Target Market and Marketing Mix
No ratings yet
Target Market and Marketing Mix
2 pages
Architecture Culture & History 2 (Arc 60203)
No ratings yet
Architecture Culture & History 2 (Arc 60203)
64 pages