Assignment 5
Assignment 5
BIA 7 – IIMB
Priyanka Sindhwani
4/25/2017
Question 1: What strategy should Vijaya Kumar adopt for developing forecasting model for
demand estimation of 20,000 spare parts?
The spare parts are categorized based on Fast, medium and slow. Also they have been further
classified as per their revenue generation
The items which contribute to more in sales, higher in value and are fast moving should be
targeted first therefore in forecasting, those item forecasting will be taken up first, also if the item
demand is fluctuating and cannot be stationarize they should not be picked up initially as we wont
be able to have a good forecast model for it
Question 2: Develop forecasting models for data provided in the Excel sheet titled ‘‘L&T
Spare Parts Forecasting’’ and discuss the choice for using a particular forecasting model.
data: difference
Dickey-Fuller = -10.305, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
Now the data can be stationaries with difference 1,We determine the Arima model
In the model we take the difference of 2 and run the following model, converting data to log
Call:
arima(x = log(ITEM1), order = c(2, 1, 2), seasonal = list(order = c(1, 0, 1)))
Coefficients:
ar1 ar2 ma1 ma2 sar1 sma1
-0.1756 -0.7045 -0.3018 0.9442 0.3896 -0.0741
s.e. 0.1377 0.1402 0.1148 0.1763 0.5946 0.6401
Residuals plot:
Prediction
Point focusLo 95 Hi 95
May-13 387.9071 280.861 535.7529
Jun-13 360.8679 250.6561 519.5391
Jul-13 348.5916 221.3342 549.0157
Aug-13 315.7873 176.7107 564.3209
Sep-13 324.2867 170.6734 616.159
Observed vs fitted plot and forecast plot
Box-Ljung test
data: resid(Model1)
X-squared = 6.2612, df = 10, p-value = 0.7929
As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 2 : “ L&T spare parts forecasting
Take the data set, check for stationarity
data: difference
Dickey-Fuller = -14.185, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
Model:
Series: log(ITEM1)
ARIMA(2,1,3)(1,0,1)[12]
Coefficients:
ar1 ar2 ma1 ma2 ma3 sar1 sma1
-1.3387 -0.9938 0.5009 -0.2139 -0.8754 0.8697 -0.5095
s.e. 0.0531 0.0182 0.2918 0.2230 0.3509 0.2575 0.5084
Accuracy
Mape : 8.83
Month Lower Point Forecast Upper
May 2013 220.9346 302.2483 413.4890
Jun 2013 184.0721 252.4793 346.3088
Jul 2013 175.8134 241.3509 331.3185
Aug 2013 164.6951 227.1128 313.1862
Sep 2013 154.5706 213.4543 294.7696
Box-Ljung test
data: resid(Model1)
X-squared = 7.0564, df = 10, p-value = 0.7201
As p-value > 0.05, we retain the null , thus residual is independently distributed
MAPE: 9.5
Box-Ljung test
data: resid(model)
X-squared = 6.1255, df = 10, p-value = 0.8046
As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 4 : “ L&T spare parts forecasting
Following steps are followed
data: ITEM1
Dickey-Fuller = -7.7298, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
MODEL:
Series: ITEM1
ARIMA(0,1,4)(1,1,1)[12]
Coefficients:
ma1 ma2 ma3 ma4 sar1 sma1
-0.9629 -0.0397 -0.0776 0.3997 -0.0628 -0.9997
s.e. 0.1660 0.2086 0.3390 0.2748 0.2569 0.9762
MAPE: 8.5
Point Forecast Lo 95 Hi 95
May 2013 850.1776 674.8568 1025.4984
Jun 2013 598.4989 423.0542 773.9437
Jul 2013 803.1488 627.8436 978.4541
Aug 2013 807.7244 631.9686 983.4803
Sep 2013 616.4064 431.9856 800.8272
Box-Ljung test
data: resid(Model1)
X-squared = 10.55, df = 10, p-value = 0.3936
As p-value > 0.05, we retain the null , thus residual is independently distributed
Choosing Data set 5 : “ L&T spare parts forecasting
data: ITEM1
Dickey-Fuller = -5.5116, Lag order = 0, p-value = 0.01
alternative hypothesis: stationary
Rejecting null
Model
ARIMA(0,1,2)(1,1,1)[12]
Coefficients:
ma1 ma2 sar1 sma1
-0.9401 0.1894 -0.328 -0.7643
s.e. 0.2113 0.2172 0.392 1.4232
MAPE:14.69
Forecast:
Point Forecast Lo 95 Hi 95
May 2013 675.0444 372.4832 977.6056
Jun 2013 563.3441 261.3652 865.3230
Jul 2013 566.6540 255.5933 877.7148
Aug 2013 560.3258 240.4409 880.2107
Sep 2013 480.4308 151.9588 808.9028
data: resid(Model1)
X-squared = 6.8745, df = 10, p-value = 0.7372
As p-value > 0.05, we retain the null , thus residual is independently distributed
Model1
Series: ITEM1
ARIMA(2,2,1)(1,1,1)[12]
Coefficients:
ar1 ar2 ma1 sar1 sma1
-0.5983 -0.5790 -0.9942 -0.0728 -0.5472
s.e. 0.1403 0.1471 0.1084 0.5797 0.7806
Point Forecast Lo 95 Hi 95
May 2013 48.82538 9.200454 88.45031
Jun 2013 30.28432 -13.234616 73.80325
Jul 2013 56.92338 12.241527 101.60523
Aug 2013 73.82233 20.312473 127.33218
Sep 2013 46.75125 -12.056058 105.55855
data: resid(Model1)
X-squared = 13.157, df = 10, p-value = 0.215
Question 3. Which forecasting techniques should L&T use to forecast different spare items?
Answer 3. For most of the spare part, sesonal arima is able to predict well , with less than 10%
error, therefore that methodology can be used.
PARTA: CLUSTERING
Question 2: List and derive the metrics that can be used in ‘‘hierarchical clustering’’ and
‘‘partition around medoids’’ clustering algorithms.
Answer 2:
Took only numeric variables and the derived variables for clustering algorithms was as follows
Q2. Do you find outliers in the derived data from Q1? If yes, how can the same be treated for
use in cluster modeling?
Firstly we tried if we could combine the outliers into an cluster, but as that was not possible after
seeing the dendogram, we decided to remove the outliers in this exercise
Q3. Develop a hierarchical clustering model with the modified data from Q2. How many
clusters seem appropriate? Justify
Answer 3.
Hierchachal Clustering:
Removing outlier:
Removing the top 3 outliers
Advantage:
PAM or k-medoid is based on centroids (or medoids) calculating by minimizing the absolute
distance between the points and the selected centroid, rather than minimizing the square distance.
As a result, it's more robust to noise and outliers than k-means.
First of all, you can use k-medoids with any similarity measure. K-means however, may fail to
converge - it really must only be used with distances that are consistent with the mean. So e.g.
Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids
Robustness of medoid
Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also
is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the
median, you will see plenty of explanations and examples why the median is more robust to
outliers than the arithmetic mean. Essentially, these explanations and examples will also hold
for the medoid. It is a more robust estimate of a representative point than the mean as used in k-
means.
The number of appropriate clusters is decided on the distance from centroid and we can
determine this by elbow /scree plot
5. Validate the goodness of resulting clusters from hierarchical and PAM models obtained in
Q3 and Q4. Which is a better model as per validation measures?
Validating results
Clustering Methods:
hierarchical kmeans pam
Cluster sizes:
3 4 5
Validation Measures:
3 4 5
Optimal Scores:
Connectivity should be minimized, while both the Dunn index and the silhouette width should be
maximized.
Thus, it appears that hierarchical clustering outperforms the other clustering algorithms under
each validation measure, for nearly every number of clusters evaluated.
Answer 7
As expected higher discount increase the number of sales , although in the beginning of every year
and end of year, the sales drop even when there is a discount.
Shows this is a seasonal selling product, and thus not high in demand in winters
As it can been seen, net price is directly proportional to sales unit, if net price of a product is
increase, sales go down, which can be clearly seen above
Correlation: -0.5842883, age and sales are negatively correlated. Thus, if the
The product is older, the sale of that product is likely to reduce
8. What is over-fitting and under-fitting in the context of regression models? What are the
consequences of over-fitting?
9. Explain why we have to partition the time series data before building the forecasting
model. Use data for “Cluster=2, Brand=BLINK, Brick=HAREMS” and partition this time
series data as explained below.
i. Consider all weeks until 51st week (including) of 2014 as training data.
ii. Consider weeks from 52nd week of 2014 to 3rd week of 2015 as test data.
Answer 9.
Data partitioning is a necessary step, as the basic idea is to separate out the available data into
a training set and a testing (or validation) set.
Primarily because we want to ensure that our model does a good job of predicting the "seen"
data, so that when we are presented with unseen data, we have some level of confidence about
the predictive power of the model.
For cross-sectional data analysis, we usually take great care to make sure that the training and
testing samples are randomly chosen or in case of unbalanced data, carefully chosen.
10. Develop a time series forecast model using regression on the training data to forecast
sales units for “Cluster=2, Brand=BLINK, Brick=HAREMS” combination using the below
variables as predictors.
a. Lag 1 (i.e. immediate previous week) of sales units
b. Discount %
c. Lag 1 (i.e. immediate previous week) of discount %
d. Promotion week flag
e. Age
Apply appropriate transformations and evaluate the model fit.
Answer 10.
As the data was not stationary, we did a log transformation on sales_unit .
The model developed was as follows
Call:
lm(formula = log(TS_train) ~ as.factor(promo_week_flg) + age +
log(salenew) + disnew + discount_per, data = train_dataset)
Residuals:
Min 1Q Median 3Q Max
-1.24444 -0.15463 0.01581 0.18425 1.14370
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.31963 0.41381 5.606 2.35e-07 ***
as.factor(promo_week_flg)1 0.34650 0.12500 2.772 0.006800 **
age -0.01392 0.00346 -4.022 0.000121 ***
log(salenew) 0.53851 0.08338 6.458 5.64e-09 ***
disnew -1.90192 0.43924 -4.330 3.94e-05 ***
discount_per 2.25105 0.44093 5.105 1.89e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
11. Perform checks to ensure that the model is valid and assumptions of regression are
met. Conduct appropriate statistical test and back the findings by visual examination
12. Based on the model result, explain the following:
a. How is this forecasting model able to account for trend and seasonality?
b. What is price elasticity and determine the price elasticity value (a proxy representing
price elasticity is enough) from the model output?
c. How do you interpret the coefficient of promotion week flag variable?
Answer 12.
a. Seasonality is difficult to account for in training data, as the data doesn’t overlap complete one
year .
b. Price elasticity a measure of the effect of a price change or a change in the quantity supplied on
the demand for a product or service,
13.How do you check if the forecasting model is able to explain most of the important
features of the time series? Explain white noise in the context of time series.
Is there a trend, meaning that, on average, the measurements tend to increase (or
decrease) over time?
Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows
related to calendar time such as seasons, quarters, months, days of the week, and so on?
Are their outliers? In regression, outliers are far away from your line. With time series data,
your outliers are far away from your other data.
Is there a long-run cycle or period unrelated to seasonality factors?
Is there constant varianceover time, or is the variance non-constant?
Are there any abrupt changes to either the level of the series or the variance?
White Noise :
A white noise process is one with a mean zero and no correlation between its values at different
times
Consider a time series {wt:t=1,...n}{wt:t=1,...n}. If the elements of the series, wiwi, are
independent and identically distributed (i.i.d.), with a mean of zero, variance σ2σ2 and no serial
correlation (i.e. Cor(wi,wj)≠0,∀i≠jCor(wi,wj)≠0,∀i≠j) then we say that the time series is discrete
white noise (DWN).
In particular, if the values wiwi are drawn from a standard normal distribution (i.e.
wt∼N(0,σ2)wt∼N(0,σ2)), then the series is known as Gaussian White Noise.
14. Using the forecast model built, generate sales units forecast for test period (52nd week
of 2014 to 3rd week of 2015).
a. Assess the forecast model accuracy on the test time period which is not used for modeling
by calculating MAPE for the test period.
Objective Function:
Max((606-(D1*606)*S1)+((606-(D2*606))*S2)+(606-(D3*606)*S3)+((606-(D3*606)*S3)+((606-
D4*606)*S4)+(2476-S1-S2-S3-S4)*(0.4*606))
Decision variable
Sales 4 week
S1,S2,S3,S4
Discount for 4 weeks
D1,D2,D3,D4
Constraints
Sales contraints
S1<= 2476
S2+S1 <=2476
S2+S2+S1 <= 2476
S1+S2+S3+S4<= 2476
D1 >= 57.9%
D2-D1 >= 0
D3-D2 >=0
D4-D2 >= 0
2476 – S1 >= 0
2476 – (S1+S2)>= 0
2476-(S1+S2+S3)>= 0
2476-(S1+S2+S3+S4 >= 0
D1,D2,D3,D4 <= 0.6
Excel output
Objective function : 602000.4
Sales
S1 : 63 , S2 = 33, S3 = 29 , S4 = 28
Discount
D1: 0.579 , D2 = 0.6 ,D3 = 0.6, D4 = 0.6
16. What are the weekly forecasted sales units if the optimal discounts identified are
implemented for the EOSS?
We know that actual revenue realized by the retailer for “Cluster=2, Brand=BLINK,
Brick=HAREMS” combination during the 4 weeks of EOSS is INR1 41,320. Then, what is the
incremental lift in revenue the retailer would have achieved in these 4 weeks if he/she
implemented our analytics solution instead?
Question 3: Read the case titled, “Machine Learning Algorithms to Drive CRM in the Online
E-Commerce Site at VMWare”, and answer the following questions:
Problem definition
1. Outline the business problem and how it can be converted to an analytics problem.
Answer 1.
Business problem
Large revenue is generated in upgrading to a newer version of workstation everyyear. As this year
VMWare is not launching any new version of workstation, therefore they want to tap into the
untapped customers , target new customers, upsell to existing customers and cross –sell to the
customers that don’t have workstation as yet
To classify the user into people likely to make a purchase and people who will not buy by studying
their consumer data collected offline and online and connecting with email address, Rank them in
order of preference
Sampling
2. What is the right cross-validation strategy for this problem? What would happen if we
choose random sampling in this scenario or stratified sampling scenario?
Answer 2
In the current case we should ideally be doing a time-based cross validation. In this method we
simulate the real world by aggregating data to a period and then predicting for the next period
Simple random samples involve the random selection of data from the entire population so
that each possible sample is equally likely to occur.
stratified random sampling divides the population into smaller groups, or strata, based on
shared characteristics
In the current problem, we should be going ahead with stratified sampling, as we need to target
audience who meet only certain behavioral characteristics
3. What could be the training data and validation datasets for the model? How should we
go about choosing that and what should be the reasons for the same?
Answer 3.
For training , we would aggregate data upto September 2015 and predict the workstation buyers
during oct- dec 2015.
For validation, we could aggregate data up to December 2015 and compare the predictions against
actual workstation buyers from Jan-March 2016
For scoring , we can aggregate the data upto March 2016
4. What are the pros and cons of using accuracy as a metric vs. precision vs. recall vs. F-
score vs. Area-under-curve? Which is better and why? What is area-under-curve?
Accuracy: Accuracy simply measures how often the classifier makes the correct prediction. It’s
the ratio between the number of correct predictions and the total number of predictions (the
number of test data points)
Precision and recall are actually two metrics. But they are often used together. Precision answers
the question: Out of the items that the classifier predicted to be true, how many are actually true?
Whereas, recall answers the question: Out of all the items that are true, how many are found to be
true by the classifier?
The precision score quantifies the ability of a classifier to not label a negative example as positive.
The precision score can be interpreted as the probability that a positive prediction made by the
classifier is positive. The score is in the range [0,1] with 0 being the worst, and 1 being perfect
F-score:
The F1 score, commonly used in information retrieval, measures accuracy using the statistics
precision p and recall r. Precision is the ratio of true positives (tp) to all predicted positives (tp +
fp). Recall is the ratio of true positives to all actual positives (tp + fn). The F1 score is given by
The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize
both precision and recall simultaneously. Thus moderately good performance on both will be
favored over extremely good performance on one and poor performance on the other.
The better among this will be accuracy as in this case we need accuracy to be as precise as
possible as the marketing campaign will be targeted accordingly
AUC:
The AUC score can also be defined when the target classes are of type string. For binary
classification, when the target label is of type string, then the labels are sorted alphanumerically
and the largest label is considered the "positive" label.
6.How do you define lift? Plot the lift curve for the different techniques?
Feature Selection
7. What feature selection techniques could be used to reduce the number of features?
What other feature selection techniques could be used in this scenario?
Selection techniques used in this scenario was the odds ratio of the target variable against each
of the features. If odds ratio is greater than 1indicates that the feature is favorable towards
purchase and odds ratio less than 1 indicates opposite. Higher degree of odd would mean high
favorability
OTHER TECHNIQUES
Filter Methods:
Filter feature selection methods apply a statistical measure to assign a scoring to each feature.
The features are ranked by the score and either selected to be kept or removed from the dataset.
The methods are often univariate and consider the feature independently, or with regard to the
dependent variable.
Other feature methods are as follows
Wrapper Methods
Wrapper methods consider the selection of a set of features as a search problem, where different
combinations are prepared, evaluated and compared to other combinations. A predictive model
us used to evaluate a combination of features and assign a score based on model accuracy.
Embedded Methods
Embedded methods learn which features best contribute to the accuracy of the model while the
model is being created. The most common type of embedded feature selection methods are
regularization methods.
Regularization methods are also called penalization methods that introduce additional
constraints into the optimization of a predictive algorithm (such as a regression algorithm) that
bias the model toward lower complexity (fewer coefficients).
Modeling Techniques
The biggest difference between RFs and GBTs is how they optimize the bias–variance tradeoff
Boosting is based on weak learners (high bias, low variance). In terms of decision trees, weak
learners are shallow trees, sometimes even as small as decision stumps (trees with two leaves).
Boosting reduces error mainly by reducing bias (and also to some extent variance, by aggregating
the output from many models).
On the other hand, Random Forest uses as you said fully grown decision trees (low bias, high
variance). It tackles the error reduction task in the opposite way: by reducing variance. The trees
are made uncorrelated to maximize the decrease in variance, but the algorithm cannot reduce
bias (which is slightly higher than the bias of an individual tree in the forest). Hence the need for
large, unpruned trees, so that the bias is initially as low as possible.
Cluster before Classification will help us understand the audience better with their unique
identification
Thus it will be ideal to run cluster understand the audience and then do classification to further
have more precision in marketing activity and target only the most potential buyers
You have three customer segments defined by their shopping frequency, the frequent
shoppers, the slow-and-steady customers, and the at-risk customers. Applying a propensity
modeling predictive tool to each of these customer segments will allow you to develop a far more
successful, long-term sales strategy
What is the retention probability of your frequent shoppers? Is the frequency between
their shopping trips, or the amount of money they spend on each shop, increasing or declining—
and if so, why? Why do your frequent shoppers prefer to shop with you, and how can you
leverage this knowledge to influence your slow-and-steady and at-risk customers?
14. How can the model be white-boxed to explain the importance of various features in the
model?
Whitebox models are models whose workings can be explained to the sales teams. For example:
Customer X is more likely to upgrade if the support for the older version is coming to an end OR if
a compelling newer version is being launched
Q 4 :Use Naïve Bayes’s algorithm and calculate the probability of positive sentiment for the
following comment: Good location but the staff was unfriendly
Loading the document and classifying the same into positive or negative sentiments
Tokenization the data set : Tokenization is the process of breaking a stream of text up into
words, phrases, symbols, or other meaningful elements called tokens.
Feature: Extraction :Feature extractor is used to convert each comment to a feature set.
Stemming: Different forms of the same word. Stemming is a process of transforming a word into
its stem.
V = {Beautiful, Good Service, Good Location, Superb, Cleanliness, Mosquitoes, Unfriendly, bad
experience}
Positive Comments:
1. Service was very good. Excellent breakfast in beautiful restaurant included in price. I was
happy there and extended my stay for extra two days.
2. Really helpful staff, the room was clean, beds really comfortable. Great roof top restaurant
with yummy food and very friendly staff.
4. I stayed for two days in deluxe A/C room (Room no. 404). I think it is renovated recently.
Staff behaviour, room cleanliness all are fine.
Negative Comments
1. The room and public spaces were infested with mosquitoes. I killed a dozen or so in my room
prior to sleeping but still woke up covered in bites.
3. Very worst and bad experience, Service I got from the hotel reception is too worst and
typical.