Assignment 2 Module 3
Assignment 2 Module 3
Priyanka Sindhwani
IIMB – BIA 7
Question 1:
Which of the following statements are correct (more than one may be correct)?
Tick () all right answers or highlight the correct statements with color.
1. The model explains 42.25% of variation in box office collection.
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases.
Answer 1.1
1. The model explains 42.25% of variation in box office collection. ( As r*r gives us R2, which explains the
variation R = 0.650, R2 = 42.25%)
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases. ( Putting it in the equation Y = Beta0+beta1*x1 : Y = -
8.354 +2.175*x1)
Question 1.2
Mr Chellappa, CEO of Oho Productions (OP) claims that the regression model in Table 3 is incorrect since it has
negative constant value. Comment whether Mr Chellappa is correct in his assessment about the model.
Answer 1.2
The y-intercept is only meaningful if it is logically meaningful for all predictor variables to be zero.
Constant value in an regression model is interpreted the expected mean value of Y at that value. In the
current case budget can never be less than zero therefore negative constant doesn’t add anything to the
model
Question 1.3
What is the average difference in the box office collection when a movie is released during a holiday
season (Releasing_Time_holiday_season) versus movies released during normal season
(Releasing_Time_Normal_Season)? Use a significance value of 5%.
Answer1.3
For the movie released during the holiday season the coefficient is equal to Constant, therefore in this case it is
2.685 vs
Releasing_time_Normal_Season
Considering significance levelshould have
of 0.05, beentime
release Y= 2.685 + 0.147*x
normal season is not significant therefore it merges with base
category thus there is no average difference ,thus there is no difference between Releasing_time_Normal_ Season
and Releasing_time_Holiday_Season
1
Mr Chellappa of Oho productions claims that the movies released during long weekend
(Releasing_Time_Long_Weekend) earn at least 5 crores more than the movies released during normal season
(Releasing_Time_Normal_Season). Check whether this claim is true (use = 0.05).
Answer 1.4
As seen from the model output, movie release on RNS is not significant, thus the value of the this would be same
as base category. Therefore
Estimated value of β of movie released in normal season (RNS) = 2.685
Whereas, Estimated value of β of RTL = 2.685+1.247 = 3.932
t-critical = -1.97
As t-value < t-critical, we fail to reject the null. Thus the claim of earning 5 crore is not proved
Answer 1.5:
2
The variation in response variable, ln(Box office collection) is 0.81 square = 0.656 = 65.6%
Answer 1.6
Budget has the maximum impact on the box office collection, looking at the standardized beta coefficient which
is 0.443 for budget
Budget should always be kept more than 35 crore for a movie as 1 unit of increase in budget adds 0.443
increase in box office collection
Question 1.7
Compare the regressions in Model 2 (Table 4) and Model 3 (Tables 5 and 6). None of the variables in Model 2
are statistically significant in Model 3. Can we conclude that the variables in Model 2 have no association
relationship with Box Office Collection? Explain clearly.
Answer 1.7
Model 3 ( Table 5 and 6) has stepwise regression, which basically adds only the most significant variable,
meaning variable with p-value is lowest.
Therefore looking at the table below , we have calculated p-value for all values
Releasing_Time_Festival_Season
constant .000
Releasing_Time_Festival_Season 0.203
Releasing_Time_Long Weekend 0.036
Releasing_Time_Normal_Season 0.734
3
Constant 2.06206E-29
Budget_35_cr 1.49579E-11
Music_Dir_Cat c 0.001464786
As the p-value for all the variables in model 3 is less than the p-value in model 2, therefore these variables
were dropped in Model 3( Stepwise regression)
Question 1.8 Among the variables in Table 6, which variable is not useful for practical application of the
model? Clearly state your reasons.
Answer 1.8
You tube views, as that is the variable we have no control over to amend. Thus keeping it in model is not
adding any additional information which will help us take decision
…………………………………………………………………………………………………………………….............................................................
Question 2.1 :
a) What is the predictor variable used in Model 1? Explain clearly.
Price Index
As it has the largest correlation as per the matrix thus in the model it will give high R2, thus it will be the first one
to be entered.
b) What proportion of variation in Sales does this predictor variable explain in model 1? Explain clearly.
4
c) What is the Std. Error of the Estimate for Model 1? Explain clearly
n = No of observation
k = No. of explanatory variable
R 2 = 1- SSE/SST
Therefore SSE = SST * (1-R2)
SST = 3.13 * 10^13 (Given)
R2 = 0.37 ( calculated in previous question)
(1-R2) = 0.62924079
Question 2.2
a) What is the magnitude of the semipartial (or part) correlation for the variable ‘Interest’ in Model 2?
Explain.
Q 2.b Carry out an appropriate test, at 95% confidence level, to determine if Model 2 as a whole is valid
(significant). State the null and alternate hypotheses and show all work.
5
Rejecting the null hypothesis states that over all model is valid
F is rejecting
As Fvalue > Fcritical, we reject the null hypothesis, stating the model is valid
c. Given no change in the other significant explanatory variables, can it be concluded from Model 2 that
‘Interest’ has a higher impact on ‘Sales’ than the other variable used in the model. Explain clearly.
NO, to compare the two coefficients we would look at the standardized Beta Coefficient.
Standardized beta for Interest = -0.362 and for Price Index it is = -0.595, Implying that one SD change in
Interest will have a lesser impact on Sales than one SD change in PriceIndex will have.
Question 2.3
Can it be concluded, at 95% confidence level, that an increase in ‘Interest’ rate by 5% decreases yearly Sales
by at least 250000 units or more? Show all work.
H0 : β2 ≥-50000
H1 : β2 < - 50000
T critical = -1.703
As t-value > t-critcal , therefore we fail to reject the null. Thus cannot conclude that increase in interest will
result in sales by 250,000
6
Question 2.4: What can you say about the relationship between ‘Interest’ and the other predictor variable
used in Models 1 and 2? Explain clearly.
Interest is negatively correlated with base category, thus means a 1 unit increase in base category ( Sales)
will decrease interest by 1 unit
In presence of Interest, price index becomes less negative thus Price index must have positive relation
with interest
Question 2.5: The partial correlation of the excluded variables; after Model 2 was fitted; are 0.184 and 0.246.
Conduct an appropriate test, at 95% confidence level, to determine if one of these excluded variables should
be added to the regression model. State the null and alternate hypotheses and show all work.
Given
r2 = we get from the correlation matrix, to decide which value belongs to which variable
7
R2 = Square of Part correlation = 0.081045778
Model 1
Model 2
Fvalue : 0.583-0.502/(4-3)/1-0.583/(30-4-1)= 4.856115108
F critical = F 0.05,4,25 = 2.75871047
Therefore, , we will add one variable from the above, as it comes out to be significant
…………………………………………………………………………………………………………………......
Question 3
a. Rank the income groups based on average revenue obtained per transaction in the sample data from
largest to smallest. Provide precise reasons as to how you obtained this ranking. Is this ranking valid for the
population? What is the average revenue per transaction obtained for the income group ($10K-$30K)?
To check the rank we can add coefficient value of each to the base category and see the change and decide on
ranks, although when we see the p-value of each of the coefficient, it is clear that none of them are significant.
Therefore for population we cannot rank as all values are same as base category .
Also as base category is the income group (10$-30$ K), therefore average revenue is 12.6841$
8
b) The grocery store wishes to estimate the average amount spent per transaction on non-consumables.
Provide the most accurate estimate possible. Provide details on how you obtained this estimate.
Regression Output 3 to be used, as all the coefficient in the model are significant therefore model is valid
c.If in regression output 3, if the base chosen in product family is drinks (Prod_Fam2), then what will be the
corresponding prediction equation?
d.)Is there a significant difference in the average amount spent per transaction between that on drinks and
non-consumables? Why or Why not? Provide precise reasons.
The difference between non-consumable and drinks will be measured by Prod_fam3 regression model output 3
H0 : Beta (Prod_fam3)=0
H1:Beta (Prod_fam3) !=0
p-value < 0.05, we reject the null. Thus stating there is significance difference in the amount spent
e) The grocery store wishes to target those customers, as well as items on which the amount spent is
maximum. Assuming that no customer has more than five children, identify the appropriate customer
segment as well as the appropriate product family. Provide precise reasons behind your answer.
With Children = 0
Amount spent
9
Drinks = Y = 12.214+0*0.393 – 1.010 = 11.204
Non-Consumable Y = 12.214 – 0*0.322 = 12.214
With this we can see that Money spent on Non-Consumable and Food is greater than drink, Now with 5 children
Food = 12.214
With 5 Children
Equation for Money spent on Prod_Fam 2 ( Drinks)
It is clear from the equation the amount spent with children = 5 is maximum in food product category. Meaning
more children, the amount spent is more towards food items
f) What is the chance that a customer with 3 children will spend more than $10.00 on food items per
transaction? Provide details on your calculations.
Therefore chance that customer will spend more than 10$ = 1- 0.338 = 0.661
g) Do the number of children effect food purchases more than non-consumables? Why or why not? State
your reasons precisely.
Yes, As we saw for part (e),when children = 0, the expenditure on food and non-consumables are same.
But as number of children increases, the food purchased is more
Y = 12.214 + 0.393*Children
10
h) If the grocery store has reason to believe that in addition to the independent variables considered in
Regression Output 4, homeowners spend significantly more on non-consumables than non-home owners on
any product category. If so, how will you modify the model provided in Regression Output 4? Provide the
model in terms. If you are adding new variables to the model, provide details on what you expect the
value to be. Positive? Negative?
To include the Home owners vis a vis non home owner, we would introduce an interaction variable
Own_Home * Prod_Fam3 =OwnH_PF3
Positive value
………………………………………………………………………………………………………………………........................................................
Question 4 : Go through the case, “Oakland A” and the spreadsheet supplement (Ref: Moodle/Cases and
Materials/Module 3). Does mark Nobel increase attendance? If so, how much is the increase worth for
Oakland? Support your decision through an appropriate regression model.
Model 1: This was the first model built,without any interaction variable. Observation we as follows
Residuals:
Min 1Q Median 3Q Max
-10989.1 -2686.2 -947.4 2218.7 15406.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14520 1617 8.980 7.01e-13 ***
as.factor(NOBEL)1 -1289 1336 -0.965 0.33832
as.factor(YANKS)1 29425 2180 13.499 < 2e-16 ***
as.factor(OD)1 17343 5106 3.397 0.00119 **
as.factor(DOW)2 -8673 2067 -4.196 8.68e-05 ***
as.factor(DOW)3 -9323 2078 -4.486 3.15e-05 ***
as.factor(DOW)4 -7448 2814 -2.647 0.01025 *
as.factor(DOW)5 -5354 2008 -2.667 0.00972 **
as.factor(DOW)6 -7710 1986 -3.882 0.00025 ***
as.factor(DOW)7 -4056 2031 -1.997 0.05018 .
as.factor(DH)1 4835 2155 2.244 0.02836 *
as.factor(PROMO)1 3019 1484 2.035 0.04610 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
11
Interaction variable 1 : Noble_Yanks
Noble_Yanks
0 If noble is not playing and yanks is not
playing
1 If noble is not playing and yanks is playing
2 If noble is playing and yanks is not playing
Residuals:
Min 1Q Median 3Q Max
-11222.6 -2493.7 -896.8 2129.3 15750.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14730 1668 8.829 1.46e-12 ***
as.factor(Nobel_Yanks)1 30570 2997 10.201 6.89e-15 ***
as.factor(Nobel_Yanks)2 -973 1458 -0.668 0.506916
as.factor(Nobel_Yanks)3 26810 3419 7.842 7.43e-11 ***
as.factor(OD)1 17342 5134 3.378 0.001267 **
as.factor(DOW)2 -8934 2130 -4.194 8.86e-05 ***
as.factor(DOW)3 -9564 2134 -4.483 3.24e-05 ***
as.factor(DOW)4 -7657 2854 -2.683 0.009346 **
as.factor(DOW)5 -5702 2112 -2.700 0.008932 **
as.factor(DOW)6 -7821 2007 -3.898 0.000241 ***
as.factor(DOW)7 -4263 2076 -2.054 0.044215 *
as.factor(DH)1 4516 2240 2.016 0.048130 *
as.factor(PROMO)1 2726 1581 1.725 0.089564 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Observations:
Nobel becomes significant if yanks is playing, but the coefficient of yanks is pulled down. Thus it doesn’t prove
that Nobel is making any valuable addition to this
We did an outlier test, at 1.5 IQR and plotting the outlier. Yanks came out as an outlier as all the 5 matches
played by Yanks are high selling. Thus we decided to remove Yanks for the final model as it might be masking
other significant variables like “Nobel”
12
Outlier test graph
Nobel_TOG
OPP1
0 If nobel is not
0 If match is against any other playing
team 1 If nobel is playing
1 If match is against team 9 and TOG is 1 (first
half)
2 If match is against team 13 2 If nobel is playing
and TOG is 2
13
Model 3:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12788 1375 9.302 5.88e-13 ***
as.factor(OD)1 19322 4034 4.790 1.26e-05 ***
as.factor(PROMO)1 5374 1310 4.103 0.000133 ***
as.factor(DH)1 7151 2042 3.502 0.000913 ***
as.factor(Nobel_TOG)1 1731 1641 1.055 0.296169
as.factor(Nobel_TOG)2 -4342 1654 -2.625 0.011144 *
as.factor(OPP1)1 8651 1673 5.171 3.23e-06 ***
as.factor(OPP1)2 4551 1575 2.889 0.005483 **
as.factor(DOW)2 -7885 1735 -4.544 2.99e-05 ***
as.factor(DOW)3 -9002 1702 -5.289 2.11e-06 ***
as.factor(DOW)4 -7695 2303 -3.341 0.001492 **
as.factor(DOW)5 -4263 1717 -2.483 0.016042 *
as.factor(DOW)6 -7338 1682 -4.364 5.56e-05 ***
as.factor(DOW)7 -5630 1793 -3.139 0.002702 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Thus, with all the interactions and stand alone also, we couldn’t justify Nobel add anything significant to the
sales of the ticket.
H0: The null hypothesis is that there is no correlation among residuals, i.e., they are independent. = 0
H1:The alternative hypothesis is that residuals are autocorrelated.
Durbin Watson test :
lag Autocorrelation D-W Statistic p-value
1 0.4428619 1.103515 0
Alternative hypothesis: rho != 0
14
Breush Pagan Test
data: NOYANKSMODELF
BP = 16.189, df = 13, p-value = 0.2391
The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals
vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is
absolutely no heteroscedastity, you should see a completely random, equal distribution of points
throughout the range of X axis and a flat red line.
Also we conducted Breush pagan test, as p-value >0.05, therefore we can conclude by both graphical
representation and test that there is no heteroscedastity
15
Qqplot : Normality of residuals and Residual Histogram plot
……………………………………………………………………………………………………………………………………………………………………………
Question 5
Part 1 . Calculate the budget for which the box office success and failure are equally likely.
16
Budget = -β0/β1 = -1.621/-0.016 = 101.3125 crore
Part 5.2
Is there a sufficient evidence to conclude that the higher budget movies are more likely to fail at
the box-office?
Therefore p-value < 0.05, thus we reject the null , therefore we cannot say if higher budget movies fail at
box office
Question 5.3
A production house is making a movie with 100 crore budget; what is the success probability for this movie?
As per the question above, it is more desirable to misclassify five 1’s as 0’s Than misclassifying 0’s as 1
Min[P00C00+P01C01+P10C10+P11C11]
17
Total
P10 P01 C01 C10 cost
0.5 3 17 5 1 88
0.6 6 14 5 1 76
0.7 11 13 5 1 76
0.8 23 0 5 1 23
Question 5.5 Calculate the difference in success probabilities for movies with item song and movies without
item song.
Which is a better model (budget as an independent variable vs item song as an independent variable).
Clearly state your reasons.
Question 5.7
Consider all the information in tables 1 to 7, which model you would recommend to predict the movie
success at the box office? Clearly state your reasons.
Table 6 and 7 would be recommended to use the success of movie, as the accuracy in classifying of success and
failure is highest that is 74.6, also the wald’s index and significance level showed in Table 7 is more relevant and
accurate.
18
……………………………………………………………………………………………………………………………………………………………………………
Question 6
Read the case,“Breaking Barriers – Micro-mortgage analytics”. Using the data provided, develop a credit
rating model that Shubham can use
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.917e+00 7.677e-01 10.313 < 2e-16 ***
LTV -7.001e-02 6.663e-03 -10.507 < 2e-16 ***
dwnp_prop_p -7.486e-02 7.329e-03 -10.214 < 2e-16 ***
IAR 6.028e-02 6.321e-03 9.536 < 2e-16 ***
IIR -9.964e-02 1.177e-02 -8.464 < 2e-16 ***
BankSave 9.880e-06 3.373e-06 2.929 0.003402 **
Tier2 4.314e-01 1.888e-01 2.286 0.022283 *
Tier3 -5.935e-01 1.724e-01 -3.441 0.000579 ***
Employment_TypeSelf_Employed -6.406e-01 1.536e-01 -4.169 3.05e-05 ***
Accommodation_ClassRented 4.292e-01 1.520e-01 2.823 0.004754 **
GenderMale 7.103e-01 2.827e-01 2.513 0.011987 *
Age_11 -3.336e-01 1.606e-01 -2.078 0.037745 *
Age_12 9.539e-02 2.595e-01 0.368 0.713148
Age_13 -2.470e+00 1.154e+00 -2.140 0.032379 *
Loan_TypeHome_Loan 6.156e-01 2.838e-01 2.169 0.030087 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We build a logistic regression model using stepwise in r, Age was changed to a categorical variable
Code sheet
AGE_1
0 Age group 20-39
1 Age group 40-49
2 Age group 50-59
3 Age group 60+
19
2.5 % 97.5 %
(Intercept) 6.461194e+00 9.473085e+00
LTV -8.361814e-02 -5.748642e-02
dwnp_prop_p -8.972646e-02 -6.097005e-02
IAR 4.817255e-02 7.296643e-02
IIR -1.231476e-01 -7.696668e-02
BankSave 4.686608e-06 1.790836e-05
Tier2 6.264465e-02 8.034187e-01
Tier3 -9.334156e-01 -2.568660e-01
Employment_TypeSelf_Employed -9.442884e-01 -3.415130e-01
Accommodation_ClassRented 1.318359e-01 7.282810e-01
GenderMale 1.764476e-01 1.289141e+00
Age_11 -6.475143e-01 -1.742790e-02
Age_12 -4.026493e-01 6.168986e-01
Age_13 -4.677746e+00 -4.202113e-02
Loan_TypeHome_Loan 5.372949e-02 1.168862e+00
FALSE TRUE
0 136 209
1 46 1388
This is at 0.5 cut off, therefore to find the optimal cut off , we use youden index
Going ahead with highest youden index, we go with cut off of 0.8 which gives the following confusion matrix on
train data
20
FALSE TRUE
0 271 74
1 356 1078
Applying the same to Test data, we get the following confusion matrix and ROC plot with AUC
FALSE TRUE
0 59 27
1 99 259
Deployment Strategy
Medium Risk: We should be charging a higher processing fee and also increase down payment
High Risk: Model is not able to classify these too well, thus anyone with probability less that 0.50.
We can apply the following thumb rules, before processing the loan
Give LTV higher weightage. Checking on their requirement vis a vis market value of the
property
Processing fee and down payment should be increased
21
Also IIR, IAR should be assigned weightage
………………………………………………………………………………………………………………………...................................................
Question 7
Comment whether the marital status has any statistical significance on the probability of loan denial. Clearly
state your reasons.
Martial status is not significant variable because of low value of Wald test
Q 7.2
What percentage of the applicants with a DI=20, LTV = 0.5, IIR = 0.8, MS = 0 and Old EMI = 0 will be given a
loan at 18% interest? Use only statistically significant variables and assume that the changes in the
coefficient values are negligible due to dropping of insignificant variables.
……………………………………………………………………………………………………………………………………………………………………………
22
Question 8
Read the case, “Fraud analytics at MCA technology solutions – Predicting Earnings Manipulation by Indian
Firms”. Develop a model using logistic regression and discriminant analysis to predict fraudulent
transactions. (Ref: Moodle/Cases and Materials/Module 3)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9708 -0.2052 -0.1632 -0.1334 3.0468
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.51768 0.71995 -9.053 < 2e-16 ***
DSRI 0.79368 0.15092 5.259 1.45e-07 ***
SGI 0.84955 0.25477 3.335 0.000854 ***
ACCR 5.89684 1.35669 4.346 1.38e-05 ***
GMI 0.45026 0.27713 1.625 0.104222
AQI 0.24252 0.09298 2.608 0.009102 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We need to reduce the number of Manipulator (Yes), but they are being predicted as non manipulaters.
As we are not sure if the same is taken correctly, we do Bagging and check on 100 samples to come up with
good confusion matrix
23
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 960 19
Yes 0 13
Accuracy : 0.9808
95% CI : (0.9703, 0.9884)
No Information Rate : 0.9677
P-Value [Acc > NIR] : 0.008494
Kappa : 0.5698
Mcnemar's Test P-Value : 3.636e-05
Sensitivity : 1.0000
Specificity : 0.4062
Pos Pred Value : 0.9806
Neg Pred Value : 1.0000
Prevalence : 0.9677
Detection Rate : 0.9677
Detection Prevalence : 0.9869
Balanced Accuracy : 0.7031
'Positive' Class : No
We applied the model to predict on test data to check the accuracy . The confusion matrix for test data
Reference
Prediction No Yes
No 240 5
Yes 0 2
Accuracy : 0.9798
95% CI : (0.9534, 0.9934)
No Information Rate : 0.9717
P-Value [Acc > NIR] : 0.29703
Kappa : 0.4374
Mcnemar's Test P-Value : 0.07364
Sensitivity : 1.0000
Specificity : 0.2857
Pos Pred Value : 0.9796
Neg Pred Value : 1.0000
Prevalence : 0.9717
Detection Rate : 0.9717
Detection Prevalence : 0.9919
Balanced Accuracy : 0.6429
'Positive' Class : No
When run on the test data, the model was accurately able to predict the manipulater’s
24
We did ROC and AUC
Thus proving the variable used in model are significant and the same can be used in deployment strategy, while
giving weightage and scores to these variables to get hold of manipulators
……………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………
End of assignment
25