Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
94 views

Assignment 2 Module 3

The document discusses a regression model analyzing factors that impact box office collections for movies. 1) Several statements about the regression model are evaluated, and it is found that the model explains 42.25% of variation in box office collection, there are outliers in the model, and box office collection increases as budget increases. 2) It is determined that the regression model's negative constant value is not problematic. 3) There is no significant average difference found between movies released during holiday seasons versus normal seasons. 4) A claim that movies released on long weekends earn 5 crores more is not supported by the model at a 5% significance level.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

Assignment 2 Module 3

The document discusses a regression model analyzing factors that impact box office collections for movies. 1) Several statements about the regression model are evaluated, and it is found that the model explains 42.25% of variation in box office collection, there are outliers in the model, and box office collection increases as budget increases. 2) It is determined that the regression model's negative constant value is not problematic. 3) There is no significant average difference found between movies released during holiday seasons versus normal seasons. 4) A claim that movies released on long weekends earn 5 crores more is not supported by the model at a 5% significance level.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Module 3: Assignment 2

Priyanka Sindhwani
IIMB – BIA 7
Question 1:

Which of the following statements are correct (more than one may be correct)?
Tick () all right answers or highlight the correct statements with color.
1. The model explains 42.25% of variation in box office collection.
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases.

Answer 1.1

1. The model explains 42.25% of variation in box office collection. ( As r*r gives us R2, which explains the
variation R = 0.650, R2 = 42.25%)
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases. ( Putting it in the equation Y = Beta0+beta1*x1 : Y = -
8.354 +2.175*x1)

Question 1.2
Mr Chellappa, CEO of Oho Productions (OP) claims that the regression model in Table 3 is incorrect since it has
negative constant value. Comment whether Mr Chellappa is correct in his assessment about the model.

Answer 1.2
The y-intercept is only meaningful if it is logically meaningful for all predictor variables to be zero.
Constant value in an regression model is interpreted the expected mean value of Y at that value. In the
current case budget can never be less than zero therefore negative constant doesn’t add anything to the
model

Question 1.3
What is the average difference in the box office collection when a movie is released during a holiday
season (Releasing_Time_holiday_season) versus movies released during normal season
(Releasing_Time_Normal_Season)? Use a significance value of 5%.

Answer1.3

For the movie released during the holiday season the coefficient is equal to Constant, therefore in this case it is
2.685 vs
Releasing_time_Normal_Season
Considering significance levelshould have
of 0.05, beentime
release Y= 2.685 + 0.147*x
normal season is not significant therefore it merges with base
category thus there is no average difference ,thus there is no difference between Releasing_time_Normal_ Season
and Releasing_time_Holiday_Season

Question 1.4 (4 Points)

1
Mr Chellappa of Oho productions claims that the movies released during long weekend
(Releasing_Time_Long_Weekend) earn at least 5 crores more than the movies released during normal season
(Releasing_Time_Normal_Season). Check whether this claim is true (use  = 0.05).

Answer 1.4

As seen from the model output,movie released on


(Releasing_Time_Long_Weekend) : RTL
(Releasing_Time_Normal_Season).: RNS

As seen from the model output, movie release on RNS is not significant, thus the value of the this would be same
as base category. Therefore
Estimated value of β of movie released in normal season (RNS) = 2.685
Whereas, Estimated value of β of RTL = 2.685+1.247 = 3.932

Estmated Difference between RTS – RNS = 3.932 – 2.685 = 1.247


As all the values, are in natural log (ln), thus we would convert 5 crore also into ln(5crore)

Ln(5 crore) = 17.727

H0 : Difference in β for RTL and RNS <= 17.727


H1: Difference in β for RTL and RNS >17.727

Decision rule : if t-value >t-critical , reject the null

t = Estimated value – hypothesis value/ std error


1.247 – 17.727/0.588 = -28.028

t-critical = -1.97

As t-value < t-critical, we fail to reject the null. Thus the claim of earning 5 crore is not proved

Question 1.5 : Question 1.5 (2 Points)


What is the variation in response variable, ln(Box office collection), explained by the model after adding all 6
variables?

Answer 1.5:

2
The variation in response variable, ln(Box office collection) is 0.81 square = 0.656 = 65.6%

Therefore, after adding 6th variable

R square of model 6 = R2 of Model + (Part correlation of variable 6)^2

R square of model 6 = 0.6561 + (-0.104)^2 = 0.6669 = 66.7%

Question 1.6 (2 Points)


Which factor has the maximum impact on the box office collection of a movie? What will be your
recommendation to a production house based on the variable that has maximum impact on the box office
collection?

Answer 1.6

Budget has the maximum impact on the box office collection, looking at the standardized beta coefficient which
is 0.443 for budget

Based on this our recommendation will be

Budget should always be kept more than 35 crore for a movie as 1 unit of increase in budget adds 0.443
increase in box office collection

Question 1.7
Compare the regressions in Model 2 (Table 4) and Model 3 (Tables 5 and 6). None of the variables in Model 2
are statistically significant in Model 3. Can we conclude that the variables in Model 2 have no association
relationship with Box Office Collection? Explain clearly.

Answer 1.7

Model 3 ( Table 5 and 6) has stepwise regression, which basically adds only the most significant variable,
meaning variable with p-value is lowest.

Therefore looking at the table below , we have calculated p-value for all values

Releasing_Time_Festival_Season

constant .000
Releasing_Time_Festival_Season 0.203
Releasing_Time_Long Weekend 0.036
Releasing_Time_Normal_Season 0.734

3
Constant 2.06206E-29

Budget_35_cr 1.49579E-11

You_tube view 1.89986E-05


Prod_house_cat A 0.002879598

Music_Dir_Cat c 0.001464786

Genre Comedy 0.02221579


Director_Cat C 0.033817534

As the p-value for all the variables in model 3 is less than the p-value in model 2, therefore these variables
were dropped in Model 3( Stepwise regression)

Question 1.8 Among the variables in Table 6, which variable is not useful for practical application of the
model? Clearly state your reasons.

Answer 1.8

You tube views, as that is the variable we have no control over to amend. Thus keeping it in model is not
adding any additional information which will help us take decision

…………………………………………………………………………………………………………………….............................................................

Question 2.1 :
a) What is the predictor variable used in Model 1? Explain clearly.

Price Index

As it has the largest correlation as per the matrix thus in the model it will give high R2, thus it will be the first one
to be entered.

b) What proportion of variation in Sales does this predictor variable explain in model 1? Explain clearly.

R2 = (Correlation Coefficient)2 = (- 0.6089)2 = 0.37075921 = 37%

4
c) What is the Std. Error of the Estimate for Model 1? Explain clearly

n = No of observation
k = No. of explanatory variable
R 2 = 1- SSE/SST
Therefore SSE = SST * (1-R2)
SST = 3.13 * 10^13 (Given)
R2 = 0.37 ( calculated in previous question)
(1-R2) = 0.62924079

SSE = 1.97* 1013

MSE = SSE/ n-(k+1)


SE = sqrt(MSE)

Standard Error (se) = sqrt (SSE/ (n-k-1)) = Sqrt( 1.97*10E13/30-1-1)


838791.6479
(n=30, k=1)

Question 2.2
a) What is the magnitude of the semipartial (or part) correlation for the variable ‘Interest’ in Model 2?
Explain.

b. Square of semi partial correlation gives the value of R2

R2: .37 ( R square when price index is added)


R2: 0.502 ( R square when both price and interest added

Therefore , Interest Sempartial or part correlation is 0.502 – 0.37 = 0.132.

Thus magnitude will be – sqrt(0.132) = 0.363318042

Q 2.b Carry out an appropriate test, at 95% confidence level, to determine if Model 2 as a whole is valid
(significant). State the null and alternate hypotheses and show all work.

To check the validity of the complete model we conduct a F test

H0 = Beta 1 = Beta 2 =…. Beta k = 0

HA = Not all beta values are zero

Decision if Fvalue > Fcritical, reject the null

5
Rejecting the null hypothesis states that over all model is valid

F is rejecting

F = MSR/MSE , F VALUE= R2/k/ (1- R2)/(n-k-1)

As we have given value for equation 2,

Fvalue = 0.502/2/ (1-0.502)/(30-2-1) = 13.60

F critical = F 0.05,2,27 = 3.354131

As Fvalue > Fcritical, we reject the null hypothesis, stating the model is valid

c. Given no change in the other significant explanatory variables, can it be concluded from Model 2 that
‘Interest’ has a higher impact on ‘Sales’ than the other variable used in the model. Explain clearly.

NO, to compare the two coefficients we would look at the standardized Beta Coefficient.

Standardized beta for Interest = -0.362 and for Price Index it is = -0.595, Implying that one SD change in
Interest will have a lesser impact on Sales than one SD change in PriceIndex will have.

Question 2.3
Can it be concluded, at 95% confidence level, that an increase in ‘Interest’ rate by 5% decreases yearly Sales
by at least 250000 units or more? Show all work.

For 1% increase in Interest , the decrease in sale would be = 250000/5 = 50,000

H0 : β2 ≥-50000

H1 : β2 < - 50000

Decision: Reject null if T value >T critical

T = (estimated value of parameter – Hypothesis value of parameter )/Std error =

-124592 + (-50,000)/ 46820.081 = -1.593

T critical = -1.703

As t-value > t-critcal , therefore we fail to reject the null. Thus cannot conclude that increase in interest will
result in sales by 250,000

6
Question 2.4: What can you say about the relationship between ‘Interest’ and the other predictor variable
used in Models 1 and 2? Explain clearly.

 Interest is negatively correlated with base category, thus means a 1 unit increase in base category ( Sales)
will decrease interest by 1 unit
 In presence of Interest, price index becomes less negative thus Price index must have positive relation
with interest

Question 2.5: The partial correlation of the excluded variables; after Model 2 was fitted; are 0.184 and 0.246.
Conduct an appropriate test, at 95% confidence level, to determine if one of these excluded variables should
be added to the regression model. State the null and alternate hypotheses and show all work.

Given

Partial correlation: 0.184 and 0.246

H0: Beta r+1 = ….. = Bk = 0


H1 : β!=0

Test : Partial F test

R square ( full model) – R square ( reduced model)/ k-r


(1-R square (full model)/n-k-1
Decision : Fcritical >F value, we fail to reject the null , meaning the Variables in the set do not improve
significantly the model when all other variables are included

r2 = we get from the correlation matrix, to decide which value belongs to which variable

Variables not part of Model 2 are


Year: Correlation with Sales = -0.5453
Income : Correlation with Sales = - 0.5033
Higher correlation = Higher partial correlation, therefore
Interest correlation value = 0.246 and Year = 0.184

Part Correlation for Income= 0.246/ Sqrt ( 1-(-0.5033)2) = 0.284685402

7
R2 = Square of Part correlation = 0.081045778

Part Correlation for Year = 0.184/sqrt(1-(-0.5453)2) = 0.219507288


R2 = 0.048183449

Full Model 1 = 0.502 + 0.048 = 0.55


Full Model 2 = 0.502 + 0.081 = 0.583
Reduced Model R2 = 0.502

R square ( full model) – R square ( reduced model)/ k-r


(1-R square (full model)/n-k-1

Model 1

F VALUE = 0.55-0.502/(4-3)/1-0.55/(30-4-1) = 2.66


F critical = F 0.05,4,25 = 2.75871047

As Fvalue <F critical, we reject the null.

Model 2
Fvalue : 0.583-0.502/(4-3)/1-0.583/(30-4-1)= 4.856115108
F critical = F 0.05,4,25 = 2.75871047

As Fvalue >F critical, we fail to reject the null.

Therefore, , we will add one variable from the above, as it comes out to be significant

…………………………………………………………………………………………………………………......

Question 3
a. Rank the income groups based on average revenue obtained per transaction in the sample data from
largest to smallest. Provide precise reasons as to how you obtained this ranking. Is this ranking valid for the
population? What is the average revenue per transaction obtained for the income group ($10K-$30K)?

Looking at the Regression Output 1


We can see that , Ann_Inc ( 10$-30$ k ) is taken as the base category.

To check the rank we can add coefficient value of each to the base category and see the change and decide on
ranks, although when we see the p-value of each of the coefficient, it is clear that none of them are significant.
Therefore for population we cannot rank as all values are same as base category .
Also as base category is the income group (10$-30$ K), therefore average revenue is 12.6841$

8
b) The grocery store wishes to estimate the average amount spent per transaction on non-consumables.
Provide the most accurate estimate possible. Provide details on how you obtained this estimate.

Regression Output 3 to be used, as all the coefficient in the model are significant therefore model is valid

As per the model

Y = 13.192 -0.975*ProdFamily 3-0.743*Prod Family 2

Y = 13.192 – 0.743 = 12.449

c.If in regression output 3, if the base chosen in product family is drinks (Prod_Fam2), then what will be the
corresponding prediction equation?

If the base is Drinks ( Prod_family2)


Base category coefficient = 13.192 – 0.975 = 12.217
The equation will be

Y = (13.192 – 0.975)+ ( -0.743-(-0.975)*Prod_fam2 – (-9.75)*Prod_fam1

d.)Is there a significant difference in the average amount spent per transaction between that on drinks and
non-consumables? Why or Why not? Provide precise reasons.

The difference between non-consumable and drinks will be measured by Prod_fam3 regression model output 3

H0 : Beta (Prod_fam3)=0
H1:Beta (Prod_fam3) !=0

Decision rule : Retain the null if p-value >0.05.

T-statistic for Prod_fam3 = -2.24, associated p-value = 0.025

p-value < 0.05, we reject the null. Thus stating there is significance difference in the amount spent

e) The grocery store wishes to target those customers, as well as items on which the amount spent is
maximum. Assuming that no customer has more than five children, identify the appropriate customer
segment as well as the appropriate product family. Provide precise reasons behind your answer.

With Children = 0
Amount spent

9
Drinks = Y = 12.214+0*0.393 – 1.010 = 11.204
Non-Consumable Y = 12.214 – 0*0.322 = 12.214

With this we can see that Money spent on Non-Consumable and Food is greater than drink, Now with 5 children
Food = 12.214
With 5 Children
Equation for Money spent on Prod_Fam 2 ( Drinks)

Y = 12.214+5*0.393 – 1.010 = 13.169

Prod_fam3 money spent on non-consumables


Y = 12.214 – 5* 0.322 = 13.825

Money spend on Food

Y = 12.214 + 5*0.393 = 14.179

It is clear from the equation the amount spent with children = 5 is maximum in food product category. Meaning
more children, the amount spent is more towards food items

f) What is the chance that a customer with 3 children will spend more than $10.00 on food items per
transaction? Provide details on your calculations.

Need to find probability for customer spending more than 10$


Estimated mean = 12.214 + 3*393 = 13.393
Std error = 8.127
X = 10
Therefore we find probability of customer spending atleast 10$
= NormDist(x=10,mean = 13.393,stddev=8.127,cumulative) = 0.338

Therefore chance that customer will spend more than 10$ = 1- 0.338 = 0.661

g) Do the number of children effect food purchases more than non-consumables? Why or why not? State
your reasons precisely.

Yes, As we saw for part (e),when children = 0, the expenditure on food and non-consumables are same.
But as number of children increases, the food purchased is more

We can also see this below

For food the prediction equation becomes

Y = 12.214 + 0.393*Children

Y = 12.214 – 0.322 * Children

10
h) If the grocery store has reason to believe that in addition to the independent variables considered in
Regression Output 4, homeowners spend significantly more on non-consumables than non-home owners on
any product category. If so, how will you modify the model provided in Regression Output 4? Provide the
model in  terms. If you are adding new variables to the model, provide details on what you expect the 
value to be. Positive? Negative?

To include the Home owners vis a vis non home owner, we would introduce an interaction variable
Own_Home * Prod_Fam3 =OwnH_PF3

Own_Hme = 1 ( if they are homeowners),0 otherwise.,

Y(hat) = β0+ β1*Children-β2*Prod_Fam2 +β3*Child*Prod_fam3 - β3*OwnH_PF3 + e

Positive value

………………………………………………………………………………………………………………………........................................................

Question 4 : Go through the case, “Oakland A” and the spreadsheet supplement (Ref: Moodle/Cases and
Materials/Module 3). Does mark Nobel increase attendance? If so, how much is the increase worth for
Oakland? Support your decision through an appropriate regression model.

Model 1: This was the first model built,without any interaction variable. Observation we as follows

 Nobel is not significant


 Yanks and OD are positively significant
Call:
lm(formula = TIX ~ as.factor(NOBEL) + as.factor(YANKS) + as.factor(OD) +
as.factor(DOW) + as.factor(DH) + as.factor(PROMO), data = Oakland)

Residuals:
Min 1Q Median 3Q Max
-10989.1 -2686.2 -947.4 2218.7 15406.9

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14520 1617 8.980 7.01e-13 ***
as.factor(NOBEL)1 -1289 1336 -0.965 0.33832
as.factor(YANKS)1 29425 2180 13.499 < 2e-16 ***
as.factor(OD)1 17343 5106 3.397 0.00119 **
as.factor(DOW)2 -8673 2067 -4.196 8.68e-05 ***
as.factor(DOW)3 -9323 2078 -4.486 3.15e-05 ***
as.factor(DOW)4 -7448 2814 -2.647 0.01025 *
as.factor(DOW)5 -5354 2008 -2.667 0.00972 **
as.factor(DOW)6 -7710 1986 -3.882 0.00025 ***
as.factor(DOW)7 -4056 2031 -1.997 0.05018 .
as.factor(DH)1 4835 2155 2.244 0.02836 *
as.factor(PROMO)1 3019 1484 2.035 0.04610 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4538 on 63 degrees of freedom


Multiple R-squared: 0.8148, Adjusted R-squared: 0.7824
F-statistic: 25.19 on 11 and 63 DF, p-value: < 2.2e-16

11
Interaction variable 1 : Noble_Yanks

Noble_Yanks
0 If noble is not playing and yanks is not
playing
1 If noble is not playing and yanks is playing
2 If noble is playing and yanks is not playing

3 If noble is playing and yanks is playing

Model 2: with interaction variable


Call:
lm(formula = TIX ~ as.factor(Nobel_Yanks) + as.factor(OD) + +as.factor(DOW) +
as.factor(DH) + as.factor(PROMO), data = Oakland)

Residuals:
Min 1Q Median 3Q Max
-11222.6 -2493.7 -896.8 2129.3 15750.1

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14730 1668 8.829 1.46e-12 ***
as.factor(Nobel_Yanks)1 30570 2997 10.201 6.89e-15 ***
as.factor(Nobel_Yanks)2 -973 1458 -0.668 0.506916
as.factor(Nobel_Yanks)3 26810 3419 7.842 7.43e-11 ***
as.factor(OD)1 17342 5134 3.378 0.001267 **
as.factor(DOW)2 -8934 2130 -4.194 8.86e-05 ***
as.factor(DOW)3 -9564 2134 -4.483 3.24e-05 ***
as.factor(DOW)4 -7657 2854 -2.683 0.009346 **
as.factor(DOW)5 -5702 2112 -2.700 0.008932 **
as.factor(DOW)6 -7821 2007 -3.898 0.000241 ***
as.factor(DOW)7 -4263 2076 -2.054 0.044215 *
as.factor(DH)1 4516 2240 2.016 0.048130 *
as.factor(PROMO)1 2726 1581 1.725 0.089564 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4563 on 62 degrees of freedom


Multiple R-squared: 0.8157, Adjusted R-squared: 0.78
F-statistic: 22.87 on 12 and 62 DF, p-value: < 2.2e-16

Observations:
Nobel becomes significant if yanks is playing, but the coefficient of yanks is pulled down. Thus it doesn’t prove
that Nobel is making any valuable addition to this

We did an outlier test, at 1.5 IQR and plotting the outlier. Yanks came out as an outlier as all the 5 matches
played by Yanks are high selling. Thus we decided to remove Yanks for the final model as it might be masking
other significant variables like “Nobel”

12
Outlier test graph

Therefore next model was created without using Yanks.


And with additional 2 interaction variable : OPP1 and Noble_TOG
The OPP1: This interaction variable was created based on running a model, where OPP came significant for few
teams. Thus we created another categorical variable with significant opposition team variable

Nobel_TOG and OPP1

Nobel_TOG
OPP1
0 If nobel is not
0 If match is against any other playing
team 1 If nobel is playing
1 If match is against team 9 and TOG is 1 (first
half)
2 If match is against team 13 2 If nobel is playing
and TOG is 2

13
Model 3:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12788 1375 9.302 5.88e-13 ***
as.factor(OD)1 19322 4034 4.790 1.26e-05 ***
as.factor(PROMO)1 5374 1310 4.103 0.000133 ***
as.factor(DH)1 7151 2042 3.502 0.000913 ***
as.factor(Nobel_TOG)1 1731 1641 1.055 0.296169
as.factor(Nobel_TOG)2 -4342 1654 -2.625 0.011144 *
as.factor(OPP1)1 8651 1673 5.171 3.23e-06 ***
as.factor(OPP1)2 4551 1575 2.889 0.005483 **
as.factor(DOW)2 -7885 1735 -4.544 2.99e-05 ***
as.factor(DOW)3 -9002 1702 -5.289 2.11e-06 ***
as.factor(DOW)4 -7695 2303 -3.341 0.001492 **
as.factor(DOW)5 -4263 1717 -2.483 0.016042 *
as.factor(DOW)6 -7338 1682 -4.364 5.56e-05 ***
as.factor(DOW)7 -5630 1793 -3.139 0.002702 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3559 on 56 degrees of freedom


Multiple R-squared: 0.6564, Adjusted R-squared: 0.5766
F-statistic: 8.228 on 13 and 56 DF, p-value: 6.433e-09

Thus, with all the interactions and stand alone also, we couldn’t justify Nobel add anything significant to the
sales of the ticket.

Model validation and Diagnostics:

Independence of Errors - No AutoCorrelation ( Durbin Watson test)

H0: The null hypothesis is that there is no correlation among residuals, i.e., they are independent. = 0
H1:The alternative hypothesis is that residuals are autocorrelated.
Durbin Watson test :
lag Autocorrelation D-W Statistic p-value
1 0.4428619 1.103515 0
Alternative hypothesis: rho != 0

As p-value = 0 , therefore we retain the null. Concluding there is no autocorrelation

Test for Heteroscedastity

14
Breush Pagan Test

studentized Breusch-Pagan test

data: NOYANKSMODELF
BP = 16.189, df = 13, p-value = 0.2391
The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals
vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is
absolutely no heteroscedastity, you should see a completely random, equal distribution of points
throughout the range of X axis and a flat red line.
Also we conducted Breush pagan test, as p-value >0.05, therefore we can conclude by both graphical
representation and test that there is no heteroscedastity

15
Qqplot : Normality of residuals and Residual Histogram plot

……………………………………………………………………………………………………………………………………………………………………………

Question 5

Part 1 . Calculate the budget for which the box office success and failure are equally likely.

Logit function (ln (π/1-π) = β0+β1*Budget


This is basically to calculate (ln (π/1-π) = odds

Equally likely π = 1-π , (ln (π/1-π) = ln (1) = 0


β0+β1*Budget = 0

16
Budget = -β0/β1 = -1.621/-0.016 = 101.3125 crore

Part 5.2
Is there a sufficient evidence to conclude that the higher budget movies are more likely to fail at
the box-office?

H0: βi=β2=….βk = 0 ( Higher budget movies fail at box office)


H1: βI !=0
Decision rule, if p-value <0.05, reject the null

Looking at table 1 , p-value = 0.046

Therefore p-value < 0.05, thus we reject the null , therefore we cannot say if higher budget movies fail at
box office

Question 5.3
A production house is making a movie with 100 crore budget; what is the success probability for this movie?

Probability = e to the power 1.621-0.016*ln(100cr)/1+e1.621-0.016*0.257

2.718 1.621-0.016*0.257/1+2.718 1.621-0.016*0.257 = 0.796265

Question 5.4 (4 Points)


Calculate the optimal cut-off probability when the cost of classifying failure at box office (0) as success at the
box office (1) is five times costlier than the cost of classifying success (1) as failure (0). Show all calculations.

Classifying 0 as 1, cost is 5 times

P01 = 0 classified as 1 ( false positive= cost is 5 times

P10 = False negative ( Meaning 1, predicted as 0 )

C01 = Cost of classifying 0 as 1 ( 5times)

C10 = Cost of classifying 1 as 0

As per the question above, it is more desirable to misclassify five 1’s as 0’s Than misclassifying 0’s as 1

Sensitivity : Proportion of 1 that are correctly identified

Specificity: Proportion of 0 that are correctly identified

Optimal cut off

Min[P00C00+P01C01+P10C10+P11C11]

17
Total
P10 P01 C01 C10 cost
0.5 3 17 5 1 88
0.6 6 14 5 1 76
0.7 11 13 5 1 76
0.8 23 0 5 1 23

As per this 0.8 is the optimal cut off

Question 5.5 Calculate the difference in success probabilities for movies with item song and movies without
item song.

Probability of movies with item song = e to the power 1.099-.501*1/1+e1.099-.501 = 0.645

Probability of movie without item song = e to the power 1.099/1+e1.099 = 0.757

Difference = 0.645 – 0.757 = 0.112

Question 5.6 (2 points)

Which is a better model (budget as an independent variable vs item song as an independent variable).
Clearly state your reasons.

Item Song is a better model, because

1. It classifies more 0’s correctly, which is desirable


2. Significance of Budget is 0.05 whereas item song it is 0.013, thus it has lower p-value. Thus more valid

Question 5.7

Consider all the information in tables 1 to 7, which model you would recommend to predict the movie
success at the box office? Clearly state your reasons.

Table 6 and 7 would be recommended to use the success of movie, as the accuracy in classifying of success and
failure is highest that is 74.6, also the wald’s index and significance level showed in Table 7 is more relevant and
accurate.

18
……………………………………………………………………………………………………………………………………………………………………………

Question 6

Read the case,“Breaking Barriers – Micro-mortgage analytics”. Using the data provided, develop a credit
rating model that Shubham can use

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.917e+00 7.677e-01 10.313 < 2e-16 ***
LTV -7.001e-02 6.663e-03 -10.507 < 2e-16 ***
dwnp_prop_p -7.486e-02 7.329e-03 -10.214 < 2e-16 ***
IAR 6.028e-02 6.321e-03 9.536 < 2e-16 ***
IIR -9.964e-02 1.177e-02 -8.464 < 2e-16 ***
BankSave 9.880e-06 3.373e-06 2.929 0.003402 **
Tier2 4.314e-01 1.888e-01 2.286 0.022283 *
Tier3 -5.935e-01 1.724e-01 -3.441 0.000579 ***
Employment_TypeSelf_Employed -6.406e-01 1.536e-01 -4.169 3.05e-05 ***
Accommodation_ClassRented 4.292e-01 1.520e-01 2.823 0.004754 **
GenderMale 7.103e-01 2.827e-01 2.513 0.011987 *
Age_11 -3.336e-01 1.606e-01 -2.078 0.037745 *
Age_12 9.539e-02 2.595e-01 0.368 0.713148
Age_13 -2.470e+00 1.154e+00 -2.140 0.032379 *
Loan_TypeHome_Loan 6.156e-01 2.838e-01 2.169 0.030087 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1750.1 on 1778 degrees of freedom


Residual deviance: 1246.1 on 1764 degrees of freedom
AIC: 1276.1

Number of Fisher Scoring iterations: 8

We build a logistic regression model using stepwise in r, Age was changed to a categorical variable

Code sheet

AGE_1
0 Age group 20-39
1 Age group 40-49
2 Age group 50-59
3 Age group 60+

Confit of the training data set :

19
2.5 % 97.5 %
(Intercept) 6.461194e+00 9.473085e+00
LTV -8.361814e-02 -5.748642e-02
dwnp_prop_p -8.972646e-02 -6.097005e-02
IAR 4.817255e-02 7.296643e-02
IIR -1.231476e-01 -7.696668e-02
BankSave 4.686608e-06 1.790836e-05
Tier2 6.264465e-02 8.034187e-01
Tier3 -9.334156e-01 -2.568660e-01
Employment_TypeSelf_Employed -9.442884e-01 -3.415130e-01
Accommodation_ClassRented 1.318359e-01 7.282810e-01
GenderMale 1.764476e-01 1.289141e+00
Age_11 -6.475143e-01 -1.742790e-02
Age_12 -4.026493e-01 6.168986e-01
Age_13 -4.677746e+00 -4.202113e-02
Loan_TypeHome_Loan 5.372949e-02 1.168862e+00

We ran a confusion matrix on train data which was as follows :

FALSE TRUE
0 136 209
1 46 1388

This is at 0.5 cut off, therefore to find the optimal cut off , we use youden index

cutoff P10 P01 msclaf.cost P11 P00 youden.index


[1,] 0.05 0.0000000000 0.93913043 1.8782609 1.0000000 0.06086957 0.06086957
[2,] 0.10 0.0000000000 0.91304348 1.8260870 1.0000000 0.08695652 0.08695652
[3,] 0.15 0.0006973501 0.89565217 1.7920017 0.9993026 0.10434783 0.10365048
[4,] 0.20 0.0020920502 0.88695652 1.7760051 0.9979079 0.11304348 0.11095143
[5,] 0.25 0.0034867503 0.86956522 1.7426172 0.9965132 0.13043478 0.12694803
[6,] 0.30 0.0048814505 0.85217391 1.7092293 0.9951185 0.14782609 0.14294464
[7,] 0.35 0.0069735007 0.80579710 1.6185677 0.9930265 0.19420290 0.18722940
[8,] 0.40 0.0160390516 0.74202899 1.5000970 0.9839609 0.25797101 0.24193196
[9,] 0.45 0.0237099024 0.67246377 1.3686374 0.9762901 0.32753623 0.30382633
[10,] 0.50 0.0334728033 0.58260870 1.1986902 0.9665272 0.41739130 0.38391850
[11,] 0.55 0.0502092050 0.53043478 1.1110788 0.9497908 0.46956522 0.41935601
[12,] 0.60 0.0725244073 0.46956522 1.0116548 0.9274756 0.53043478 0.45791038
[13,] 0.65 0.0885634589 0.42608696 0.9407374 0.9114365 0.57391304 0.48534958
[14,] 0.70 0.1317991632 0.35072464 0.8332484 0.8682008 0.64927536 0.51747620
[15,] 0.75 0.1834030683 0.28405797 0.7515190 0.8165969 0.71594203 0.53253896
[16,] 0.80 0.2482566248 0.21449275 0.6772421 0.7517434 0.78550725 0.53725062
[17,] 0.85 0.3277545328 0.14782609 0.6234067 0.6722455 0.85217391 0.52441938
[18,] 0.90 0.4497907950 0.09275362 0.6352980 0.5502092 0.90724638 0.45745558
[19,] 0.95 0.6171548117 0.05507246 0.7272997 0.3828452 0.94492754 0.32777272
[20,] 1.00 NA NA NA NA NA NA

Going ahead with highest youden index, we go with cut off of 0.8 which gives the following confusion matrix on
train data

20
FALSE TRUE
0 271 74
1 356 1078

Applying the same to Test data, we get the following confusion matrix and ROC plot with AUC

FALSE TRUE
0 59 27
1 99 259

This gives 72% accuracy

ROC PLOT : AUC IS 0.77

Deployment Strategy

Based on the probability we will divide the customers into 3 categories

Low Risk Medium Risk High Risk


Prob. >0.75 0.70-0.50 >0.50

Low Risk: Loan can be approved.

Medium Risk: We should be charging a higher processing fee and also increase down payment

High Risk: Model is not able to classify these too well, thus anyone with probability less that 0.50.

We can apply the following thumb rules, before processing the loan

 Give LTV higher weightage. Checking on their requirement vis a vis market value of the
property
 Processing fee and down payment should be increased

21
 Also IIR, IAR should be assigned weightage

………………………………………………………………………………………………………………………...................................................

Question 7

Comment whether the marital status has any statistical significance on the probability of loan denial. Clearly
state your reasons.

For loan denial, we take model 1

Martial status is not significant variable because of low value of Wald test

Significance = 0.17 >0.05 , therefore not significant

Q 7.2

What percentage of the applicants with a DI=20, LTV = 0.5, IIR = 0.8, MS = 0 and Old EMI = 0 will be given a
loan at 18% interest? Use only statistically significant variables and assume that the changes in the
coefficient values are negligible due to dropping of insignificant variables.

To Calculate % of applicant given loan at 18%

Interest at 18% = 1- [Prob denial +Prob 14%]

Z (Average)loan denial = 1.720 -0.120*20-0.521*0.5 – 0.220*0.8-1.120*0 = -1.1165

Prob. = e(z)/1+e(z) = 0.246

Z loan at 14% = 0.650-0.580*20 = 1.75777E-05

Therefore prob of applicant at 18% is = 1- [0.246 +1.75777E-05] = 0.75

75% applicant will be given loan at 18%

……………………………………………………………………………………………………………………………………………………………………………

22
Question 8

Read the case, “Fraud analytics at MCA technology solutions – Predicting Earnings Manipulation by Indian
Firms”. Develop a model using logistic regression and discriminant analysis to predict fraudulent
transactions. (Ref: Moodle/Cases and Materials/Module 3)

Stepwise logistic model :

Model without bagging


Call:
glm(formula = Manipulater ~ DSRI + SGI + ACCR + GMI + AQI, family = binomial,
data = data.train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9708 -0.2052 -0.1632 -0.1334 3.0468

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.51768 0.71995 -9.053 < 2e-16 ***
DSRI 0.79368 0.15092 5.259 1.45e-07 ***
SGI 0.84955 0.25477 3.335 0.000854 ***
ACCR 5.89684 1.35669 4.346 1.38e-05 ***
GMI 0.45026 0.27713 1.625 0.104222
AQI 0.24252 0.09298 2.608 0.009102 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 282.73 on 991 degrees of freedom


Residual deviance: 190.10 on 986 degrees of freedom
AIC: 202.1

Confusion matrix for this model on train data

confusion matrix on train data


FALSE TRUE
No 957 3
Yes 23 9

We need to reduce the number of Manipulator (Yes), but they are being predicted as non manipulaters.

As we are not sure if the same is taken correctly, we do Bagging and check on 100 samples to come up with
good confusion matrix

Post bagging : Confusion matrix on train data is

23
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 960 19
Yes 0 13

Accuracy : 0.9808
95% CI : (0.9703, 0.9884)
No Information Rate : 0.9677
P-Value [Acc > NIR] : 0.008494

Kappa : 0.5698
Mcnemar's Test P-Value : 3.636e-05

Sensitivity : 1.0000
Specificity : 0.4062
Pos Pred Value : 0.9806
Neg Pred Value : 1.0000
Prevalence : 0.9677
Detection Rate : 0.9677
Detection Prevalence : 0.9869
Balanced Accuracy : 0.7031

'Positive' Class : No

We applied the model to predict on test data to check the accuracy . The confusion matrix for test data

Confusion Matrix and Statistics

Reference
Prediction No Yes
No 240 5
Yes 0 2

Accuracy : 0.9798
95% CI : (0.9534, 0.9934)
No Information Rate : 0.9717
P-Value [Acc > NIR] : 0.29703

Kappa : 0.4374
Mcnemar's Test P-Value : 0.07364

Sensitivity : 1.0000
Specificity : 0.2857
Pos Pred Value : 0.9796
Neg Pred Value : 1.0000
Prevalence : 0.9717
Detection Rate : 0.9717
Detection Prevalence : 0.9919
Balanced Accuracy : 0.6429

'Positive' Class : No

When run on the test data, the model was accurately able to predict the manipulater’s

24
We did ROC and AUC

Thus proving the variable used in model are significant and the same can be used in deployment strategy, while
giving weightage and scores to these variables to get hold of manipulators

……………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………

End of assignment

25

You might also like