0% found this document useful (0 votes)

94 views

Assignment 2 Module 3

The document discusses a regression model analyzing factors that impact box office collections for movies. 1) Several statements about the regression model are evaluated, and it is found that the model explains 42.25% of variation in box office collection, there are outliers in the model, and box office collection increases as budget increases. 2) It is determined that the regression model's negative constant value is not problematic. 3) There is no significant average difference found between movies released during holiday seasons versus normal seasons. 4) A claim that movies released on long weekends earn 5 crores more is not supported by the model at a 5% significance level.

Uploaded by

Priyanka Sindhwani

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Assignment 2 Module 3

Uploaded by

Priyanka Sindhwani

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Module 3: Assignment 2

Priyanka Sindhwani
IIMB – BIA 7
Question 1:

Which of the following statements are correct (more than one may be correct)?
Tick () all right answers or highlight the correct statements with color.
1. The model explains 42.25% of variation in box office collection.
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases.

Answer 1.1

1. The model explains 42.25% of variation in box office collection. ( As r*r gives us R2, which explains the
variation R = 0.650, R2 = 42.25%)
2. There are outliers in the model.
3. The residuals do not follow a normal distribution.
4. The model cannot be used since R-square is low.
5. Box office collection increases as the budget increases. ( Putting it in the equation Y = Beta0+beta1*x1 : Y = -
8.354 +2.175*x1)

Question 1.2
Mr Chellappa, CEO of Oho Productions (OP) claims that the regression model in Table 3 is incorrect since it has
negative constant value. Comment whether Mr Chellappa is correct in his assessment about the model.

Answer 1.2
The y-intercept is only meaningful if it is logically meaningful for all predictor variables to be zero.
Constant value in an regression model is interpreted the expected mean value of Y at that value. In the
current case budget can never be less than zero therefore negative constant doesn’t add anything to the
model

Question 1.3
What is the average difference in the box office collection when a movie is released during a holiday
season (Releasing_Time_holiday_season) versus movies released during normal season
(Releasing_Time_Normal_Season)? Use a significance value of 5%.

Answer1.3

For the movie released during the holiday season the coefficient is equal to Constant, therefore in this case it is
2.685 vs
Releasing_time_Normal_Season
Considering significance levelshould have
of 0.05, beentime
release Y= 2.685 + 0.147*x
normal season is not significant therefore it merges with base
category thus there is no average difference ,thus there is no difference between Releasing_time_Normal_ Season
and Releasing_time_Holiday_Season

Question 1.4 (4 Points)

1
Mr Chellappa of Oho productions claims that the movies released during long weekend
(Releasing_Time_Long_Weekend) earn at least 5 crores more than the movies released during normal season
(Releasing_Time_Normal_Season). Check whether this claim is true (use  = 0.05).

Answer 1.4

As seen from the model output,movie released on

(Releasing_Time_Long_Weekend) : RTL
(Releasing_Time_Normal_Season).: RNS

As seen from the model output, movie release on RNS is not significant, thus the value of the this would be same
as base category. Therefore
Estimated value of β of movie released in normal season (RNS) = 2.685
Whereas, Estimated value of β of RTL = 2.685+1.247 = 3.932

Estmated Difference between RTS – RNS = 3.932 – 2.685 = 1.247

As all the values, are in natural log (ln), thus we would convert 5 crore also into ln(5crore)

Ln(5 crore) = 17.727

H0 : Difference in β for RTL and RNS <= 17.727

H1: Difference in β for RTL and RNS >17.727

Decision rule : if t-value >t-critical , reject the null

t = Estimated value – hypothesis value/ std error

1.247 – 17.727/0.588 = -28.028

t-critical = -1.97

As t-value < t-critical, we fail to reject the null. Thus the claim of earning 5 crore is not proved

Question 1.5 : Question 1.5 (2 Points)

What is the variation in response variable, ln(Box office collection), explained by the model after adding all 6
variables?

Answer 1.5:

2
The variation in response variable, ln(Box office collection) is 0.81 square = 0.656 = 65.6%

Therefore, after adding 6th variable

R square of model 6 = R2 of Model + (Part correlation of variable 6)^2

R square of model 6 = 0.6561 + (-0.104)^2 = 0.6669 = 66.7%

Question 1.6 (2 Points)

Which factor has the maximum impact on the box office collection of a movie? What will be your
recommendation to a production house based on the variable that has maximum impact on the box office
collection?

Answer 1.6

Budget has the maximum impact on the box office collection, looking at the standardized beta coefficient which
is 0.443 for budget

Based on this our recommendation will be

Budget should always be kept more than 35 crore for a movie as 1 unit of increase in budget adds 0.443
increase in box office collection

Question 1.7
Compare the regressions in Model 2 (Table 4) and Model 3 (Tables 5 and 6). None of the variables in Model 2
are statistically significant in Model 3. Can we conclude that the variables in Model 2 have no association
relationship with Box Office Collection? Explain clearly.

Answer 1.7

Model 3 ( Table 5 and 6) has stepwise regression, which basically adds only the most significant variable,
meaning variable with p-value is lowest.

Therefore looking at the table below , we have calculated p-value for all values

Releasing_Time_Festival_Season

constant .000
Releasing_Time_Festival_Season 0.203
Releasing_Time_Long Weekend 0.036
Releasing_Time_Normal_Season 0.734

3
Constant 2.06206E-29

Budget_35_cr 1.49579E-11

You_tube view 1.89986E-05

Prod_house_cat A 0.002879598

Music_Dir_Cat c 0.001464786

Genre Comedy 0.02221579

Director_Cat C 0.033817534

As the p-value for all the variables in model 3 is less than the p-value in model 2, therefore these variables
were dropped in Model 3( Stepwise regression)

Question 1.8 Among the variables in Table 6, which variable is not useful for practical application of the
model? Clearly state your reasons.

Answer 1.8

You tube views, as that is the variable we have no control over to amend. Thus keeping it in model is not
adding any additional information which will help us take decision

…………………………………………………………………………………………………………………….............................................................

Question 2.1 :
a) What is the predictor variable used in Model 1? Explain clearly.

Price Index

As it has the largest correlation as per the matrix thus in the model it will give high R2, thus it will be the first one
to be entered.

b) What proportion of variation in Sales does this predictor variable explain in model 1? Explain clearly.

R2 = (Correlation Coefficient)2 = (- 0.6089)2 = 0.37075921 = 37%

4
c) What is the Std. Error of the Estimate for Model 1? Explain clearly

n = No of observation
k = No. of explanatory variable
R 2 = 1- SSE/SST
Therefore SSE = SST * (1-R2)
SST = 3.13 * 10^13 (Given)
R2 = 0.37 ( calculated in previous question)
(1-R2) = 0.62924079

SSE = 1.97* 1013

MSE = SSE/ n-(k+1)

SE = sqrt(MSE)

Standard Error (se) = sqrt (SSE/ (n-k-1)) = Sqrt( 1.97*10E13/30-1-1)

838791.6479
(n=30, k=1)

Question 2.2
a) What is the magnitude of the semipartial (or part) correlation for the variable ‘Interest’ in Model 2?
Explain.

b. Square of semi partial correlation gives the value of R2

R2: .37 ( R square when price index is added)

R2: 0.502 ( R square when both price and interest added

Therefore , Interest Sempartial or part correlation is 0.502 – 0.37 = 0.132.

Thus magnitude will be – sqrt(0.132) = 0.363318042

Q 2.b Carry out an appropriate test, at 95% confidence level, to determine if Model 2 as a whole is valid
(significant). State the null and alternate hypotheses and show all work.

To check the validity of the complete model we conduct a F test

H0 = Beta 1 = Beta 2 =…. Beta k = 0

HA = Not all beta values are zero

Decision if Fvalue > Fcritical, reject the null

5
Rejecting the null hypothesis states that over all model is valid

F is rejecting

F = MSR/MSE , F VALUE= R2/k/ (1- R2)/(n-k-1)

As we have given value for equation 2,

Fvalue = 0.502/2/ (1-0.502)/(30-2-1) = 13.60

F critical = F 0.05,2,27 = 3.354131

As Fvalue > Fcritical, we reject the null hypothesis, stating the model is valid

c. Given no change in the other significant explanatory variables, can it be concluded from Model 2 that
‘Interest’ has a higher impact on ‘Sales’ than the other variable used in the model. Explain clearly.

NO, to compare the two coefficients we would look at the standardized Beta Coefficient.

Standardized beta for Interest = -0.362 and for Price Index it is = -0.595, Implying that one SD change in
Interest will have a lesser impact on Sales than one SD change in PriceIndex will have.

Question 2.3
Can it be concluded, at 95% confidence level, that an increase in ‘Interest’ rate by 5% decreases yearly Sales
by at least 250000 units or more? Show all work.

For 1% increase in Interest , the decrease in sale would be = 250000/5 = 50,000

H0 : β2 ≥-50000

H1 : β2 < - 50000

Decision: Reject null if T value >T critical

T = (estimated value of parameter – Hypothesis value of parameter )/Std error =

-124592 + (-50,000)/ 46820.081 = -1.593

T critical = -1.703

As t-value > t-critcal , therefore we fail to reject the null. Thus cannot conclude that increase in interest will
result in sales by 250,000

6
Question 2.4: What can you say about the relationship between ‘Interest’ and the other predictor variable
used in Models 1 and 2? Explain clearly.

 Interest is negatively correlated with base category, thus means a 1 unit increase in base category ( Sales)
will decrease interest by 1 unit
 In presence of Interest, price index becomes less negative thus Price index must have positive relation
with interest

Question 2.5: The partial correlation of the excluded variables; after Model 2 was fitted; are 0.184 and 0.246.
Conduct an appropriate test, at 95% confidence level, to determine if one of these excluded variables should
be added to the regression model. State the null and alternate hypotheses and show all work.

Given

Partial correlation: 0.184 and 0.246

H0: Beta r+1 = ….. = Bk = 0

H1 : β!=0

Test : Partial F test

R square ( full model) – R square ( reduced model)/ k-r

(1-R square (full model)/n-k-1
Decision : Fcritical >F value, we fail to reject the null , meaning the Variables in the set do not improve
significantly the model when all other variables are included

r2 = we get from the correlation matrix, to decide which value belongs to which variable

Variables not part of Model 2 are

Year: Correlation with Sales = -0.5453
Income : Correlation with Sales = - 0.5033
Higher correlation = Higher partial correlation, therefore
Interest correlation value = 0.246 and Year = 0.184

Part Correlation for Income= 0.246/ Sqrt ( 1-(-0.5033)2) = 0.284685402

7
R2 = Square of Part correlation = 0.081045778

Part Correlation for Year = 0.184/sqrt(1-(-0.5453)2) = 0.219507288

R2 = 0.048183449

Full Model 1 = 0.502 + 0.048 = 0.55

Full Model 2 = 0.502 + 0.081 = 0.583
Reduced Model R2 = 0.502

R square ( full model) – R square ( reduced model)/ k-r

(1-R square (full model)/n-k-1

Model 1

F VALUE = 0.55-0.502/(4-3)/1-0.55/(30-4-1) = 2.66

F critical = F 0.05,4,25 = 2.75871047

As Fvalue <F critical, we reject the null.

Model 2
Fvalue : 0.583-0.502/(4-3)/1-0.583/(30-4-1)= 4.856115108
F critical = F 0.05,4,25 = 2.75871047

As Fvalue >F critical, we fail to reject the null.

Therefore, , we will add one variable from the above, as it comes out to be significant

…………………………………………………………………………………………………………………......

Question 3
a. Rank the income groups based on average revenue obtained per transaction in the sample data from
largest to smallest. Provide precise reasons as to how you obtained this ranking. Is this ranking valid for the
population? What is the average revenue per transaction obtained for the income group ($10K-$30K)?

Looking at the Regression Output 1

We can see that , Ann_Inc ( 10$-30$ k ) is taken as the base category.

To check the rank we can add coefficient value of each to the base category and see the change and decide on
ranks, although when we see the p-value of each of the coefficient, it is clear that none of them are significant.
Therefore for population we cannot rank as all values are same as base category .
Also as base category is the income group (10$-30$ K), therefore average revenue is 12.6841$

8
b) The grocery store wishes to estimate the average amount spent per transaction on non-consumables.
Provide the most accurate estimate possible. Provide details on how you obtained this estimate.

Regression Output 3 to be used, as all the coefficient in the model are significant therefore model is valid

As per the model

Y = 13.192 -0.975ProdFamily 3-0.743Prod Family 2

Y = 13.192 – 0.743 = 12.449

c.If in regression output 3, if the base chosen in product family is drinks (Prod_Fam2), then what will be the
corresponding prediction equation?

If the base is Drinks ( Prod_family2)

Base category coefficient = 13.192 – 0.975 = 12.217
The equation will be

Y = (13.192 – 0.975)+ ( -0.743-(-0.975)Prod_fam2 – (-9.75)Prod_fam1

d.)Is there a significant difference in the average amount spent per transaction between that on drinks and
non-consumables? Why or Why not? Provide precise reasons.

The difference between non-consumable and drinks will be measured by Prod_fam3 regression model output 3

H0 : Beta (Prod_fam3)=0
H1:Beta (Prod_fam3) !=0

Decision rule : Retain the null if p-value >0.05.

T-statistic for Prod_fam3 = -2.24, associated p-value = 0.025

p-value < 0.05, we reject the null. Thus stating there is significance difference in the amount spent

e) The grocery store wishes to target those customers, as well as items on which the amount spent is
maximum. Assuming that no customer has more than five children, identify the appropriate customer
segment as well as the appropriate product family. Provide precise reasons behind your answer.

With Children = 0
Amount spent

9
Drinks = Y = 12.214+0*0.393 – 1.010 = 11.204
Non-Consumable Y = 12.214 – 0*0.322 = 12.214

With this we can see that Money spent on Non-Consumable and Food is greater than drink, Now with 5 children
Food = 12.214
With 5 Children
Equation for Money spent on Prod_Fam 2 ( Drinks)

Y = 12.214+5*0.393 – 1.010 = 13.169

Prod_fam3 money spent on non-consumables

Y = 12.214 – 5* 0.322 = 13.825

Money spend on Food

Y = 12.214 + 5*0.393 = 14.179

It is clear from the equation the amount spent with children = 5 is maximum in food product category. Meaning
more children, the amount spent is more towards food items

f) What is the chance that a customer with 3 children will spend more than $10.00 on food items per
transaction? Provide details on your calculations.

Need to find probability for customer spending more than 10$

Estimated mean = 12.214 + 3*393 = 13.393
Std error = 8.127
X = 10
Therefore we find probability of customer spending atleast 10$
= NormDist(x=10,mean = 13.393,stddev=8.127,cumulative) = 0.338

Therefore chance that customer will spend more than 10$ = 1- 0.338 = 0.661

g) Do the number of children effect food purchases more than non-consumables? Why or why not? State
your reasons precisely.

Yes, As we saw for part (e),when children = 0, the expenditure on food and non-consumables are same.
But as number of children increases, the food purchased is more

We can also see this below

For food the prediction equation becomes

Y = 12.214 + 0.393*Children

Y = 12.214 – 0.322 * Children

10
h) If the grocery store has reason to believe that in addition to the independent variables considered in
Regression Output 4, homeowners spend significantly more on non-consumables than non-home owners on
any product category. If so, how will you modify the model provided in Regression Output 4? Provide the
model in  terms. If you are adding new variables to the model, provide details on what you expect the 
value to be. Positive? Negative?

To include the Home owners vis a vis non home owner, we would introduce an interaction variable
Own_Home * Prod_Fam3 =OwnH_PF3

Own_Hme = 1 ( if they are homeowners),0 otherwise.,

Y(hat) = β0+ β1Children-β2Prod_Fam2 +β3ChildProd_fam3 - β3*OwnH_PF3 + e

Positive value

………………………………………………………………………………………………………………………........................................................

Question 4 : Go through the case, “Oakland A” and the spreadsheet supplement (Ref: Moodle/Cases and
Materials/Module 3). Does mark Nobel increase attendance? If so, how much is the increase worth for
Oakland? Support your decision through an appropriate regression model.

Model 1: This was the first model built,without any interaction variable. Observation we as follows

 Nobel is not significant

 Yanks and OD are positively significant
Call:
lm(formula = TIX ~ as.factor(NOBEL) + as.factor(YANKS) + as.factor(OD) +
as.factor(DOW) + as.factor(DH) + as.factor(PROMO), data = Oakland)

Residuals:
Min 1Q Median 3Q Max
-10989.1 -2686.2 -947.4 2218.7 15406.9

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14520 1617 8.980 7.01e-13 ***
as.factor(NOBEL)1 -1289 1336 -0.965 0.33832
as.factor(YANKS)1 29425 2180 13.499 < 2e-16 ***
as.factor(OD)1 17343 5106 3.397 0.00119 **
as.factor(DOW)2 -8673 2067 -4.196 8.68e-05 ***
as.factor(DOW)3 -9323 2078 -4.486 3.15e-05 ***
as.factor(DOW)4 -7448 2814 -2.647 0.01025 *
as.factor(DOW)5 -5354 2008 -2.667 0.00972 **
as.factor(DOW)6 -7710 1986 -3.882 0.00025 ***
as.factor(DOW)7 -4056 2031 -1.997 0.05018 .
as.factor(DH)1 4835 2155 2.244 0.02836 *
as.factor(PROMO)1 3019 1484 2.035 0.04610 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4538 on 63 degrees of freedom

Multiple R-squared: 0.8148, Adjusted R-squared: 0.7824
F-statistic: 25.19 on 11 and 63 DF, p-value: < 2.2e-16

11
Interaction variable 1 : Noble_Yanks

Noble_Yanks
0 If noble is not playing and yanks is not
playing
1 If noble is not playing and yanks is playing
2 If noble is playing and yanks is not playing

3 If noble is playing and yanks is playing

Model 2: with interaction variable

Call:
lm(formula = TIX ~ as.factor(Nobel_Yanks) + as.factor(OD) + +as.factor(DOW) +
as.factor(DH) + as.factor(PROMO), data = Oakland)

Residuals:
Min 1Q Median 3Q Max
-11222.6 -2493.7 -896.8 2129.3 15750.1

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14730 1668 8.829 1.46e-12 ***
as.factor(Nobel_Yanks)1 30570 2997 10.201 6.89e-15 ***
as.factor(Nobel_Yanks)2 -973 1458 -0.668 0.506916
as.factor(Nobel_Yanks)3 26810 3419 7.842 7.43e-11 ***
as.factor(OD)1 17342 5134 3.378 0.001267 **
as.factor(DOW)2 -8934 2130 -4.194 8.86e-05 ***
as.factor(DOW)3 -9564 2134 -4.483 3.24e-05 ***
as.factor(DOW)4 -7657 2854 -2.683 0.009346 **
as.factor(DOW)5 -5702 2112 -2.700 0.008932 **
as.factor(DOW)6 -7821 2007 -3.898 0.000241 ***
as.factor(DOW)7 -4263 2076 -2.054 0.044215 *
as.factor(DH)1 4516 2240 2.016 0.048130 *
as.factor(PROMO)1 2726 1581 1.725 0.089564 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4563 on 62 degrees of freedom

Multiple R-squared: 0.8157, Adjusted R-squared: 0.78
F-statistic: 22.87 on 12 and 62 DF, p-value: < 2.2e-16

Observations:
Nobel becomes significant if yanks is playing, but the coefficient of yanks is pulled down. Thus it doesn’t prove
that Nobel is making any valuable addition to this

We did an outlier test, at 1.5 IQR and plotting the outlier. Yanks came out as an outlier as all the 5 matches
played by Yanks are high selling. Thus we decided to remove Yanks for the final model as it might be masking
other significant variables like “Nobel”

12
Outlier test graph

Therefore next model was created without using Yanks.

And with additional 2 interaction variable : OPP1 and Noble_TOG
The OPP1: This interaction variable was created based on running a model, where OPP came significant for few
teams. Thus we created another categorical variable with significant opposition team variable

Nobel_TOG and OPP1

Nobel_TOG
OPP1
0 If nobel is not
0 If match is against any other playing
team 1 If nobel is playing
1 If match is against team 9 and TOG is 1 (first
half)
2 If match is against team 13 2 If nobel is playing
and TOG is 2

13
Model 3:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12788 1375 9.302 5.88e-13 ***
as.factor(OD)1 19322 4034 4.790 1.26e-05 ***
as.factor(PROMO)1 5374 1310 4.103 0.000133 ***
as.factor(DH)1 7151 2042 3.502 0.000913 ***
as.factor(Nobel_TOG)1 1731 1641 1.055 0.296169
as.factor(Nobel_TOG)2 -4342 1654 -2.625 0.011144 *
as.factor(OPP1)1 8651 1673 5.171 3.23e-06 ***
as.factor(OPP1)2 4551 1575 2.889 0.005483 **
as.factor(DOW)2 -7885 1735 -4.544 2.99e-05 ***
as.factor(DOW)3 -9002 1702 -5.289 2.11e-06 ***
as.factor(DOW)4 -7695 2303 -3.341 0.001492 **
as.factor(DOW)5 -4263 1717 -2.483 0.016042 *
as.factor(DOW)6 -7338 1682 -4.364 5.56e-05 ***
as.factor(DOW)7 -5630 1793 -3.139 0.002702 **
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3559 on 56 degrees of freedom

Multiple R-squared: 0.6564, Adjusted R-squared: 0.5766
F-statistic: 8.228 on 13 and 56 DF, p-value: 6.433e-09

Thus, with all the interactions and stand alone also, we couldn’t justify Nobel add anything significant to the
sales of the ticket.

Model validation and Diagnostics:

Independence of Errors - No AutoCorrelation ( Durbin Watson test)

H0: The null hypothesis is that there is no correlation among residuals, i.e., they are independent. = 0
H1:The alternative hypothesis is that residuals are autocorrelated.
Durbin Watson test :
lag Autocorrelation D-W Statistic p-value
1 0.4428619 1.103515 0
Alternative hypothesis: rho != 0

As p-value = 0 , therefore we retain the null. Concluding there is no autocorrelation

Test for Heteroscedastity

14
Breush Pagan Test

studentized Breusch-Pagan test

data: NOYANKSMODELF
BP = 16.189, df = 13, p-value = 0.2391
The plots we are interested in are at the top-left and bottom-left. The top-left is the chart of residuals
vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is
absolutely no heteroscedastity, you should see a completely random, equal distribution of points
throughout the range of X axis and a flat red line.
Also we conducted Breush pagan test, as p-value >0.05, therefore we can conclude by both graphical
representation and test that there is no heteroscedastity

15
Qqplot : Normality of residuals and Residual Histogram plot

……………………………………………………………………………………………………………………………………………………………………………

Question 5

Part 1 . Calculate the budget for which the box office success and failure are equally likely.

Logit function (ln (π/1-π) = β0+β1*Budget

This is basically to calculate (ln (π/1-π) = odds

Equally likely π = 1-π , (ln (π/1-π) = ln (1) = 0

β0+β1*Budget = 0

16
Budget = -β0/β1 = -1.621/-0.016 = 101.3125 crore

Part 5.2
Is there a sufficient evidence to conclude that the higher budget movies are more likely to fail at
the box-office?

H0: βi=β2=….βk = 0 ( Higher budget movies fail at box office)

H1: βI !=0
Decision rule, if p-value <0.05, reject the null

Looking at table 1 , p-value = 0.046

Therefore p-value < 0.05, thus we reject the null , therefore we cannot say if higher budget movies fail at
box office

Question 5.3
A production house is making a movie with 100 crore budget; what is the success probability for this movie?

Probability = e to the power 1.621-0.016ln(100cr)/1+e1.621-0.0160.257

2.718 1.621-0.0160.257/1+2.718 1.621-0.0160.257 = 0.796265

Question 5.4 (4 Points)

Calculate the optimal cut-off probability when the cost of classifying failure at box office (0) as success at the
box office (1) is five times costlier than the cost of classifying success (1) as failure (0). Show all calculations.

Classifying 0 as 1, cost is 5 times

P01 = 0 classified as 1 ( false positive= cost is 5 times

P10 = False negative ( Meaning 1, predicted as 0 )

C01 = Cost of classifying 0 as 1 ( 5times)

C10 = Cost of classifying 1 as 0

As per the question above, it is more desirable to misclassify five 1’s as 0’s Than misclassifying 0’s as 1

Sensitivity : Proportion of 1 that are correctly identified

Specificity: Proportion of 0 that are correctly identified

Optimal cut off

Min[P00C00+P01C01+P10C10+P11C11]

17
Total
P10 P01 C01 C10 cost
0.5 3 17 5 1 88
0.6 6 14 5 1 76
0.7 11 13 5 1 76
0.8 23 0 5 1 23

As per this 0.8 is the optimal cut off

Question 5.5 Calculate the difference in success probabilities for movies with item song and movies without
item song.

Probability of movies with item song = e to the power 1.099-.501*1/1+e1.099-.501 = 0.645

Probability of movie without item song = e to the power 1.099/1+e1.099 = 0.757

Difference = 0.645 – 0.757 = 0.112

Question 5.6 (2 points)

Which is a better model (budget as an independent variable vs item song as an independent variable).
Clearly state your reasons.

Item Song is a better model, because

1. It classifies more 0’s correctly, which is desirable

2. Significance of Budget is 0.05 whereas item song it is 0.013, thus it has lower p-value. Thus more valid

Question 5.7

Consider all the information in tables 1 to 7, which model you would recommend to predict the movie
success at the box office? Clearly state your reasons.

Table 6 and 7 would be recommended to use the success of movie, as the accuracy in classifying of success and
failure is highest that is 74.6, also the wald’s index and significance level showed in Table 7 is more relevant and
accurate.

18
……………………………………………………………………………………………………………………………………………………………………………

Question 6

Read the case,“Breaking Barriers – Micro-mortgage analytics”. Using the data provided, develop a credit
rating model that Shubham can use

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.917e+00 7.677e-01 10.313 < 2e-16 ***
LTV -7.001e-02 6.663e-03 -10.507 < 2e-16 ***
dwnp_prop_p -7.486e-02 7.329e-03 -10.214 < 2e-16 ***
IAR 6.028e-02 6.321e-03 9.536 < 2e-16 ***
IIR -9.964e-02 1.177e-02 -8.464 < 2e-16 ***
BankSave 9.880e-06 3.373e-06 2.929 0.003402 **
Tier2 4.314e-01 1.888e-01 2.286 0.022283 *
Tier3 -5.935e-01 1.724e-01 -3.441 0.000579 ***
Employment_TypeSelf_Employed -6.406e-01 1.536e-01 -4.169 3.05e-05 ***
Accommodation_ClassRented 4.292e-01 1.520e-01 2.823 0.004754 **
GenderMale 7.103e-01 2.827e-01 2.513 0.011987 *
Age_11 -3.336e-01 1.606e-01 -2.078 0.037745 *
Age_12 9.539e-02 2.595e-01 0.368 0.713148
Age_13 -2.470e+00 1.154e+00 -2.140 0.032379 *
Loan_TypeHome_Loan 6.156e-01 2.838e-01 2.169 0.030087 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 1750.1 on 1778 degrees of freedom

Residual deviance: 1246.1 on 1764 degrees of freedom
AIC: 1276.1

Number of Fisher Scoring iterations: 8

We build a logistic regression model using stepwise in r, Age was changed to a categorical variable

Code sheet

AGE_1
0 Age group 20-39
1 Age group 40-49
2 Age group 50-59
3 Age group 60+

Confit of the training data set :

19
2.5 % 97.5 %
(Intercept) 6.461194e+00 9.473085e+00
LTV -8.361814e-02 -5.748642e-02
dwnp_prop_p -8.972646e-02 -6.097005e-02
IAR 4.817255e-02 7.296643e-02
IIR -1.231476e-01 -7.696668e-02
BankSave 4.686608e-06 1.790836e-05
Tier2 6.264465e-02 8.034187e-01
Tier3 -9.334156e-01 -2.568660e-01
Employment_TypeSelf_Employed -9.442884e-01 -3.415130e-01
Accommodation_ClassRented 1.318359e-01 7.282810e-01
GenderMale 1.764476e-01 1.289141e+00
Age_11 -6.475143e-01 -1.742790e-02
Age_12 -4.026493e-01 6.168986e-01
Age_13 -4.677746e+00 -4.202113e-02
Loan_TypeHome_Loan 5.372949e-02 1.168862e+00

We ran a confusion matrix on train data which was as follows :

FALSE TRUE
0 136 209
1 46 1388

This is at 0.5 cut off, therefore to find the optimal cut off , we use youden index

cutoff P10 P01 msclaf.cost P11 P00 youden.index

[1,] 0.05 0.0000000000 0.93913043 1.8782609 1.0000000 0.06086957 0.06086957
[2,] 0.10 0.0000000000 0.91304348 1.8260870 1.0000000 0.08695652 0.08695652
[3,] 0.15 0.0006973501 0.89565217 1.7920017 0.9993026 0.10434783 0.10365048
[4,] 0.20 0.0020920502 0.88695652 1.7760051 0.9979079 0.11304348 0.11095143
[5,] 0.25 0.0034867503 0.86956522 1.7426172 0.9965132 0.13043478 0.12694803
[6,] 0.30 0.0048814505 0.85217391 1.7092293 0.9951185 0.14782609 0.14294464
[7,] 0.35 0.0069735007 0.80579710 1.6185677 0.9930265 0.19420290 0.18722940
[8,] 0.40 0.0160390516 0.74202899 1.5000970 0.9839609 0.25797101 0.24193196
[9,] 0.45 0.0237099024 0.67246377 1.3686374 0.9762901 0.32753623 0.30382633
[10,] 0.50 0.0334728033 0.58260870 1.1986902 0.9665272 0.41739130 0.38391850
[11,] 0.55 0.0502092050 0.53043478 1.1110788 0.9497908 0.46956522 0.41935601
[12,] 0.60 0.0725244073 0.46956522 1.0116548 0.9274756 0.53043478 0.45791038
[13,] 0.65 0.0885634589 0.42608696 0.9407374 0.9114365 0.57391304 0.48534958
[14,] 0.70 0.1317991632 0.35072464 0.8332484 0.8682008 0.64927536 0.51747620
[15,] 0.75 0.1834030683 0.28405797 0.7515190 0.8165969 0.71594203 0.53253896
[16,] 0.80 0.2482566248 0.21449275 0.6772421 0.7517434 0.78550725 0.53725062
[17,] 0.85 0.3277545328 0.14782609 0.6234067 0.6722455 0.85217391 0.52441938
[18,] 0.90 0.4497907950 0.09275362 0.6352980 0.5502092 0.90724638 0.45745558
[19,] 0.95 0.6171548117 0.05507246 0.7272997 0.3828452 0.94492754 0.32777272
[20,] 1.00 NA NA NA NA NA NA

Going ahead with highest youden index, we go with cut off of 0.8 which gives the following confusion matrix on
train data

20
FALSE TRUE
0 271 74
1 356 1078

Applying the same to Test data, we get the following confusion matrix and ROC plot with AUC

FALSE TRUE
0 59 27
1 99 259

This gives 72% accuracy

ROC PLOT : AUC IS 0.77

Deployment Strategy

Based on the probability we will divide the customers into 3 categories

Low Risk Medium Risk High Risk

Prob. >0.75 0.70-0.50 >0.50

Low Risk: Loan can be approved.

Medium Risk: We should be charging a higher processing fee and also increase down payment

High Risk: Model is not able to classify these too well, thus anyone with probability less that 0.50.

We can apply the following thumb rules, before processing the loan

 Give LTV higher weightage. Checking on their requirement vis a vis market value of the
property
 Processing fee and down payment should be increased

21
 Also IIR, IAR should be assigned weightage

………………………………………………………………………………………………………………………...................................................

Question 7

Comment whether the marital status has any statistical significance on the probability of loan denial. Clearly
state your reasons.

For loan denial, we take model 1

Martial status is not significant variable because of low value of Wald test

Significance = 0.17 >0.05 , therefore not significant

Q 7.2

What percentage of the applicants with a DI=20, LTV = 0.5, IIR = 0.8, MS = 0 and Old EMI = 0 will be given a
loan at 18% interest? Use only statistically significant variables and assume that the changes in the
coefficient values are negligible due to dropping of insignificant variables.

To Calculate % of applicant given loan at 18%

Interest at 18% = 1- [Prob denial +Prob 14%]

Z (Average)loan denial = 1.720 -0.12020-0.5210.5 – 0.2200.8-1.1200 = -1.1165

Prob. = e(z)/1+e(z) = 0.246

Z loan at 14% = 0.650-0.580*20 = 1.75777E-05

Therefore prob of applicant at 18% is = 1- [0.246 +1.75777E-05] = 0.75

75% applicant will be given loan at 18%

……………………………………………………………………………………………………………………………………………………………………………

22
Question 8

Read the case, “Fraud analytics at MCA technology solutions – Predicting Earnings Manipulation by Indian
Firms”. Develop a model using logistic regression and discriminant analysis to predict fraudulent
transactions. (Ref: Moodle/Cases and Materials/Module 3)

Stepwise logistic model :

Model without bagging

Call:
glm(formula = Manipulater ~ DSRI + SGI + ACCR + GMI + AQI, family = binomial,
data = data.train)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.9708 -0.2052 -0.1632 -0.1334 3.0468

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.51768 0.71995 -9.053 < 2e-16 ***
DSRI 0.79368 0.15092 5.259 1.45e-07 ***
SGI 0.84955 0.25477 3.335 0.000854 ***
ACCR 5.89684 1.35669 4.346 1.38e-05 ***
GMI 0.45026 0.27713 1.625 0.104222
AQI 0.24252 0.09298 2.608 0.009102 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 282.73 on 991 degrees of freedom

Residual deviance: 190.10 on 986 degrees of freedom
AIC: 202.1

Confusion matrix for this model on train data

confusion matrix on train data

FALSE TRUE
No 957 3
Yes 23 9

We need to reduce the number of Manipulator (Yes), but they are being predicted as non manipulaters.

As we are not sure if the same is taken correctly, we do Bagging and check on 100 samples to come up with
good confusion matrix

Post bagging : Confusion matrix on train data is

23
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 960 19
Yes 0 13

Accuracy : 0.9808
95% CI : (0.9703, 0.9884)
No Information Rate : 0.9677
P-Value [Acc > NIR] : 0.008494

Kappa : 0.5698
Mcnemar's Test P-Value : 3.636e-05

Sensitivity : 1.0000
Specificity : 0.4062
Pos Pred Value : 0.9806
Neg Pred Value : 1.0000
Prevalence : 0.9677
Detection Rate : 0.9677
Detection Prevalence : 0.9869
Balanced Accuracy : 0.7031

'Positive' Class : No

We applied the model to predict on test data to check the accuracy . The confusion matrix for test data

Confusion Matrix and Statistics

Reference
Prediction No Yes
No 240 5
Yes 0 2

Accuracy : 0.9798
95% CI : (0.9534, 0.9934)
No Information Rate : 0.9717
P-Value [Acc > NIR] : 0.29703

Kappa : 0.4374
Mcnemar's Test P-Value : 0.07364

Sensitivity : 1.0000
Specificity : 0.2857
Pos Pred Value : 0.9796
Neg Pred Value : 1.0000
Prevalence : 0.9717
Detection Rate : 0.9717
Detection Prevalence : 0.9919
Balanced Accuracy : 0.6429

'Positive' Class : No

When run on the test data, the model was accurately able to predict the manipulater’s

24
We did ROC and AUC

Thus proving the variable used in model are significant and the same can be used in deployment strategy, while
giving weightage and scores to these variables to get hold of manipulators

……………………………………………………………………………………………………………………………………………………………………………
……………………………………………………………………………………………………………………………………………………………………………

End of assignment

Big Data Science in Finance
From Everand
Big Data Science in Finance
Irene Aldridge
No ratings yet
Review Questions
No ratings yet
Review Questions
9 pages
Data Interpretation Guide For All Competitive and Admission Exams
From Everand
Data Interpretation Guide For All Competitive and Admission Exams
Mohmmad Khaja Shareef
2.5/5 (6)
Quantitative Methods II Mid-Term Examination: Instructions
100% (1)
Quantitative Methods II Mid-Term Examination: Instructions
17 pages
QM-II Midterm OCT 2014 Solution
No ratings yet
QM-II Midterm OCT 2014 Solution
19 pages
Practice-Final-Part-1
No ratings yet
Practice-Final-Part-1
9 pages
Quiz Solutions
No ratings yet
Quiz Solutions
6 pages
Assignment No.2: Jameel Ahmed (8513) To: Sir Arsalan Hashmi
No ratings yet
Assignment No.2: Jameel Ahmed (8513) To: Sir Arsalan Hashmi
7 pages
Assignment 2 Course: QTMS Submitted By: Zoya Palijo (8211) Submitted To: Dr. Arsalan Hashmi
No ratings yet
Assignment 2 Course: QTMS Submitted By: Zoya Palijo (8211) Submitted To: Dr. Arsalan Hashmi
5 pages
DS II Tutorial 2-SecB, E, F
No ratings yet
DS II Tutorial 2-SecB, E, F
6 pages
CJ Econometrics
No ratings yet
CJ Econometrics
6 pages
Exercise Discussion PDF
No ratings yet
Exercise Discussion PDF
14 pages
Individual Assignment
No ratings yet
Individual Assignment
9 pages
Sample Final Exam (SMMD) : Part A: Each Question in This Part Is Worth 1point
No ratings yet
Sample Final Exam (SMMD) : Part A: Each Question in This Part Is Worth 1point
9 pages
linear regression
No ratings yet
linear regression
37 pages
Pred_exam1
No ratings yet
Pred_exam1
4 pages
Quiz 5 Chap 6
No ratings yet
Quiz 5 Chap 6
5 pages
Econometrics Sheet 2B MR 2024
No ratings yet
Econometrics Sheet 2B MR 2024
5 pages
Chap 012
75% (4)
Chap 012
91 pages
518 2023 05 23 Econometrics - 08052023b
No ratings yet
518 2023 05 23 Econometrics - 08052023b
11 pages
Statistics For Business Decision Making
No ratings yet
Statistics For Business Decision Making
22 pages
Chapter 4 Regression Models: Quantitative Analysis For Management, 11e (Render)
No ratings yet
Chapter 4 Regression Models: Quantitative Analysis For Management, 11e (Render)
27 pages
Econ 4400 Wpaperfinal 2
No ratings yet
Econ 4400 Wpaperfinal 2
11 pages
MidtermII Preparation Questions
No ratings yet
MidtermII Preparation Questions
5 pages
Lab 03 Sol
No ratings yet
Lab 03 Sol
6 pages
Department of Statistics Course STATS 330: Term Test 2003. 9:00 - 10:00 Friday, Sept 19, 2003
No ratings yet
Department of Statistics Course STATS 330: Term Test 2003. 9:00 - 10:00 Friday, Sept 19, 2003
8 pages
Qcm1 February 2015 424 Corrige
No ratings yet
Qcm1 February 2015 424 Corrige
10 pages
Na9vr1 SZWvb69fvimVUw BF C2 W2 Multiple Regression Models
No ratings yet
Na9vr1 SZWvb69fvimVUw BF C2 W2 Multiple Regression Models
25 pages
The University of Auckland: Second Semester, 2004 Campus: City
No ratings yet
The University of Auckland: Second Semester, 2004 Campus: City
23 pages
Topic 24 - Hypothesis Tests and Confidence Intervals in Multiple Regression Question
No ratings yet
Topic 24 - Hypothesis Tests and Confidence Intervals in Multiple Regression Question
10 pages
1 Final-Exam
No ratings yet
1 Final-Exam
6 pages
Economics EC 9418 Basic Econometrics October 2019 A
No ratings yet
Economics EC 9418 Basic Econometrics October 2019 A
2 pages
EF3450 2021B MID
No ratings yet
EF3450 2021B MID
12 pages
Assignment 6
No ratings yet
Assignment 6
5 pages
Uasstatlan2018 PDF
No ratings yet
Uasstatlan2018 PDF
16 pages
Predicting Pregnancies of Our Customers I - Regression Model
No ratings yet
Predicting Pregnancies of Our Customers I - Regression Model
50 pages
(EMPTY) - Practice Test 1.5
No ratings yet
(EMPTY) - Practice Test 1.5
15 pages
Practice Midterm1 Solutions F23 Edited
No ratings yet
Practice Midterm1 Solutions F23 Edited
16 pages
P (Y 1) e 1+ E: Business Analytics - Assignment
No ratings yet
P (Y 1) e 1+ E: Business Analytics - Assignment
4 pages
Econometrics_Problem_Set_2
No ratings yet
Econometrics_Problem_Set_2
3 pages
MMLA_IA_FT202087
No ratings yet
MMLA_IA_FT202087
6 pages
Sample Final
No ratings yet
Sample Final
10 pages
Homework 3
No ratings yet
Homework 3
10 pages
518 2023 05 23 GECO170523English
No ratings yet
518 2023 05 23 GECO170523English
8 pages
Exam Final BAF 2020
100% (1)
Exam Final BAF 2020
4 pages
SubjectiveQuestions
No ratings yet
SubjectiveQuestions
4 pages
Dsba Solution
No ratings yet
Dsba Solution
2 pages
mt1 2017 Soln
No ratings yet
mt1 2017 Soln
8 pages
Chap 013
50% (2)
Chap 013
141 pages
333 Practice Final Solutions
No ratings yet
333 Practice Final Solutions
5 pages
Đề thi cuối kỳ - Tổng hợp - EN1
No ratings yet
Đề thi cuối kỳ - Tổng hợp - EN1
7 pages
Linear Regression Questions Answers
No ratings yet
Linear Regression Questions Answers
6 pages
Econometrics Sample Paper
No ratings yet
Econometrics Sample Paper
5 pages
Chap 1,2,3,5,6 (QA) Upload
No ratings yet
Chap 1,2,3,5,6 (QA) Upload
6 pages
Tute Exercise 5
No ratings yet
Tute Exercise 5
22 pages
Assignment 3
No ratings yet
Assignment 3
10 pages
M1 A1 Document
No ratings yet
M1 A1 Document
8 pages
Quiz 2
No ratings yet
Quiz 2
3 pages
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
From Everand
Student Solutions Manual to Accompany Loss Models: From Data to Decisions, Fourth Edition
Stuart A. Klugman
4/5 (1)
Solutions Manual to accompany Introduction to Linear Regression Analysis
From Everand
Solutions Manual to accompany Introduction to Linear Regression Analysis
Douglas C. Montgomery
1/5 (1)
Pharmeasy
No ratings yet
Pharmeasy
16 pages
Month 205-70-N1190 PC - 198 - 27 - 42263 PC - 203 - 32 - 51461 PC - 600 - 863 - 4210 PC - 6735 - 61 - 3410
No ratings yet
Month 205-70-N1190 PC - 198 - 27 - 42263 PC - 203 - 32 - 51461 PC - 600 - 863 - 4210 PC - 6735 - 61 - 3410
4 pages
Assignment 5
No ratings yet
Assignment 5
43 pages
Classification Trees - CART and CHAID
No ratings yet
Classification Trees - CART and CHAID
50 pages
Gradient Descent
No ratings yet
Gradient Descent
18 pages