Homework4 1
Homework4 1
Homework4 1
SM -2 Homework-4
Q1) The lathe1 data set from the alr4 package contains the results of an experiment on
characterizing the life of a drill bit in cutting steel on a lathe. Two factors were varied in
the experiment, Speed and Feed rate. The response is Life, the total time until the drill bit
fails, in minutes.
For all the below questions, consider log(Life) as the response and the following second
order model:
E(log(Life) | Speed, Feed) = β0 + β1 Speed + β2 Feed + β11 Speed2 + β22 Feed2 + β12 Speed
× Feed
(a)State the null and alternative hypotheses for the overall F-test for this model. Perform
the test and summarize results.
Answer
For the overall F-test, the hypotheses are:
H0: β1= β2= β11= β22= β12=0
H1: At least one of the five β’s is not equal to 0
library(alr4)
#Part a
mod.reduced= lm(y~1)
mod.full= lm(y ~ x1+x2+x3+x4+x5 )
anova(mod.reduced, mod.full)
For the ANOVA table, we can observe that the p-value is very low (close to 0). Hence, we can
reject the null hypotheses and conclude that at least one of the given five variables is significant
in the model.
(b) Perform a suitable test to find if both the square terms and the interactions are
significant or not. Write down the hypotheses and the conclusion clearly.
Answer
For the new test, the hypotheses are:
H0: β11= β22= β12=0
H1: At least one of the above 3 β’s is not equal to 0
#Part b
mod.new =lm(y~x1+x2)
mod.full= lm(y ~ x1+x2+x3+x4+x5 )
anova(mod.new, mod.full)
From the ANOVA table, we observe that the p-values are very low(less than 5%). Hence, we can
reject the null hypotheses and conclude that at least one of Speed^2, Feed^2 and Speed*Feed
variables are significant.
Q2) The dataset copier.csv gives the relationship between the number of minutes spent on
the service call (Y ), the number of copiers serviced (X1) and type of copier (X2), which can
either be small (S) or large (L). Answer the following questions.
a) Fit a linear regression and write down the fitted model.
Answer.
## ï..Y X1 X2
## 1 20 2 1
## 2 60 4 0
## 3 46 3 0
## 4 41 2 0
## 5 12 1 0
## 6 137 10 0
#a
fit = lm(data[,1] ~ data$X1 + data$X2)
summary(fit)
##
## Call:
## lm(formula = data[, 1] ~ data$X1 + data$X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.5390 -4.2515 0.5995 6.5995 14.9330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9225 3.0997 -0.298 0.767
## data$X1 15.0461 0.4900 30.706 <2e-16 ***
## data$X2 0.7587 2.7799 0.273 0.786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.011 on 42 degrees of freedom
## Multiple R-squared: 0.9576, Adjusted R-squared: 0.9556
## F-statistic: 473.9 on 2 and 42 DF, p-value: < 2.2e-16
b) Write down separate estimated regression equations for “small” copiers and “large”
copiers.
Answer
The estimated regression equation for “small” copiers is:
𝑌̂=-0.1638 + 15.0461*x1
c) How will you interpret the coefficient attached to X2?
Answer
The coefficient attached to variable x2 can be interpreted as the mean change in the number of
minutes spent on the service call when the size of the copier changes from small to large keeping
the other variable constant.
Q3) A marketing research trainee in the national office of a chain of shoe stores used the
following response function to study seasonal (winter, spring, summer, fall) effects on sales
of a certain line of shoes: E(Y ) = β0 + β1X1 + β2X2 + β3X3. The X0 s are indicator
variables defined as follows:
Q4) The Highway dataset in the alr4 package gives a relationship between the accident rate
(rate) and several other predictors. Specifically, pick the following five predictors for this
problem: ‘trks’, ‘lane’, ‘acpt’, ‘sigs’ and ‘itg’. Implement the following variable selection
methods to determine the “best” model under each case.
(a) Stepwise regression with AIC
Answer
According to AIC, the final model includes acpt, trks, sigs as the predictors.
𝑌̂= 3.9663+0.1248*acpt-0.1892*trks+0.5373*sigs
library(alr4)
library(leaps)
data = Highway
y = Highway$rate
trks= data$trks
lane=data$lane
acpt=data$acpt
sigs=data$sigs
itg =data$itg
#Part a
mod0<-lm(y~1)
mod.upper<-lm(y~trks+lane+acpt+sigs+itg)
step(mod0,scope=list(lower=mod0,upper=mod.upper))
## Start: AIC=54.51
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + acpt 1 84.767 65.119 23.994
## + sigs 1 47.759 102.127 41.543
## + trks 1 39.372 110.514 44.622
## <none> 149.886 54.506
## + lane 1 0.163 149.723 56.464
## + itg 1 0.092 149.794 56.482
##
## Step: AIC=23.99
## y ~ acpt
##
## Df Sum of Sq RSS AIC
## + trks 1 10.053 55.066 19.454
## + sigs 1 7.160 57.959 21.451
## <none> 65.119 23.994
## + itg 1 2.466 62.653 24.488
## + lane 1 2.411 62.708 24.522
## - acpt 1 84.767 149.886 54.506
##
## Step: AIC=19.45
## y ~ acpt + trks
##
## Df Sum of Sq RSS AIC
## + sigs 1 2.936 52.130 19.317
## <none> 55.066 19.454
## + itg 1 1.210 53.856 20.587
## + lane 1 0.614 54.452 21.017
## - trks 1 10.053 65.119 23.994
## - acpt 1 55.448 110.514 44.622
##
## Step: AIC=19.32
## y ~ acpt + trks + sigs
##
## Df Sum of Sq RSS AIC
## <none> 52.130 19.317
## - sigs 1 2.936 55.066 19.454
## + itg 1 0.702 51.428 20.788
## + lane 1 0.028 52.101 21.295
## - trks 1 5.829 57.959 21.451
## - acpt 1 37.448 89.578 38.430
##
## Call:
## lm(formula = y ~ acpt + trks + sigs)
##
## Coefficients:
## (Intercept) acpt trks sigs
## 3.9663 0.1248 -0.1892 0.5373
(b) Stepwise regression with t-test p-values
Answer
According to t-test p-values, the correct model includes acpt and trks as the predictors.
#Part b
add1(mod0,~.+trks+lane+acpt+sigs+itg,test='F')
mod1 = lm(y~acpt)
add1(mod1,~.+trks+lane+sigs+itg ,test ='F')
mod2=lm(y~acpt+trks)
summary(mod2)
##
## Call:
## lm(formula = y ~ acpt + trks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0610 -0.9655 0.1222 0.6568 3.1717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.42932 1.00856 4.392 9.47e-05 ***
## acpt 0.13896 0.02308 6.021 6.52e-07 ***
## trks -0.23418 0.09134 -2.564 0.0147 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 36 degrees of freedom
## Multiple R-squared: 0.6326, Adjusted R-squared: 0.6122
## F-statistic: 30.99 on 2 and 36 DF, p-value: 1.487e-08
add1(mod2,~.+lane+sigs+itg,test = 'F')
#We do not take any more predictors as the p-values of rest of the predictors
is more than 0.15(15%)
summary.mod$adjr2
## [1] 0.5538002 0.6122046 0.6223945 0.6165208 0.6070799
The largest value of adjusted R2 is 0.6223945. Hence, the best model under the best subset
regression with adjusted R2 comes out when we choose the third row.
Q5) (a) In stepwise regression, what advantage is there in using a relatively large α value to
add variables? Comment briefly.
Answer
A relatively large alpha ensures that entering and removing of predictors at every step becomes
easy and less stringent. It also makes the partial F-test less selective and thus removing the
problem of underfitting.
(b) Do the stepwise procedures using AIC and t-test p-values yield the same final model
every time? If not, how will you decide on which model to pick?
Answer
No, the stepwise procedures using AIC and t-test p-values do not yield the same final model
every time, i.e., The final models may be different.
If such a case arises then we will have to go by our intuition and prior information to pick the
correct model. If we know beforehand that a certain predictor has to be included in the final
model, then we will go with the bigger model containing that predictor. Otherwise, we will pick
the model containing less predictors.