Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Homework4 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Anubhav(2021IPM019)

SM -2 Homework-4

Q1) The lathe1 data set from the alr4 package contains the results of an experiment on
characterizing the life of a drill bit in cutting steel on a lathe. Two factors were varied in
the experiment, Speed and Feed rate. The response is Life, the total time until the drill bit
fails, in minutes.
For all the below questions, consider log(Life) as the response and the following second
order model:
E(log(Life) | Speed, Feed) = β0 + β1 Speed + β2 Feed + β11 Speed2 + β22 Feed2 + β12 Speed
× Feed
(a)State the null and alternative hypotheses for the overall F-test for this model. Perform
the test and summarize results.
Answer
For the overall F-test, the hypotheses are:
H0: β1= β2= β11= β22= β12=0
H1: At least one of the five β’s is not equal to 0

library(alr4)

## Warning: package 'alr4' was built under R version 4.1.3

## Loading required package: car

## Warning: package 'car' was built under R version 4.1.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.1.3

## Loading required package: effects

## Warning: package 'effects' was built under R version 4.1.3

## lattice theme set by effectsTheme()


## See ?effectsTheme for details.
data = lathe1
y = log(data$Life)
x1 = data$Speed
x2 = data$Feed
x3 = (data$Speed)^2
x4= (data$Feed)^2
x5 = data$Speed*data$Feed

#Part a
mod.reduced= lm(y~1)
mod.full= lm(y ~ x1+x2+x3+x4+x5 )

anova(mod.reduced, mod.full)

## Analysis of Variance Table


##
## Model 1: y ~ 1
## Model 2: y ~ x1 + x2 + x3 + x4 + x5
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 41.533
## 2 14 1.237 5 40.296 91.236 3.551e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For the ANOVA table, we can observe that the p-value is very low (close to 0). Hence, we can
reject the null hypotheses and conclude that at least one of the given five variables is significant
in the model.

(b) Perform a suitable test to find if both the square terms and the interactions are
significant or not. Write down the hypotheses and the conclusion clearly.
Answer
For the new test, the hypotheses are:
H0: β11= β22= β12=0
H1: At least one of the above 3 β’s is not equal to 0
#Part b
mod.new =lm(y~x1+x2)
mod.full= lm(y ~ x1+x2+x3+x4+x5 )
anova(mod.new, mod.full)

## Analysis of Variance Table


##
## Model 1: y ~ x1 + x2
## Model 2: y ~ x1 + x2 + x3 + x4 + x5
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 17 3.7431
## 2 14 1.2367 3 2.5065 9.4583 0.001138 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table, we observe that the p-values are very low(less than 5%). Hence, we can
reject the null hypotheses and conclude that at least one of Speed^2, Feed^2 and Speed*Feed
variables are significant.

Q2) The dataset copier.csv gives the relationship between the number of minutes spent on
the service call (Y ), the number of copiers serviced (X1) and type of copier (X2), which can
either be small (S) or large (L). Answer the following questions.
a) Fit a linear regression and write down the fitted model.
Answer.

𝑌̂= -0.9925 + 15.0461*x1 + 0.7587*x2

Here, x2 is a categorical variable.


data = read.csv("copier.csv" , header = TRUE)
head(data)

## ï..Y X1 X2
## 1 20 2 1
## 2 60 4 0
## 3 46 3 0
## 4 41 2 0
## 5 12 1 0
## 6 137 10 0
#a
fit = lm(data[,1] ~ data$X1 + data$X2)
summary(fit)

##
## Call:
## lm(formula = data[, 1] ~ data$X1 + data$X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.5390 -4.2515 0.5995 6.5995 14.9330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.9225 3.0997 -0.298 0.767
## data$X1 15.0461 0.4900 30.706 <2e-16 ***
## data$X2 0.7587 2.7799 0.273 0.786
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.011 on 42 degrees of freedom
## Multiple R-squared: 0.9576, Adjusted R-squared: 0.9556
## F-statistic: 473.9 on 2 and 42 DF, p-value: < 2.2e-16

b) Write down separate estimated regression equations for “small” copiers and “large”
copiers.
Answer
The estimated regression equation for “small” copiers is:

𝑌̂=-0.9225 + 15.0461*x1 (as x2=0 for ‘small’ copiers)

The estimated regression equation for “large” copiers is:

𝑌̂=-0.9225 + 15.0461*x1 + 0.7587*1 (as x2=1 for ‘large’ copiers)

𝑌̂=-0.1638 + 15.0461*x1
c) How will you interpret the coefficient attached to X2?
Answer
The coefficient attached to variable x2 can be interpreted as the mean change in the number of
minutes spent on the service call when the size of the copier changes from small to large keeping
the other variable constant.

Q3) A marketing research trainee in the national office of a chain of shoe stores used the
following response function to study seasonal (winter, spring, summer, fall) effects on sales
of a certain line of shoes: E(Y ) = β0 + β1X1 + β2X2 + β3X3. The X0 s are indicator
variables defined as follows:

a) State the response functions for the four types of seasons.


The response functions for the four types of seasons are as follows:
Winter: E(Y) = β0 + β1
Spring: E(Y) = β0 + β2
Fall: E(Y) = β0 + β3
Summer: E(Y) = β0
b) Interpret each of the following quantities:
(i) βo - βo represents the number of sales of a certain line of shoes during summer season.
(ii) β1 - β1 represents reduction in sales of shoes from winter season to summer season.
(iii) β2 - β2 represents reduction in sales of shoes from spring season to summer season.
(iv) β3 - β3 represents reduction in sales of shoes from fall season to summer season.

Q4) The Highway dataset in the alr4 package gives a relationship between the accident rate
(rate) and several other predictors. Specifically, pick the following five predictors for this
problem: ‘trks’, ‘lane’, ‘acpt’, ‘sigs’ and ‘itg’. Implement the following variable selection
methods to determine the “best” model under each case.
(a) Stepwise regression with AIC
Answer
According to AIC, the final model includes acpt, trks, sigs as the predictors.

𝑌̂= 3.9663+0.1248*acpt-0.1892*trks+0.5373*sigs

library(alr4)

## Warning: package 'alr4' was built under R version 4.1.3

## Loading required package: car

## Warning: package 'car' was built under R version 4.1.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.1.3

## Loading required package: effects

## Warning: package 'effects' was built under R version 4.1.3

## lattice theme set by effectsTheme()


## See ?effectsTheme for details.

library(leaps)

## Warning: package 'leaps' was built under R version 4.1.3

data = Highway
y = Highway$rate
trks= data$trks
lane=data$lane
acpt=data$acpt
sigs=data$sigs
itg =data$itg

#Part a

mod0<-lm(y~1)
mod.upper<-lm(y~trks+lane+acpt+sigs+itg)
step(mod0,scope=list(lower=mod0,upper=mod.upper))

## Start: AIC=54.51
## y ~ 1
##
## Df Sum of Sq RSS AIC
## + acpt 1 84.767 65.119 23.994
## + sigs 1 47.759 102.127 41.543
## + trks 1 39.372 110.514 44.622
## <none> 149.886 54.506
## + lane 1 0.163 149.723 56.464
## + itg 1 0.092 149.794 56.482
##
## Step: AIC=23.99
## y ~ acpt
##
## Df Sum of Sq RSS AIC
## + trks 1 10.053 55.066 19.454
## + sigs 1 7.160 57.959 21.451
## <none> 65.119 23.994
## + itg 1 2.466 62.653 24.488
## + lane 1 2.411 62.708 24.522
## - acpt 1 84.767 149.886 54.506
##
## Step: AIC=19.45
## y ~ acpt + trks
##
## Df Sum of Sq RSS AIC
## + sigs 1 2.936 52.130 19.317
## <none> 55.066 19.454
## + itg 1 1.210 53.856 20.587
## + lane 1 0.614 54.452 21.017
## - trks 1 10.053 65.119 23.994
## - acpt 1 55.448 110.514 44.622
##
## Step: AIC=19.32
## y ~ acpt + trks + sigs
##
## Df Sum of Sq RSS AIC
## <none> 52.130 19.317
## - sigs 1 2.936 55.066 19.454
## + itg 1 0.702 51.428 20.788
## + lane 1 0.028 52.101 21.295
## - trks 1 5.829 57.959 21.451
## - acpt 1 37.448 89.578 38.430

##
## Call:
## lm(formula = y ~ acpt + trks + sigs)
##
## Coefficients:
## (Intercept) acpt trks sigs
## 3.9663 0.1248 -0.1892 0.5373
(b) Stepwise regression with t-test p-values
Answer
According to t-test p-values, the correct model includes acpt and trks as the predictors.

𝑌̂= 4.42932 + 0.13896*acpt – 0.23418*trks

#Part b
add1(mod0,~.+trks+lane+acpt+sigs+itg,test='F')

## Single term additions


##
## Model:
## y ~ 1
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 149.886 54.506
## trks 1 39.372 110.514 44.622 13.1817 0.0008505 ***
## lane 1 0.163 149.723 56.464 0.0403 0.8420245
## acpt 1 84.767 65.119 23.994 48.1636 3.408e-08 ***
## sigs 1 47.759 102.127 41.543 17.3029 0.0001817 ***
## itg 1 0.092 149.794 56.482 0.0228 0.8806802
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

mod1 = lm(y~acpt)
add1(mod1,~.+trks+lane+sigs+itg ,test ='F')

## Single term additions


##
## Model:
## y ~ acpt
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 65.119 23.994
## trks 1 10.0532 55.066 19.454 6.5724 0.01468 *
## lane 1 2.4108 62.708 24.522 1.3840 0.24714
## sigs 1 7.1603 57.959 21.451 4.4475 0.04198 *
## itg 1 2.4664 62.653 24.488 1.4172 0.24165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

mod2=lm(y~acpt+trks)

summary(mod2)

##
## Call:
## lm(formula = y ~ acpt + trks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0610 -0.9655 0.1222 0.6568 3.1717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.42932 1.00856 4.392 9.47e-05 ***
## acpt 0.13896 0.02308 6.021 6.52e-07 ***
## trks -0.23418 0.09134 -2.564 0.0147 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.237 on 36 degrees of freedom
## Multiple R-squared: 0.6326, Adjusted R-squared: 0.6122
## F-statistic: 30.99 on 2 and 36 DF, p-value: 1.487e-08

add1(mod2,~.+lane+sigs+itg,test = 'F')

## Single term additions


##
## Model:
## y ~ acpt + trks
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 55.066 19.454
## lane 1 0.61364 54.452 21.017 0.3944 0.5341
## sigs 1 2.93636 52.130 19.317 1.9715 0.1691
## itg 1 1.20990 53.856 20.587 0.7863 0.3813

#We do not take any more predictors as the p-values of rest of the predictors
is more than 0.15(15%)

(c) Best subsets regression with adjusted R2


Answer
According to adjusted R2, the final model includes acpt, trks and sigs as the predictors.
#c
mod=regsubsets(cbind(trks,lane,acpt,sigs,itg),y)
summary.mod=summary(mod)
summary.mod$which

## (Intercept) trks lane acpt sigs itg


## 1 TRUE FALSE FALSE TRUE FALSE FALSE
## 2 TRUE TRUE FALSE TRUE FALSE FALSE
## 3 TRUE TRUE FALSE TRUE TRUE FALSE
## 4 TRUE TRUE FALSE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE

summary.mod$adjr2
## [1] 0.5538002 0.6122046 0.6223945 0.6165208 0.6070799

The largest value of adjusted R2 is 0.6223945. Hence, the best model under the best subset
regression with adjusted R2 comes out when we choose the third row.

Q5) (a) In stepwise regression, what advantage is there in using a relatively large α value to
add variables? Comment briefly.
Answer
A relatively large alpha ensures that entering and removing of predictors at every step becomes
easy and less stringent. It also makes the partial F-test less selective and thus removing the
problem of underfitting.

(b) Do the stepwise procedures using AIC and t-test p-values yield the same final model
every time? If not, how will you decide on which model to pick?
Answer
No, the stepwise procedures using AIC and t-test p-values do not yield the same final model
every time, i.e., The final models may be different.
If such a case arises then we will have to go by our intuition and prior information to pick the
correct model. If we know beforehand that a certain predictor has to be included in the final
model, then we will go with the bigger model containing that predictor. Otherwise, we will pick
the model containing less predictors.

You might also like