AMS 427 SUP NOTES 1g
AMS 427 SUP NOTES 1g
AMS 427 SUP NOTES 1g
Applied Regression
STAT 462 Analysis
X1 = age
X4 = weight (kg)
X5 = height (mm)
X6 = chin skinfold
X7 = forearm skinfold
The results from the best subsets procedure are presented below.
https://online.stat.psu.edu/stat462/node/203/ 1/5
7/19/2021 11.6 - Further Automated Variable Selection Examples | STAT 462
To interpret the results, we start by noting that the lowest Cp value (= 5.5) occurs for the five-variable model that includes the
variables Age, Years, fraclife, Weight, and Chin. The ”X”s to the right side of the display tell us which variables are in the
model (look up to the column heading to see the variable name). The value of R2 for this model is 63.9% and the value of R2adj is
58.4%. If we look at the best six-variable model, we see only minimal changes in these values, and the value of S = √M SE
increases. A five-variable model most likely will be sufficient. We should then use multiple regression to explore the five-variable
model just identified. Note that two of these x-variables relate to how long the person has lived at the urban lower altitude.
Next, we turn our attention to calculating AIC and BIC. Here are the multiple regression results for the best five-variable model
(which has Cp = 5.5) and the best four-variable model (which has Cp = 7.2).
https://online.stat.psu.edu/stat462/node/203/ 2/5
7/19/2021 11.6 - Further Automated Variable Selection Examples | STAT 462
AIC Comparison: The five-variable model still has a slight edge (a lower AIC is better).
BIC Comparison: The values are nearly the same; the five-variable model has a slightly lower value (a lower BIC is better).
Our decision is that the five-variable model has better values than the four-variable models, so it seems to be the winner.
Interestingly, the Chin variable is not quite at the 0.05 level for significance in the five-variable model so we could consider
dropping it as a predictor. But, the cost will be an increase in MSE and 4.2% drop in R2. Given the closeness of the Chin-value
(0.061) to the 0.05 significance level and the relatively small sample size (39), we probably should keep the Chin variable in the
model for prediction purposes. When we have a p-value that is only slightly higher than our significance level (by slightly higher,
we mean usually no more than 0.05 above the significance level we are using), we usually say a variable is marginally
significant. It is usually a good idea to keep such variables in the model, but one way or the other, you should state why you
decided to keep or drop the variable.
https://online.stat.psu.edu/stat462/node/203/ 3/5
7/19/2021 11.6 - Further Automated Variable Selection Examples | STAT 462
Next we will illustrate stepwise procedures. Recall from Lesson 5 that this dataset consists of n = 55 college students with
measurements for the following seven variables (Physical.txt
(../../sites/onlinecourses.science.psu.edu.stat462/files/data/Physical/index.txt) ):
Y = height (in)
All six x-variables were candidates for the final model. The procedure took two forward steps and then stopped. The variables in
the model at that point are left foot length and left forearm length. The left foot length variable was selected first (in Step 1), and
then left forearm length was added to the model. The procedure stopped because no other variables could enter at a significant
level. Notice that the significance level used for entering variables was 0.15. Thus, after Step 2 there were no more x-variables for
which the p-value would be less than 0.15.
It is also possible to work backwards from a model with all the predictors included and only consider steps in which the least
significant predictor is removed. Output for this backward elimination procedure is given below.
https://online.stat.psu.edu/stat462/node/203/ 4/5
7/19/2021 11.6 - Further Automated Variable Selection Examples | STAT 462
The procedure took five steps (counting Step 1 as the estimation of a model with all variables included). At each subsequent step,
the weakest variable is eliminated until all variables in the model are significant (at the default 0.10 level). At a particular step,
you can see which variable was eliminated by the new blank spot in the display (compared to the previous step). For instance,
from Step 1 to Step 2, the nose length variable was dropped (it had the highest p-value.) Then, from Step 2 to Step 3, the gender
variable was dropped, and so on.
The stopping point for the backward elimination procedure gave the same model as the stepwise procedure did, with left forearm
length and left foot length as the only two x-variables in the model. It will not always necessarily be the case that the two methods
used here will arrive at the same model.
Finally, it is also possible to work forwards from a base model with no predictors included and only consider steps in which the
most significant predictor is added. We leave it as exercise to see how this forward selection procedure works for this dataset (you
can probably guess given the results of the Stepwise procedure above).
https://online.stat.psu.edu/stat462/node/203/ 5/5