Stepwise Regression
Stepwise Regression
Stepwise Regression
com
Chapter 311
Stepwise Regression
Introduction
Often, theory and experience give only general direction as to which of a pool of candidate variables (including
transformed variables) should be included in the regression model. The actual set of predictor variables used in
the final regression model must be determined by analysis of the data. Determining this subset is called the
variable selection problem.
Finding this subset of regressor (independent) variables involves two opposing objectives. First, we want the
regression model to be as complete and realistic as possible. We want every regressor that is even remotely
related to the dependent variable to be included. Second, we want to include as few variables as possible because
each irrelevant regressor decreases the precision of the estimated coefficients and predicted values. Also, the
presence of extra variables increases the complexity of data collection and model maintenance. The goal of
variable selection becomes one of parsimony: achieve a balance between simplicity (as few regressors as
possible) and fit (as many regressors as needed).
There are many different strategies for selecting variables for a regression model. If there are no more than fifteen
candidate variables, the All Possible Regressions procedure (discussed in the next chapter) should be used since it will
always give as good or better models than the stepping procedures available in this procedure. On the other hand,
when there are more than fifteen candidate variables, the four search procedures contained in this procedure may be of
use.
These search procedures will often find very different models. Outliers and collinearity can cause this. If there is very
little correlation among the candidate variables and no outlier problems, the four procedures should find the same
model.
We will now briefly discuss each of these procedures.
311-1
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
Stepwise Selection
Stepwise regression is a combination of the forward and backward selection techniques. It was very popular at one
time, but the Multivariate Variable Selection procedure described in a later chapter will always do at least as well and
usually better.
Stepwise regression is a modification of the forward selection so that after each step in which a variable was added, all
candidate variables in the model are checked to see if their significance has been reduced below the specified
tolerance level. If a nonsignificant variable is found, it is removed from the model.
Stepwise regression requires two significance levels: one for adding variables and one for removing variables. The
cutoff probability for adding variables should be less than the cutoff probability for removing variables so that the
procedure does not get into an infinite loop.
Min MSE
This procedure is similar to the Stepwise Selection search procedure. However, instead of using probabilities to add
and remove, you specify a minimum change in the root mean square error. At each step, the variable whose status
change (in or out of the model) will decrease the mean square error the most is selected and its status is reversed. If it
is currently in the model, it is removed. If it is not in the model, it is added. This process continues until no variable
can be found that will cause a change larger than the user-specified minimum change amount.
311-2
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
Data Structure
An example of data appropriate for this procedure is shown in the table below. This data is from a study of the
relationships of several variables with a persons IQ. Fifteen people were studied. Each persons IQ was recorded
along with scores on five different personality tests. The data are contained in the IQ dataset. We suggest that you
open this database now so that you can follow along with the example.
IQ dataset
Test1 Test2 Test3 Test4 Test5 IQ
83 34 65 63 64 106
73 19 73 48 82 92
54 81 82 65 73 102
96 72 91 88 94 121
84 53 72 68 82 102
86 72 63 79 57 105
76 62 64 69 64 97
54 49 43 52 84 92
37 43 92 39 72 94
42 54 96 48 83 112
71 63 52 69 42 130
63 74 74 71 91 115
69 81 82 75 54 98
81 89 64 85 62 96
50 75 72 64 45 103
Missing Values
Rows with missing values in the active variables are ignored.
Procedure Options
This section describes the options available in this procedure.
Variables Tab
Specify the variables on which to run the analysis.
Dependent Variable
Y: Dependent Variable
Specifies a dependent (Y) variable. If more than one variable is specified, a separate analysis is run for each.
311-3
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
Weight Variable
Weight Variable
Specifies a variable containing observation (row) weights for generating weighted regression analysis. These
weights might be those saved during a robust regression analysis.
Independent Variables
Xs: Independent Variables
Specify the independent (X or candidate) variables.
Model Selection
Selection Method
This option specifies which of the four search procedures should be used: Forward, Backward, Stepwise, or Min
MSE.
Prob to Enter
Sometimes call PIN, this is the probability required to enter the equation. This value is used by the Forward and
the Stepwise procedures. A variable, not currently in the model, must have a t-test probability value less than or
equal to this in order to be considered for entry into the regression equation. You must set PIN < POUT.
Prob to Remove
Sometimes call POUT, this is the probability required to be removed from the equation. This value is used by the
Backward and the Stepwise procedures. A variable, currently in the model, must have a t-test probability value
greater than this in order to be considered for removal from the regression equation. You must set PIN < POUT.
Min RMSE Change
This value is used by the Minimum MSE procedure to determine when to stop. The procedure stops when the
maximum relative decrease in the square root of the mean square error brought about by changing the status of a
variable is less than this amount.
Maximum Iterations
This is the maximum number of iterations that will be allowed. This option is useful to prevent the unlimited
looping that may occur. You should set this to a high value, say 50 or 100.
Remove Intercept
Unchecked indicates that the intercept term is to be included in the regression. Checked indicates that the intercept
should be omitted from the regression model. Note that deleting the intercept distorts most of the diagnostic
statistics (R-Squared, etc.).
Reports Tab
These options control the reports that are displayed.
Select Reports
Descriptive Statistics and Selected Variables Reports
This option specifies whether the indicated report is displayed.
311-4
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
Report Options
Report Format
Two output formats are available: brief and verbose. The Brief output format consists of a single line for each step
(scan through the variables). The Verbose output format gives a complete table of each variables statistics at each
step. If you have many variables, the Verbose option can produce a lot of output.
Precision
Specifies the precision of numbers in the report. Single precision will display seven-place accuracy, while double
precision will display thirteen-place accuracy.
Variable Names
This option lets you select whether to display only variable names, variable labels, or both.
311-5
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
For each variable, the Count, Mean, and Standard Deviation are calculated. This report is especially useful for
making certain that you have selected the right variables and that the appropriate number of rows was used.
Iteration 0: Unchanged
Iteration 3: Unchanged
This report presents information about each step of the search procedures. You can scan this report to see if you
would have made the same choice. Each report shows the statistics after the specified action (entry or removal)
was taken.
311-6
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
sx j
b j, std = b j
sy
where sy and s x j are the standard deviations for the dependent variable and the corresponding jth independent
variable.
R-Squared Increment
This is the amount that R-Squared would be changed if the status of this variable were changed. If the variable is
currently in the model, this is the amount the R-Squared value would be decreased if it were removed. If the
variable is currently out of the model, this is the amount the overall R-Squared would be increased if it were
added. Large values here indicate important independent variables.
You want to add variables that make a large contribution to R-Squared and to delete variables that make a small
contribution to R-Squared.
R-Squared Other Xs
This is a collinearity measure, which should be as small as possible. This is the R-Squared value that would result if
this independent variable were regressed on all of the other independent variables currently in the model.
T-Value
This is the t-value for testing the hypothesis that this variable should be added to, or deleted from, the model. The test
is adjusted for the rest of the variables in the model. The larger this t-value is, the more important the variable.
Prob Level
This is the two-tail p-value for the above t-value. The smaller this p-value, the more important the independent
variable is. This is the significance value that is compared to the values of PIN and POUT (see Stepwise Method
above).
311-7
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
This is an abbreviated report summarizing the statistics at each iteration. Individual definitions of the items on the
report are as follows:
Iter. No.
The number of this iteration.
Action
For each iteration, there are three possible actions:
1. Unchanged. No action was taken because of the scan in this step. Because of the backward look in the
stepwise search method, this will show up a lot when this method is used. Otherwise, it will show up at
the first and last steps.
2. Removed. A variable was removed from the model.
3. Added. A variable was added to the model.
Variable
This is the name of the variable whose status is being changed.
R-Squared
The value of R-Squared for the current model.
Sqrt(MSE)
This is the square root of the mean square error for the current model.
311-8
NCSS, LLC. All Rights Reserved.
NCSS Statistical Software NCSS.com
Stepwise Regression
311-9
NCSS, LLC. All Rights Reserved.