Multiple Linear Regression and Checking For Collinearity Using SAS
Multiple Linear Regression and Checking For Collinearity Using SAS
For this example we use data from the Werner birth control study. Data for this study were
collected from 188 women, 94 of whom were taking birth control pills, and 94 controls, matched
on age, who were not taking birth control pills. The raw data are in the WERNER2.DAT file. For
this analysis, we ignore the matching between cases and controls. The codebook for this study
is shown below.
SAS commands to read in the raw data and create a permanent SAS dataset are shown below:
DATA b510.WERNER;
INFILE "E:\LABDATA\WERNER2.DAT";
INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16
PILL 17-20 CHOL 21-24 ALB 25-28
CALC 29-32 URIC 33-36 PAIR 37-39;
if ht=999 then ht=.;
if wt=999 then wt=.;
if alb=99 then alb=.;
if calc=99 then calc=.;
if uric=99 then uric=.;
wtalb = wt + alb;
run;
We examine descriptive statistics using Proc Means for all numeric variables (all variables are
numeric in this case), and Proc Freq.
title "Werner Data";
proc freq data=b510.werner;
tables age pill;
run;
1
proc means data=b510.werner;
run;
Werner Data
The FREQ Procedure
Cumulative Cumulative
AGE Frequency Percent Frequency Percent
--------------------------------------------------------
19 2 1.06 2 1.06
20 2 1.06 4 2.13
21 14 7.45 18 9.57
22 16 8.51 34 18.09
23 4 2.13 38 20.21
24 6 3.19 44 23.40
25 8 4.26 52 27.66
26 4 2.13 56 29.79
27 8 4.26 64 34.04
28 6 3.19 70 37.23
29 4 2.13 74 39.36
30 10 5.32 84 44.68
31 6 3.19 90 47.87
32 10 5.32 100 53.19
33 6 3.19 106 56.38
34 2 1.06 108 57.45
35 4 2.13 112 59.57
36 4 2.13 116 61.70
37 4 2.13 120 63.83
38 2 1.06 122 64.89
39 6 3.19 128 68.09
40 8 4.26 136 72.34
41 4 2.13 140 74.47
42 2 1.06 142 75.53
43 8 4.26 150 79.79
44 2 1.06 152 80.85
45 2 1.06 154 81.91
46 6 3.19 160 85.11
47 4 2.13 164 87.23
48 8 4.26 172 91.49
49 2 1.06 174 92.55
50 2 1.06 176 93.62
52 2 1.06 178 94.68
53 2 1.06 180 95.74
54 6 3.19 186 98.94
55 2 1.06 188 100.00
Cumulative Cumulative
PILL Frequency Percent Frequency Percent
---------------------------------------------------------
1 94 50.00 94 50.00
2 94 50.00 188 100.00
2
URIC 187 4.7705882 1.1572312 2.2000000 9.9000000
PAIR 188 47.5000000 27.2063810 1.0000000 94.0000000
wtalb 184 135.7978261 20.6557047 98.1000000 219.3000000
-------------------------------------------------------------------------------
Before we fit a multiple regression model, we examine the correlations among the predictor
variables and dependent variable using Proc Corr. We first use the default settings from Proc
Corr, which gives us a correlation matrix with pairwise deletion of missing values. In the
correlation matrix below the sample size for each pair of variables is based on all available
cases for those two variables.
.
TITLE "PEARSON CORRELATION MATRIX PAIRWISE DELETION";
proc corr data=b510.werner;
var chol age calc uric alb wt wtalb;
run;
PEARSON CORRELATION MATRIX PAIRWISE DELETION
The CORR Procedure
7 Variables: CHOL AGE CALC URIC ALB WT WTALB
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
Next, we examine the correlations using the nomiss option, which gives us a correlation matrix
with listwise deletion of cases. That is, only those cases that have complete data for all
variables will be included in the correlation matrix. This corresponds to the method used in Proc
Reg, which requires complete data on all variables for a case to be included in the analysis.
3
The CORR Procedure
7 Variables: CHOL AGE CALC URIC ALB WT WTALB
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
CHOL 181 234.81215 44.80722 42501 50.00000 390.00000
AGE 181 33.49171 9.89086 6062 19.00000 55.00000
CALC 181 9.96575 0.47227 1804 8.80000 11.10000
URIC 181 4.74586 1.12558 859.00000 2.20000 9.90000
ALB 181 4.11878 0.35852 745.50000 3.20000 5.00000
WT 181 131.19890 20.49103 23747 94.00000 215.00000
WTALB 181 135.31768 20.40863 24493 98.10000 219.30000
4
Next, we examine a scatterplot matrix for the variables: Chol, Calc, Uric, and Age, Wt and
Wtalb.
Now we fit a multiple regression model with cholesterol (CHOL) as the dependent variable, and
age, calcium, uric acid, albumin and weight as predictors. We request the standardized
5
coefficients with the stb option, and examine collinearity with the three options: tol, vif, and
collin. We also request the influence diagnostic, Cook’s Distance (a summary measure of the
influence of each observation on the parameter estimates).
TITLE1 "MULTIPLE REGRESSION ANALYSIS";
TITLE2 "WITH COLLINEARITY DIAGNOSTICS";
proc reg data=b510.werner;
model chol = age calc uric alb wt / stb tol vif collin;
plot rstudent. * predicted.;
output out=outreg1 p=predict1 r=resid1 rstudent=rstud1
cookd = cookd1;
run;quit;
MULTIPLE REGRESSION ANALYSIS
WITH COLLINEARITY DIAGNOSTICS
The REG Procedure
Model: MODEL1
Dependent Variable: CHOL
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 5 85352 17070 10.82 <.0001
Error 175 276032 1577.32432
Corrected Total 180 361384
Parameter Estimates
Parameter Standard Standardized
Variable DF Estimate Error t Value Pr > |t| Estimate Tolerance
Intercept 1 -55.66568 65.84364 -0.85 0.3990 0 .
AGE 1 1.51731 0.31310 4.85 <.0001 0.33494 0.91375
CALC 1 23.13256 7.23435 3.20 0.0016 0.24382 0.75069
URIC 1 7.77395 2.82519 2.75 0.0066 0.19529 0.86656
ALB 1 -3.61701 9.72612 -0.37 0.7104 -0.02894 0.72069
WT 1 -0.09809 0.16119 -0.61 0.5436 -0.04486 0.80319
Parameter Estimates
Variance
Variable DF Inflation
Intercept 1 0
AGE 1 1.09439
CALC 1 1.33210
URIC 1 1.15399
ALB 1 1.38755
WT 1 1.24504
6
Collinearity Diagnostics
Condition
Number Eigenvalue Index
1 5.87226 1.00000
2 0.06389 9.58714
3 0.03985 12.13989
4 0.01979 17.22558
5 0.00317 43.03127
6 0.00104 75.16465
Collinearity Diagnostics
-----------------------------Proportion of Variation-----------------------------
Number Intercept AGE CALC URIC ALB WT
1 0.00005791 0.00200 0.00004819 0.00129 0.00015438 0.00054274
2 0.00127 0.93760 0.00128 0.01076 0.00570 0.00210
3 0.00310 0.01098 0.00228 0.91410 0.01133 0.00055001
4 0.00084135 0.04094 0.00162 0.06181 0.04116 0.75326
5 0.19460 0.00311 0.07113 0.00089968 0.87927 0.24347
6 0.80013 0.00537 0.92366 0.01113 0.06239 0.00008896
We examine the distribution of the residuals to check for normality. We see that there is no
apparent skewness, and that the residuals appear to be reasonably normally distributed.
7
title "Residuals from Multiple Regression";
proc univariate data=outreg1;
var rstud1;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;
Tests for normality indicate that we do not reject H0, and conclude that the residuals are
normally distributed.
------Quantile------
Percent Observed Estimated
1.0 -2.38479 -2.35418
5.0 -1.51725 -1.66431
10.0 -1.21990 -1.29655
25.0 -0.69353 -0.68203
50.0 0.02256 0.00074
75.0 0.59352 0.68352
90.0 1.23187 1.29804
95.0 1.67376 1.66580
99.0 2.89051 2.35567
8
DESCRIPTIVES ON OUTREG1 DATA SET
The MEANS Procedure
Variable Label N Mean Std Dev
-------------------------------------------------------------------------------------------
AGE 188 33.8191489 10.1126942
HT 186 64.5107527 2.4850673
WT 186 131.6720430 20.6605767
PILL 188 1.5000000 0.5013351
CHOL 187 235.1550802 44.5706219
ALB 186 4.1112903 0.3579694
CALC 185 9.9621622 0.4795556
URIC 187 4.7705882 1.1572312
WTALB 184 135.7978261 20.6557047
AGEGRP 188 2.5425532 1.1106186
AGEDUM1 188 0.2340426 0.4245295
AGEDUM2 188 0.2446809 0.4310457
AGEDUM3 188 0.2659574 0.4430215
AGEDUM4 188 0.2553191 0.4372048
predict1 Predicted Value of CHOL 181 234.8121547 21.7756052
resid1 Residual 181 9.048624E-15 39.1600531
cookd1 Cook's D Influence Statistic 181 0.0058677 0.0127657
rstud1 Studentized Residual without Current Obs 181 0.000744985 1.0122833
-------------------------------------------------------------------------------------------
We examine a plot of Cook’s Distance vs. the observation number, so we can identify influential
observations. To do this, we first modify the outreg1 dataset to add a new variable called
OBSNUM, which is equal to the special variable _n_, which is actually just the observation
number.
data outreg1;
set outreg1;
obsnum = _n_;
run;
title "Cook's Distance vs. Observation Number";
proc gplot data=outreg1;
plot cookd1 * obsnum;
run;
9
We see that there are a number of observations with very high values of Cook’s Distance,
relative to the other observations.
We can examine these observations by using Proc Print, and selecting those cases with high
values of Cook’s Distance, by using a where statement.
Notice that these cases are influential, and that observation 4 and observation 182 are not well
fit by the regression model (they both have studentized residuals that are greater than 3 in
absolute value). We might want to check the values of the variables for these cases to verify
that there are no problems with them, and then possbibly fix these values, and rerun the
analysis.
Obs ID CHOL AGE CALC URIC ALB WT predict1 resid1 rstud1 cookd1
4 1797 50 25 9.6 3.0 3.8 150 199.204 -149.204 -3.98681 0.092426
60 152 317 27 9.8 8.4 3.7 180 246.263 70.737 1.89177 0.065697
70 2830 305 28 9.3 2.4 4.1 113 194.696 110.304 2.89051 0.052719
182 3134 390 50 9.7 5.5 4.6 140 256.972 133.028 3.54133 0.095397
10
Example with perfect collinearity:
Now, we fit a multiple regression model in which we deliberately include a variable, WTALB,
which is perfectly collinear with weight and albumin (it is the sum of weight and albumin). SAS
detects this collinearity and produces a note in the output. SAS also sets the parameter
estimate for WTALB to zero, gives it zero degrees of freedom in the Analysis of variance
table, and shows collinearity that is off the map in the output.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 5 85352 17070 10.82 <.0001
Error 175 276032 1577.32432
Corrected Total 180 361384
NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some
statistics will be misleading. A reported DF of 0 or B means that the estimate is biased.
NOTE: The following parameters have been set to 0, since the variables are a linear
combination of other variables as shown.
WTALB = ALB + WT
Parameter Estimates
Parameter Estimates
Variance
Variable DF Inflation
Intercept 1 0
AGE 1 1.09439
CALC 1 1.33210
URIC 1 1.15399
11
ALB B 1.38755
WT B 1.24504
WTALB 0 .
Collinearity Diagnostics
Condition ---------------Proportion of Variation---------------
Number Eigenvalue Index Intercept AGE CALC URIC
1 6.85949 1.00000 0.00004242 0.00146 0.00003529 0.00094208
2 0.06449 10.31327 0.00099816 0.95866 0.00101 0.00592
3 0.03999 13.09718 0.00244 0.00993 0.00170 0.95676
4 0.03175 14.69846 0.00311 0.02017 0.00346 0.02491
5 0.00324 45.97840 0.19319 0.00441 0.07020 0.00033026
6 0.00104 81.23749 0.80022 0.00536 0.92359 0.01114
7 1E-12 2619062 7.73798E-17 0 2.86561E-17 0
Collinearity Diagnostics
---------Proportion of Variation---------
Number ALB WT WTALB
12
Collinearity Diagnostics
-----------------------------Proportion of Variation-----------------------------
Number Intercept no_atbat no_hits no_home no_runs no_rbi
1 0.00269 0.00023159 0.00023721 0.00178 0.00070700 0.00067808
2 0.09506 0.00165 0.00144 0.17194 0.00056374 0.00580
3 0.59402 0.00312 0.01109 0.05859 0.03961 0.00024424
4 0.03327 0.00418 0.00579 0.18226 0.33544 0.41146
5 0.09469 0.17467 0.09367 0.53523 0.58122 0.55859
6 0.18027 0.81615 0.88778 0.05021 0.04246 0.02322
We now look at the Baseball dataset, which gives statistics for major league baseball players in
1986. This dataset, which contains information on players’ statistics for 1986 and for their
entire careers, is provided as one of the SAS example datasets. We Fit a multiple linear
regression model to predict their current salary based on a number of their current statistics,
and examine collinearity.
/***********************************************************
Look at another data set
************************************************************/
title "Descriptive statistics for Baseball data set";
proc means data=b510.baseball;
run;
Descriptive statistics for Baseball data set
The MEANS Procedure
13
Variable Label Maximum
------------------------------------------------------
no_atbat Times at Bat in 1986 687.0000000
no_hits Hits in 1986 238.0000000
no_home Home Runs in 1986 40.0000000
no_runs Runs in 1986 130.0000000
no_rbi RBIs in 1986 121.0000000
no_bb Walks in 1986 105.0000000
yr_major Years in the Major Leagues 24.0000000
cr_atbat Career times at bat 14053.00
cr_hits Career Hits 4256.00
cr_home Career Home Runs 548.0000000
cr_runs Career Runs 2165.00
cr_rbi Career RBIs 1659.00
cr_bb Career Walks 1566.00
no_outs Put Outs in 1986 1378.00
no_assts Assists in 1986 492.0000000
no_error Errors in 1986 32.0000000
salary 1987 Salary in $ Thousands 2460.00
------------------------------------------------------
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
salary 263 535.92588 451.11868 140949 67.50000 2460
no_atbat 322 390.07453 143.59584 125604 127.00000 687.00000
no_hits 322 103.39752 44.17951 33294 31.00000 238.00000
no_home 322 11.10248 8.69877 3575 0 40.00000
no_runs 322 52.21739 25.05737 16814 12.00000 130.00000
no_rbi 322 49.37267 25.50116 15898 8.00000 121.00000
14
title "Linear Regression with Collinear Predictors";
proc reg data=b510.baseball;
model salary = no_atbat no_hits no_home no_runs no_rbi/ tol vif collin;
run; quit;
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 5 16448569 3289714 22.93 <.0001
Error 257 36870544 143465
Corrected Total 262 53319113
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 60.48056 75.44086 0.80 0.4235
no_atbat Times at Bat in 1986 1 -1.50940 0.62381 -2.42 0.0162
no_hits Hits in 1986 1 5.94661 2.25811 2.63 0.0090
no_home Home Runs in 1986 1 -2.94006 6.14149 -0.48 0.6325
no_runs Runs in 1986 1 1.74261 2.52465 0.69 0.4907
no_rbi RBIs in 1986 1 7.27974 2.69290 2.70 0.0073
Parameter Estimates
Variance
Variable Label DF Tolerance Inflation
Intercept Intercept 1 . 0
no_atbat Times at Bat in 1986 1 0.06917 14.45751
no_hits Hits in 1986 1 0.05549 18.02256
no_home Home Runs in 1986 1 0.18636 5.36590
no_runs Runs in 1986 1 0.13665 7.31804
no_rbi RBIs in 1986 1 0.11421 8.75605
Collinearity Diagnostics
Condition
Number Eigenvalue Index
1 5.63031 1.00000
2 0.23425 4.90265
3 0.09385 7.74565
4 0.02668 14.52574
5 0.01055 23.10450
6 0.00436 35.92760
Collinearity Diagnostics
-----------------------------Proportion of Variation-----------------------------
15
Number Intercept no_atbat no_hits no_home no_runs no_rbi
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 4 15608610 3902153 26.70 <.0001
Error 258 37710503 146165
Corrected Total 262 53319113
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 -40.58946 63.41007 -0.64 0.5227
no_hits Hits in 1986 1 2.06327 1.60337 1.29 0.1993
no_home Home Runs in 1986 1 -2.25645 6.19244 -0.36 0.7159
no_runs Runs in 1986 1 0.78095 2.51651 0.31 0.7566
no_rbi RBIs in 1986 1 6.39975 2.69322 2.38 0.0182
Parameter Estimates
Variance
Variable Label DF Tolerance Inflation
Intercept Intercept 1 . 0
no_hits Hits in 1986 1 0.11212 8.91870
no_home Home Runs in 1986 1 0.18676 5.35454
no_runs Runs in 1986 1 0.14012 7.13669
no_rbi RBIs in 1986 1 0.11633 8.59635
Collinearity Diagnostics
Condition
Number Eigenvalue Index
1 4.65566 1.00000
2 0.22036 4.59644
3 0.08929 7.22101
4 0.02628 13.31095
5 0.00842 23.51501
Collinearity Diagnostics
16
------------------------Proportion of Variation------------------------
Number Intercept no_hits no_home no_runs no_rbi
1 0.00564 0.00069471 0.00266 0.00106 0.00101
2 0.20603 0.00436 0.16743 0.00145 0.00441
3 0.73320 0.03336 0.07880 0.06028 0.00070962
4 0.03421 0.01939 0.22431 0.28979 0.47351
5 0.02091 0.94220 0.52680 0.64743 0.52036
17
Model: MODEL1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 2 15584682 7792341 53.69 <.0001
Error 260 37734430 145132
Corrected Total 262 53319113
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Intercept Intercept 1 -40.15685 63.06377 -0.64 0.5248
no_hits Hits in 1986 1 2.56978 0.85202 3.02 0.0028
no_rbi RBIs in 1986 1 5.65255 1.45775 3.88 0.0001
Parameter Estimates
Variance
Variable Label DF Tolerance Inflation
Intercept Intercept 1 . 0
no_hits Hits in 1986 1 0.39426 2.53637
no_rbi RBIs in 1986 1 0.39426 2.53637
Collinearity Diagnostics
Condition ---------Proportion of Variation---------
Number Eigenvalue Index Intercept no_hits no_rbi
1 2.85776 1.00000 0.01581 0.00655 0.00894
2 0.10757 5.15435 0.81555 0.02401 0.22821
3 0.03468 9.07800 0.16864 0.96944 0.76285
18