Simple Linear Regression Models Using SAS
Simple Linear Regression Models Using SAS
/*****************************************************************
This example illustrates:
How to get a scatter plot with a regression line
How to get a Pearson correlation
How to create a log-transformed variable
How to carry out a simple linear regression
How to check residuals for normality
How to get ODS output that looks really nice
Procs used:
Proc SGplot
Proc Corr
Proc Reg
Proc Univariate
Proc Transreg
Filename: regression_lecture1.sas
*******************************************************************/
We first set up the libname statement so we can use the permanent data set: b510.cars.
OPTIONS FORMCHAR="|----|+|---+=|-/\<>*";
libname b510 "e:\510\";
We are examining the relationship between the dependent variable, HorsePower, and a continuous predictor,
Weight. We first look at a scatterplot, with a regression line included to see the relationship between Y and X
and decide if it appears to be linear (degree = 1 is used for the regression line). We also look for any outliers.
title "Scatter Plot with Regression Line";
proc sgplot data=b510.cars;
reg y=horse x=weight / degree=1;
run;
Next we check the correlation between horsepower and weight.
title "Correlation";
proc corr data=b510.cars;
var horse weight;
run;
Correlation
The CORR Procedure
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum
HORSE 400 104.83250 38.52206 41933 46.00000 230.00000
WEIGHT 406 2970 849.82717 1205642 732.00000 5140
HORSE WEIGHT
HORSE 1.00000 0.85942
<.0001
400 400
Next, we fit a simple linear regression model, with HorsePower as the dependent variable, and Weight as the
predictor. We plot studentized residuals vs. the predicted values as part of Proc Reg, to check for
homoskedasticity (equality of variances). We later use Proc Univariate to check the distribution of the residuals
for normality.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 437321 437321 1124.57 <.0001
Error 398 154774 388.88017
Corrected Total 399 592096
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 -10.77763 3.58572 -3.01 0.0028
WEIGHT 1 0.03884 0.00116 33.53 <.0001
The residuals appear to have very unequal variances. We will try to correct this problem.
These residuals don’t look too bad for normality, although they have a long right tail (are somewhat skewed to
the right).
We use Proc Transreg to decide on a transformation to correct for non-normality.
/*Decide on a transformation*/
title "Look for an appropriate Transformation of Y";
proc transreg data=b510.cars;
model boxcox(horse/geo) = identity(weight);
run;
Based on the output, we choose Log(Y) or Log(HorsePower) as the transformation we will use.
Look for an appropriate Transformation of Y
The TRANSREG Procedure
Transformation Information
for BoxCox(HORSE)
Lambda R-Square Log Like
-3.00 0.44 -1469.66
-2.75 0.48 -1425.53
-2.50 0.51 -1382.98
-2.25 0.55 -1342.22
-2.00 0.58 -1303.49
-1.75 0.61 -1267.14
-1.50 0.64 -1233.60
-1.25 0.67 -1203.42
-1.00 0.70 -1177.27
-0.75 0.72 -1155.93
-0.50 0.74 -1140.21
-0.25 0.75 -1130.91
0.00 + 0.76 -1128.69 <
0.25 0.76 -1133.90
0.50 0.76 -1146.58
0.75 0.75 -1166.37
1.00 0.74 -1192.65
1.25 0.72 -1224.64
1.50 0.70 -1261.49
1.75 0.68 -1302.44
2.00 0.66 -1346.78
2.25 0.63 -1393.96
2.50 0.61 -1443.53
2.75 0.58 -1495.12
3.00 0.55 -1548.47
We create the new variables, LogHorse, LogWeight, and LogMPG in a data step. We will only be using
LogHorse in this example.
We can see more equal variance, with a mainly linear relationship, but with some weird cases as the lower end
of the plot.
We now rerun the regression analysis, but with LogHorse as Y and Weight as X.
title "Log HorsePower is Y";
proc reg data=b510.cars2;
model loghorse = weight;
plot rstudent.*predicted.;
output out=regdat2 p=predict r=resid rstudent=rstudent;
run; quit;
Log Horsepower is Y
The REG Procedure
Model: MODEL1
Dependent Variable: loghorse
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 35.89161 35.89161 1235.69 <.0001
Error 398 11.56023 0.02905
Corrected Total 399 47.45184
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 3.54381 0.03099 114.36 <.0001
WEIGHT 1 0.00035187 0.00001001 35.15 <.0001
The scatterplot of residuals vs. predicted values now shows much more homogeneous variance at all the
predicted values. We still see a couple of outliers.
We check the studentized residuals from this regression for normality, using Proc Univariate.
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 35.89161 35.89161 1235.69 <.0001
Error 398 11.56023 0.02905
Corrected Total 399 47.45184
Parameter Estimates
Parameter Standard
Variable DF Estimate Error t Value Pr > |t|
Intercept 1 3.54381 0.03099 114.36 <.0001
WEIGHT 1 0.00035187 0.00001001 35.15 <.0001
9
10