10 Regression Analysis in SAS
10 Regression Analysis in SAS
10 Regression Analysis in SAS
Picka
10
The following SAS code produces statistical descriptions and a simple regression analysis of a small data set. The data are the introductory example from Draper and Smith (1998). The DATA step causes SAS to read data values directly from the input stream. In this example, the goal is to predict monthly steam usage (STEAMUSE) from average monthly temperature (TEMP), using a straight line regression. Statistical characteristics of the data are examined using PROC UNIVARIATE. The PLOT option in the PROC UNIVARIATE statement cause SAS to produce crude histograms and boxplots. The NORMAL option causes SAS to test the hypothesis that the variable has a normal distribution. These options are given for illustrative purposes only, since we are not interested in the distribution of TEMP and do not believe that the STEAMUSE observations are identically distributed. PROC UNIVARIATE produces lots of output and is mainly used for exploratory purposes. It is always useful to obtain scatter plots of the raw data in regression problems. The PROC PLOT step generates crude printer plots, but in this case it is sucient to show that there is a rough linear trend with a negative slope and no wild outliers. The regression analysis is performed using PROC REG. In this example we only specify the model to be estimated (in the MODEL statement). It is possible to get much more: plots, diagnostics and tests of model assumptions. The PROC GPLOT step produces a high-resolution graph of the raw data with the regression line superimposed. The form of the graph is specied in the SYMBOL statement, which species that a least squares regression line should be used to interpolate between data points and that raw data points should be indicated by plus signs. options ls=70 ; title1 STEAM DATA FROM CHAPTER 1 OF DRAPER & SMITH ; data steam ; input steamuse temp @@ ; /* Note that SAS requires ONLY 2 obs per line (for 2 variables) without the @@ at end of input line. */ datalines ; 10.98 35.3 11.13 29.7 12.51 30.8 8.40 58.8 9.27 61.4
86
proc univariate data=steam plot normal ; var steamuse temp ; title2 Univariate Descriptive Statistics ; proc plot data=steam ; title2 Scatterplot of Raw Data ; plot steamuse*temp ; proc reg data=steam ; title2 Least Squares Analysis ; model steamuse = temp ; /* NOTE: dont need run between PROCs */ proc gplot data=steam ; symbol i=rl value=PLUS ; plot steamuse*temp ; title2 Observed Values and Estimated Regression Line ; run ; The 25 data values produce about 3.5 pages of SAS output. Shorter output could have been generated using PROC MEANS or PROC CORR. However, these procedures can not produce histograms or test for normality. The regression output from PROC REG appears on page 10 of the SAS output. The Analysis of Variance table shows how the total variation (Yi Y )2 is decomposed into a component explained by the regression model and an unexplained component described as Error. The F -test has a very small p value so that we conclude that the straight line model Yi = 0 + 1 xi + ei ts the data much better than the trivial model Yi = 0 + ei . Because R2 = .7144, we conclude that 71% of the variation in Y is explained by the linear regression relationship. The estimated slope and intercept appear in the table headed Parameter Estimates. For each parameter, the least squares estimate and estimated standard error are given. Also, SAS provides a Student t test of the null hypothesis that the true parameter is zero. Note that the square of the t statistic for the slope is equal to the F statistic for the model, in accordance with the theory.
87
STEAM DATA FROM CHAPTER 1 OF DRAPER & SMITH Univariate Descriptive Statistics The UNIVARIATE Procedure Variable: steamuse Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation ... The UNIVARIATE Procedure Variable: steamuse Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling ... The UNIVARIATE Procedure Variable: steamuse ... Stem 12 11 10 9 8 7 6 Leaf 25 0119 149 1366 12455799 78 48 ----+----+----+----+ # 2 4 3 4 8 2 2 Boxplot | | +-----+ *--+--* +-----+ | | --Statistic--W D W-Sq A-Sq 0.971215 0.110663 0.055016 0.310084 -----p Value-----Pr Pr Pr Pr < > > > W D W-Sq A-Sq 0.6760 >0.1500 >0.2500 >0.2500 25 9.424 1.63064149 0.21135583 2284.1102 17.3030718 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 25 235.6 2.65899167 -0.6061937 63.8158 0.3261283
88
Variable:
steamuse
Normal Probability Plot 12.5+ | | 9.5+ | | 6.5+ *++++*+ *+*+*++ ***+*+ ++**** **+****** +*+*++ *++++* +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
The UNIVARIATE Procedure Variable: temp Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation ... Variable: temp 25 52.6 17.2655968 -0.1157664 76323.42 32.8243285 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 25 1315 298.100833 -1.5262541 7154.42 3.45311936
Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling ... --Statistic--W D W-Sq A-Sq 0.897492 0.163221 0.115956 0.826067 -----p Value-----Pr Pr Pr Pr < > > > W D W-Sq A-Sq 0.0162 0.0847 0.0676 0.0290
89
STEAM DATA FROM CHAPTER 1 OF DRAPER & SMITH Scatterplot of Raw Data Plot of steamuse*temp. Legend: A = 1 obs, B = 2 obs, etc.
90
Least Squares Analysis The REG Procedure Model: MODEL1 Dependent Variable: steamuse Analysis of Variance Sum of Squares 45.59240 18.22340 63.81580 0.89012 9.42400 9.44529 Mean Square 45.59240 0.79232
Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var
DF 1 23 24
F Value 57.54
Pr > F <.0001
0.7144 0.7020
Parameter Estimates Parameter Estimate 13.62299 -0.07983 Standard Error 0.58146 0.01052
DF 1 1
We will see in the next segment of this handout that one can calculate by hand how far out individual residuals are from a regression-model t in Splus. To do this more automatically in SAS, try the INFLUENCE option within the MODEL statement of PROC REG. Another option you can use to highlight special outlying features of individual observations is the COOKD output keyword under PROC REG.
91
Splus Example, 3/31/03 ---------------------> steamdat <- matrix(c(10.98, 35.3, 11.13, 29.7, 12.51, 30.8, 8.40, 58.8, + 9.27, 61.4, 8.73, 71.3, 6.36, 74.4, 8.50, 76.7, 7.82, 70.7, 9.14, 57.5, + 8.24, 46.4, 12.19, 28.9, 11.88, 28.1, 9.57, 39.1, 10.94, 46.8, 9.58, + 48.5, 10.09, 59.3, 8.11, 70.0, 6.83, 70.0, 8.88, 74.5, 7.68, 72.1, + 8.47, 58.1, 8.86, 44.6, 10.36, 33.4, 11.08, 28.6), ncol=2, byrow=T, + dimnames=list(NULL,c("steamuse","temp"))) ## This is the dataset used in the little SAS illustration for ## PROC REG above. > motif() > plot(steamdat[,2],steamdat[,1], xlab="Temp", ylab="Use", main="Draper-Smith Steam Data Example") ## This produces a scatterplot, motivating line-fitting. > lmtmp <- lm(steamuse ~ . , data=data.frame(steamdat)) > lmtmp Call: lm(formula = steamuse ~ ., data = data.frame(steamdat)) Coefficients: (Intercept) temp 13.62299 -0.07982869 Degrees of freedom: 25 total; 23 residual Residual standard error: 0.8901245 Value Std. Error 13.62298927 0.58146349 -0.07982869 0.01052358 t value Pr(>|t|) 23.428795 0.00000e+00 -7.585697 1.05495e-07
(Intercept) temp
> names(lmtmp) [1] "coefficients" "residuals" "fitted.values" "effects" [5] "R" "rank" "assign" "df.residual" [9] "contrasts" "terms" "call" > lines(steamdat[,2], lmtmp$fitted, lty=3)
92
> names(summary(lmtmp)) [1] "call" "terms" [6] "df" "r.squared" > dim(model.matrix(lmtmp)) [1] 25 2
"residuals" "fstatistic"
### Next we do two things to show how to interact with the plot. ### The first is purely graphical: we highlight a few of the points by ### pointing and clicking at them, using "identify": > identify(steamdat[,2],steamdat[,1]) ### Now click successively with left mouse-button over the three ### uppermost points in the plot and then the three lowermost, ### and then click middle mouse-button [1] 3 15 17 11 19 7 > printgraph(file="Steamplot.ps") ### These are the indices of the points clicked on: a great way ### to identify outliers "visually" More conventionally, we can try to identify outliers according to the "hat matrix": if X denotes the design-matrix (in this case the 25x2 matrix model.matrix(lmtmp)) for a simple linear regression, for which the fitted variance is > summary(lmtmp)$sigma^2 [1] 0.7923217 > sum(lmtmp$residuals^2)/23 [1] 0.7923217 then the theoretical vector of variances for the residuals from the linear-regression fit is > rvar <- { mtmp <- model.matrix(lmtmp) 0.7923217*diag(diag(25)-mtmp %*% solve(t(mtmp) %*% mtmp, t(mtmp))) } So the standardized residuals are: > rstd <- lmtmp$resid/sqrt(rvar) ### standardized residuals ## of which only those with indices 3, 7, 11 seem significant:
93
11
15
10
17
Use 9
11
19 7 30 40 50 Temp 60 70
Figure 1: Scatterplot of Draper-Smith Steam Data including Splus tted line and points highlighted with the identify function. Of the selected points, we found the standardized residuals of numbers 3, 7, 11 to be respectively 1.60, -1.57, -1.53.
If we had wanted to identify points by plotting the standardized residuals instead of the row-indices, the Splus command would be: > identify(steamdat[,2],steamdat[,1], labels=round(rstd,2)))
94
10.1
Various methods exist to quantify either how badly an individual point is reproduced by a regression model, or how important the individual point is in aecting the values of tted model parameters. Points which are very special by either of these criteria can be called outliers and can be considered for removal from the dataset before reporting coecients and assessing quality of t, although removing points is often a very bad idea because the reality often is that observed populations must be viewed as superpositions of distinct or latent subpopulations which are not easy to recognize in advance. We have already described under the Splus segment above the calculation of standardized residuals, rstd in the example. This measures whether the discrepancy between an observation and its predictor are larger than might occur by chance. (Compare its square to percentage points for a chi-square random variable with one degree of freedom. Another approach to spotting residuals might be to plot lower and upper condence limits for each of the observations. In Splus, in the plot already displayed in the Figure, we would do this with the statements > lines(steamdat[,2],lmtmp$fitted + 1.645*sqrt(rvar), lty=6) > lines(steamdat[,2],lmtmp$fitted -1.645*sqrt(rvar), lty=6) Here we are calling points extreme if their standardized residuals are signicant (two-sided) at the 10% level. The value of Cooks distance for each observation represents a measure of the degree to which the predicted values change if the observation is left out of the regression. If an observation has an unusually large value for the Cooks distance, it might be worth deleting it from the regression and seeing if the t is improved. If no signicant change in the coecient estimates or the root MSE occurs, it is best not to delete the point from further analyses. (It might not be a good idea to delete it even if it looks inuential in this sense.) There may be no inuential cases in a particular regression problem. In SAS, the way to get a SAS output le with a calculation of predictors, residuals, standardized (or studentized residuals, Cooks distance, and lower and upper condence limits for all of the observation values, the code would be written:
95
PROC REG data=SASstf.steam; MODEL steamuse = temp / alpha=.05 INFLUENCE ; OUTPUT out=SASstf.steamOUT predicted=steamhat residual=steamrsd student=stdresid cookd=steamcook lcl = lowstm ucl = histm; run; In the resulting output matrix, the steamcook column shows largest values (respectively .152, .147, .122) for observations 3, 7, and 20. There is no direct correspondence between absolute studentized residual and Cooks distance, although it is true that the very largest Cooks distances almost always correspond to observations with large studentized residuals. As a single indication of syntax and results for a Proc Gplot in SAS, consider the following: data steam; set SASstf.steamout; proc sort; by temp; symbol1 symbol2 symbol3 symbol4 value value value value = = = = NONE color=black i=join line = 3 ; NONE color=black i=j l=10 ; circle color = black ; square color = black i=j line 6;
proc gplot data=steam ; title "Simultaneous plotting example" ; plot steamuse * temp = 3 lowstm * temp = 1 histm * temp = 2 steamhat * temp = 4 / overlay legend ; run; The output from this SAS code is given in the following gure:
96
Figure 2: SAS plot of Draper-Smith Steam Data including tted line and lower and upper prediction-interval points calculated for individual observations.
97