Proc Summary
Proc Summary
KEY WORDS
OUTPUT, MEANS, SUMMARY, AUTONAME, _TYPE_, WAYS, LEVELS, MAXID, GROUPID, preloaded formats
INTRODUCTION
PROC MEANS is one of SASs original procedures, and its initial mandate was to create printed tables of summary statistics. Later PROC SUMMARY was introduced to create summary data sets. Although these two procedures grew up on the opposite side of the tracks, over time both has evolved so that under the current version of SAS they actually both use the same software behind the scenes. These two procedures completely share capabilities. In fact neither can do anything that the other cannot do. Only some of the defaults are different (as they reflect the procedures original roots). For the analyst faced with creating statistical summaries, the MEANS/SUMMARY procedure is indispensable. While it is fairly simple to generate a straightforward statistical summary, these procedures allow a complex list of options and statements that give the analyst a great deal of control. Because of the similarity of these two procedures, examples will tend to show one or the other but not both. When I use MEANS or SUMMARY, I tend to select the procedure based on it primary objective of the step (SUMMARY for a summary data set and MEANS for a printed table). Even that rule, however is rather lax as MEANS has the further advantage of only having 5 letters in the procedure name.
BASIC STATEMENTS
The MEANS/SUMMARY procedure is so powerful that just a few simple statements and options can produce fairly complex and useful summary tables.
Selecting Statistics
Generally we want more control over which statistics are to be selected. When you want to specifically select statistics, they are listed as options on the PROC statement. title1 'The First Two Statistical Moments'; proc means data=sashelp.class n mean var std stderr; var weight; run;
The First Two Statistical Moments The MEANS Procedure Analysis Variable : Weight N Mean Variance Std Dev Std Error 19 100.0263158 518.6520468 22.7739335 5.2246987
The list of available statistics is fairly comprehensive. A subset of which includes: ! n number of observations used to calculate the statistics ! nmiss number of observations with missing values ! min minimum value taken on by the data ! max maximum value taken on by the data ! range difference between the min and the max ! sum total of the data ! mean arithmetic mean ! std standard deviation ! stderr standard error ! var variance ! skewness symmetry of the data's distribution ! kurtosis peakedness of the data's distribution A number of statistics having to do with percentiles and quantiles are also available, including: ! median 50th percentile ! p50 50th percentile (or second quartile) ! p25 | q1 25th percentile (or first quartile) ! p75 | q3 75th percentile (or third quartile) ! p1 p5 p10 other percentiles ! p90 p95 p99 other percentiles Starting in SAS9.2 the MODE statistic is also available. Statistics listed on the PROC statement are only applied to the printed table and have NOTHING to do with and summary data sets that are also created.
title1 'A Simple Summary Data Set'; proc means data=sashelp.class noprint; var weight; output out=summrydat; run;
The NOPRINT option is used with MEANS, because a printed table is not wanted. A PROC PRINT of the summary data set (WORK.SUMMRYDAT) shows the following:
A Simple Summary Data Set Obs 1 2 3 4 5 _TYPE_ 0 0 0 0 0 _FREQ_ 19 19 19 19 19 _STAT_ N MIN MAX MEAN STD Weight 19.000 50.500 150.000 100.026 22.774
Again since statistics were not specified the same default list of statistics as was used in the MEANSs printed table appears here.
Selecting the Statistics and Naming the Variables in the Summary Data Set
Usually when you create a summary data set, you will want to specifically select the statistics. These are specified on the OUTPUT statement. Remember statistics listed on the PROC statement only apply to printed tables and have nothing to do with the statistics that you want in the summary data set. The techniques shown below can be combined - experiment. Selecting Statistics Statistics are selected by using their names as options in the OUTPUT statement. The name of each statistic is followed by an equal sign. The following OUTPUT statement requests that the mean weight be calculated and saved in the data set SUMMRYDAT.
title1 'Selected Statistics'; proc summary data=sashelp.class; var weight; output out=summrydat mean=; run; The mean weight will be stored in a variable named WEIGHT. This technique allows you to only pick a single statistic, and as such it is limited, however when combined with the techniques shown below, it can be very flexible. Explicate Naming By following the equal sign with a name, you can provide names for the new variables. This allows you to name more than one statistic on the OUTPUT statement. title1 'Selecting Multiple Statistics'; proc summary data=sashelp.class; var weight; output out=summrydat n=number mean=average std=std_deviation; run; You can also name multiple analysis variables. Here both HEIGHT and WEIGHT are specified.
Selecting Multiple Statistics std_ deviation 22.7739
Obs 1
_TYPE_ 0
_FREQ_ 19
number 19
average 100.026
title1 'Multiple Analysis Variables'; proc summary data=sashelp.class; var height weight; output out =summrydat n = ht_n wt_n mean = mean_ht mean_wt std = sd_ht sd_wt; run; Be sure to be careful here as the order of the variables in the VAR statement determines which variable is for height and which is for weight. You should also be smart about naming conventions. In the previous example the statistics for N are not consistently named relative to those for the MEAN and STD. This technique does not allow you to skip statistics. If you did not want the mean for HEIGHT, but only the mean for WEIGHT, this would not be possible, because HEIGHT is first on the VAR statement. To get around this you can use the techniques on naming the statistics shown in the next section.
Selected Naming When there is more than one variable in the VAR statement, but you do not want every statistic calculated for every analysis variable, you can selectively associate statistics with analysis variables.
title1 'Selective Associations'; proc summary data=sashelp.class; var height weight; output out =summrydat n =ht_n wt_n mean(weight)= wt_mean std(height) = ht_std; run;
Selective Associations Obs 1 _TYPE_ 0 _FREQ_ 19 ht_n 19 wt_n 19 wt_mean 100.026 ht_std 5.12708
Alternate forms of the statistic selections (in this case for the MEAN) could have included the following: mean(weight height)=wt_mean ht_mean mean(weight)=wt_mean mean(height)=ht_mean Automatic Naming of Summary Variables When you do not NEED to control the naming of the new summary variables, the AUTONAME and AUTOLABEL options can be used on the OUTPUT statement. The AUTONAME option allows you to select statistics without picking a name for the resulting variable in the OUTPUT table. This eliminates naming conflicts. The AUTOLABEL option creates a label for variables added to the OUT= data set. title1 'Using AUTONAME'; proc summary data=sashelp.class; var height weight; output out =summrydat n = mean= std = / autoname; run;
Using AUTONAME Height_ Mean 62.3368 Weight_ Mean 100.026 Height_ StdDev 5.12708 Weight_ StdDev 22.7739
Obs 1
_TYPE_ 0
_FREQ_ 19
Height_N 19
Weight_N 19
Notice that the names are in the form of variable_statistic. This is a nicely consistent, dependable, and usable naming convention
title1 'CLASS and a Printed Table'; proc means data=sashelp.class(where=(age in(12,13,14))) n mean std; class age sex; var height; run;
CLASS and a Printed Table The MEANS Procedure Analysis Variable : Height N Age Sex Obs N Mean Std Dev 12 F 2 2 58.0500000 2.4748737 M 3 3 60.3666667 3.9323445 13 F M 2 1 2 1 60.9000000 62.5000000 6.2225397 .
In a Summary Data Set When creating a summary data set, one can get not only the classification variable interaction statistics, but the main factor statistics as well. This can be very helpful to the statistician.
14
title1 'CLASS and a Summary Data Set'; proc summary data=sashelp.class(where=(age in(12,13,14))); class age sex; var height; output out=clsummry n=ht_n mean=ht_mean std=ht_sd; run A PROC PRINT of the data set CLSUMMRY shows:
CLASS and a Summary Data Set Obs 1 2 3 4 5 6 7 8 9 10 11 12 Age . . . 12 13 14 12 12 13 13 14 14 Sex _TYPE_ 0 1 1 2 2 2 3 3 3 3 3 3 _FREQ_ 12 6 6 5 3 4 2 3 2 1 2 2 ht_n 12 6 6 5 3 4 2 3 2 1 2 2 ht_mean 61.7583 60.8333 62.6833 59.4400 61.4333 64.9000 58.0500 60.3667 60.9000 62.5000 63.5500 66.2500 ht_sd 3.97868 3.90470 4.18637 3.29742 4.49592 2.80119 2.47487 3.93234 6.22254 . 1.06066 3.88909
F M
F M F M F M
Two additional variables have been added to the summary data set; _TYPE_ (which is described below in more detail), and _FREQ_ (which counts observations). Although not apparent in this example, _FREQ_ counts all observations, while the N
statistic only counts observations with non-missing values. If you only want the statistics for the highest order interaction, you can use the NWAY option on the PROC statement. proc summary data=sashelp.class(where=(age in(12,13,14))) nway; Understanding _TYPE_ The _TYPE_ variable in the output data set helps us track the level of summarization, and can be used to distinguish the sets of statistics. Notice in the previous example that _TYPE_ changes for each level of summarization. _TYPE_ = 0 _TYPE_ = 1 _TYPE_ = 2 _TYPE_ = 3 Summarize across all classification variables Summarize as if the right most classification variable (SEX) was the only one Summarize as if the next to the right most classification variable (AGE) was the only one Interaction of the two classification variables.
In the following example there are three CLASS variables and _TYPE_ ranges from 0 to 7. title1 'Understanding _TYPE_'; proc summary data=advrpt.demog(where=(race in('1','4') & 12 le edu le 15 & symp in('01','02','03'))); class race edu symp; var ht; output out=stats mean= meanHT; run;
Understanding _TYPE_ Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 RACE EDU . . . . 12 14 15 12 12 14 15 . . . . . 12 15 14 12 12 15 14 SYMP 01 02 03 _TYPE_ 0 1 1 1 2 2 2 3 3 3 3 4 4 5 5 5 6 6 6 7 7 7 7 _FREQ_ 8 2 4 2 4 2 2 2 2 2 2 6 2 4 2 2 4 2 2 2 2 2 2 mean HT 66.25 64.00 66.50 68.00 67.50 64.00 66.00 67.00 68.00 64.00 66.00 67.00 64.00 66.50 68.00 64.00 67.50 66.00 64.00 67.00 68.00 66.00 64.00
1 4 1 1 4 1 1 4 1 1 1 4
02 03 01 02
02 03 01
02 03 02 01
When calculating the value of _TYPE_, assign a zero (0) when summarizing over a CLASS variable and assign a one (1) when summarizing for the CLASS variable. In the table below the zeros and ones associated with the class variables form a binary value. This binary value can be converted to decimal to obtain _TYPE_.
CLASS VARIABLES Observations 1 2-4 5-7 8 - 11 12 - 13 14 - 16 17 - 19 20 - 23 RACE 0 0 0 0 1 1 1 1 22=4 EDU 0 0 1 1 0 0 1 1 21=2 SYMP 0 1 0 1 0 1 0 1 20=1 Binary Value 0 1 10 11 100 101 110 111 _TYPE_ 0 1 2 3 4 5 6 7
A binary value of 110 = 1*22 + 1*21 + 0*20 = 1*4 + 1*2 + 0*1 = 6 = _TYPE_ Some SAS programmers find converting binary values to decimal values a bit tedious. Fortunately the developers at SAS Institute have provided us with alternatives.
Using CHARTYPE
The CHARTYPE option causes _TYPE_ to be displayed as a character variable in binary form rather than as a decimal value. title1 'Understanding _TYPE_ Using CHARTYPE'; proc summary data=advrpt.demog(where=(race in('1','4') & 12 le edu le 15 & symp in('01','02','03'))) chartype; class race edu symp; var ht; output out=stats mean= meanHT; run;
Understanding _TYPE_ Using CHARTYPE mean HT 66.25 64.00 66.50 68.00 67.50 64.00 66.00 67.00 68.00
Obs 1 2 3 4 5 6 7 8 9
RACE
EDU . . . . 12 14 15 12 12
SYMP
_TYPE_ 000 001 001 001 010 010 010 011 011
_FREQ_ 8 2 4 2 4 2 2 2 2
01 02 03
02 03
_LEVEL_ 1 1 2 3 4 5 6 7 8 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
_FREQ_ 75 11 18 4 11 7 10 10 4 41 17 9 4 4 11 15 4 5 2 2 2 3 6 8 7 2 4 2 2
meanHT 67.5200 71.3636 66.8889 70.0000 64.1818 65.2857 70.4000 65.2000 69.0000 68.4390 67.6471 64.8889 64.5000 66.5000 71.3636 67.0667 70.0000 64.2000 71.0000 63.0000 73.0000 66.0000 71.0000 65.7500 64.0000 68.0000 64.5000 68.0000 65.0000
LEVELS option Adds the variable _LEVEL_ to the OUT= data table. This numeric variable counts the observations within _TYPE_. This means that when FIRST._TYPE_ is true _LEVEL_ will equal 1.
WAYS option Adds the variable _WAY_ to the OUT= data table. This numeric variable equals the number of classification variables that were used to calculate each observation e.g. for a three way interaction _WAY_ will equal 3.
1 2 3 4 5 1 1 1 1 1 1 1 2 2 2 3 3 4 5 5
10
Controlling Summary Subsets Using WAYS The WAYS statement can be used to specify a list of combinations of class variables, which are to be displayed. Combinations of the WAYS statement for three classification variables include the following summarizations: ways ways ways ways 0; 1; 2; 3; across all class variables each classification variable (no cross products) each two way combination of the classification variables three way combination for three classification variables this is the same as using the NWAY option when there are three classification variables. lists of numbers are acceptable
ways 0,3;
When the number of classification variables becomes large the WAYS statement can utilize an incremental list. ways 0 to 9 by 3; In the following example, the main effect summaries (_TYPE_ = 1, 2) are not even calculated. title1 'Using the WAYS Statement'; proc summary data=advrpt.demog; Using the WAYS Statement class race edu symp; var ht; Obs RACE EDU SYMP _TYPE_ ways 0,2; output out=stats mean= meanHT; 1 . 0 run; 2 10 04 3
3 4 10 12 10 02 3 3
_FREQ_ 64 6 3 2
Controlling Summary Subsets Using TYPES The TYPES statement can be used to select and limit the data roll up summaries. The TYPES statement eliminates much of your need to understand the automatic variable _TYPE_. The TYPES statement is used to list those combinations of the classification variable that are desired. Like the WAYS statement this also can be used to limit the number of calculations that need to be performed.
Using the TYPES Statement Obs 1 2 3 4 5 6 7 8 9 10 11 RACE EDU SYMP _TYPE_ _FREQ_ meanHT 72.3333 66.2667 68.0000 64.0000 65.2857 70.4000 65.2000 65.0000 71.0000 66.5000 68.0000
title1 'Using the TYPES Statement'; proc summary data=advrpt.demog; class race edu symp; var ht; types edu race*symp; output out=stats mean= meanHT; run;
11
For the following CLASS statement class race edu symp; variations of the TYPES statement could include: types (); types race*edu edu*symp; types race*(edu symp);
title1 'Using the CLASSDATA and EXCLUSIVE Options'; data selectlevels(keep=race edu symp); set advrpt.demog(where=(race in('1','4') & 12 le edu le 15 & symp in('01','02','03'))); output; * For fun add some nonexistent levels; if _n_=1 then do; edu=0; race='0'; symp='00'; output; end; Using the CLASSDATA and EXCLUSIVE Options run; proc summary data=advrpt.demog Obs RACE EDU SYMP _TYPE_ _FREQ_ classdata=selectlevels exclusive; 1 . 0 8 class race edu symp; 2 . 00 1 0 var ht; 3 . 01 1 2 output out=stats mean= 4 . 02 1 4 5 . 03 1 2 meanHT; 6 0 2 0 run;
The summary lines for observations 2 and 6 represent levels of the classification variables that do not appear in the data. They were generated thru a combination of the CLASSDATA= data set and the EXCLUSIVE option.
7 8 12 14 2 2 4 2
12
completetypes; class race edu symp; var ht; output out=stats mean= meanHT; run; In the data there are no observations with both EDU=12 and SYMP=01', however since both levels exist somewhere in the data, the COMPLETETYPES option causes the combination to appear in the summary data set (obs=8).
Using the COMPLETETYPES Option mean HT 66.25 64.00 66.50 68.00 67.50 64.00 66.00 . 67.00
Obs 1 2 3 4 5 6 7 8 9
RACE
EDU . . . . 12 14 15 12 12
SYMP
_TYPE_ 0 1 1 1 2 2 2 3 3
_FREQ_ 8 2 4 2 4 2 2 0 2
01 02 03
01 02
13
Using MAXID max Ht 74 74 70 72 max WT 240 215 240 215 maxHt Subject 110 110 106 148 MaxWt Subject 137 109 137 117
Obs 1 2 3 4
RACE
EDU . 10 12 13
_TYPE_ 0 1 1 1
_FREQ_ 75 11 18 4
14
2 1 2 1 4
1 1 1 1 4
MAX statistic is superfluous in this example, and is included only for your reference. We are asking for the maximum of WT. GROUPID also is available for MIN, therefore in this example we could have also specified: idgroup(min(ht)out[3](ht subject race)=minht minsub minrace)
The top 2 values are to be shown This is a list of variables that will be shown as observation identifiers. The analysis variable is usually included. The MAX statistic has also been requested for comparison purposes , however it will only provide one value and not the next highest. You can choose the prefix of the ID variable or you can let the procedure do it for you . In either case, a number is appended to the variable name. In this example we can see that the second heaviest subject in the study was subject 137 with a weight of 215 pounds and a RACE of 1.
title1 'Using the DESCENDING CLASS Option'; proc summary data=advrpt.demog; class race/descending; var ht wt; output out=stats mean= MeanHT MeanWT ; run;
Using the DESCENDING CLASS Option Obs 1 2 3 4 5 6 RACE _TYPE_ 0 1 1 1 1 1 _FREQ_ 76 4 4 9 17 42 MeanHT 67.5526 66.5000 64.5000 64.8889 67.6471 68.4762 MeanWT 160.461 147.000 113.500 111.222 162.000 176.143
5 4 3 2 1
15
GROUPINTERNAL When a classification variable is associated with a format, the format is used when forming groups. proc format; value edulevel 0-12 = 'High School' 13-16= 'College' 17-high='Post Graduate'; run; title1 'Without Using the GROUPINTERNAL CLASS Option'; proc summary data=advrpt.demog; class edu; var ht wt; output out=stats mean= MeanHT MeanWT ; format edu edulevel.; run; The resulting table will show at most three levels for EDU. To use the original data values (internal values), the GROUPINTERNAL option is added to the CLASS statement. class edu/groupinternal; MISSING When a classification variable takes on a missing value that observation is eliminated from the analysis. If a missing value is OK or if the analyst needs to have it included in the summary, the MISSING option can be used. Most procedures that have either an implicit or explicit CLASS statement also have a MISSING option. However when the MISSING option is used on the PROC statement it is applied to all the classification variables and this may not be acceptable. By using the MISSING option on the CLASS statement you can control which classification variables are to be handled differently. In the following example there are three classification variables. However the MISSING option has only been applied to two of them. title1 'Using the MISSING CLASS Option'; proc means data=advrpt.demog n mean std; class race ; class edu symp/ missing; var ht wt; run; ORDER When classification variables are displayed or written to a table the values are ordered according to one of several possible schemes. These include: data order is based on the order of the incoming data formatted values are formatted and then ordered (default when the variable is formatted) freq the order is based on the frequency of the class level unformatted same as INTERNAL or GROUPINTERNAL Using the order=freq option on the CLASS statement causes the table to be ordered according to the most common levels of education. class edu/order=freq;
16
Using the ORDER CLASS Option The MEANS Procedure years of N education Obs Variable Label N Mean Std Dev 12 19 HT height in inches 19 66.9473684 2.7582942 WT weight in pounds 19 171.5263158 32.2703311 14 11 HT WT HT WT HT WT height in inches weight in pounds height in inches weight in pounds height in inches weight in pounds 11 11 11 11 10 10 64.1818182 108.0909091 71.3636364 194.0909091 65.2000000 145.2000000 0.4045199 4.3921417 3.2022719 19.0811663 2.3475756 25.0900600
10
11
17
10
PRELOADED FORMATS
Several options and techniques are available to control which levels of classification variables are to appear in the summary. Those that were discussed earlier in this paper include the CLASSDATA and COMPLETETYPES options. Also discussed were the WAYS and TYPES statements, as well as the WAYS and LEVELS options on the OUTPUT statement. A related set of options come under the general topic of Preloaded Formats. Variations of these options are available for most of the procedures that utilize classification variables. Like the others listed above these techniques/options are used to control the relationship of levels of classification variables that may not appear in the data and how those levels are to appear (or not appear) in the summary. Generally speaking when a level of a classification variable is not included in the data, the associated row will not appear in the table. This behavior relative to the missing levels can be controlled through the use of preloaded formats. For the MEANS/SUMMARY procedures, options used to preload formats include: PRELOADFMT Loads the format levels prior to execution. This option will always be present when you want to use a preloaded format. EXCLUSIVE COMPLETETYPES Only data levels that are included in the format definition are to appear in summary table All levels representing format levels are to appear in the summary
17
It is the interaction of these three options that gives us a wide range of possible outcomes. In each case the option PRELOADFMT will be present. As the name of the technique implies, the control is maintained through the use of user defined formats. For the examples that follow, the format $SYMPX has been created, and it contains one level, 00', that is not in the data. In the data the values of SYMP range from 01' to 10'. proc format; value $sympx '01' = 'Sleepiness' '02' = 'Coughing' '00' = 'Bad Code' ; run;
Obs 1 2 3
SYMP
_TYPE_ 0 1 1
_FREQ_ 14 4 10
Sleepiness Coughing
18
SUMMARY
The MEANS /SUMMARY procedure produces a wide variety of summary reports and summary data tables. It is very flexible and, while it can be quite complex, a few basic statements allow the user to create useful summaries. As you develop a deeper knowledge of the MEANS/SUMMARY procedure, you will find that the generation of highly sophisticated summarizations is possible from within a single step.
AUTHOR CONTACT
Arthur L. Carpenter California Occidental Consultants 10606 Ketch Circle Anchorage, AK 99515 (907) 865-9167 art@caloxy.com www.caloxy.com
TRADEMARK INFORMATION
SAS, SAS Certified Professional, SAS Certified Advanced Programmer, and all other SAS Institute Inc. product or service names are registered trademarks of SAS Institute, Inc. in the USA and other countries. indicates USA registration.
19