Sas Routines gUIDE
Sas Routines gUIDE
Contents
1
Label variable....................................................................................................... 4
Report.................................................................................................................. 4
Define................................................................................................................... 4
6.1.2
6.1.3
6.1.4
6.1.5
Column input........................................................................................................ 9
formatted input.................................................................................................... 9
10
11
Proc report...................................................................................................... 10
12
13
Computed variable.......................................................................................... 11
14
Summary report.............................................................................................. 11
15
16
17
Print by............................................................................................................ 11
18
19
20
21
22
23
COPYING DATASET........................................................................................... 14
24
25
26
27
SUBSTRING..................................................................................................... 15
28
29
DATE FUNCTION.............................................................................................. 15
30
31
SORT............................................................................................................... 16
32
REMOVING DUPLICATE.................................................................................... 16
33
34
35
PLOT................................................................................................................ 17
36
37
PROC SQL........................................................................................................ 18
38
39
40
SAMPLING....................................................................................................... 20
41
42
MEAN CALCULATION....................................................................................... 21
43
44
45
QUANTILES...................................................................................................... 25
46
47
48
49
Weight statement............................................................................................ 27
50
order............................................................................................................... 27
51
52
Correlation...................................................................................................... 28
53
Regression...................................................................................................... 28
54
logistic regression........................................................................................... 28
55
test stationarity............................................................................................... 28
56
57
58
59
60
62
SAS QUESTIONS.............................................................................................. 38
1 Label variable
Label varname = label name
2 Report
Proc report data = dataset name;
Column age, weight prints age and wieight in columns
3 Define
Assign formats to variables
Specify column headings and width
Proc report data = dataset <options>;
Define variable/ <usage><attributes><options><justification><Columnheading>;
Run;
The LENGTH statement tells SAS that the variable Gender is character (the
dollar sign indicates this) and that you want to store Gender in 1 byte (the 1
indicates this). The INPUT statement lists the
variable names in the same order as the values in the text file. Because you
already told
SAS that Gender is a character variable, the dollar sign following the name
Gender on the
INPUT statement is not necessary. If you had not included a LENGTH statement,
the
dollar sign following Gender on the INPUT statement would have been
necessary. SAS
assumes variables are numeric unless you tell it otherwise.
Let's say we have a data set of student scores and want to conduct a paired t-test on writing score
and math score for each program type. For some reason, we want to save the t-values and pvalues to a data set for later use. Without ODS, it would not be an easy thing to do since proc
ttest does not have an output statement. With ODS it is only one more line of code.
We will sort the data set first by variable prog and use statement ods output Ttests=test_output
to create a temporary data set called test_output containing information of t-values and p-values
together with degrees of freedom for each t-test conducted.
proc sort data=hsb25;
by prog;
proc ttest data=hsb25;
by prog;
paired write*math;
ods output Ttests=ttest_output;
run;
proc print data=ttest_output;
run;
The SAS System
Obs
Probt
prog
Variable1
Variable2
Difference
tValue
DF
1
0.1389
2
0.5475
3
0.0766
write
math
write - math
-1.57
14
write
math
write - math
0.66
write
math
write - math
-2.37
For each SAS procedure, SAS produces a group of ODS output objects. For example, in the
above example, Ttests is the name of a such object associated with proc ttest. In order to know
what objects are associated with a particular proc, we use ods trace on statement right before the
proc and turn the trace off right after it. Let's look at another example using proc reg. The option
listing with ods trace on displays the information of an object along with the corresponding
output. Below we see three objects (data sets in this case) associated with proc reg when no
extra options used. The ANOVA part of the output is stored in a data set called ANOVA. The
parameter estimates are stored in ParameterEstimates. Each object has a name, a label and a
path along with its template. Once we obtain the name or the label of the object, we can use ods
output statement to output it to a dataset as shown in the example above.
ods trace on /listing;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods trace off;
The REG Procedure
Model: MODEL1
Dependent Variable: write
Output Added:
------------Name:
ANOVA
Label:
Analysis of Variance
Template:
Stat.REG.ANOVA
Path:
Reg.MODEL1.Fit.write.ANOVA
------------Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
2
22
24
2154.11191
1222.04809
3376.16000
1077.05596
55.54764
F Value
Pr > F
19.39
<.0001
Output Added:
------------Name:
FitStatistics
Label:
Fit Statistics
Template:
Stat.REG.FitStatistics
Path:
Reg.MODEL1.Fit.write.FitStatistics
------------Root MSE
Dependent Mean
Coeff Var
7.45303
50.44000
14.77603
R-Square
Adj R-Sq
0.6380
0.6051
Output Added:
------------Name:
ParameterEstimates
Label:
Parameter Estimates
Template:
Stat.REG.ParameterEstimates
Path:
Reg.MODEL1.Fit.write.ParameterEstimates
------------Parameter Estimates
Variable
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Intercept
female
math
1
1
1
7.07533
5.95697
0.76991
7.56161
3.07209
0.14323
0.94
1.94
5.38
0.3596
0.0654
<.0001
Along with the name of an object, we also see the label for the object. We can use the label to
create a data set just as using the name.
ods output "Parameter Estimates"=parest;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods output close;
proc print data=parest;
run;
Obs
Model
Dependent
tValue
Probt
Variable
DF
Estimate
StdErr
1
0.94
2
1.94
3
5.38
write
Intercept
7.07533
7.56161
write
female
5.95697
3.07209
write
math
0.76991
0.14323
MODEL1
0.3596
MODEL1
0.0654
MODEL1
<.0001
Since we can save our output from a proc to a dataset using ODS, we sometimes want to turn the
listing output off. We can NOT use noprint option since ODS requires an output object. What
we'll do is to use ODS statement here shown as in the example below. It makes sense because
listing output is just a form of ODS output. The statement ods listing close eliminates the output
to appear in the output window. After the proc reg, we turn back the listing output back so output
will appear in the output window again. The
ods listing close;
ods output "Parameter Estimates"=parest;
proc reg data=hsb25;
model write = female math;
run;
quit;
ods output close;
ods listing;
Let's say that we want to write the output of our proc reg to an HTML file. This can be done
very easily using ODS. First we specify the file name we are going to use. Then we point the ods
html output to it. At the end we close the ods html output to finish writing to the HTML file. You
can view procreg.html created by the following code.
filename myhtml "c:\examples\procreg.html";
ods html body=myhtml;
proc reg data=hsb25;
model write= female math;
run;
quit;
ods html close;
8 Column input
if you have ID data in columns 13, Age in columns 46, and Gender in
column 7 of your raw data file, your input statement might look like this:
input ID $ 1-3 Age 4-6 Gender $ 7;
9 formatted input
input @1 ID $3.
@4 Age 3.
@7 Gender $1.;
The informat $3. tells SAS to read three columns of character data; the 3.
informat says to
read three columns of numeric data; the $1. informat says to read one column
of character
data. The two informats n. and $n., are used to read n columns of numeric and
character
data, respectively.
11Proc report
Display, order, group, across , analysis or computed
13Computed variable
Computed variable is not part of the dataset
14Summary report
Define flight/group Flight/Number width = 6 center;
run;
17Print by
Proc print data = order_finance;
Var payment_gateway payment_mode;
By payment_mode
Data set is not mandatory in proc print. If data set is not given, it will print the
lastly created dataset
Example =
4500 2000
5000 2300
7890 2810
8900 5400
2300 2000
;
run;
GETNAMES=YES;
MIXED=NO;
SCANTEXT=YES;
USEDATE=YES;
SCANTIME=YES;
RUN;
23COPYING DATASET
data MYDATA.sat_exam_copy;
set MYDATA.sat_exam;
run;
run;
data new_data;
set old_data(Drop=Var5 Var6 Var7);
<rest of the statements>
run;
27SUBSTRING
New variable= SUBSTR(variable, start character, number of characters)
29DATE FUNCTION
Duration_days=INTCK('day',start_date,end_date); /* Finds the duration in days */
Duration_months=INTCK('month',start_date,end_date); /* Finds the duration in
months */
Duration_weeks=INTCK('week',start_date,end_date); /* Finds the duration in
weeks */
MONTH
YEAR
31SORT
proc sort data=<dataset>;
by <variable>;
run;
32REMOVING DUPLICATE
proc sort data=MYDATA.bill out=mydata.bill_wod nodup;
by cust_id ;
run;
35PLOT
proc gplot data= <data set>;
plot y*x;
run;
symbol i=none;
proc gplot data= market_asset;
plot reach*budget;
run;
vbar category;
Run;
/* 3D bar chart */
proc gchart data= market_asset;
vbar3d category ;
Run;
/* 3D pie chart */
proc gchart data= market_asset;
pie3d category ;
Run;
We get:
Median
of 1
-------1.1
6
7.7
You were probably expecting to see just a 6, the median of the values in column A.
Instead we have, for each row, the trivial median of a single value of A. In other words,
the processing was horizontal rather than vertical.
The explanation: vertical calculation of medians is not supported in PROC SQL (though
it is in PROC SUMMARY). Thus, there is no ambiguity. In SQL, the only valid
interpretation of MEDIAN with a single argument is that it is a SAS function call, to be
computed horizontally
37PROC SQL
proc sql;
create buss_fin /* This is the new dataset */
as select *
from market_asset
where Category= 'Business/Finance';
Quit;
proc sql;
select Category, sum(budget) as total_budget
from market_asset
group by Category;
Quit;
data final;
Merge data1(in=a) data2(in=b);
by var;
if a;
run;
data final1;
Merge data1(in=a) data2(in=b);
by var;
if b;
run;
data final2;
Merge data1(in=a) data2(in=b);
by var;
if a and b;
run;
40SAMPLING
proc surveyselect data = sashelp.prdsale
method = SRS
rep = 1
sampsize = 30 seed = 12345 out = prod_sample_30;
id _all_;
run;
42MEAN CALCULATION
Proc means data=online_sales mean;
var listPrice;
class brand;
run;
What if you want to see the grand mean, as well as the means broken down by
Drug, all
in one listing? The PROC MEANS option PRINTALLTYPES does this for you when
you include a CLASS statement. Here is the modified program
Option Description
N Number of nonmissing observations
NMISS Number of observations with missing values
MEAN Arithmetic mean
45QUANTILES
Proc univariate data= online_sales ;
var listPrice ;
run
id Subj; The ID statement is not necessary, but it is particularly useful with PROC
UNIVARIATE
a mean and standard deviation. Specify these by using the keyword MU= to
specify the
mean and the keyword SIGMA= to specify a standard deviation. The keyword
EST tells
the procedure to use the data values to estimate the mean and standard
deviation, instead
of some theoretical value.
Notice the slash between the word PROBPLOT and NORMAL. Using a slash here
follows standard SAS syntax: if you want to specify options for any statement in
a PROC
step, follow the statement keyword with a slash.
/* BOX PLOT*/
Proc univariate data= health_claim plot;
var Claim_amount ;
run;
table SeriousDlqin2yrs;
run;
test f 40
test u 20
;
proc freq;
weight count;
tables treat*outcome;
run;
49Weight statement
The WEIGHT statement is necessary to tell the procedure that the data are count
data, or frequency data; the variable listed in the WEIGHT statement contains the
values of the count variable
If the data is stored in the record form, the weight is not required. Ie
data respire;
input treat $ outcome $ @@;
datalines;
placebo f placebo f placebo f
placebo f placebo f
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u placebo u placebo u
placebo u
test f test f test f
test f test f test f
test f test f
test u test u test u
;
proc freq;
tables treat*outcome;
run;
50order
order = data that the sort order is the same order in which the values are
encountered in the data set.
Thus, since marked comes first, it is first in the sort order. Since some is the
second value for
IMPROVE encountered in the data set, then it is second in the sort order. And none
would be third
in the sort order. This is the desired sort order. The following PROC FREQ statements
produce a
table displaying the sort order resulting from the ORDER=DATA option
proc freq order=data;
weight count;
tables treatment*improve;
run;
NOCOL and NOPCT options suppress the printing of column percentages and cell
percentages, respectively
52Correlation
proc corr data=add_budget ;
var Online_Budget Responses_online ;
run;
53Regression
/* Predicting SAT score using rest of the four variables. General_knowledge,
Aptitude,
Mathematics, and Science */
proc reg data=sat_score;
model SAT=General_knowledge Aptitude Mathematics Science;
run;
54logistic regression
Proc logistic data=ice_cream_sales;
model buy_ind=age;
run;
55test stationarity
proc arima data= ms;
identify var= stock_price stationarity=(DICKEY);
run;
Software was used to create the code and output, but everything presented in this paper is available in Release 8.0
and higher of the SAS System.
The first data set, ELEC_ANNUAL, contains about 16,300 customer-level observations (rows) with information about
how much electricity they consumed in a year, the rate schedule on which they were billed for the electricity, the total
revenue billed for that energy and the geographic region in which they live. The variables in the data set are:
PREMISE Premise Number [Unique identifier for customer meter]
TOTKWH Total Kilowatt Hours [KwH is the basic unit of electricity consumption]
TOTREV Total Revenue [Amount billed for the KwH consumed
TOTHRS Total Hours [Total Hours Service in Calendar Year]
RATE_SCHEDULE Rate Schedule [Table of Rates for Electric Consumption Usage]
REGION Geographic Region [Area in which customer lives]
The second data set, CARD_TRANS2, contains about 1.35 million observations (rows), each representing one
(simulated) credit card transaction. The variables in the data set are:
CARDNUMBER Credit Card Number
CARDTYPE Credit Card Type [Visa, MasterCard, etc.]
CHARGE_AMOUNT Transaction Amount (in dollars/cents)
TRANS_DATE Transaction Date [SAS Date Variable]
TRANS_TYPE Transaction Type [1=Electronic 2=Manual]
Since TOTKWH, TOTREV and TOTHRS are all numeric variables, PROC MEANS calculated the five default
statistical measures on them and placed the results in the Output Window.
need all of the five statistical analyses that PROC MEANS will perform automatically. And, we may want to round the
values to a more useful number of decimal places than what PROC MEANS will do for us automatically.
Again using the ELEC_ANNUAL data set, here is how we can take more control over what PROC MEANS will do for
us. Suppose we just want the SUM and MEAN of TOTREV, rounded to two decimal places. The following PROC
MEANS task gets us just what we want.
A box has been drawn around the important features presented iin Step 2. First, the SUM and MEAN statistics
keywords were specified, which instructs PROC MEANS to just perform those analyses. Second, the MAXDEC
option was used to round the results in the Output Window to just two decimal places. (If we had wanted the
analyses rounded to the nearest whole number, then MAXDEC = 0 would have been specified.) Finally, the VAR
Statement was added, giving the name of the variable for which the analyses were desired. You can put as many
(numeric) variables as you need/want in to one VAR Statement in your PROC MEANS task.
The Output Window displays:
Suppose the observations in ELEC_DATA are a random sample from a larger population of utility customers. We
might therefore want to obtain, say, a 95 percent confidence interval around the mean total KwH consumption and
around the mean billed revenue, along with the mean and median. From the above table, you can see that the
MEAN, MEDIAN and CLM statistics keywords will generate the desired analyses. The PROC MEANS task below
generates the desired analyses. The task also includes a LABEL Statement, which add additional information about
the variables in the Output Window.
Selecting Statistics;
PROC MEANS DATA=SUGI.ELEC_ANNUAL
MEDIAN MEAN CLM MAXDEC=0;
Label TOTREV = 'Total Billed Revenue'
TOTKWH = 'Total KwH Consumption';
VAR TOTREV TOTKWH;
title3 'Step 3: Selecting Statistics';
run;
The output generated is:
By specifying REGION in the CLASS Statement, we now have the MEAN and SUM of TOTREV and TOTKWH for
each unique value of region. We also have a column called N Obs, which is worthy of further discussion. By
default, PROC MEANS shows the number of observations for each value of the classification variable. So, we can
see that there are, for example, 5,061 observations in the data set from the WESTERN Region.
How does PROC MEANS handle missing values of classification variables? Suppose there were some observations
in ELEC_ANNUAL with missing values for REGION. By default, those observations would not be included in the
analyses generated by PROC MEANSbut, we have an option in PROC MEANS that we can use to include
observations with missing values of the classification variables in our analysis. This option is shown in Step 5.
60.6.1.1
60.6.1.2
*-------------------------------------------------------------------------;
* Use the PROC SURVEYMEANS procedure in SAS to compute a properly weighted;
* estimated ratio of means for all persons ages 20+ and by gender.
;
*-------------------------------------------------------------------------;
* Run analysis for overall subpopulation of interest;
proc surveymeans data=DTTOT;
where usedat=1 ;
strata SDMVSTRA;
cluster SDMVPSU;
weight WTDRD1;
var D1MCALC DR1TCALC;
ratio D1MCALC / DR1TCALC;
title " Ratio of Means -- All Persons ages 20+" ;
run ;
*-------------------------------------------------------------------------;
* Use the PROC SORT procedure to sort the data by gender.
*-------------------------------------------------------------------------;
60.6.1.3
Output of Program
Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
4448
0.114940
0.006826
0.100390
0.129490
----------------------------------------------------------------------------------------------Ratios of Means -- by Gender
Gender - Adjudicated=male
Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
2135
Sum of Weights
98664010.2
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean CL for Mean
------------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
2135
122.142347
8.719800
103.556533
140.728162
DR1TCALC Calcium (mg)
2135
998.359501 21.809584
951.873474
1044.845528
-----------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
2135
0.122343
0.007148
0.107107
0.137579
----------------------------------------------------------------------------------------------Gender - Adjudicated=female
Data Summary
Number of Strata
15
Number of Clusters
30
Number of Observations
2313
Sum of Weights
106620659
Statistics
Std Error
Lower 95%
Upper 95%
Variable Label
N
Mean
of Mean
CL for Mean
CL for Mean
-----------------------------------------------------------------------------------------------d1mcalc
Calcium (mg)
2313
81.747649
9.880726
60.687380
102.807918
DR1TCALC Calcium (mg)
2313
770.725113
15.292108
738.130756
803.319469
-----------------------------------------------------------------------------------------------Ratio Analysis
Numerator Denominator
N
Ratio
Std Err
95% Confidence Interval
----------------------------------------------------------------------------------------------d1mcalc DR1TCALC
2313
0.106066
0.011329
0.081919
0.130213
-----------------------------------------------------------------------------------------------
The ratio of mean calcium from milk to total calcium, for all persons ages 20
and older, is 0.11 (with a standard error of 0.01). The corresponding values
for males and females, respectively, are 0.12 (0.01) and 0.11 (0.01).
Note that, even though this analysis did not incorporate a domain statement,
the results are exactly equal to those obtained using SUDAAN and its
SUBPOPN statement because the subgroup of interest was one for which the
weighted NHANES sample is representative.
Then I get the output mean values for variables x1 and x2 in the two categories of X and Y,
but it is x1/x2 for each category that I am interested in, and doing it by hand is not really a
solution
You need to precompute x1/x2 or postcompute x1/x2 (Depending on whether you want
mean(x1/x2) or mean(x1)/mean(x2), which can have different answers of x1 and x2 have
different numbers of responses).
So either (... means fill in what you have already)
data premean;
set have;
x1x2 = x1/x2;
run;
or
proc means ...;
class ... ;
var x1 x2;
output out=postmeans mean=;
run;
data want;
set postmeans;
x1x2=x1/x2;
run;
62SAS QUESTIONS
1. To create a raw data file:
Use the SET statement
Format:
DATA _null_;
SET dataset;
_null _ allows the DATA step to be used to without
creating a SAS data set
2. Result
a.
b.
c.
3. The appearance of the output can be controlled, specifically (for SAS listings):
a. line size (the maximum width of the log and output)
b. page size (the number of lines per printed page)
c. page numbers displayed
d. data and time displayed
4. Variable length identifies the number of bytes used to store the variable.
Length is dependent on type:
a. Character variables can be up to 32,767 bytes long
b. Numeric variables have a constant default length of 8 with
c. an infinite number of digits possible
d. Numeric variables have a constant length because they
e. are stored as floating-point numbers
f.
5. Data set has two parts: a descriptive portion and a data portion that the data
set can locate
a. The $w. informat allows character data to be read. The dollar sign
indicates character only data. The w represents the field width, or
number of columns, of the data. The period ends the informat.
b. The w.d format allows standard numeric data to be read. The w
represents the field width, or number of columns, of the data. If a
decimal point exists with the raw data, that acts as one decimal. The
period acts as a delimiter. The optional d specifies the number of
implied decimal places (not necessary if the value already has decimal
places).
c. The COMMAw.d will read nonstandard numeric data, removing any
embedded:
i. blanks
ii. commas
iii. dashes
iv. dollar signs
v. percent signs
vi. right parentheses
vii. left parentheses.
14.Record Formats of external file define how data is read by column input and
formatted input processes - The default value of the maximum record length
is determined by the operating environment. The maximum record length can
be changed using the LRECL=option in the INFILE statement
15.A List input can read standard and nonstandard data in a free-format record
16.By default, List input does not have specified column locations, so:
a. all fields must be separated by at least one delimiter
b. fields cannot be skipped or re-read
c. the order for reading fields is from left to right
17.List input - By default several limitations exist on the type of data that can be
read using list input:
a. Character values that are longer than eight characters will be
truncated
b. Data must be in standard numeric or character format
c. Character values cannot contain embedded delimiters
d. Missing numeric and character values must be represented by a
period or some other character
18.List input - The default length of character values is 8. Variables that are
longer than 8 are truncated when written to the program data vector. Using a
LENGTH statement before the INPUT statement will define the length and
type of the variable.
19.List input missing values:
a. If missing values occur at the end of the record, the MISSOVER option
in the INFILE statement can be used to read them. The MISSOVER
option will prevent the SAS from going to another record if values
cannot be found for every specified variable in the current line.
b. MISSOVER only works with missing values at the end of a record.
c. To begin to read missing values in the beginning or middle of the
record, the DSD option in the INFILE statement can be used.
d. DSD changes how delimiters are treated when using a list input: