Tranforming SAS Data Sets
Tranforming SAS Data Sets
Tranforming SAS Data Sets
Reading Assignment:
Selected SAS Documentation for Bios111 Part 3: Transforming SAS Data Sets
5-1
5-2
5-3
Syntax: OUTPUT; or
OUTPUT SASdataset(s) ;
5-4
v An Example DATA Step: DATA WORK.MYCLASS; SET CLASSLIB.CLASS; OUTPUT; RETURN; RUN;
5-5
5-6
Flowchart of Execution:
DATA WORK.MYCLASS;
END of Input No
Yes
SET CLASSLIB.CLASS;
OUTPUT;
5-7
5-8
Summary--Creating New SAS Data Sets The four statements just described (DATA, SET, OUTPUT, RETURN) are used whenever we want to create a new SAS data set from an existing one. Other statements are added to the step in order to make the output data set a modified version of the input data set, rather than an exact copy. In this chapter, we only discuss creating SAS data sets from other, already existing SAS data sets. Creating a SAS data set from a non-SAS data set (e.g., ascii or Dbase file) is a more complex task, which will be covered in detail later in the course. Creating a new data set does not delete or modify the input data set; it is still available for use in subsequent steps.
5-9
5-10
v Notes: v The assignment is one (of two) exceptions to the rule that every SAS statement begins with a keyword v If "variable" is the name of an already existing variable, the value of "expression" replaces the previous value; if "variable" is a new name, the assignment statement creates a new variable, which is added to the output data set
Expressions
v An expression consists of one or more constants, variables, and functions, combined by operators. v A constant is a number (e.g., 1, - 23.6, .00/) or, a character string (e.g., JOHN, MALE, X#!); character constants must be enclosed in single quotes (apostrophes). (SAS also allows other, specialized types of constants; we will discuss some of them later in the course.) v A function is a program "built in" to SAS that performs some computation on character or numeric values. v An operator is a mathematical, logical, or character operation or manipulation that combines, compares, or transforms numeric or character values.
Symbol + * / **
5-11
Comparison operators look at the relationship between two quantities Symbol = ^= > < >= <= Mnemonic Equivalent EQ NE GT LT GE LE IN Action equal to not equal to greater than less than greater than or equal to less than or equal to equal to one of a list
Logical or Boolean operators are used to link sequences of comparisons. Symbol & | ^ Mnemonic Equivalent AND OR NOT
5-12
v Examples: Assigning constants: N=4; SEX=FEMALE; Basic arithmetic operators: X2=X; SUM=X+Y; DIF=X-Y; TWICE=X*2; HALF=X/2; CUBIC=X**3; Y=-X; Comparison operators: X<Y THEN C=5 ; (X<Y) ; NAME=PAT ; IF 5<=AGE<=20 ; IF AGE IN(10,20,30) THEN X=5 ; IF SEX IN(M,F) THEN S=1 ; Logical operators: IF A<B AND C>0 ; IF X=2 OR X=4 ; IF NOT(X=2) ; IF NOT(NAME=SMITH); Other operators: X= A><B ; NAME= FIRST || LAST ; copy the value addition subtraction multiplication division exponentiation change sign numeric constant character constant
5-13
Complex expressions
Priority of evaluation: ( ) ** |*/| |+ -| left to right operator precedence left to right parenthetical parenthetical
5-14
Functions:
v General form of a SAS function: variable=function-name(argument1, argument2, . . .); v Each argument is separated from the others by a comma. Must functions accept arguments that areconstants, variables, expressions, or functions. v Examples: S=SQRT(X); A=ABS(X); B=MAX(2,7); C=SUBST(INSIDE,3,4); D=MIN(X,7,A+B); v Types of functions v v v v v Arithmetic (absolute value, square root, mean, variance..) Trigonometric (cosine, sine, arc cosine.) Other mathematical and statistical (natural logarithm, exponential.) Pseudo-random number generators Character string functions
v Selected functions that compute simple statistics v v v v v v v v Sum Mean Var Min Max Std N Nmiss sum mean variance minimum maximum standard deviation number non-missing number missing
v Simple statistics functions compute statistics for each observation (row) in the SAS data set (functions operate across rows) v Procedures produce statistics for variables (columns) in the SAS data set (procedures operate down columns)
5-15
Subsetting Observations
A common type of transformation is subsetting observations, creating a new SAS data set with the same variables as the input data set, but only those observations that satisfy some selection criterion. The subsetting IF statement can be used to accomplish this. Syntax: IF logical expression; where logical expression is given by one of the following: GT LT GE LE EQ NE
1.
expression
expression
2.
logical expression1
OR AND
logical expression2
where "expression" can be any of the forms discussed for assignment statements. v If the expression is true, execution of the step continues for this observation. v If the expression is false: v SAS stops executing statements for this observation immediately, and v returns to the top of the data step and begins processing the next observation. v Examples: IF AGE GT 35; IF DEPT EQ FURS; v Complex logical expressions can be constructed by combining simple logical expression with the operators AND and/or OR. v Examples: IF (HT GT 70) AND (WT GT 180); IF (DEPT EQ FURS) OR (CLERK EQ ABLE);
5-16
NAME |
SEX |
AGE |
HT |
WT
5-17
DATA WORK.FURS; SET CLASSLIB.SALES; IF DEPT EQ FURS; OUTPUT; RETURN RUN; The data set WORK.FURS has 6 observations and 6 variables. The DATA statement used 2.00 seconds. PROC PRINT DATA=WORK.FURS; TITLE1 SELECTING OBSERVATIONS USING SUBSETTING IF; RUN; The PROCEDURE PRINT used 1.00 seconds.
SELECTING OBSERVATIONS USING SUBSETTING IF OBS 1 2 3 4 5 6 DEPT FURS FURS FURS FURS FURS FURS CLERK BURLEY BURLEY AGILE BURLEY BURLEY AGILE PRICE 599.95 800.00 590.00 499.95 700.00 700.00 COST 180.01 240.00 182.00 200.01 210.00 210.00 WEEKDAY THR MON SAT SAT THR WED DAY 5 9 14 14 19 25
5-18
Comparison Operators
GT LT GE LE EQ NE IN
EXAMPLES: if age > 35 ; if age gt 35 ; if age < 35 ; if age lt 35 ; if age >= 35 ; if age ge 35 ; if sex > name ; if age <= 35 ; if age le 35 ; if age=35 ; if age= 35 ; if age eq 35 ; if sex=female ; if sex=FEMALE ; if sex= ; if sex= ; if age ne 35 ; if age ^= 35 ; if ht < wt ; if ht <=.z. ; if sex= ; IF sex in(MALE,FEMALE) ; IF age in(30,34) ;
5-19
& | ^
AND OR NOT
EXAMPLES:
IF AGE=35 AND HT=40 ; IF (AGE=35) & (HT=40) ; IF SEX EQ FEMALES AND AGE IN(30,35) ; IF AGE>=16 AND AGE<=65 ; IF 16<= AGE <=65 ; IF HT>WT OR AGE=40 ; IF (HT>WT) | (AGE=40) ; IF AGE=20 OR AGE=30 OR AGE=40 ; IF AGE IN(20,30,40) ; IF NOT(SEX=MALE) ; IF SEX NE MALES ;
5-20
IF AGE ; IF (HT > WT) ; IF (AGE) & (HT > WT) ; NEWVAR=(HT>WT) ; NEWVAR=(AGE=40) ;
5-21
Example
19 PROC PRINT DATA=CLASSLIB.CLASS ; 20 TITLE PRINT OUT CLASS DATA SET ; 21 RUN; NOTE: The PROCEDURE PRINT used 1.00 seconds. 22 23 DATA CLASS2 ; 24 SET CLASSLIB.CLASS ; 25 IF AGE ; 26 OUTPUT ; 27 RETURN ; 28 RUN ; NOTE: The data set WORK.CLASS2 has 5 observations and 5 variables. NOTE: The DATA statement used 2.00 seconds. 29 30 PROC PRINT DATA=ONE ; 31 TITLE PRINT OUT CLASS2 DATA SET ; 32 RUN; NOTE: The PROCEDURE PRINT used 1.00 seconds.
PRINT OUT CLASS DATA SET OBS 1 2 3 4 5 6 NAME CHRISTIANSEN HOSKING J HELMS R PIGGY M FROG K GONZO SEX M M M F M AGE 37 31 41 . 3 14 HT 71 70 74 48 12 25 WT 195 160 195 . 1 45
PRINT OUT CLASS2 DATA SET OBS 1 2 3 4 5 NAME CHRISTIANSEN HOSKING J HELMS R FROG K GONZO SEX M M M M AGE 37 31 41 3 14 HT 71 70 74 12 25 WT 195 160 195 1 45
5-22
5-23
CONCATENATION EXAMPLE
64 data one ; 65 set classlib.class ; 66 67 c1 = dept ; 68 c2 = bios ; 69 c3 = c1 || c2 ; 70 71 length c4 $ 8 ; 72 c4 = dept ; 73 c5 = c4 || c2 ; 74 75 c6 = c1 || of || c2 ; 76 77 keep c1-c6 ; 78 run; NOTE: The data set WORK.ONE has 6 observations and 6 variables. NOTE: The DATA statement used 2.00 seconds. 79 80 title concatenation example ; 81 proc print ; 82 run; NOTE: The PROCEDURE PRINT used 1.00 seconds. 83 proc contents ; 84 run; NOTE: The PROCEDURE CONTENTS used 1.00 seconds. CONCATENATION EXAMPLE OBS 1 2 3 4 5 6 C1 dept dept dept dept dept dept C2 bios bios bios bios bios bios C3 deptbios deptbios deptbios deptbios deptbios deptbios C4 dept dept dept dept dept dept dept dept dept dept dept dept C5 bios bios bios bios bios bios dept dept dept dept dept dept C6 of of of of of of bios bios bios bios bios bios
CONTENTS PROCEDURE Data Set Name: Observations: Variables: WORK.ONE 6 6 Type: Record Len: 52
-----Alphabetic List of Variables and Attributes----# 1 2 3 4 5 6 Variable C1 C2 C3 C4 C5 C6 Type Char Char Char Char Char Char Len 4 4 8 8 12 12 Pos 4 8 12 20 28 40 Label
5-24
WHERE STATEMENT
v The WHERE statement allows you to select observations from an existing SAS data set that meet a particular condition before the SAS system brings observations into a data set. v WHERE selection is the first operation the SAS system performs in each execution of a set, merge, or update operation v The WHERE statement in not executable; that is, it cant be used as part of an IF/THEN statement v The WHERE statement is not a replacement for the IF statement ; the two work differently and can produce different output data sets. A data step can use either statement, both, or neither. v SYNTAX: WHERE where_expression in which where_expression is an arithmetic or logical expression v EXAMPLES: where age>50 ; where sex=FEMALE and ht=. ;
WHERE EXPRESSIONS
v A WHERE expression is a sequence of operands and operators. You cannot use variables created within the data step or variables created in assignment statements. v A WHERE expression can use the following operators: Arithmetic Operators * / + Comparison Operators = ^= > < >= <=
IN
5-25
WHERE vs IF
v The WHERE statement works before observation are brought into the data step(that is the PROGRAM DATA VECTOR) . v The IF statement works on observation that are already in the data step. v The WHERE statement is not executable, but the IF statement is v The WHERE statement operates only on observations in SAS data sets, whereas the IF statement can operate either on observations from existing SAS data sets or on observations created with an input statement. v If a BY statement does not accompany a SET or MERGE statement, the WHERE and IF statements usually produce the same result v In almost all cases a WHERE statement is more efficient than an IF statement(observations do not have to be moved into the PDV) v The WHERE statement, but not the IF statement can be used in SAS PROCS . v EXAMPLES: DATA ONE ; SET TWO; WHERE AGE>35 ; RUN; DATA ONE; SET TWO; WHERE AGE ; RUN; PROC PRINT DATA=CLASSLIB.CLASS ; WHERE SEX=FEMALE ; RUN; PROC MEANS DATA=CLASSLIB.CLASS ; WHERE 25 <AGE <= 35 AND SEX=MALE ; RUN;
5-26
print classlib.sales no WHERE statement OBS 1 2 3 4 5 6 NAME CHRISTIANSEN HOSKING J HELMS R PIGGY M FROG K GONZO SEX M M M F M AGE 37 31 41 . 3 14 HT 71 70 74 48 12 25 WT 195 160 195 . 1 45
print classlib.sales using a WHERE statement OBS 4 NAME PIGGY M SEX F AGE . HT 48 WT .
print classlib.sales using a WHERE statement OBS 4 5 6 NAME PIGGY M FROG K GONZO SEX F M AGE . 3 14 HT 48 12 25 WT . 1 45
5-27
Subsetting Variables
Another type of transformation that can be performed with a data step is to create a data set containing a subset of the variables from the input data set. The DROP or KEEP statement can be used to accomplish this. v Syntax: or KEEP variable list; v Notes: v Only one of the statements can be used in a step: v if the DROP statement is used, the variables listed are not included in the output data set if the KEEP statement is used, the variables listed are the only ones included in the output data set v The KEEP or DROP statement only defines which values are written from the program data vector to the output data set; all values are available during the execution of the step DROP variable list;
SUBSETTING VARIABLES WITH THE DROP STATEMENT OBS 1 2 3 4 5 6 SEX M M M F M AGE 37 31 41 . 3 14 HT 71 70 74 48 12 25 WT 195 160 195 . 1 45
5-28
DATA WORK.NAMEONLY; SET CLASSLIB.CLASS; OUTPUT; KEEP NAME; RETURN; RUN; The data set WORK.NAMEONLY has 6 observations and 1 variable. The DATA statement used 2.00 seconds. PROC PRINT DATA=WORK.NAMEONLY; TITLE1 SUBSETTING VARIABLES WITH THE KEEP STATEMENT; RUN The PROCEDURE PRINT used 0.00 seconds.
SUBSETTING VARIABLES WITH THE KEEP STATEMENT OBS 1 2 3 4 5 6 NAME CHRISTIANSEN HOSKING J HELMS R PIGGY M FROG K GONZO
5-29
v Notes for character variables: v The length of a character variable is determined by the first statement in which the compiler sees the variable. When used, the LENGTH statement should precede any assignment of SET statement involving the variable in question. v When character variables of different lengths are compared, the shorter value is padded with blanks on the right to match the length of the longer variable (in memory only). v Notes for numeric variables: v The valid length of a numeric variable is 2-8 bytes on the mainframe and 3-8 bytes on the PC. v The default length for numeric variables is 8 bytes; you should specify shorter lengths ONLY FOR INTEGERS, being sure to take into account the maximum integer that can be stored in a given number of bytes as specified in the length tables on the next page. Nonintegers stored in less than 8 bytes will lose precision because they will be truncated. v In the PDV, all numbers are stored in 8 bytes.
5-30
(Largest Integer by Length for SAS Numeric Variables under MVS and PC)
Length in Bytes
PC
2 3 4 5 6 7 8
5-31
LENGTH STATEMENT
USAGE NOTES: v LENGTHS placement in the data step determines its effectiveness. If placed before the first reference to a variable, it will store it in the indicated number of bytes. If it is placed after the steps first reference to a variable, it will have no effect, nor will SAS produce an error message. v There is no correspondence between the number of columns used for a numeric variable and the number of bytes specified in the length statement. v For numeric variables lengths of less than eight should only be used for integers v It is usually a good idea to specify lengths for all calculated or assigned character variables.
Data one ; set two ; length size $ 6 ; if ht<10 then size = small ; if ht>=10 then size=medium ; run;
5-32
5-33
ASSIGNING CONSTANTS
OBS 1 2 3 4 5 6 NAME CHRISTIANSEN HOSKING J HELMS R PIGGY M FROG K GONZO SEX M M M F M AGE 37 31 41 . 3 14 HT 71 70 74 48 12 25 WT 195 160 195 . 1 45 C1 CSCC CSCC CSCC CSCC CSCC CSCC C2 cscc cscc cscc cscc cscc cscc C3 csc csc csc csc csc csc N1 100 100 100 100 100 100 N2 100 100 100 100 100 100 N3 100 100 100 100 100 100 N4 100 100 100 100 100 100
CONTENTS PROCEDURE Data Set Name: Observations: Variables: Label: WORK.ONE 6 12 Type: Record Len: 85
-----Alphabetic List of Variables and Attributes----# 3 6 7 8 4 9 10 11 12 1 2 5 Variable AGE C1 C2 C3 HT N1 N2 N3 N4 NAME SEX WT Type Num Char Char Char Num Num Num Num Num Char Char Num Len 8 4 4 4 8 8 8 8 8 12 1 8 Pos 17 41 45 49 25 53 61 69 77 4 16 33 Label
5-34
Jan 1, 1953
Jan 1, 1960
-2556
8728
Notes: v The baseline of January 1, 1960 is arbitrary v Any dates from 1582 to 20,000AD are valid v SAS accounts for leap years, century and fourth-century adjustments v Although date and date-time values have implied baseline times, differences in these values are directly interpretable. For example, the number of days from January 1, 1953 to November 24, 1983 is: 8728 (-2556) = 11284 days
5-35
The d at the end of the constant ensures that SAS does not confuse the string with a character constant. v Examples: Date1 = 07OCT1999d; If evdate <= 21JUL1987d; If 01JUL1990d <= bdate <= 30JUL1990d ; v There are several useful functions available for handling SAS dates YEAR(SAS-date) extracts the year from a SAS date and returns a 4-digit year value.
MONTH(SAS-date) extracts the month from a SAS date and returns a number between 1 and 12. DAY(SAS-date) TODAY() extracts the day from a SAS date and returns a number between 1 and 31. extracts the date from the computer systems clock and stores the value as a SAS date. This function does not require any arguments.
MDY(month,day,year) creates a SAS date from separate month, day, and year variables. Arguments can be SAS numeric variables or constants. A missing or out of range value creates a missing value.
5-36
Simple Calculations
Using SAS date variables you can easily find the time elapsed between two dates. Simply subtract the dates to find the number of elapsed days, then, if necessary, divide the number to scale it to months, years, weeks, or any other unit of interest.
Days = date2 date1 ; Months = (date2 date1)/30.4 ; Years = (date2 date1) /365.25 ;
5-37
CONTENTS PROCEDURE Data Set Name: Observations: Variables: Label: WORK.ONE 6 4 Type: Record Len: 40
-----Alphabetic List of Variables and Attributes----# 2 3 4 1 Variable DATE1 DATE2 DATE3 NAME Type Num Num Num Char Len 8 8 8 12 Pos 16 24 32 4 Label
5-38
ARITHMETIC OPERATIONS OBS 1 2 3 4 5 6 NAME CHRISTIANSEN HOSKING J HELMS R PIGGY M FROG K GONZO SEX M M M F M AGE 37 31 41 . 3 14 HT 71 70 74 48 12 25 WT 195 160 195 . 1 45 N1 71 70 74 48 1 25 N2 35 31 35 35 3 14 N3 71 70 74 48 12 25 N4 . . . . . .
5-39
5-40
CONTENTS PROCEDURE
Data Set Name: Observations: Variables: Label: WORK.ONE 6 8 Type: Record Len: 60
-----Alphabetic List of Variables and Attributes----# 3 6 8 4 7 1 2 5 Variable AGE C1 C2 HT N1 NAME SEX WT Type Num Num Char Num Num Char Char Num Len 8 8 3 8 8 12 1 8 Pos 17 41 57 25 49 4 16 33 Label
OBS 1 2 3 4 5 6
SEX M M M F M
AGE 37 31 41 . 3 14
HT 71 70 74 48 12 25
C1 . . . . . .
N1 . . . . . .
C2 37 31 41 . 3 14
5-41