Programming III
Programming III
Course Notes
For Your Information ii
SAS® Programming III: Advanced Techniques Course Notes was developed by Linda Jolley and
Jane Stroupe. Additional contributions were made by Bill Brideson, George Berg, Ted Meleky,
Rich Papel, Dr. Sue Rakes, Kent Reeve, Christine Riddiough, and Roger Staum. Editing and production
support was provided by the Curriculum Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
Table of Contents
Prerequisites ...............................................................................................................................vii
2.1 Introduction......................................................................................................................2-3
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) .........3-93
6.1 Introduction......................................................................................................................6-3
7.1 Introduction......................................................................................................................7-3
8.1 Introduction......................................................................................................................8-3
Course Description
This course builds on the concepts presented in the SAS Programming II: Manipulating Data with the
DATA Step course. This course focuses on reading data with direct access; combining data; sorting; using
multidimensional arrays, hash tables, and formats for table lookups; efficiently storing data; utilizing best
practices; and creating tables with the SAS Scalable Performance Data Engine.
This course is a combination of the previously offered SAS Programming III: Advanced Techniques and
Optimizing SAS Programs courses.
To learn more…
For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department at
1-800-727-3228 or send e-mail to sasbook@sas.com. Customers outside the
USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for a
complete list of books and a convenient order form.
For Your Information vii
Prerequisites
This course is not appropriate for beginning SAS software users. Before attending this course, you should
have at least nine months of SAS programming experience and should have completed the SAS
Programming II: Manipulating Data with the DATA Step course. Specifically, you should be able to do
the following:
understand your operating system file structures and perform basic operating system tasks
use different kinds of input to create SAS data sets from external files
continued...
3 ...
02DEC1999
03DEC1999
04DEC1999
05DEC1999
58907
108543
21963
31517
RDU
RDU
RDU
RDU
LHR
LHR
LHR
LHR
− a summary data set
with a detail data set
06DEC1999 105682 RDU LHR
07DEC1999 66992 RDU LHR
08DEC1999 92873 RDU LHR
09DEC1999 59560 RDU LHR
10DEC1999 41096 RDU LHR
11DEC1999 10272 RDU LHR
continued...
4 ...
1-4 Chapter 1 Introduction
continued...
5
CPU
memory
6
1.2 Measuring Efficiencies 1-5
Objectives
Identify the resources used by a SAS program.
Use SAS system options to measure computer
resources.
Interpret resource usage statistics in your operating
environment.
Benchmark resource usage.
9
1-6 Chapter 1 Introduction
10
1.2 Measuring Efficiencies 1-7
resources used
networking memory
data storage
space
11
CPU measures the amount of time that the Central Processing Unit uses to perform
requested tasks such as calculations, reading and writing data, conditional and
iterative logic, and so on.
I/O provides a measurement of the read-and-write operations performed as data and
programs are moved from a storage device to memory (input) or from memory to a
storage or display device (output).
Memory is the size of the work area required to hold executable program modules, data, and
buffers.
Data storage space is the amount of space on a disk or tape required to store data.
Programmer time is the amount of time required for the programmer to write and maintain the
program. This can be decreased through well documented, logical programming
practices.
Networking is the amount of time required to transfer data across your computer network. This
can be decreased by performing as much of the subsetting and summarizing as
possible on the remote computer before transferring the data to the local computer.
The networking time is dependent on the bandwidth of your I/O controller.
1-8 Chapter 1 Introduction
12 ...
Data Data
Space
i mo f t
pl en
ie
s
12
12
9 3 9 3
6
CPU Time
13
1.2 Measuring Efficiencies 1-9
I/O
i mo f t
pl en
ie
s
Memory Usage
14
15
You must decide which factors are the most important for improving resource usage at your site. To make
this decision, you must know the following:
• which resources are scarce or costly at your site
• how and when your programs will be used
• the type and volume of data your programs will process
1-10 Chapter 1 Introduction
SAS
hardware environment
system load
16
Environmental factors that affect the efficiency of SAS programs include the following:
Hardware the amount of available memory, the number of peripheral devices attached to
the CPU, and the communications hardware in use
Operating environment resource allocation, scheduling algorithms, and I/O methods
System load the number of users or jobs sharing system resources including network
bandwidth along with the traffic.
SAS environment determined by which SAS software products are installed, how they were
installed, and which methods are available to run SAS programs at your site
In most cases, one or two resources are the most limited or most expensive for your programs. You can
usually decrease the amount of critical resources that are used if you are willing to sacrifice some
efficiency of the resources that are less critical at your site.
1.2 Measuring Efficiencies 1-11
17
• Developing an efficient program requires time and thought. The first question to address is whether the
additional amount of resources saved is worth the time and effort spent to achieve the savings.
• Consider the size of the program or the files that are processed. As the programs or files increase in
size, the potential for savings increases. Therefore, devote your effort to improve the efficiency of large
programs.
• Also consider the number of times the program will run. The difference in the resources used by an
inefficient program and an efficient program that run one time or a few times is relatively small,
whereas the cumulative difference for a program that is run frequently is large.
1-12 Chapter 1 Introduction
18
The effectiveness of any efficiency technique depends greatly on the data with which you use it. When
you know the characteristics of your data, you can select the techniques that take advantage of those
characteristics.
Considering Trade-Offs
In this class, each task will be performed using one or
more techniques.
You should benchmark with your own data to determine
which technique is the most efficient.
19
1.2 Measuring Efficiencies 1-13
20
continued...
21 ...
1-14 Chapter 1 Introduction
22 ...
1.2 Measuring Efficiencies 1-15
SAS
STATS MEMRPT
options
FULLSTIMER
23
There are four SAS system options that you can use to track and report on resource utilization:
STIMER tracks the CPU time used to perform a task (DATA or PROC step). CPU time can be
divided into System CPU time and User CPU time.
MEMRPT tracks memory used while performing a task.
FULLSTIMER tracks usage of additional resources. This option is ignored unless STIMER or
MEMRPT is in effect. It can also be specified by the alias FULLSTATS.
STATS writes information tracked by the above options to the SAS log.
The availability and usage of these options are specific to the operating environment.
STIMER I BD BD
FULLSTIMER B B B
24
Use the OPTIONS procedure with the HOST option to determine the default settings of these
options at your site.
proc options host;
run;
You can find more information on operating environment dependencies in the SAS documentation for
your operating environment.
1.2 Measuring Efficiencies 1-17
OPTIONS
OPTIONSSASTRACE
SASTRACE==',,,d
',,,d' '||',,t,
',,t,' '||',,t,s
',,t,s';';
25
• In order to turn SAS tracing off, you can specify the following option:
options sastrace=off;
26 c01s2d1
Objectives
Investigate the concept of a data set page and
how it relates to the structure of SAS data sets.
Review how SAS reads and writes data.
28
29
1-20 Chapter 1 Introduction
Partial Output
Engine/Host Dependent Information
30 c01s3d1
The total number of bytes occupied by ia.sales can be calculated as shown below:
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
1.3 SAS Processing 1-21
Data might be
cached in
storage devices.
On UNIX and
Windows, data
might also be
cached by the
file system.
33 ...
37 ...
Data might be
cached in
storage devices.
On UNIX and
Windows, data
might also be
cached by the
file system.
38 ...
Caches memory
PDV
Output ID Flight Route Dest
I/O Buffers
SAS
measured
Data here
41 ...
1.4 Controlling Memory and I/O Resources 1-23
Objectives
Change the page size of a SAS data set.
Use system and data set options to control memory
usage.
Use the SASFILE statement when you read small
SAS data sets.
Use the Scatter/Gather I/O feature in the Windows
operating environment.
43
1-24 Chapter 1 Introduction
BUFSIZE=
BUFSIZE= nn||nK
nK||nM
nM ||nG
nG ||nT
nT||hexX
hexX||MIN
MIN||MAX
MAX
BUFNO=
BUFNO=nn
44
Increasing the BUFSIZE= option is useful for SAS data sets that are read sequentially (top to bottom).
Using small BUFSIZE= and larger BUFNO= options is useful for SAS data sets that are read randomly.
Random access to SAS data is discussed in Chapter 2.
Reference Information
BUFSIZE=n| nK | nM | nG | nT |hexX | MIN | MAX
n | nK | nM | nG | nT
specifies the page size in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes);
1,073,741,824 (gigabytes); or 1,099,511,627,776 (terabytes). For example, a value of 8 specifies 8
bytes, and a value of 3m specifies 3,145,728 bytes.
The default is 0, which causes SAS to use the minimum optimal page size for the operating
environment.
hexX
specifies the page size as a hexadecimal value. You must specify the value beginning with a number
(0-9), followed by an X. For example, the value 2dx sets the page size to 45 bytes.
MIN
sets the page size to the smallest possible number in your operating environment, down to the
smallest four-byte, signed integer, which is -231-1, or approximately -2 billion bytes.
CAUTION: This setting might cause unexpected results and should be avoided.
Use BUFSIZE=0 in order to reset the buffer page size to the default value in your operating environment.
MAX
sets the page size to the maximum possible number in your operating environment, up to the largest
four-byte, signed integer, which is 231-1, or approximately 2 billion bytes.
1.4 Controlling Memory and I/O Resources 1-25
Windows:
n | nK | nM | nG
specifies the buffer page size in multiples of 1; 1,024 (kilobytes); 1,048,576 (megabytes), and
1,073,741,824 (gigabytes), respectively. You can specify decimal values for the number of
kilobytes, megabytes, or gigabytes. For example, a value of 8 specifies 8 bytes, a value of .782k
specifies 801 bytes, and a value of 3m specifies 3,145,728 bytes.
hexX
specifies the buffer page size as a hexadecimal value. You must specify the value beginning with a
number (0-9), followed by an X. For example, the value 2dx sets the buffer page size to 45 bytes.
MIN
sets the buffer page size to -2,147,483,648 and requires SAS to use a default value. Under
Windows, the default value is 0. The minimum number is -2,147,483,648.
MAX
sets the buffer page size to 2,147,483,647 bytes.
UNIX:
n | nK | nM | nG
specifies the buffer page size in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes); or
1,073,741,824 (gigabytes). You can specify decimal values for the number of kilobytes, megabytes,
or gigabytes. For example, a value of 8 specifies 8 bytes, a value of .782k specifies 801 bytes, and a
value of 3m specifies 3,145,728 bytes.
hexX
specifies the buffer page size as a hexadecimal value. You must specify the value beginning with a
number (0-9), followed by hex digits (0-9, A-F), and then followed by an X. For example, 2dx sets
the buffer page size to 45 bytes.
MIN
sets the buffer page size to 0. When the buffer size is 0, the BASE engine calculates a buffer size to
optimize CPU and I/O use. This size is the smallest multiple of 8K that can hold 80 observations but
is not larger than 64K.
MAX
sets the buffer page size to 2,147,483,647.
1-26 Chapter 1 Introduction
Reference Information
z/OS:
BUFSIZE=0 | n | nK
0
specifies that SAS choose the optimal page size of the data set based on the characteristics of the
library and the type of data set.
n | nK
specifies the permanent buffer size (page size) in bytes or kilobytes, respectively. For libraries other
than HFS, the value specified will be rounded up to the block size (BLKSIZE) of the library data
set, because a block is the smallest unit of a data set that may be transferred in a single I/O
operation.
Windows:
n | nK | nM | nG
specifies the number of buffers in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes);
or 1,073,741,824 (gigabytes). You can specify decimal values for the number of kilobytes,
megabytes, or gigabytes. For example, a value of 8 specifies 8 buffers, a value of .782k specifies
801 buffers, and a value of 3m specifies 3,145,728 buffers.
For values greater than 1G, use the nM option or specify MAX.
hexX
specifies the number of buffers as a hexadecimal value. You must specify the value beginning with
a number (0-9), followed by an X. For example, the value 2dx specifies 45 buffers.
MIN
sets the number of buffers to 0, and requires SAS to use the default value of 1.
MAX
sets the number of buffers to 2,147,483,647.
1.4 Controlling Memory and I/O Resources 1-27
UNIX:
n | nK | nM | nG
specifies the number of buffers in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes);
or 1,073,741,824 (gigabytes). You can specify decimal values for the number of kilobytes,
megabytes, or gigabytes. For example, a value of 8 specifies 8 buffers, a value of .782k specifies
801 buffers, and a value of 3m specifies 3,145,728 buffers.
hexX
specifies the number of buffers as a hexadecimal value. You must specify the value beginning with
a number (0-9), followed by hex digits (0-9, A-F), and then followed by an X. For example, 2dx
specifies 45 buffers.
MIN
sets the number of buffers to 0, and requires SAS to use the default value of 1.
MAX
sets the number of buffers to 2,147,483,647.
For more information, consult SAS OnlineDoc 9.1.3. Expand Base SAS, and select SAS
Language Reference: Dictionary and Operating Environment Specific Information.
1-28 Chapter 1 Introduction
45 ...
one
Page Buffer
operation
of
data
47 ...
Page Copy
of of
data data
48
1-30 Chapter 1 Introduction
49
Page 3
Page 2
Page 1
bufno = 3 data
c01s4d2
51
52
The SASFILE statement can reduce execution time by taking advantage of large amounts of memory. The
SASFILE statement became available in SAS Release 8.1.
1-32 Chapter 1 Introduction
SASFILE
SASFILE<libref.>member-name
<libref.>member-name
<(password-data-set-option(s))>
<(password-data-set-option(s))>
OPEN
OPEN || LOAD
LOAD || CLOSE;
CLOSE;
53
OPEN opens the file and allocates the buffers, but defers reading the data into memory until a
procedure or a statement that references the file is executed.
LOAD opens the file, allocates the buffers, and reads the data into memory.
CLOSE frees the buffers and closes the file.
Buffer Allocation
When the SASFILE statement executes, SAS allocates
the number of buffers based on the number of pages of
the SAS data set and index file.
If the file in memory increases in size during processing
by editing or appending data, the number of buffers also
increases.
54
1.4 Controlling Memory and I/O Resources 1-33
55
56 c01s4d3
NOSGIO
NOSGIO||SGIO;
SGIO;
57
The Scatter-Read/Gather-Write feature is active only for SAS I/O files that have the following attributes:
• contain a 4K-multiple pagesize (for example, 4096 or 8192) on 32-bit systems
• contain a 8K-multiple pagesize (for example, 8192 or 16384) on 64-bit systems
If an I/O file does not meet these criteria, SGIO is inactive for that file even though the SGIO option is
specified.
To learn more, visit this page: http://support.sas.com/techsup/technote/ts710.html.
1-36 Chapter 1 Introduction
Exercises
1) CPU
2) I/O
3) Memory
options fullstimer;
options nofullstimer;
b. Turn off the option after you record the statistics.
1-38 Chapter 1 Introduction
2.1 Introduction.....................................................................................................................2-3
2.1 Introduction
Objectives
Review sequential processing.
Investigate methods for direct access.
SAS
Data
Set
memory
4 ...
2-4 Chapter 2 Accessing Observations
SAS
Data
Set
memory
PDV
Output ID Flight Route Dest
SAS Buffers
Data
6 ...
SAS
Data
Set
memory
PDV
Output ID Flight Route Dest
SAS Buffers
Data
7 ...
2.1 Introduction 2-5
SAS
Data
Set
memory
PDV
Output ID Flight Route Dest
SAS Buffers
Data
8 ...
SAS
Data
Set
Sequential
memoryprocessing continues
until the pointer
reaches the end of file.
PDV
Output ID Flight Route Dest
SAS Buffers
Data
9 ...
2-6 Chapter 2 Accessing Observations
10
2.2 Creating a Sample Data Set 2-7
Objectives
Create a systematic sample that contains five
observations.
Create a systematic sample that contains an unknown
number of observations.
Create a random sample with replacement.
Create a random sample without replacement.
12
Selecting Observations
International Airlines (IA) is concerned with the accuracy
of the data in ia.sales that contains revenue figures
for 2004 and 2005. The size of the data set makes
auditing all of the data difficult. IA first wants to audit a
small sample to determine if a full audit is necessary.
Partial Output
Cap Num
Flight Pass Num Num Pass
ID RouteID Origin Dest DestType FltDate Cap1st CapBus CapEcon Total CapCargo Num1st Bus Econ Total
IA10700 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 11 . 126 137
IA10701 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 12 . 136 148
IA10702 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 10 . 112 122
IA10703 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 12 . 113 125
IA10704 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 10 . 118 128
IA10705 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 11 . 117 128
IA10700 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 10 . 131 141
IA10701 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 11 . 113 124
IA10702 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 10 . 134 144
IA10703 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 11 . 114 125
IA10704 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 11 . 128 139
IA10705 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 12 . 131 143
IA10700 0000107 WLG AKL International 03JAN2005 12 . 138 150 36900 10 . 124 134
IA10701 0000107 WLG AKL International 03JAN2005 12 . 138 150 36900 12 . 135 147
IA10702 0000107 WLG AKL International 03JAN2005 12 . 138 150 36900 12 . 127 139
13 ...
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2-8 Chapter 2 Accessing Observations
c02s2d1
14
d PickIt is used by the POINT= option to select an observation from the SAS data set.
e The OUTPUT statement writes the PDV values to the SAS data set.
f The STOP statement stops the DATA step from continuing to execute after the five observations are
selected. Without a STOP statement, the DATA step continues in an infinite loop
2.2 Creating a Sample Data Set 2-9
SET
SET data-set-name
data-set-namePOINT
POINT ==point-variable;
point-variable;
The POINT= option value should be an integer greater than zero and less than or equal to the number of
observations in the SAS data set. If the value is not integral, the SET statement effectively applies the
FLOOR function to the value.
STOP;
STOP;
17
2-10 Chapter 2 Accessing Observations
c02s2d1
data work.subset;
do PickIt = 100 to 500 by 100;
set ia.sales
point = PickIt;
output;
end;
stop;
run;
The PROC PRINT output of work.subset is shown below.
Creating a Systematic Sample of 5 Observations
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st RevBus
Cargo
Obs RevEcon CargoRev RevTotal Weight
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.2 Creating a Sample Data Set 2-11
c02s2d2
18
c The NOBS= option creates a temporary variable that contains the total number of observations in the
input data files. During compilation, SAS reads the descriptor portion of the data file and assigns the
value of the NOBS= variable.
The total includes deleted observations. Rebuild the data set to remove deleted observations.
d You can refer to the NOBS= variable in executable statements that appear before the SET statement.
2-12 Chapter 2 Accessing Observations
SET
SET SAS-data-set
SAS-data-setNOBS
NOBS==variable;
variable;
retained
19
c02s2d2
20 ...
2.2 Creating a Sample Data Set 2-13
21 ...
22 ...
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2-14 Chapter 2 Accessing Observations
24 ...
25 ...
2.2 Creating a Sample Data Set 2-15
26 ...
28 ...
2-16 Chapter 2 Accessing Observations
30 ...
31 ...
2.2 Creating a Sample Data Set 2-17
c02s2d2
32
2-18 Chapter 2 Accessing Observations
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st
Cargo
Obs RevBus RevEcon CargoRev RevTotal Weight
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.2 Creating a Sample Data Set 2-19
RANUNI(seed)
RANUNI(seed)
33
A 0 argument for the RANUNI function uses the system clock time, resulting in a different stream
of random numbers each time that the program is run.
2-20 Chapter 2 Accessing Observations
0 1
CEIL(ranuni(seed) * 5)
Examples:
Random number
.01253689
.95196500
34 ...
0 5
CEIL(ranuni(seed) * 5)
Examples:
Random number * 5
.01253689 Î 0.06268445
.95196500 Î 4.75982500
35 ...
2.2 Creating a Sample Data Set 2-21
1 2 3 4 5
CEIL(ranuni(0) * 5)
CEIL(ranuni(seed) * 5)
Examples:
Random number * 5 CEIL( )
.01253689 Î 0.06268445 Î 1
.95196500 Î 4.75982500 Î 5
36
The CEIL function returns the smallest integer that is greater than or equal to the argument.
2-22 Chapter 2 Accessing Observations
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st
Cargo
Obs RevBus RevEcon CargoRev RevTotal Weight
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2-24 Chapter 2 Accessing Observations
c02s2d4 (Self-Study)
Create a random sample without replacement. A sample without replacement cannot contain duplicate
observations because after an observation is output to work.subset, programmatically it cannot be
selected again.
d ObsLeft is the number of observations still needed to be selected. The start value is equal to
TotObs, the total number of observations in the data set being sampled.
e PickIt is the number of the observation to be read in the sample data set. Because it is used in a
SUM statement, its starting value is 0.
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.2 Creating a Sample Data Set 2-25
2) SampSize is decreased by 1.
3. ObsLeft is decreased by 1.
This is an adaptation of a sampling routine that has been used by statisticians for many years.
• The sample size is fixed.
• An observation can be selected only once.
• Each observation has an equal probability of being selected.
• The selection probability for an observation is independent of the selection of another
observation.
2-26 Chapter 2 Accessing Observations
Output
A Random Sample without Replacement
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st
Cargo
Obs RevBus RevEcon CargoRev RevTotal Weight
With a seed value of 0, you get different results each time that the program is executed, but it is
possible that some of the same observations will be selected as were selected in previous
executions.
2.2 Creating a Sample Data Set 2-27
38
c02s2d5
39
2-28 Chapter 2 Accessing Observations
PROC
PROCSURVEYSELECT
SURVEYSELECT options;
options;
STRATA variables;
STRATA variables;
CONTROL
CONTROL variables;
variables;
SIZE
SIZEvariable;
variable;
IDIDvariables;
variables;
RUN;
RUN;
40
The STRATA statement partitions the input data set into non-overlapping groups defined by the
STRATA variables. PROC SURVEYSELECT then selects independent
samples from these strata, according to the selection method and design
parameters specified in the PROC SURVEYSELECT statement. PROC
SURVEYSELECT expects the input data set to be sorted in the order of the
STRATA variables.
The CONTROL statement names variables for sorting the input data set. The CONTROL variables can
be character or numeric. PROC SURVEYSELECT sorts the input data set by
the CONTROL variables before selecting the sample. If you also specify a
STRATA statement, PROC SURVEYSELECT sorts by the CONTROL
variables within the strata.
The SIZE statement names one and only one size measure variable, which contains the size
measures to be used when sampling with probability proportional to size.
The SIZE variable must be numeric. When the value of an observation's
SIZE variable is missing or non-positive, that observation has no chance of
being selected for the sample.
The ID statement names variables from the DATA= input data set to be included in the OUT=
data set of selected units. If there is no ID statement, PROC
SURVEYSELECT includes all variables from the DATA= data set in the
OUT= data set. The ID variables can be character or numeric.
2.2 Creating a Sample Data Set 2-29
41
42
2-30 Chapter 2 Accessing Observations
METHOD=
SYS The method of systematic random sampling selects
units at a fixed interval throughout the sampling frame
or stratum after a random start.
URS The method of unrestricted random sampling selects
units with equal probability and with replacement.
Because units are selected with replacement, a unit
can be selected for the sample more than once.
SRS The method of simple random sampling selects units
with equal probability and without replacement. The
selection probability for each individual unit equals
n/N.
43
These methods correspond to the DATA step examples at the beginning of this section.
c02s2d5
44
The SURVEYSELECT procedure step produces similar output to the c02s2d3 example earlier in this
chapter, except that it selects more samples (100 versus 10).
2.2 Creating a Sample Data Set 2-31
To specify a seed so that you can replicate a sample, use the SEED= option on the PROC
SURVEYSELECT statement.
proc surveyselect data = ia.sales
method = srs n = 100
out = sample
seed = 12345;
run;
2-32 Chapter 2 Accessing Observations
47
2.2 Creating a Sample Data Set 2-33
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st
Cargo
Obs RevBus RevEcon CargoRev RevTotal Wt
48
2.2 Creating a Sample Data Set 2-35
Exercises
If you obtain zero observations in one of the data sets, run the program again. It is possible
that the selected observations might all be over $30,000 or all $30,000 or less.
2. Generating a Random Sample without Replacement (Optional)
Generate a random sample without replacement of ten flights from ia.cap2000.
2-36 Chapter 2 Accessing Observations
Objectives
Define indexes.
List the uses of indexes.
Use the DATA step to create indexes.
Use PROC DATASETS to create and maintain
indexes.
Use PROC SQL to create and maintain indexes.
51
Using Indexes
To decrease the time used to query a heavily used
SAS data set, create an index on ia.sales.
Flight
Obs ID RouteID Origin Dest DestType FltDate . . .
Flight
Obs ID RouteID Origin Dest DestType FltDate . . .
52
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.3 Creating and Using an Index 2-37
Using Indexes
Indexed SAS Data Set
Flight
obs
ID RouteID Origin Dest DestType FltDate . . .
329259
IA10800 0000108 AKL WLG International 30DEC2005 . . .
329260 IA10801 0000108 AKL WLG International 30DEC2005 . . .
329261 IA10802 0000108 AKL WLG International 30DEC2005 . . .
329262 IA10803 0000108 AKL WLG International 30DEC2005 . . .
Using Indexes
An index is an optional file that you can create for a
SAS data file that does the following:
points to observations based on the values of one or
more key variables
provides direct access to specific observations
54
This section discusses indexes for Base SAS data files. A discussion of indexes for Scalable
Performance Data Engine (SPDE) data files is presented in a later chapter.
2-38 Chapter 2 Accessing Observations
55
56 ...
2.3 Creating and Using an Index 2-39
Without an Index
ALL
pages
Input loaded The WHERE statement
SAS Buffers selects observations
I/O
Data by reading data
sequentially.
Disk Memory
PDV
Output ID Route Origin Dest
Buffers
Data
I/O
Set
57 ...
2-40 Chapter 2 Accessing Observations
Index
Index
Only necessary
pages are loaded.
Input
SAS Buffers
I/O The WHERE statement
Data
selects observations
Disk Memory by using direct access.
PDV
Output ID Route Origin Dest
Buffers
Data
Set
I/O
58 ...
When SAS uses an index to process data, SAS accomplishes the following:
• performs a binary search on the index file
• positions the index to the first entry containing a qualified value
• transfers a page of data containing the first record identifier for the qualified value to a buffer
• directly accesses the value specified by the record identifier
• positions the index to the next entry containing a qualified value
• transfers the page of data, if it is not already in the buffer
• directly accesses the value specified by the record identifier
• continues to process the data until there is no more data that satisfies the WHERE expression
If the data values are sorted in ascending order by the indexed variables, fewer I/O operations are
required. In addition, if observations with the same key values are near each other in the file, for
whatever reason, I/O will be minimized.
2.3 Creating and Using an Index 2-41
Using Indexes
The index file consists of entries that are organized in
a tree structure, and connected by pointers.
When an index is used to process a request, such as
for WHERE processing, SAS searches the index file in
order to locate the requested record(s) rapidly.
FlightID
Origin
FltDate
DteFlt
Origin
Index Terminology
There are two types of indexes.
Type Based On Name Example
60
2-42 Chapter 2 Accessing Observations
Index Terminology
Index options include the following:
UNIQUE Values of the key variable(s) must be
unique. The option prevents an observation
with a duplicate value for the key variable(s)
from being added to the data set.
Flight
ID RouteID Origin Dest DestType FltDate . . .
In an existing data set, if the variable(s) on which you attempt to create a unique index has duplicate
values, the index is not created and an error message is written to the SAS log.
Creating Indexes
To create indexes at the same time that you create a
data set, use the INDEX= data set option on the
output data set.
To create or delete indexes in existing data sets,
use the one of the following:
− DATASETS procedure
− SQL procedure
62
Indexes can also be created using the SAS Management Console that is part of SAS Business Intelligence
Architecture.
2.3 Creating and Using an Index 2-43
Creating Indexes
When creating the index, you can do the following:
designate the key variable(s)
63
For increased efficiency, use the INDEX= option to create indexes when you initially create a
SAS data set.
2-44 Chapter 2 Accessing Observations
The external file sales used for demonstrations and exercises contains fewer observations than
the external file sales used for the course notes.
2.3 Creating and Using an Index 2-45
DATASAS-data-file-name
DATA SAS-data-file-name(INDEX
(INDEX==
((index-specification-1</option>
index-specification-1</option>
…<index-specification-n</option>>
…<index-specification-n</option>>));));
65
OPTIONS
OPTIONSMSGLEVEL
MSGLEVEL==NN ||I;I;
66
N only prints notes, warnings, and error messages. This is the default.
I also prints informational or INFO notes that pertain to index creation and usage, merge
processing, and host sort utilities.
2.3 Creating and Using an Index 2-47
The NOLIST option prevents a list of library members from being printed in the log.
Log
703 options msglevel = i;
704
705 proc datasets library = ia nolist;
706 modify Sales;
707 index delete Origin;
NOTE: Index Origin deleted.
708 index delete DteFlt;
NOTE: All indexes defined on IA.SALESDATA.DATA have been deleted.
709
710 index create Origin;
NOTE: Simple index Origin has been defined.
711 index create DteFlt = (FltDate FlightID) / unique;
NOTE: Composite index DteFlt has been defined.
712 quit;
PROC
PROCDATASETS LIBRARY==libref
DATASETSLIBRARY libref;;
MODIFYSAS-data-set-name
MODIFY SAS-data-set-name;;
INDEX DELETEindex-name
INDEXDELETE index-name;;
INDEX CREATEindex-specification
INDEXCREATE index-specification
<<//options>
options>;;
QUIT;
QUIT;
68
The INDEX CREATE statement in PROC DATASETS cannot be used if the index to be created
already exists.
proc sql;
drop index Origin
from ia.Sales;
drop index DteFlt
from ia.Sales;
PROC
PROCSQL;
SQL;
DROP INDEX index-name
DROPINDEX index-name
FROM table-name
FROM table-name;;
CREATE<<option
CREATE INDEX index-name
option>>INDEX index-name
ON table-name((column-name-1,...
ON table-name column-name-1,...
column-name-n
column-name-n););
70
Index Documentation
PROC CONTENTS
PROC DATASETS
SAS Explorer
SAS Management Console
71
2-52 Chapter 2 Accessing Observations
Documenting Indexes
c02s3d4
# of
Unique Unique
# Index Option Values Variables
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.3 Creating and Using an Index 2-53
Exercises
continued...
75
There are simple indexes on the variables FlightID, EconomyRev, and Origin.
The colon modifier indicates a starts with condition. It cannot be used in the SQL procedure.
2.3 Creating and Using an Index 2-55
continued...
76
continued...
77
78
SUBSTR (variable,position,<length>)
2.3 Creating and Using an Index 2-57
79
The conditions listed here apply to indexed Base SAS data files only. A discussion of when an
index is used with Scalable Performance Data Engine data files is contained in a later chapter.
No Index Usage
SAS does not use an index when a WHERE expression
references an indexed variable if the following conditions
exist:
No single index could supply all required observations.
where weekday(FlightDate) = 6;
continued...
80
2-58 Chapter 2 Accessing Observations
No Index Usage
The SUBSTR function does not search a string
beginning at the first position.
81
Compound Optimization
When you write a WHERE expression using all the key
variables in a composite index, you can take advantage
of compound optimization.
Compound optimization means that SAS can use a
composite index to optimize some WHERE expressions
that involve multiple variables.
82
Compound Optimization
For compound optimization to occur, all of the following
must be true:
At least the first two key variables in the composite
index must be used in the WHERE conditions.
The conditions are connected using the AND operator.
83
84
2-60 Chapter 2 Accessing Observations
3%
0% SAS will
Data Set
use an index.
85 ...
To determine whether it is more efficient to satisfy the WHERE expression by using the index or reading
the data sequentially, SAS uses these guidelines:
• If only a few observations are qualified, it is more efficient to use the index than to do a sequential
search of the entire data file.
• If most or all of the observations qualify, then it is more efficient to read the data file sequentially.
2.3 Creating and Using an Index 2-61
86
For information on updating and viewing the centile information, see the centiles information in the SAS
documentation for the CONTENTS and DATASETS procedures.
87 ...
2-62 Chapter 2 Accessing Observations
88
Data Order
Sort order can affect the number of I/O operations
required for indexed access.
Flight Flight
Obs ID RouteID Origin . . Obs ID RouteID Origin . .
If the data set is sorted on the indexed variable(s), the qualified observations are adjacent to each other.
Fewer pages must be read into the input buffers.
2.3 Creating and Using an Index 2-63
90
IDXWHERE = YES | NO
YES SAS uses the best available index to process the WHERE expression, even if SAS estimates that
processing sequentially is faster.
NO SAS processes the data sequentially, even if SAS estimates that processing with an index is
faster.
You cannot use IDXWHERE= to override the use of an index to process a BY statement.
2-64 Chapter 2 Accessing Observations
c02s3d5
91
NOTE: There were 65935 observations read from the data set
IA.FREQFLYERS.
WHERE Country='USA';
NOTE: PROCEDURE PRINT used (Total process time):
real time 4.86 seconds
cpu time 0.89 seconds
92
2.3 Creating and Using an Index 2-65
A variable such as Gender is not discriminating. A discriminating variable is one that enables you to
break the data into many small groups or subsets.
Index Trade-offs
BENEFITS COSTS
Fast access to a small Extra CPU cycles and I/O
subset of observations operations to create and
Values returned maintain an index
in sorted order Increased CPU to read
Can enforce uniqueness the data
Extra disk space to store
the index file
Extra memory to load
index pages and SAS C
code to use the index
95
Maintaining Indexes
Data Management Tasks Index Action Taken
Copy the data set with the Index file constructed
COPY procedure or the for new data file
DATASETS procedure.
Move the data set Index file deleted
with the MOVE option from IN= library;
in the COPY procedure. rebuilt in OUT= library
Copy the data set with Index file constructed
drag-and-drop in SAS for new file
Explorer.
96
2.3 Creating and Using an Index 2-67
Maintaining Indexes
Data Management Tasks Index Action Taken
Rename data set. Index file renamed
Indexes are maintained by updates in place, such as using the Viewtable window to update, add, or delete
observations, and the APPEND or SQL procedures to append data. Using the Explorer window or the
DATASETS procedure maintains indexes when data sets or variables are renamed. However, recreating a
data set with the SET, MERGE, or UPDATE statements does not automatically maintain indexes.
2-68 Chapter 2 Accessing Observations
Maintaining Indexes
Data Management Tasks Index Action Taken
Delete a data set. Index file deleted
proc datasets lib = work;
delete a;
run;
Sort the data set in place with the Index file deleted
FORCE option in the SORT
procedure.
proc sort data = a force;
by var;
run;
98
If you use the UPLOAD procedure or the DOWNLOAD procedure, the index is re-created by default
when you upload or download a single data set and omit the OUT= option, or when you upload or
download a SAS data library. Use the INDEX=NO data set option to upload or download without re-
creating the index.
Index re-created:
proc upload data = schedule;
run;
Index not re-created:
proc download data = Sales(index = no);
run;
2.3 Creating and Using an Index 2-69
Exercises
5. Using an Index
Open the program, c02ex7Start, and submit it. Consult the log and answer the questions following the
program code listed here.
c02ex7Start
options msglevel=I obs = 500;
*** Example 1;
data rdu;
set ia.Sales;
if Origin = 'RDU';
run;
*** Example 2;
*** Example 3;
*** Example 4;
**** Example 5;
*****Example 6;
data SalesCopy;
set ia.Sales;
run;
2-70 Chapter 2 Accessing Observations
Questions:
a. Does Example 1 use an index? Why or why not?
proc sql;
drop index Depart
from ia.schedule;
quit;
5. Creating Indexes with the DATASETS Procedure
Use PROC DATASETS to create a simple index Date based on the Date variable for the
ia.schedule data set.
proc datasets library = ia nolist;
modify schedule;
index create Date;
quit;
6. Viewing Index Information
Use PROC CONTENTS to look at the index information.
No, the data set ia.sales maintains its index, but SalesCopy does not retain the index from
ia.sales.
2-74 Chapter 2 Accessing Observations
Chapter 3 Combining Data
Horizontally
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study).....3-93
Objectives
Use the DATA step with a MERGE statement to join
more than two SAS data sets.
Use the SQL procedure to join SAS data sets without
a common variable.
Investigate the differences between the DATA step
MERGE and PROC SQL.
Combine data conditionally.
Business Task
Merge multiple SAS data sets with no common BY variable.
ia.expenses ia.alldata
Date Date
FlightID FlightID ia.expenses
Expenses Expenses
ia.revenue Dest
Date
Dest
FlightID
Date
Origin ia.revenue
FlightID
RevBusiness
Origin
RevEcon
RevBusiness
Rev1st
RevEcon
DestCity
Rev1st
DestApt
ia.airports
ia.airports OriginCity
City OriginApt
Code Profit calculated
Country
Name
4 ...
3-4 Chapter 3 Combining Data Horizontally
Airport Airport
Code Code
Airport
Code
6
3.1 Joining Data Sets by Value 3-5
data exprev;
merge expenses(in = e) revenue(in = r);
by FlightID Date;
if e and r;
Profit = sum(Rev1st, RevBusiness, RevEcon, -Expenses);
run;
data destinfo; c
merge exprev(in = exp)
airports(keep = City Name Code
rename = (Code = Dest City = DestCity
Name = DestApt));
by Dest;
if exp;
run;
data alldata; d
merge destinfo(in = des)
airports(keep = City Name Code
rename = (Code = Origin City = OriginCity
Name = OriginApt));
by Origin;
if des;
run;
d This DATA step creates the city variable for the origin.
Partial Output
Result of Merging Three Data Sets
1 IA03400 02DEC2005 89155 ANC RDU 15829 28420 68688 23782 Raleigh-Durham, NC
2 IA03400 03DEC2005 22008 ANC RDU 15829 26460 68688 88969 Raleigh-Durham, NC
3 IA03400 04DEC2005 71609 ANC RDU 18707 23520 77751 48369 Raleigh-Durham, NC
4 IA03400 05DEC2005 82454 ANC RDU 15829 27440 64872 25687 Raleigh-Durham, NC
5 IA03400 06DEC2005 85174 ANC RDU 17268 27440 67257 26791 Raleigh-Durham, NC
continued...
9
3-8 Chapter 3 Combining Data Horizontally
10
Example:
Data set ONE
X Y Z
1 2 3
X Y W
1 8 9
data three;
merge one two;
by x;
run;
Data set THREE
X Y Z W
1 8 3 9
To avoid this behavior, merge on all common BY variables or use the RENAME input data set
option.
3.1 Joining Data Sets by Value 3-9
PROC
PROCSQL;
SQL;
CREATE
CREATETABLE
TABLESAS-data-set
SAS-data-setAS AS
SELECT
SELECT column-1, column-2,…,column-n
column-1, column-2,… ,column-n
FROM
FROMtable-1,
table-1,table-2,…,table-n
table-2,…,table-n
WHERE
WHEREjoining
joiningcriteria
criteria
ORDER
ORDERBY BYsorting
sortingcriteria;
criteria;
11
3-10 Chapter 3 Combining Data Horizontally
Partial Output
Result of Joining Three Data Sets
1 IA00100 02DEC2005 58907 RDU LHR 19200 31610 79650 71553 London, England
2 IA00100 03DEC2005 108543 RDU LHR 17600 25070 80181 14308 London, England
3 IA00100 04DEC2005 21963 RDU LHR 17600 28340 84960 108937 London, England
4 IA00100 05DEC2005 31517 RDU LHR 17600 32700 72216 90999 London, England
5 IA00100 06DEC2005 105682 RDU LHR 22400 29430 74871 21019 London, England
13
14
3.1 Joining Data Sets by Value 3-13
Comparison Programs
The following programs are used to generate the results
for the next four result sets.
data three;
merge one two;
by x;
run;
proc sql;
select one.x, one.y, two.z
from one, two
where one.x = two.x;
quit;
15
The DATA step and SQL procedure code remain constant. The data values change in the
following examples.
X Y Z
1 a f
2 b g
16
The X values are unique in both data sets one and two.
3-14 Chapter 3 Combining Data Horizontally
X Y Z
1 a f
1 a r
2 b g
17
18
The X values in data sets one and two are not unique.
Many-to-many joins are problematic. The question is not efficiency of the technique; rather, the
question is which output do you want? Do you want two or four observations for a 2-to-2 match?
Reference Information
19
Reference Information
The following SQL step produces results that are identical to those of the DATA step when there is
non-matching data.
proc sql;
select coalesce(one.x, two.x) as x, y, z
from one full join two
on one.x = two.x;
quit;
.
3.1 Joining Data Sets by Value 3-17
PDV
20 ...
The DATA step MERGE statement processes sequentially, top to bottom, by default.
PDV 1 a f
21 ...
3-18 Chapter 3 Combining Data Horizontally
PDV 1 d r
22 ...
PDV 3 c t
23 ...
3.1 Joining Data Sets by Value 3-19
PDV 4 w
24
25
3-20 Chapter 3 Combining Data Horizontally
Conceptually, PROC SQL creates the result set pictured above. There are optimization routines that make
the process more efficient.
3.1 Joining Data Sets by Value 3-21
29
3-22 Chapter 3 Combining Data Horizontally
Exercises
32
3-24 Chapter 3 Combining Data Horizontally
ID Dest FltDate
IA05900 MAD 01MAR2005
continued...
33 ...
ID Dest FltDate
IA05900 MAD 08MAR2005
34 ...
3.1 Joining Data Sets by Value 3-25
36
c03s1d3
38
c The DO WHILE statement executes statements in a DO loop while a condition is true. The expression
is evaluated at the top of the loop. The statements in the loop never execute if the expression is
initially false.
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
40 ...
3.1 Joining Data Sets by Value 3-27
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
41 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
42 ...
3-28 Chapter 3 Combining Data Horizontally
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
43 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
44 ...
3.1 Joining Data Sets by Value 3-29
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
45 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
46 ...
3-30 Chapter 3 Combining Data Horizontally
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
47 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
49 ...
3.1 Joining Data Sets by Value 3-31
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
50 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
52 ...
3-32 Chapter 3 Combining Data Horizontally
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
54 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
55 ...
3.1 Joining Data Sets by Value 3-33
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
56 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
57 ...
3-34 Chapter 3 Combining Data Horizontally
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
58 ...
Execution ia.madrid
ia.rates FlightID FltDate RevTotal
60 ...
3.1 Joining Data Sets by Value 3-35
61
The secret to using multiple SET statements in this fashion is to have both data sets in order
(ascending or descending) by the variables tested in the DO WHILE statement.
c03s1d4
62 ...
3-36 Chapter 3 Combining Data Horizontally
Exercises
1 15JAN1982 CHRISTIAN JOHN G. LONDON 1369 E01146 FLTAT1 28000 Flight Attendant
2 23FEB1981 ELLIS GREGORY FRANKFURT 1595 E00364 FLTAT1 25000 Flight Attendant
3 15APR1994 EUNICE ROBERT N. CARY 1157 E03022 FLTAT1 23000 Flight Attendant
4 23DEC1990 FITZGERALD JAMES V. CARY 1168 E03511 FLTAT1 21000 Flight Attendant
5 11JUN1983 GOODWIN CYNTHIA Q. CARY 1752 E03510 FLTAT1 29000 Flight Attendant
Objectives
Create an output SAS data set that contains
summary statistics from PROC MEANS.
Combine PROC MEANS summary statistics in a
SAS data set with a detail SAS data set.
65
66 ...
3-38 Chapter 3 Combining Data Horizontally
1 BAGCLK 140
2 BAGSUP 18
3 CHKCLK 125
4 CHKSUP 18
5 FACCLK 124
6 FACMGR 17
7 FACMNT 60
8 FINACT 36
9 FINCLK 53
10 FINMGR 20
67
TotalEmps
2070
68
3.2 Combining Summary and Detail Data 3-39
70 ...
3-40 Chapter 3 Combining Data Horizontally
minimum
maximum
standard deviation
71
The default statistics generated by PROC MEANS are listed. For a complete list
of statistics, please refer to the SAS documentation.
1 0 42 2070
c03s2d1
72
3.2 Combining Summary and Detail Data 3-41
PROC
PROCMEANS
MEANSDATA
DATA==SAS-data-set
SAS-data-setNOPRINT;
NOPRINT;
OUTPUT
OUTPUTOUT
OUT==SAS-data-set
SAS-data-set
output-statistic-specification(s);
output-statistic-specification(s);
73
The NOPRINT option suppresses the printing of the PROC MEANS report.
For a complete listing of PROC MEANS statements and options, see the SAS documentation.
The output data set contains variables that contain the requested statistics plus the following:
• _TYPE_ contains information about the class variables.
• _FREQ_ contains the number of observations that an output level represents.
PROC SUMMARY can also be used to generate a data set that contains summary statistics.
3-42 Chapter 3 Combining Data Horizontally
74 ...
SUM VALUE
100 30 OUTPUT
76 ...
EOF
STOP! SUM VALUE Observation
2 (and higher)
100 30 are never read!
77 ...
3-44 Chapter 3 Combining Data Horizontally
Using _N_
During the execution of a DATA step, the automatic
variable _N_ has the following features:
is set to 1 initially
is incremented by 1 as the DATA step loops past the
DATA statement
is dropped automatically from the data set that is
created
can be used in the DATA step to control when
statements are executed
78
c03s2d2
79
c The _n_ = 1 condition causes the summary data set to be read only during the first iteration of the
DATA step. Without it, the DATA step reaches the end of file of summary on the second iteration of
the DATA step, and the DATA step terminates with one observation in the data set percent1.
d The data set ia.empcount is read for each iteration of the DATA step.
3.2 Combining Summary and Detail Data 3-45
Compilation ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
1 . . . .
80 ...
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
True
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
1 . . . .
81 ...
3-46 Chapter 3 Combining Data Horizontally
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
1 2070
. . . .
82 ...
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
1 2070
. .
BAGCLK .
140 .
83 ...
3.2 Combining Summary and Detail Data 3-47
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
1 2070 .
BAGCLK .
140 0.067632
.
84 ...
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
1 2070 .
BAGCLK .
140 0.067632
.
85 ...
3-48 Chapter 3 Combining Data Horizontally
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
False
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
2 2070 .
BAGCLK .
140 .
86 ...
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
2 2070 .
BAGSUP .
18 .
87 ...
3.2 Combining Summary and Detail Data 3-49
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV
D _N_ TotalEmps JobCode NumEmps PctEmps
2 2070 .
BAGSUP .
18 0.008695
.
88 ...
Execution ia.empcount
JobCode NumEmps
summary
TotalEmps
BAGCLK 140
2070 BAGSUP 18
CHKCLK 125
CHKSUP 18
data percent; FACCLK 124
if _n_ = 1 then set summary FACMGR
(keep = TotalEmps);
17
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
2 2070 .
BAGSUP .
18 0.008695
.
89 ...
3-50 Chapter 3 Combining Data Horizontally
Partial Output
The previous program creates the following data:
c03s2d2
90
c03s2d3
91
3.2 Combining Summary and Detail Data 3-51
Job Num
Obs Code Emps PctEmps
92 c03s2d3
c03s2d4
93
When SQL remerges summary data, it puts a note in the SAS log:
7 proc sql;
8 title 'Remerging Summary Data with Detail Data';
9 create table percent as
10 select JobCode, NumEmps,
11 NumEmps / sum(NumEmps) as PctEmps
12 from ia.empcount;
NOTE: The query requires remerging summary statistics back with the original data.
13 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.33 seconds
cpu time 0.05 seconds
3.2 Combining Summary and Detail Data 3-53
Job Num
Obs Code Emps PctEmps
c03s2d4
94
Exercises
Qtr
Obs EmpID Num Amount
1 E00224 qtr1 12
2 E00224 qtr2 33
3 E00224 qtr3 22
4 E00224 qtr4 .
5 E00367 qtr1 35
6 E00367 qtr2 48
7 E00367 qtr3 40
8 E00367 qtr4 30
9 E00441 qtr1 .
10 E00441 qtr2 63
11 E00441 qtr3 89
12 E00441 qtr4 90
13 E00587 qtr1 16
14 E00587 qtr2 19
15 E00587 qtr3 30
16 E00587 qtr4 29
17 E00598 qtr1 4
18 E00598 qtr2 8
19 E00598 qtr3 6
Output
ia.mean
Obs AvgAmt
1 28.9667
3.2 Combining Summary and Detail Data 3-55
Objectives
Use the SET statement with the KEY= option to
combine two SAS data sets.
Use _IORC_ to determine whether the index search
was successful.
97
The data set ia.dnunder used for demonstrations and exercises contains fewer observations
than the data set ia.dnunder used for the course notes.
3.3 Using an Index to Combine Data 3-57
Business Task
Build a data set with the following variables:
Date Expenses ia.dnunder
ia.dnunder FlightID Date
Expenses FlightID
Rev1st
Date RevBus ia.sales
ia.sales FlightID RevEcon
Rev1st CargoRev
RevBus Profit calculated
RevEcon
CargoRev
99
The data sets ia.sales and ia.dnunder used for demonstrations and exercises contain
fewer observations than the data sets ia.sales and ia.dnunder used for the course notes.
Indexes on ia.sales
Partial PROC CONTENTS Output for ia.sales
# of
Unique Unique
# Index Option Values Variables
100
3-58 Chapter 3 Combining Data Horizontally
SET
SET SAS-data-file-name
SAS-data-file-nameKEY
KEY==index-name;
index-name;
101
• Assign a value to the index key variable(s) before the SET statement is executed.
• The index is then used to retrieve an observation with the key value.
• WHERE processing is not allowed for a data set read with the KEY= option.
c03s3d1
102
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
- Expenses);
run;
. . . 1
103 ...
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
- Expenses);
run;
98124 188277 . 1
104 ...
The KEY= option causes the second SET statement to use the current PDV values for FlightID and
FltDate to access an observation through the DteFlt index.
3-60 Chapter 3 Combining Data Horizontally
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
- Expenses);
run;
105 ...
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
- Expenses);
run;
Implied Output
FlightID FltDate Expenses Rev1st RevBus
106 ...
Partial Output
Partial PROC PRINT Output from profit
Profit for the Flights
to Australia and New Zealand
Flight
Obs ID FltDate Expenses Rev1st RevBus
c03s3d1
107
Partial Output
Partial PROC PRINT Output from profit
Profit for the Flights
to Australia and New Zealand
Flight
Obs ID FltDate Expenses Rev1st RevBus
c03s3d1
108
Observation 899 is correct, but because the data values are retained when SAS reads observation 900
from ia.dnunder, observation 900 is incorrect.
The observation number and the data are different in the data set created during the demonstration
than the one created in the course notes.
3-62 Chapter 3 Combining Data Horizontally
Log
11 data profit;
212 set ia.dnunder;
213 set ia.sales(keep = FlightID FltDate Rev1st
214 RevBus RevEcon CargoRev)
215 key = DteFlt;
216 Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
217 - Expenses);
218 run;
c03s3d1
109
The observation that appears in the log is the result of having an observation in ia.dnunder that does
not match an observation in ia.sales.
The last observation in profit is incorrect because there is no flight on December 30, 2005 in the SAS
data set ia.sales.
3.3 Using an Index to Combine Data 3-63
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
- Expenses);
run;
Implied Output
FlightID FltDate Expenses Rev1st RevBus
At the next iteration of the DATA step, only Profit is reinitialized to missing.
The observation number is different in the data set created during the demonstration than the one
created in the course notes.
3-64 Chapter 3 Combining Data Horizontally
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
No
Nomatch
Profit = sum(Rev1st, RevBus, RevEcon, match found
found
CargoRev,
- Expenses);
run;
Profit is recalculated using the new value of Expenses and the retained values of Rev1st,
RevBus, RevEcon, and CargoRev.
data profit;
Execution
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
key = DteFlt;
Profit = sum(Rev1st, RevBus, RevEcon, CargoRev,
- Expenses);
run;
Implied Output
FlightID FltDate Expenses Rev1st RevBus
112 ...
3.3 Using an Index to Combine Data 3-65
113
For values of the _IORC_ automatic variable, see the %SYSRC autocall macro in the Macro Language
Dictionary in the Base SAS Documentation.
114
The automatic variable _error_ controls the writing of the PDV contents to the SAS log if there is a
data error. Setting _error_ = 0 prevents writing to the log, even if a data error is encountered.
3-66 Chapter 3 Combining Data Horizontally
Using _IORC_
data profit errors;
set ia.dnunder;
set ia.sales(keep = FlightID FltDate Rev1st
RevBus RevEcon CargoRev)
n key = DteFlt;
if _IORC_ = 0 then do;
Profit = sum(Rev1st, RevBus, RevEcon,
CargoRev, - Expenses);
output profit; o
end;
else do;
_error_ = 0; p
output errors; q
end;
run;
c03s3d2
116 ...
c Finds a match
d Outputs to profit
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
3.3 Using an Index to Combine Data 3-67
Partial Output
Partial PROC PRINT Output from profit
The PROFIT Data
Flight
Obs ID FltDate Expenses Rev1st RevBus
The observation number and the data are different in the data set created during the demonstration
than the one created in the course notes.
Output
PROC PRINT Output from errors
The ERRORS data
Flight
Obs ID FltDate Expenses Rev1st RevBus
1 $4,872.00 $2,300.00 .
c03s3d2
119
3-68 Chapter 3 Combining Data Horizontally
Log
249 data profit errors;
250 set ia.dnunder;
251 set ia.sales(keep = FlightID FltDate Rev1st
252 RevBus RevEcon CargoRev)
253 key = DteFlt;
254 if _IORC_ = 0 then do;
255 Profit = sum(Rev1st, RevBus, RevEcon,
256 CargoRev, - Expenses);
257 output profit;
258 end;
259 else do;
260 _error_ = 0;
261 output errors;
262 end;
263 run;
NOTE: There were 900 observations read from the data set IA.DNUNDER.
NOTE: The data set WORK.PROFIT has 899 observations and 8 variables.
NOTE: The data set WORK.ERRORS has 1 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
c03s3d2
120
121
3.3 Using an Index to Combine Data 3-69
122
3-70 Chapter 3 Combining Data Horizontally
Exercises
The flight times are stored as SAS time (the number of seconds since midnight).
Create the variable NewDepart that is the new departure time for the flights. Apply the TIME5.
format to NewDepart. (Hint: Use the expression sum(TimeDiff*60,depart).)
time new
Obs flight date diff depart depart
NewSched Output
work.newsched
Time New
flight date Diff depart Depart
Errors Output
Errors data
Time New
Obs flight date Diff depart Depart
Objectives
Update a master data set with a transaction data set.
Use special missing values when updating.
Compare the MERGE statement with the UPDATE
statement.
125
126
Although the technique is not discussed in this course, the UPDATE statement can also delete
observations from the master data set. See the documentation for the UPDATE statement for details.
3.4 Updating Data 3-73
E00003 BOSTON .
E00003 3422 .
E00010 CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
E00208 $42,000
128 ...
3-74 Chapter 3 Combining Data Horizontally
129
c03s4d1
130
3.4 Updating Data 3-75
ia.hremps
EmpID Location Jobcode Phone Salary
Compilation
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; The PDV is created with all
run; variables in both data sets
and any variables created
by the DATA step.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
131 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; Are these BY values equal?
yes
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
. 0 0
132 ...
• SAS looks at the first observation in each data set that is named in the UPDATE statement to determine
which BY group should appear first.
• If the transaction BY value precedes the master BY value, SAS reads from the transaction data set only
and sets the variables from the master data set to missing.
• If the master BY value precedes the transaction BY value, SAS reads from the master data set only and
sets the unique variables from the transaction data set to missing.
• If the BY values in the master and transaction data sets are equal, SAS reads from the master data set
first and then applies the first transaction by copying the non-missing values into the program data
vector.
• If the transaction data set contains multiple observations with the same BY value, non-missing values
on all of those observations are applied to the data that was read from the master data set.
3.4 Updating Data 3-77
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; Read ia.hremps.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 CARY VICEPR 1428 120000 1 0
133 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Apply the
run; transactions from
ia.hrempsu.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 1428 120000 1 0
134 ...
3-78 Chapter 3 Combining Data Horizontally
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Is there another observation for
run;
E00003 in the transaction data set?
PDV yes
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 1428 120000 1 0
135 ...
• After completing the first transaction, SAS looks at the next observation in the transaction data set. If
SAS finds one with the same BY value, it applies that transaction, too.
• The first observation then contains the new values from both transactions.
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Apply the
run; transactions from
ia.hrempsu.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 3422 120000 0 1
136 ...
3.4 Updating Data 3-79
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
3422
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Is there another observation for
run;
E00003 in the transaction data set?
PDV no
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 1428 120000 0 1
138 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
3422
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run;
Implied Output
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 1428 120000 0 1
139 ...
• If no other transactions exist for that observation, SAS writes the observation to the new data set and
sets the values in the program data vector to missing.
• SAS repeats these steps until it reads all observations from all BY groups in both data sets.
3-80 Chapter 3 Combining Data Horizontally
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; Which comes first?
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 3422 120000 0 1
140 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Reinitialize any variables unique to
run;
the transaction data set to missing.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00003 BOSTON VICEPR 3422 120000 0 1
141 ...
3.4 Updating Data 3-81
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; Read ia.hremps.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00004 CARY FACMNT 2061 42000 1 1
142 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Is there an observation for E00004
run;
in the transaction data set?
PDV no
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00004 CARY FACMNT 2061 42000 1 1
143 ...
3-82 Chapter 3 Combining Data Horizontally
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run;
Implied Output
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00004 CARY FACMNT 2061 42000 1 1
144 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; Which comes first?
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00004 CARY FACMNT 2061 42000 1 1
145 ...
3.4 Updating Data 3-83
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID; Reinitialize any variables unique
run;
to the master data set to missing.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00004 CARY FACMNT 2061 42000 1 1
146 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; Read ia.hrempsu.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00010 CARY RESCLK 5153 20000 1 1
147 ...
3-84 Chapter 3 Combining Data Horizontally
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run;
Implied Output
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00010 CARY RESCLK 5153 20000 1 1
148 ...
ia.hremps
EmpID Location Jobcode Phone Salary
Execution
E00003
E00004
CARY
CARY
VICEPR
FACMNT
1428
2061
$120,000
$42,000
ia.hrempsu
E00013 BOSTON RECEPT 1002 $22,000
E00017 CARY RESCLK 2821 Empid $36,000
Location Jobcode Phone Salary
E00018 CARY FACMNT 1459 $33,000
E00020 CARY FACCLK 1256 E00003 $21,000
BOSTON .
E00022 CARY FACCLK 1255 E00003 $27,000 3422 .
E00038 CARY FACCLK 2853 E00010 $20,000
CARY RESCLK 5153 $20,000
E00068 3253 .
E00133 HRCLK $42,000
E00173 RESCLK $23,000
data ia.hremps; E00208 $42,000
update ia.hremps
ia.hrempsu;
by EmpID;
run; The DATA step continues until
the end of file in both data sets.
PDV
D D
EmpID Location JobCode Phone Salary First.EmpID Last.EmpID
E00531 SINGAPORE RECEPT 1002 39000 1 1
149
3.4 Updating Data 3-85
DATA
DATAmaster-data-set;
master-data-set;
UPDATE
UPDATEmaster-data-set
master-data-settransaction-data-set
transaction-data-set
<END=variable>
<END=variable>
<UPDATEMODE=
<UPDATEMODE=
MISSINGCHECK|NOMISSINGCHECK>;
MISSINGCHECK|NOMISSINGCHECK>;
BY
BYby-variables;
by-variables;
RUN;
RUN;
151
END=variable creates and names a temporary variable that contains an end-of-file indicator. This
variable is initialized to 0 and is set to 1 when the UPDATE statement processes the
last observation in both data sets. This variable is not added to any data set.
3-86 Chapter 3 Combining Data Horizontally
152
UPDATEMODE
UPDATEMODE==MISSINGCHECK
MISSINGCHECK
UPDATEMODE = NOMISSINGCHECK
UPDATEMODE = NOMISSINGCHECK
153
3.4 Updating Data 3-87
1 E00003 3422 .
2 E00004 CARY FACMNT 2061 $42,000
3 E00010 CARY RESCLK 5153 $20,000
4 E00013 BOSTON RECEPT 1002 $22,000
5 E00017 CARY RESCLK 2821 $36,000
6 E00018 CARY FACMNT 1459 $33,000
7 E00020 CARY FACCLK 1256 $21,000
8 E00022 CARY FACCLK 1255 $27,000
9 E00038 CARY FACCLK 2853 $20,000
10 E00039 TORONTO FACCLK 1053 $38,000
11 E00066 NASHVILLE TELOP 1010 $39,000
12 E00068 3253 .
13 E00070 OSLO RESCLK 1029 $24,000
14 E00076 CARY RESMGR 1030 $36,000
15 E00087 FRANKFURT HRCLK 1019 $45,000 c03s4d2
154
155
3-88 Chapter 3 Combining Data Horizontally
MISSING
MISSINGspecial-value
special-valuespecial-value
special-value......;;
156
3.4 Updating Data 3-89
ia.empupdates
Obs EmpID Add1 Telephone DOB
1 1352 _ .
2 212 12 Main St. _ .
3 2512 _ _ _
157
The program, c03s4d3, created the transaction data set ia.empupdates, which contains special
missing values:
data ia.empupdates;
missing _;
infile cards missover;
input EmpID $4. Add1 $12. Telephone $ DOB ;
cards;
1352 _
212 12 Main St. _
2512 _ _ _
;
run;
3-90 Chapter 3 Combining Data Horizontally
158
Output
Obs EmpID Add1 Telephone DOB
c03s4d4
159
3.4 Updating Data 3-91
160
The output at the end of a BY group used by the UPDATE statement is called conditional output, where
the condition is that the step reached the last observation in the BY group.
3-92 Chapter 3 Combining Data Horizontally
1 E00003 BOSTON .
2 E00003 3422 .
3 E00004 CARY FACMNT 2061 $42,000
4 E00010 CARY RESCLK 5153 $20,000
5 E00013 BOSTON RECEPT 1002 $22,000
6 E00017 CARY RESCLK 2821 $36,000
7 E00018 CARY FACMNT 1459 $33,000
8 E00020 CARY FACCLK 1256 $21,000
9 E00022 CARY FACCLK 1255 $27,000
10 E00038 CARY FACCLK 2853 $20,000
11 E00039 TORONTO FACCLK 1053 $38,000
12 E00066 NASHVILLE TELOP 1010 $39,000
13 E00068 3253 .
14 E00070 OSLO RESCLK 1029 $24,000
c03s4d5
161
163
3-94 Chapter 3 Combining Data Horizontally
c03s5d1
164
c The DO UNTIL loop is used to read through the entire data set ia.empcount once, in order to
calculate the summary statistics.
d The SUM statement calculates the summary variable TotalEmps.
e When the DO LOOP completes execution, the second SET statement reads the ia.empcount data
set a second time.
f PctEmps is calculated using the TotalEmps summary variable.
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) 3-95
ia.empcount
JobCode NumEmps
Compilation
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
165 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
set ia.empcount(keep = NumEmps) end = LastObs; n
FACMGR 17
TotalEmps + NumEmps;o
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
166 ...
c The value for the END = variable is 0 when reading all observations from a data set except for the
last one, when the value changes to 1.
d The SUM statement creates a variable that is initialized to 0 prior to the execution of the DATA step
and retained across iterations of the DATA step.
3-96 Chapter 3 Combining Data Horizontally
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP
Evaluated 18
True CHKCLK at bottom of125
data percent; CHKSUP DO loop 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 0 .
167 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 140 0 .
168 ...
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) 3-97
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 140 140 .
169 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
LastObs ne 1
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 140 140 .
170 ...
3-98 Chapter 3 Combining Data Horizontally
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP
Evaluated 18
CHKCLK at bottom of125
data percent; CHKSUP DO loop 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 140 140 .
171 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 18 140 .
172 ...
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) 3-99
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 18 158 .
173 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run; LastObs ne 1
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 0 18 158 .
174 ...
3-100 Chapter 3 Combining Data Horizontally
ia.empcount
Continuing JobCode
until NumEmps
Execution the last observation ...
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 1 6 2070 .
175 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 1 6 2070 .
176 ...
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) 3-101
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
LastObs = 1
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
1 1 6 2070 .
177 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
178 ...
3-102 Chapter 3 Combining Data Horizontally
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
179 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
Implied Output
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
180 ...
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) 3-103
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
False BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
181 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
2 1 18 2070 BAGCLK .
182 ...
3-104 Chapter 3 Combining Data Horizontally
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
183 ...
ia.empcount
JobCode NumEmps
Execution
BAGCLK 140
BAGSUP 18
CHKCLK 125
data percent; CHKSUP 18
FACCLK
if _n_ = 1 then do until(LastObs); 124
FACMGR
set ia.empcount(keep = NumEmps) end = LastObs; 17
TotalEmps + NumEmps;
end;
set ia.empcount;
PctEmps = NumEmps / TotalEmps;
run;
Implied Output
PDV Total Job
D _N_ D LastObs NumEmps Emps Code PctEmps
184 ...
3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) 3-105
Partial Output
Partial PROC PRINT Output from percent
Reading through the data twice
c03s5d1
185
3-106 Chapter 3 Combining Data Horizontally
/* CONTENTS solution */
proc contents data=ia.newsals;
run;
data temp1;
merge employees(in = e) jcodedat(in = j);
by JobCode;
if e and j;
run;
data final;
merge newsals(in = n) temp1(in = t);
by EmpID;
if n and t;
run;
proc sql;
create table crewshrs as
select LastName, FirstName, HireDate, NumShares
from ia.crew, ia.options
where crew.HireDate between BeginDte and EndDte
order by HireDate;
select *
from crewshrs;
quit;
data crewshrs;
keep LastName FirstName HireDate NumShares;
set crew;
do while (not (BeginDte le HireDate le
EndDte));
set ia.options;
end;
run;
The flight times are stored as a SAS time (the number of seconds since midnight).
Create the variable NewDepart that is the new departure time for the flights. Apply the time5.
format to NewDepart. (Hint: Use the expression sum(TimeDiff*60,depart).)
Objectives
Define table lookup.
Investigate table look up techniques.
Table Lookups
Lookup values for a table lookup can be stored in the
following: Lookup Values
array
hash object
format
data set
p
the following:
lo
Data Values
array index value
FORMAT statement,
PUT function
merge, join
4
Overview of Arrays
An array is similar to a row of buckets.
1 2 3 4
SAS puts a value in a bucket based on the
bucket number.
Values are retrieved from a bucket based on the
bucket number.
6 ...
4.1 Introduction to Lookup Techniques 4-5
7 ...
Overview of a Format
A format is similar to stacks of buckets that are referred to
by the value of a variable.
Data Value Label SAS puts data values and
label values in the buckets
when the format is used in
a FORMAT statement, PUT
function, or PUT statement.
SAS uses a binary search
on the data value bucket in
order to return the value in
the label bucket.
8 ...
4-6 Chapter 4 Using Lookup Tables to Match Data
Objectives
Review one dimensional arrays.
Write an ARRAY statement for a multidimensional
array.
Process a multidimensional array.
Load a multidimensional array from a SAS data set.
Use a multidimensional array to compare values.
10
Overview of Arrays
An array is similar to a row of buckets.
1 2 3 4
SAS puts a value in a bucket based on the
bucket number.
Values are retrieved from a bucket based on the
bucket number.
11
4.2 Using Arrays as Lookup Tables 4-7
Reviewing Arrays
An array
is a temporary grouping of SAS variables that are
arranged in a particular order and identified by an
array name
exists only for the duration of the current DATA step.
An array can
perform repetitive calculations on a group of variables
restructure data
12
4-8 Chapter 4 Using Lookup Tables to Match Data
ARRAY
ARRAYarray-name
array-name {number-of-elements}
{number-of-elements}<$><$><length>
<length>
<list-of-variables>
<list-of-variables> <(initial-values)>;
<(initial-values)>;
array char{4} $ 6;
13
ARRAY
ARRAYarray-name
array-name {number-of-elements}
{number-of-elements}<$><$><length>
<length>
<list-of-variables>
<list-of-variables> <(initial-values)>;
<(initial-values)>;
Array name
array char{4} $ 6;
14
ARRAY
ARRAYarray-name
array-name {number-of-elements}
{number-of-elements}<$><$><length>
<length>
<list-of-variables>
<list-of-variables> <(initial-values)>;
<(initial-values)>;
15
4-10 Chapter 4 Using Lookup Tables to Match Data
ARRAY
ARRAYarray-name
array-name {number-of-elements}
{number-of-elements}<$><$><length>
<length>
<list-of-variables>
<list-of-variables> <(initial-values)>;
<(initial-values)>;
List of numeric
array numarray{3} num1 – num3; variables
1 IA00201 01JAN2004 11
2 IA00200 01JAN2004 22
3 IA00400 01JAN2004 25
4 IA00401 01JAN2004 8
5 IA00600 01JAN2004 6
6 IA00601 01JAN2004 22
17
4.2 Using Arrays as Lookup Tables 4-11
18
Desired Results
Flight Delay
Obs ID FltDate Delay Average Dif
20
4.2 Using Arrays as Lookup Tables 4-13
c04s2d1
21
c During the first time through the DATA step, the data set ia.delaystats is read into the PDV.
d The array JAN is associated with the variables Jan01, Jan02, Jan03, and so forth. The ARRAY
statement that defines the array JAN appears after the SET statement for the data set that contains the
variables JAN01 – JAN31. The array statement does not have to be inside the DO loop because it is
a non-executable statement.
e The value of the JAN array referenced positionally by the value of the variable day is given to the
variable Average.
4-14 Chapter 4 Using Lookup Tables to Match Data
Statistic JAN01 JAN02 JAN03 JAN04 JAN05 JAN06 JAN07 JAN08 JAN09 . . .
AvgDelay 4.708 4.760 5.842 6.571 4.645 6.0714 5.500 5.080 4.692 . . .
ia.delaystats (where =
data compare; (Statistic = 'AvgDelay'));
keep FlightID FltDate Delay Average
DelayDif; Execution
if _n_ = 1 then do;
set ia.delaystats(where =
(Statistic = 'AvgDelay')); ia.rdudelay
array jan{31} Jan01 - Jan31; Flight
end; ID FltDate Delay
set ia.rdudelay;
day = day(FltDate); IA00201 01JAN2004 11
Average = Jan{day}; IA00200 01JAN2004 22
IA00400 01JAN2004 25
DelayDif = Delay - Average;
run;
1} }
N{ 2} 3} 4} 5} 31
A N{ N{ N{ N{ N{
J JA JA JA JA J A
D JAN01 D JAN02 D JAN03 D JAN04 D JAN05 D JAN31
. . . . . ... .
FlightID FltDate Delay D day Average DelayDif D _N_
. . . . . 1
22 ...
Statistic JAN01 JAN02 JAN03 JAN04 JAN05 JAN06 JAN07 JAN08 JAN09 . . .
AvgDelay 4.708 4.760 5.842 6.571 4.645 6.0714 5.500 5.080 4.692 . . .
ia.delaystats (where =
data compare; (Statistic = 'AvgDelay'));
keep FlightID FltDate Delay Average
DelayDif; Execution
if _n_ = 1 then do;
set ia.delaystats(where =
(Statistic = 'AvgDelay')); ia.rdudelay
array jan{31} Jan01 - Jan31; Flight
end; ID FltDate Delay
set ia.rdudelay;
day = day(FltDate); IA00201 01JAN2004 11
Average = Jan{day}; IA00200 01JAN2004 22
IA00400 01JAN2004 25
DelayDif = Delay - Average;
run;
1} }
N{ 2} 3} 4} 5} {31
N{ N{ N{ N{ N
JA JA JA JA JA JA
D JAN01 D JAN02 D JAN03 D JAN04 D JAN05 D JAN31
Statistic JAN01 JAN02 JAN03 JAN04 JAN05 JAN06 JAN07 JAN08 JAN09 . . .
AvgDelay 4.708 4.760 5.842 6.571 4.645 6.0714 5.500 5.080 4.692 . . .
ia.delaystats (where =
data compare; (Statistic = 'AvgDelay'));
keep FlightID FltDate Delay Average
DelayDif; Execution
if _n_ = 1 then do;
set ia.delaystats(where =
(Statistic = 'AvgDelay')); ia.rdudelay
array jan{31} Jan01 - Jan31; Flight
end; ID FltDate Delay
Implied Output
set ia.rdudelay;
day = day(FltDate); IA00201 01JAN2004 11
Average = Jan{day}; IA00200 01JAN2004 22
IA00400 01JAN2004 25
DelayDif = Delay - Average;
run;
1} }
N{ 2} 3} 4} 5} 31
A N{ N{ N{ N{ N{
J JA JA JA JA J A
D JAN01 D JAN02 D JAN03 D JAN04 D JAN05 D JAN31
Statistic JAN01 JAN02 JAN03 JAN04 JAN05 JAN06 JAN07 JAN08 JAN09 . . .
AvgDelay 4.708 4.760 5.842 6.571 4.645 6.0714 5.500 5.080 4.692 . . .
F ia.delaystats (where =
data compare; (Statistic = 'AvgDelay'));
keep FlightID FltDate Delay Average
DelayDif; Execution
if _n_ = 1 then do;
set ia.delaystats(where =
(Statistic = 'AvgDelay')); ia.rdudelay
array jan{31} Jan01 - Jan31; Flight
end; ID FltDate Delay
set ia.rdudelay;
day = day(FltDate); IA00201 01JAN2004 11
Average = Jan{day}; IA00200 01JAN2004 22
IA00400 01JAN2004 25
DelayDif = Delay - Average;
run;
1} }
N{ 2} 3} 4} 5} {31
N{ N{ N{ N{ N
JA JA JA JA JA JA
D JAN01 D JAN02 D JAN03 D JAN04 D JAN05 D JAN31
Statistic JAN01 JAN02 JAN03 JAN04 JAN05 JAN06 JAN07 JAN08 JAN09 . . .
AvgDelay 4.708 4.760 5.842 6.571 4.645 6.0714 5.500 5.080 4.692 . . .
ia.delaystats (where =
data compare; (Statistic = 'AvgDelay'));
keep FlightID FltDate Delay Average
DelayDif; Execution
if _n_ = 1 then do;
set ia.delaystats(where =
(Statistic = 'AvgDelay')); ia.rdudelay
array jan{31} Jan01 - Jan31; Flight
end; ID FltDate Delay
Implied Output
set ia.rdudelay;
day = day(FltDate); IA00201 01JAN2004 11
Average = Jan{day}; IA00200 01JAN2004 22
IA00400 01JAN2004 25
DelayDif = Delay - Average;
run;
1} }
N{ 2} 3} 4} 5} 31
A N{ N{ N{ N{ N{
J JA JA JA JA J A
D JAN01 D JAN02 D JAN03 D JAN04 D JAN05 D JAN31
34
4.2 Using Arrays as Lookup Tables 4-17
35
36
4-18 Chapter 4 Using Lookup Tables to Match Data
Overview of Arrays
A two-dimensional array is similar to a stack of buckets.
2,1 2,2
37 ...
38
The keyword _TEMPORARY_ can be used instead of elements to avoid creating new data set variables.
4.2 Using Arrays as Lookup Tables 4-19
Temperature
-10 -5 0 5 10 15 20 25 30
5 -22 -16 -11 -5 1 7 13 19 25
Wind
10 -28 -22 -16 -10 -4 3 9 15 21
Speed 15 -32 -26 -19 -13 -7 0 6 13 19
20 -35 -29 -22 -15 -9 -2 4 11 17
25 -37 -31 -24 -17 -11 -4 3 9 16
30 -39 -33 -26 -19 -12 -5 1 8 15
35 -41 -34 -27 -21 -14 -7 0 7 14
40 -43 -36 -29 -22 -15 -8 -1 6 13
For this example, only the first two columns and four rows are included in the array.
The initial values fill all the columns in a row before moving on to the next row.
4-20 Chapter 4 Using Lookup Tables to Match Data
,1}
,2}
,2}
,1}
,2}
,1}
,2}
,1}
W {1
W {4
W {1
W {2
W {2
W {3
W {3
W {4
W1 W2 W3 W4 W5 W6 W7 W8
40 ...
Flights Data
Find the windchill for the flights based on the temperature
and wind speed.
First Two Observations of ia.flights
ia.flights
IA2736 -8 9
IA6352 -4 16
41
4.2 Using Arrays as Lookup Tables 4-21
Desired Results
wndchill
IA2736 -8 9 -28
IA6352 -4 16 -26
constants
ia.flights
loaded into
an array
42
43 ...
4-22 Chapter 4 Using Lookup Tables to Match Data
,2}
,1}
,2}
,1}
,2}
,1}
,2}
W{1
W{1
W{3
W{3
W{4
W{4
W{2
W{2
Row = round(wspeed,5)/5;
Example: Row = 2;
round(wspeed,5)/5;
10/5;
45 ...
4.2 Using Arrays as Lookup Tables 4-23
,2}
,1}
,2}
,1}
,2}
,1}
,2}
W{1
W{1
W{3
W{3
W{4
W{4
W{2
W{2
Column = round(temp,5)/5;
,1}
,2}
,1}
,2}
,1}
,2}
,1}
,2}
W{1
W{1
W{3
W{3
W{4
W{4
W{2
W{2
-22 -16 -28 -22 -32 -26 -35 -29
c04s2d2
49
In this example, WSpeed must be at least 2.5 and less than 22.5, and Temp must be at least –12.5
and less than –2.5.
c Eight values are typed into the array initial values. The _TEMPORARY_ keyword creates a list of
temporary data elements. They behave in the same way as DATA step variables except that they do
not have names and they do not appear in the output data set.
d WSpeed is rounded to the nearest fifth unit because the lookup table only contains wind speeds
rounded to every 5 units. The value is divided by 5 to derive the row position in the windchill lookup
table.
e The offset of 3 is used because the third column in the windchill lookup table represents zero
degrees.
f The W array is used to look up the windchill values using the row and column variables.
4-26 Chapter 4 Using Lookup Tables to Match Data
,1}
,2}
,1}
,2}
,1}
,2}
,2}
W {1
W {4
W {2
W {2
W {3
W {3
W {4
W {1
,1}
,1}
,2}
,1}
,2}
,2}
,2}
W {1
W {2
W {2
W {3
W {3
W {4
W {4
W {1
,1}
,1}
,2}
,1}
,2}
,2}
,2}
W {1
W {2
W {2
W {3
W {3
W {4
W {4
W {1
,1}
,2}
,1}
,2}
,1}
,2}
,2}
W {1
W {2
W {2
W {3
W {3
W {4
W {4
W {1
55 ...
4-28 Chapter 4 Using Lookup Tables to Match Data
,1}
,1}
,2}
,1}
,2}
,2}
,2}
W {1
W {4
W {2
W {2
W {3
W {3
W {4
W {1
56 ...
,1}
,1}
,2}
,1}
,2}
,2}
,2}
W {1
W {4
W {2
W {2
W {3
W {3
W {4
W {1
,1}
,2}
,1}
,2}
,1}
,2}
,2}
W {1
W {2
W {2
W {3
W {3
W {4
W {4
W {1
58
Output
PROC PRINT Output from work.wndchill
wndchill
IA2736 -8 9 -28
IA6352 -4 16 -26
c04s2d2
59
4-30 Chapter 4 Using Lookup Tables to Match Data
Exercises
1 65 55 45 35
2 80 70 60 50
3 70 60 50 40
Output
work.results
Frst
LastName Name Event Finish Score
Tuttle Thomas 1 1 65
Gomez Alan 1 2 55
Chapman Neil 1 3 45
Welch Darius 1 4 35
Vandeusen Richard 2 1 80
Tuttle Thomas 2 2 70
Venter Vince 2 3 60
Morgan Mel 2 4 50
Chapman Neil 3 1 70
Gomez Alan 3 2 60
Morgan Mel 3 3 50
Tuttle Thomas 3 4 40
4.2 Using Arrays as Lookup Tables 4-31
61
62
4-32 Chapter 4 Using Lookup Tables to Match Data
c The index variable, I, is used so that the SET statement is executed for each observation in
ia.wchill.
d The array, Tmp, is associated with the variables Neg10 through Tmp30.
e The two-dimensional array W is loaded with the values of the Tmp array.
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
. . . . . . . . . . . . … .
D 1} D p{2} D p{
3} D
p{
4} D
p{
5} D { 6} D
p{
7} D
p{
8} D { 9} D D
p{ p p
Tm Tm Tm Tm Tm Tm Tm Tm Tm
Neg10 Neg5 Tmp0 Tmp5 Tmp10 Tmp15 Tmp20 Tmp25 Tmp30 Flight Temp WSpeed Row Column Chill
. . . . . . . . . . . . . . .
D _N_ I J
64 PDV 1 . . ...
4.2 Using Arrays as Lookup Tables 4-33
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
. . . . . . . . . . . . … .
D 1} D p{2} D p{
3} D
p{
4} D
p{
5} D { 6} D
p{
7} D
p{
8} D { 9} D D
p{ p p
Tm Tm Tm Tm Tm Tm Tm Tm Tm
Neg10 Neg5 Tmp0 Tmp5 Tmp10 Tmp15 Tmp20 Tmp25 Tmp30 Flight Temp WSpeed Row Column Chill
. . . . . . . . . . . . . . .
D _N_ I J
65 PDV 1 . . ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
. . . . . . . . . . . . … .
D 1} D p{2} D p{
3} D
p{
4} D
p{
5} D { 6} D
p{
7} D
p{
8} D { 9} D D
p{ p p
Tm Tm Tm Tm Tm Tm Tm Tm Tm
Neg10 Neg5 Tmp0 Tmp5 Tmp10 Tmp15 Tmp20 Tmp25 Tmp30 Flight Temp WSpeed Row Column Chill
. . . . . . . . . . . . . . .
D _N_ I J
66 PDV 1 1 . ...
4-34 Chapter 4 Using Lookup Tables to Match Data
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
. . . . . . . . . . . . … .
D 1} D p{2} D p{
3} D
p{
4} D
p{
5} D { 6} D
p{
7} D
p{
8} D { 9} D D
p{ p p
Tm Tm Tm Tm Tm Tm Tm Tm Tm
Neg10 Neg5 Tmp0 Tmp5 Tmp10 Tmp15 Tmp20 Tmp25 Tmp30 Flight Temp WSpeed Row Column Chill
67 PDV 1 1 . ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
. . . . . . . . . . . . … .
D 1} D p{2} D p{
3} D
p{
4} D
p{
5} D { 6} D
p{
7} D
p{
8} D { 9} D D
p{ p p
Tm Tm Tm Tm Tm Tm Tm Tm Tm
Neg10 Neg5 Tmp0 Tmp5 Tmp10 Tmp15 Tmp20 Tmp25 Tmp30 Flight Temp WSpeed Row Column Chill
68 PDV 1 1 1 ...
4.2 Using Arrays as Lookup Tables 4-35
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
81 PDV 1 1 10 ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
82 PDV 1 2 10 ...
4-36 Chapter 4 Using Lookup Tables to Match Data
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
83 PDV 1 2 10 ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
88 PDV 1 9 10 ...
4.2 Using Arrays as Lookup Tables 4-37
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
89 PDV 1 9 10 ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
92 PDV 1 9 10 ...
4-38 Chapter 4 Using Lookup Tables to Match Data
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
93 PDV 1 9 10 ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
94 PDV 2 . . ...
4.2 Using Arrays as Lookup Tables 4-39
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
95 PDV 2 . . ...
W{1,1} W{1,2} W{1,3} W{1,4} W{1,5} W{1,6} W{1,7} W{1,8} W{1,9} W{2,1} W{2,2} W{2,3} W{8,9}
96 PDV 2 . . ...
4-40 Chapter 4 Using Lookup Tables to Match Data
Advantages of an Array
Advantages of using an array include the following:
use of positional order
98
Disadvantages of an Array
Disadvantages of using an array include the following:
memory requirements to load the entire array
99
4.2 Using Arrays as Lookup Tables 4-41
Exercises
Output
work.results
Frst
LastName Name Event Finish Score
Tuttle Thomas 1 1 65
Gomez Alan 1 2 55
Chapman Neil 1 3 45
Welch Darius 1 4 35
Vandeusen Richard 2 1 80
Tuttle Thomas 2 2 70
Venter Vince 2 3 60
Morgan Mel 2 4 50
Chapman Neil 3 1 70
Gomez Alan 3 2 60
Morgan Mel 3 3 50
Tuttle Thomas 3 4 40
a. Produce a SAS data set named meals that contains the meal service code for each flight.
d. Look up the meal for each flight using the WEEKDAY function on Date and the HOUR function
on Depart.
The HOUR function returns values between 0 and 23. The Hour variable in
ia.mealplan contains the values 1 to 24.
Output
meals
Objectives
Define the DATA step hash object.
Use the hash object as a lookup table.
Use the hash object to match records.
102
is sized dynamically.
103
4-44 Chapter 4 Using Lookup Tables to Match Data
104 ...
105
4.3 Using Hash Objects as Lookup Tables 4-45
must be unique
can be composite.
106
E00224 qtr1 12 1 10
E00224 qtr2 33 2 15
E00224 qtr3 22 3 5
E00224 qtr4 . 4 15
E00367 qtr1 35
E00367 qtr2 48
E00367 qtr3 40
E00367 qtr4 30
E00441 qtr1 .
E00441 qtr2 63
107
4-46 Chapter 4 Using Lookup Tables to Match Data
A set of lookup values can be stored in a hash object. Whereas an array uses a series of consecutive
integers to address array elements, a hash object can use any combination of numeric and character values
as addresses.
4.3 Using Hash Objects as Lookup Tables 4-47
QtrNum GoalAmount
hash object
Key Data
qtr1 10
qtr2 15
qtr3 5
qtr4 15
110
PDV D Goal D
QtrNum Amount Empid Amount Diff _N_
. . . 1
Difference
112 ...
PDV D Goal D
QtrNum Amount Empid Amount Diff _N_
. . . 1
Difference
113 ...
4.3 Using Hash Objects as Lookup Tables 4-49
PDV D Goal D
QtrNum Amount Empid Amount Diff _N_
. . . 1
Difference
114 ...
PDV D Goal D
QtrNum Amount Empid Amount Diff _N_
. . . 1
Difference
115 ...
4-50 Chapter 4 Using Lookup Tables to Match Data
. . . 1
Difference
119 ...
qtr1 . E00224 12 . 1
Difference
120 ...
4.3 Using Hash Objects as Lookup Tables 4-51
qtr1 . E00224 12 . 1
Difference
121 ...
qtr1 10 E00224 12 . 1
Difference
122 ...
4-52 Chapter 4 Using Lookup Tables to Match Data
qtr1 10 E00224 12 2 1
Difference
123 ...
qtr1 10 E00224 12 2 1
Difference
124 ...
4.3 Using Hash Objects as Lookup Tables 4-53
qtr1 . E00224 12 . 2
Difference
125 ...
qtr2 . E00224 33 . 2
Difference
126 ...
4-54 Chapter 4 Using Lookup Tables to Match Data
qtr2 . E00224 33 . 2
Difference
127 ...
qtr2 15 E00224 33 18 2
Difference
129 ...
4.3 Using Hash Objects as Lookup Tables 4-55
qtr2 15 E00224 33 18 2
Difference
130 ...
qtr2 15 E00224 33 18 2
Difference
131 ...
4-56 Chapter 4 Using Lookup Tables to Match Data
E00224 qtr1 12 2
E00224 qtr2 33 18
E00224 qtr3 22 17
E00224 qtr4 . .
E00367 qtr1 35 25
E00367 qtr2 48 33
E00367 qtr3 40 35
E00367 qtr4 30 20
E00441 qtr1 . .
E00441 qtr2 63 48
132
133
DECLARE
DECLARE object
object variable
variable (<arg_tag-1:
(<arg_tag-1: value-1
value-1
<,…arg_tag-n:
<,…arg_tag-n: value-n>>)
value-n>>);;
134
135
136
The table in a hash object is an array of buckets. The default hash table size (the default number of
buckets) is 256 (28) and the maximum size is 65,536 (216). When multiple key values hash to the same
index (same bucket), the key values are stored in a binary tree in the bucket for rapid retrieval. The size of
the tree is limited only by the available memory.
137
4.3 Using Hash Objects as Lookup Tables 4-59
OBJECT.METHOD(<arg_tag-1
OBJECT.METHOD(<a rg_tag-1:: value-1
value-1<<
,… arg_tag-n value-n
,…arg_tag-n: value-n>>);
: >>);
138
4-60 Chapter 4 Using Lookup Tables to Match Data
139
Goal.add(key:'qtr1', data:10 );
Goal.add(key:'qtr2', data:15 );
Goal.add(key:'qtr3', data: 5 );
Goal.add(key:'qtr4', data:15 );
140
Goal.find();
141
4-62 Chapter 4 Using Lookup Tables to Match Data
Business Task
Combine three data sets to create a report showing
revenues, expenses, profits, and airport information.
ia.revenue ia.alldata
ia.expenses Dest Date
Date FlightID ia.expenses
Date FlightID Expenses
FlightID Origin Dest
Expenses RevBusiness Date
RevEcon FlightID
Rev1st Origin ia.revenue
RevBusiness
ia.airports RevEcon
City Rev1st
Code DestCity
Country DestApt
ia.airports
Name OriginCity
OriginApt
Profit calculated
142 ...
p
ku
loo
merge load
ia.Revenue ia.Expenses
ia.Airports
143
4.3 Using Hash Objects as Lookup Tables 4-63
hash object
Key Data Data
AKL Auckland International
AMS Amsterdam Schiphol
ANC Anchorage Anchorage International Airport
ARN Stockholm Arlanda
ATH Athens Hellinikon International Airport
BHM Birmingham Birmingham International Airport
144
Preview of Program
data Alldata_hash;
if _N_ = 1 then do;
if 0 then
set ia.Airports(keep=Code City Name);
declare hash airports(dataset: "ia.Airports");
airports.definekey ("Code");
airports.definedata("City", "Name");
airports.definedone();
end;
merge Expenses(in = e) Revenue(in = r);
by FlightID Date;
if e and r;
Profit = sum(Rev1st, RevBusiness, RevEcon, -Expenses);
rc = airports.find(key:origin);
OriginCity = city;
OriginAirport = name;
rc=airports.find(key:dest);
DestCity = city;
DestAirport = name;
run;
145 c04s3d2
4-64 Chapter 4 Using Lookup Tables to Match Data
Preview of Program
data Alldata_hash;
if _N_ = 1 then do;
if 0 then
set ia.Airports(keep=Code City Name);
declare hash airports(dataset: "ia.Airports");
airports.definekey ("Code");
airports.definedata("City", "Name");
airports.definedone();
end;
merge Expenses(in = e) Revenue(in = r);
by FlightID Date;
if e and r;
Profit = sum(Rev1st, RevBusiness, RevEcon, -Expenses);
rc = airports.find(key:origin);
OriginCity = city;
OriginAirport = name;
rc=airports.find(key:dest);
DestCity = city;
DestAirport = name;
run;
146 c04s3d2 ...
4.3 Using Hash Objects as Lookup Tables 4-65
c04s3d2
147
c To initialize the attributes of hash variables that originate from an existing SAS data set, you can use a
non-executing SET statement. When you use this technique, the MISSING routine is not required.
149
<more keys and data added> c04s3d2 ...
4.3 Using Hash Objects as Lookup Tables 4-67
zero success
non-zero failure
155
4-70 Chapter 4 Using Lookup Tables to Match Data
data Alldata_hash;
rc = airports.find(key:origin);
OriginCity = city;
OriginAirport = name;
rc=airports.find(key:dest);
DestCity = city;
DestAirport = name;
run;
title;
(Continued on the next page.)
4-72 Chapter 4 Using Lookup Tables to Match Data
/*****************************/
/* Alternate Solution */
/* Checking the Return Code */
/*****************************/
proc sort data = ia.Expenses out = Expenses;
by FlightID Date;
run;
proc sort data = ia.Revenue out = Revenue;
by FlightID Date;
run;
data Alldata_hash;
if _N_ = 1 then do;
if 0 then
set ia.Airports(keep=Code City Name);
declare hash airports(dataset: "ia.Airports");
airports.definekey ("Code");
airports.definedata("City", "Name");
airports.definedone();
end;
merge Expenses(in = e) Revenue(in = r);
by FlightID Date;
if e and r;
Profit = sum(Rev1st, RevBusiness, RevEcon, -Expenses);
rc = airports.find(key:origin);
if rc = 0 then do;
OriginCity = city;
OriginAirport = name;
end;
else do;
OriginCity = ' ';
OriginAirport = ' ';
end;
rc = airports.find(key:dest);
if rc = 0 then do;
DestCity = city;
DestAirport = name;
end;
else do;
DestCity = ' ';
DestAirport = ' ';
end;
run;
proc print data = Alldata_hash(obs = 5);
title 'Result of Merge plus Hash Object Lookup';
var FlightID Date OriginCity OriginAirport DestCity DestAirport Profit;
format Date date9.;
run;
4.3 Using Hash Objects as Lookup Tables 4-73
To define all data set variables as data variables for the hash object, use the ALL: "YES" option.
Flight
Obs ID Date OriginCity
158
159
4.3 Using Hash Objects as Lookup Tables 4-75
Exercises
b. Load the relevant data from ia.Sales in a hash object and use it as a lookup table for the
flights in ia.Dnunder. Include the variables FlightID, RouteID, FltDate, RevTotal,
Expenses, and Profit in the report.
Partial Listing
ia.dnunder
Flight
Obs ID FltDate Expenses
Partial Listing
ia.sales
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st RevBus
Cargo
Obs RevEcon CargoRev RevTotal Weight
Partial Output
Profit for Flights to Australia and New Zealand
Flight Rev
Obs ID RouteID FltDate Total Expenses Profit
Objectives
Create permanent formats.
Access permanent formats.
Create formats from SAS data sets.
Maintain formats.
Use formats as lookup tables.
162
4-78 Chapter 4 Using Lookup Tables to Match Data
163
Overview of a Format
A format is similar to stacks of buckets that are referred to
by the value of a variable.
Data Value Label SAS puts data values and
label values in the buckets
when the format is used in
a FORMAT statement, PUT
function, or PUT statement.
SAS uses a binary search
on the data value bucket in
order to return the value in
the label bucket.
164 ...
4-80 Chapter 4 Using Lookup Tables to Match Data
Example 2
proc catalog cat = ia.FORMATS;
contents;
run;
quit;
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: REVFMT LENGTH: 18 NUMBER OF VALUES: 7 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 18 FUZZ: STD ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 13MAY2005:15:36:19)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ .‚ .‚Missing ‚
‚LOW ‚ 10000‚Up to $10,000 ‚
‚ 10000< 20000‚$10,000+ to $20000 ‚
‚ 20000< 30000‚$20,000+ to $30000 ‚
‚ 30000< 40000‚$30,000+ to $40000 ‚
‚ 40000< 50000‚$40,000+ to $50000 ‚
‚ 50000<HIGH ‚More than $50,000 ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $AIRPORT LENGTH: 28 NUMBER OF VALUES: 52 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 28 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 20APR2005:13:41:43)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚AKL ‚AKL ‚Auckland ‚
‚AMS ‚AMS ‚Amsterdam ‚
‚ANC ‚ANC ‚Anchorage, AK ‚
‚ARN ‚ARN ‚Stockholm ‚
‚ATH ‚ATH ‚Athens (Athinai) ‚
‚BHM ‚BHM ‚Birmingham, AL ‚
‚BKK ‚BKK ‚Bangkok ‚
‚BNA ‚BNA ‚Nashville, TN ‚
‚BOS ‚BOS ‚Boston, MA ‚
‚BRU ‚BRU ‚Brussels (Bruxelles) ‚
‚CBR ‚CBR ‚Canberra, Australian Capitol ‚
‚CCU ‚CCU ‚Calcutta ‚
‚CDG ‚CDG ‚Paris ‚
‚CPH ‚CPH ‚Kobenhavn (Copenhagen) ‚
‚CPT ‚CPT ‚Cape Town ‚
‚DEL ‚DEL ‚Delhi ‚
‚DFW ‚DFW ‚Dallas/Fort Worth, TX ‚
‚DXB ‚DXB ‚Dubai ‚
‚FBU ‚FBU ‚Oslo ‚
‚FCO ‚FCO ‚Roma (Rome) ‚
‚FRA ‚FRA ‚Frankfurt ‚
‚GLA ‚GLA ‚Glasgow, Scotland ‚
‚GVA ‚GVA ‚Geneva ‚
‚HEL ‚HEL ‚Helsinki ‚
‚HKG ‚HKG ‚Hong Kong ‚
‚HND ‚HND ‚Tokyo ‚
‚HNL ‚HNL ‚Honolulu, HI ‚
‚IAD ‚IAD ‚Washington, DC ‚
‚IND ‚IND ‚Indianapolis, IN ‚
‚JED ‚JED ‚Jeddah ‚
‚JFK ‚JFK ‚New York, NY ‚
‚JNB ‚JNB ‚Johannesburg ‚
‚JRS ‚JRS ‚Jerusalem ‚
‚LAX ‚LAX ‚Los Angeles, CA ‚
‚LHR ‚LHR ‚London, England ‚
‚LIS ‚LIS ‚Lisboa (Lisbon) ‚
‚MAD ‚MAD ‚Madrid ‚
‚MCI ‚MCI ‚Kansas City, MO ‚
‚MIA ‚MIA ‚Miami, FL ‚
‚MSY ‚MSY ‚New Orleans, LA ‚
‚NBO ‚NBO ‚Nairobi ‚
‚ORD ‚ORD ‚Chicago, IL ‚
‚PEK ‚PEK ‚Beijing (Peking) ‚
‚PRG ‚PRG ‚Praha (Prague) ‚
‚PWM ‚PWM ‚Portland, ME ‚
‚RDU ‚RDU ‚Raleigh-Durham, NC ‚
‚SEA ‚SEA ‚Seattle, WA ‚
‚SFO ‚SFO ‚San Francisco, CA ‚
‚SIN ‚SIN ‚Singapore ‚
‚SYD ‚SYD ‚Sydney, New South Wales ‚
‚VIE ‚VIE ‚Wien (Vienna) ‚
‚WLG ‚WLG ‚Wellington ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $DEST LENGTH: 13 NUMBER OF VALUES: 52 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 13 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 13MAY2005:15:36:19)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚AKL ‚AKL ‚International ‚
‚AMS ‚AMS ‚International ‚
‚ANC ‚ANC ‚Domestic ‚
‚ARN ‚ARN ‚International ‚
‚ATH ‚ATH ‚International ‚
‚BHM ‚BHM ‚Domestic ‚
‚BKK ‚BKK ‚International ‚
‚BNA ‚BNA ‚Domestic ‚
‚BOS ‚BOS ‚Domestic ‚
‚BRU ‚BRU ‚International ‚
‚CBR ‚CBR ‚International ‚
‚CCU ‚CCU ‚International ‚
‚CDG ‚CDG ‚International ‚
‚CPH ‚CPH ‚International ‚
‚CPT ‚CPT ‚International ‚
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $DEST LENGTH: 13 NUMBER OF VALUES: 52 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 13 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (CONT'D)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚DEL ‚DEL ‚International ‚
‚DFW ‚DFW ‚Domestic ‚
‚DXB ‚DXB ‚International ‚
‚FBU ‚FBU ‚International ‚
‚FCO ‚FCO ‚International ‚
‚FRA ‚FRA ‚International ‚
‚GLA ‚GLA ‚International ‚
‚GVA ‚GVA ‚International ‚
‚HEL ‚HEL ‚International ‚
‚HKG ‚HKG ‚International ‚
‚HND ‚HND ‚International ‚
‚HNL ‚HNL ‚Domestic ‚
‚IAD ‚IAD ‚Domestic ‚
‚IND ‚IND ‚Domestic ‚
‚JED ‚JED ‚International ‚
‚JFK ‚JFK ‚Domestic ‚
‚JNB ‚JNB ‚International ‚
‚JRS ‚JRS ‚International ‚
‚LAX ‚LAX ‚Domestic ‚
‚LHR ‚LHR ‚International ‚
‚LIS ‚LIS ‚International ‚
‚MAD ‚MAD ‚International ‚
‚MCI ‚MCI ‚Domestic ‚
‚MIA ‚MIA ‚Domestic ‚
‚MSY ‚MSY ‚Domestic ‚
‚NBO ‚NBO ‚International ‚
‚ORD ‚ORD ‚Domestic ‚
‚PEK ‚PEK ‚International ‚
‚PRG ‚PRG ‚International ‚
‚PWM ‚PWM ‚Domestic ‚
‚RDU ‚RDU ‚Domestic ‚
‚SEA ‚SEA ‚Domestic ‚
‚SFO ‚SFO ‚Domestic ‚
‚SIN ‚SIN ‚International ‚
‚SYD ‚SYD ‚International ‚
‚VIE ‚VIE ‚International ‚
‚WLG ‚WLG ‚International ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $ROUTES LENGTH: 10 NUMBER OF VALUES: 5 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 10 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 13MAY2005:15:36:19)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ ‚Missing ‚
‚Route1 ‚Route1 ‚Zone One ‚
‚Route2 ‚Route4 ‚Zone Two ‚
‚Route5 ‚Route7 ‚Zone Three ‚
‚**OTHER** ‚**OTHER** ‚Unknown ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
4-86 Chapter 4 Using Lookup Tables to Match Data
166
Format names are limited to eight characters in versions of SAS prior to SAS®9.
4.4 Using Formats as Lookup Tables 4-87
167
PROC
PROC FORMAT LIBRARY == libref.catalog
FORMATLIBRARY libref.catalog;;
168
4-88 Chapter 4 Using Lookup Tables to Match Data
169
170
For a complete listing of the CATALOG procedure statements and functionality, see the procedures
section of the Base SAS Procedures Guide in the Base SAS documentation.
4.4 Using Formats as Lookup Tables 4-89
Documenting Formats
You can use the FMTLIB option in the PROC FORMAT
statement to document the format.
General form of the FMTLIB option:
PROC
PROC FORMAT LIBRARY == libref.catalog
FORMATLIBRARY libref.catalog
FMTLIB;
FMTLIB;
<other
<other statements>
statements>;;
RUN
RUN;;
171
You can use either the SELECT or EXCLUDE statement to process specific formats rather than an entire
catalog.
4-90 Chapter 4 Using Lookup Tables to Match Data
PUT statements
172
You can use the WHERE statement when the OBS= option is in effect.
The MMDDYYB10. format displays the Date variable value using a blank as a separator.
General form:
MMDDYYxw.
Value of x Separator
B blank
C colon
D dash
N no separator
P period
S slash
4-92 Chapter 4 Using Lookup Tables to Match Data
OPTIONS FMTSEARCH==((item-1
OPTIONSFMTSEARCH item-1 item-2…item-n
item-2…item-n););
174
By specifying multiple items in the FMTSEARCH= option, you can concatenate format catalogs. This
enables you to do the following:
• define personal format catalogs to be used in addition to corporate catalogs
• use test and production format catalogs without duplicating the production catalog
• control the order in which catalogs are searched
4.4 Using Formats as Lookup Tables 4-93
work.formats
work.formats
library.formats
library.formats
ia.formats
ia.formats
ia.formats3
ia.formats3
175 ...
Because ia is a libref without a catalog name, formats is assumed to be the catalog name.
SAS-supplied formats are always searched first. The work.formats catalog is always searched second,
unless it appears in the FMTSEARCH list. If the library libref is assigned, the library.formats catalog is
searched after work.formats and before anything else in the FMTSEARCH list, unless it appears in the
list. To assign the library libref, use the code shown below:
OPTIONS
OPTIONS FMTERR
FMTERR || NOFMTERR;
NOFMTERR;
176
FMTERR specifies that when SAS cannot find a specified variable format, it generates an error
message and does not allow default substitution to occur.
NOFMTERR replaces missing formats with the w. or $w. default format, issues a note, and continues
processing.
4.4 Using Formats as Lookup Tables 4-95
177
Partial Output
$airport format
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $AIRPORT LENGTH: 28 NUMBER OF VALUES: 52 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 28 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 20APR2005:13:41:43)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚AKL ‚AKL ‚Auckland ‚
‚AMS ‚AMS ‚Amsterdam ‚
‚ANC ‚ANC ‚Anchorage, AK ‚
‚ARN ‚ARN ‚Stockholm ‚
‚ATH ‚ATH ‚Athens (Athinai) ‚
‚BHM ‚BHM ‚Birmingham, AL ‚
‚BKK ‚BKK ‚Bangkok ‚
‚BNA ‚BNA ‚Nashville, TN ‚
‚BOS ‚BOS ‚Boston, MA ‚
‚BRU ‚BRU ‚Brussels (Bruxelles) ‚
‚CBR ‚CBR ‚Canberra, Australian Capitol ‚
‚CCU ‚CCU ‚Calcutta ‚
‚CDG ‚CDG ‚Paris ‚
data international;
set ia.international;
DestCity = put(dest,$airport.);
OriginCity = put(Origin,$airport.);
run;
International Cities
Num
Flight Num Num Pass
Obs ID Origin Dest FltDate Num1st Bus Econ Total DestCity OriginCity
PROC
PROC FORMAT LIBRARY == libref.catalog
FORMATLIBRARY libref.catalog
CNTLIN SAS-data-set
CNTLIN = SAS-data-set;;
=
RUN
RUN;;
179
Review
Maintaining Formats
To maintain formats, perform one of the following tasks:
Edit the PROC FORMAT code that created the original
format.
or
Create a SAS data set from the format, edit the data
set, and use the CNTLIN= option to re-create the
format.
180
181 ...
When the data set created by the CNTLOUT= option will be used as a CNTLIN= data set in a
subsequent FORMAT procedure step, the minimum variables that must be edited are START,
END, FMTNAME, and LABEL.
4-100 Chapter 4 Using Lookup Tables to Match Data
Add the new observations, re-create the format, and document the format:
proc fsedit data = work.fmtdata;
run;
proc sql;
insert into FmtData
set FmtName = '$airport',
Start = 'YQB',
End = 'YQB',
Label = 'Quebec, QC'
set FmtName = '$AIRPORT',
Start = 'YUL',
End = 'YUL',
Label = 'Montreal, QC';
quit;
4.4 Using Formats as Lookup Tables 4-101
Log
proc sql;
insert into fmtdata
set FmtName = '$airport',
Start = 'YQB',
End = 'YQB',
Label = 'Quebec, QC'
set FmtName = '$airport',
Start = 'YUL',
End = 'YUL' ,
Label = 'Montreal, QC';
NOTE: 2 rows were inserted into WORK.FMTDATA.
data work.fmtdata;
set work.fmtdata end=last;
output;
if last then do;
FmtName = '$airport';
Start = 'YYC';
End = 'YYC';
Label = 'Calgary, AB';
output;
Start = 'YYZ';
End = 'YYZ';
Label = 'Toronto, ON';
output;
end;
run;
Partial Output
New values in the $airport Format
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $AIRPORT LENGTH: 28 NUMBER OF VALUES: 56 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 28 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (CONT'D)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚SFO ‚SFO ‚San Francisco, CA ‚
‚SIN ‚SIN ‚Singapore ‚
‚SYD ‚SYD ‚Sydney, New South Wales ‚
‚VIE ‚VIE ‚Wien (Vienna) ‚
‚WLG ‚WLG ‚Wellington ‚
‚YQB ‚YQB ‚Quebec, QC ‚
‚YUL ‚YUL ‚Montreal, QC ‚
‚YYC ‚YYC ‚Calgary, AB ‚
‚YYZ ‚YYZ ‚Toronto, ON ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
4.4 Using Formats as Lookup Tables 4-103
PROC
PROC FORMAT LIBRARY == libref.catalog
FORMATLIBRARY libref.catalog
CNTLOUT = SAS-data-set
CNTLOUT = SAS-data-set;;
<other
<other statements>;
statements>;
RUN;
RUN;
183
Advantages of Formats
Advantages of using formats include the following:
familiarity
centralize maintenance
184
Disadvantages of Formats
Disadvantages of using formats include the following:
memory requirements to load the entire format
for the binary search
use of only one variable for the table lookup
185
4.4 Using Formats as Lookup Tables 4-105
Exercises
b. Use the CNTLOUT= and CNTLIN= options in PROC FORMAT. Add new data for ticket agents
using the INSERT statement in PROC SQL or a DATA step program.
c. View the new format using the FMTLIB option in PROC FORMAT. The output is on the next
page.
4.4 Using Formats as Lookup Tables 4-107
Exercise Output
New values in the $JCODES Format
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ†
‚ FORMAT NAME: $JCODES LENGTH: 32 NUMBER OF VALUES: 45 ‚
‚ MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 32 FUZZ: 0 ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚START ‚END ‚LABEL (VER. V7|V8 22JAN2004:11:50:24)‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚BAGCLK ‚BAGCLK ‚BAGGAGE CLERK ‚
‚BAGSUP ‚BAGSUP ‚BAGGAGE SUPERVISOR ‚
‚CHKCLK ‚CHKCLK ‚CHECK IN CLERK ‚
‚CHKSUP ‚CHKSUP ‚CHECK IN SUPERVISOR ‚
‚FACCLK ‚FACCLK ‚FACILITIES CLERK ‚
‚FACMGR ‚FACMGR ‚FACILITES MANAGER ‚
‚FACMNT ‚FACMNT ‚FACILITIES MAINTENANCE OPERATIVE ‚
‚FINACT ‚FINACT ‚FINANCIAL ACCOUNTANT ‚
‚FINCLK ‚FINCLK ‚FINANCE CLERK ‚
‚FINMGR ‚FINMGR ‚FINANCE MANAGER ‚
‚FLSCHD ‚FLSCHD ‚FLIGHT SCHEDULER ‚
‚FLSMGR ‚FLSMGR ‚FLIGHT SCHEDULING MANAGER ‚
‚FLTAT1 ‚FLTAT1 ‚FLIGHT ATTENDANT GRADE 1 ‚
‚FLTAT2 ‚FLTAT2 ‚FLIGHT ATTENDANT GRADE 2 ‚
‚FLTAT3 ‚FLTAT3 ‚FLIGHT ATTENDANT GRADE 3 ‚
‚FSVCLK ‚FSVCLK ‚FLIGHT SERVICES CLERK ‚
‚FSVMGR ‚FSVMGR ‚FLIGHT SERVICES MANAGER ‚
‚GRCREW ‚GRCREW ‚GROUND CREW ‚
‚GRCSUP ‚GRCSUP ‚GROUND CREW SUPERVISOR ‚
‚HRCLK ‚HRCLK ‚HUMAN RESOURCES CLERK ‚
‚HRMGR ‚HRMGR ‚HUMAN RESOURCES MANAGER ‚
‚ITCLK ‚ITCLK ‚IT CLERK ‚
‚ITMGR ‚ITMGR ‚IT MANAGER ‚
‚ITPROG ‚ITPROG ‚COMPUTER PROGRAMMER ‚
‚ITSUPT ‚ITSUPT ‚IT SUPPORT SPECIALIST ‚
‚MECH01 ‚MECH01 ‚MECHANIC GRADE 1 ‚
‚MECH02 ‚MECH02 ‚MECHANIC GRADE 2 ‚
‚MECH03 ‚MECH03 ‚MECHANIC GRADE 3 ‚
‚MKTCLK ‚MKTCLK ‚MARKETING CLERK ‚
‚MKTMGR ‚MKTMGR ‚MARKETING MANAGER ‚
‚OFFMGR ‚OFFMGR ‚OFFICE MANAGER ‚
‚PILOT1 ‚PILOT1 ‚PILOT GRADE 1 ‚
‚PILOT2 ‚PILOT2 ‚PILOT GRADE 2 ‚
‚PILOT3 ‚PILOT3 ‚PILOT GRADE 3 ‚
‚PRES ‚PRES ‚COMPANY PRESIDENT ‚
‚RECEPT ‚RECEPT ‚RECEPTIONIST ‚
‚RESCLK ‚RESCLK ‚RESERVATIONS CLERK ‚
‚RESMGR ‚RESMGR ‚RESERVATIONS MANAGER ‚
‚SALCLK ‚SALCLK ‚SALES CLERK ‚
‚SALMGR ‚SALMGR ‚SALES MANAGER ‚
‚TELOP ‚TELOP ‚TELEPHONE SWITCHBOARD OPERATOR ‚
‚TKTAG1 ‚TKTAG1 ‚Ticket Agent Grade 1 ‚
‚TKTAG2 ‚TKTAG2 ‚Ticket Agent Grade 2 ‚
‚TKTAG3 ‚TKTAG3 ‚Ticket Agent Grade 3 ‚
‚VICEPR ‚VICEPR ‚VICE PRESIDENT ‚
Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒŒ
4-108 Chapter 4 Using Lookup Tables to Match Data
Objectives
Use the TRANSPOSE procedure to transpose a
SAS data set and prepare it for a table lookup.
188
Another reason for transposing a data set is to restructure a data set to match the requirements of a
particular procedure.
1 IA00201 01JAN2004 22
189
4.5 Transposing Data to Create a Lookup Table 4-109
190
191
4-110 Chapter 4 Using Lookup Tables to Match Data
192 c04s5d1
The OUT= option provides the name of the new data set.
The default variable names for transposed variables are _NAME_, COL1, COL2, COL3, and COL4.
The data set is not structured correctly for the merge. More options and statements are needed.
The variable, Statistic, does not appear in the PROC TRANSPOSE data set because PROC
TRANSPOSE does not automatically transpose character variables.
4.5 Transposing Data to Create a Lookup Table 4-111
NAME= Option
proc transpose data = ia.delaystats
out = stats
name = Day;
run;
Partial Output
Using the NAME =
The NAME= option specifies the name for the new variable in the output data set that contains the names
of the existing variables being transposed.
4-112 Chapter 4 Using Lookup Tables to Match Data
BY Statement
proc sort data = ia.delaystats
out = delaystats;
by Statistic;
run;
proc transpose data = delaystats
out = stats
name = Day;
by Statistic;
run;
Using a BY statement
Partial Output Obs Statistic Day COL1
For each BY group, PROC TRANSPOSE creates one observation for each variable that it transposes.
The BY variable is not transposed.
The original SAS data set must be sorted or indexed with the BY statement prior to the PROC
TRANSPOSE statement.
The COL1 variable needs a more descriptive variable name. You can use SAS data set options to rename
this variable.
4.5 Transposing Data to Create a Lookup Table 4-113
Partial Output
Using the RENAME= option
Partial Output
Using the ID Statement
The ID statement specifies a variable in the input data set whose formatted values name the transposed
variables in the output data set.
4-114 Chapter 4 Using Lookup Tables to Match Data
PROC
PROC TRANSPOSE
TRANSPOSE <DATA= <DATA=input-data-set
input-data-set>>
<OUT=output-data-set
<OUT= output-data-set>>
<NAME == variable-name>
<NAME variable-name>;;
<<BY <DESCENDING>variable-1
BY<DESCENDING> variable-1
<...<DESCENDING> variable-n
<...<DESCENDING> variable-n>;> >;>
<<VAR variable(s)
VAR variable(s);> ;>
<<ID
IDvariable
variable;>
;>
RUN;
RUN;
197
198
4.5 Transposing Data to Create a Lookup Table 4-115
199
4-116 Chapter 4 Using Lookup Tables to Match Data
/*****************************
Program assumes that the data set STATS was created
by the TRANSPOSE procedure using the BY statement
and the RENAME= data set option.
*****************************/
data delays;
set stats;
FltDate = mdy(1,input(substr(day,4),2.),2004);
drop day;
where Statistic = 'AvgDelay';
run;
data combine;
merge rdudelay delays;
by FltDate;
DelayDif = delay - AvgDelay;
run;
Flight Delay
Obs ID FltDate Delay Dif
/*********************************************************
Alternate Solution if the data set STATS was created with the
TRANSPOSE procedure and the ID statement;
*********************************************************/
data delays;
set stats (keep = Day AvgDelay);
FltDate = mdy(1,input(substr(day,4),2.),2004);
drop day;
run;
data combine;
merge rdudelay delays;
by FltDate;
DelayDif = delay - AvgDelay;
run;
Exercises
Partial Output
ia.tcontrib
Qtr
Obs EmpID Num Amount
1 65 55 45 35
2 80 70 60 50
3 70 60 50 40
Output
work.results
Frst
LastName Name Event Finish Score
Tuttle Thomas 1 1 65
Gomez Alan 1 2 55
Chapman Neil 1 3 45
Welch Darius 1 4 35
Vandeusen Richard 2 1 80
Tuttle Thomas 2 2 70
Venter Vince 2 3 60
Morgan Mel 2 4 50
Chapman Neil 3 1 70
Gomez Alan 3 2 60
Morgan Mel 3 3 50
Tuttle Thomas 3 4 40
data results;
array Awards{3,4} _Temporary_ (65,55,45,35,
80,70,60,50,
70,60,50,40);
set ia.compete;
Score = Awards{Event,Finish};
run;
Output
work.results
Frst
LastName Name Event Finish Score
Tuttle Thomas 1 1 65
Gomez Alan 1 2 55
Chapman Neil 1 3 45
Welch Darius 1 4 35
Vandeusen Richard 2 1 80
Tuttle Thomas 2 2 70
Venter Vince 2 3 60
Morgan Mel 2 4 50
Chapman Neil 3 1 70
Gomez Alan 3 2 60
Morgan Mel 3 3 50
Tuttle Thomas 3 4 40
a. Produce a SAS data set named meals that contains the meal service code for each flight.
d. Look up the meal for each flight using the WEEKDAY function on Date and the HOUR function
on Depart.
The HOUR function returns values between 0 and 23. The Hour variable in
ia.mealplan contains the values 1 to 24.
data meals;
array food{7,24} $ 10 _Temporary_;
if _n_ = 1 then do i = 1 to 7*24;
set ia.mealplan;
food{dow,hour} = Meal;
end;
set ia.schedule;
Service = food{weekday(Date),hour(Depart)+1};
keep Flight Date Depart Service;
run;
b. Load the relevant data from ia.Sales in a hash object and use it as a lookup table for the
flights in ia.Dnunder. Include the variables FlightID, RouteID, FltDate, RevTotal,
Expenses, and Profit in the report. The variable RevTotal is the sum of Rev1st,
RevBus, RevEcon, and CargoRev.
Partial Listing
ia.Dnunder
Flight
Obs ID FltDate Expenses
Partial Listing
ia.sales
Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus
Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st RevBus
Cargo
Obs RevEcon CargoRev RevTotal Weight
Partial Output
Profit for Flights to Australia and New Zealand
Flight Total
Obs ID RouteID Date Revenue Expenses Profit
data Profit;
if _n_ = 1 then do;
if 0 then set ia.Sales
(keep = FlightID RouteID FltDate RevTotal);
declare hash ht(dataset: 'ia.Sales');
ht.definekey ('FlightID', 'FltDate');
ht.definedata('RouteID', 'RevTotal');
ht.definedone();
end;
set ia.Dnunder;
if ht.find() = 0 then do;
Profit = RevTotal - Expenses;
output;
end;
else putlog 'WARNING: _N_=' _N_ 'No match found. '
FlightID= FltDate=;
run;
PUTLOG 'text';
Preceding the text with WARNING, ERROR, or NOTE displays the text in the color that SAS-
generated warnings, errors, or notes are written to the log.
4-124 Chapter 4 Using Lookup Tables to Match Data
options ls = 80;
proc format library = ia fmtlib;
select $jcodes;
title '$jcodes Format';
run;
6. Updating a Format (Optional)
Update an existing format by following these steps:
a. Add to the permanent $jcodes format.
b. Use the CNTLOUT= and CNTLIN= options in PROC FORMAT. Add new data for ticket agents
using the INSERT statement in PROC SQL or a DATA step program.
c. View the new format using the FMTLIB option in PROC FORMAT.
proc format lib = ia cntlout = FmtData;
select $jcodes;
run;
/* SQL solution */
proc sql;
insert into fmtdata
set FmtName = '$JCODES',
Start = 'TKTAG1',
End = 'TKTAG1',
Label = 'Ticket Agent Grade 1'
set FmtName = '$JCODES',
Start = 'TKTAG2',
End = 'TKTAG2',
Label = 'Ticket Agent Grade 2'
(Continued on the next page.)
4.6 Solutions to Exercises 4-125
Partial Output
ia.tcontrib
Qtr
Obs EmpID Num Amount
Objectives
Append two SAS data sets using the APPEND
procedure.
Update a SAS data set using an INSERT INTO
statement in the SQL procedure.
4 ...
This chapter discusses the APPEND procedure and the SQL procedure INSERT INTO statement.
5-4 Chapter 5 Combining Data Vertically
5 c05s1d1
Log
113
114 proc append base = emps
115 data = newemps;
116 run;
117
118 proc print data = ia.emps;
119 title 'All Employees Created';
120 title2 'by Appending ia.newemps to ia.emps';
121 run;
NOTE: There were 2070 observations read from the data set IA.EMPS.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.01 seconds
cpu time 0.02 seconds
5.1 Appending SAS Data Sets 5-5
Partial Output
All Employees Created
by Appending ia.newemps to ia.emps
PROC
PROC APPEND
APPEND BASE=SAS-data-set
BASE=SAS-data-set
DATA=SAS-data-set
DATA=SAS-data-set
<FORCE>;
<FORCE>;
PROC APPEND only reads the data in the DATA= SAS data set, not in the BASE= SAS data set.
The FORCE option forces PROC APPEND to concatenate data sets when the DATA= data set contains
variables that have any of the following characteristics:
• are not in the BASE= data set.
• do not have the same type as the variables in the BASE= data set. (For variables with a type mismatch,
missing values are assigned in the appended observations when the FORCE option is used.)
• are longer than the variables in the BASE= data set.
5.1 Appending SAS Data Sets 5-7
allsales
missing
partsales
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
5-8 Chapter 5 Combining Data Vertically
Partial Log
8
9 proc append base=allsales data=partsales;
10 run;
8 c05ref1
The work.allsales data set has 21 variables. The work.partsales data set has 16 variables.
Partial Output
proc print data=allsales(firstobs=23 obs=29);
var Origin Dest DestType CargoRev CargoWeight;
title 'Partial ALLSALES Data Set';
run;
FORCE Option
The FORCE option enables PROC APPEND to
concatenate the data sets even though there might be
variables in the DATA= data set that do not exist in the
BASE= data set.
partsales
truncate
allsales
10 ...
The FORCE option can cause loss of data due to truncation or dropping variables.
To create allsales and partsales, execute the following program (c05ref2):
data allsales;
set ia.sales(obs = 25);
run;
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
5-10 Chapter 5 Combining Data Vertically
Partial Log
51
52 proc append base=partsales data=allsales force;
53 run;
11 c05ref2
The work.allsales data set has 21 variables. The work.partsales data set has 16 variables.
The variable RouteID is character in the work.allsales data set. The variable RouteID is numeric
in the work.partsales data set.
The type mismatch for RouteID and the additional variables present in work.allsales require the
use of the FORCE option.
5.1 Appending SAS Data Sets 5-11
Partial Output
Partial PARTSALES Data Set
Cap
Flight Pass
Obs ID FltDate Cap1st CapBus CapEcon Total CapCargo
Num
Num Num Pass Route
Obs Num1st Bus Econ Total Rev1st RevBus RevEcon RevTotal ID
numeric
character
Origin, Dest, DestType, CargoRev, and CargoWeight
are in allsales but not in partsales.
12 c05ref2
5-12 Chapter 5 Combining Data Vertically
Log
proc append data=airports base=acities force;
run;
Partial Output
The CONTENTS Procedure
5 Division Char 30
1 EmpID Char 6
2 LastName Char 15
4 Location Char 13
3 Phone Char 4
5-14 Chapter 5 Combining Data Vertically
data pilots;
keep phone Division LastName Location EmpID;
set pilots(rename = (phone = ophone));
phone = input(ophone,4.);
run;
547 1003
548 1028
549 1070
550 1016
551
552
553
554
5 Division Char 30
1 EmpID Char 6
2 LastName Char 15
4 Location Char 13
3 Phone Char 4
5-16 Chapter 5 Combining Data Vertically
14
15
5.1 Appending SAS Data Sets 5-17
16
5-18 Chapter 5 Combining Data Vertically
17
When you use the INSERT INTO statement with a view, the view must reference one and only
one table. The INSERT INTO statement cannot add rows to a view of joined tables.
The columns are matched positionally when you use the VALUES clause or a query expression to insert
the results in a table. If the data types do not match, if there are more values than columns, or if there are
fewer values than columns, the row is not inserted. Whether or not other rows are inserted depends on the
current value of the UNDO_POLICY SQL statement option.
5.1 Appending SAS Data Sets 5-19
PROC
PROCSQL;
SQL;
INSERT
INSERTINTO INTOtable-name<(column<,
table-name<(column<,...
...column>)>
column>)>
SET column=sql-expression
SET column=sql-expression 1
<,
<,...
...column=sql-expression>
column=sql-expression>
<SET
<SETcolumn=sql-expression
column=sql-expression
<,
<,...
...column=sql-expression>>;
column=sql-expression>>;
QUIT;
QUIT;
18 c05s1d3
c Each SET clause contains column names and their values separated by commas. The value for a
column can be the result of a SELECT clause.
Log
76 proc sql;
77 insert into acities
78 set City = 'Toronto',Code = 'YYZ',
79 Name = 'Pearson International',
80 Country = 'Canada'
81 set City = 'Montreal', Code = 'YUL',
82 Name = 'Montreal Trudeau',
83 Country = 'Canada';
PROC
PROC SQL;
SQL;
INSERT
INSERT INTO
INTOtable-name
table-name <(column<,
<(column<,...
... column>)>
column>)>
VALUES
VALUES(value
(value <,
<,...
... value>)
value>)
<...
<...VALUES
VALUES(value
(value <,<,...
... value>)>;
value>)>; 2
QUIT;
QUIT;
19 c05s1d4
d The VALUES clause is positional unless the columns are specified in the INSERT INTO clause.
Log
86 proc sql;
87 insert into acities(City, Code, Name, Country)
88 values
89 ('Toronto','YYZ','Pearson International','Canada')
90 values
91 ('Montreal','YUL','Montreal Trudeau','Canada');
PROC
PROC SQL;
SQL;
INSERT
INSERT INTO
INTOtable-name
table-name
SELECT
SELECT <(column<,
<(column<,...column>)>
...column>)> 3
FROM table-name query-expression;
FROM table-name query-expression;
QUIT;
QUIT;
20 c05s1d5
maintains indexes
21
22
5.1 Appending SAS Data Sets 5-23
Reference Information
Other techniques to concatenate SAS data sets:
Pros:
• This technique enables the full power of the DATA step to manipulate the data.
• Creation of a new data set occurs.
• An unlimited number of SAS data sets can be read.
Cons:
• All of the SAS data sets must be read.
Pros:
• Data manipulation occurs in both data sets.
• There is a combination of joins and OUTER UNION CORRESPONDING.
• A new data set is created.
• ANSI standard syntax is used.
Cons:
• All data sets are read.
Only the APPEND procedure and the INSERT INTO statement in the SQL procedure were
discussed in this section.
Concatenation
Exercises
2. Updating a Data Set Using the INSERT INTO Statement in the SQL Procedure (Optional)
Create the work.quarter4 and work.y2005 data sets by submitting the code in the ProcCopy
program file:
proc copy in = ia out = work;
select Quarter4 Y2005;
run;
Append work.quarter4 to work.y2005 using the INSERT INTO statement in the SQL
procedure. First, determine if the data sets have the same variables. The resulting data set should be
work.y2005 data with the additional observations from work.quarter4.
Objectives
Create a SAS data set from multiple raw data files
using the FILENAME statement.
Create a SAS data set from multiple raw data files
using the FILEVAR= option.
25
26 ...
Only the FILENAME statement and the FILEVAR= option are discussed in this section.
5.2 Appending Raw Data Files 5-27
Obs 1
Obs 1
Obs 2
Obs 2
27 ...
Use multiple INFILE statements to read a record from one raw data file, a record from the second raw
data file, a record from the third raw data file, and so on (similar to an interleave).
Multiple INFILE statements can be used to concatenate raw data files that have different file layouts.
28 ...
Use the FILENAME statement to concatenate multiple raw data files whose names can be hard-coded.
5-28 Chapter 5 Combining Data Vertically
29 c05s2d1
Windows/UNIX Log
filename Q1 ('month1.dat' 'month2.dat' 'month3.dat');
data firstq;
infile Q1;
input Flight $ Origin $ Dest $ Date : date9. RevCargo : comma15.;
run;
File List=('c:\workshop\winsas\prog3\month1.dat'
'c:\workshop\winsas\prog3\month2.dat'
'c:\workshop\winsas\prog3\month3.dat'),
RECFM=V,LRECL=256
File List=('c:\workshop\winsas\prog3\month1.dat'
'c:\workshop\winsas\prog3\month2.dat'
'c:\workshop\winsas\prog3\month3.dat'),
RECFM=V,LRECL=256
File List=('c:\workshop\winsas\prog3\month1.dat'
'c:\workshop\winsas\prog3\month2.dat'
'c:\workshop\winsas\prog3\month3.dat'),
RECFM=V,LRECL=256
FILENAME
FILENAMEfileref
fileref ('external-file1'
('external-file1'
'external-file2'
'external-file2' …… 'external-filen');
'external-filen');
fileref
is any SAS name that is eight characters or fewer.
'external-file'
is the physical name of an external file. The physical
name is the name that is recognized by the operating
environment.
30
A FILENAME statement can associate a fileref with multiple physical external files.
month8
month8 month9
month9 month10
month10 month11
month11 month12
month12
31 ...
5.2 Appending Raw Data Files 5-31
month8
month8 month9
month9 month10
month10 month11
month11 month12
month12
32 ...
month8
month8 month9
month9 month10
month10 month11
month11 month12
month12
33
5-32 Chapter 5 Combining Data Vertically
month + 9 + .dat
month + 10 + .dat
month + 11 + .dat
34
5.2 Appending Raw Data Files 5-33
35 ...
The value of a FILEVAR= variable option is a character string that contains the physical filename of the
raw data file to be read. When the next INPUT statement executes, it reads from the new file that the
FILEVAR= variable option specifies. Similar to automatic variables, the FILEVAR= variable is not
written to the data set.
The FILEVAR= variable option can read raw data files conditionally. You can construct the names of the
raw data files programmatically.
INFILE
INFILEfile-specification
file-specification FILEVAR
FILEVAR ==variable;
variable;
FILEVAR = variable
names a variable whose change in value causes the
INFILE statement to close the current input file and
open a new one.
36
zzz
is an arbitrarily named placeholder, not an actual
filename or a fileref that was assigned to a file
previously. SAS uses this placeholder for reporting
processing information to the SAS log.
NextFile
contains the name of the raw data file to be read
(month9.dat, month10.dat, month11.dat,
and so on).
37
The placeholder must be eight characters or fewer, and must begin with an alpha character or underscore,
followed by alphanumeric characters or underscores.
5.2 Appending Raw Data Files 5-35
COMPRESS Function
To eliminate the space in filenames such as
month 9.dat, use the COMPRESS function.
General form of the COMPRESS function:
COMPRESS(source,<characters-to-remove>)
COMPRESS(source,<characters-to-remove>)
38
If the characters-to-remove option is omitted, the COMPRESS function removes blanks from the source.
5-36 Chapter 5 Combining Data Vertically
Log
data movingq;
length Dest Origin $ 3 Flight $ 7;
do i = 11,10,9;
NextFile = "month"||put(I,2.)||".dat";
NextFile = compress(NextFile,' ');
infile zzz filevar=NextFile;
input Flight $ Origin $ Dest $ Date : date9. RevCargo : comma15.;
output;
end;
stop;
run;
40 c05s2d3
c The DO UNTIL statement continues to execute the INFILE statement for every record of the raw data
file until the value of LastObs = 1. The DO UNTIL statement checks the condition at the bottom
of the loop.
d The END= option creates the variable LastObs that can be used to determine the end of the raw data
file. The END= option names a variable whose value is one of the following:
0 when the current input data record is not the last in the current input file
1 when the current input record is the last in the current input file
5.2 Appending Raw Data Files 5-39
Partial Log
42 data movingq;
43 length Dest Origin $ 3 Flight $ 7;
44 do I = 11,10,9;
45 NextFile = "month"||put(I,2.)||".dat";
46 NextFile = compress(NextFile,' ');
47 do until (LastObs);
48 infile zzz filevar = NextFile end = LastObs;
49 input Flight $ Origin $ Dest $ Date : date9.
50 RevCargo : comma15.2;
51 output;
52 end;
53 end;
54 stop;
55 run;
c Obtains the month number of today’s date to begin the rolling month range.
d Calculates the month numbers of the two months prior to today’s month number.
Calendar Logic
What if the current month is January or February?
42
5.2 Appending Raw Data Files 5-41
INTNX Function
The INTNX function increments a date value by a given
interval or intervals, and returns a date value.
EDate = intnx('interval',BDate, increment)
The INTNX function can increment dates, time, or datetime values by a given interval or
intervals, and returns a date, time, or datetime value.
5-42 Chapter 5 Combining Data Vertically
INTNX Function
General form of the INTNX function:
INTNX('interval',start-from,increment<,alignment>)
INTNX('interval',start-from,increment<,alignment>)
'interval'
specifies a character constant or variable of date,
datetime, or time intervals.
start-from
specifies a SAS expression that represents a SAS
date,datetime, or time value identifying a starting point.
increment
specifies a negative or positive integer that represents
the specific number of time intervals.
44
Optional arguments:
interval specifies a character constant, a variable, or an expression that contains a time interval such
as WEEK, SEMIYEAR, QTR, or HOUR. The type of interval (date, datetime, or time) must
match the type of value in start-from and increment.
multiple specifies a multiple of the interval. It sets the interval equal to a multiple of the interval type.
For example, YEAR2 consists of two-year, or biennial, periods.
shift-index specifies the starting point of the interval. By default, the starting point is 1. A value that is
greater than 1 shifts the start to a later point within the interval. The unit for shifting depends
on the interval. For example, YEAR.3 specifies yearly periods that are shifted to start on the
first of March of each calendar year and to end in February of the following year. The shift
index cannot be greater than the number of periods in the entire interval. For example,
YEAR2.24 has a valid shift index, but YEAR2.25 is invalid because there is no twenty-fifth
month in a two-year interval. If the default shift period is the same as the interval type, then
you can shift only multi-period intervals with the shift index. For example, because MONTH
type intervals shift by MONTH sub-periods by default, you cannot shift monthly intervals
with the shift index. However, you can shift bimonthly intervals with the shift index, because
two MONTH intervals exist in each MONTH2 interval. The interval name MONTH2.2, for
example, specifies bimonthly periods starting on the first day of even-numbered months.
5.2 Appending Raw Data Files 5-43
start-from specifies a SAS expression that represents a SAS date, time, or datetime value that identifies
a starting point.
increment specifies a negative, positive, or zero integer that represents the number of date, time, or
datetime intervals. Increment is the number of intervals to shift the value of start-from.
alignment controls the position of SAS dates within the interval. Alignment can be one of these values:
BEGINNING | B specifies that the returned date is aligned to the beginning
of the interval. (DEFAULT)
MIDDLE | M specifies that the returned date is aligned to the midpoint of
the interval.
END | E specifies that the returned date is aligned to the end of the
interval.
SAMEDAY | S | SAME specifies that the date that is returned is aligned to the same
calendar date with the corresponding interval increment.
Alignment is new in SAS®9.
5-44 Chapter 5 Combining Data Vertically
Log
data movingq;
drop MonNum MidMon LastMon I;
MonNum=month(today());
MidMon=month(intnx('month',today(),-1));
LastMon=month(intnx('month',today(),-2));
do i=MonNum, MidMon, LastMon;
NextFile="month"||put(i,2.)||".dat";
NextFile=compress(NextFile,' ');
do until (LastObs);
infile zzz filevar=NextFile end=LastObs;
input Flight $ Origin $ Dest $ Date : date9.
RevCargo : comma15.;
output;
end;
end;
stop;
run;
Considering Efficiency
To make the program more efficient, call the TODAY
function only once.
today = today();
MonNum = month(today);
MidMon = month(intnx('month',today,-1));
LastMon = month(intnx('month',today,-2));
46 c05s2d5a
c05s2d5a
data movingq;
drop MonNum MidMon LastMon I today;
today = today();
MonNum = month(today);
MidMon = month(intnx('month',today,-1));
LastMon = month(intnx('month',today,-2));
do i=MonNum, MidMon, LastMon;
NextFile = "month"||put(i,2.)||".dat"; * PC and Unix;
*Nextfile = ".prog3.rawdata(month"||put(i,2.)||")"; * mainframe ;
NextFile=compress(NextFile,' ');
do until (LastObs);
infile xxx filevar=NextFile end=LastObs;
input Flight $ Origin $ Dest $ Date : date9.
RevCargo : comma15.2;
output;
end;
end;
stop;
run;
5.2 Appending Raw Data Files 5-47
Instead of using the concatenate operator (|| or !!), you could use the concatenation functions.
Caution: Without specifying the LENGTH of the new variable, the value of the new variable returned
by any of the CAT functions has a length of up to the following:
• 200 characters in WHERE clauses and in PROC SQL
• 32,767 characters in the DATA step except in WHERE clauses
• 65,534 characters when string is called from the macro processor
5-48 Chapter 5 Combining Data Vertically
Reference Information
1 route1.dat
2 route2.dat
3 route3.dat
4 route4.dat
5 route5.dat
d The letter grouping zzz is a placeholder, not an actual filename or a fileref that was previously
assigned to a file. SAS uses this placeholder for reporting processing information to the SAS log.
The placeholder is an arbitrary word; however, it must be eight characters or fewer, begin with an
alpha character or underscore, followed by alphanumeric characters or underscores.
e The FILEVAR= option specifies the value for the FILEVAR= variable. The INFILE statement
closes the current file and opens a new one if the value of Readit changed when the INFILE
statement executed.
f LastFile is the arbitrary variable name created by the END= option. LastFile is a
temporary variable and is set to 1 after each file is finished being read.
g The DO WHILE loop checks the value of the variable LastFile at the top of the loop.
Therefore, the INPUT statement reads from the current open INPUT file. Use a DO WHILE loop
here, not a DO UNTIL loop. The DO UNTIL stops the DATA step if any file is empty.
h The OUTPUT statement writes the contents of the Program Data Vector to create an observation
of the SAS data set. The OUTPUT statement is required in this DATA step. Without the OUTPUT
statement, the data set route1_5 contains only six observations, that is, one per external file.
5.2 Appending Raw Data Files 5-49
d The letter grouping zzz is a placeholder, not an actual filename or a fileref that was previously
assigned to a file. SAS uses this placeholder for reporting processing information to the SAS log.
The placeholder is an arbitrary word; however, it must be eight characters or fewer, begin with an
alpha character or underscore, followed by alphanumeric characters or underscores.
e The FILEVAR= option specifies the value for the FILEVAR= variable. The INFILE statement
closes the current file and opens a new one if the value of Readit changed when the INFILE
statement executes.
f LastFile is the arbitrary variable name created by the END= option. LastFile is a
temporary variable and is set to 1 after each file is finished being read.
g The DO WHILE loop checks the value of the variable LastFile at the top of the loop.
Therefore, the INPUT statement reads from the current open INPUT file. Use a DO WHILE loop
here, not a DO UNTIL loop. The DO UNTIL stops the DATA step if any file is empty.
h The OUTPUT statement writes the contents of the Program Data Vector to create an observation
of the SAS data set. The OUTPUT statement is required in this DATA step. Without the OUTPUT
statement, the data set route1_5 contains only six observations, that is, one per external file.
5-50 Chapter 5 Combining Data Vertically
Exercises
The raw data files use the naming convention Yyyyy. For example:
For directory based: y2005.dat
Open the program c05ex3Start, which contains the following INPUT statement:
input Flight $ Date : date9. Depart : time5.;
Partial Output
Three Years of Data
Open the program c05ex4Start, which contains the following INPUT statement:
5.2 Appending Raw Data Files 5-51
The raw data files use the following naming convention: Yyyyy. For example:
For directory based: y2005.dat
Open the program c05ex3Start, which contains the following INPUT statement:
input Flight $ Date : date9. Depart : time5.;
Save your SAS program.
For directory based: ch3ex1.sas
For z/OS (OS/390): '.prog3.sascode(ch3ex1)'
data last3(drop=year thisyear);
thisyear=year(today());
do year=thisyear to thisyear-2 by -1;
NextFile="y"||put(year,4.)||".dat";
do until(Last);
infile zzz filevar=NextFile end=Last;
input Flight $ Date : date9. Depart : time5.;
output;
end;
end;
stop;
run;
Open the program c05ex4Start, which contains the following INPUT statement:
input @1 RouteID $7.
@8 Origin $3.
@11 Destination $3.
@14 cargo 5.
@19 totalpass 4.
@23 boarded 4.
@27 transfered 4.;
data EuropeFlights;
infile europe;
input @1 RouteID $7.
@8 Origin $3.
@11 Destination $3.
@14 cargo 5.
@19 totalpass 4.
@23 boarded 4.
@27 transfered 4.;
run;
6.1 Introduction.....................................................................................................................6-3
6.1 Introduction
Objectives
Investigate the reasons for sorting data.
Define BY-group processing.
List alternatives to the SORT procedure.
4 ...
6-4 Chapter 6 BY-Group Processing and Sorting
BY-Group Processing
BY-group processing has these characteristics:
is a method of processing observations that are
grouped or ordered by the values of common variables
can be used in both DATA and PROC steps
Alternatives to Sorting
There are several alternatives to sorting data:
indexing
6
6.2 Eliminating Duplicates 6-5
Objectives
Use the NODUPKEY option.
Use FIRST. and LAST. processing.
Create a data set using the DUPOUT= option.
PROC
PROC SORT
SORTDATA
DATA ==data-set-name
data-set-name NODUPKEY;
NODUPKEY;
9
6-6 Chapter 6 BY-Group Processing and Sorting
Reference Information
The NODUPRECS option checks for and eliminates duplicate consecutive observations.
The example below demonstrates the fact that duplicates might remain in the data set.
TABLE_ONE
A B C D
1 3 5 8
1 3 5 8
2 4 6 8
1 2 8 6
1 3 5 8
2 5 7 3
SORTDUP=PHYSICAL | LOGICAL
is a system option that controls how NODUPRECS processing works.
PHYSICAL removes duplicates based on all variables in the data set. This is the default.
LOGICAL removes duplicates based only on variables remaining after DROP= and KEEP= data set
options are processed.
An example of using the SORTDUP= system option is shown below.
TABLE_ONE
A B C D
1 3 5 8
1 3 8 6
1 3 8 6
Eliminate Duplicates
The data set ia.allemps contains data for both
retired and current employees. Because the data was
drawn from different sources, multiple observations were
accidentally inserted for some employee ID numbers.
Create a new SAS data set that contains only one
observation for each employee ID number.
ia.allemps (First Six Observations)
Obs EmpID LastName Phone Location Division
10
c06s2d1
11
6.2 Eliminating Duplicates 6-9
12 c06s2d2
c The NODUPKEY option selects duplicate observations based on the key value EmpID.
d The DUPOUT= option creates a data set named dups that contains the duplicate observations.
13
Additionally, there is a new SAS global option, SORTEQUALS | NOSORTEQUALS, that enables you to
globally disengage the stable sorting logic (EQUALS) that is on by default in the SORT procedure.
SORTEQUALS is the shipped default to maintain backward compatibility, but NOSORTEQUALS is
recommended.
6-12 Chapter 6 BY-Group Processing and Sorting
14 c06s2d3
c EQUALS maintains the relative order of the observations within the input data set in the output data
set.
d NOEQUALS does not necessarily preserve this order in the output data set.
15
6.2 Eliminating Duplicates 6-13
Exercises
1 RHONDA D. USA
2 IRIS GERMANY
3 CHARLES H. USA
4 RAYMA M. USA
5 HARALD GERMANY
6 ROGER USA
7 STEVEN UNITED KINGDOM
8 LEWIS USA
9 SANDRA USA
10 KECIA H. USA
11 SELBY USA
12 JULIE R. USA
13 MARK J. USA
14 ISABELLE FRANCE
Job
Obs EmpLocation Phone EmpID Code
15 GUENTER GERMANY
16 THOMAS GERMANY
Job
Obs EmpLocation Phone EmpID Code
1 RHONDA D. USA
2 KECIA H. USA
3 SELBY USA
Job
Obs EmpLocation Phone EmpID Code
Objectives
Define threading.
Understand the workspace and library space
required to sort a SAS data file.
Estimate sort workspace.
Allocate sort workspace.
18
Threading
In SAS®9, the SORT procedure is multi-threaded.
A thread is defined as the following:
a single path of execution
19
6.3 Sorting Resources 6-17
Multi-Threaded Processing
Multi-threaded processing is a type of parallel
processing introduced in SAS®9.
Parallel processing means that multiple units of work
are available to be scheduled for concurrent execution
by the operating system.
This technology takes advantage of hardware called
symmetric multiprocessing machines (SMPs) that has
multiple central processing units (CPUs).
20
6-18 Chapter 6 BY-Group Processing and Sorting
Multi-Threaded Processing
A symmetric multiprocessing environment possesses the
following features:
has multiple CPUs that share the same memory and
a thread-enabled operating system
can spawn and process multiple threads
simultaneously using multiple CPUs
enables the application to coordinate threads from the
same process to share data very efficiently
21
• In an SMP computer environment, one instance of an operating system runs on several CPUs.
Applications that run under this operating system can also run on several or all existing CPUs. All
processes (operating system and applications) share the same memory and the same I/O resources.
• SMP systems are referred to as shared everything systems.
• One advantage of the SMP architecture is the ability to distribute the computational load dynamically
over the existing CPUs and thus achieve equal loading of the CPUs.
• SMP systems can be arranged in multiple clusters to achieve even more scalability that often extends
into 10 terabytes or more of data capacity and processing support.
6.3 Sorting Resources 6-19
Collate process
22
Multi-Threaded Processing
Processes suitable for threading are the following:
sorting
grouping
summarizing
REPORT
TABULATE
23
When you benchmark using the threaded procedures, use the Real Time statistic rather than the
CPU time statistic. The back-end collating process to re-create the single data set might result in
an increase in total CPU time, while reducing wall-clock time (time from submission of code for
execution to return of results).
OPTIONS
OPTIONSTHREADS
THREADS || NOTHREADS;
NOTHREADS;
24
6.3 Sorting Resources 6-21
PROC
PROCSORT
SORTDATA
DATA==SAS-data-set
SAS-data-setTHREADS
THREADS||NOTHREADS;
NOTHREADS;
25
OPTIONS
OPTIONSCPUCOUNT
CPUCOUNT == 1-1024
1-1024 || ACTUAL;
ACTUAL;
1-1024
is the number of CPUs that SAS will assume
are available for use by threaded-enabled
applications.
ACTUAL
is the number of CPUs that SAS detects are
available for a specific session.
The default is ACTUAL.
26
The SAS Administrator might have limited the number of CPUs that are available for SAS
processing, so the value ACTUAL might be less than the total number of CPUs in the machine that
SAS is using.
6-22 Chapter 6 BY-Group Processing and Sorting
ia
Disk Space
sales or
y
em
m
sales
28
6.3 Sorting Resources 6-23
29
6-24 Chapter 6 BY-Group Processing and Sorting
30
The space calculation for the SAS Release 8.2 sort is as follows:
c06s3d1
31
32
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
6-26 Chapter 6 BY-Group Processing and Sorting
33
6.3 Sorting Resources 6-27
34
In multi-threaded environments, if you use the OVERWRITE option in the PROC SORT statement, you
need space equal to the data set size. The OVERWRITE option enables the input data set to be deleted
before the replacement output data set is populated with observations. The OVERWRITE option is
supported by the SAS sort and SAS multi-threaded sort only. The option has no effect if you use a host
sort or the TAGSORT option.
Use the OVERWRITE option only with a data set that is backed up or with a data set that you can
reconstruct. Because the input data set is deleted, data will be lost if a failure occurs while the output data
set is being written.
6-28 Chapter 6 BY-Group Processing and Sorting
35
SORTSIZE=n
SORTSIZE=n ||nK
nK||nM
nM ||nG
nG||MIN
MIN||MAX
MAX||hexX
hexX||SIZE;
SIZE;
36
6.3 Sorting Resources 6-29
37
38
6-30 Chapter 6 BY-Group Processing and Sorting
c06s3d2
39
6.4 Choosing the Right Sort Routine (Self-Study) 6-31
Objectives
Understand the processing differences between host
and portable sort utilities.
Learn how to specify a particular sort utility.
41
42
6-32 Chapter 6 BY-Group Processing and Sorting
43
z/OS Dfsort *
Syncsort
Unix Syncsort *
Windows Syncsort *
* Default
44
6.4 Choosing the Right Sort Routine (Self-Study) 6-33
SORTCUTP=
SORTNAME=
45
6-34 Chapter 6 BY-Group Processing and Sorting
OPTIONS
OPTIONSSORTPGM
SORTPGM== utility
utility||BEST
BEST||HOST
HOST||SAS;
SAS;
46
OPTIONS
OPTIONSSORTCUTP=n
SORTCUTP=n ||nK
nK||nM
nM ||nG
nG||MAX
MAX||MIN
MIN||hexX;
hexX;
47
z/OS 4M *
UNIX 0 **
Windows 0 **
48
6-36 Chapter 6 BY-Group Processing and Sorting
OPTIONS
OPTIONSSORTNAME
SORTNAME==host-sort-utility-name;
host-sort-utility-name;
49
The SORTNAME= option is only required if you have more than one host sort installed at your
site on your platform.
6.5 Alternatives to Sorting 6-37
Objectives
Use indexes to return the data in sorted order.
Use indexes to combine data horizontally.
Use a format to group data for BY-group processing.
Use a CLASS statement.
51
6-38 Chapter 6 BY-Group Processing and Sorting
c06s5d1
52
Using an index for BY-group processing with Scalable Performance Data Engine data is discussed in a
later chapter.
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
NOTE: There were 25 observations read from the data set IA.SALES.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
53 c06s5d1
6.5 Alternatives to Sorting 6-39
54 c06s5d1
55
6-40 Chapter 6 BY-Group Processing and Sorting
Using Indexes
You can use the SET/SET statements with the KEY=
option to avoid sorting a large data set when you merge
a large SAS data set with a smaller data set that can be
indexed.
1. The first SET statement names the data set that has
the key values that will be used to retrieve
observations from the second data set.
2. Specify the KEY= option in the second SET statement
to use an index to retrieve observations.
General form of the KEY= option:
SET
SET SAS-data-file-name
SAS-data-file-name KEY
KEY ==index-name;
index-name;
56
Use of the SET/SET statements with the KEY= option is also a good technique for merging a small driver
data set with a larger indexed data set when only the matches are required to be returned.
Using Indexes
The SAS data set ia.distances contains the
distance for each airline route.
RouteID Distance
Partial Data Set
0000108 298
0000070 231
0000034 3480
0000032 2018
0000066 762
0000074 1130
0000024 480
0000096 893
0000036 442
. .
. .
0000103 147
0000102 4581
0000072 388
0000107 298
0000106 1446
57
6.5 Alternatives to Sorting 6-41
Using Indexes
The data set ia.sales is not sorted by RouteID.
There are two indexes on the data set, Origin and
DteFlt. Neither of them can be used in the merge, and
you do not want to sort the large data set.
Partial Data Set
Flight
ID RouteID Origin Dest DestType FltDate . . .
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
Using Indexes
proc datasets lib = ia;
modify distances;
index create RouteID;
run;
quit;
data routes; n
set ia.sales; o p
set ia.distances key = RouteID/unique;
run;
c06s5d2
59
61
62
6.5 Alternatives to Sorting 6-43
Flight
Obs ID RouteID FltDate Origin Dest Distance
64
6-44 Chapter 6 BY-Group Processing and Sorting
Destination City
BRU Brussels
CDG Paris
GLA Glasgow
GVA Geneva
Sorted by Grouped by
Destination City
65
6.5 Alternatives to Sorting 6-45
c06s5d3
66
c06s3d3a
title 'Printing ia.lhr by FlightID';
proc print data = ia.lhr;
by FlightID notsorted;
run;
6-46 Chapter 6 BY-Group Processing and Sorting
Partial Output
Printing ia.lhr by FlightID
Num
Num Num Pass
Obs Dest FltDate Num1st Bus Econ Total City
Num
Num Num Pass
Obs Dest FltDate Num1st Bus Econ Total City
Num
Num Num Pass
Obs Dest FltDate Num1st Bus Econ Total City
Num
Num Num Pass
Obs Dest FltDate Num1st Bus Econ Total City
Num
Num Num Pass
Obs Dest FltDate Num1st Bus Econ Total City
Num
Num Num Pass
Obs Dest FltDate Num1st Bus Econ Total City
continued...
67
Num
Flight Num Num Pass
Obs ID Dest FltDate Num1st Bus Econ Total
--------------------------City=Frankfurt--------------------------
Num
Flight Num Num Pass
Obs ID Dest FltDate Num1st Bus Econ Total
68
6-48 Chapter 6 BY-Group Processing and Sorting
BY
BYvariable-name
variable-name NOTSORTED;
NOTSORTED;
69
6.5 Alternatives to Sorting 6-49
70
OPTIONS BYSORTED;
If observations with the same BY value are grouped together but are not necessarily sorted in alphabetic
or numeric order, use the NOBYSORTED option.
OPTIONS NOBYSORTED;
When the NOBYSORTED option is specified, you do not have to specify NOTSORTED in every
BY statement to access the data set(s).
6-50 Chapter 6 BY-Group Processing and Sorting
71
c The GROUPFORMAT option enables the BY statement to use the $QTRFMT format to create
FIRST.SALEMON and LAST.SALEMON. The NOTSORTED option is used because the data is
grouped by SaleMon but not sorted by SaleMon.
6.5 Alternatives to Sorting 6-51
qtr TotalCargo
1 $770,915,528.00
2 $778,976,417.00
3 $788,588,795.00
4 $779,322,475.00
73
BY
BYGROUPFORMAT
GROUPFORMAT variable-name
variable-name <NOTSORTED>;
<NOTSORTED>;
74
First.variable and last.variable are temporary automatic variables in the PDV that identify
the first and last observations in each BY-group.
6-52 Chapter 6 BY-Group Processing and Sorting
75
76
6.5 Alternatives to Sorting 6-53
77
78
6-54 Chapter 6 BY-Group Processing and Sorting
79
80 c06s5d5 ...
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
6.5 Alternatives to Sorting 6-55
81
82 c06s5d6 ...
6-56 Chapter 6 BY-Group Processing and Sorting
Flight
Number N Obs Variable Label Sum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
IA00100 728 Rev1st Revenue from First Class Passengers 14428800.00
RevBus Revenue from Business Passengers 21006480.00
RevEcon Revenue from Economy Passengers 55384362.00
CargoRev Revenue from Cargo 81998560.00
83
CLASS
CLASSvariable(s)
variable(s) </
</options>;
options>;
TABULATE
SUMMARY
UNIVARIATE
84
6.5 Alternatives to Sorting 6-57
Reference Information
data-set-name(SORTEDBY
data-set-name(SORTEDBY == by-clause
by-clause ||_NULL_
_NULL_ ))
85
by-clause indicates the data order. You can specify variables and options as you can in a BY statement.
_NULL_ removes any existing sort information.
c06s5d7
86 ...
6.5 Alternatives to Sorting 6-59
Sort Information
Sortedby InvoiceID
Validated NO
Character Set ANSI
87
6-60 Chapter 6 BY-Group Processing and Sorting
88 c06s5d7
If a CONTENTS procedure is run after the PROC SORT, the Validated flag is still set to NO.
Partial Log
Sort Information
Sortedby InvoiceID
Validated NO
Character Set ANSI
To set the Validated flag to YES, use the FORCE option in the PROC SORT statement.
proc sort data = invoices force;
by InvoiceID;
run;
Sortedby InvoiceID
Validated YES
Character Set ANSI
6.5 Alternatives to Sorting 6-61
Exercises
N
JobCat Obs Sum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Flight Attendant 32 991000.00
Navigator 8 556000.00
Pilots 17 1520000.00
Sum
ƒƒƒƒƒƒƒƒƒƒƒƒ
991000.00
ƒƒƒƒƒƒƒƒƒƒƒƒ
Sum
ƒƒƒƒƒƒƒƒƒƒƒƒ
531000.00
ƒƒƒƒƒƒƒƒƒƒƒƒ
Sum
ƒƒƒƒƒƒƒƒƒƒƒƒ
556000.00
ƒƒƒƒƒƒƒƒƒƒƒƒ
------------------------ JobCat=Pilots -------------------------
Sum
ƒƒƒƒƒƒƒƒƒƒƒƒ
1520000.00
ƒƒƒƒƒƒƒƒƒƒƒƒ
Partial Output from the PRINT Procedure (page 3 of output if the OPTIONS PS=60 LS=120;
statement is submitted)
------------------------------------- JobCode=BAGCLK ------------------------------------
(continued)
Hire Emp
Obs Date LastName FirstName Country EmpLocation EmpID
Hire Emp
Obs Date LastName FirstName EmpCountry Location EmpID
/* alternative solution */
proc sort data = ia.retirees out = retirees;
by EmpID;
run;
7.1 Introduction.....................................................................................................................7-3
7.1 Introduction
Objectives
Investigate how SAS data sets are stored.
Review the concept of a data set page.
This chapter addresses Base SAS data sets only. Scalable Performance Data Engine data is addressed in a
later chapter.
7-4 Chapter 7 Controlling Data Storage Space
Data Portion
Index File
Index 1
Index 2
The total amount of storage required for a SAS data file is the sum of the space required for the following:
• the descriptor portion
• the observation length multiplied by the number of observations
• any associated indexes
• any operating-system-specific storage overhead
7.1 Introduction 7-5
The total number of bytes occupied by a data set equals the data page size times the number of pages plus
the index page size times the number of pages.
Partial Output
Engine/Host Dependent Information
c07s1d1
6
The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
7-6 Chapter 7 Controlling Data Storage Space
Objectives
Describe how SAS stores numeric values.
Determine how to safely reduce the space required
to store numeric values in SAS data sets.
9
7.2 Reducing the Length of Numeric Variables 7-7
+0.35298*(10**5)
Sign Mantissa Base Exponent
10
SAS stores numeric values in native floating point representation. On UNIX, Linux, Windows, and Open
VMS/Alpha platforms, this form is "IEEE format" as defined in ISO standard IEC 60559. On z/OS, SAS
stores numeric values in IBM mainframe floating-point representation.
Summary of Floating-Point Numbers Stored in Eight Bytes
IBM mainframe 16 7 56
IEEE 2 11 52
7-8 Chapter 7 Controlling Data Storage Space
11 c07s2d1
To decrease the length of all numeric variables, you can use the DEFAULT= option in the LENGTH
statement:
data reducedsales;
length default = 4;
... more SAS code ...
run;
7.2 Reducing the Length of Numeric Variables 7-9
12
NOTE: No unequal values were found. All values compared are exactly equal.
13 c07s2d2
7-10 Chapter 7 Controlling Data Storage Space
14
15
Exceeding the number of integer digits recommended above or reducing the stored size of non-integer
data can result in a loss of precision due to the truncation of nonzero bytes. It is not recommended.
7.2 Reducing the Length of Numeric Variables 7-11
16
17
7-12 Chapter 7 Controlling Data Storage Space
data _null_;
set test;
put x=;
put y=;
run;
18 c07s2d3
86
87 data _null_;
88 set test;
89 put x=;
90 put y=;
91 run;
x=0.0999999642
y=0.1
NOTE: There were 1 observations read from the data set WORK.TEST.
19
Just as a decimal number system cannot store the fraction 1/3 exactly in a finite number of digits,
a binary number system (or multiple thereof, such as octal or hexadecimal) cannot store the
fraction 1/10 exactly in any finite number of digits.
7.2 Reducing the Length of Numeric Variables 7-13
data test;
length x 3;
x = 8193;
run;
data _null_;
set test;
put x=;
run;
20 c07s2d4
x=8192
NOTE: There were 1 observations read from the
data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
21
7-14 Chapter 7 Controlling Data Storage Space
Objectives
Define the structure of a compressed SAS data file.
Create a compressed SAS data file.
Examine the advantages and disadvantages of
compression.
23
SAS data files, but not views, can be stored in compressed form.
continued...
24
7.3 Compressing Data Files 7-15
25
continued...
26
7-16 Chapter 7 Controlling Data Storage Space
27
Compressing a file reduces the number of bytes required to represent each observation. In a compressed
file, each observation is a variable-length record.
28
7.3 Compressing Data Files 7-17
29 c07s3d1
30 c07s3d2
The external file sales used for demonstrations and exercises contains fewer records than the
external file sales used for the course notes.
7-18 Chapter 7 Controlling Data Storage Space
Partial Log
NOTE: The data set WORK.SALESCHAR has 329264 observations and 21
variables.
NOTE: Compressing data set WORK.SALESCHAR decreased size by 28.14
percent.
Compressed is 4930 pages; un-compressed would require 6861 pages.
NOTE: DATA statement used (Total process time):
real time 17.36 seconds
cpu time 3.25 seconds
31
32 c07s3d3
7.3 Compressing Data Files 7-19
Partial Log
NOTE: The data set WORK.SALESBIN has 329264 observations and 21
variables.
NOTE: Compressing data set WORK.SALESBIN decreased size by 31.51
percent.
Compressed is 4699 pages; un-compressed would require 6861 pages.
NOTE: DATA statement used (Total process time):
real time 7.04 seconds
cpu time 3.62 seconds
33
34
7-20 Chapter 7 Controlling Data Storage Space
SAS-data-set(COMPRESS
SAS-data-set(COMPRESS==NO
NO||YES
YES || CHAR
CHAR || BINARY)
BINARY)
OPTIONS
OPTIONSCOMPRESS
COMPRESS==NO
NO||YES
YES||CHAR
CHAR ||BINARY;
BINARY;
35
CHAR | YES uses the Run Length Encoding (RLE) compression algorithm, which
compresses repeating consecutive bytes, such as trailing blanks or repeated
zeros.
BINARY uses Ross Data Compression (RDC), which combines run length encoding
and sliding window compression.
The COMPRESS= data set option overrides the COMPRESS= system option.
The COMPRESS= options interact with two other system or data set options, POINTOBS= and
REUSE=. See "COMPRESS= Data Set Option" in the dictionary of SAS language elements in SAS
Language Reference: Dictionary in the Base SAS documentation for additional information on these
interactions.
7.3 Compressing Data Files 7-21
36
LastName FirstName
0
1
… 2
0
…
A D AMS B I L L
37
7-22 Chapter 7 Controlling Data Storage Space
@A D A M S #@B I L L #
38
COMPRESS = BINARY
Ross Data Compression uses both run-length encoding
and sliding window compression.
A data set has these variables:
Name Type Length
Answer1 Numeric 8
...
Answer200 Numeric 8
In uncompressed form, the data file resembles this:
Obs answer1 answer2 answer3 answer4 answer5 answer200
1 1 2 1 2 1 ... 2
2 1 1 1 1 1 ... 1
3 2 2 2 2 2 ... 2
39 ...
7.3 Compressing Data Files 7-23
COMPRESS = BINARY
In Ross data compressed form, the first observation in the
data file resembles the form below:
0 0
1 9
+ +
@ 1 1 # @ 1 2 # %
40
+
Indicates the sign and exponent.
1
Compression Guidelines
41
7-24 Chapter 7 Controlling Data Storage Space
Compression Dependencies
Because there is higher overhead for each observation, a
data file can occupy more space in compressed form than
in uncompressed form if the file has the following:
few repeated characters
42
Compression Guidelines
data capacity(compress = yes);
set ia.capacity;
run;
Partial Log
1175 data capacity(compress = yes);
1176 set ia.capacity;
1177 run;
NOTE: There were 108 observations read from the data set IA.CAPACITY.
NOTE: The data set WORK.CAPACITY has 108 observations and 7 variables.
NOTE: Compressing data set WORK.CAPACITY increased size by 50.00 percent.
Compressed is 3 pages; un-compressed would require 2 pages.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
43 c07s3d4
7.3 Compressing Data Files 7-25
Compression Dependencies
When you use the COMPRESS= data set option or the
COMPRESS= system option, SAS knows the following:
size of the overhead introduced by compression
44
Compression Dependencies
1 data test(compress = yes);
2 x = 1;
3 run;
45 c07s3d5
7-26 Chapter 7 Controlling Data Storage Space
Compression Trade-Offs
Uncompressed Compressed
Usually requires more Usually requires less
disk storage. disk storage.
Requires less CPU time Requires more CPU
to prepare observation time to prepare
for I/O. observation for I/O.
Uses more I/O Uses fewer I/O
operations. operations.
46
Compression Trade-Offs
Uncompressed Compressed
An updated observation An updated observation
fits in its original might be moved from
location. its original location.
47
7.3 Compressing Data Files 7-27
Exercises
b. Edit the program to decrease the length of the numeric variables Cap1st, CapBus, and
CapEcon to 3; CapCargo, Num1st, NumBus, NumEcon, NumPassTotal,
CapPassTotal, CargoWeight and FltDate to 4; and Rev1st, RevBus, RevEcon,
RevCargo and RevTotal to 5.
Change the name of the output data set to salesnum. Resubmit it, and record the number of
pages and the page size for the data set salesnum.
c. Edit the original c07ex1start program to create a compressed data set using COMPRESS=CHAR.
Change the name of the output data set to saleschar. Be sure not to use the reduced length
numeric program to create saleschar. Submit the program, and record the number of pages
and the page size for the data set saleschar.
d. Edit the program to create a compressed data set using COMPRESS=BINARY. Change the name
of the output data set to salesbin. Resubmit it, and record the number of pages and the page
size for the data set salesbin.
The external file sales used for demos and exercises contains fewer records than the external
file sales used for the course notes.
7-28 Chapter 7 Controlling Data Storage Space
Objectives
Investigate types of SAS data sets.
Create and use DATA step views.
Determine the advantages of DATA step views.
Examine guidelines for using DATA step views.
50
51
The FILENAME statement and the FILEVAR option for the INFILE statement were discussed in an
earlier chapter.
7.4 Creating a DATA Step View 7-29
Instructions
Data stored
stored
on disk
on disk
52 ...
is a SAS file with a member type of DATA. is a SAS file with a member type of VIEW.
A DATA File
data ia.newdata;
infile fileref; External
DATA step statements;
File
run;
53 ...
Compile Execute
54 ...
The name of a DATA view must be different from the name of an existing DATA file in the same SAS
library.
7.4 Creating a DATA Step View 7-31
Rev
Obs Flight Origin Dest Date Cargo
Log
filename Q1 ('month1.dat' 'month2.dat' 'month3.dat');
File List=('c:\workshop\winsas\prog3\month1.dat'
'c:\workshop\winsas\prog3\month2.dat'
'c:\workshop\winsas\prog3\month3.dat'),
RECFM=V,LRECL=256
File List=('c:\workshop\winsas\prog3\month1.dat'
'c:\workshop\winsas\prog3\month2.dat'
'c:\workshop\winsas\prog3\month3.dat'),
RECFM=V,LRECL=256
File List=('c:\workshop\winsas\prog3\month1.dat'
'c:\workshop\winsas\prog3\month2.dat'
'c:\workshop\winsas\prog3\month3.dat'),
RECFM=V,LRECL=256
NOTE: There were 6686 observations read from the data set IA.FIRSTQ.
NOTE: PROCEDURE PRINT used:
real time 0.15 seconds
cpu time 0.16 seconds
7.4 Creating a DATA Step View 7-33
c07s4d2
/* The following program appends data from 3 months.
The data selected is dependent on today's date. */
options date;
Rev
Obs Flight Origin Date Dest Cargo
Log
options date;
NOTE: There were 6579 observations read from the data set IA.MOVINGQ.
NOTE: PROCEDURE PRINT used:
real time 0.83 seconds
cpu time 0.23 seconds
7-36 Chapter 7 Controlling Data Storage Space
DATA
DATA data-set-name(s)
data-set-name(s)// VIEW
VIEW==view-name;
view-name;
INFILE
INFILEfileref;
fileref;
INPUT
INPUTvariable(s);
variable(s);
RUN;
RUN;
VIEW = view-name
view-name specifies a name that the DATA step uses
to store the partially compiled DATA step.
The view-name must match one of the data
set names.
56
You can also create SAS data files in the DATA step that creates the view; but you can only create one
view per DATA step.
DATA
DATA VIEW
VIEW== view-name;
view-name;
DESCRIBE;
DESCRIBE;
RUN;
RUN;
57
7.4 Creating a DATA Step View 7-37
58
59 ...
7-38 Chapter 7 Controlling Data Storage Space
64 ...
The PRINT procedure with the UNIFORM option, the CLASS statement in the MEANS/SUMMARY,
TABULATE, and UNIVARIATE procedures, and many SAS/STAT procedures require multiple passes
through the data.
7.4 Creating a DATA Step View 7-39
65 ...
7-40 Chapter 7 Controlling Data Storage Space
Reference Information
NOTE: The data set WORK.MOVINGQ has 6684 observations and 5 variables.
NOTE: There were 6684 observations read from the data set IA.MOVINGQ.
NOTE: PROCEDURE PRINT used:
real time 0.30 seconds
cpu time 0.25 seconds
7.4 Creating a DATA Step View 7-41
Because SAS macro variables are resolved during compilation, any macro variables used in a DATA step
view are resolved when the view is created.
You can use the SYMGET function to postpone macro resolution until the view is executed.
c07ref2
data ia.movingq / view = ia.movingq;
drop MonNum MidMon LastMon I today;
today = today();
MonNum = month(today);
MidMon = month(intnx('month',today,-1));
LastMon = month(intnx('month',today,-2));
do I = MonNum, MidMon, LastMon;
NextFile = "month"!!put(i,2.)!!".dat";* Windows/UNIX;
*Nextfile = ".prog3.rawdata(month"!!put(i,2.)!!")"; /* z/OS */
NextFile = compress(NextFile,' ');
do until (LastObs);
infile in filevar = NextFile end = LastObs;
input Flight $ Origin $ Dest $ Date : date9.
RevCargo : comma15.2;
if Dest = symget('ThisDest') then output;
end;
end;
stop;
run;
Use the %LET statement to provide a value for the macro variable ThisDest.
%let ThisDest = MCI;
proc print data = ia.movingq;
title "Flight to &ThisDest";
var Flight Origin Date Dest RevCargo;
format Date date9.;
run;
Partial Output
Flights to MCI
Rev
Obs Flight Origin Date Dest Cargo
Exercises
b. Name the data file saircraft. The file should contain the aircraft where the CapTotal value
is less than or equal to 200.
4. Printing the DATA Step File Unsuccessfully
Attempt to print the saircraft data.
Change the name of the output data set to salesnum. Resubmit it, and record the number of
pages and the page size for the data set salesnum.
7-44 Chapter 7 Controlling Data Storage Space
data salesnum;
length Cap1st CapBus CapEcon 3
CapCargo Num1st NumBus NumEcon NumPassTotal
CapPassTotal CargoWeight FltDate 4
Rev1st RevBus RevEcon RevCargo RevTotal 5;
d. Edit the program to create a compressed data set using COMPRESS=BINARY. Change the name
of the output data set to salesbin. Resubmit it, and record the number of pages and page size
for the data set salesbin.
data salesbin (compress = binary);
infile 'sales.dat' missover; /* Windows and UNIX */
* infile '.prog3.rawdata(sales)'; /* Mainframe */
input @1 FlightID $7. @8 RouteID $7.
@15 Origin $3. @18 Dest $3.
@21 DestType $13. @34 FltDate date9.
@43 Cap1st 3. @46 CapBus 3.
@49 CapEcon 3. @52 CapPassTotal 3.
@55 CapCargo 6. @62 Num1st 3.
@64 NumBus 3. @67 NumEcon 3.
@70 NumPassTotal 3. @73 Rev1st 7.
@80 RevBus 7. @87 RevEcon 7.
@94 RevCargo 7. @102 RevTotal 10.
@112 CargoWeight 5.;
run;
SAS Log
318 options fullstimer;
319
320 data _null_;
321 set sales;
322 run;
NOTE: There were 329264 observations read from the data set WORK.SALES.
NOTE: DATA statement used (Total process time):
real time 0.11 seconds
user cpu time 0.07 seconds
system cpu time 0.04 seconds
Memory 153k
323
324 data _null_;
325 set salesnum;
326 run;
NOTE: There were 329264 observations read from the data set WORK.SALESNUM.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
user cpu time 0.06 seconds
system cpu time 0.04 seconds
Memory 147k
327
328 data _null_;
329 set saleschar;
330 run;
NOTE: There were 329264 observations read from the data set WORK.SALESCHAR.
NOTE: DATA statement used (Total process time):
real time 0.50 seconds
user cpu time 0.40 seconds
system cpu time 0.04 seconds
Memory 153k
331
332 data _null_;
333 set salesbin;
334 run;
NOTE: There were 329264 observations read from the data set WORK.SALESBIN.
NOTE: DATA statement used (Total process time):
real time 0.64 seconds
user cpu time 0.60 seconds
system cpu time 0.02 seconds
Memory 153k
b. Name the data file saircraft. The file should contain the aircraft where the CapTotal value
is less than or equal to 200.
data laircraft saircraft / view = laircraft;
infile air;
input ModelType $15. Model $8. AircraftID $6.
CapFirst 4. CapBusiness 4. CapEconomy 4.
CapTotal 5. CapCargo 6. Range 6.
InServiceDate Date9. LastMaintDate Date9.
CruiseSpeed 6.;
if CapTotal > 200 then output laircraft;
else output saircraft;
run;
4. Printing the DATA Step File Unsuccessfully
Attempt to print the saircraft data.
filename air 'aircraft.dat'; *Windows/UNIX;
* filename air '.prog3.rawdata(aircraft)'; *z/OS;
Printing laircraft automatically executed the compiled code for laircraft. Therefore, the
saircraft file was created.
7-48 Chapter 7 Controlling Data Storage Space
Chapter 8 Utilizing Best Practices to
Improve Efficiency
8.1 Introduction.....................................................................................................................8-3
8.1 Introduction
Objectives
Review best practice techniques.
I/O
disk space
memory
network traffic
4
8-4 Chapter 8 Utilizing Best Practices to Improve Efficiency
Because the CPU performs all the processing that is needed to perform an I/O operation, an option or
technique that reduces the number of I/O operations can also reduce CPU usage.
8.1 Introduction 8-5
Objectives
Use the most efficient technique to perform the following
tasks:
Subset your data by using the subsetting IF statement.
12
13
8-8 Chapter 8 Utilizing Best Practices to Improve Efficiency
data totals;
set ia.sales;
PercentCap =
sum(Num1st,NumEcon,NumBus)/CapPassTotal;
NumNonEconomy = sum(Num1st,NumBus);
CargoKG = CargoWeight*0.454;
Month = month(FltDate);
if PercentCap < 0.8;
run;
14 c08s2d1a
15 c08s2d1b
8.2 Executing Only Necessary Statements 8-9
Comparing Techniques
Technique CPU I/O Memory
I. Subsetting IF at Bottom 2.3 1226.0 265.0
II. Subsetting IF near Top 1.3 1226.0 265.0
Percent Difference 42.8 0.0 0.0
16
All of the benchmarks were run on HP-UX 11 (64-bit) in SAS 9.1.3 SP2.
17
8-10 Chapter 8 Utilizing Best Practices to Improve Efficiency
19 c08s2d2b
8.2 Executing Only Necessary Statements 8-11
20 c08s2d2c
21 c08s2d2d
8-12 Chapter 8 Utilizing Best Practices to Improve Efficiency
Comparing Techniques
Technique CPU I/O Memory
I. ALL IF Statements 15.9 6797.0 280.0
II. ELSE-IF Statements 9.7 6797.0 288.0
III. Using a Function Once 3.0 6797.0 272.0
IV. SELECT/WHEN Block 3.0 6795.0 263.0
CPU Memory
24
SELECT statements perform slightly better for a large selection of uniformly distributed numeric values.
8-14 Chapter 8 Utilizing Best Practices to Improve Efficiency
Objectives
Use the most efficient technique to accomplish the
following tasks:
Create multiple subsets.
26
27
8.3 Eliminating Unnecessary Passes through the Data 8-15
continued...
28 c08s3d1a
29 c08s3d1a
8-16 Chapter 8 Utilizing Best Practices to Improve Efficiency
30 c08s3d1b
Comparing Techniques
Technique CPU I/O Memory
I. Multiple DATA Steps 5.2 1781.0 262.0
II. Single DATA Step 1.3 1774.0 483.0
Percent Difference 74.8 0.4 -84.4
31
The memory increases for the single DATA step because multiple data sets are open in memory for
output.
8.3 Eliminating Unnecessary Passes through the Data 8-17
data east;
set ia.sales;
where Dest in
('RDU','BOS','IAD','JFK','MIA','PWM');
run;
proc sort data = east;
by Dest;
run;
32 c08s3d2a
33 c08s3d2b
8-18 Chapter 8 Utilizing Best Practices to Improve Efficiency
Comparing Techniques
Technique CPU I/O Memory
I. DATA/SORT 1.8 3490.0 18199
II. SORT with WHERE 1.4 1745.0 18355
Percent Difference 23.4 50.0 -0.9
34
Business Task
Change the variable attributes in ia.salesc to be
consistent with those in ia.sales.
35
8.3 Eliminating Unnecessary Passes through the Data 8-19
36 c08s3d3b
Comparing Techniques
Technique CPU IO Memory
I. DATA Step 1.8 9.0 264.0
II. PROC DATASETS 0.1 10.0 173.0
Percent Difference 97.1 -11.1 34.5
37
8-20 Chapter 8 Utilizing Best Practices to Improve Efficiency
Objectives
Use the most efficient technique to select the following;
observations
variables
39
40
8.4 Reading and Writing Only Essential Data 8-21
Selecting Observations
WHERE Dest = "BWI"
41 ...
Selecting Observations
IF Dest = "BWI"
42 ...
8-22 Chapter 8 Utilizing Best Practices to Improve Efficiency
data west;
set ia.sales;
if Dest in ('LAX','SEA','SFO');
run;
c08s4d1a
data west;
set ia.sales;
where Dest in ('LAX','SEA','SFO');
run;
43 c08s4d1b
Comparing Techniques
Technique CPU I/O Memory
I. Subsetting IF 1.0 429.0 263.0
II. WHERE Statement 0.9 427.0 272.0
Percent Difference 5.1 0.5 -3.4
44
8.4 Reading and Writing Only Essential Data 8-23
Input operations are not affected by the subsetting IF, the WHERE statement, or the WHERE=
data set options.
8-24 Chapter 8 Utilizing Best Practices to Improve Efficiency
Reference Information
The WHERE and subsetting IF statement are not equivalent. While both statements test a condition to
determine whether SAS should process an observation, there are differences:
• The WHERE statement selects observations before they are brought into the PDV. The subsetting IF
statement works on observations after they are read into the PDV.
• The WHERE statement can produce a different data set than the subsetting IF when a BY statement
accompanies a SET, MERGE, or UPDATE statement.
• When you use the subsetting IF statement with the MERGE statement, SAS selects observations after
the current observations are combined. When you use the WHERE statement, SAS applies the selection
criteria to each input data set before it combines observations.
• The WHERE statement can select observations only from SAS data sets. The subsetting IF statement
selects observations from SAS data sets, those created with an INPUT statement, or where the selection
criteria is based on computed variables.
• The WHERE statement cannot be executed conditionally as part of an IF statement, but the subsetting
IF statement can.
If you use the WHERE= data set option and the WHERE statement in the same DATA step, SAS ignores
the WHERE statement for data sets with the WHERE= data set option. There is no significant efficiency
difference between a WHERE statement and a WHERE= data set option on an input data set.
8.4 Reading and Writing Only Essential Data 8-25
47
48 c08s4d2a
8-26 Chapter 8 Utilizing Best Practices to Improve Efficiency
Comparing Techniques
Technique CPU I/O Memory
I. Subsetting at bottom 4.3 433.0 227.0
II. Subsetting higher up 1.4 425.0 243.0
Percent Difference 67.2 1.8 -7.0
50
8.4 Reading and Writing Only Essential Data 8-27
51
Subsetting Variables
To subset variables, you can use the following:
DROP and KEEP statements
DROP KEEP
52
8-28 Chapter 8 Utilizing Best Practices to Improve Efficiency
54 c08s4d3b
8.4 Reading and Writing Only Essential Data 8-29
data totals;
set ia.sales(keep = Dest Num1st
NumBus);
NonEconPass =
sum(Num1st,NumBus);
run;
55 c08s4d3c
56 c08s4d3d
8-30 Chapter 8 Utilizing Best Practices to Improve Efficiency
57 c08s4d3e
Comparing Techniques
Technique CPU I/O Memory
I. KEEP not used 2.9 7177 8140
II. KEEP on DATA statement 2.3 656 8138
III. KEEP on SET statement 2.4 1625 8138
IV. KEEP on SET and DATA statements 2.2 662 8138
V. KEEP on SET and PROC statements 2.4 1625 8139
CPU
V.
58
8.4 Reading and Writing Only Essential Data 8-31
Comparing Techniques
I/O
V.
Memory
V.
V.
59
60
8-32 Chapter 8 Utilizing Best Practices to Improve Efficiency
62 c09s4d4b
8.4 Reading and Writing Only Essential Data 8-33
Comparing Techniques
Technique CPU I/O Memory
I. Read all fields 4.4 1627.0 219.0
II. Read required fields 1.7 1625.0 215.0
Percent Difference 60.7 0.1 1.8
63
Conclusions
If the variable is already in a SAS data set, you can use
the following to minimize the volume of data processed:
WHERE statements in DATA and PROC steps
64
8-34 Chapter 8 Utilizing Best Practices to Improve Efficiency
Objectives
Examine available efficiency techniques to do the
following tasks:
access database data
66
67
8.5 Networking Efficiency Considerations (Self-Study) 8-35
68
The SAS/ACCESS LIBNAME engine writes native DBMS SQL statements from your SAS statements
and sends them to the DBMS for processing.
The SQL Pass-Through Facility enables you to write native DBMS SQL statements from within the SQL
procedure and pass them directly to the DBMS for processing.
69
8-36 Chapter 8 Utilizing Best Practices to Improve Efficiency
70
The list of aggregate functions that are passed varies by database. See the documentation for the
SAS/ACCESS Interface to your database for a list of aggregate functions that are passed to your database
for processing.
71
8.5 Networking Efficiency Considerations (Self-Study) 8-37
SASTRACE=',,,d'
SASTRACE=',,,d'
General form of the SASTRACELOC= option:
SASTRACELOC
SASTRACELOC == stdout
stdout || SASLOG
SASLOG
Example:
options sastrace= ',,,d' sastraceloc = saslog;
72
',,,d' specifies that all SQL statements sent to the DBMS are sent to the log. These statements include
the following:
• SELECT
• DELETE
• CREATE
• SYSTEM CATALOG
• DROP
• COMMIT
• INSERT
• ROLLBACK
• UPDATE
There are four possible positional arguments to SASTRACE. The commas in the value for the
SASTRACE option are placeholders for other debugging options. For other values, please see the
SAS documentation.
8-38 Chapter 8 Utilizing Best Practices to Improve Efficiency
Threaded Reads
A threaded read retrieves the result set from the database
on multiple connections between SAS and the DBMS.
Threaded reads are accomplished by doing the following:
using the LIBNAME engine
establishing a read connection between the DBMS
and each SAS thread
partitioning the result set across the connections
passing the rows to SAS simultaneously (in parallel)
across the connections
73
Most, but not all, SAS/ACCESS interfaces support threaded reads in SAS 9.1.
− DMINE, DMREG
74
8.5 Networking Efficiency Considerations (Self-Study) 8-39
75
Reading Columns
Techniques for limiting the number of columns returned
from the DBMS include the following:
DROP= SAS data set option
Examples:
data temp;
set mylib.table(keep = name age state);
run;
proc sql;
select name, age, state
from mylib.table;
quit;
76
8-40 Chapter 8 Utilizing Best Practices to Improve Efficiency
Reading Columns
DROP= DBMS SELECT
SAS System
KEEP= clause
VAR statement
SAS SELECT
clause
DBMS
Results
77
network traffic
memory requirements
Examples:
data temp;
set mylib.table;
where state in ('NC', 'SC');
run;
proc sql;
select *
from mylib.table
where state in ('NC', 'SC');
quit;
78
8.5 Networking Efficiency Considerations (Self-Study) 8-41
WHERE
SAS System DBMS
Criteria
Results
79
80
SAS enhancements include functions or operators that are not a part of the native database SQL. The
SASTRACE= system option can help you determine what is passed to the database to process.
8-42 Chapter 8 Utilizing Best Practices to Improve Efficiency
proc sql;
select * from mylib.table
order by state;
quit;
81
Be aware that SAS sorts null values low; most DBMSs sort null values high.
If you specify a BY statement in a DATA or PROC step that references a DBMS data source, it is
recommended for performance reasons that you associate the BY variable (in a DATA or PROC step) with
an indexed DBMS column. If you reference DBMS data in a SAS program and the program includes a
BY statement for a variable that corresponds to a column in the DBMS table, the SAS/ACCESS
LIBNAME engine automatically generates an ORDER BY clause for that variable. The ORDER BY
clause causes the DBMS to sort the data before the DATA or PROC step uses the data in a SAS program.
If the DBMS table is very large, this sorting can adversely affect your performance. Use a BY variable
that is based on an indexed DBMS column in order to reduce this negative impact.
8.5 Networking Efficiency Considerations (Self-Study) 8-43
1 Query Query
Request Results
proc sql...
3
. . .
RESULTS
SAS Session
82
83
8-44 Chapter 8 Utilizing Best Practices to Improve Efficiency
84
85
8.5 Networking Efficiency Considerations (Self-Study) 8-45
SAS/ACCESS Summary
The SAS/ACCESS LIBNAME engine enables transparent
access to your DBMS tables. As much code as possible is
passed behind the scenes by SAS to the DBMS for
processing in order to optimize performance.
The SQL Pass-Through Facility enables the programmer
to control the native DBMS SQL queries that are passed
to the database to execute.
87
8-46 Chapter 8 Utilizing Best Practices to Improve Efficiency
Distributed Processing
Distributed processing can be defined as any one of the
following:
one process (a client or local host) requesting services
or data from another process (a server or remote host)
executing on a different machine
the distribution of computing resources to enable
utilization of data files, hardware resources, and
software resources between different computers
the division of applications into tasks to be performed
on the most appropriate machine, thereby maximizing
all computing resources
88
Parallel Processing
Parallel processing is the dividing of an application into
subunits of work that can be executed simultaneously.
This parallel processing can occur on the same machine
or different machines.
The purposes of parallel processing (also known as
multiprocessing or asynchronous processing) are to do
the following:
execute independent tasks in parallel (SAS Version 8)
continued...
89
8.5 Networking Efficiency Considerations (Self-Study) 8-47
Parallel Processing
take advantage of each processor on a network of
machines
complete a job in less total elapsed time than it would
take to execute the same job serially
increase usage of underutilized CPUs
– exploit current investment
– prevent further monetary outlay for hardware
90
Grid Computing
A computing grid is a collection of multiple computers
that solve one application problem.
The concept of grid computing is to tap into the unused
processor cycles of computers hooked up to a network
to solve problems that require a massive amount of
processing power and deal with vast amounts of data.
The idea of grid computing is that any device or computer
could hook into a network and make use of the collective
unused power of every device on the network or grid.
continued...
91
8-48 Chapter 8 Utilizing Best Practices to Improve Efficiency
Grid Computing
The goal is to use the processing cycles of all computers
in a network for solving problems too intensive for any
stand-alone machine.
Grid computing is not a new concept, but one that has
gained renewed interest recently for at least two reasons:
IT budgets were cut, and grid computing offers
a less expensive alternative to purchasing new, larger
server platforms.
Computing problems in several industries involve
processing large volumes of data and/or performing
repetitive computations to the extent that the workload
requirements exceed existing server platform
capabilities.
92
93
Distributed processing using SAS software requires a license for SAS/CONNECT, SAS/SHARE,
or SAS Integration Technologies.
8.5 Networking Efficiency Considerations (Self-Study) 8-49
Compute Services
Compute services enable you to move any or all
segments of an application to other processors to take
advantage of hardware, software, and data resources.
Report
Result
Data SAS Program
Report
Server
(Remote)
Request
Client (Local)
94 ...
95
8-50 Chapter 8 Utilizing Best Practices to Improve Efficiency
96
97
8.5 Networking Efficiency Considerations (Self-Study) 8-51
98
OPTIONS
OPTIONSAUTOSIGNON
AUTOSIGNON == NO|YES;
NO|YES;
The default is NO.
Example:
options autosignon = yes;
99
8-52 Chapter 8 Utilizing Best Practices to Improve Efficiency
RSUBMIT
RSUBMIT<remote-machine-name>;
<remote-machine-name>;
code
code to
to be
be processed
processed on
on the
the remote
remote machine
machine
ENDRSUBMIT;
ENDRSUBMIT;
Example:
local SAS session
rsubmit bcom1;
SAS code to run on remote machine
endrsubmit;
100
You can transfer SAS files, flat files, and extracts of DBMS tables.
8.5 Networking Efficiency Considerations (Self-Study) 8-53
continued...
102
103
8-54 Chapter 8 Utilizing Best Practices to Improve Efficiency
104
105
8.5 Networking Efficiency Considerations (Self-Study) 8-55
Server
(Remote)
Request for
Records
Client (Local)
106 ...
Benefits of RLS
A single copy of the data can be maintained while
processing is performed on the local machine.
The data appears to be local.
RLS enables updates to remote data as a result of
local processing.
RLS permits a user interface to reside on the local
system while the data is on a remote system.
107
8-56 Chapter 8 Utilizing Best Practices to Improve Efficiency
108
109
8.5 Networking Efficiency Considerations (Self-Study) 8-57
SERVER= Option
General form of the SERVER= option in the LIBNAME
statement:
LIBNAME
LIBNAMElibref
libref 'SAS-data-library'
'SAS-data-library'||SLIBREF=server-libref
SLIBREF=server-libref
SERVER=remote-host;
SERVER=remote-host;
Examples:
Access a library stored on your user ID on UNIX:
libname rmtunx '/orion/sasdata' server = sdcunx;
Access the Work library on z/OS:
libname rmtwork slibref = work server = sdcmvs;
110
libref is a libref defined to your local session referencing a remote SAS library.
SAS-data library is the physical location of the remote SAS library.
server-libref is an existing libref in the server’s session, for example, Work.
remote-host is the same name previously specified with OPTIONS REMOTE=id or the value of
server-ID on the SIGNON statement.
8-58 Chapter 8 Utilizing Best Practices to Improve Efficiency
output needs
– printers
– tape drives
– GUI display continued...
111
112
Chapter 9 Using the Scalable
Performance Data Engine
(Self-Study)
Objectives
Define the Scalable Performance Data Engine
(SPDE).
Discuss symmetric multiprocessing (SMP) machines.
Compare SPDE tables with Base SAS tables.
3
9-4 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
− SMP machines
− multiple I/O channels
The SPD Engine is part of Base SAS software and runs on UNIX, Windows, z/OS (zFS file
system only), and OpenVMS Alpha (on ODS-5 file systems only).
An SMP machine is a Symmetric MultiProcessor machine, which has more than one CPU and a thread-
enabled operating system.
parallel loads
5
9.1 Introduction to the Scalable Performance Data Engine 9-5
The SPD Engine running on an SMP machine provides the capability to read and deliver much more data
to an application in a given elapsed time. When the SPD Engine reads a data file, it launches one or more
threads for each CPU. These threads read data in parallel from multiple disk drives, driven by one or more
controllers.
The exact number of CPUs on an SMP machine varies by manufacturer and model. The operating system
of the machine is also specialized; it must be capable of scheduling code segments so that they execute in
parallel. If the operating system kernel is threaded, performance is further enhanced because it prevents
contention between the executing threads. While threads run on the SMP machine, managed by a threaded
operating system, the available CPUs work together. The synergy between the CPUs and threads enables
the software to scale processing performance.
Although it is not necessary to utilize an SMP machine for SPD Engine data files, it is highly
recommended to achieve maximum performance.
9-6 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
data descriptor
Each of these components can comprise one or more physical files so that the components can span
volumes, but are referenced as one logical file.
9.1 Introduction to the Scalable Performance Data Engine 9-7
Data Metadata
Descriptor *.MDF
Data
Data *.1.DPF
*.2.DPF
*.sas7bdat *.3.DPF
*.4.DPF
• When a SAS data file is copied from a base engine library to SPD Engine data storage, the file is split
into a metadata file (*.mdf) and at least one data file (*.dpf). Because of the particular way data is
stored with SPD Engine, several data files (*.1.dpf, *.2.dpf) might also be generated, which splits the
data file into several file segments.
• On UNIX file systems, you can use standard commands, such as ls, to see these files. On Windows
platforms, you can use Windows Explorer to see these files.
It is not recommended that you move SPD Engine data files using operating system commands
because of disk file segmentation.
9-8 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
Index Index
*.HBX
Navigational Component
*.sas7bndx
*.IDX
SPD Engine creates a separate index file for each index. For example, if five indexes are defined, the SAS
base engine stores them all in one index file. There would be at least ten files in SPD Engine data storage,
and each would contain the values of the appropriate index variable(s).
The navigational component file (.HBX) has each unique value for an index and the data partitions in
which that value can be found. The record identifier component file (.IDX) has pointers to each row in the
table containing the value of the index variable(s).
9.1 Introduction to the Scalable Performance Data Engine 9-9
data area
index area
work area
10
9-10 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
Objectives
Discuss the LIBNAME statement and the LIBNAME
options.
Create SPDE tables.
Create SPDE indexes.
12
Windows
libname mylib spde 'c:\workshop\winsas\prog3\meta';
13
In this example, the index and data components are stored in the same location.
9.2 Creating SPD Engine Tables 9-11
LIBNAME
LIBNAME libref
librefSPDE
SPDE 'full-primary-path'
'full-primary-path'<options>
<options> ;;
full-primary-path
is the fully qualified pathname of the primary path for
the SPD Engine library
must be recognized by the operating environment
14
The metadata for the library must start in the primary path. It can continue in secondary paths
using the METADATA= option.
DATAPATH
DATAPATH == ('path1'
('path1' 'path2'...
'path2'... 'pathn'
'pathn' ))
UNIX
libname mylib spde '/disk/meta'
datapath = ('/disk1/data'
'/disk2/data'
'/disk3/data');
Windows
libname mylib spde 'c:\workshop\winsas\prog3\meta'
datapath = ('c:\workshop\winsas\prog3\data1'
'c:\workshop\winsas\prog3\data2'
'c:\workshop\winsas\prog3\data3');
15
9-12 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
dpf4 dpf5
16 ...
9.2 Creating SPD Engine Tables 9-13
17
For UNIX:
• The metadata is stored in '/disk/meta'.
• The data is stored in '/disk1/data', '/disk2/data', and '/disk3/data'.
• The index is stored in '/disk4/index' and '/disk5/index'.
For Windows:
• The metadata is stored in 'c:\workshop\winsas\prog3\meta'.
• The data is stored in 'c:\workshop\winsas\prog3\data1',
'c:\workshop\winsas\prog3\data2', and 'c:\workshop\winsas\prog3\data3'.
• The index is stored in 'c:\workshop\winsas\prog3\index1' and
'c:\workshop\winsas\prog3\index2'.
9-14 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
INDEXPATH
INDEXPATH == ('path1'
('path1' 'path2'...
'path2'... 'pathn')
'pathn')
18
APPEND procedure
19
9.2 Creating SPD Engine Tables 9-15
c09s2d1_unix
20
The data sets ia.sales, ia.international, and ia.revenue are used as examples.
They are too small to partition well. The data set ia.sales used for demonstrations and
exercises contains fewer observations than the data set ia.sales used for the course notes.
9-16 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
c09s2d1_win
21
The data sets ia.sales, ia.international, and ia.revenue are used as examples.
They are too small to partition well. The data set ia.sales used for demos and exercises
contains fewer observations than the data set ia.sales used for the course notes.
All the data and index files are tied back to the location of the
metadata files by the 3rd segment of the component file name.
22
9.2 Creating SPD Engine Tables 9-17
global index
sales.hbxorigin.c_workshop_winsas_prog3_meta.0.1.spds9
for variable Origin
segmented index
sales.idxorigin.c_workshop_winsas_prog3_meta.0.1.spds9
for variable Origin
All the data and index files are tied back to the location of the
metadata files by the 3rd segment of the component file name.
23
When you create an SPD Engine data set, many component files can result. SPD Engine component files
are stored with the following naming conventions:
Metadata files filename.mdf.0.p#.v#.spds9
Data files filename.dpf.fuid.p#.v#.spds9
Index files filename.idxsuffix.fuid.p#.v#.spds9
filename.hbxsuffix.fuid.p#.v#.spds9
where
filename is a valid SAS file name.
mdf identifies the metadata component file.
dpf identifies the partitioned data component files.
p# is the partition number.
v# is the version number.
fuid is the unique file ID, which is set to the primary (metadata) path.
idxsuffix identifies the segmented view of an index, where suffix is the name of the index.
hbxsuffix identifies the global view of an index, where suffix is the name of the index.
spds9 denotes a SAS®9 SPD Engine component file.
Only the filename portion of the data component names and the suffix portion of the index component
names are user-controllable. SPDE uses these names and the metadata path, partition number, and version
number to build the individual file names.
9-18 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
24
25
9.2 Creating SPD Engine Tables 9-19
PARTSIZE
PARTSIZE== nn
Example:
libname mylib spde '/disk/meta'
datapath = ('/disk1/data'
'/disk2/data'
'/disk3/data');
26 c09s2d2
n is the size of the partition in megabytes. The default is 128. The maximum value is 2047.
27
See the SPD Engine documentation for additional information on setting an adequate value for the
PARTSIZE= data set option.
9-20 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
28
9.2 Creating SPD Engine Tables 9-21
ASYNCINDEX
ASYNCINDEX==NO
NO||YES
YES
29 c09s2d3
The SPD Engine spawns a single thread for each index created, and then processes the threads
simultaneously. Although creating indexes in parallel is much faster than creating one index at a time, the
default for this option is NO because parallel creation requires additional utility work space and additional
memory, which might not be available. If the index creation fails due to insufficient resources, set the
system option to MEMSIZE=0 or increase the size of the utility file space using the SPDEUTILLOC=
system option.
See the SPDE documentation in the SAS OnlineDoc for information about the SPDEUTILLOC= system
option.
9-22 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
c09s2d4
30
9.3 Using the SPD Engine Efficiently 9-23
Objectives
Investigate the efficiencies of the SPD Engine.
32
Final Result
33
9-24 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
c09s3d1
34
35
9.3 Using the SPD Engine Efficiently 9-25
36
You can suppress the use of indexes for BY-group processing by using the SPDSNIDX=YES
macro variable or the NOINDEX = YES data set option.
All SPD Engine macro variables values of NO|YES must be typed in uppercase.
37
9-26 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
38
These efficiencies apply to both WHERE statements and WHERE= data set options.
The WHERE evaluation planner included in the SPD Engine chooses the best method to use to evaluate
WHERE expressions that use indexes.
9.3 Using the SPD Engine Efficiently 9-27
39 ...
The SPD Engine can return some query results without reading the data. An example of such a query is
shown below:
proc sql;
select origin, count(*)
from mylib.sales
group by origin;
quit;
The SPD Engine checks the HBX index component to locate the distinct values of origin. It then goes to
the IDX index component to count the rows for each value of origin. The actual mylib.sales data set
never has to be opened; only the index files for the mylib.sales data set are opened.
The Base SAS Engine would need to read the entire mylib.sales data set to find the count for each
value of origin.
9-28 Chapter 9 Using the Scalable Performance Data Engine (Self-Study)
Reference Information
BYSORT= specifies for the SPD Engine to perform an automatic implicit sort when it encounters
a BY statement.
DATAPATH= specifies a list of paths in which to store data partitions (.dpf) for an SPD Engine data
set.
ENDOBS= specifies the end observation number in a user-defined range of observations to be
processed.
INDEXPATH= specifies a path or list of paths in which to store the two index component files (.hbx
and .idx) associated with an SPD Engine data set.
METAPATH= specifies a list of overflow paths to store metadata (.mdf) component files for an SPD
Engine data set.
PARTSIZE= specifies, when an SPD Engine data set is created, the size (in megabytes) that the data
component partitions must be. This is a fixed-length size. This specification applies
only to partitions in the data component files.
STARTOBS= specifies the starting observation number in a user-defined range of observations to be
processed.
TEMP= specifies to store the library in a temporary subdirectory of the primary directory.
Chapter 10 Additional Topics
(Self-Study)
Objectives
Use the MODIFY statement in a DATA step
to update a data set in place.
Use a transaction data set to make modifications
to a SAS data set.
Use the KEY= option with the MODIFY statement
to make modifications to a SAS data set.
Business Task
International Airlines decided to give passengers more
leg room, so they want to decrease the number of seats
for business and economy classes.
First Capacity Capacity
Class Business Economy
14 27
30 154
163
data ia.capacity;
set ia.capacity;
*or modify ia.capacity;
CapEcon = int(CapEcon * .95);
CapBusiness = int(CapBusiness * .90);
run;
c10s1d1 ...
4
10-4 Chapter 10 Additional Topics (Self-Study)
Implied Output
5 ...
DATA
DATA SAS-data-set;
SAS-data-set;
MODIFY
MODIFYSAS-data-set;
SAS-data-set;
existing-variable
existing-variable == expression;
expression;
RUN;
RUN;
The name of the data set on the DATA and MODIFY statements must match.
10.1 Modifying SAS Data Sets in Place 10-5
Implied Replace
7 ...
The name of the data set on the DATA and MODIFY statements must match.
8
10-6 Chapter 10 Additional Topics (Self-Study)
continued...
10
10.1 Modifying SAS Data Sets in Place 10-7
data ia.capacity;
modify ia.capacity;
CapEcon = int(CapEcon * .95);
CapBusiness = int(CapBusiness * .90);
run;
c10s1d1
11
If the system terminates abnormally while a DATA step that is using the MODIFY statement is
processing, you can lose data and possibly damage your master data set. You can recover from the failure
by doing the one of the following:
• restoring the master file from a backup and restarting the step
• keeping an audit file and using this file to determine which master observations were updated
• creating generations of SAS data sets
10-8 Chapter 10 Additional Topics (Self-Study)
modify ia.capacity;
CapEcon = int(CapEcon * .95);
CapBusiness = int(CapBusiness* .90);
run;
12 ...
c Reads an observation.
modify ia.capacity;
CapEcon = int(CapEcon * .95);
CapBusiness = int(CapBusiness* .90);
run;
13 ...
modify ia.capacity;
CapEcon = int(CapEcon * .95);
CapBusiness = int(CapBusiness* .90);
run;
Implied Replace
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
14 ...
16
10.1 Modifying SAS Data Sets in Place 10-11
DATA
DATA SAS-data-set;
SAS-data-set;
MODIFY
MODIFYSAS-data-set
SAS-data-set
transaction
transaction data
data set;
set;
BY key-variable;
BY key-variable;
RUN;
RUN;
c10s1d2
17
When you use the MODIFY statement to update a data set, the following conditions might occur:
• If a variable has a missing value in the transaction data set, the corresponding master value is not
changed by default.
• If duplicate values of the BY variable exist in the master data set, only the first observation of the group
is updated.
• If multiple transactions exist for one master observation, all transactions are applied in order.
The MODIFY statement locates the matching observation in the master data set by using dynamic
WHERE processing.
data ia.capacity;
modify ia.capacity
ia.capacity ia.newrtnum;
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
by FlightID;
IA00100 0000001 RDU LHR run; 14 27 154
IA00201 0000002 LHR RDU 14 27 154
IA00300 0000003 RDU FRA 14 27 154
IA00400 0000004 FRA RDU 14 27 154
IA00500 0000005 RDU JFK 16 . 238
Transaction
n First Observation of FlightID RouteID Origin Dest
Master PDV
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
18 ...
data ia.capacity;
modify ia.capacity
ia.capacity ia.newrtnum;
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
by FlightID;
IA00100 0000001 RDU LHR run; 14 27 154
IA00201 0000002 LHR RDU 14 27 154
IA00300 0000003 RDU FRA 14 27 154
IA00400 0000004 FRA RDU 14 27 154
IA00500 0000005 RDU JFK 16 . 238
Transaction
FlightID RouteID Origin Dest
Master PDV
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
19 ...
data ia.capacity;
modify ia.capacity
ia.capacity ia.newrtnum;
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
by FlightID;
IA00100 0000001 RDU LHR run; 14 27 154
IA00201 0000002 LHR RDU 14 27 154
IA00300 0000003 RDU FRA 14 27 154
IA00400 0000004 FRA RDU 14 27 154
IA00500 0000005 RDU JFK 16 . 238
Transaction
FlightID RouteID Origin Dest
Master PDV
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
20 ...
e Applies a dynamic WHERE statement to the master data set. Reads an observation from the master
data set into the PDV.
data ia.capacity;
modify ia.capacity
ia.capacity ia.newrtnum;
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
by FlightID;
IA00100 0000001 RDU LHR run; 14 27 154
IA00201 0000002 LHR RDU 14 27 154
IA00300 0000003 RDU FRA 14 27 154
IA00400 0000004 FRA RDU 14 27 154
IA00500 0000005 RDU JFK 16 . 238
Transaction
FlightID RouteID Origin Dest
Master PDV
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
IA00500 0000035
0000005 RDU
RDU JFK
JFK 16 . 238
21 ...
data ia.capacity;
modify ia.capacity
ia.capacity ia.newrtnum;
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
by FlightID;
IA00100 0000001 RDU LHR run; 14 27 154
IA00201 0000002 LHR RDU 14 27 154
IA00300 0000003 RDU FRA 14 27 154
IA00400 0000004 FRA RDU 14 27 154
IA00500 0000035
0000005 RDU JFK
JFK 16 . 238
Transaction
FlightID RouteID Origin Dest
Master PDV
FlightID RouteID Origin Dest Cap1st CapBusiness CapEcon
IA00500 0000035
0000005 RDU
RDU JFK
JFK 16 . 238
22 ...
g Rewrites the observation back to the master data set in the same location.
Partial Output
proc print data = ia.capacity(obs = 5);
title 'Using a Transaction Data Set for Modifications';
run;
c10s1d2
23
10.1 Modifying SAS Data Sets in Place 10-15
Business Task
The cargo figures for 1999 are stored in ia.cargo99,
which has a composite index named FlghtDte
consisting of FlightID and Date.
ia.cargo99
Flight Cargo
ID RouteID Origin Dest CapCargo Date Wgt CargoRev
24
Business Task
An accountant discovered that some of the figures are
incorrect. You must modify the cargo data to correct the
figures. The correct cargo numbers are stored in
ia.newcgnum.
ia.newcgnum
Flight Cap Cargo
ID RouteID Origin Dest Cargo Date Wgt CargoRev
25
10-16 Chapter 10 Additional Topics (Self-Study)
26
c10s1d3
27
10.1 Modifying SAS Data Sets in Place 10-17
28
When you use an index with the MODIFY statement, these situations occur:
• The index named in the KEY= option can be a simple or composite index.
• You must explicitly specify the update you want to occur. No automatic overlay of nonmissing
transaction values occurs as it does with the MODIFY/BY method.
• The data set you are updating must have an index on the key variable. (Data views or sequential
libraries, for example, cannot be processed.)
• Each transaction must have a matching observation in the master data set. If you have multiple
transactions for one master observation, only the first transaction is applied. The others generate
runtime errors and terminate the DATA step (unless you use the UNIQUE option, which is discussed in
this section).
10-18 Chapter 10 Additional Topics (Self-Study)
ia.cargo99
Flight Cargo
ID RouteID Origin Dest CapCargo Date Wgt CargoRev
IA00101
...
...
01JAN1999
82400 . 121879.9
82400 121879.9
29 ...
c The SET statement reads an observation from the transaction data set into the PDV.
IA00101
...
...
01JAN1999
82400 . 121879.9
82400 121879.9
30 ...
d The KEY= option uses the FlghtDte index to locate an observation in the master data set.
10.1 Modifying SAS Data Sets in Place 10-19
ia.cargo99
Flight Cargo
ID RouteID Origin Dest CapCargo Date Wgt CargoRev
82400 . 121879.9
82400 121879.9
31 ...
e The MODIFY statement reads the observation in the master data set using the index and writes values
to the PDV.
10-20 Chapter 10 Additional Topics (Self-Study)
ia.cargo99
Flight Cargo
ID RouteID Origin Dest CapCargo Date Wgt CargoRev
IA00101 82400
82400 .
IA00101 0000001
0000001
...
...
01JAN1999 48000
48000 121879.9
117600
117600
82400 . 121879.9
82400 121879.9
32 ...
Because CargoWgt was assigned a missing value using an assignment statement, the missing
value replaces the original data in the master data set.
ia.cargo99
Flight Cargo
ID RouteID Origin Dest CapCargo Date Wgt CargoRev
PDV
Flight
FlightID RouteID CapCargo CargoWgt CargoRev
Date
IA00101
IA00101 0000001
0000001 ...
...
82400 01JAN1999
01JAN1999 . 121879.9
117600
82400 . 121879.9
82400 121879.9
33 ...
Exercises
This is a backup copy of the data in case your program must be submitted multiple times as
you test and debug.
2. Modifying All Observations in a SAS Data Set
Give all the employees in the empdata SAS data set a 5% salary increase using the MODIFY
statement. Print the data before and after the increase.
Partial Output
Original Data
Last
Obs Division HireDate Name FirstName
Job
Obs Country Location Phone EmpID Code Salary
Partial Output
Modified Data
Last
Obs Division HireDate Name FirstName
Job
Obs Country Location Phone EmpID Code Salary
Partial Output
Modified Data
Job
Obs EmpID Phone Code Division Salary
4. Modifying a SAS Data Set Using a Transaction Data Set and an Index
Use the transaction data set ia.empdatu2 to modify the empdata SAS data set by the employee
ID number. Use the index on the empdata SAS data set. Modify the variables LastName,
Location, and Salary. Print the data set before and after the changes.
Partial Output
Modified Data
Reference Information
Missing Values
The MODIFY statement with a BY statement enables you to specify how missing values in the
transaction data set are handled by using the UPDATEMODE= option in the MODIFY statement.
The default is MISSINGCHECK. When MISSINGCHECK is in effect, SAS checks for missing data in
the transaction data set and does not replace the data in the master data set with missing values unless
they are special missing values.
NOMISSINGCHECK does not check for missing values in the transaction data set and enables missing
values in the transaction data set to replace the values in the master data set. Special missing values in the
transaction data set still replace values in the master data set.
Example:
modify sasdata.payroll sasdata.update1
updatemode = nomissingcheck;
Duplicate Values
If there are duplicates in either MASTER or TRANSACTION:
data master;
set transaction;
modify master key = id;
x = y;
run;
10.1 Modifying SAS Data Sets in Place 10-25
35
EXAMPLE 1: If there are contiguous duplications in transaction, each of which has a match in
master, then SAS performs a one-to-one update.
EXAMPLE 2: If there are contiguous duplications in transaction, some of which do not have a
match in master, then SAS performs a one-to-one update until it finds a non-match. At
that time, SAS encounters a run-time error.
10-26 Chapter 10 Additional Topics (Self-Study)
You can specify the UNIQUE argument with the KEY= option in the MODIFY statement to perform the
following tasks:
• apply multiple transactions to one master observation
• identify that each observation in the master data set contains a unique value of the index variable(s)
For example:
data master;
set transaction;
modify master key = id/unique;
x = y;
run;
EXAMPLE 3: If there are noncontiguous duplications in transaction, then SAS updates the first
observation in master. This is the same action as if the UNIQUE option were used.
EXAMPLE 4: If there are contiguous duplications in transaction and the UNIQUE option is used,
then SAS updates the first observation in master.
10.1 Modifying SAS Data Sets in Place 10-27
MNEMONIC MEANING
_DSENMR The observation in the transaction data set does not exist in the
master data set. Used with the MODIFY statement with a BY
statement.
Objectives
Introduce the terminology for generation data sets.
Create generations of a SAS data set.
Process generations of a SAS data set.
38
39
10-30 Chapter 10 Additional Topics (Self-Study)
ia.year2005#001
(Quarter 1) ia.year2005#002
ia.year2005#003 ia.year2005
(Quarter 1 and
Quarter 2) (Quarter 1,
Quarter 2, and (Quarter 1,
Quarter 3) Quarter 2,
Quarter 3, and
Quarter 4)
40 ...
41
The SAS Scalable Performance Data Engine and OpenVMS do not support generation data sets.
10.2 Creating Generation Data Sets 10-31
No Generations (Default)
data a;
set a;
run;
a
a
42 ...
By default, as the SAS data set a is replaced, there are two copies of a in the SAS data library.
No Generations (Default)
data a;
set a;
run;
43
When the DATA step completes execution, SAS removes the original copy of the data set a from the data
library.
10-32 Chapter 10 Additional Topics (Self-Study)
data a;
set a;
run;
a
a
44 ...
By default, as the SAS data set a is replaced, there are two copies of a in the SAS data library.
data a;
set a;
run;
a#001
Historical Version
a
Current Version
(base version)
45
When the DATA step completes execution, SAS keeps the original copy of the SAS data set a in the data
library and renames it.
New versions are created only when a data set is replaced; not when it is modified in place.
10.2 Creating Generation Data Sets 10-33
Terms to Know
Generation group
the group of files that represents a series of
replacement data sets. The generation group consists
of the base version and a set of historical versions of
a file.
Version
any one of the files in a generation group
Base version
the most recently created version of a file
continued...
46
Terms to Know
Historical versions
all the versions of a file in the generation group except
the base version
Youngest version
the version that is chronologically closest to the base
version
Oldest version
the oldest version in a generation group
47
When the number of created generations exceeds the value of the GENMAX= option, the oldest
versions age off. When this happens, the oldest version is not the first version that was created.
10-34 Chapter 10 Additional Topics (Self-Study)
a#001
Historical Version
a
Current Version
(base version)
48 ...
49
The dictionary.tables file does not include information about generation data sets.
10.2 Creating Generation Data Sets 10-35
50
Example
Create a SAS data set with a maximum of four versions.
proc datasets lib = ia nolist;
modify year2005 (genmax = 4);
run;
quit;
c10s2d1
51
The GENMAX= option can be specified in the same way as a regular data set option.
data ia.year2005(genmax = 4);
data-step-syntax
run;
10-36 Chapter 10 Additional Topics (Self-Study)
ia.year2005 1 0
52 ...
53
c10s2d1
54
ia.year2005 2 0
ia.year2005 #001 1 -0
1
55 ...
The original data set is renamed as ia.year2005#001. The relative generation number is reassigned
as –1.
The absolute generation number is a permanent attribute of the data set, stored in the descriptor portion.
10-38 Chapter 10 Additional Topics (Self-Study)
c10s2d1
56
ia.Year2005 3 0
ia.Year2005 #002 2 -1
0
ia.Year2005#001 1 -2
-1
57 ...
c10s2d1
58
ia.year2005 4 0
ia.year2005#003 3 -1
0
ia.year2005#002 2 -1
-2
ia.year2005#001
ia.Year2005#001 1 -2
-3
59 ...
The third copy of ia.year2005 [ia.year2005#003] is assigned a relative generation number of –1.
c10s2d1
60
ia.year2005 5 0
ia.year2005#004 4 0
-1
ia.year2005#003 3 -2
-1
ia.year2005#002 2 -3
-2
ia.Year2005#001
ia.year2005#001 1 -3
Deleted
61 ...
The fourth copy of ia.year2005 [ia.year2005#004] is assigned a relative generation number of –1.
The third copy of ia.year2005 [ia.year2005#003] is assigned a relative generation number of –2.
The second copy of ia.year2005 [ia.year2005#002] is assigned a relative generation number of –3.
The first version of ia.year2005 [ia.year2005#001] is deleted.
10.2 Creating Generation Data Sets 10-41
The NODS option suppresses printing the contents of individual files when you specify _ALL_ in
the DATA= option. The CONTENTS statement prints only the SAS data library directory.
Partial Output
Contents of the Current Version of ia.year2005
Partial Output
Contents of the Current Version of ia.year2005
Sort Information
Sortedby Date
Validated YES
Character Set ANSI
10.2 Creating Generation Data Sets 10-43
63
GENNUM= Option
For example,
GENNUM = -1 refers to the youngest version.
GENNUM = 0 refers to the current version.
GENNUM = 1 refers to the first version created.
As new generations are created, the absolute generation
number increases sequentially.
As older generations are deleted, the absolute generation
numbers are retired.
64
10-44 Chapter 10 Additional Topics (Self-Study)
65
10.2 Creating Generation Data Sets 10-45
1 $223,134 . 01JAN2005
2 $214,236 $969,241 02JAN2005
3 $213,864 $942,459 03JAN2005
4 $226,276 $958,295 04JAN2005
5 $227,258 $982,329 05JAN2005
1 $223,134 . 01JAN2005
2 $214,236 $969,241 02JAN2005
3 $213,864 $942,459 03JAN2005
4 $226,276 $958,295 04JAN2005
5 $227,258 $982,329 05JAN2005
10-46 Chapter 10 Additional Topics (Self-Study)
Reference Information
HIST is a keyword for the GENNUM= option in the PROC DATASETS DELETE statement that
refers to all generations (excludes the base name).
To delete all of the SAS data sets in a generation group:
proc datasets library = ia;
delete sales2005(gennum = ALL);
run;
ALL is a keyword for the GENNUM= option in the PROC DATASETS DELETE statement that
refers to the base name and all generations.
10-48 Chapter 10 Additional Topics (Self-Study)
Exercises
a. Use the ia.y200061 and ia.y200062 data sets to concatenate to ia.jobhstry and test
your program.
b. Use PROC DATASETS to look at the generation information for ia.jobhstry.
Partial Output
Directory
Libref IA
Engine V9
Physical Name c:\workshop\winsas\prog3
File Name c:\workshop\winsas\prog3
Output
The DATASETS Procedure
2 Job1 Char 6
3 Job2 Char 6
4 Job3 Char 8
1 LastName Char 25
10-50 Chapter 10 Additional Topics (Self-Study)
Objectives
Define integrity constraints.
Determine the available types of integrity constraints.
Describe the benefits of integrity constraints.
Create integrity constraints.
69
Business Task
The data set ia.capinfo is updated frequently and
data errors are prevalent.
70
10.3 Creating Integrity Constraints 10-51
Integrity Constraints
You can create integrity constraints on the data to
accomplish the following:
preserve the consistency and correctness of data
71
Integrity constraints are rules that SAS data set modifications must follow to guarantee the validity of
data. Integrity constraints apply only when data values are modified in place; not when the table is
replaced.
Techniques for modifying data in place include the following:
• Viewtable window
• FSVIEW window
• FSEDIT window
• DATA step with the MODIFY statement
• PROC SQL with the INSERT INTO, DELETE FROM, or UPDATE statements or the SET statement
• PROC APPEND
10-52 Chapter 10 Additional Topics (Self-Study)
72
10.3 Creating Integrity Constraints 10-53
73
You can create integrity constraints for tables containing no rows, one row, or many rows.
NOT NULL guarantees that corresponding columns have non-missing values in each row.
CHECK insures that a specific set or range of values is the only value in a column. It can
also check the validity of a value in one column based on another value in another
column within the same row.
UNIQUE enforces uniqueness for the value of a column. DISTINCT is an alias for UNIQUE.
PRIMARY KEY uniquely defines a row within a table. There can be at most one primary key based
on one column or a set of columns. The primary key includes the NOT NULL and
UNIQUE attributes.
FOREIGN KEY links one or more rows in a table to a specific row in another table by matching a
column or set of columns in one table with the primary key in another table. This
parent/child relationship limits modifications made to both primary and foreign
keys. The only acceptable values for a foreign key are values of the primary key or
missing values.
If the table contains data, all data values are checked to determine whether they satisfy the
constraint before the constraint is added.
10-54 Chapter 10 Additional Topics (Self-Study)
Business Task
You must put integrity constraints on the data so that the
following conditions are met:
The route ID number is both unique and required.
PRIMARY
PRIMARY KEY
KEY
Capacity for first class passengers is less than
capacity for business passengers.
CHECK
CHECK
74 ...
For the UNIQUE constraint and the PRIMARY KEY constraint, SAS builds unique indexes on the
column(s) involved if an appropriate index does not already exist. Any index created by an integrity
constraint can be used for other purposes, such as WHERE processing or the KEY= option in a SET
statement.
Such an index cannot be removed through ordinary index deletion methods, because it is owned by the
constraint.
10.3 Creating Integrity Constraints 10-55
CHECK Constraint
First Class Capacity must be less than Business Capacity.
Constraint:
Edit Cap1st for
Cap1st <
these selected
CapBusiness or
rows.
CapBusiness = .;
0000005 15
16 .
0000029 38
14 30
0000077 19 56
75
ia.capinfo ...
76
PROC SQL can assign constraints in the CREATE TABLE and ALTER TABLE statements.
PROC DATASETS can only assign constraints to an existing table.
10-56 Chapter 10 Additional Topics (Self-Study)
PROC DATASETS uses a WHERE= data set option for the CHECK constraint.
Output
The DATASETS Procedure
Integrity Where
# Constraint Type Variables Clause
User
# Message
# of
Unique Owned Unique
# Index Option by IC Values
proc sql;
alter table ia.capinfo
add constraint PKIDInfo Primary Key (RouteID)
message = 'You must supply a Route ID Number'
add constraint Class1 check
(Cap1st < CapBusiness or
CapBusiness = .)
message = 'First Class Capacity must be less than
Business Capacity';
describe table constraints ia.capinfo;
quit;
Log
53 proc sql;
54 alter table capinfo
55 add constraint PKIDInfo Primary Key (RouteID)
66 message = 'You must supply a Route ID Number'
57 add constraint Class1 check
58 (Cap1st < CapBusiness or
59 CapBusiness = .)
70 message = 'First Class Capacity must be less than Business
70 ! Capacity';
NOTE: Table WORK.CAPINFO has been modified, with 7 columns.
60 describe table constraints capinfo;
NOTE: SQL table WORK.CAPINFO ( bufsize=4096 ) has the following
integrity constraint(s):
Integrity Where
# Constraint Type Variables Clause
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1 Class1 Check (Cap1st<CapBusiness)
or (CapBusiness=.)
2 PKIDInfo Primary Key RouteID
User
# Message
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
1 First Class Capacity must be less than Business Capacity
2 You must supply a Route ID Number
61 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.55 seconds
cpu time 0.07 seconds
10.3 Creating Integrity Constraints 10-59
78
79
10-60 Chapter 10 Additional Topics (Self-Study)
80
See the SAS documentation for additional information about maintaining integrity constraints using
PROC SQL.
PROC
PROCSQL;
SQL;
DESCRIBE
DESCRIBETABLE
TABLE CONSTRAINTS
CONSTRAINTS table-name;
table-name;
PROC
PROCCONTENTS
CONTENTSDATA=libref.dataname;
DATA=libref.dataname;
RUN;
RUN;
81
The DESCRIBE statement in PROC SQL prints the report in the Log window.
10.3 Creating Integrity Constraints 10-61
Business Task
The data set ia.cap2000 contains information about
every flight in 2000.
You need to ensure that
an added route ID number
is valid and that it is one
0000001
of the route ID numbers
in the data set 0000045
000077
ia.capinfo.
0000145
82
RouteIDNumber
RouteIDNumber
is
is Foreign
Foreign Key.
Key.
83
10-62 Chapter 10 Additional Topics (Self-Study)
parent table
84
10.3 Creating Integrity Constraints 10-63
ON UPDATE RESTRICT and ON DELETE RESTRICT are the defaults for foreign keys.
Referential constraints are defined in the child tables.
The requirements for establishing a referential relationship are as follows:
• The primary key and foreign key must reference the same number of variables, and the variables must
be in the same order.
• The variables must be of the same type (character or numeric) and length.
• If the foreign key is added to a data file that already contains data, the data values in the foreign key
data file must match existing values in the primary key data file or be null.
The foreign key data file can exist in the same SAS library as the referenced primary key data file (intra-
libref) or in different SAS libraries (inter-libref). However, if the library that contains the foreign key data
file is temporary, then the library containing the primary key data file must be temporary as well. In
addition, referential integrity constraints cannot be assigned to data files in concatenated libraries.
There is no limit to the number of foreign keys that can reference a primary key. However, additional
foreign keys can adversely impact the performance of update and delete operations.
10-64 Chapter 10 Additional Topics (Self-Study)
87 ...
? 0000112
0000145
55700
23987
181582
176000
88 ...
You want to add the route number 0000145 to the child table, ia.cap2000. The parent table,
ia.capinfo, is checked to see if route number 0000145 exists.
10-66 Chapter 10 Additional Topics (Self-Study)
89 ...
If route number 0000145 does not exist in ia.capinfo, 0000145 is not added to the data set
ia.cap2000.
90 ...
In order to add 0000145 to the data set ia.cap2000, the value 0000145 must first be added to
ia.capinfo.
10.3 Creating Integrity Constraints 10-67
91
Reference Information
To drop a constraint, use the DROP CONSTRAINT clause of the ALTER TABLE statement in PROC
SQL or the IC DELETE statement in PROC DATASETS.
c10ref1
proc sql;
alter table ia.cap2000
drop constraint FKRoute;
alter table ia.capinfo
drop constraint PKIDInfo
drop constraint Class1;
quit;
Exercises
Objectives
Determine what an audit trail file is.
Examine the columns in an audit trail file.
Initiate an audit trail file.
Add values to the audit trail file.
Report on an audit trail file.
Manage an audit trail file.
94
Business Task
You must monitor the updates for the data set
ia.capinfo.
Creating an audit trail file enables you to document the
following:
Who?
What?
When?
95
10-70 Chapter 10 Additional Topics (Self-Study)
Audit Trail
The audit trail is an optional SAS file that logs
modifications to a SAS table.
For each addition, deletion, and update to the data,
the audit file stores information about the following:
who made the modification
96
The MODIFY statement is one method with which to modify a SAS table. When a MODIFY statement is
used, integrity constraints are checked and edits are recorded in an audit trail.
10.4 Creating and Using Audit Trails 10-71
read-only
97
• The audit trail file must reside in the same SAS data library as the data file associated with it.
• A SAS table can have, at most, one audit file.
• Procedures such as PRINT, TABULATE, and FREQ can read audit trail files using the TYPE= data set
option.
10-72 Chapter 10 Additional Topics (Self-Study)
98
For the _AT*_ variables, the asterisk is replaced by a specific string, such as DATETIME.
USER_VAR variables are optional. They supplement the information automatically recorded in the
_AT*_ variables.
10.4 Creating and Using Audit Trails 10-73
_AT*_ Variables
_AT*_ Variable Description
99
By default, SAS logs all _ATOPTCODE_ codes. You can change this behavior when you initiate an audit
trail.
10-74 Chapter 10 Additional Topics (Self-Study)
_ATOPTCODE_ Values
Code Event
DA Added data record image
DD Deleted data record image
DR Before-update record image
DW After-update record image
EA Observation add failed
ED Observation delete failed
EU Observation update failed
100
User Variables
User variables have the following characteristics:
defined as part of the audit trail specification
101
10.4 Creating and Using Audit Trails 10-75
proc sql;
insert into ia.cap2000
set FlightID = 'IA00040',
RouteID = '0000100',
Origin = 'CDG',
Dest = 'LHR',
Cap1st = 12,
CapBusiness = 20,
CapEcon = 120,
Date = '03JUN2000'd,
who = 'Administrator',
why = 'New Flight';
quit;
• The TERMINATE statement deletes the audit file. Do not delete the audit file using operating system
methods because this can damage the SAS data file.
• To stop auditing without deleting the audit file, use the SUSPEND statement.
• To resume auditing after a suspension, use the RESUME statement.
10-76 Chapter 10 Additional Topics (Self-Study)
Output
Audit Trail for ia.cap2000
Flight Cap
Obs ID RouteID Origin Dest Cap1st Business CapEcon
1 2001 . saswjr DA
103 c10s4d1
10.4 Creating and Using Audit Trails 10-77
104
105
USER_VAR variables are unique in SAS in that they are stored in one file (for example, the audit file)
and opened for update in another (for example, the data file).
When the data file is opened for update, the USER_VAR variables appear, and you can edit them as
though they were part of the data file.
resume logging
106
10.4 Creating and Using Audit Trails 10-79
To resume an audit:
proc datasets lib = ia;
audit cap2000;
resume;
run;
quit;
107 c10s4d2
108 c10s4d2
10-80 Chapter 10 Additional Topics (Self-Study)
Exercises
Objectives
Describe Perl regular expressions and
metacharacters.
Use pattern matching to validate data.
Use pattern matching to replace text.
111
is documented at www.perldoc.com
112
113
special characters
number of matches
capture buffers
114
continued...
115
10-84 Chapter 10 Additional Topics (Self-Study)
116
117
10.5 Working with Perl Regular Expressions 10-85
PRXPARSE(Perl-regular-expression)
PRXPARSE(Perl-regular-expression)
Examples:
re=prxparse('m/boat/');
re=prxparse('s/boat/ship/');
118
If Perl regular expression is a constant or if it uses the /o option, the Perl regular expression is compiled
only once. Successive calls to PRXPARSE do not cause a recompile, but return the identifier that was
already compiled. This behavior simplifies the code because you do not need to use an initialization block
(IF _N_ =1) to initialize Perl regular expressions.
10-86 Chapter 10 Additional Topics (Self-Study)
PRXMATCH(Perl-regular-expression,
PRXMATCH(Perl-regular-expression, source)
source)
Perl-regular-expression
specifies for which a character pattern to search.
source
specifies the string to be searched.
119
120
10.5 Working with Perl Regular Expressions 10-87
data Invalidssn;
retain re;
set ia.Staff;
if _n_ = 1 then
re = prxparse('/\d{3}-\d{2}-\d{4}/');
if prxmatch(re, ssn) = 0;
run;
proc print data=Invalidssn;
title 'Invalid Social Security Numbers';
var Name SSN;
run;
c10s5d1
121
Equivalent code:
The LIKE operator would select 364-9A-7412 as a valid SSN because it cannot distinguish letters
from digits. The VERIFY function validates that the characters were digits.
The roles of the items in the regular expression:
/ Start regular expression.
\d{3} Match three digits
- followed by a dash
\d{2} followed by two digits
- followed by a dash
\d{4} followed by four digits.
/ End the regular expression.
10-88 Chapter 10 Additional Topics (Self-Study)
What happened
to Angela and David?
c10s5d1
129 ...
130
10.5 Working with Perl Regular Expressions 10-89
131
10-90 Chapter 10 Additional Topics (Self-Study)
data Invalidssn;
set ia.Staff;
re = prxparse('/^\d{3}-\d{2}-\d{4}$/');
if prxmatch(re, trim(ssn)) = 0;
run;
proc print data=Invalidssn;
title 'Invalid Social Security Numbers';
var Name SSN;
run;
c10s5d2
132
Be sure to trim the blanks from the end of the SSN variable. In Perl expressions, blanks have
significance.
If the Perl regular expression is a constant or if it uses the /o option, then the Perl regular
expression is compiled once and each use of PRXMATCH reuses the compiled expression.
If the Perl regular expression is not a constant and if it does not use the /o option, then the Perl
regular expression is recompiled for each call to PRXMATCH.
The compile-once behavior occurs when you use PRXMATCH in a DATA step, in a
WHERE clause, or in PROC SQL. For all other uses, the Perl regular expression is
recompiled for each call to PRXMATCH.
10.5 Working with Perl Regular Expressions 10-91
c10s5d2
133
c10s5d3
134
10-92 Chapter 10 Additional Topics (Self-Study)
PRXCHANGE(Perl-regular-expression,
PRXCHANGE(Perl-regular-expression,times,
times,source)
source)
Perl-regular-expression
specifies a pattern to search for and a string
to replace with.
times
specifies number of times to perform the
replacement.
source
specifies the string to be searched.
135
Use the value -1 for the times argument to replace all occurrences.
O'REILY, MARY
PYLES, JANE
HOFFMAN, VALERIE
DAWN, JENNIFER
VAN HUSEN, JEFF
SIM-SMITH, ANGELA
TIMMONS, DAVID
BENJAMIN, CATHERINE
136
10.5 Working with Perl Regular Expressions 10-93
c10s5d4
137
Match a space.
( Start capture buffer #2 to store the first name.
\w+ Match a word character one or more times.
Insert a space.
$1 Insert capture buffer #1, which contains the last name.
/ End replacement text.
Equivalent code:
data Namechange;
set ia.Staff;
First=scan(name, 2, ' ,');
Middle=scan(name, 3, ' ,');
Last = scan(name,1, ' ,');
if middle ne ' '
then NewName=trim(first) || ' ' ||
trim(middle) || ' ' || last;
else NewName=trim(first) || ' ' || last;
run;
10.5 Working with Perl Regular Expressions 10-95
c10s5d4
160
data Namechange;
set ia.Staff;
NewName = prxchange('s/([^,]+), (\w+(\s+\w+)?)/$2 $1/',
1,Name);
run;
proc print data = Namechange;
title 'Rearranged Names';
var Name NewName;
run;
c10s5d5
161
10-96 Chapter 10 Additional Topics (Self-Study)
Exercises
Use the PRINT procedure with a WHERE statement to create the report.
Output
Employees with Invalid Phone Numbers
Phone
Obs Name Number
This is a backup copy of the data in case your program must be submitted multiple times as
you test and debug.
2. Modifying All Observations in a SAS Data Set
Give all the employees in the empdata SAS data set a 5% salary increase using the MODIFY
statement. Print the data set before and after the increase.
proc print data = empdata (obs = 5);
title 'Original Data';
run;
data empdata;
modify empdata;
salary = salary * 1.05;
run;
data empdata;
modify empdata ia.empdatu;
by EmpID;
run;
4. Modifying a SAS Data Set Using a Transaction Data Set and an Index
Use the transaction data set ia.empdatu2 to modify the empdata SAS data set by the employee
ID number. Use the index on the empdata SAS data set. Modify the variables LastName,
Location, and Salary. Print the data set before and after the changes.
proc print data = empdata;
var EmpID LastName Location Salary;
title 'Original Data';
run;
data empdata;
set ia.empdatu2 (rename = (LastName = NewLastName
Location = NewLocation
Salary = NewSalary));
modify empdata key = EmpID;
LastName = NewLastName;
Location = NewLocation;
Salary = NewSalary;
run;
a. Use the ia.y200061 and ia.y200062 data sets to concatenate to ia.jobhstry and test
your program.
b. Use PROC DATASETS to look at the generation information for ia.jobhstry.
proc datasets lib = ia nolist;
modify jobhstry (genmax = 3);
run;
quit;
data ia.jobhstry;
set ia.jobhstry ia.y200061;
run;
data ia.jobhstry;
set ia.jobhstry ia.y200062;
run;
proc sql;
insert into ia.pilots
set EmpID = 'E01724';
quit;
Log
434 proc sql;
435 insert into IA.Pilots
436 set EmpID = 'E01724';
ERROR: Observation was not added/updated because a matching primary key value
was not found for foreign key FKEmpID.
NOTE: Deleting the successful inserts before error noted above to restore table
to a consistent state.
437 quit;
NOTE: The SAS System stopped processing this step because of errors.
10-100 Chapter 10 Additional Topics (Self-Study)
Use the PRINT procedure with a WHERE statement to create the report.
Output
Employees with Invalid Phone Numbers
Phone
Obs Name Number
controlling memory and I/O resources, 1-24– syntax for initiating an audit trail, 10-77
1-31 TERMINATE statement, 10-75, 10-79
controlling page size, 1-28–1-29 USER_VAR statement, 10-78
CPU DBMS, 8-42
conserving, 8-4 access techniques, 8-35
CPUCOUNT= option, 6-21 DECLARE statement, 4-57–4-58
DESCRIBE statement
D DATA step, 7-36
direct access methods, 2-6
data file structure
DO loops
compressed, 7-15–7-16
multidimensional arrays, 4-20
compressed, overhead, 7-16
DOWNLOAD procedure, 8-52–8-53
uncompressed, 7-14–7-15
DROP statement, 8-27, 8-31, 8-33
data file variables
DROP= data set option, 8-27, 8-31, 8-33, 8-
audit trails, 10-72
39
data set page
duplicate key values, 10-26
definition, 1-19
DUPOUT option
DATA step
SORT procedure, 6-9
BY statement, 8-42
BY variable, 8-42
E
combining data conditionally, 3-23–3-35
combining summary and detail data, 3-44 efficiency trade-offs, 1-7–1-11
creating multiples, 8-15 eliminating unnecessary data passes, 8-14
creating summary statistics, 3-105 ENDOBS= option
DATASETS procedure, 8-19 LIBNAME statement, 9-28
DESCRIBE statement, 7-36 ENDRSUBMIT statement, 8-52
DROP statement, 8-33 EQUALS option
FIRST. processing, 6-8 SORT procedure, 6-11–6-12
KEEP statement, 8-33 EXCLUDE statement
KEY= option, 3-58 FORMAT procedure, 4-103
MERGE statement, 3-23–3-35 executing only necessary statements, 8-7
multiple SET statements, 3-25, 3-42 external files
SORT procedure, 8-17 reading, 1-21
WHERE statement, 8-33 subsetting and reading, 8-25–8-26
DATA step view
advantages, 7-37 F
creating, 7-31
FILENAME statement, 5-28
definition, 7-29
syntax, 5-30
guidelines, 7-37–7-38
FILEVAR= option
syntax, 7-36
INFILE statement, 5-33–5-34
data transfer services, 8-52–8-54
FIND method, 4-61
database data
FIRST. processing
accessing efficiently, 8-34
DATA step, 6-8
DATAPATH= option
FMTERR system option, 4-94
LIBNAME statement, 9-11, 9-28
FMTLIB option
DATASETS procedure, 9-22
FORMAT procedure, 4-89
DATA step, 8-19
FMTSEARCH= system option, 4-92–4-93
INDEX CREATE statement, 9-20
FORCE option
managing indexes, Error! Not a valid
APPEND procedure, 5-7–5-11
bookmark in entry on page 2-48
FOREIGN KEY constraint, 10-53
syntax, 2-48
foreign keys, 10-61–10-62
Index A-3
FORMAT procedure I
advantages, 4-104
IDXNAME= option, 2-63
CNTLIN= option, 4-98
IDXWHERE= option, 2-63
CNTLOUT= option, 4-103
IF/THEN logic
disadvantages, 4-104
guidelines for efficiency, 8-12
documenting, 4-89
INDEX CREATE statement
EXCLUDE statement, 4-103
DATASETS procedure, 9-20
FMTLIB option, 4-89
index files, 2-49
maintaining permanent formats, 4-99, 4-
index values
103
multidimensional arrays, 4-20
SELECT statement, 4-103
INDEX= data set option, 9-20
syntax, 4-86
indexes, 2-45
using permanent formats, 4-90
indexes
FULLSTIMER option, 1-15
centiles, 2-61
definition, 2-37
G
documenting, 2-49–2-52
generation data sets INDEX= data set option, 2-45
creating, 10-36 maintaining, 2-66–2-68
definition, 10-30 managing with the DATASETS
GENMAX= option, 10-35 procedure, Error! Not a valid bookmark
GENNUM= option, 10-43 in entry on page 2-48
maintaining, 10-46 managing with the SQL procedure, 2-49
processing, 10-43 purpose, 2-38
terms, 10-33 terminology, 2-41
uses, 10-31 usage, 2-54–2-60
GENMAX= option INDEXPATH= option
generation data sets, 10-35 LIBNAME statement, 9-13–9-14, 9-28
GENNUM= option INFILE statement
processing generation data sets, 10-43 FILEVAR= option, 5-33–5-34
grid computing, 8-47 INPUT statement, 8-33
GROUPFORMAT option INSERT INTO statement
BY statement, 6-50–6-53 advantages, 5-22
disadvantages, 5-22
H syntax, 5-18–5-21
hash objects integers
advantages, 4-74 storage lengths, 7-10
argument tags, 4-58 integrity constraints
attributes, 4-56 CHECK, 10-53–10-54
creating, 4-47 creating, 10-55–10-60
creating from a SAS data set, 4-65 documenting, 10-60
data variables, 4-61 FOREIGN KEY, 10-53
DECLARE statement, 4-57–4-58 general constraints, 10-52–10-53
FIND method, 4-61 NOT NULL, 10-53
key variables, 4-60 PRIMARY KEY, 10-53–10-54, 10-61–10-
methods, 4-56 63
MISSING routine, 4-65 referential constraints, 10-52–10-53
object dot syntax, 4-59 UNIQUE, 10-53–10-54
SET statement, 4-65 uses, 10-51
using as table lookups, 4-46 INTNX function, 5-41–5-43
host sort, 6-32
A-4 Index