0% found this document useful (0 votes)

8 views

Programming III

manual 1 de sas

Uploaded by

cavegag

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Programming III

manual 1 de sas

Uploaded by

cavegag

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 718

®

SAS Programming III:

Advanced Techniques

Course Notes
For Your Information ii

SAS® Programming III: Advanced Techniques Course Notes was developed by Linda Jolley and
Jane Stroupe. Additional contributions were made by Bill Brideson, George Berg, Ted Meleky,
Rich Papel, Dr. Sue Rakes, Kent Reeve, Christine Riddiough, and Roger Staum. Editing and production
support was provided by the Curriculum Development and Support Department.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.

SAS® Programming III: Advanced Techniques Course Notes

Copyright © 2005 by SAS Institute Inc., Cary, NC 27513, USA. All rights reserved. Printed in the
United States of America. No part of this publication may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without
the prior written permission of the publisher, SAS Institute Inc.

Book code E70041, course code PROG3, prepared date 14Oct05.

For Your Information iii

Table of Contents

Course Description ...................................................................................................................... vi

Prerequisites ...............................................................................................................................vii

Chapter 1 Introduction .......................................................................................... 1-1

1.1 Introduction of Course Topics..........................................................................................1-3

1.2 Measuring Efficiencies ....................................................................................................1-5

1.3 SAS Processing..............................................................................................................1-19

1.4 Controlling Memory and I/O Resources........................................................................1-23

1.5 Solutions to Exercises ....................................................................................................1-37

Chapter 2 Accessing Observations...................................................................... 2-1

2.1 Introduction......................................................................................................................2-3

2.2 Creating a Sample Data Set .............................................................................................2-7

2.3 Creating and Using an Index..........................................................................................2-36

2.4 Solutions to Exercises ....................................................................................................2-71

Chapter 3 Combining Data Horizontally............................................................... 3-1

3.1 Joining Data Sets by Value ..............................................................................................3-3

3.2 Combining Summary and Detail Data ...........................................................................3-37

3.3 Using an Index to Combine Data...................................................................................3-56

3.4 Updating Data ................................................................................................................3-72

3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study) .........3-93

3.6 Solutions to Exercises ..................................................................................................3-106

iv For Your Information

Chapter 4 Using Lookup Tables to Match Data ................................................... 4-1

4.1 Introduction to Lookup Techniques .................................................................................4-3

4.2 Using Arrays as Lookup Tables .......................................................................................4-6

4.3 Using Hash Objects as Lookup Tables...........................................................................4-43

4.4 Using Formats as Lookup Tables...................................................................................4-77

4.5 Transposing Data to Create a Lookup Table ...............................................................4-108

4.6 Solutions to Exercises .................................................................................................. 4-119

Chapter 5 Combining Data Vertically ................................................................... 5-1

5.1 Appending SAS Data Sets ...............................................................................................5-3

5.2 Appending Raw Data Files ............................................................................................5-26

5.3 Solutions to Exercises ....................................................................................................5-52

Chapter 6 BY-Group Processing and Sorting ..................................................... 6-1

6.1 Introduction......................................................................................................................6-3

6.2 Eliminating Duplicates.....................................................................................................6-5

6.3 Sorting Resources ..........................................................................................................6-16

6.4 Choosing the Right Sort Routine (Self-Study) ..............................................................6-31

6.5 Alternatives to Sorting ...................................................................................................6-37

6.6 Solutions to Exercises ....................................................................................................6-65

Chapter 7 Controlling Data Storage Space.......................................................... 7-1

7.1 Introduction......................................................................................................................7-3

7.2 Reducing the Length of Numeric Variables .....................................................................7-6

7.3 Compressing Data Files .................................................................................................7-14

7.4 Creating a DATA Step View...........................................................................................7-28

For Your Information v

7.5 Solutions to Exercises ....................................................................................................7-43

Chapter 8 Utilizing Best Practices to Improve Efficiency .................................. 8-1

8.1 Introduction......................................................................................................................8-3

8.2 Executing Only Necessary Statements ............................................................................8-7

8.3 Eliminating Unnecessary Passes through the Data ........................................................8-14

8.4 Reading and Writing Only Essential Data .....................................................................8-20

8.5 Networking Efficiency Considerations (Self-Study) .....................................................8-34

Chapter 9 Using the Scalable Performance Data Engine (Self-Study).............. 9-1

9.1 Introduction to the Scalable Performance Data Engine ...................................................9-3

9.2 Creating SPD Engine Tables..........................................................................................9-10

9.3 Using the SPD Engine Efficiently .................................................................................9-23

9.4 SPD Engine LIBNAME Statement Options List ...........................................................9-28

Chapter 10 Additional Topics (Self-Study)........................................................... 10-1

10.1 Modifying SAS Data Sets in Place ................................................................................10-3

10.2 Creating Generation Data Sets.....................................................................................10-29

10.3 Creating Integrity Constraints......................................................................................10-50

10.4 Creating and Using Audit Trails ..................................................................................10-69

10.5 Working with Perl Regular Expressions ......................................................................10-81

10.6 Solutions to Exercises ..................................................................................................10-97

Appendix A Index ..................................................................................................... A-1

vi For Your Information

Course Description
This course builds on the concepts presented in the SAS Programming II: Manipulating Data with the
DATA Step course. This course focuses on reading data with direct access; combining data; sorting; using
multidimensional arrays, hash tables, and formats for table lookups; efficiently storing data; utilizing best
practices; and creating tables with the SAS Scalable Performance Data Engine.

This course is a combination of the previously offered SAS Programming III: Advanced Techniques and
Optimizing SAS Programs courses.

To learn more…

A full curriculum of general and statistical instructor-based training is available

at any of the Institute’s training facilities. Institute instructors can also provide
on-site training.
For information on other courses in the curriculum, contact the SAS Education
Division at 1-919-531-7321, or send e-mail to training@sas.com. You can also
find this information on the Web at support.sas.com/training/ as well as in the
Training Course Catalog.

For a list of other SAS books that relate to the topics covered in this
Course Notes, USA customers can contact our SAS Publishing Department at
1-800-727-3228 or send e-mail to sasbook@sas.com. Customers outside the
USA, please contact your local SAS office.
Also, see the Publications Catalog on the Web at support.sas.com/pubs for a
complete list of books and a convenient order form.
For Your Information vii

Prerequisites
This course is not appropriate for beginning SAS software users. Before attending this course, you should
have at least nine months of SAS programming experience and should have completed the SAS
Programming II: Manipulating Data with the DATA Step course. Specifically, you should be able to do
the following:
understand your operating system file structures and perform basic operating system tasks

understand programming logic concepts

understand the compilation and execution process of the DATA step

use different kinds of input to create SAS data sets from external files

use SAS software to access SAS data libraries

create and use SAS date values

read, concatenate, merge, match-merge, and interleave SAS data sets

use the DROP=, KEEP=, and RENAME= data set options

create multiple output data sets

use array processing and DO loops to process data iteratively

use SAS functions to perform data manipulation and transformations.

viii For Your Information
Chapter 1 Introduction

1.1 Introduction of Course Topics.......................................................................................1-3

1.2 Measuring Efficiencies...................................................................................................1-5

1.3 SAS Processing............................................................................................................1-19

1.4 Controlling Memory and I/O Resources .....................................................................1-23

1.5 Solutions to Exercises .................................................................................................1-37

1-2 Chapter 1 Introduction
1.1 Introduction of Course Topics 1-3

1.1 Introduction of Course Topics

General Business Scenario

International Airlines has several data files that must be
manipulated before they can be used for report production.

The to-do list includes the items on the following slides:

continued...
3 ...

General Business Scenario

appending combining
− raw data files − three SAS data sets
− SAS data sets without common BY
variables
Date Expenses Origin Destination

02DEC1999
03DEC1999
04DEC1999
05DEC1999
58907
108543
21963
31517
RDU
RDU
RDU
RDU
LHR
LHR
LHR
LHR
− a summary data set
with a detail data set
06DEC1999 105682 RDU LHR
07DEC1999 66992 RDU LHR
08DEC1999 92873 RDU LHR
09DEC1999 59560 RDU LHR
10DEC1999 41096 RDU LHR
11DEC1999 10272 RDU LHR

− a small data set with

12DEC1999 35121 RDU LHR
13DEC1999 65836 RDU LHR
14DEC1999 73350 RDU LHR
15DEC1999 58539 RDU LHR

a large data set

16DEC1999 64191 RDU LHR
17DEC1999 116839 RDU LHR
18DEC1999 82369 RDU LHR
19DEC1999 109908 RDU LHR
20DEC1999 2439 RDU LHR
21DEC1999 36700 RDU LHR

12DEC1999 35121 RDU LHR Expenses Origin Destination Date Date

Class Business Economy Profit AirportCity AirportName
13DEC1999 65836 RDU LHR
14DEC1999 73350 RDU LHR 58907 RDU LHR 02DEC1999
02DEC1999 19200 31610 79650 71553 London, England Heathrow Airport
15DEC1999 58539 RDU LHR 108543 RDU LHR 03DEC1999
03DEC1999 17600 25070 80181 14308 London, England Heathrow Airport
16DEC1999 64191 RDU LHR 21963 RDU LHR 04DEC1999
04DEC1999 17600 28340 84960 108937 London, England Heathrow Airport
17DEC1999 116839 RDU LHR 31517 RDU LHR 05DEC1999
05DEC1999 17600 32700 72216 90999 London, England Heathrow Airport
18DEC1999 82369 RDU LHR 105682 RDU LHR 06DEC1999
06DEC1999 22400 29430 74871 21019 London, England Heathrow Airport
19DEC1999 109908 RDU LHR 66992 RDU LHR 07DEC1999
07DEC1999 22400 29430 84960 69798 London, England Heathrow Airport
20DEC1999 2439 RDU LHR 92873 RDU LHR 08DEC1999
08DEC1999 20800 27250 82305 37482 London, England Heathrow Airport
21DEC1999 36700 RDU LHR 59560 RDU LHR 09DEC1999
09DEC1999 22400 32700 84429 79969 London, England Heathrow Airport
41096 RDU LHR 10DEC1999
10DEC1999 20800 32700 67968 80372 London, England Heathrow Airport
14DEC1999 73350 RDU LHR 10272 RDU LHR 11DEC1999
11DEC1999 22400 29430 78588 120146 London, England Heathrow Airport
15DEC1999 58539 RDU LHR 35121 RDU LHR 12DEC1999
12DEC1999 17600 30520 67968 80967 London, England Heathrow Airport
16DEC1999 64191 RDU LHR 65836 RDU LHR 13DEC1999
13DEC1999 22400 31610 84960 73134 London, England Heathrow Airport
17DEC1999 116839 RDU LHR 73350 RDU LHR 14DEC1999
14DEC1999 22400 32700 74340 56090 London, England Heathrow Airport
18DEC1999 82369 RDU LHR 58539 RDU LHR 15DEC1999
15DEC1999 20800 29430 72747 64438 London, England Heathrow Airport
19DEC1999 109908 RDU LHR 64191 RDU LHR 16DEC1999
16DEC1999 20800 28340 82836 67785 London, England Heathrow Airport
20DEC1999 2439 RDU LHR 116839 RDU LHR 17DEC1999
17DEC1999 20800 25070 83898 12929 London, England Heathrow Airport
21DEC1999 36700 RDU LHR 82369 RDU LHR 18DEC1999
18DEC1999 20800 32700 72747 43878 London, England Heathrow Airport
109908 RDU LHR 19DEC1999
19DEC1999 20800 27250 70092 8234 London, England Heathrow Airport
2439 RDU LHR 20DEC1999
20DEC1999 17600 30520 65844 111525 London, England Heathrow Airport
36700 RDU LHR 21DEC1999
21DEC1999 22400 32700 75933 94333 London, England Heathrow Airport

continued...
4 ...
1-4 Chapter 1 Introduction

General Business Scenario

creating random samples to use for various analyses
creating indexes for quick retrieval of subsets
updating a master table with a transaction table
performing table lookups
sorting data sets
accessing current data in frequently changing files

continued...
5

General Business Scenario

Perform these tasks as efficiently as possible, and
optimize the following:
I/O

CPU

memory

data storage space

6
1.2 Measuring Efficiencies 1-5

1.2 Measuring Efficiencies

Objectives
Identify the resources used by a SAS program.
Use SAS system options to measure computer
resources.
Interpret resource usage statistics in your operating
environment.
Benchmark resource usage.

Running a SAS Program

What resources are required
to run a SAS program?
The programmer must perform
the following tasks:
write the program
execute the program
maintain the program

9
1-6 Chapter 1 Introduction

Running a SAS Program

The computer must perform the following actions:
load the required SAS software components
and the program into memory
compile the program

locate data required by the program

execute the program

store output data files

store printed reports

10
1.2 Measuring Efficiencies 1-7

What Resources Are Used?

CPU time
programmer
I/O
time

resources used

networking memory

data storage
space

CPU measures the amount of time that the Central Processing Unit uses to perform
requested tasks such as calculations, reading and writing data, conditional and
iterative logic, and so on.
I/O provides a measurement of the read-and-write operations performed as data and
programs are moved from a storage device to memory (input) or from memory to a
storage or display device (output).
Memory is the size of the work area required to hold executable program modules, data, and
buffers.
Data storage space is the amount of space on a disk or tape required to store data.
Programmer time is the amount of time required for the programmer to write and maintain the
program. This can be decreased through well documented, logical programming
practices.
Networking is the amount of time required to transfer data across your computer network. This
can be decreased by performing as much of the subsetting and summarizing as
possible on the remote computer before transferring the data to the local computer.
The networking time is dependent on the bandwidth of your I/O controller.
1-8 Chapter 1 Introduction

Understanding Efficiency Trade-offs

When you decrease the use
U
Free! of one resource, the use of
another resource frequently
increases.

12 ...

Understanding Efficiency Trade-offs

Data Data

Space
i mo f t
pl en
ie
s
12
12

9 3 9 3
6

CPU Time

13
1.2 Measuring Efficiencies 1-9

Understanding Efficiency Trade-offs

I/O
i mo f t
pl en
ie
s

Memory Usage

Deciding What Is Important for Efficiency

You must decide which factors are the most important for improving resource usage at your site. To make
this decision, you must know the following:
• which resources are scarce or costly at your site
• how and when your programs will be used
• the type and volume of data your programs will process
1-10 Chapter 1 Introduction

Understanding Efficiency at Your Site

operating environment

SAS
hardware environment

system load

Environmental factors that affect the efficiency of SAS programs include the following:
Hardware the amount of available memory, the number of peripheral devices attached to
the CPU, and the communications hardware in use
Operating environment resource allocation, scheduling algorithms, and I/O methods
System load the number of users or jobs sharing system resources including network
bandwidth along with the traffic.
SAS environment determined by which SAS software products are installed, how they were
installed, and which methods are available to run SAS programs at your site
In most cases, one or two resources are the most limited or most expensive for your programs. You can
usually decrease the amount of critical resources that are used if you are willing to sacrifice some
efficiency of the resources that are less critical at your site.
1.2 Measuring Efficiencies 1-11

Knowing How Your Program Will Be Used

The importance of efficiency increases with the following:
the size of the program or the files being processed

the number of times the program will be executed

• Developing an efficient program requires time and thought. The first question to address is whether the
additional amount of resources saved is worth the time and effort spent to achieve the savings.
• Consider the size of the program or the files that are processed. As the programs or files increase in
size, the potential for savings increases. Therefore, devote your effort to improve the efficiency of large
programs.
• Also consider the number of times the program will run. The difference in the resources used by an
inefficient program and an efficient program that run one time or a few times is relatively small,
whereas the cumulative difference for a program that is run frequently is large.
1-12 Chapter 1 Introduction

Knowing Your Data

The effectiveness of any efficiency technique depends greatly on the data with which you use it. When
you know the characteristics of your data, you can select the techniques that take advantage of those
characteristics.

Considering Trade-Offs
In this class, each task will be performed using one or
more techniques.
You should benchmark with your own data to determine
which technique is the most efficient.

19
1.2 Measuring Efficiencies 1-13

Deciding Which Technique Is Most Efficient

To decide which technique is
most efficient for a given task,
benchmark, or measure and
compare, the resource usage
of each technique.

Running Benchmarks: Guidelines

To benchmark your programming techniques, do the
following:
Turn on the appropriate options to report resource
usage.
Test each technique in a separate SAS session.

Test only one technique or change at a time, with as

little additional code present as possible.

continued...

21 ...
1-14 Chapter 1 Introduction

Running Benchmarks: Guidelines

Run your tests and use the conditions that your final
program will use (for example, batch execution, large
data sets, and so on).
Turn off the options that report resource usage after
testing is finished, because they consume resources.
Run each program several times and base your
conclusions on averages, not on an individual
execution, if you are benchmarking elapsed time.
Average resource usage data only if the results are in
the same ballpark. Do not average very diverse
resource usages because that data might lead you to
tune your program to run less efficiently.

22 ...
1.2 Measuring Efficiencies 1-15

Tracking Resource Usage

STIMER

SAS
STATS MEMRPT
options

FULLSTIMER

There are four SAS system options that you can use to track and report on resource utilization:
STIMER tracks the CPU time used to perform a task (DATA or PROC step). CPU time can be
divided into System CPU time and User CPU time.
MEMRPT tracks memory used while performing a task.
FULLSTIMER tracks usage of additional resources. This option is ignored unless STIMER or
MEMRPT is in effect. It can also be specified by the alias FULLSTATS.
STATS writes information tracked by the above options to the SAS log.

The availability and usage of these options are specific to the operating environment.

Syntax (default listed first):

OPTIONS NOFULLSTIMER | FULLSTIMER;

OPTIONS STIMER | NOSTIMER;

OPTIONS STATS | NOSTATS;

OPTIONS MEMRPT | NOMEMRPT;

1-16 Chapter 1 Introduction

Tracking Resources with SAS Options

z/OS Windows UNIX

STIMER I BD BD

MEMRPT BD N/A N/A

FULLSTIMER B B B

STATS BD N/A N/A

I Invocation option only

B Can be set at invocation or by using an OPTIONS statement
N/A Not available (The functionality is part of the STIMER option under UNIX and Windows.)
D Default

Use the OPTIONS procedure with the HOST option to determine the default settings of these
options at your site.
proc options host;
run;
You can find more information on operating environment dependencies in the SAS documentation for
your operating environment.
1.2 Measuring Efficiencies 1-17

Tracking SAS/ACCESS Resources (Self-Study)

In addition to the traditional four SAS system options for
tracking resource usage, the SASTRACE= system option
is a powerful tool to use when you want to see the
commands that are sent to your database management
system (DBMS) by the SAS/ACCESS engine.
SASTRACE= output is DBMS-specific.
General form of the SASTRACE= system option:

OPTIONS
OPTIONSSASTRACE
SASTRACE==',,,d
',,,d' '||',,t,
',,t,' '||',,t,s
',,t,s';';

Notice the use of the commas as placeholders.

Selected values for SASTRACE= are shown below:

',,,d' specifies that all SQL statements sent to the DBMS are sent to the log.
',,t,' specifies that all threading information is sent to the log.
',,t,s' specifies that all threading information and a summary of timing information for calls made to the
DBMS are sent to the log.
The following details can help you manage SASTRACE= output in your DBMS:
• When using SASTRACE= on PC platforms, you must also specify the following option:
sastraceloc = stdout | saslog

• In order to turn SAS tracing off, you can specify the following option:
options sastrace=off;

• Log output is much easier to read if you specify nostsuffix.

1-18 Chapter 1 Introduction

Tracking SAS/ACCESS Resources (Self-Study)

7 options ls = 64 sastrace = ',,,d' sastraceloc = saslog
nostsuffix;
9 proc print data = oralib.flightdelays;
10 where destination = 'CPH';
11 title 'Flights to Copenhagen';
12 run;
ORACLE_2: Prepared:
SELECT "DESTINATION", "FLIGHTNUMBER", "FLIGHTDATE", "ORIGIN",
"DELAYCATEGORY", "DESTINATIONTYPE", "DAYOFWEEK", "DELAY" FROM
educ.FLIGHTDELAYS WHERE ("DESTINATION" = 'CPH' )
ORACLE_3: Executed:
SELECT statement ORACLE_2
NOTE: There were 27 observations read from the data set
ORALIB.FLIGHTDELAYS.
WHERE destination='CPH';
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.58 seconds
cpu time 0.07 seconds

26 c01s2d1

The following code was used to generate this output:

/* Using a WHERE statement to subset an Oracle table. */

libname oralib oracle user = edu001 pw = xxxxxx

path = dbmssrv schema = educ;

/* Use SASTRACE= and SASTRACELOC= to write the */

/* generated Oracle SQL statements to the log. */
options ls = 64 sastrace = ',,,d' sastraceloc = saslog
nostsuffix;

/* Subset for Copenhagen destination */

proc print data = oralib.flightdelays;
where destination = 'CPH';
title 'Flights to Copenhagen';
run;
1.3 SAS Processing 1-19

1.3 SAS Processing

Objectives
Investigate the concept of a data set page and
how it relates to the structure of SAS data sets.
Review how SAS reads and writes data.

SAS Data Set Pages

A SAS data set page has the following attributes:
is the unit of data transfer between the operating
system buffers and SAS buffers in memory
includes the number of bytes used by the descriptor
portion, the data values, and the overhead
is fixed in size when the data set is created, either to a
default value or to a value specified by the
programmer

29
1-20 Chapter 1 Introduction

Using PROC CONTENTS to Report Page Size

proc contents data = ia.sales;
run;

Partial Output
Engine/Host Dependent Information

Data Set Page Size 16384

Number of Data Set Pages 3396
First Data Page 1
Max Obs per Page 97
Obs in First Data Page 76
Index File Page Size 4096
Number of Index File Pages 2552
Number of Data Set Repairs 0
File Name sales.sas7bdat
Release Created 9.0101M3
Host Created XP_PRO

30 c01s3d1

The total number of bytes occupied by ia.sales can be calculated as shown below:

(16,384 * 3,396) + (4,096 * 2,552) = 66,093,056 bytes

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
1.3 SAS Processing 1-21

Reading External Files

I/O
measured
Input here
Raw Buffers
Data
Caches memory

Data might be
cached in
storage devices.
On UNIX and
Windows, data
might also be
cached by the
file system.
33 ...

Reading External Files

I/O
measured Input Buffer
Input here
Raw Buffers
Data Data is converted
from external
Caches memory format to
SAS format.
PDV
Output ID Flight Route Dest
I/O Buffers
SAS
measured
Data here

37 ...

• The Input Buffer contains one record of raw data.

• The PDV contains one observation of SAS data.
1-22 Chapter 1 Introduction

Reading SAS Data Sets

I/O
measured
Input here
SAS Buffers
Data
Caches memory

Data might be
cached in
storage devices.
On UNIX and
Windows, data
might also be
cached by the
file system.
38 ...

Reading SAS Data Sets

I/O
measured
Input here
Buffers No data
SAS conversion
Data is necessary.

Caches memory

PDV
Output ID Flight Route Dest
I/O Buffers
SAS
measured
Data here

41 ...
1.4 Controlling Memory and I/O Resources 1-23

1.4 Controlling Memory and I/O Resources

Objectives
Change the page size of a SAS data set.
Use system and data set options to control memory
usage.
Use the SASFILE statement when you read small
SAS data sets.
Use the Scatter/Gather I/O feature in the Windows
operating environment.

43
1-24 Chapter 1 Introduction

Controlling Page Size and Memory Usage

You can use the BUFSIZE= system option or data set
option to control the page size of an output SAS data
set.
You can use the BUFNO= system option or data set
option to control the number of SAS buffers open
simultaneously in memory.

BUFSIZE=
BUFSIZE= nn||nK
nK||nM
nM ||nG
nG ||nT
nT||hexX
hexX||MIN
MIN||MAX
MAX

BUFNO=
BUFNO=nn

Increasing the BUFSIZE= option is useful for SAS data sets that are read sequentially (top to bottom).
Using small BUFSIZE= and larger BUFNO= options is useful for SAS data sets that are read randomly.
Random access to SAS data is discussed in Chapter 2.

Reference Information
BUFSIZE=n| nK | nM | nG | nT |hexX | MIN | MAX

n | nK | nM | nG | nT
specifies the page size in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes);
1,073,741,824 (gigabytes); or 1,099,511,627,776 (terabytes). For example, a value of 8 specifies 8
bytes, and a value of 3m specifies 3,145,728 bytes.
The default is 0, which causes SAS to use the minimum optimal page size for the operating
environment.
hexX
specifies the page size as a hexadecimal value. You must specify the value beginning with a number
(0-9), followed by an X. For example, the value 2dx sets the page size to 45 bytes.
MIN
sets the page size to the smallest possible number in your operating environment, down to the
smallest four-byte, signed integer, which is -231-1, or approximately -2 billion bytes.
CAUTION: This setting might cause unexpected results and should be avoided.
Use BUFSIZE=0 in order to reset the buffer page size to the default value in your operating environment.
MAX
sets the page size to the maximum possible number in your operating environment, up to the largest
four-byte, signed integer, which is 231-1, or approximately 2 billion bytes.
1.4 Controlling Memory and I/O Resources 1-25

Windows:
n | nK | nM | nG
specifies the buffer page size in multiples of 1; 1,024 (kilobytes); 1,048,576 (megabytes), and
1,073,741,824 (gigabytes), respectively. You can specify decimal values for the number of
kilobytes, megabytes, or gigabytes. For example, a value of 8 specifies 8 bytes, a value of .782k
specifies 801 bytes, and a value of 3m specifies 3,145,728 bytes.
hexX
specifies the buffer page size as a hexadecimal value. You must specify the value beginning with a
number (0-9), followed by an X. For example, the value 2dx sets the buffer page size to 45 bytes.
MIN
sets the buffer page size to -2,147,483,648 and requires SAS to use a default value. Under
Windows, the default value is 0. The minimum number is -2,147,483,648.
MAX
sets the buffer page size to 2,147,483,647 bytes.
UNIX:

n | nK | nM | nG
specifies the buffer page size in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes); or
1,073,741,824 (gigabytes). You can specify decimal values for the number of kilobytes, megabytes,
or gigabytes. For example, a value of 8 specifies 8 bytes, a value of .782k specifies 801 bytes, and a
value of 3m specifies 3,145,728 bytes.
hexX
specifies the buffer page size as a hexadecimal value. You must specify the value beginning with a
number (0-9), followed by hex digits (0-9, A-F), and then followed by an X. For example, 2dx sets
the buffer page size to 45 bytes.
MIN
sets the buffer page size to 0. When the buffer size is 0, the BASE engine calculates a buffer size to
optimize CPU and I/O use. This size is the smallest multiple of 8K that can hold 80 observations but
is not larger than 64K.
MAX
sets the buffer page size to 2,147,483,647.
1-26 Chapter 1 Introduction

Reference Information

z/OS:

BUFSIZE=0 | n | nK

0
specifies that SAS choose the optimal page size of the data set based on the characteristics of the
library and the type of data set.
n | nK
specifies the permanent buffer size (page size) in bytes or kilobytes, respectively. For libraries other
than HFS, the value specified will be rounded up to the block size (BLKSIZE) of the library data
set, because a block is the smallest unit of a data set that may be transferred in a single I/O
operation.

Windows and Unix:

BUFNO= MIN | MAX | n| nK | nM | nG | nT | hex

Windows:
n | nK | nM | nG
specifies the number of buffers in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes);
or 1,073,741,824 (gigabytes). You can specify decimal values for the number of kilobytes,
megabytes, or gigabytes. For example, a value of 8 specifies 8 buffers, a value of .782k specifies
801 buffers, and a value of 3m specifies 3,145,728 buffers.
For values greater than 1G, use the nM option or specify MAX.
hexX
specifies the number of buffers as a hexadecimal value. You must specify the value beginning with
a number (0-9), followed by an X. For example, the value 2dx specifies 45 buffers.
MIN
sets the number of buffers to 0, and requires SAS to use the default value of 1.
MAX
sets the number of buffers to 2,147,483,647.
1.4 Controlling Memory and I/O Resources 1-27

UNIX:
n | nK | nM | nG
specifies the number of buffers in multiples of 1 (bytes); 1,024 (kilobytes); 1,048,576 (megabytes);
or 1,073,741,824 (gigabytes). You can specify decimal values for the number of kilobytes,
megabytes, or gigabytes. For example, a value of 8 specifies 8 buffers, a value of .782k specifies
801 buffers, and a value of 3m specifies 3,145,728 buffers.
hexX
specifies the number of buffers as a hexadecimal value. You must specify the value beginning with
a number (0-9), followed by hex digits (0-9, A-F), and then followed by an X. For example, 2dx
specifies 45 buffers.
MIN
sets the number of buffers to 0, and requires SAS to use the default value of 1.
MAX
sets the number of buffers to 2,147,483,647.

For more information, consult SAS OnlineDoc 9.1.3. Expand Base SAS, and select SAS
Language Reference: Dictionary and Operating Environment Specific Information.
1-28 Chapter 1 Introduction

Controlling Page Size and Memory Usage

The product of BUFNO= and BUFSIZE= determines how
much data can be transferred in a read operation.
Bytes
BUFSIZE BUFNO transferred
in one I/O
6144 2 12,288

Increasing either BUFSIZE= or BUFNO=

increases the amount of data that can be
transferred in a read operation.

45 ...

Controlling Page Size

In order to select a default page size, SAS software uses
an algorithm based on observation length, engine, and
operating environment.
You can use the BUFSIZE= system or data set option
to override the default page size.
BUFSIZE= specifies not only the page size (in bytes),
but also the size of each buffer used to read or write the
SAS data set.
data ia.times(bufsize = 30720);
infile rtetimes;
input @1 RouteID $7.
@8 Origin $3.
@11 Dest $3.
@14 Distance 8.
@24 Depart time5.
@32 Arrival time5.;
run;
46 c01s4d1
1.4 Controlling Memory and I/O Resources 1-29

Controlling Page Size

Operating SAS buffers
system buffers

one
Page Buffer
operation
of
data

6144 bytes 6144 bytes

47 ...

Controlling Page Size

Operating SAS buffers
system buffers

Page Copy
of of
data data

6144 bytes 6144 bytes

48
1-30 Chapter 1 Introduction

Controlling Page Size

After it is specified, page size is a permanent attribute of
the data set, and is used whenever the data set is
processed.
Choosing a page size that is larger than the default can
reduce execution time by reducing the number of times
that SAS must read from or write to the operating system
buffers.
The reduction in I/O comes at the cost of increased
memory consumption.

Controlling Memory Usage

Page 3
Page 2
Page 1

bufno = 3 data

current SAS session

50
1.4 Controlling Memory and I/O Resources 1-31

Controlling Memory Usage

The buffer number is not a permanent attribute of the data
set and is valid only for the current step or SAS session.
As more buffers are available, more pages can be
transferred in a single move operation.
The reduction in number of moves comes at the cost of
increased memory consumption.
data _null_;
set ia.times(bufno = 2);
run;

c01s4d2
51

SASFILE Global Statement

The SASFILE statement requests that a SAS data set
be opened and loaded into SAS memory in its entirety
instead of a few pages at a time.
After it is read, data is held in memory for subsequent
DATA and PROC steps to process.
A second SASFILE statement closes the file and frees
the SAS buffers.

The SASFILE statement can reduce execution time by taking advantage of large amounts of memory. The
SASFILE statement became available in SAS Release 8.1.
1-32 Chapter 1 Introduction

SASFILE Global Statement

General form of the SASFILE statement:

SASFILE
SASFILE<libref.>member-name
<libref.>member-name
<(password-data-set-option(s))>
<(password-data-set-option(s))>
OPEN
OPEN || LOAD
LOAD || CLOSE;
CLOSE;

OPEN opens the file and allocates the buffers, but defers reading the data into memory until a
procedure or a statement that references the file is executed.
LOAD opens the file, allocates the buffers, and reads the data into memory.
CLOSE frees the buffers and closes the file.

Buffer Allocation
When the SASFILE statement executes, SAS allocates
the number of buffers based on the number of pages of
the SAS data set and index file.
If the file in memory increases in size during processing
by editing or appending data, the number of buffers also
increases.

54
1.4 Controlling Memory and I/O Resources 1-33

Using the SASFILE Statement

Create reports using the PRINT, TABULATE, MEANS,
and FREQUENCY procedures against a single
SAS data set.
„ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
‚ ‚ Employee Salary ‚
‚ ‡ƒƒƒƒƒƒƒƒƒƒƒƒ…ƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚ ‚ Mean ‚ Median ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
‚Job Code ‚ ‚ ‚
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‰ Job ‚ ‚
LastName‚FLTAT1 FirstName ‚ Code
29594.12‚ Location
29000.00‚ Country
‡ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒ‰
The FREQ Procedure
FORT ‚FLTAT2 THERESA L. ‚ 30691.63‚
FLTAT2 31000.00‚
CARY USA
FISHER Šƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒ‹ƒƒƒƒƒƒƒƒƒƒƒƒŒ
Job Code ALEC FLTAT2 CARY USA
WILLIAMS ARLENE M. FLTAT1 CARY USA
Job GOODYEAR GEORGIA
Cumulative FLTAT1
Cumulative CARY USA
Code Frequency CHASE JR.
Percent MARJORIE J. Percent
Frequency FLTAT1 CARY USA
The MEANS Procedure
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
FLTAT1 170 42.82 170 42.8
Analysis Variable : Salary Employee Salary
FLTAT2 227 57.18 397 100.00
Job N
Code Obs N Mean Std Dev Minimum Maximum
Cumulative Cumulative
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Gender Frequency Percent Frequency Percent
FLTAT1 170 170 29594.12 7982.60 16000.00 45000.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
F 211 53.15 211 53.15
FLTAT2 227 227 30691.63 8848.88 16000.00 45000.00
M 186 46.85 397 100.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Using the SASFILE Statement

sasfile ia.fltaten load;
proc print data = ia.fltaten;
var LastName FirstName JobCode
Country Location; ia.fltaten is
sum Salary;
read into memory
run;
proc tabulate data = ia.fltaten; only once instead
class Gender; of four times. This
var Salary; results in one-
table Gender, Salary*(mean median); fourth as many I/O
run; operations,
proc means data = ia.fltaten;
var Salary; increased memory
class Gender; usage, and
output out = summary sum =; probably reduced
run; elapsed time.
proc freq data = ia.fltaten;
tables JobCode Gender;
run;
sasfile ia.fltaten close;

56 c01s4d3

The SASFILE statement is good for small SAS data sets.

1-34 Chapter 1 Introduction

Using the SGIO System Option in Windows

(Self-Study)
The SGIO system option performs the following functions:
activates the Scatter-Read/Gather-Write I/O feature

improves I/O performance for SAS I/O files when the

PC has a large amount of RAM
General form of the SGIO system option:

NOSGIO
NOSGIO||SGIO;
SGIO;

NOSGIO | SGIO is an invocation option.

The default value is NOSGIO.

• With SAS I/O files (data sets, catalogs, indexes, utility files, and so on), normal sequential reads and
writes go through the Windows File Cache.
• The Windows File Cache provides a great benefit in most cases, but for large SAS I/O files, Scatter-
Read or Gather-Write usually improves performance.

Scatter-Read/Gather-Write is available in Windows 2000 and Windows XP.

For Windows NT users, you must install Service Pack 4.

1.4 Controlling Memory and I/O Resources 1-35

Using the SGIO System Option in Windows

(Self-Study)
When SGIO is active, SAS does the following:
uses the number of buffers that are specified by the
BUFNO= system option to transfer data between disk
and RAM
bypasses intermediate buffer transfers when reading
or writing data
reads ahead the number of pages specified by the
BUFNO= system option and places the data in
memory before it is needed
When the data is needed, it is already in memory and is,
in effect, a direct memory access.
Try different values of the BUFNO system option
to tune each SAS job or DATA step.
58

The Scatter-Read/Gather-Write feature is active only for SAS I/O files that have the following attributes:
• contain a 4K-multiple pagesize (for example, 4096 or 8192) on 32-bit systems
• contain a 8K-multiple pagesize (for example, 8192 or 16384) on 64-bit systems
If an I/O file does not meet these criteria, SGIO is inactive for that file even though the SGIO option is
specified.
To learn more, visit this page: http://support.sas.com/techsup/technote/ts710.html.
1-36 Chapter 1 Introduction

Exercises

1. Recording Resource Statistics

a. Open the program, c01ex1Start, and add the appropriate OPTIONS statement to report the
following statistics. Record your results.
1) CPU
2) I/O
3) Memory
b. Turn off the option after you record the statistics.
2. Using the SASFILE Statement
Open the program, c01ex2Start, and add the appropriate statement(s) to open and load the entire data
set ia.UK_fltat into memory. At the end of the program, close the data set.
1.5 Solutions to Exercises 1-37

1.5 Solutions to Exercises

1. Recording Resource Statistics
a. Open the program, c01ex1Start, and add the appropriate OPTIONS statement to report the
following statistics. Record your results.

Each student's results will vary depending on the individual PC.

1) CPU
2) I/O
3) Memory
options fullstimer;

filename rawdata 'saledata.dat';

data sales(keep = FlightID Num1st

NumBus NumEcon NumPassTotal);
infile rawdata;
input FlightID $7. RouteID $7.
Origin $3. Dest $3.
DestType $13. FltDate date9.
Cap1st 8. CapBus 8.
CapEcon 8. CapPassTotal 8.
CapCargo 8. Num1st 8.
NumBus 8. NumEcon 8.
NumPassTotal 8. Rev1st 8.
RevBus 8. RevEcon 8.
CargoRev 8. RevTotal 8.
CargoWeight 8.;
run;

options nofullstimer;
b. Turn off the option after you record the statistics.
1-38 Chapter 1 Introduction

2. Using the SASFILE Statement

Open the program, c01ex2Start, and add the appropriate statement(s) to open and load the entire data
set ia.UK_fltat into memory. At the end of the program, close the data set.
sasfile ia.uk_fltat load;

proc print data = ia.uk_fltat;

run;

proc means data = ia.uk_fltat;

var Salary;
run;

proc freq data = ia.uk_fltat;

tables JobCode Gender;
run;

proc tabulate data = ia.uk_fltat;

class Gender JobCode;
var Salary;
tables JobCode,Gender*Salary*(Mean Median);
run;

sasfile ia.uk_fltat close;

Chapter 2 Accessing Observations

2.1 Introduction.....................................................................................................................2-3

2.2 Creating a Sample Data Set ...........................................................................................2-7

2.3 Creating and Using an Index .......................................................................................2-36

2.4 Solutions to Exercises .................................................................................................2-71

2-2 Chapter 2 Accessing Observations
2.1 Introduction 2-3

2.1 Introduction

Objectives
Review sequential processing.
Investigate methods for direct access.

Reading SAS Data Sets (Default)

SAS
Data
Set
memory

4 ...
2-4 Chapter 2 Accessing Observations

Reading SAS Data Sets (Default)

SAS
Data
Set
memory

PDV
Output ID Flight Route Dest
SAS Buffers
Data

6 ...

Reading SAS Data Sets (Default)

SAS
Data
Set
memory

PDV
Output ID Flight Route Dest
SAS Buffers
Data

7 ...
2.1 Introduction 2-5

Reading SAS Data Sets (Default)

SAS
Data
Set
memory

PDV
Output ID Flight Route Dest
SAS Buffers
Data

8 ...

Reading SAS Data Sets (Default)

SAS
Data
Set
Sequential
memoryprocessing continues
until the pointer
reaches the end of file.
PDV
Output ID Flight Route Dest
SAS Buffers
Data

9 ...
2-6 Chapter 2 Accessing Observations

Using Direct Access Methods

To change the default sequentially processing, you can
use direct access methods.

Method: Possible use: How does it work?

POINT= SET creating a sample of Locates an observation

statement option data from a SAS data by observation number
set
Indexing creating a subset of Locates an observation
data with a WHERE by variable value(s)
clause

10
2.2 Creating a Sample Data Set 2-7

2.2 Creating a Sample Data Set

Objectives
Create a systematic sample that contains five
observations.
Create a systematic sample that contains an unknown
number of observations.
Create a random sample with replacement.
Create a random sample without replacement.

Selecting Observations
International Airlines (IA) is concerned with the accuracy
of the data in ia.sales that contains revenue figures
for 2004 and 2005. The size of the data set makes
auditing all of the data difficult. IA first wants to audit a
small sample to determine if a full audit is necessary.
Partial Output
Cap Num
Flight Pass Num Num Pass
ID RouteID Origin Dest DestType FltDate Cap1st CapBus CapEcon Total CapCargo Num1st Bus Econ Total

IA10700 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 11 . 126 137
IA10701 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 12 . 136 148
IA10702 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 10 . 112 122
IA10703 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 12 . 113 125
IA10704 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 10 . 118 128
IA10705 0000107 WLG AKL International 01JAN2005 12 . 138 150 36900 11 . 117 128
IA10700 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 10 . 131 141
IA10701 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 11 . 113 124
IA10702 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 10 . 134 144
IA10703 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 11 . 114 125
IA10704 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 11 . 128 139
IA10705 0000107 WLG AKL International 02JAN2005 12 . 138 150 36900 12 . 131 143
IA10700 0000107 WLG AKL International 03JAN2005 12 . 138 150 36900 10 . 124 134
IA10701 0000107 WLG AKL International 03JAN2005 12 . 138 150 36900 12 . 135 147
IA10702 0000107 WLG AKL International 03JAN2005 12 . 138 150 36900 12 . 127 139

13 ...

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2-8 Chapter 2 Accessing Observations

Creating a Systematic Sample

Select a five-observation subset by reading every
hundredth observation from observation number 100
to observation number 500.
data work.subset;
do PickIt = 100 to 500 by 100;
set ia.sales
point = PickIt;
output;
end;
stop;
run;

c02s2d1
14

c The DO loop assigns a value to the variable PickIt.

d PickIt is used by the POINT= option to select an observation from the SAS data set.

e The OUTPUT statement writes the PDV values to the SAS data set.
f The STOP statement stops the DATA step from continuing to execute after the five observations are
selected. Without a STOP statement, the DATA step continues in an infinite loop
2.2 Creating a Sample Data Set 2-9

Using the POINT= Option

To create a sample, use the POINT= option in the
SET statement.
General form of the POINT= option:

SET
SET data-set-name
data-set-namePOINT
POINT ==point-variable;
point-variable;

The point-variable has the following attributes:

names a temporary numeric variable that contains the
observation number of the observation to read
must be given a value before the execution of the
SET statement
must be a variable (for example, X) and not a constant
value (for example, 12)
15 ...

The POINT= option value should be an integer greater than zero and less than or equal to the number of
observations in the SAS data set. If the value is not integral, the SET statement effectively applies the
FLOOR function to the value.

Using the STOP Statement

The POINT= option has the following features:
uses direct-access read mode

does not detect the end-of-file

To prevent the DATA step from looping continuously, use

the STOP statement.
General form of the STOP statement:

STOP;
STOP;

17
2-10 Chapter 2 Accessing Observations

c02s2d1
data work.subset;
do PickIt = 100 to 500 by 100;
set ia.sales
point = PickIt;
output;
end;
stop;
run;
The PROC PRINT output of work.subset is shown below.
Creating a Systematic Sample of 5 Observations

Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus

1 IA09200 0000092 CCU DEL International 01JAN2004 12 .

Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st RevBus

1 138 150 36900 10 . 110 120 $3,360.00 .

2 138 150 36900 12 . 134 146 $2,472.00 .
3 138 150 36900 11 . 126 137 $2,915.00 .
4 138 150 36900 10 . 116 126 $2,890.00 .
5 125 139 39700 12 . 124 136 $1,020.00 .

Cargo
Obs RevEcon CargoRev RevTotal Weight

1 $12,210.00 $6,708.00 $22,278 12900

2 $9,112.00 $2,464.00 $14,048 7700
3 $11,088.00 $3,895.00 $17,898 9500
4 $11,136.00 $5,148.00 $19,174 11700
5 $3,472.00 $1,625.00 $6,117 12500

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.2 Creating a Sample Data Set 2-11

Using the Number of Observations

You must select a subset by reading every hundredth
observation from observation number 100 to the end
of the SAS data set.
data work.subset;
do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;

c02s2d2
18

c The NOBS= option creates a temporary variable that contains the total number of observations in the
input data files. During compilation, SAS reads the descriptor portion of the data file and assigns the
value of the NOBS= variable.

The total includes deleted observations. Rebuild the data set to remove deleted observations.

d You can refer to the NOBS= variable in executable statements that appear before the SET statement.
2-12 Chapter 2 Accessing Observations

Using the Number of Observations

You can use the NOBS= option in the SET statement
to determine how many observations there are in a
SAS data set.
General form of the SET statement:

SET
SET SAS-data-set
SAS-data-setNOBS
NOBS==variable;
variable;

The NOBS= option creates a temporary variable whose

value has the following characteristics:
is the number of observations in the input data set(s)

assigned during compilation

retained

should not be modified during execution

Compilation data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
D
PickIt

c02s2d2
20 ...
2.2 Creating a Sample Data Set 2-13

Compilation data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
D D
PickIt TotObs

21 ...

Compilation data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
D D
PickIt TotObs FlightID RouteID Origin ...
329264

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

22 ...

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2-14 Chapter 2 Accessing Observations

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
D D
PickIt TotObs FlightID RouteID Origin ...
100 329264 IA10703 0000107 WLG

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1524.00 . 4956.00 2180.00 8660.00 10900.00

24 ...

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
Explicit Output
D D
PickIt TotObs FlightID RouteID Origin ...
100 329264 IA10703 0000107 WLG

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1524.00 . 4956.00 2180.00 8660.00 10900.00

25 ...
2.2 Creating a Sample Data Set 2-15

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop; PickIt =
run; 200
D D
PickIt TotObs FlightID RouteID Origin ...
200 329264 IA10703 0000107 WLG

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1524.00 . 4956.00 2180.00 8660.00 10900.00

26 ...

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
Explicit Output
D D
PickIt TotObs FlightID RouteID Origin ...
200 329264 IA10701 0000107 WLG

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1270.00 . 5796.00 1460.00 8526.00 7300.00

28 ...
2-16 Chapter 2 Accessing Observations

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
PickIt > output;
TotObs end;
stop;
run;
D D
PickIt TotObs FlightID RouteID Origin ...
329300 329264 IA10801 0000108 AKL

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1524.00 . 4998.00 2140.00 8662.00 10700.00

30 ...

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end;
stop;
run;
D D
PickIt TotObs FlightID RouteID Origin ...
329300 329264 IA10801 0000108 AKL

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1524.00 . 4998.00 2140.00 8662.00 10700.00

31 ...
2.2 Creating a Sample Data Set 2-17

Execution data work.subset;

do PickIt = 100 to TotObs by 100;
set ia.sales point = PickIt
nobs = TotObs;
output;
end; Execution
stop;
run;
STOPS
D D
PickIt TotObs FlightID RouteID Origin ...
329300 329264 IA10801 0000108 AKL

Rev1st RevBus RevEcon CargoRev RevTotal CargoWt

1524.00 . 4998.00 2140.00 8662.00 10700.00

c02s2d2
32
2-18 Chapter 2 Accessing Observations

Partial PROC PRINT Output of work.subset

A Systematic Sample of Fares

Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus

1 IA09200 0000092 CCU DEL International 01JAN2004 12 .

2 IA02501 0000025 RDU IND Domestic 01JAN2004 12 .
3 IA01101 0000011 RDU ORD Domestic 01JAN2004 12 .
4 IA04203 0000042 PWM RDU Domestic 01JAN2004 12 .
5 IA04901 0000049 LHR BRU International 02JAN2004 14 .
6 IA06405 0000064 FBU FRA International 02JAN2004 14 .
7 IA05203 0000052 GVA LHR International 02JAN2004 14 .
8 IA02000 0000020 BOS RDU Domestic 02JAN2004 12 .
9 IA10802 0000108 AKL WLG International 02JAN2004 12 .
10 IA08900 0000089 JRS DEL International 03JAN2004 14 30
11 IA01305 0000013 RDU IAD Domestic 03JAN2004 12 .
12 IA03705 0000037 RDU MSY Domestic 03JAN2004 12 .

Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st

1 138 150 36900 10 . 110 120 $3,360.00

2 138 150 36900 12 . 134 146 $2,472.00
3 138 150 36900 11 . 126 137 $2,915.00
4 138 150 36900 10 . 116 126 $2,890.00
5 125 139 39700 12 . 124 136 $1,020.00
6 125 139 39700 14 . 101 115 $3,976.00
7 125 139 39700 12 . 109 121 $2,280.00
8 138 150 36900 11 . 120 131 $2,772.00
9 138 150 36900 11 . 108 119 $1,397.00
10 163 207 82400 12 26 145 183 $12,372.00
11 138 150 36900 12 . 130 142 $1,140.00
12 138 150 36900 11 . 122 133 $3,520.00

Cargo
Obs RevBus RevEcon CargoRev RevTotal Weight

1 . $12,210.00 $6,708.00 $22,278 12900

2 . $9,112.00 $2,464.00 $14,048 7700
3 . $11,088.00 $3,895.00 $17,898 9500
4 . $11,136.00 $5,148.00 $19,174 11700
5 . $3,472.00 $1,625.00 $6,117 12500
6 . $9,494.00 $7,181.00 $20,651 16700
7 . $6,867.00 $4,495.00 $13,642 15500
8 . $9,960.00 $4,173.00 $16,905 10700
9 . $4,536.00 $2,620.00 $8,553 13100
10 $18,278.00 $49,590.00 $72,364.00 $152,604 45800
11 . $4,160.00 $1,275.00 $6,575 8500
12 . $12,932.00 $5,047.00 $21,499 10300

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.2 Creating a Sample Data Set 2-19

Creating a Random Sample

There are several random number functions to generate
random numbers from various distributions.
General form of the RANUNI function:

RANUNI(seed)
RANUNI(seed)

The UNIFORM function is an alias for the RANUNI function.

The seed is an initial starting point that the RANUNI function uses to generate streams of random
numbers.
The seed must be an integer with a value less than 231-1 (2,147,483,647).

A 0 argument for the RANUNI function uses the system clock time, resulting in a different stream
of random numbers each time that the program is run.
2-20 Chapter 2 Accessing Observations

Using the RANUNI Function

The RANUNI function returns a rational number between
0 and 1 (non-inclusive) generated from a uniform
distribution.

0 1
CEIL(ranuni(seed) * 5)

Examples:
Random number
.01253689
.95196500

34 ...

Using the RANUNI Function

If you want a number between 0 and 5 (non-inclusive),
use the following:

0 5
CEIL(ranuni(seed) * 5)

Examples:
Random number * 5
.01253689 Î 0.06268445
.95196500 Î 4.75982500

35 ...
2.2 Creating a Sample Data Set 2-21

Using the RANUNI and CEIL Functions

If you want an integer between 1 and 5 (inclusive), use
the following:

1 2 3 4 5
CEIL(ranuni(0) * 5)
CEIL(ranuni(seed) * 5)

Examples:
Random number * 5 CEIL( )
.01253689 Î 0.06268445 Î 1
.95196500 Î 4.75982500 Î 5

The CEIL function returns the smallest integer that is greater than or equal to the argument.
2-22 Chapter 2 Accessing Observations

Creating a Random Sample

c02s2d3
Create a random sample with replacement. A sample with replacement can contain duplicate
observations because an observation can be selected more than one time.

data work.subset (drop = i SampSize);

SampSize = 10;
do i = 1 to SampSize;
PickIt = ceil(ranuni(0)*TotObs);
set ia.sales point = PickIt nobs = TotObs;
output;
end;
stop;
run;

proc print data = work.subset;

title 'A Random Sample with Replacement';
run;
Output
A Random Sample with Replacement

Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus

1 IA04604 0000046 GLA LHR International 04APR2005 14 .

2 IA06302 0000063 FRA FBU International 29NOV2005 14 .
3 IA01003 0000010 LAX RDU Domestic 28JUL2004 16 .
4 IA01502 0000015 RDU SEA Domestic 26APR2005 16 .
5 IA09000 0000090 DEL JRS International 05DEC2005 14 30
6 IA02003 0000020 BOS RDU Domestic 09JAN2004 12 .
7 IA03000 0000030 HNL SFO Domestic 28MAY2005 14 30
8 IA01302 0000013 RDU IAD Domestic 20FEB2004 12 .
9 IA01602 0000016 SEA RDU Domestic 06MAY2005 16 .
10 IA06802 0000068 PRG LHR International 21FEB2004 14 .

Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st

1 125 139 39700 13 . 106 119 $1,846.00

2 125 139 39700 14 . 95 109 $3,976.00
3 251 267 77400 16 . 227 243 $14,816.00
4 251 267 77400 15 . 208 223 $14,610.00
5 163 207 82400 13 24 150 187 $13,403.00
6 138 150 36900 10 . 111 121 $2,520.00
7 163 207 82400 13 27 132 172 $12,844.00
8 138 150 36900 11 . 129 140 $1,045.00
9 251 267 77400 13 . 241 254 $12,662.00
10 125 139 39700 12 . 124 136 $3,192.00

(Continued on the next page.)

2.2 Creating a Sample Data Set 2-23

Cargo
Obs RevBus RevEcon CargoRev RevTotal Weight

1 . $4,982.00 $3,498.00 $10,326 15900

2 . $8,930.00 $7,697.00 $20,603 17900
3 . $69,689.00 $40,896.00 $125,401 28800
4 . $67,184.00 $48,872.00 $130,666 32800
5 $16,872.00 $51,300.00 $71,100.00 $152,675 45000
6 . $9,213.00 $4,953.00 $16,686 12700
7 $18,171.00 $43,296.00 $72,960.00 $147,271 48000
8 . $4,128.00 $1,335.00 $6,508 8900
9 . $77,843.00 $39,634.00 $130,139 26600
10 . $10,912.00 $5,125.00 $19,229 12500

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2-24 Chapter 2 Accessing Observations

c02s2d4 (Self-Study)
Create a random sample without replacement. A sample without replacement cannot contain duplicate
observations because after an observation is output to work.subset, programmatically it cannot be
selected again.

The following program can be used as a template. Replace the following:

• work.subset with the name of your resulting SAS data set
• ia.sales with the name of the data set from which to sample
• the 10 in the SampSize = 10 statement with the number of observations to read
data work.subset(drop = ObsLeft SampSize);
c SampSize = 10;
d ObsLeft = TotObs;
do while(SampSize > 0 and ObsLeft > 0);
e PickIt + 1;
if ranuni(0) < SampSize/ObsLeft then
do;
set ia.sales point = PickIt
nobs = TotObs;
output;
SampSize = SampSize - 1;
end;
ObsLeft = ObsLeft - 1;
end;
stop;
run;

proc print data = work.subset;

title 'A Random Sample without Replacement';
run;
c SampSize is the number of observations wanted in the sample.

d ObsLeft is the number of observations still needed to be selected. The start value is equal to
TotObs, the total number of observations in the data set being sampled.

e PickIt is the number of the observation to be read in the sample data set. Because it is used in a
SUM statement, its starting value is 0.

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.2 Creating a Sample Data Set 2-25

In each iteration of the DO loop, the following occurs:

1. PickIt is incremented by 1.

2. The IF expression ranuni(0) < Sampsize/ObsLeft is evaluated:

a. If true, these actions occur:

1) The observation PickIt is selected in the sample.

2) SampSize is decreased by 1.

b. If false, the observation PickIt is skipped.

3. ObsLeft is decreased by 1.

The process ends when SampSize is 0; no additional observations are needed.

Take note of the following:

• Each observation is considered for selection.
• An observation number is considered only once.
• The data set is read-only when an observation number is selected.

This is an adaptation of a sampling routine that has been used by statisticians for many years.
• The sample size is fixed.
• An observation can be selected only once.
• Each observation has an equal probability of being selected.
• The selection probability for an observation is independent of the selection of another
observation.
2-26 Chapter 2 Accessing Observations

Output
A Random Sample without Replacement

Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus

1 IA02000 0000020 BOS RDU Domestic 08JAN2004 12 .

2 IA06502 0000065 FRA ARN International 08FEB2004 14 .
3 IA11201 0000112 SFO HND International 23JUN2004 19 35
4 IA01804 0000018 SFO SEA Domestic 15JUL2004 12 .
5 IA04605 0000046 GLA LHR International 08SEP2004 14 .
6 IA01803 0000018 SFO SEA Domestic 09SEP2004 12 .
7 IA02203 0000022 DFW RDU Domestic 18JAN2005 12 .
8 IA05205 0000052 GVA LHR International 23MAR2005 14 .
9 IA03904 0000039 RDU MCI Domestic 23JUN2005 12 .
10 IA04200 0000042 PWM RDU Domestic 10DEC2005 12 .

Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st

1 138 150 36900 11 . 133 144 $2,772.00

2 125 139 39700 11 . 100 111 $3,377.00
3 201 255 105500 17 32 193 242 $36,091.00
4 138 150 36900 12 . 134 146 $3,360.00
5 125 139 39700 11 . 97 108 $1,562.00
6 138 150 36900 10 . 113 123 $2,800.00
7 138 150 36900 10 . 137 147 $4,350.00
8 125 139 39700 14 . 106 120 $2,660.00
9 138 150 36900 11 . 125 136 $4,092.00
10 138 150 36900 12 . 116 128 $3,468.00

Cargo
Obs RevBus RevEcon CargoRev RevTotal Weight

1 . $11,039.00 $3,159.00 $16,970 8100

2 . $10,200.00 $8,225.00 $21,802 17500
3 $46,304.00 $136,065.00 $186,146.00 $404,606 57100
4 . $12,462.00 $3,311.00 $19,133 7700
5 . $4,559.00 $3,982.00 $10,103 18100
6 . $10,509.00 $5,289.00 $18,598 12300
7 . $19,728.00 $5,025.00 $29,103 7500
8 . $6,678.00 $4,553.00 $13,891 15700
9 . $15,500.00 $5,529.00 $25,121 9700
10 . $11,136.00 $4,972.00 $19,576 11300

With a seed value of 0, you get different results each time that the program is executed, but it is
possible that some of the same observations will be selected as were selected in previous
executions.
2.2 Creating a Sample Data Set 2-27

Using the SURVEYSELECT Procedure

(Self-Study)
The SURVEYSELECT procedure has the following
attributes:
provides a variety of methods for selecting probability-
based random samples
can select a simple random sample or can sample
according to a complex multistage sample design
that includes stratification, clustering, and unequal
probabilities of selection
is part of SAS/STAT

Using the SURVEYSELECT Procedure

(Self-Study)
This program creates a SAS data set, sample,
containing 100 observations randomly selected from the
ia.sales SAS data set.
proc surveyselect data = ia.sales
method = srs n = 100
out = sample;
run;

c02s2d5
39
2-28 Chapter 2 Accessing Observations

Using the SURVEYSELECT Procedure

(Self Study)
General form of the SURVEYSELECT procedure:

PROC
PROCSURVEYSELECT
SURVEYSELECT options;
options;
STRATA variables;
STRATA variables;
CONTROL
CONTROL variables;
variables;
SIZE
SIZEvariable;
variable;
IDIDvariables;
variables;
RUN;
RUN;

The STRATA statement partitions the input data set into non-overlapping groups defined by the
STRATA variables. PROC SURVEYSELECT then selects independent
samples from these strata, according to the selection method and design
parameters specified in the PROC SURVEYSELECT statement. PROC
SURVEYSELECT expects the input data set to be sorted in the order of the
STRATA variables.
The CONTROL statement names variables for sorting the input data set. The CONTROL variables can
be character or numeric. PROC SURVEYSELECT sorts the input data set by
the CONTROL variables before selecting the sample. If you also specify a
STRATA statement, PROC SURVEYSELECT sorts by the CONTROL
variables within the strata.
The SIZE statement names one and only one size measure variable, which contains the size
measures to be used when sampling with probability proportional to size.
The SIZE variable must be numeric. When the value of an observation's
SIZE variable is missing or non-positive, that observation has no chance of
being selected for the sample.
The ID statement names variables from the DATA= input data set to be included in the OUT=
data set of selected units. If there is no ID statement, PROC
SURVEYSELECT includes all variables from the DATA= data set in the
OUT= data set. The ID variables can be character or numeric.
2.2 Creating a Sample Data Set 2-29

Using the SURVEYSELECT Procedure

(Self-Study)
The PROC SURVEYSELECT statement performs the
following tasks:
invokes the procedure

optionally identifies input and output data sets

specifies the sample selection method, the sample

size, and other sample design parameters
The PROC SURVEYSELECT statement is the only
statement required to create a simple random sample.

Options for the SURVEYSELECT Procedure

(Self-Study)
The following options can be specified in the
PROC SURVEYSELECT statement:
To do this: Use this option:
Specify the input data set DATA=
Specify output data sets OUT=
Suppress displayed output NOPRINT
Specify selection method METHOD=

Specify sample size SAMPSIZE=

Specify random number seed SEED=

42
2-30 Chapter 2 Accessing Observations

Methods Used by the SURVEYSELECT

Procedure (Self-Study)
Selected values for the METHOD= option are as follows:

METHOD=
SYS The method of systematic random sampling selects
units at a fixed interval throughout the sampling frame
or stratum after a random start.
URS The method of unrestricted random sampling selects
units with equal probability and with replacement.
Because units are selected with replacement, a unit
can be selected for the sample more than once.
SRS The method of simple random sampling selects units
with equal probability and without replacement. The
selection probability for each individual unit equals
n/N.
43

These methods correspond to the DATA step examples at the beginning of this section.

Reviewing the SURVEYSELECT Procedure

Example (Self-Study)
This program creates a SAS data set, sample,
containing 100 observations randomly selected from the
ia.sales SAS data set.
proc surveyselect data = ia.sales
method = srs n = 100
out = sample;
run;

c02s2d5
44

The SURVEYSELECT procedure step produces similar output to the c02s2d3 example earlier in this
chapter, except that it selects more samples (100 versus 10).
2.2 Creating a Sample Data Set 2-31

Using the SURVEYSELECT Procedure

(Self-Study)
In addition to creating the SAS data set, Sample,
PROC SURVEYSELECT provides the following
information in the Output window:
The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set SALES

Random Number Seed 955326001
Sample Size 100
Selection Probability 0.000304
Sampling Weight 3292.64
Output Data Set SAMPLE
Because the SEED= option is not specified
in the PROC SURVEYSELECT statement,
the seed value is obtained using the time of
45
day from the computer's clock.

To specify a seed so that you can replicate a sample, use the SEED= option on the PROC
SURVEYSELECT statement.
proc surveyselect data = ia.sales
method = srs n = 100
out = sample
seed = 12345;
run;
2-32 Chapter 2 Accessing Observations

Using the SURVEYSELECT Procedure

(Self-Study)
In addition to creating the SAS data set, Sample,
PROC SURVEYSELECT provides the following
information in the log:
The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set SALES

Random Number Seed 955326001
Sample Size 100
Selection Probability 0.000304
Sampling Weight 3292.64
Output Data Set SAMPLE
The Selection Probability for each individual
unit is calculated as 100/329264 (sample
size/number of observations in the input
46
data set).

Using the SURVEYSELECT Procedure

(Self-Study)
In addition to creating the SAS data set, Sample,
PROC SURVEYSELECT provides the following
information in the log:
The SURVEYSELECT Procedure

Selection Method Simple Random Sampling

Input Data Set SALES

Random Number Seed 955326001
Sample Size 100
Selection Probability 0.000304
Sampling Weight 3292.64
Output Data Set SAMPLE
The Sampling Weight is the inverse of the
selection probability, 329264/100.

47
2.2 Creating a Sample Data Set 2-33

Partial Output from the SAS Data Set SAMPLE

Using PROC SURVEYSELECT to create a Random Sample without Replacement

Flight
Obs ID RouteID Origin Dest DestType FltDate Cap1st CapBus

1 IA06900 0000069 LHR AMS International 29OCT2005 14 .

2 IA01905 0000019 RDU BOS Domestic 14FEB2005 12 .
3 IA01904 0000019 RDU BOS Domestic 22MAY2005 12 .
4 IA04901 0000049 LHR BRU International 26JAN2005 14 .
5 IA10303 0000103 SYD CBR International 30OCT2005 12 .
6 IA09103 0000091 DEL CCU International 08MAR2005 12 .
7 IA09801 0000098 PEK CCU International 23DEC2005 28 52
8 IA04301 0000043 LHR CDG International 15NOV2005 14 .
9 IA06001 0000060 MAD CDG International 23NOV2005 14 .
10 IA06000 0000060 MAD CDG International 26NOV2005 14 .
11 IA05701 0000057 FRA CPH International 05APR2005 14 .
12 IA08500 0000085 FRA CPT International 12JUL2005 19 56

Cap Num
Pass Num Num Pass
Obs CapEcon Total CapCargo Num1st Bus Econ Total Rev1st

1 125 139 39700 13 . 106 119 $1,170.00

2 138 150 36900 11 . 115 126 $2,772.00
3 138 150 36900 12 . 137 149 $3,024.00
4 125 139 39700 14 . 101 115 $1,190.00
5 138 150 36900 12 . 118 130 $768.00
6 138 150 36900 12 . 131 143 $4,032.00
7 157 237 85900 28 48 146 222 $23,324.00
8 125 139 39700 14 . 106 120 $1,274.00
9 125 139 39700 14 . 115 129 $3,710.00
10 125 139 39700 13 . 112 125 $3,445.00
11 125 139 39700 12 . 106 118 $2,088.00
12 163 238 105500 18 50 124 192 $43,344.00

Cargo
Obs RevBus RevEcon CargoRev RevTotal Wt

1 . $3,074.00 $2,226.00 $6,470.00 15900

2 . $9,545.00 $4,563.00 $16,880.00 11700
3 . $11,371.00 $2,769.00 $17,164.00 7100
4 . $2,828.00 $2,171.00 $6,189.00 16700
5 . $2,478.00 $1,090.00 $4,336.00 10900
6 . $14,541.00 $4,316.00 $22,889.00 8300
7 $27,264.00 $40,442.00 $53,120.00 $144,150.00 41500
8 . $3,180.00 $2,198.00 $6,652.00 15700
9 . $10,120.00 $5,699.00 $19,529.00 13900
10 . $9,856.00 $6,027.00 $19,328.00 14700
11 . $6,042.00 $4,347.00 $12,477.00 16100
12 $82,050.00 $99,076.00 $247,599.00 $472,069.00 67100
2-34 Chapter 2 Accessing Observations

Comparison of the DATA Step and the

SURVEYSELECT Procedure (Self-Study)

DATA Step PROC SURVEYSELECT

Full power of DATA step Less coding
processing
Can create multiple output One output data set with
data sets additional statistics
Part of Base SAS Part of SAS/STAT

48
2.2 Creating a Sample Data Set 2-35

Exercises

1. Generating a Random Sample with Replacement

Generate a random sample with replacement of 50 employees from ia.salcomps to analyze their
current salaries.
If the current salary is over $30,000, then place the employee’s information in the work.over30
SAS data set.
If the current salary is $30,000 or less, then place the employee’s information in the
work.ltoreq30 SAS data set.

If you obtain zero observations in one of the data sets, run the program again. It is possible
that the selected observations might all be over $30,000 or all $30,000 or less.
2. Generating a Random Sample without Replacement (Optional)
Generate a random sample without replacement of ten flights from ia.cap2000.
2-36 Chapter 2 Accessing Observations

2.3 Creating and Using an Index

Objectives
Define indexes.
List the uses of indexes.
Use the DATA step to create indexes.
Use PROC DATASETS to create and maintain
indexes.
Use PROC SQL to create and maintain indexes.

Using Indexes
To decrease the time used to query a heavily used
SAS data set, create an index on ia.sales.
Flight
Obs ID RouteID Origin Dest DestType FltDate . . .

1 IA10700 0000107 WLG AKL International 01JAN2004 . . .

2 IA10701 0000107 WLG AKL International 01JAN2004 . . .
3 IA10702 0000107 WLG AKL International 01JAN2004 . . .
4 IA10703 0000107 WLG AKL International 01JAN2004 . . .
5 IA10704 0000107 WLG AKL International 01JAN2004 . . .
. . . . .
. . . . .
. . . . .

Flight
Obs ID RouteID Origin Dest DestType FltDate . . .

329259 IA10800 0000108 AKL WLG International 30DEC2005 . . .

329260 IA10801 0000108 AKL WLG International 30DEC2005 . . .
329261 IA10802 0000108 AKL WLG International 30DEC2005 . . .
329262 IA10803 0000108 AKL WLG International 30DEC2005 . . .
329263 IA10804 0000108 AKL WLG International 30DEC2005 . . .
329264 IA10805 0000108 AKL WLG International 30DEC2005 . . .

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.3 Creating and Using an Index 2-37

Using Indexes
Indexed SAS Data Set
Flight
obs
ID RouteID Origin Dest DestType FltDate . . .
329259
IA10800 0000108 AKL WLG International 30DEC2005 . . .
329260 IA10801 0000108 AKL WLG International 30DEC2005 . . .
329261 IA10802 0000108 AKL WLG International 30DEC2005 . . .
329262 IA10803 0000108 AKL WLG International 30DEC2005 . . .

Simplified Index File

Key Variable=Origin

Key Record Identifiers

Value Page(obs,obs...)

AKL 25(1,2,3,...) 32(...)...

AMS 82(22,23,...) 96(...)...
ANC 75(18,34,...) 96(...)...
. ...
. ...
. ...
53

The index is stored with the key values in sorted order.

Using Indexes
An index is an optional file that you can create for a
SAS data file that does the following:
points to observations based on the values of one or
more key variables
provides direct access to specific observations

In other words, index usage locates an observation by

value.

This section discusses indexes for Base SAS data files. A discussion of indexes for Scalable
Performance Data Engine (SPDE) data files is presented in a later chapter.
2-38 Chapter 2 Accessing Observations

The Purpose of Indexes

Indexes can provide direct access to observations in
SAS data sets to accomplish the following:
yield faster access to small subsets (WHERE)

return observations in sorted order (BY)

perform table lookup operations (SET with KEY=)

join observations (PROC SQL)

modify observations (MODIFY with KEY=)

Why Use an Index?

data _null_;
set ia.sales;
where FltDate = '02JUL2004'd;
run;

What happens when you

submit this program?

56 ...
2.3 Creating and Using an Index 2-39

Without an Index
ALL
pages
Input loaded The WHERE statement
SAS Buffers selects observations
I/O
Data by reading data
sequentially.
Disk Memory
PDV
Output ID Route Origin Dest
Buffers
Data
I/O
Set

57 ...
2-40 Chapter 2 Accessing Observations

With an Index The index file is checked.

Index
Index
Only necessary
pages are loaded.
Input
SAS Buffers
I/O The WHERE statement
Data
selects observations
Disk Memory by using direct access.

PDV
Output ID Route Origin Dest
Buffers
Data
Set
I/O
58 ...

When SAS uses an index to process data, SAS accomplishes the following:
• performs a binary search on the index file
• positions the index to the first entry containing a qualified value
• transfers a page of data containing the first record identifier for the qualified value to a buffer
• directly accesses the value specified by the record identifier
• positions the index to the next entry containing a qualified value
• transfers the page of data, if it is not already in the buffer
• directly accesses the value specified by the record identifier
• continues to process the data until there is no more data that satisfies the WHERE expression

If the data values are sorted in ascending order by the indexed variables, fewer I/O operations are
required. In addition, if observations with the same key values are near each other in the file, for
whatever reason, I/O will be minimized.
2.3 Creating and Using an Index 2-41

Using Indexes
The index file consists of entries that are organized in
a tree structure, and connected by pointers.
When an index is used to process a request, such as
for WHERE processing, SAS searches the index file in
order to locate the requested record(s) rapidly.

FlightID
Origin
FltDate
DteFlt
Origin

Key variables in Indexes in the index

ia.sales file for ia.sales
sales.sas7bdat sales.sas7bndx
Directory-based Index File Naming Conventions
59

Index Terminology
There are two types of indexes.
Type Based On Name Example

Simple the value of only Automatically given the Origin

one variable same name as its key
variable

Composite the values of more Must be given a name DteFlt

than one variable that is not the same as
concatenated to any variable or existing
form a single value index

60
2-42 Chapter 2 Accessing Observations

Index Terminology
Index options include the following:
UNIQUE Values of the key variable(s) must be
unique. The option prevents an observation
with a duplicate value for the key variable(s)
from being added to the data set.

Flight
ID RouteID Origin Dest DestType FltDate . . .

IA10800 0000108 AKL WLG International 30DEC2005 . . .

IA10801 0000108 AKL WLG International 30DEC2005 . . .
IA10802 0000108 AKL WLG International 30DEC2005 . . .
IA10803 0000108 AKL WLG International 30DEC2005 . . .

The concatenation of the values for FlightID and

FltDate forms a unique identifier for a row of data.
61

In an existing data set, if the variable(s) on which you attempt to create a unique index has duplicate
values, the index is not created and an error message is written to the SAS log.

Creating Indexes
To create indexes at the same time that you create a
data set, use the INDEX= data set option on the
output data set.
To create or delete indexes in existing data sets,
use the one of the following:
− DATASETS procedure
− SQL procedure

Indexes can also be created using the SAS Management Console that is part of SAS Business Intelligence
Architecture.
2.3 Creating and Using an Index 2-43

Creating Indexes
When creating the index, you can do the following:
designate the key variable(s)

select a valid SAS name for the index

(composite index only)
specify the UNIQUE index option if appropriate

A data set can have these features:

multiple simple and composite indexes

character and numeric key variables

For increased efficiency, use the INDEX= option to create indexes when you initially create a
SAS data set.
2-44 Chapter 2 Accessing Observations

Creating an Index with the DATA Step

c02s3d1
options msglevel=i;

data ia.Sales(index = (Origin

DteFlt = (FltDate FlightID)/unique));
infile 'sales.dat' lrecl=162; * PC and Unix;
*infile '.prog3.rawdata(sales)' lrecl=162; * mainframe ;
input FlightID $7. RouteID $7. Origin $3. Dest $3.
DestType $13. FltDate date9. Cap1st 8. CapBus 8.
CapEcon 8. CapPassTotal 8. CapCargo 8. Num1st 8.
NumBus 8. NumEcon 8. NumPassTotal 8. Rev1st comma8.
RevBus comma8. RevEcon comma8. CargoRev comma8.
RevTotal comma8. CargoWeight comma8.;
format FltDate date9.;
run;
Log
679 options msglevel=i;
680
681 data ia.Sales(index = (Origin
682 DteFlt = (FltDate FlightID)/unique));
683 infile 'sales.dat' lrecl=162; * PC and Unix;
684 *infile '.prog3.rawdata(sales)' lrecl=162; * mainframe ;
685 input FlightID $7. RouteID $7. Origin $3. Dest $3.
686 DestType $13. FltDate date9. Cap1st 8. CapBus 8.
687 CapEcon 8. CapPassTotal 8. CapCargo 8. Num1st 8.
688 NumBus 8. NumEcon 8. NumPassTotal 8. Rev1st comma8.
689 RevBus comma8. RevEcon comma8. CargoRev comma8.
690 RevTotal comma8. CargoWeight comma8.;
691 format FltDate date9.;
692 run;

NOTE: The infile 'C:\workshop\winsas\prog3\sales.dat' is:

File Name=C:\workshop\winsas\prog3\sales.dat,
RECFM=V,LRECL=162

NOTE: 329264 records were read from the infile 'C:\workshop\winsas\prog3\sales.dat'

The minimum record length was 162.
The maximum record length was 162.
NOTE: The data set IA.SALES has 329264 observations and 21 variables.
NOTE: Composite index DteFlt has been defined.
NOTE: Simple index Origin has been defined.
NOTE: DATA statement used (Total process time):
real time 10.76 seconds
cpu time 3.85 seconds

The external file sales used for demonstrations and exercises contains fewer observations than
the external file sales used for the course notes.
2.3 Creating and Using an Index 2-45

Creating Indexes with the DATA Step

When creating a data set in a DATA step, use the
INDEX= data set option to create an index at the same
time.
General form of the INDEX= data set option:

DATASAS-data-file-name
DATA SAS-data-file-name(INDEX
(INDEX==
((index-specification-1</option>
index-specification-1</option>
…<index-specification-n</option>>
…<index-specification-n</option>>));));

The following are conditions for an index-specification

simple index is the name of the key variable.
composite index is index-name = (list of key variables).
You can specify the UNIQUE option with the INDEX= data set option.
The INDEX= data set option can also be used in procedures with OUT= options and also with ODS
OUTPUT statements.
2-46 Chapter 2 Accessing Observations

Viewing Information about Indexes

To display information in the log concerning index
creation or index usage, change the value of the
MSGLEVEL= system option from its default value
of N to I.
General form of the MSGLEVEL= system option:

OPTIONS
OPTIONSMSGLEVEL
MSGLEVEL==NN ||I;I;

N only prints notes, warnings, and error messages. This is the default.
I also prints informational or INFO notes that pertain to index creation and usage, merge
processing, and host sort utilities.
2.3 Creating and Using an Index 2-47

Managing Indexes with PROC DATASETS

c02s3d2
proc datasets library = ia nolist;
modify Sales;
index delete Origin;
index delete DteFlt;

index create Origin;

index create DteFlt = (FltDate FlightID) / unique;
quit;

The NOLIST option prevents a list of library members from being printed in the log.

Log
703 options msglevel = i;
704
705 proc datasets library = ia nolist;
706 modify Sales;
707 index delete Origin;
NOTE: Index Origin deleted.
708 index delete DteFlt;
NOTE: All indexes defined on IA.SALESDATA.DATA have been deleted.
709
710 index create Origin;
NOTE: Simple index Origin has been defined.
711 index create DteFlt = (FltDate FlightID) / unique;
NOTE: Composite index DteFlt has been defined.
712 quit;

NOTE: MODIFY was successful for IA.SALES.DATA.

NOTE: PROCEDURE DATASETS used (Total process time):
real time 0.84 seconds
cpu time 0.80 seconds
2-48 Chapter 2 Accessing Observations

Managing Indexes with PROC DATASETS

You can use the DATASETS procedure on existing
data sets to create or delete indexes.
General form of the PROC DATASETS step to delete
or create indexes:

PROC
PROCDATASETS LIBRARY==libref
DATASETSLIBRARY libref;;
MODIFYSAS-data-set-name
MODIFY SAS-data-set-name;;
INDEX DELETEindex-name
INDEXDELETE index-name;;
INDEX CREATEindex-specification
INDEXCREATE index-specification
<<//options>
options>;;
QUIT;
QUIT;

The INDEX CREATE statement in PROC DATASETS cannot be used if the index to be created
already exists.

If the index to be created already exists, you must do the following:

• Delete the existing index of the same name.
• Create the new index to avoid an error.
If you delete and create indexes in the same step, delete indexes first so that the newly created
indexes can reuse the space of the deleted indexes.
You can specify the UNIQUE option on the INDEX CREATE statement.
2.3 Creating and Using an Index 2-49

Managing Indexes with PROC SQL

c02s3d3
options msglevel = n;

proc sql;
drop index Origin
from ia.Sales;
drop index DteFlt
from ia.Sales;

create index Origin

on ia.Sales(Origin);
create unique index DteFlt
on ia.Sales(FltDate,FlightID);
quit;
Log
739 options msglevel = n;
740
741 proc sql;
742 drop index Origin
743 from ia.Sales;
NOTE: Index Origin has been dropped.
744 drop index DteFlt
745 from ia.Sales;
NOTE: Index DteFlt has been dropped.
746
747 create index Origin
748 on ia.Sales(Origin);
NOTE: Simple index Origin has been defined.
749 create unique index DteFlt
750 on ia.Sales(FltDate,FlightID);
NOTE: Composite index DteFlt has been defined.
751 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.88 seconds
cpu time 0.77 seconds
2-50 Chapter 2 Accessing Observations

Managing Indexes with the SQL Procedure

You can use PROC SQL on existing data sets to create
or delete indexes.
General form of the PROC SQL step to create or delete
indexes:

PROC
PROCSQL;
SQL;
DROP INDEX index-name
DROPINDEX index-name
FROM table-name
FROM table-name;;
CREATE<<option
CREATE INDEX index-name
option>>INDEX index-name
ON table-name((column-name-1,...
ON table-name column-name-1,...
column-name-n
column-name-n););

PROC SQL cannot be used if the index to be created already exists.

If the index to be created already exists, you must do the following:

1. Drop the existing index of the same name.
2. Create the new index.
In most data processing situations, SAS maintains an index automatically.
The SQL procedure CREATE|DROP INDEX syntax is ANSI standard syntax.
You can specify the UNIQUE option in the CREATE INDEX statement.
2.3 Creating and Using an Index 2-51

Index Documentation
PROC CONTENTS
PROC DATASETS
SAS Explorer
SAS Management Console

71
2-52 Chapter 2 Accessing Observations

Documenting Indexes
c02s3d4

proc contents data = ia.sales;

run;
Partial Output
The CONTENTS Procedure

Data Set Name IA.SALES Observations 329264

Member Type DATA Variables 21
Engine V9 Indexes 2
Created Monday, March 28, Observation Length 168
2005 05:55:43 PM
Last Modified Monday, March 28, Deleted Observations 0
2005 06:06:25 PM
Protection Compressed NO
Data Set Type Sorted NO
Label
Data Representation WINDOWS_32
Encoding wlatin1 Western (Windows)

< lines of output removed >

Alphabetic List of Indexes and Attributes

# of
Unique Unique
# Index Option Values Variables

1 DteFlt YES 329264 FltDate FlightID

2 Origin 52

The data set ia.sales used for demonstrations and exercises contains fewer observations than
the data set ia.sales used for the course notes.
2.3 Creating and Using an Index 2-53

Exercises

1. Creating Indexes with the DATA Step

2. Deleting Indexes with the SQL Procedure

Use PROC SQL to delete the Depart index from the ia.schedule data set.

3. Creating Indexes with the DATASETS Procedure

Use PROC DATASETS to create a simple index Date based on the Date variable for the
ia.schedule data set.

4. Viewing Index Information

Use PROC CONTENTS or PROC DATASETS to look at the index information.
2-54 Chapter 2 Accessing Observations

Index Usage Possible

An index might be used when a WHERE expression
references one of the following:
a simple index key variable

the primary key variable of a composite index

Using an index to process

a WHERE expression might
improve performance, and
is referred to as optimizing
the WHERE expression.

In a compound expression using the logical operator

AND, only one simple index can be used.
74

Index Usage Possible

Condition Examples
Comparison where FlightID eq 'IA07903';
operators and where EconomyRev < 5000;
the IN operator where Origin in ('LHR','CDG');

Comparison where FlightID ne 'IA07903';

operators with NOT where Origin not in
('LHR','CDG');
Comparison where Origin =:'L';
operators with the
colon modifier

continued...

There are simple indexes on the variables FlightID, EconomyRev, and Origin.

The colon modifier indicates a starts with condition. It cannot be used in the SQL procedure.
2.3 Creating and Using an Index 2-55

Index Usage Possible

Condition Examples
CONTAINS operator where Origin contains 'L';

Fully bounded range where 5000 < EconomyRev <

10000;
conditions specifying
where EconomyRev between
both an upper and
5000 and 10000;
lower limit, which
includes the
BETWEEN-AND
operator

continued...

There are simple indexes on the variables EconomyRev and Origin.

Index Usage Possible

Condition Examples
Pattern-matching where Origin like 'L%';
operator LIKE where Origin like 'YY_';

IS NULL or IS where Origin is null;

MISSING operator where Origin is missing;

TRIM function where trim(City)='London';

continued...

There are simple indexes on the variables Origin and City.

2-56 Chapter 2 Accessing Observations

Index Usage Possible

Condition Examples
The SUBSTR where substr(City,1,2)='Ca';
function with the
conditions that the
starting position = 1
and the length is
less than or equal to
the length of the
string variable.

There is a simple index on the variable City.

General form of the SUBSTR function:

SUBSTR (variable,position,<length>)
2.3 Creating and Using an Index 2-57

When Is an Index Not Used?

An index is not used in the following circumstances:
with a subsetting IF statement in a DATA step

with particular WHERE expressions

if SAS determines that it is more efficient to read

the data sequentially

The conditions listed here apply to indexed Base SAS data files only. A discussion of when an
index is used with Scalable Performance Data Engine data files is contained in a later chapter.

No Index Usage
SAS does not use an index when a WHERE expression
references an indexed variable if the following conditions
exist:
No single index could supply all required observations.

where RouteID = '000035' or FlightID = '202';

Any function other than TRIM or SUBSTR appears

in the WHERE expression.

where weekday(FlightDate) = 6;

continued...

80
2-58 Chapter 2 Accessing Observations

No Index Usage
The SUBSTR function does not search a string
beginning at the first position.

where substr(Destination,2,1) = 'F';

The sounds-like operator (=*) is used.

where Destination =* 'lacks';

Compound Optimization
When you write a WHERE expression using all the key
variables in a composite index, you can take advantage
of compound optimization.
Compound optimization means that SAS can use a
composite index to optimize some WHERE expressions
that involve multiple variables.

where FlightID = 'IA10703' and

FltDate = '03DEC2004'd;

There is a composite index, DteFlt, on the variables FlightID and FltDate.

2.3 Creating and Using an Index 2-59

Compound Optimization
For compound optimization to occur, all of the following
must be true:
At least the first two key variables in the composite
index must be used in the WHERE conditions.
The conditions are connected using the AND operator.

At least one condition must be the EQ or IN operator.

WHERE Expression Index Usage

To decide whether to use indexed or sequential access,
SAS must do the following:
determine whether the WHERE expression can be
satisfied by an existing index
select the best index if several indexes are available

estimate the number of observations that qualify

compare probable resource usage for both methods

84
2-60 Chapter 2 Accessing Observations

Number of Qualified Observations

SAS might
use an index.

SAS will probably

use an index.
33.3%

3%
0% SAS will
Data Set
use an index.

85 ...

To determine whether it is more efficient to satisfy the WHERE expression by using the index or reading
the data sequentially, SAS uses these guidelines:
• If only a few observations are qualified, it is more efficient to use the index than to do a sequential
search of the entire data file.
• If most or all of the observations qualify, then it is more efficient to read the data file sequentially.
2.3 Creating and Using an Index 2-61

Number of Qualified Observations

To help SAS estimate the number of observations that
would be selected by a WHERE expression, each index
stores 21 statistics called cumulative percentiles or
centiles.
Centiles provide information about the distribution of
values in an index.

For information on updating and viewing the centile information, see the centiles information in the SAS
documentation for the CONTENTS and DATASETS procedures.

Comparing Resource Usage

predicts the I/O operations
required to read via index

calculates I/Os needed

to read sequentially

compares the two

resource costs

87 ...
2-62 Chapter 2 Accessing Observations

Factors Affecting I/O

Size of the subset relative to the size of the data file
Order of data
Page size of the data file
Number of buffers allocated
Cost to uncompress a compressed file for a
sequential read

Data Order
Sort order can affect the number of I/O operations
required for indexed access.
Flight Flight
Obs ID RouteID Origin . . Obs ID RouteID Origin . .

450 IA10803 0000108 AKL . . . 1 IA10800 0000108 AKL . . .

451 IA10804 0000108 AKL . . . 2 IA10801 0000108 AKL . . .
452 IA10805 0000108 AKL . . . 3 IA10802 0000108 AKL . . .
. . . . 4 IA10803 0000108 AKL . . .
. . . .
898 IA10800 0000108 AKL . . . Flight
899 IA10801 0000108 AKL . . . Obs ID RouteID Origin . .
900 IA10802 0000108 AKL . . .
901 IA10803 0000108 AKL . . . 4367 IA10804 0000108 AKL . . .
902 IA10804 0000108 AKL . . . 4368 IA10805 0000108 AKL . . .
903 IA10805 0000108 AKL . . . 4369 IA07000 0000070 AMS . . .
. . . . 4370 IA07001 0000070 AMS . . .
. . . . 4371 IA07002 0000070 AMS . . .
1350 IA10800 0000108 AKL . . . 4372 IA07003 0000070 AMS . . .
1351 IA10801 0000108 AKL . . .
1352 IA10802 0000108 AKL . . .
1353 IA10803 0000108 AKL . . .

unsorted data set sorted data set

If the data set is sorted on the indexed variable(s), the qualified observations are adjacent to each other.
Fewer pages must be read into the input buffers.
2.3 Creating and Using an Index 2-63

Controlling WHERE Processing Index Usage

You can control index usage for WHERE processing
with these data set options:
IDXWHERE=YES | NO

overrides the software’s decision regarding whether

to use an index.
IDXNAME=index-name

directs SAS to use a specific index.

IDXWHERE = YES | NO
YES SAS uses the best available index to process the WHERE expression, even if SAS estimates that
processing sequentially is faster.
NO SAS processes the data sequentially, even if SAS estimates that processing with an index is
faster.
You cannot use IDXWHERE= to override the use of an index to process a BY statement.
2-64 Chapter 2 Accessing Observations

Using the IDXWHERE= Option

Suppose that the variable Country in the data set
ia.freqflyers has the value 'USA' in 71%
of the observations.
To insure that SAS does not use an index when printing
the data for Country = 'USA', use the following
code:
options msglevel = i;
proc print data = ia.freqflyers
(idxwhere = no);
where Country = 'USA';
run;

c02s3d5
91

Using the IDXWHERE= Option

Partial Log

18 proc print data = ia.freqflyers

19 (idxwhere = no);
20 where Country = 'USA';
INFO: Data set option (IDXWHERE=NO) forced a sequential pass
of the data rather than use of an index for where-clause
processing.
21 run;

NOTE: There were 65935 observations read from the data set
IA.FREQFLYERS.
WHERE Country='USA';
NOTE: PROCEDURE PRINT used (Total process time):
real time 4.86 seconds
cpu time 0.89 seconds

92
2.3 Creating and Using an Index 2-65

Guidelines for Indexing

Suggested guidelines for creating indexes:
Minimize the number of indexes to reduce disk storage
and update costs. Create indexes only on variables
that are often used in queries or BY-group processing
(when data cannot be sorted).
Do not create an index if the data file page count is
less than three pages. It is faster to access the data
sequentially.
Consider the cost of an index for a data file that is
frequently changed.
Create indexes on variables that are discriminating.
These variables precisely identify observations that
satisfy WHERE expressions.
continued...
93

A variable such as Gender is not discriminating. A discriminating variable is one that enables you to
break the data into many small groups or subsets.

Guidelines for Indexing

When you create a composite index, make the first
key variable the most discriminating.
Create an index when you intend to retrieve a small
subset of observations from a large data file.
To reduce the number of I/Os performed when you
create an index, first sort the data by the key variable.
Then, to improve performance, maintain the data file
in sorted order by the key variable.
Consider how often your applications use an index.
An index must be used often in order to compensate
for the resources used in creating and maintaining it.
When you create an index to process a WHERE
expression, do not try to create one index that is used
to satisfy every conceivable query.
94
2-66 Chapter 2 Accessing Observations

Index Trade-offs
BENEFITS COSTS
Fast access to a small Extra CPU cycles and I/O
subset of observations operations to create and
Values returned maintain an index
in sorted order Increased CPU to read
Can enforce uniqueness the data
Extra disk space to store
the index file
Extra memory to load
index pages and SAS C
code to use the index

Maintaining Indexes
Data Management Tasks Index Action Taken
Copy the data set with the Index file constructed
COPY procedure or the for new data file
DATASETS procedure.
Move the data set Index file deleted
with the MOVE option from IN= library;
in the COPY procedure. rebuilt in OUT= library
Copy the data set with Index file constructed
drag-and-drop in SAS for new file
Explorer.

96
2.3 Creating and Using an Index 2-67

Maintaining Indexes
Data Management Tasks Index Action Taken
Rename data set. Index file renamed

Rename variable. Variable renamed to new

name in index file
Add observations. Value/identifier pairs added

Delete observations. Value/identifier pairs

deleted; space recovered
for re-use
Update observations. Value/identifier pairs
updated if values change
97

Indexes are maintained by updates in place, such as using the Viewtable window to update, add, or delete
observations, and the APPEND or SQL procedures to append data. Using the Explorer window or the
DATASETS procedure maintains indexes when data sets or variables are renamed. However, recreating a
data set with the SET, MERGE, or UPDATE statements does not automatically maintain indexes.
2-68 Chapter 2 Accessing Observations

Maintaining Indexes
Data Management Tasks Index Action Taken
Delete a data set. Index file deleted
proc datasets lib = work;
delete a;
run;

Rebuild a data set with a DATA Index file deleted

step.
data a;
set a;
run;

Sort the data set in place with the Index file deleted
FORCE option in the SORT
procedure.
proc sort data = a force;
by var;
run;
98

If you use the UPLOAD procedure or the DOWNLOAD procedure, the index is re-created by default
when you upload or download a single data set and omit the OUT= option, or when you upload or
download a SAS data library. Use the INDEX=NO data set option to upload or download without re-
creating the index.
Index re-created:
proc upload data = schedule;
run;
Index not re-created:
proc download data = Sales(index = no);
run;
2.3 Creating and Using an Index 2-69

Exercises

5. Using an Index
Open the program, c02ex7Start, and submit it. Consult the log and answer the questions following the
program code listed here.
c02ex7Start
options msglevel=I obs = 500;

*** Example 1;

data rdu;
set ia.Sales;
if Origin = 'RDU';
run;

*** Example 2;

proc print data=ia.Sales;

where Origin = 'RDU' or FltDate = '01dec2004'd;
run;

*** Example 3;

proc print data=ia.Sales;

where Origin ne 'RDU';
run;

*** Example 4;

proc print data=ia.Sales;

where Origin='ATH';
run;

**** Example 5;

proc print data=ia.Sales;

where FltDate='24mar2005'd;
run;

*****Example 6;

data SalesCopy;
set ia.Sales;
run;
2-70 Chapter 2 Accessing Observations

Questions:
a. Does Example 1 use an index? Why or why not?

b. Does Example 2 use an index? Why or why not?

c. Does Example 3 use an index? Why or why not?

d. Does Example 4 use an index? Why or why not?

e. Does Example 5 use an index? Why or why not?

f. In Example 6, does the data set SalesCopy have an index?

2.4 Solutions to Exercises 2-71

2.4 Solutions to Exercises

1. Generating a Random Sample with Replacement
Generate a random sample with replacement of 50 employees from ia.salcomps to analyze their
current salaries.
If the current salary is over $30,000, then place the employee’s information in the work.over30
SAS data set.
If the current salary is $30,000 or less, then place the employee's information in the
work.ltoreq30 SAS data set.

data over30 ltoreq30;

SampSize = 50;
do i = 1 to SampSize;
PickIt = ceil(ranuni(0)*TotObs);
set ia.salcomps point = PickIt nobs = TotObs;
if Salary > 30000 then output over30;
else output ltoreq30;
end;
stop;
run;
2. Generating a Random Sample without Replacement (Optional)
Generate a random sample without replacement of ten flights from ia.cap2000.

DATA Step Solution:

data work.CapSample(drop = ObsLeft SampSize);
SampSize = 10;
ObsLeft = TotObs;
do while(SampSize > 0 and ObsLeft > 0);
PickIt + 1;
if ranuni(0) < SampSize/ObsLeft then
do;
set ia.cap2000 point = PickIt
nobs = TotObs;
output;
SampSize = SampSize - 1;
end;
ObsLeft = ObsLeft - 1;
end;
stop;
run;
SURVEYSELECT Procedure Solution:
proc surveyselect data=ia.cap2000
method=srs n=10
out=CapSample;
run;
2-72 Chapter 2 Accessing Observations

3. Creating Indexes with the DATA Step

Open the program, c02ex3Start, and add the INDEX= option to create two indexes:
• a simple index Depart, based on the Depart variable
• a unique composite index FltDte, based on the Flight and Date variables
data ia.schedule(index = (Depart
FltDte = (Flight Date)/unique));
infile 'schedule.dat'; *PC/Unix;
*infile '.prog3.rawdata(schedule)'; *z/OS;
input Flight $7. Depart time5. Date date9.;
format Depart time5. Date date9.;
run;
4. Deleting Indexes with the SQL Procedure
Use PROC SQL to delete the Depart index from the ia.schedule data set.

proc sql;
drop index Depart
from ia.schedule;
quit;
5. Creating Indexes with the DATASETS Procedure
Use PROC DATASETS to create a simple index Date based on the Date variable for the
ia.schedule data set.
proc datasets library = ia nolist;
modify schedule;
index create Date;
quit;
6. Viewing Index Information
Use PROC CONTENTS to look at the index information.

proc contents data = ia.schedule;

run;
7. Using Indexes
Open the program, c02ex7Start, and submit it. Consult the log and answer the questions following the
program code listed here.
Questions:

a. Does Example 1 use an index? Why or why not?

No, Example 1 does not use an index because the example uses a subsetting IF statement instead
of a WHERE statement.

b. Does Example 2 use an index? Why or why not?

No, Example 2 does not use an index because the WHERE statement uses the OR operator.
2.4 Solutions to Exercises 2-73

c. Does Example 3 use an index? Why or why not?

No, Example 3 does not use an index because the subset is too large for an index to be
appropriate.

d. Does Example 4 use an index? Why or why not?

Yes, Example 4 uses an index because the WHERE statement selects a small subset.

e. Does Example 5 use an index? Why or why not?

Yes, Example 5 uses an index because the WHERE statement selects a small subset. The WHERE
statement is using the composite index, DteFlt, because the subset is on the primary key
variable.

f. In Example 6, does the data set SalesCopy have an index?

No, the data set ia.sales maintains its index, but SalesCopy does not retain the index from
ia.sales.
2-74 Chapter 2 Accessing Observations
Chapter 3 Combining Data
Horizontally

3.1 Joining Data Sets by Value ............................................................................................3-3

3.2 Combining Summary and Detail Data.........................................................................3-37

3.3 Using an Index to Combine Data.................................................................................3-56

3.4 Updating Data ...............................................................................................................3-72

3.5 Combining Summary and Detail Data Using Two SET Statements (Self-Study).....3-93

3.6 Solutions to Exercises ...............................................................................................3-106

3-2 Chapter 3 Combining Data Horizontally
3.1 Joining Data Sets by Value 3-3

3.1 Joining Data Sets by Value

Objectives
Use the DATA step with a MERGE statement to join
more than two SAS data sets.
Use the SQL procedure to join SAS data sets without
a common variable.
Investigate the differences between the DATA step
MERGE and PROC SQL.
Combine data conditionally.

Business Task
Merge multiple SAS data sets with no common BY variable.
ia.expenses ia.alldata
Date Date
FlightID FlightID ia.expenses
Expenses Expenses
ia.revenue Dest
Date
Dest
FlightID
Date
Origin ia.revenue
FlightID
RevBusiness
Origin
RevEcon
RevBusiness
Rev1st
RevEcon
DestCity
Rev1st
DestApt
ia.airports
ia.airports OriginCity
City OriginApt
Code Profit calculated
Country
Name
4 ...
3-4 Chapter 3 Combining Data Horizontally

Methods for the Match-Merge

You can perform a match-merge of two or more
SAS data sets with the following:
DATA step with the MERGE statement and a
BY statement
PROC SQL join

DATA Step MERGE Statement

DATA
DATAdata-set-name;
data-set-name;
MERGE
MERGESAS-data-sets;
SAS-data-sets;
BY variables;
BY variables;
RUN;
RUN;

Matches on equal values for like-named variables:

Airport Airport
Code Code

Airport
Code

6
3.1 Joining Data Sets by Value 3-5

Using the DATA Step to Perform a Match-Merge

c03s1d1
proc sort data = ia.expenses out = expenses;
by FlightID Date;
run;

proc sort data = ia.revenue out = revenue;

by FlightID Date;
run;

data exprev;
merge expenses(in = e) revenue(in = r);
by FlightID Date;
if e and r;
Profit = sum(Rev1st, RevBusiness, RevEcon, -Expenses);
run;

proc sort data = exprev;

by Dest;
run;

proc sort data = ia.airports out = airports;

by Code;
run;

data destinfo; c
merge exprev(in = exp)
airports(keep = City Name Code
rename = (Code = Dest City = DestCity
Name = DestApt));
by Dest;
if exp;
run;

proc sort data = destinfo;

by Origin;
run;
(Continued on the next page.)
3-6 Chapter 3 Combining Data Horizontally

data alldata; d
merge destinfo(in = des)
airports(keep = City Name Code
rename = (Code = Origin City = OriginCity
Name = OriginApt));
by Origin;
if des;
run;

proc print data = alldata(obs=5);

title 'Result of Merging Three Data Sets';
format Date date9.;
run;
c This DATA step creates the city variable for the destination.

d This DATA step creates the city variable for the origin.

Partial Output
Result of Merging Three Data Sets

Flight Rev Rev

Obs ID Date Expenses Origin Dest Rev1st Business Econ Profit DestCity

1 IA03400 02DEC2005 89155 ANC RDU 15829 28420 68688 23782 Raleigh-Durham, NC
2 IA03400 03DEC2005 22008 ANC RDU 15829 26460 68688 88969 Raleigh-Durham, NC
3 IA03400 04DEC2005 71609 ANC RDU 18707 23520 77751 48369 Raleigh-Durham, NC
4 IA03400 05DEC2005 82454 ANC RDU 15829 27440 64872 25687 Raleigh-Durham, NC
5 IA03400 06DEC2005 85174 ANC RDU 17268 27440 67257 26791 Raleigh-Durham, NC

Obs DestApt OriginCity OriginApt

1 Raleigh-Durham International Airport Anchorage, AK Anchorage International Airport

2 Raleigh-Durham International Airport Anchorage, AK Anchorage International Airport
3 Raleigh-Durham International Airport Anchorage, AK Anchorage International Airport
4 Raleigh-Durham International Airport Anchorage, AK Anchorage International Airport
5 Raleigh-Durham International Airport Anchorage, AK Anchorage International Airport
3.1 Joining Data Sets by Value 3-7

Advantages of DATA Step MERGE

Multiple values can be returned.
There is no limit to the size of the table, other than disk
space.
Multiple BY variables enable lookups that depend on
more than one variable.
Multiple data sets can be used to provide access to
different tables.
A merge enables complex business logic to be
incorporated into the new data set by using DATA step
processing, such as arrays and DO loops, in addition
to merging features.

continued...

Advantages of DATA Step MERGE

The IN= data set option and subsequent
IF-THEN/ELSE logic afford comprehensive control
over whether to accept, reject, or process differently
depending on which data set contributed each
observation.
Observations with duplicate BY values are joined
one-to-one instead of being expanded into a
Cartesian product, as SQL does.

9
3-8 Chapter 3 Combining Data Horizontally

Disadvantages of DATA Step MERGE

Data sets must be sorted by or indexed based on the
BY variable(s).
An exact match on the key value(s) must be found.
The BY variable(s) must be present in all data sets.
When more than one data set contributes variables
with the same name, the values from the variable in
the rightmost data set overwrite the other like-named
variables, and no warning is printed.

Example:
Data set ONE

X Y Z

1 2 3

Data set TWO

X Y W

1 8 9

data three;
merge one two;
by x;
run;
Data set THREE

X Y Z W

1 8 3 9

To avoid this behavior, merge on all common BY variables or use the RENAME input data set
option.
3.1 Joining Data Sets by Value 3-9

The SQL Procedure

General form of the SQL procedure CREATE TABLE
statement:

PROC
PROCSQL;
SQL;
CREATE
CREATETABLE
TABLESAS-data-set
SAS-data-setAS AS
SELECT
SELECT column-1, column-2,…,column-n
column-1, column-2,… ,column-n
FROM
FROMtable-1,
table-1,table-2,…,table-n
table-2,…,table-n
WHERE
WHEREjoining
joiningcriteria
criteria
ORDER
ORDERBY BYsorting
sortingcriteria;
criteria;

11
3-10 Chapter 3 Combining Data Horizontally

Using a PROC SQL Join to Perform a Match-Merge

c03s1d2
proc sql;
create table usesql as
select revenue.FlightID, revenue.Date,
Expenses,
Origin, Dest,
Rev1st, RevBusiness, RevEcon,
sum(Rev1st, RevBusiness, RevEcon, -Expenses)
as Profit,
d.City as DestCity, d.Name as DestApt, c
o.City as OriginCity, o.Name as OriginApt c
from ia.expenses, ia.revenue,
ia.airports as d, ia.airports as o c
where expenses.FlightID = revenue.FlightID
and expenses.Date = revenue.Date
and d.Code = revenue.Dest c
and o.Code = revenue.Origin c
order by revenue.FlightID, revenue.Date;
quit;

proc print data = usesql(obs=5);

title 'Result of Joining Three Data Sets';
format Date date9.;
run;
c The data set ia.airports is named twice in the FROM clause so that the airport Code variable
can be used twice in the code and the airport City can be extracted twice: once for the destination
city and once for the city of origin. An alias is required on the duplicated data set names to distinguish
which of the duplicate column names is requested.
3.1 Joining Data Sets by Value 3-11

Partial Output
Result of Joining Three Data Sets

Flight Rev Rev

Obs ID Date Expenses Origin Dest Rev1st Business Econ Profit DestCity

1 IA00100 02DEC2005 58907 RDU LHR 19200 31610 79650 71553 London, England
2 IA00100 03DEC2005 108543 RDU LHR 17600 25070 80181 14308 London, England
3 IA00100 04DEC2005 21963 RDU LHR 17600 28340 84960 108937 London, England
4 IA00100 05DEC2005 31517 RDU LHR 17600 32700 72216 90999 London, England
5 IA00100 06DEC2005 105682 RDU LHR 22400 29430 74871 21019 London, England

Obs DestApt OriginCity OriginApt

1 Heathrow Airport Raleigh-Durham, NC Raleigh-Durham International Airport

2 Heathrow Airport Raleigh-Durham, NC Raleigh-Durham International Airport
3 Heathrow Airport Raleigh-Durham, NC Raleigh-Durham International Airport
4 Heathrow Airport Raleigh-Durham, NC Raleigh-Durham International Airport
5 Heathrow Airport Raleigh-Durham, NC Raleigh-Durham International Airport
3-12 Chapter 3 Combining Data Horizontally

Advantages of PROC SQL Joins

Multiple data sets can be joined without having
common variables in all data sets.
Data sets do not have to be sorted or indexed.
Inequality joins can be performed.
You can create data files (tables), views, or reports.
PROC SQL follows ANSI standard language
definitions, so that you can use knowledge gained
from other languages.

Disadvantages of PROC SQL Joins

The maximum number of tables that can be joined at
one time is 32.
PROC SQL might require more resources than the
DATA step with the MERGE statement for simple
joins.
Complex business logic is difficult to incorporate into
the join.
Duplicate BY values are combined into a Cartesian
product, which can produce an extremely large output
data set.

14
3.1 Joining Data Sets by Value 3-13

Comparison Programs
The following programs are used to generate the results
for the next four result sets.
data three;
merge one two;
by x;
run;

proc sql;
select one.x, one.y, two.z
from one, two
where one.x = two.x;
quit;

The DATA step and SQL procedure code remain constant. The data values change in the
following examples.

MERGE and SQL Join Comparison

ONE-TO-ONE matches produce identical results:
one two
X Y X Z
1 a 1 f
2 b 2 g

X Y Z
1 a f
2 b g

The X values are unique in both data sets one and two.
3-14 Chapter 3 Combining Data Horizontally

MERGE and SQL Join Comparison

ONE-TO-MANY matches produce identical results:
one two
X Y X Z
1 a 1 f
2 b 1 r
2 g

X Y Z
1 a f
1 a r
2 b g

The X values are unique in one but not in two.

3.1 Joining Data Sets by Value 3-15

MERGE and SQL Join Comparison

MANY-TO-MANY matches produce different results:
one two
X Y X Z PROC SQL
1 a 1 f
DATA Step 1 r
X Y Z
1 c
2 b 2 g 1 a f
X Y Z 1 a r
1 a f 1 c f
1 c r 1 c r
2 b g 2 b g

The X values in data sets one and two are not unique.

Many-to-many joins are problematic. The question is not efficiency of the technique; rather, the
question is which output do you want? Do you want two or four observations for a 2-to-2 match?

Reference Information

The following DATA step creates a Cartesian product.

data three(drop = temp);
set one;
do I = 1 to totobs;
set two(rename = (x = temp))
nobs=totobs point = i;
if x = temp then output;
end;
run;
3-16 Chapter 3 Combining Data Horizontally

MERGE and SQL Join Comparison

NONMATCHING data produces different results:
one two
X Y X Z
1 a 1 f
DATA Step 2 b 3 t PROC SQL
3 c 4 w
X Y Z X Y Z
1 a f 1 a f
2 b 3 c t
3 c t
4 w

Reference Information

The following SQL step produces results that are identical to those of the DATA step when there is
non-matching data.
proc sql;
select coalesce(one.x, two.x) as x, y, z
from one full join two
on one.x = two.x;
quit;
.
3.1 Joining Data Sets by Value 3-17

MERGE and SQL Join Comparison

How does the DATA step perform a match-merge?
one two
X Y X Z
1 a 1 f
1 d 1 r
3 c 3 t
4 w
X Y Z

PDV

20 ...

The DATA step MERGE statement processes sequentially, top to bottom, by default.

MERGE and SQL Join Comparison

How does the DATA step perform a match-merge?
one two
X Y X Z
1 a 1 f
1 d 1 r
3 c 3 t
4 w
X Y Z

PDV 1 a f

21 ...
3-18 Chapter 3 Combining Data Horizontally

MERGE and SQL Join Comparison

How does the DATA step perform a match-merge?
one two
X Y X Z
1 a 1 f
1 d 1 r
3 c 3 t
4 w
X Y Z

PDV 1 d r

22 ...

MERGE and SQL Join Comparison

How does the DATA step perform a match-merge?
one two
X Y X Z
1 a 1 f
1 d 1 r
3 c 3 t
4 w
X Y Z

PDV 3 c t

23 ...
3.1 Joining Data Sets by Value 3-19

MERGE and SQL Join Comparison

How does the DATA step perform a match-merge?
one two
X Y X Z
1 a 1 f
1 d 1 r
3 c 3 t
EOF 4 w
X Y Z

PDV 4 w

MERGE and SQL Join Comparison

Both the matches and the non-matches on X remain.
one two
X Y X Z
1 a three 1 f
1 d 1 r
3 c X Y Z 3 t
1 a f 4 w
1 d r
3 c t
4 w

25
3-20 Chapter 3 Combining Data Horizontally

MERGE and SQL Join Comparison

How does PROC SQL perform a join?
one two
X Y X Z
1 a 1 f
1 d 1 r
3 c 3 t
4 w

Without a WHERE clause

26 ...

PROC SQL processes by creating a Cartesian product.

MERGE and SQL Join Comparison

How does PROC SQL perform a join?
one two
X Y X Y X Z X Z
1 a 1 f
1 a 1 a 1 r 1 f
1 d 1 a 3 t 1 r
1 a 4 w
3 c 3 t
1 d 1 f
1 d 1 r 4 w
1 d 3 t
1 d 4 w
3 c 1 f
3 c 1 r
3 c 3 t
3 c 4 w

Without a WHERE clause

Conceptually, PROC SQL creates the result set pictured above. There are optimization routines that make
the process more efficient.
3.1 Joining Data Sets by Value 3-21

MERGE and SQL Join Comparison

How does PROC SQL perform a join?
one where one.x = two.x two
X Y X Y X Z X Z
1 a 1 f
1 a 1 a 1 r 1 f
1 d 1 a 3 t 1 r
1 a 4 w
3 c 3 t
1 d 1 f
1 d 1 r 4 w
1 d 3 t
1 d 4 w
3 c 1 f
3 c 1 r
3 c 3 t
3 c 4 w

With a WHERE clause

The non-matches on X are eliminated.

MERGE and SQL Join Comparison

How does PROC SQL perform a join?
one three two
X Y X Y Z X Z
1 a 1 a f 1 f
1 a r
1 d 1 r
1 d f
3 c 1 d r 3 t
3 c t 4 w

All combinations of observations

from ONE and TWO with matches on X remain.

29
3-22 Chapter 3 Combining Data Horizontally

Exercises

1. Joining Data Sets to Create a New Data Set

Using PROC SQL, join ia.employees, ia.jcodedat, and ia.newsals to create a data set
that contains employee IDs, employee job codes, job code descriptions, current salaries, and new
salaries. Print the resulting data set.
There is no variable common to all three SAS data sets. Use PROC CONTENTS, PROC
DATASETS, or the SAS Explorer to determine the columns on which to join the rows.
Partial Output
Job
EmpID Code Descript Salary NewSalary

E00001 FLTAT3 FLIGHT ATTENDANT GRADE 3 $25,000 $27,420.04

E00003 VICEPR VICE PRESIDENT $120,000 $143,789.80
E00005 GRCREW GROUND CREW $19,000 $20,757.68
E00008 OFFMGR OFFICE MANAGER $85,000 $93,811.78
E00012 MKTCLK MARKETING CLERK $33,000 $38,481.44
E00013 RECEPT RECEPTIONIST $22,000 $23,243.79
E00014 MECH02 MECHANIC GRADE 2 $19,000 $20,434.78
E00017 RESCLK RESERVATIONS CLERK $36,000 $36,241.64
E00018 FACMNT FACILITIES MAINTENANCE OPERATIVE $33,000 $35,947.80
E00022 FACCLK FACILITIES CLERK $27,000 $27,530.65

2. Combining Data with the DATA Step MERGE Statement

Repeat the same task using the DATA step MERGE statement to merge all three data sets. Print the
resulting data set.
Partial Output
Job
EmpID Code Descript Salary NewSalary

E00001 FLTAT3 FLIGHT ATTENDANT GRADE 3 $25,000 $27,420.04

The results should be identical to the previous exercise.

3.1 Joining Data Sets by Value 3-23

Conditionally Combining Data

Some combinations of data are based on a condition.
For example, the data set ia.madrid contains the
flights from Madrid in March 2005. The revenue amounts
are in dollars.
Partial Data Set
Flight
Obs ID FltDate Rev1st RevBus

1 IA05900 01MAR2005 $3,445.00 .

2 IA05901 01MAR2005 $2,915.00 .
3 IA05902 01MAR2005 $2,915.00 .
4 IA05903 01MAR2005 $2,915.00 .

Obs RevEcon CargoRev RevTotal

1 $8,360.00 $7,421.00 $19,226

2 $10,824.00 $5,289.00 $19,028
3 $8,448.00 $7,503.00 $18,866
31 4 $9,416.00 $6,601.00 $18,932

Conditionally Combining Data

The data set ia.rates has the conversion rate for
converting from dollars to euros.

Obs BDate EDate rate

1 03/01/2005 03/07/2005 0.76

2 03/08/2005 03/10/2005 0.75
3 03/11/2005 03/13/2005 0.74
4 03/14/2005 03/15/2005 0.75
5 03/16/2005 03/16/2005 0.74
6 03/17/2005 03/20/2005 0.75
7 03/21/2005 03/22/2005 0.76
8 03/23/2005 03/27/2005 0.77
9 03/28/2005 03/28/2005 0.78
10 03/29/2005 03/31/2005 0.77

32
3-24 Chapter 3 Combining Data Horizontally

Conditionally Combining Data

What needs to be done:
Use the current value of rate when FltDate is
between BDate and EDate.

BDate EDate rate

current rate
03/01/2005 03/07/2005 0.76

ID Dest FltDate
IA05900 MAD 01MAR2005

BDate <= FltDate <= EDate?

continued...

33 ...

Conditionally Combining Data

What needs to be done:
Read a new value for rate when FltDate is not
between BDate and EDate.

BDate EDate rate

current rate
03/01/2005 03/07/2005 0.76

ID Dest FltDate
IA05900 MAD 08MAR2005

BDate <= FltDate <= EDate?

34 ...
3.1 Joining Data Sets by Value 3-25

Conditionally Combining Data

The MERGE statement cannot be used in this example.
It can only be used to join data when one of the following
conditions are met:
The data can be joined by comparing values of a
common BY value.
or
The data can be combined by observation number. In
this case, there is no BY statement in the DATA step.

Conditionally Combining Data

You can use multiple SET statements to combine
observations from several SAS data sets.
When you use multiple SET statements, the following
occurs:
Processing stops when SAS encounters the end-of-file
marker on either data set.
The variables in the PDV are not reinitialized when a
second SET statement is executed.
Example:
data Euros;
set ia.madrid;
set ia.rates;
run;
37
3-26 Chapter 3 Combining Data Horizontally

Conditionally Combining Data

data Euros;
set ia.Madrid(keep = FlightID FltDate
RevTotal);
do while (not (BDate le FltDate le n
EDate));
set ia.rates;
end;
RevEuros = RevTotal*rate;
run;

ia.madrid must be sorted by FltDate.

ia.rates must be sorted by BDate.

c03s1d3
38

c The DO WHILE statement executes statements in a DO loop while a condition is true. The expression
is evaluated at the top of the loop. The statements in the loop never execute if the expression is
initially false.