Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

C747 Transcripts Part1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 224
At a glance
Powered by AI
The document outlines the steps to set up practice files in SAS Studio for a course. It involves creating a folder, defining a library, and creating practice data files. Common procedures like PROC FREQ, PROC MEANS, PROC PRINT and PROC UNIVARIATE are demonstrated for exploring, summarizing and validating the data. The Output Delivery System allows exporting output to different formats like PDF, CSV etc.

The steps outlined are: 1) Create a folder and define a library to store practice files. 2) Create SAS programs with code to generate practice data and store in the defined library. 3) Run the setup program each time to access the practice data.

PROC FREQ is used for frequency and crosstab reports. PROC MEANS provides summary statistics with options to request specific stats or group data. PROC PRINT and PROC UNIVARIATE allow validating data through viewing extreme values. PROC MEANS and PROC UNIVARIATE are also used to check ranges for numeric variables.

Setup Instructions for SAS University Edition

To complete the practices in this course, you must follow these steps to set up the practice files for SAS Studio. These
instructions apply to all configurations of SAS Studio, including SAS University Edition.

IMPORTANT: You must perform these two tasks in the same SAS Studio session. Do not close your browser until you
have completed both tasks or you will need to start again.
Task 1: Create a folder for your practice files and define the orion library.
1. Start SAS Studio from your SAS University Edition: Information Center page.
2. At the top of the Files and Folders pane, click New
and select Folder.
3. In the Name box, type ecprg193. Click Save.
4. The following code creates a macro variable to store the location of your practice files and defines the orion
library. Copy and paste the following two lines of code into the Code tab in SAS Studio. Don't run the code yet -
you must edit the code before you run it.

%let path=FILEPATH;
libname orion "&path";

5. In the Files and Folders pane, open My Folders. Right-click ecprg193 and select Properties. Highlight the filepath
shown in Location and copy it.
6. In the Code tab in SAS Studio, edit the program as follows:
a. In the first line of the program, highlight FILEPATH and paste the filepath that you copied in the previous
step.
b. Do not change the second line of the program.
7. Click Run
to submit this program to SAS.
8. Check the log to make sure that the libref orion was successfully assigned.
a. You should see this message in the log:

NOTE: Libref ORION was successfully assigned


You have defined the orion library and SAS knows where you are storing your practice files.
b. If the program does not run successfully, make sure you created the folder that you are referencing, and
review your program for errors. Correct the errors and resubmit the program.
9. Click the Code tab and click Save. In the Save As window, navigate to ecprg193 and select the folder. Name the
program setup.sas and click Save.
IMPORTANT: Each time you start SAS Studio, you must open the setup.sas program and submit it to define the
orion library so that you have access to your practice files.
10. Optional step: To make this code easy to find, you can add it to the Snippets panel. To do this, right-click setup.sas
and select Add to My Snippets. To run the program, open the Snippets panel, open the My Snippets folder, and
double-click setup. Then, run the code.
11. Close the setup.sas program tab in the workspace.

Task 2: Create the practice files for the course — you only have to do this once.
1. In the Files and Folders pane, click New
and select SAS Program.
2. Click here to open a popup window with the SAS code that creates your practice data. Select all of the code
(Ctrl+A or Command+A) and then copy it (Ctrl+C or Command+C).
3. Paste the code (Ctrl+V or Command+V) into the Code tab in SAS Studio.
4. Click Run
to submit the program to SAS. This program creates the practice files in the ecprg193 folder.
5. After the code runs, the Results tab displays two tables. The second table lists the data sets that the program
created. Check the log to verify that there are no errors or warnings. There will be many notes because SAS writes
a note for each data set that the program creates.

NOTE: If your program does not run successfully, make sure you completed all the steps in Task 1 correctly.
6. On the Code tab, click Clear All code
You have set up your practice data and you are ready to work in SAS Studio.
NOTE: Unless you delete your practice files folder, you do not need to perform this task again (if you are using SAS
University Edition on AWS Marketplace, see the note below).

IMPORTANT: Each time you start SAS Studio to practice in this course, navigate to the ecprg193 folder in the Files and
Folders pane and double-click setup.sas to open it. After you run the program, you can access your practice data in the
orion library. Each practice page reminds you to define the orion library and has a link to this setup page.

Lesson 1: Getting Started with SAS Programming


Introduction
In this lesson, you’ll get an overview of SAS software and how you can use SAS to access, manage, analyze, and present
your data. You’ll explore the SAS programming process and learn the iterative manner of working with SAS programs. In
short, you'll learn some important SAS concepts that lay the foundation for you as a SAS programmer.

Objectives

In this lesson, you learn to do the following:

 describe SAS capabilities


 explain the SAS programming process
 identify the types of files used in SAS

What Is SAS?
SAS is a suite of business solutions and technologies to help organizations solve business problems. Base SAS is the
centerpiece of all SAS software. It provides a flexible and extensible programming language designed for data access,
transformation, and reporting.

To extend the capabilities of Base SAS, you can add other SAS components. For example, you can use a component to
access third-party data. Other components give you tools for report writing, high-resolution graphics, statistical analysis,
visualization and discovery, and business solutions.

The SAS Framework


Let’s look at all of these SAS capabilities in a simple framework. No matter what type of business or industry you work in,
you need to access your data. You might have data stored in SAS, in a raw data file, in Oracle, in Excel, or in other types
of files. Using SAS, you can read any kind of data.

Once you access your data, you can manage it. For example, you might need to subset data, create variables, validate
and clean data, or combine data to ready it for analysis. SAS gives you excellent data management capabilities. You’ll
probably want to analyze your data as well. You can perform some simple analyses, such as finding frequency counts or
calculating averages. Or you can run more complex analyses, such as regression or forecasting. For statistical analysis,
SAS is the gold standard.
Finally, you'll want to present your data meaningfully. You can create list reports, summary reports, or graphic reports.
And you can print these reports, write them to new data files, or publish them on the web. You have lots of options for
presenting your data.

Exploring the SAS Programming Process


Now that you know a little bit about the power of SAS, let’s take a look at the overall programming process in SAS.

The first step in the programming process is to define the business need. You do this by communicating with the
business team or by reviewing a written specification. After you define the business need, you write a SAS program
based on the desired output, the necessary input, and the required processing. After you finish coding, you run the
program and review your results, which can be reports or notes and messages from SAS regarding your code. As you
review the results, you might find inaccuracies or errors, in which case you might need to debug or modify the program.
Depending on your results, you might need to repeat some of the steps.

Types of Files Used in SAS


As you know, the power of SAS is that you can use it to read any type of data. Let's take a few moments and learn about
the three major file types you'll use in this course: raw data files, SAS data sets, and SAS program files. Keeping the
overall programming process in mind, let’s see how you use each type of file.

Raw data files contain data that has not been processed by any other computer program. They are text files that contain
one record per line, and the record typically contains multiple fields. Raw data files aren't reports; they are unformatted
text.

The second major type of file you’ll use is a SAS data set. This important type of file is specific to SAS. A SAS data set is
your data in a form that SAS can understand. Like raw data files, SAS data sets contain data. But in SAS data sets, the
data is created only by SAS and can be read only by SAS.

Now let’s explore the third major type of file you’ll use in SAS: the SAS program file. SAS program files contain SAS
programming code. These instructions tell SAS how to process your data and what output to create.

Lesson 2: Working with SAS Programs


In this lesson, you'll learn how to work with SAS code. First, you'll learn the main components of SAS programs. Then
you'll learn the syntax rules and formatting guidelines for writing SAS programs. As you work with SAS programs, you'll
add descriptive comments, and identify and correct common syntax errors.

Objectives

In this lesson, you learn to do the following:

 list the components of a SAS program


 identify the characteristics of SAS statements
 define SAS syntax rules
 document a program using comments
 identify common syntax errors
 diagnose and correct syntax errors in a SAS program
Exploring SAS Programs
Understanding SAS Programs
Let's investigate the main components of SAS programs. Generally speaking, a SAS program is a sequence of steps that
you submit to SAS for execution. Each step in the program performs a specific task. Only two kinds of steps make up SAS
programs: DATA steps and PROC steps. A SAS program can contain a DATA step, or a PROC step, or any combination of
DATA steps and PROC steps. The number and kind of steps depend on what tasks you need to perform.

A DATA step typically reads data from an input source, processes it, and creates a SAS data set, which is data in a form
that SAS understands. So, one of the primary purposes of a DATA step is to create a SAS data set. In addition, you can
use a DATA step to create new variables that were not in your original data. In SAS terminology, variables are the
columns in your data.

For example, suppose your raw data file contains the fields Cost Price Per Unit and Quantity Sold. In a DATA step, you
can multiply these variables and assign the value to a new variable named Total_Retail_Price.

A PROC or procedure step typically processes a SAS data set. Various PROC steps generate reports and graphs, manage
data, and sort data.

One way to use these two steps together is to use a DATA step to create a SAS data set, and then use a PROC step to
create a report. Remember, though, that this is just one possible combination of steps in a SAS program. Your SAS
programs might perform other tasks. Now let's learn more about what makes up a SAS step.

SAS Programming Steps


A SAS program is comprised of a sequence of steps, and a step is comprised of a sequence of statements. Every step has
a beginning and ending boundary. These are called step boundaries. SAS compiles and executes each step independently
based on the step boundaries.

A DATA step begins with a DATA statement, and a PROC step begins with a PROC statement. SAS detects the end of a
step when it encounters one of the following: a RUN statement for most steps, a QUIT statement for some procedures,
or the beginning of another step. Occasionally, a user might omit a RUN or QUIT statement, and the step will end
implicitly when the next step begins. It is a best practice to include a RUN or QUIT statement to explicitly end each step
in a SAS program.

Take a look at this program.

data work.newsalesemps;
set orion.sales;
where Country='AU';
run;

title 'New Sales Employees';

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

title;
Can you tell how many steps it contains? This program contains three steps: one DATA step and two PROC steps. In the
first line of code, the DATA step creates a temporary SAS data set named work.newsalesemps by reading the
orion.sales data set. In the eighth line, the PROC PRINT step creates a list report of the work.newsalesemps data set. In
line 11, the PROC MEANS step creates a summary report of work.newsalesemps with statistics for the variable Salary
for each value of Job_Title.

In addition to DATA and PROC steps, this SAS program also contains global statements. These statements can lie outside
DATA and PROC steps, and they can affect more than one step. For example, the first TITLE statement located before the
PROC statements, specifies a title that appears on both reports. The second TITLE statement located at the end of the
SAS program, turns all titles off for all subsequent output. You'll learn several global statements in this course.

Submitting a SAS Program


In this demonstration, you submit a program and examine the log and results.

1. Copy and paste the following program into the editor.

data work.newsalesemps;
set orion.sales;
where Country='AU';
run;

title 'New Sales Employees';

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

title;

2. Submit the code and check the log. It's a good programming practice to first check the log, even if the program
appears to produce results. You want to ensure that the code ran successfully before you look at any reports
SAS created. Notice that SAS processed the code without warnings or errors.

3. View the results. The first report is the PROC PRINT report. Recall that this type of report simply lists your data.
You can see columns for the various variables and all of their values. Notice that the title you specified appears
at the top of the report. The next report is the PROC MEANS report. The MEANS procedure provides data
summarization tools to compute descriptive statistics on your data, and displays output by default. Here, SAS
calculated statistics for the analysis variable Salary.

Question
Which of the following can represent a step boundary?

a. a RUN statement

b. a QUIT statement

c. a DATA statement
d. a PROC statement

e. all of the above

The correct answer is e. All of these statements can represent a step boundary by indicating either the end of a step or
the beginning of a new step.

Question
What does a DATA step typically create?

a. raw data file

b. program file

c. SAS data set

d. report

The correct answer is c. A DATA step typically creates a SAS data set. However, you can use DATA steps to create raw
data, program files, and reports. The DATA step is very flexible.

Question
What does a PROC step typically create?

a. raw data file

b. program file

c. SAS data set

d. report

The correct answer is d. A PROC step typically creates a report.

Business Scenario
Orion Star management encourages their programmers to write well-formatted, clearly documented SAS programs. So,
you need to know the syntax rules and recommended structure for SAS programming statements, as well as how to use
comments in your SAS programs.

Characteristics of SAS Programs


SAS statements usually begin with an identifying keyword, and they always end with a semicolon. Keywords identify the
type of statement, and semicolons end the statement. For example, in the following SAS program, the second statement
is a SET statement, and the fourth statement is a RUN statement.
data work.newsalesemps;
set orion.sales;
where Country='AU';
run;

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

Take a look at this program. Can you tell how many statements make up this DATA step?

data work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;

This step contains five statements: a DATA statement, a LENGTH statement, an INFILE statement, an INPUT statement,
and a RUN statement. Each statement has an identifying keyword and ends in a semicolon.

SAS Program Structure


In the following program, the statements are pretty easy to read.

data work.newsalesemps;
set orion.sales;
where Country='AU';
run;

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

The DATA, PROC, and RUN statements begin in column one, and the other statements are indented. Each statement
begins on a new line, and a blank line separates each step. Using conventional formatting (that is, structured, consistent
spacing) makes a SAS program easy to read.

However, SAS statements are free format. In other words, they can begin and end anywhere. In SAS, you can have as
much or as little white space as you want. You can begin or end a statement in any column and span multiple lines. You
can also place multiple statements on one line, and unquoted values can be lowercase, uppercase, or mixed case.

The following program takes advantage of the free-format style that SAS permits, but at a cost of being difficult to read:

data work.newsalesemps;set orion.


sales;where Country='AU';run;

proc print data=work.newsalesempls;


run;proc means
data=work.newsalesemps;class
Job_Title;var Salary;run;

Remember the old saying: "Just because you can do something, doesn't mean that you should." Again, in this program,
the SAS syntax rules have been followed, but this unconventional formatting might be especially difficult for other
programmers to read.

Using conventional formatting can take the guesswork out of your programs. It's recommended that you use a
conventional programming style. Click the Information button in the course to learn about automatic formatting in SAS
Enterprise Guide and other SAS environments.

Question
Which of the following rules is required by SAS syntax? Select all that apply.

a. beginning a statement in a certain column


b. ending a statement in a certain column
c. spanning multiple lines
d. adding multiple statements on one line
e. using all uppercase for text that isn't within quotation marks
f. using all lowercase for text that isn't within quotation marks
g. none of the above

The correct answer is g. None of the rules is required by SAS syntax.

Using SAS Comments


In addition to using conventional formatting, another way to make your program easier for others to follow is to add
comments to the program. A comment is text in your program that SAS ignores during processing but writes to the SAS
log. You can use comments anywhere in a SAS program to document the purpose of the program, explain segments of
the program, or mark SAS code as non-executing text. Using comments to mark SAS code as non-executing text is also
called commenting out code.

Comments can also help you test your SAS programs in stages. By commenting out your error-free code, you can use
comments to submit only the steps that you're testing. When your entire program is error-free, you can remove the
comment symbols without damaging the SAS program.

Types of Comments
Let's take a closer look at comments. In SAS, you can create comments in two ways. Using the first method, called a
block comment, you begin with a forward slash and asterisk, your comment text, and then end with an asterisk and a
forward slash.

/* create a temporary data


set, newsalesemps, from
the data set orion.sales */

data work.newsalesemps;
set orion.sales;
where Country='AU';
run;

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

These comments can be any length, and can contain semicolons. They cannot be nested. You should avoid placing block
comment symbols in the first or second columns. In some operating environments, SAS might interpret block comment
symbols in columns 1 and 2 as a request to end the SAS job or session.

The second method is called a comment statement. It begins with an asterisk, followed by the comment text, and ends
with a semicolon.

*create a temporary data set,


newsalesemps, from the data set
orion.sales;

data work.newsalesemps;
set orion.sales;
*where Country='AU';
run;

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

Comment statements can begin in columns 1 and 2. To comment out a statement in one of these steps, you simply add
an asterisk to the beginning of the statement, as shown above in the WHERE statement. Comments in this form are
complete statements, and they can't contain internal semicolons.

Adding Comments to Your SAS Programs


In this demonstration, you add comments to a program to make sure that another programmer understands it.

1. Copy and paste the following code into the editor.

data work.newsalesemps;
set orion.sales;
where Country='US';
run;

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Gender;
var Salary;
run;

2. At the beginning of the program, add a comment statement stating that you're using orion.sales to create
work.newsalesemps.

*This program uses the data set orion.sales to create work.newsalesemps.;


3. In the PROC MEANS step, add a block comment stating that the variable Salary is numeric. Place the comment
immediatately following the variable name. Remember that SAS ignores any text between the comment
symbols.

proc means data=work.newsalesemps;


class Gender;
var Salary/*numeric variable*/;
run;

4. Next, comment out the PROC PRINT step so that it doesn't run when you submit the code.

/*
proc print data=work.newsalesemps;
run;*/

5. Submit this code and examine the log. Notice that SAS didn't process the portions of code that were commented
out. You can see that the data set was created, and that the PROC MEANS step created output, but the PROC
PRINT step that was commented out produced no other messages or output.

Question
How many comments are in this program?

*This program creates and uses the


data set named work.newsalesemps.;

data work.newsalesemps;
length First_Name $ 12 Last_Name $ 18
Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary /*numeric*/;
run;
/*
proc print data=work.newsalesemps;
run;
*/
proc means data=work.newsalesemps;
*var Salary;
run;

a. 2
b. 4
c. 5
d. 6

The correct answer is b. This program contains four comments.

The first comment describes the program:


*This program creates and uses the data set named work.newsalesemps.;

The second comment is within a statement:

input First_Name $ Last_Name $


Job_Title $ Salary /*numeric*/;

The third comment is commenting out a step:


/*
proc print data=work.newsalesemps;
run;
*/

The fourth comment is commenting out a statement:

*var Salary;

Activity
Copy and paste the following program into the editor of your SAS environment, and then submit the program.

Reminder: Make sure you've defined the orion library.

data work.newsalesemps;
length First_Name $ 12 Last_Name $ 18
Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary /*numeric*/;
run;

proc print data=work.newsalesemps;


run;

/*
proc means data=work.newsalesemps;
var Salary;
run;*/

Which of the following steps executes and produces output?

a. the DATA step


b. the PROC PRINT step
c. the PROC MEANS step
d. all of the above
e. only a and b
The correct answer is e. The DATA step executes and creates an output data set. The PROC PRINT step executes and
produces a report. The PROC MEANS step is commented out, and therefore does not execute.

Business Scenario
As an Orion Star programmer, you work with a lot of code…some that's yours and some that's not. You need to be able
to diagnose and correct syntax errors in any of these SAS programs.

Diagnosing and Correcting Syntax Errors


What Is a Syntax Error?
Syntax errors occur when program statements do not conform to the rules of the SAS language. Some common syntax
errors are misspelled keywords, missing semicolons, and invalid options. The editor uses the color red to indicate a
potential error in your SAS code. Notice that in the first line of the program below SAS displays the misspelled word
DAAT in red.
daat work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;

proc print data=work.newsalesemps


run;

proc means data=work.newsalesemps average max;


class Job_Title;
var Salary;
run;

This misspelling affects other statements following it. Although the following statements in the DATA step are
syntactically correct, they are only permitted in a DATA step. The editor doesn't recognize this as a DATA step though,
due to the misspelled keyword, so SAS also displays the other statements in the DATA step in red.

SAS finds syntax errors during the compilation phase, before it executes the program. So, when you submit a SAS
program, SAS scans each statement for syntax errors. If no errors are found, SAS executes the step when it reaches the
step boundary. Then SAS goes to the next step and repeats the process.

When SAS encounters a syntax error, it writes the following to the SAS log: the word ERROR or WARNING, the location
of the error, and an explanation of the error. SAS continues the syntax scan until it reaches the step boundary, but the
step doesn't execute if errors are found. Then SAS continues scanning the rest of the program, and reports any
additional errors as needed. When you check the log, as all good SAS programmers do, and find a warning or error
message, you need to correct your code.

Viewing and Correcting Syntax Errors


In this demonstration, you diagnose and correct syntax errors in your program.

1. Copy and paste the following program into the editor. As you know, the DATA step keyword is misspelled. Also,
the semicolon is missing from the PROC PRINT statement, and the PROC MEANS step includes an option that is
not valid. As you can see, SAS color-codes the program to indicate the errors.

daat work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;

proc print data=work.newsalesemps


run;

proc means data=work.newsalesemps average min;


var Salary;
run;

2. Submit the program and check the log. You should always check the log to make sure that the program ran
successfully, even if output is generated.
Notice that there is a WARNING message and the word DAAT is underlined. In this case, SAS resolved the issue
by assuming that DAAT was simply DATA misspelled. A warning means that SAS was able to perform the action.
In this case, SAS processed the DATA step. But this is a rare situation, as SAS might not always be able to
interpret your misspelled words.

Next, notice that the RUN statement is underlined. In this case, the previous line is missing the semicolon. The
message 'Syntax error, expecting one of the following...' indicates that something was missing. Consider how
SAS processed this step. SAS started with the PROC PRINT statement and kept going until it reached the
semicolon at the end of the RUN statement. So, SAS thought that the PROC PRINT and the RUN statements were
all one statement. SAS interpreted RUN as an option for PROC PRINT and printed an error message about an
invalid option. Notice that SAS did list the semicolon as one of the expected options.

You might be thinking, “Why did SAS report an error in the RUN statement? There's nothing wrong with the RUN
statement.” When you encounter this type of error, always check the statement before the underlined
statement. In many cases you will find that the statement before the error is missing a semicolon.

Now look at the next error message. SAS did not recognize the word AVERAGE as a valid option in the PROC
MEANS statement, so the PROC MEANS step didn't execute. Notice that SAS lists the valid options. The word
MEAN is listed as a valid option and should be used to calculate an average.

3. In the editor, correct the program. First, correct the spelling of DATA, and then add a semicolon to the end of
the PROC PRINT statement. Lastly, change the word AVERAGE to MEAN in the PROC MEANS statement.

data work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps mean max;


class Job_Title;
var Salary;
run;

4. Submit the revised code and check the log. The log shows that the code ran successfully. No errors or warnings
appear. Also, SAS produced the reports you requested. As demonstrated, you can easily view and correct syntax
errors in SAS.

Business Scenario
Another common mistake that programmers make is leaving off a matching quotation mark. For example, suppose you
write a program that creates a data set and generates two reports. You submit the program, but it doesn't produce
results. The program might have unbalanced quotation marks.

Unbalanced Quotation Marks


In SAS, a quotation counter keeps count of the quotation marks in your code. SAS expects an even number, or matching
number, of quotation marks. If SAS detects an uneven number of quotation marks, the code won't execute properly.
Also, although SAS allows either single or double quotation marks, you can't mix the types. If you begin with a single
quotation mark, you must end with a single quotation mark; otherwise, SAS considers the quotation marks unbalanced.
When your program contains unbalanced quotation marks, whether from an uneven number or mismatched quotation
marks, SAS misreads both the statement containing the error and any following statements.

You should notice that there's a problem because much of the program will be colored purple in the editor. Purple
represents a quoted string. In this example, the string begins with a single quotation mark followed by a comma, a
semicolon, and then all the remaining statements in the program. Because the string does not contain a matching or
ending quotation mark, SAS reads all of this text as a quoted string.

When you submit a program with unbalanced quotation marks in the SAS windowing environment, the program doesn't
stop running, and the log includes only the code you submitted. You won't see any error or warning messages, nor will
you see any indication that any of the steps executed. You'll also see a message in the banner of the editor stating that
the step is still running. You have to stop an executing program by cancelling the submitted statements. You can then
correct your program by adding the missing quotation mark.

When you submit a program with unbalanced quotation marks in client applications such as SAS Enterprise Guide and
SAS Studio, SAS writes messages to the log to alert you of the error. A warning in the SAS log stating that a quoted string
has become too long, or that a statement containing quotation marks is ambiguous, sometimes indicates unbalanced
quotation marks. In fact, any log message about a quoted string should alert you to the possibility of unbalanced
quotation marks. In client applications, SAS submits additional code, or wrapper code, including a single and double
quotation mark. SAS is attempting to repair any potential unbalanced quotes in a submitted program. The wrapper code
balances quotation marks and the code stops running, but your results will still contain errors and you must correct the
program. To do this, you either add the missing quotation mark, or match the quotation mark, and then resubmit the
program.

For more information on correcting unbalanced quotation marks in the SAS windowing environment and client
applications, click the Information button.

Question
Which of the following represents a syntax error? Select all that apply.

a. a statement that begins in column one


b. invalid options
c. missing semicolon
d. multiple statements on one line
e. a statement that spans multiple lines
f. unmatched quotation marks
g. using mixed case letters within quotation marks
h. a misspelled keyword

The correct answer is b, c, f, and h. Common syntax errors include invalid options, missing semicolons, unmatched
quotation marks, and misspelled keywords.

Question
Which of the following samples of code is valid?
a.
title "New Sales Employees';
proc print data=work.NewSalesEmps;
run;

b.
title 'New Sales Employees';
proc print data=work.NewSalesEmps;
run;

The correct answer is b. Although SAS allows either single or double quotation marks, you can’t mix the types. If you use
one type to begin and a different type to end, SAS considers the quotation marks unbalanced.

Summary of Lesson 2: Working with SAS Programs

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Exploring SAS Programs


A SAS program consists of DATA steps and PROC steps. A SAS programming step is comprised of a sequence of
statements. Every step has a beginning and ending step boundary. SAS compiles and executes each step independently,
based on the step boundaries.

A SAS program can also contain global statements, which are outside DATA and PROC steps, and typically affect the SAS
session. A TITLE statement is a global statement. After it is defined, a title is displayed on every report, unless the title is
cleared or canceled.

SAS statements usually begin with an identifying keyword, and always end with a semicolon. SAS statements are free
format and can begin and end in any column. A single statement can span multiple lines, and there can be more than
one statement per line. Unquoted values can be lowercase, uppercase, or mixed case. This flexibility can result in
programs that are difficult to read.

Conventional formatting, also called structured formatting, uses consistent spacing to make a SAS program easy to read.
To follow best practices, begin each statement on a new line, indent statements within each step, and indent
subsequent lines in a multi-line statement.

Comments are used to document a program and to mark SAS code as non-executing text. There are two types of
comments: block comments and comment statements.

/* comment */
* comment statement;
Diagnosing and Correcting Syntax Errors
Syntax errors occur when program statements do not conform to the rules of the SAS language. Common syntax errors
include misspelled keywords, missing semicolons, and invalid options. SAS finds syntax errors during the compilation
phase, before it executes the program. When SAS encounters a syntax error, it writes the following to the log: the word
ERROR or WARNING, the location of the error, and an explanation of the error. You should always check the log, even if
the program produces output.

Mismatched or unbalanced quotation marks are considered a syntax error. In some programming environments, this
results in a simple error message. In other environments, it is more difficult to identify this type of error.

Sample Programs

Submitting a SAS Program

data work.newsalesemps;
set orion.sales;
where Country='AU';
run;

title 'New Sales Employees';

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

title;

Adding Comments to Your SAS Programs

*This program uses the data set orion.sales to create work.newsalesemps.;


data work.newsalesemps;
set orion.sales;
where Country='US';
run;

/*
proc print data=work.newsalesemps;
run;*/
proc means data=work.newsalesemps;
class Gender;
var Salary/*numeric variable*/;
run;

Viewing and Correcting Syntax Errors

daat work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;
proc print data=work.newsalesemps
run;

proc means data=work.newsalesemps average max;


class Job_Title;
var Salary;
run;

Lesson 3: Accessing Data


Lesson Overview
In this lesson, you learn to access and view the Orion Star data sets that you'll be working with. First, you'll learn to write
a simple program to define a SAS library. Then, you'll view the contents of the library and examine individual data sets.
You’ll explore the two portions of a SAS data set, and investigate variable attributes. Finally, you’ll learn the SAS naming
conventions for variables and data sets.

Objectives

In this lesson, you learn to do the following:

 explain the concept of a SAS library


 state the difference between a temporary library and a permanent library
 assign a library reference name to a SAS library by using the LIBNAME statement
 explore the contents of SAS libraries using the CONTENTS procedure
 access a data set in a user-created permanent library
 define the components of a SAS data set
 browse the descriptor portion of a SAS data set using the CONTENTS procedure
 browse the data portion of a SAS data set using the PRINT procedure
 identify the two main types of values (character and numeric) and missing values
 explain the SAS naming conventions for variables and data sets

Accessing SAS Libraries


You know the three major types of files used in SAS: raw data files, SAS data sets, and SAS program files. But you haven’t
yet learned how SAS organizes and stores these files, or more importantly, how you can access them. For example, as a
new SAS programmer at Orion Star, you need to access existing SAS data sets and use them to perform your duties. You
need to know about SAS libraries.

Exploring SAS Libraries


SAS data sets are stored in SAS libraries. A SAS library is a collection of one or more SAS files that are recognized by SAS
and that are referenced and stored as a unit. A library is the highest level of organization for information within SAS. You
can think of a SAS library as a drawer in a filing cabinet and a SAS data set as one of the files in the drawer. Each file is
referred to as a member of the library.

At the beginning of each SAS session, SAS automatically provides one temporary and at least one permanent SAS library
that you can access. These libraries, or drawers, are open and ready for use. SAS provides a temporary library called
work where you can store and access SAS data sets for the duration of the SAS session. At the end of every SAS session,
SAS deletes the work library and its contents.

SAS also provides permanent libraries. SAS data sets in permanent libraries are saved after your SAS session terminates.
For example, sashelp is a permanent library that SAS makes available. It contains sample SAS data sets that you can
access anytime you start a SAS session. Traditionally, SAS defines a permanent library named sasuser that you can use
for storing and accessing SAS data sets during your SAS session. At your site, there might be different permanent
libraries that are defined for you to use. These data sets will be available in later sessions as well.

Accessing SAS Libraries


Regardless of the operating environment you use, you refer to a SAS library by a logical name called a library reference
name, or libref. A libref references a particular physical location that the operating environment recognizes, so you can
think of a libref as a shortcut to that physical location. So, work, sasuser, and sashelp are librefs that refer to physical
locations. The implementation of a SAS library corresponds to the way that your operating environment stores files. In
the Windows operating environment, for example, a libref typically refers to a group of SAS files in the same folder or
directory.

Question
Which statement about SAS libraries is true?

a. You refer to a SAS library by a logical name called a LIBNAME.


b. A SAS library is a collection of one or more SAS files that are referenced and stored as a unit.
c. A single SAS library can contain files that are stored in different physical locations.
d. At the end of each session, SAS deletes the contents of all SAS libraries.

The correct answer is b. You refer to a SAS library by a logical name called a libref. A single SAS library cannot contain
files that are stored in different physical locations. And SAS deletes the contents of temporary SAS libraries but not
permanent SAS libraries.

Using Two-Level Data Set Names


All SAS data sets have a two-level name that consists of the libref and the data set name, separated by a period. When a
data set is in the temporary work library, you can optionally use a one-level name. A one-level name consists of just the
data set name, such as newsalesemps. When you specify a one-level name, SAS assumes that the data set is stored in
the work library. So, work is the default libref.

When the data set is in a permanent library, you must use a two-level name. Let’s take a look at the following program
to further understand how SAS data sets are named and how you refer to them in code.

data work.newsalesemps;
set orion.sales;
run;

title 'New Sales Employees';

proc print data=work.newsalesemps;


run;

proc means data=work.newsalesemps;


class Job_Title;
var Salary;
run;

title;
In this program, the DATA step creates a temporary data set named newsalesemps by using the two-level name
work.newsalesemps. Both of the PROC steps reference this data set. When the current SAS session ends, SAS deletes
the newsalesemps data set, along with any other data sets that are stored in the work library.

Question
In this SAS program, is salesbonus a temporary SAS data set?

proc means data=salesbonus;


class Job_Title;
var Amount;
run;

a. yes
b. no

The correct answer is a. The data set salesbonus is a temporary data set because it is referenced using a one-level name.
SAS assumes that the data set is stored in the temporary work library.

Business Scenario
Suppose you know the physical location of the SAS data sets that contain Orion Star data. You want to define a SAS libref
named orion to access and view those data sets. After you define the libref, you can explore the data sets and reference
the data sets in your SAS programs.

Creating SAS Libraries


You can create and access your own SAS libraries. A user-created library has the following characteristics. It’s
permanent, meaning that the data sets are stored there until you delete them. It’s implemented within the operating
environment’s file system. And it’s not automatically available in a SAS session. You must assign a libref to a user-created
library to make it available in a SAS session.

Let’s look at how you define a library. First, you identify the location of the library to SAS. For example, suppose you
have Orion Star data stored in a Microsoft Windows folder, and you want to use the folder as your SAS library. Your
operating system knows about the folder, but SAS doesn’t. To use this folder as your SAS library, you must tell SAS
where it is. In other words, you need to make a connection between the folder containing your data and SAS.

Using the LIBNAME Statement


In SAS code, you use the LIBNAME statement to associate the libref with the physical location of the library, that is, the
physical location of your data. The LIBNAME statement makes the SAS library—your data—available for the duration of
your current SAS session.

Let's explore the syntax.

LIBNAME libref 'SAS-data-set' <options>;

You begin with the keyword LIBNAME. Next, you specify the name of the libref. A valid libref must follow several rules. It
must have a length of one to eight characters, and must begin with a letter or underscore. The remaining characters
must be letters, numbers, or underscores. You then specify the physical location of the SAS library—your data. You must
reference an existing folder; the LIBNAME statement does not create a new folder. You enclose the physical location in
single or double quotation marks.

The following LIBNAME statement creates the libref named orion.

libname orion 'filepath';

In place of filepath, you specify the actual physical location of the library. Here's an example in the Windows operating
environment.

libname orion 'c:\oriondata';

However, depending upon your working environment, the path might look more like this.

libname orion '/oriondata';

The LIBNAME statement is a global statement. It's not part of a DATA or PROC step, and it doesn't need a RUN
statement. You can submit the LIBNAME statement alone, or you can store it with any SAS program so that the SAS
library is defined each time the program runs. If your program needs to reference data sets in multiple locations, you
can use multiple LIBNAME statements, as many as you want.

Question
Which of the following librefs is valid?

a. _orionstar
b. orion/01
c. or_01
d. 1_or_a

The correct answer is c. This libref follows all three rules for valid librefs. It has a length of one to eight characters, it
begins with a letter or underscore, and its remaining characters are letters, numbers, or underscores.

Question
Which of the following correctly assigns the libref myfiles to a SAS library in the c:/mysasfiles folder?

a. libname orion myfiles 'c:/mysasfiles';


b. libname myfiles 'c:/mysasfiles';
c. libref orion myfiles 'c:/mysasfiles';
d. libref myfiles 'c:/mysasfiles';

The correct answer is b. This LIBNAME statement begins with the keyword LIBNAME, followed by the name of the libref,
which is myfiles. It then specifies the physical location of the library, in quotation marks, which is c:/mysasfiles.

Accessing a SAS Library


In this demonstration, you submit a LIBNAME statement to assign the orion libref.

1. Copy and paste the following LIBNAME statement into the editor:
%let path=/dept/dvt/WebDMS/Education/Prg1PracticeFiles;
libname orion "&path";

You might recognize the code from the data setup program you used for this course. The first line of code
references a macro variable that stores the location of the practice files. Your location will be different from the
one shown here.

The next line is the LIBNAME statement, which is the focus of this demo. You specify the orion libref and then
the physical location of the library. Again, because of the macro variable, the location is simply &path.
Remember that the location must be enclosed in quotation marks. Note that any time you reference a macro
variable within quotation marks, such as in a LIBNAME statement, you must use double quotation marks. So
we'll use double quotation marks here.

2. Submit the code and check the log. Verify that the orion libref was assigned successfully.

Business Scenario
The orion library is now active, so you have access to all of the files that it contains. You can use DATA steps and PROC
steps to work with the files. Suppose you don't know what files are available in the library. To view the contents of the
library, you can write a SAS program that creates a report with general information about the library. The report will also
list the members of the library. Let's find out how to generate this report.

Browsing a Library Programmatically


To create a report that displays the contents of a SAS library, you can write a PROC CONTENTS step.

PROC CONTENTS DATA= libref._ALL_;


RUN;

The syntax of a basic PROC CONTENTS step has two statements: a PROC CONTENTS statement and a RUN statement.
The PROC CONTENTS statement begins with the keywords PROC CONTENTS, followed by the DATA= option.

When you use PROC CONTENTS to list the contents of a SAS library, you indicate the library after the DATA= option by
specifying the libref, a period, and the keyword _ALL_. When you use _ALL_ in the DATA= option, PROC CONTENTS
displays a list of all the SAS files that are contained in the SAS library. Finally, remember that the RUN statement tells SAS
to execute the preceding SAS statements.

Browsing a Library
In this demonstration, you submit a PROC CONTENTS step to browse a SAS library programmatically.

1. Copy and paste the following PROC CONTENTS step into the editor.

proc contents data=orion._all_;


run;

2. Submit the code and view the results. You might see additional details in your PROC CONTENTS results,
depending upon your SAS environment. The first table, Directory, lists general information about the library. The
second table lists all members of the library in alphabetical order, and provides basic information about each
member. As the Member Type column indicates, the orion library contains two types of SAS files: data sets and
indexes. Only the data sets are numbered in the first column. The names of the data sets indicate the type of
Orion Star information that you'll work with.
3. Scroll down in the report. By default, a PROC CONTENTS report also includes information about each individual
data set in the library, called the descriptor portion. If a library has many data sets, a report that includes all the
descriptors can be very long. To suppress the descriptor portions in the report, you specify the NODS option.

4. In the editor, add the NODS option after the _ALL_ keyword. You must use a space to separate _ALL_ from the
NODS option.

proc contents data=orion._all_ nods;


run;

5. Submit the code and view the results. The results now contain only the two tables with the information you're
interested in working with.

Question
Which PROC CONTENTS step prints only general information about a SAS library and a listing of the members of the
library?

a.
proc contents data=orion.country nods;
run;

b.
proc contents data=orion._all_ nods;
run;

c.
proc contents data=orion._all_;
run;

d.
proc contents data=orion.nods _all_;
run;

The correct answer is b. This PROC CONTENTS step generates the specified output. After the keyword _ALL_, you add a
space and then the NODS keyword to suppress the descriptor data for each individual file in the library.

Accessing a Permanent Data Set


Now that you’ve assigned the orion libref, the connection has been made between the folder containing your data and
SAS. You can access the SAS files in the orion library. One way to do this is to use a PROC PRINT step. The syntax is very
similar to the PROC CONTENTS syntax.

PROC PRINT DATA= SAS-data set;


RUN;

After DATA=, you specify the libref name orion as the first part of the two-level data set name. Then you specify the
name of the data set that you want to display. For example, suppose you know that one of the data sets is named
country. You type data=orion.country in your PROC PRINT step to view the data set.

Viewing a Data Set with PROC PRINT


In this demonstration, you view a data set with PROC PRINT.
1. Copy and paste the following PROC PRINT step into the editor.

proc print data=orion.country;


run;

2. Submit the code and check the log. Verify that SAS read 7 observations from the data set and that there are no
warnings or errors.

3. View the results. You don't need to be concerned with the details of the data set. Just remember that you must
have already assigned the orion libref in order to view and/or work with data in the orion library.

Changing or Cancelling a Libref


Now let's look at how long a libref stays in effect. In an interactive SAS session, a libref that you assign remains in effect
until you cancel the libref, change the libref, or end your SAS session. You can use the CLEAR option in the LIBNAME
statement to cancel, or disassociate, a libref that you previously assigned. Disassociating the libref disconnects the
library from SAS.

For example, in the following program, the LIBNAME statement associates the libref perm with the data in the folder
myfiles.

libname perm 'filepath/myfiles';


proc print data=perm.orders;
var Order_ID Order_Type Order_Date;
run;
libname perm clear;

At the end of the program, another LIBNAME statement disassociates the perm libref. Suppose you need to specify a
different physical location for the files. To change the location, you can submit a LIBNAME statement with the same
libref name but with a different filepath.

When you end your SAS session, the contents of a permanent library still exist in their physical location in your operating
environment, but SAS deletes everything in the work library. Each time you start a new SAS session, you must resubmit
the LIBNAME statement for the SAS libraries that you need to use.

Examining SAS Data Sets


Business Scenario
As an Orion Star programmer, you’ll work with permanent data sets in the orion library, as well as with temporary data
sets in the work library. All SAS data sets, both permanent and temporary, are data files that SAS creates and that only
SAS can read. Before you begin your work at Orion Star, you first need to understand the components and structure of
SAS data sets.

Examining SAS Data Sets


What is a SAS data set? A SAS data set is a specially structured data file that is displayed as a table with variables and
observations. You might be more familiar with the terms table, columns, and rows, as these terms are commonly used
across different types of databases. In SAS, a table is usually called a data set, a column is called a variable, and a row is
called an observation.

Let’s look at the following example. In the data set work.newsalesemps, the third column is the variable Job_Title, and
observation 3 contains the data for Kevin Lyon. A variable is a container that stores values. The value of the variable
Job_Title in observation 3 is Sales Rep. I.

work.newsalesemps
First_Name Last_Name Job_Title Salary

Satyakam Denny Sales Rep. II 26780

Monica Kletschkus Sales Rep. IV 30890

Kevin Lyon Sales Rep. I 26955

Petrea Soltau Sales Rep. II 27440

A SAS data set contains a descriptor portion and a data portion. The descriptor portion contains information about the
attributes of the data set, or metadata. The metadata includes general properties such as the data set name, the
number of observations, and the date and time that the data set was created, as well as variable properties such as
name, type, and length. You can browse the descriptor portion of your SAS data sets using PROC CONTENTS.

The data portion of a SAS data set contains the data values, stored in variables. Remember that the variable names are
part of the descriptor portion, not the data portion. Data values are either character or numeric. For example,
First_Name, Last_Name and Job_Title have character values, and Salary has numeric values. You can use PROC PRINT to
display the data portion of your SAS data sets. You’ll learn more about the data portion of SAS data sets in a bit.

Viewing the Descriptor Portion of a Data Set


In this demonstration, you run a PROC CONTENTS step to display the descriptor portion of the sales data set.

1. Copy and paste the following code into the editor. The sales data set is in the permanent orion library, so you
must use a two-level data set name, orion.sales.

proc contents data=orion.sales;


run;

2. Submit the code and view the results. The PROC CONTENTS output displays the descriptor portion of the data
set in three tables. The first table shows general information about the data set, such as the data set name, and
the date and time the data set was created.

Take a look at the other information in this table. Can you determine how many observations are in this data
set? There are 165 observations in orion.sales.

The second table displays operating environment information, the physical location of the file, and other data
set information. The third table is an alphabetic list of variables in the data set and their attributes.

Viewing the Data Portion of a Data Set


In this demonstration, you run a PROC PRINT step to display the data portion of the data set orion.sales

1. Copy and paste the following code into the editor. Remember that the data portion contains the data values.

proc print data=orion.sales;


run;

2. Submit the code and check the log. A note in the log confirms that SAS read 165 observations from the data set.
3. View the report. By default, PROC PRINT displays all variables and observations, using the variable names as
column headings. The Obs column is displayed to identify each observation, similar to a row number.

In the PROC CONTENTS output, SAS displayed a table with variable attributes, including the variable type:
character or numeric. You can determine a variable’s type by looking at your PROC PRINT output; SAS
automatically displays character variables left-aligned, and numeric variables right-aligned.

Question
Type the correct letter to match the types of information with the portion of a SAS data set in which each is documented
or stored.

the value of Salary for observation 1 a. descriptor portion

the name of the data set b. data portion

the type of the Salary variable

the creation date of the data set

The correct answers from top to bottom are b, a, a, a. The descriptor portion of a SAS data set contains the name of the
data set, the variable's type, and the creation date of the data set. The data portion contains the data values. Although
you can determine a variable's type by the alignment of the variable values in the data portion, SAS documents the type
in the descriptor portion.

Question
How many observations and variables does the data set shown here contain?

Company Region Sales


A&MRadio N 63500
Jack's TV S 45800
Sound City S 38900
Music Ltd. N 99500

a. three observations, three variables


b. four observations, three variables
c. three observations, four variables

The correct answer is b. The SAS data set contains four observations and three variables. Recall that in SAS, observations
are the rows in a data set, and variables are the columns in a data set.

Understanding Missing Values


Now consider this: What happens if you have missing values in your data set? In a SAS data set, a value must exist for
every variable and observation. If a data value is unknown for a particular observation, a missing value is recorded in the
SAS data set. Missing values are valid values in a SAS data set. A variable’s type determines how SAS displays missing
values for a variable.

For character variables such as Job_Title, a blank represents a missing value. For numeric variables such as Salary, SAS
uses a period, by default, to represent a missing value.

work.newsalesemps
First_Name Last_Name Job_Title Salary
Satyakam Denny Sales Rep. II 26780
Monica Kletschkus Sales Rep. IV .
Kevin Lyon 26955
Petrea Soltau Sales Rep. II 27440

You can alter this default with the MISSING= SAS system option, which specifies a character to print for missing numeric
variable values.

MISSING='character'

Question
According to the data set shown, what type of variable is ActLevel?

ID DoB ActLevel
134 05MAR59
. 22MAY41 3
224 . 4
298 12DEC43 2

a. numeric
b. character
c. can't tell from the data shown

The correct answer is b. The variable ActLevel has a missing value in row one that is represented with a blank. Missing
character values in SAS are represented with blanks. In addition, the values for ActLevel are left justified, which also
indicates that ActLevel is a character variable. If the MISSING SAS system option was turned on, the missing numeric
values for ID and DoB would not show periods, as they do here.

Exploring SAS Variable Attributes


When you write SAS programs, it's important to understand the attributes of the variables that you use. Using the
following PROC CONTENTS output of the data set orion.sales, let's take a closer look at the two variable attributes type
and length.

Alphabetic List of Variables and Attributes


# Variable Type Len Format

8 Birth_Date Num 8

7 Country Char 2

1 Employee_ID Num 8 12.

2 First_Name Char 12

4 Gender Char 1

9 Hire_Date Num 8

6 Job_Title Char 25

3 Last_Name Char 18

5 Salary Num 8

As you know, a variable's type is either character or numeric. Character variables can store any values, such as letters,
numbers, special characters, and blanks.

Character Values

Monica

120101

3Top Sports

Auditing & Wages

Now let's look at some examples of valid numeric values. Numeric variables can store only numeric values, which can
include the digits 0 through 9, a minus sign, a single decimal point, and E for scientific notation.

Numeric Values

26780

-30
-29.92

3.1E6

A variable's length indicates the number of bytes used to store it. The length is related to the variable's type. Character
values are stored with a length of 1 to 32,767 bytes. One byte equals one character. In orion.sales, the First_Name
variable has a length of 12 characters and uses 12 bytes of storage.

Numeric variables have 8 bytes of storage by default, no matter how many digits they contain. When stored in floating
point or binary representation, 8 bytes of storage provide space for 16 or 17 significant digits. SAS variables can have
additional attributes, such as formats, as well as some that are not shown in this PROC CONTENTS output: informat and
label. You will learn more about the additional attributes in a later lesson.

Question
Which of the following values can be stored in the variable Product_Line, based on the attributes shown in this partial
PROC CONTENTS output? Select all that apply.

Alphabetic List of Variables and Attributes


# Variable Type Len Format Label
3 Product_Category Char 25 Product Category
4 Product_Group Char 25 Product Group
1 Product_ID Num 8 12. Product ID
2 Product_Line Char 20 Product Line
5 Product_Name Char 45 Product Name
6 Supplier_Country Char 2 Supplier Country
8 Supplier_ID Num 8 12. Supplier ID
7 Supplier_Name Char 30 Supplier Name

a. Dubby Low Men's Street Shoes


b. Shoes
c. 13198
d. A-Team, Kids
e. Trois Socks (Cush)

The correct answers are b, c, d, and e. All of the values are valid character values, but the first value is longer than the
specified length of 20.

Question
In this PROC CONTENTS output, what is the default length of the variable Street_ID?

Alphabetic List of Variables and Attributes


# Variable Type Len Format Label
6 Country Char 2 Country
3 Street_ID Num 8 12. Street ID
5 Sup_Street_Number Char 8 Supplier Street Number
4 Supplier_Address Char 45 Supplier Address
1 Supplier_ID Num 8 12. Supplier ID
2 Supplier_Name Char 30 Supplier Name

a. 8 bytes
b. 9 bytes
c. 16 or 17 bytes
d. 32,767 bytes

The correct answer is a. The default length of numeric variables is 8 bytes. A numeric variable of the default length can
hold 16 or 17 significant digits. 32,767 bytes is the largest possible length for a character variable.

Naming SAS Variables and Data Sets


You must follow the SAS naming conventions when naming variables and data sets. SAS variable names can be 1 to 32
characters long. The name must start with a letter or underscore and can continue with any combination of numbers,
letters, or underscores. SAS variable names can be uppercase, lowercase, or mixed case.

Valid SAS Names

Job_Title

_quantity2012_

Customer_FirstName

SAS creates each variable name in the same case that you first specify it, and that is the way it appears in reports. After a
variable has been created, you can refer to it in any case in your code without affecting the way that it is stored. You
apply the same naming conventions to SAS data set names.

proc print data=orion.sales;


var salary;
run;

Partial output

Obs Salary

1 108255
2 87975

3 26600

4 27475

5 26190

6 26480

Click the Information button in the course interface to learn about using special characters in variable names.

Question
Which of the following variable names are valid? Select all that apply.
a. data5mon
b. 5monthsdata
c. data#5
d. five months data
e. five_months_data
f. FiveMonthsData

The correct answers are a, e, and f. Valid variable names begin with a letter or underscore, and continue with letters,
numbers, or underscores.

Question
Which of the following is not a valid name for a SAS data set?
a. Customer_Purchases_Quarter2_2007
b. _Sales_MainOffice
c. 2007_Sales
d. TotalSales2007

The correct answer is c. A valid SAS name cannot begin with a number. Valid names must begin with either a letter or an
underscore. Subsequent characters can be letters, underscores, or numerals.

Summary of Lesson 3: Accessing Data

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Accessing SAS Libraries
SAS data sets are stored in SAS libraries. A SAS library is a collection of one or more SAS files that are recognized by SAS.
SAS automatically provides one temporary and at least one permanent SAS library in every SAS session.

Work is a temporary library that is used to store and access SAS data sets for the duration of the session. Sasuser and
sashelp are permanent libraries that are available in every SAS session.

You refer to a SAS library by a library reference name, or libref. A libref is a shortcut to the physical location of the SAS
files.

All SAS data sets have a two-level name that consists of the libref and the data set name, separated by a period. Data
sets in the work library can be referenced with a one-level name, consisting of only the data set name, because work is
the default library. Data sets in permanent libraries must be referenced with a two-level name.

You can create and access your own SAS libraries. User-defined libraries are permanent but are not automatically
available in a SAS session. You must assign a libref to a user-created library to make it available. You use a LIBNAME
statement to associate the libref with the physical location of the library, that is, the physical location of your data. You
can submit the LIBNAME statement alone at the start of a SAS session, or you can store it in a SAS program so that the
SAS library is defined each time the program runs. If your program needs to reference data sets in multiple locations,
you can use multiple LIBNAME statements.

LIBNAME libref 'SAS-library' <options>;

Use PROC CONTENTS with libref._ALL_ to display the contents of a SAS library. The report will list all the SAS files
contained in the library, as well as the descriptor portion of each data set in the library. Use the NODS option in the
PROC CONTENTS statement to suppress the descriptor information for each data set.

PROC CONTENTS DATA=libref._ALL_ NODS;


RUN;

After associating a libref with a permanent library, you can write a PROC PRINT step to display a SAS data set within the
library.

PROC PRINT DATA=libref.SAS-data-set;


RUN;

In an interactive SAS session, a libref remains in effect until you cancel it, change it, or end your SAS session. To cancel a
libref, you submit a LIBNAME statement with the CLEAR option. This clears or disassociates a libref that was previously
assigned. To specify a different physical location, you submit a LIBNAME statement with the same libref name but with a
different filepath.

LIBNAME libref CLEAR;

When a SAS session ends, everything in the work library is deleted. The librefs are also deleted. Remember that the
contents of permanent libraries still exist in in the operating environment, but each time you start a new SAS session,
you must resubmit the LIBNAME statement to redefine a libref for each user-created library that you want to access.
Examining SAS Data Sets
SAS data sets are specially structured data files that SAS creates and that only SAS can read. A SAS data set is displayed
as a table composed of variables and observations. A SAS data set contains a descriptor portion and a data portion.

The descriptor portion contains general information about the data set (such as the data set name and the number of
observations) and information about the variable attributes (such as name, type, and length). There are two types of
variables: character and numeric. A character variable can store any value and can be up to 32,767 characters long.
Numeric variables store numeric values in floating point or binary representation in 8 bytes of storage by default. Other
attributes include formats, informats, and labels. You can use PROC CONTENTS to browse the descriptor portion of a
data set.

PROC CONTENTS DATA=libref.SAS-data-set;


RUN;

The data portion contains the data values. Data values are either character or numeric. A valid value must exist for every
variable in every observation in a SAS data set. A missing value is a valid value in SAS. A missing character value is
displayed as a blank, and a missing numeric value is displayed as a period. You can specify an alternate character to print
for missing numeric values using the MISSING= SAS system option. You can use PROC PRINT to display the data portion
of a SAS data set.

SAS variable and data set names must be 1 to 32 characters in length and start with a letter or underscore, followed by
letters, underscores, and numbers. Variable names are not case sensitive.

Sample Programs

Accessing a SAS Library

/*Replace filepath with the physical location of your practice files.*/


%let path=filepath;
libname orion "&path";

Browsing a Library

proc contents data=orion._all_;


run;

proc contents data=orion._all_ nods;


run;

Viewing a Data Set with PROC PRINT

proc print data=orion.country;


run;

Viewing the Descriptor Portion of a Data Set

proc contents data=orion.sales;


run;
Viewing the Data Portion of a SAS Data Set
proc print data=orion.sales;
run;

Lesson 4: Producing Detail Reports


Sometimes, you might want to display all of the information from a SAS data set in a report. But other times, you might
want to display very specific information from the data set. In this lesson, you learn to enhance the way that information
is arranged and formatted in your reports. By adding statements to your PROC PRINT steps, you can select variables to
print, subset observations, sort and group observations, specify titles and footnotes, and assign labels to variables.

Objectives

In this lesson, you learn to do the following:

 select variables to print by using the VAR statement


 calculate column totals by using the SUM statement
 subset observations by using the WHERE statement
 sort observations by using the SORT procedure
 group observations in reports by using the BY statement
 specify titles and footnotes for your reports
 assign temporary labels to variables by using the LABEL statement

Subsetting Report Data


Orion Star management wants you to create a report that displays only the names and salaries of sales employees. They
also want to see a salary total for these employees. You'll need to subset the data in your report to show the required
variables, and create a sum of all the salaries.

Selecting Variables
By default, a PROC PRINT step displays all observations and variables in a data set, and the variables appear in the order
in which they occur in the data set. You can use the VAR statement to modify the default behavior and display only the
variables you want.

VAR variable(s);

In a VAR statement, you list the variables to include in the report. In this scenario, you want to print the last name and
first name of each sales employee, as well as their salaries, so you list the variables in that order, as shown below.

proc print data=orion.sales;


var Last_Name First_Name Salary;
run;
Generating Column Totals
Now you need to generate a salary total for all sales employees and display it in your report. The SUM statement
calculates and displays report totals for the requested numeric variable, which in this case is Salary. The general form of
the SUM statement is shown below.

SUM variable(s);

This is the code that produces the column totals that will appear at the end of the report.
proc print data=orion.sales;
var Last_Name First_Name Salary;
sum Salary;
run;

Question
Which SUM statement will produce column totals for the variables Quantity and Total_Retail_Price?

proc print data=orion.order_fact;


var Customer_ID Order_Date Quantity
Total_Retail_Price;
__________________________________________;
run;

a. sum=Quantity, Total_Retail_Price;
b. sum Quantity, Total_Retail_Price;
c. sum Quantity Total_Retail_Price;
d. sum=Quantity sum=Total_Retail_Price;

The correct answer is c. You specify the variable names separated by blanks to display totals for variables in your report.

Subsetting Your Report


In this demonstration, you subset your report and summarize the values of a variable.

1. Copy and paste the following code into the editor.

proc print data=orion.sales;


run;

2. Submit the code and view the results. Notice that there are nine variables.

3. In the editor, modify the code. You want to display Last_Name, First_Name, and Salary, and you want to
summarize the values of Salary. Add a VAR statement and a SUM statement to the PROC PRINT step.

proc print data=orion.sales;


var Last_Name First_Name Salary;
sum Salary;
run;

4. Submit the code. Check the log to make sure the code ran successfully.

5. View the report. The report shows only the three variables that you requested, and the salary total is in the last
row of the report.

Code Challenge
Write a statement to specify that the variables ID, Name, Company, and Policy be printed for the SAS data set
orion.sales.

proc print data=orion.sales;


;
run;

var ID Name Company Policy;

You specify variable names separated by blanks in the VAR statement in the order in which you want the variables
printed.

Business Scenario
The management team now wants a report that displays the names and salaries of the sales employees earning a salary
that is less than $25,500. To create this report, you'll need to subset the observations.

Subsetting Observations Using the WHERE Statement


You can use the WHERE statement in a PROC PRINT step to subset observations in a report.

WHERE where-expression;

When you use a WHERE statement, your output contains only the observations that meet the conditions specified in the
where-expression. The WHERE expression defines a condition for selecting observations, and can be any valid SAS
expression.

This WHERE statement defines the condition for your task: it selects observations where the value of Salary is less than
$25,500. Notice that we still have the VAR statement, because those are still the variables we want to see. We've
removed the SUM statement here though, because we aren't interested in seeing a salary total anymore.

proc print data=orion.sales;


var Last_Name First_Name Salary;
where Salary<25500;
run;

Now let's explore the WHERE expression a bit. An expression is a sequence of operands and operators that form a set of
instructions. The operands can be constants or variables. A constant operand is a fixed value, such as 25500. Numeric
constants do not use quotation marks or special characters.

This example shows a WHERE statement with a character constant.

proc print data=orion.sales;


var Last_Name First_Name Salary;
where Gender='M';
run;

Character constants must be enclosed in quotation marks and are case sensitive. If the value of the variable Gender is
equal to the uppercase M character constant, SAS will select that observation.

A variable operand must be a variable from the input data set. Orion.sales is the input data set in this program, so the
variable Salary must exist in that data set. SAS selects observations where the value of Salary is less than 25500.
You can use comparison, arithmetic, or logical operators in a WHERE expression. There are also several special WHERE
operators that can only be used in WHERE statements. Let's look at each of these types of operators in more detail.

Comparison and Arthmetic Operators


The comparison operators, which are below, compare a variable with a constant or another variable. You can use either
the symbol or the mnemonic in your code.

Symbol(s) Mnemonic Definition

= EQ equal to

^= ¬= ~= NE not equal to

> GT greater than

< LT less than

>= GE greater than or equal to

<= LE less than or equal to

IN equal to one of a list

Look over the following examples of comparison operators used in WHERE statements.

where Gender='M';

where Gender eq 'M';

where Salary ne .;

where Salary>50000;

where Salary lt 50000;

where Salary<=60000;

where Country in ('Au','US');

The IN operator selects observations if they are equal to one of a list, and the value list in the IN operator must be
enclosed in parentheses and separated by either commas or blanks. In the last example, SAS selects observations where
the value of Country is equal to AU or US. Remember that character values must be enclosed in quotation marks.

The arithmetic operators indicate that an arithmetic calculation is performed.

Symbol Definition
** exponentiation

* multiplication

/ division

+ addition

- subtraction

Here's an example of arithmetic operators used in a WHERE statement:

where Salary+Bonus<=10000;

Activity
Copy and paste the following code, which contains two WHERE statements, into the editor. Submit the code.

Reminder: Make sure you've defined the orion library.

proc print data=orion.sales;


where Country='AU';
where Salary<30000;
run;

Which of the following is true?

a. The program executes, applying both WHERE conditions successfully.


b. The program fails and SAS writes an error message to the log.
c. The program executes, but only the first WHERE condition is applied.
d. The program executes, but only the second WHERE condition is applied.

The correct answer is d. The following log message indicates that SAS replaces the first WHERE condition with the
second WHERE condition:

NOTE: WHERE clause has been replaced.

Logical Operators
You can use logical operators in a WHERE statement to combine or modify expressions.

WHERE where-expression-1 AND | OR where-expression-n;


For example, suppose you want to further modify your report to show not only the employees with salaries that are less
than $25,500, but also only the employees who are from Australia. You know that SAS won't execute two WHERE
statements. So to create the report you need, you can use logical operators. You can use either the symbol or the
mnemonic in your code.

Symbol(s) Mnemonic Definition

& AND logical and

| OR logical or

^¬~ NOT logical not

Here, you can see the logical operators that are available and several examples of WHERE statements that use these
operators. The logical operator AND finds observations that satisfy both conditions. The logical operator OR finds
observations that satisfy one or both conditions. And the logical operator NOT modifies a condition by finding the
complement to the specified criteria.

where Country ne 'AU' and Salary>=50000;


where Gender eq 'M' or Salary ge 50000;
where Country='AU' | Country='US';
where Country in ('AU' 'US');
where Country not in ('AU', 'US');
Selecting Observations
In this demonstration, you select specific observations to display in a report.

1. Copy and paste the following code into the editor.

proc print data=orion.sales;


var Last_Name First_Name Salary;
run;

2. Add a WHERE statement to select only the employees who are from the country Australia and who have a Salary
value that is less than $25,500. Notice that you use the logical operator AND to combine the two expressions.

proc print data=orion.sales;


var Last_Name First_Name Salary;
where Country='AU' and Salary<25500;
run;;

3. Add the variable Country to the VAR statement so that it's included in the report.

proc print data=orion.sales;


var Last_Name First_Name Salary Country;
where Country='AU' and Salary<25500;
run;
4. Submit the code and check the log. The log shows that SAS processed two observations from orion.sales due to
the WHERE statement. You might recall that the previous report contained 165 observations.

5. View the report. Only the employees whose Salary and Country values met the WHERE expression conditions
are displayed. But take a look at the Obs column. SAS displays the original observation numbers.

6. To suppress the Obs column, modify the PROC PRINT statement by adding the NOOBS option.

proc print data=orion.sales noobs;


var Last_Name First_Name Salary Country;
where Country='AU' and Salary<25500;
run;

7. Submit the code and verify that the report does not have the Obs column.

Question
Which WHERE statement correctly subsets on the numeric values for May, June, or July and missing character names?

a. where Months in (5-7) and Names=.;


b. where Months in (5,6,7) and Names=' ';
c. where Months in ('5','6','7') and Names='.';

The correct answer is b. You specify the value list in the IN operator in parentheses, and separate the values by either
commas or blanks. Only character values must be enclosed in quotation marks, and a blank represents a missing
character value.

Business Scenario
You need to create a report that lists only the Australian sales representatives with Rep anywhere in their job titles, so
you need to subset by Country and Job_Title. Think about it. You can add another condition using the AND operator, but
how do you specify this condition? Let's find out.

The CONTAINS Operator


You can use the CONTAINS special WHERE operator with the AND operator. The CONTAINS operator selects
observations that include the specified substring.

Symbol(s) Mnemonic Definition

? CONTAINS includes a substring

The position of the substring within the variable's value does not matter. For example, the WHERE statements shown
here will select observations in which the value for Job_Title is Sales Rep I, Sales Rep II, and so on.

where Country='AU' and


Job_Title contains 'Rep';

where Country='AU' and


Job_Title ? 'Rep';

The CONTAINS operator is case sensitive, so you must specify the substring in the exact case that you want it to match.
Using the CONTAINS Operator
In this demonstration, you use the CONTAINS operator to select observations for your report.

1. Copy and paste the following program into the editor. The WHERE statement uses the special operator
CONTAINS to specify the conditions for the report.

proc print data=orion.sales noobs;


var Last_Name First_Name Country
Job_Title;
where Country='AU' and
Job_Title contains 'Rep';
run;

2. Submit the code and view the log. The log shows that SAS read 61 observations from orion.sales.

3. View the report. You can see that all of the values for Country are AU and all of the Job_Title values contain the
Rep character string.

Special WHERE Operators


In addition to the general categories of operators that you've seen, there are several other special operators that can
only be used in a WHERE statement. Let's look more closely at these special WHERE operators.

Mnemonic Definition

BETWEEN-AND an inclusive range

WHERE SAME AND augment a where expression

IS NULL a missing value

IS MISSING a missing value

LIKE matches a pattern

The BETWEEN-AND operator selects observations in which the value of a variable falls within an inclusive range of
values. For example, the first WHERE statement in the program below selects observations in which the value of Salary
is between 50000 and 100000. You could write an equivalent statement without using the BETWEEN-AND operator, as
shown in the second WHERE statement. You could use the BETWEEN-AND operator along with the NOT operator to
select values outside of the specified range, as shown in the last WHERE statement.

where Salary between 50000 and 100000;


where 50000<=Salary<=100000;
where Salary not between 50000 and 100000;

Use the WHERE SAME AND operator to add more conditions to an existing WHERE expression later in the program
without retyping the original conditions. The WHERE SAME AND condition augments the original condition. This means
that SAS will read observations where Country=AU and Gender=F and Salary<25500.
proc print data=orion.sales;
var First_Name Last_Name Gender Salary Country;
where Country='AU' and Salary<25500;
where same and Gender='F';
run;

The IS NULL and IS MISSING operators select observations in which the value of a variable is missing. These operators
can be used for both character and numeric variables. For example, the WHERE statements shown below both select
observations in which the value of Employee_ID is missing.

where Employee_ID is null;


where Employee_ID is missing;

Here's a question: can you think of another operator you could use to select observations with a missing value? You
could use the equals operator, as long as you knew whether the variable was numeric or character.

where Salary=.;
where Last_Name=' ';

The NOT logical operator can be added to select observations with nonmissing values.

where Employee_ID is not null;


where Employee_ID is not missing;

The LIKE operator selects observations by comparing character values to specified patterns. There are two special
characters available for specifying a pattern: the percent sign specifies that any number of characters, including zero
characters, can occupy that position. The underscore specifies that exactly one character must occupy that position. You
can specify consecutive underscores. You can also specify a percent sign and an underscore in the same pattern.

Symbol Replaces

% any number of characters

- one character

For example, in the following program the first WHERE statement selects observations in which the value of Name ends
in an uppercase N, which is preceded by any number of characters. The second WHERE statement selects observations
in which the value of Name begins with an uppercase T, followed by a single character, followed by a lowercase m,
followed by any number of characters. This statement can match both the name Tom and the name Tommy.

where Name like '%N';


where Name like 'T_m%';

Question
The values for the variable Name in the table below are in the form last name, first name. Which WHERE statement will
return all the observations that have a first name starting with the letter M for the given values?
Name
Elvish, Irenie
Ngan, Christina
Hotstone, Kimiko
Daymond, Lucian
Hofmeister, Fong
Denny, Satyakam
Clarkson, Sharryn
Kletschkus, Monica
a. where Name like '_, M_';
b. where Name like '%, M%';
c. where Name like '_, M%';
d. where Name like '%, M_';

The correct answer is b. By using a percent sign, the pattern specifies last names that contain any number of characters.
The last name must be followed by a comma, a space, and an uppercase M to start the first name. This can be followed
by any number of characters. If you use an underscore, exactly one character must occupy that position.

Business Scenario
The sales manager wants a report that includes only customers who are 21 years old. He also wants the Customer_ID
variable to print at the beginning of each row.

Using the ID Statement


You know that you can use a WHERE statement to subset the data correctly. To specify the variable to print at the
beginning of the row instead of an observation number, you can use the ID statement.

ID variable(s);

The variable you specify replaces the Obs column. We'll specify Customer_ID.

proc print data=orion.customer_dim;


where Customer_Age=21;
id Customer_ID;
run;

Subsetting Observations and Replacing the Obs Column

In this demonstration, you subset observations by replacing the Obs column.

1. Copy and paste the following program into the editor.

proc print data=orion.customer_dim;


run;
2. Submit the code and view the report. In the report, you can see that the data set orion.customer_dim contains
quite a few variables, and some of these values are quite lengthy.

3. In the editor, add a WHERE statement to select the observations for customers who are 21 years old. Also add a
VAR statement to display only the variables Customer_ID, Customer_Name, Customer_Gender,
Customer_Country, Customer_Group, Customer_Age_Group, and Customer_Type.

proc print data=orion.customer_dim;


where Customer_Age=21;
var Customer_ID Customer_Name
Customer_Gender Customer_Country
Customer_Group Customer_Age_Group
Customer_Type;
run;

4. Submit the code and check the log. Verify that SAS read 6 observations.

5. View the report and verify that the Customer_Age_Group has the value 15-30 years in all observations.

Consider this: If the observations were to wrap to another line, the observation numbers would make it easy to
find the continuation. But a better technique is to choose a variable that uniquely identifies observations.

6. In the editor, add the ID statement to replace the Obs column and identify observations based on the
Customer_ID variable. Remove Customer_ID from the VAR statement because you don't want the variable to
appear twice.

proc print data=orion.customer_dim;


where Customer_Age=21;
id Customer_ID;
var Customer_Name
Customer_Gender Customer_Country
Customer_Group Customer_Age_Group
Customer_Type;
run;

7. Submit the code and check the log. The log still shows that SAS read 6 observations, and in the report, the
Customer_ID variable identifies the observations rather than the observation numbers.

Business Scenario
The payroll manager wants a report that displays the observations from orion.sales in ascending order by the variable
Salary. As you've probably noticed, PROC PRINT displays observations in the order in which they appear in your data set.
But you can write a SAS program to sort the observations in the data set and then use PROC PRINT to display the sorted
data set.

Using the SORT Procedure


To sort the observations in a data set, you use PROC SORT. Using PROC SORT, you can sort on one variable or multiple
variables, sort on character or numeric variables, and sort in ascending or descending order. Here's how PROC SORT
works. First, SAS rearranges the observations in the input data set. Then, SAS creates a data set that contains the
rearranged observations either by replacing the original data set or by creating a new data set. By default, SAS replaces
the original SAS data set unless you use the OUT= option to specify an output data set. PROC SORT does not generate
printed output.

PROC SORT
A basic PROC SORT step has three statements: a PROC SORT statement, a BY statement, and a RUN statement.
PROC SORT DATA= input-SAS-data-set
<OUT=output-SAS-data-set>;
BY<DESCENDING> by-variable(s);
RUN;

proc sort data=orion.sales


out=work.sales_sort;
by Salary;
run;

As with a PROC PRINT step, you use the DATA= option to specify the input data set. We need to sort orion.sales, so it's
our input data set. Next, we use the OUT= option in the PROC SORT step because we don't want to permanently sort
orion.sales. We want to sort orion.sales and create a new data set, work.sales_sort. Note that in the syntax box, the
general form of the OUT= option is enclosed in angle brackets (<>). These brackets indicate this element is optional.

Every PROC SORT step must include a BY statement. The BY statement specifies one or more variables in the input data
set whose values are used to sort the data. These are called BY variables. The BY statement also indicates whether you
want to sort in ascending or descending order. By default, SAS sorts in ascending order and you don't have to specify
anything additional in the BY statement. In the example above, PROC SORT will sort the observations by the values of
Salary in ascending order.

Sorting a Data Set


In this demonstration, you sort a SAS data set.

1. Copy and paste the following program into the editor. SAS will display the Salary values from lowest to highest,
that is, in ascending order.

proc sort data=orion.sales


out=work.sales_sort;
by Salary;
run;

2. Submit the code and check the log. You can see that SAS read 165 observations from orion.sales and created
work.sales_sort successfully.

You haven't created the report that the payroll manager requested yet. You've only sorted the data. To create
the report, you need to write a PROC PRINT step.

3. Copy and paste the following code into the editor.

proc print data=work.sales_sort;


run;

4. Submit the code and view the report. The report shows that the data set is sorted by the value of the variable
Salary. You can see that the values are in ascending order, with the highest salary in the last observation.

Question
Which step sorts the observations in a SAS data set and overwrites the same data set?

a.
proc sort data=work.empsau
out=work.sorted;
by First;
run;

b.
proc sort data=orion.empsau
out=empsau;
by First;
run;

c.
proc sort data=work.empsau;
by First;
run;

The correct answer is c. PROC SORT replaces the original data set unless you specify an output data set in the OUT=
option.

Code Challenge
Add a statement to specify that the values in work.tests be sorted by the variable TimeMin.

proc sort data=clinic.stress


out=work.tests;

;
run;

by TimeMin;

The BY statement specifies the variable or variables whose values are used to sort observations in the data set.

Business Scenario
Now that you're familiar with sorting, the payroll manager at Orion Star has asked you to create another report that
displays sales employees grouped by Country and displays them in descending Salary order within Country. So within
each country, you want the observations arranged by Salary, from highest to lowest.

Specifying Multiple BY Variables


For this scenario, you'll first sort the data set orion. sales to group the observations. In a PROC SORT step, remember
that you can list multiple variables in the BY statement, separated by spaces. In the following step, SAS first arranges the
data set by the values of the first BY variable, Country, in ascending order. SAS then arranges any observations that have
the same value of the first BY variable by the values of the second BY variable, Salary.

proc sort data=orion.sales;


out=work.sales2;
by Country Salary;
run;

To sort on a variable in descending order, you must specify the DESCENDING keyword immediately before each variable
that you want in descending order. SAS will sort the observations from the largest value to the smallest value.

proc sort data=orion.sales;


out=work.sales2;
by Country descending Salary;
run;

Question
Which BY statement in a PROC SORT step can produce the output shown here?

Obs Postal_Code Employee_ID


1 92173 120807
2 92131 120661
3 92129 121074
4 92128 121128
5 92128 120755
6 92128 120730
7 92126 121049
8 92124 121029
9 92124 121021
10 92122 120744

a. by Postal_Code Employee_ID;
b. by descending Postal_Code Employee_ID;
c. by Postal_Code descending Employee_ID;
d. by descending Postal_Code descending Employee_ID;

The correct answer is d. In the output, the observations are sorted in descending order for Postal_Code and, within each
postal code, in descending order for Employee_ID. The BY statement must specify the keyword DESCENDING before
each variable.

Sorting a Data Set by Multiple Variables


In this demonstration, you sort a data set by multiple variables.

1. Copy and paste the following code into the editor. Notice that you're creating the output data set work.sales2.

proc sort data=orion.sales


out=work.sales2;
by Country descending Salary;
run;

proc print data=work.sales2;


run;

2. Submit the code and check the log. The log shows that the code ran successfully. SAS read 165 observation from
the data set orion.sales.

3. View the report. The first country listed is Australia, or AU. Can you tell whether the Salary variable has been
sorted in descending order? Yes. Scroll down the report to see that the Salary values decrease. And starting with
observation 64, you see the employees from the US, and the highest Salary value is listed first.

The payroll manager wanted the report grouped by Country. This report displays the AU employees first
because you sorted Country in ascending order, but it doesn't actually group the observations by Country.

Specifying Report Groupings


You use a BY statement in PROC PRINT to display the sorted observations grouped by Country.

proc print data=work.sales2 noobs;


by Country;
run;

This BY statement specifies the variable to use to form BY groups. The variables in the BY statement are called BY
variables. Think about this: when you specify variables in the BY statement, what do you need to verify in the input data
set? The input data set must be sorted on the variables specified in the BY statement. The variables must also be sorted
in the order specified, either ascending or descending.

Grouping Observations in Reports


In this demonstration, you create a report that displays observations grouped by the variable Country.

1. Copy and paste the following code into the editor.

proc sort data=orion.sales


out=work.sales2;
by Country descending Salary;
run;

proc print data=work.sales2;


run;

2. In the PROC PRINT step, add a BY statement that groups the data by Country.

proc print data=work.sales2;


by Country;
run;

3. Submit the code and examine the report. You can see that SAS grouped the report by Country. The first table in
the report is for employees who are from Australia, and the second table is for employees who are from the US.
Notice that the Salary values are still listed in descending order.

Activity
Copy and paste the following program into the editor and submit it. View the log.

Reminder: Make sure you've defined the orion library.

proc sort data=orion.sales


out=work.sorted;
by Country Gender;
run;

proc print data=work.sorted;


by Gender;
run;
Does the program execute successfully?

a. yes
b. no

The correct answer is b. The log shows an error message indicating that the program failed. The input data set is not
sorted by Gender. Only a portion of the output is produced.

Question
In which PROC step would you add the following WHERE statement to result in the most efficient processing?

where Salary<25500;

proc sort data=orion.sales


out=work.sales3;
by Country descending Salary;
run;

proc print data=work.sales3 noobs;


by Country;
sum Salary;
var First_Name Last_Name Gender Salary;
run;

a. Add the WHERE statement to the PROC SORT step.


b. Add the WHERE statement to the PROC PRINT step.
c. Add the WHERE statement to both PROC steps.
d. It doesn't matter which PROC step contains the WHERE statement; either placement is equally
efficient.

The correct answer is a. Subsetting in the PROC SORT step is more effiicient. It selects and sorts only the required
observations.

Business Scenario
Suppose you need to share one of your reports at an upcoming staff meeting. You know that the information in the
report is accurate, but your report could use some improvements in the way it looks. You can enhance the report by
adding titles and footnotes.

Assigning Titles and Footnotes


You'd like to add a title that's specific to your report, such as Orion Star Sales Staff Salary Report. At the bottom of the
report, you want to display the footnote Confidential. Titles appear at the top and footnotes appear at the bottom of the
output from each procedure, no matter how long the output is. If you don't specify a title, the default title is The SAS
System.

To add a title to your report, you use the TITLE statement, and to add a footnote, you use the FOOTNOTE statement.
Aside from the keyword at the beginning, these two statements have the same syntax.
TITLEn 'text'; FOOTNOTEn 'text';

Following the keyword, you specify a value for n: a number from 1 to 10 that indicates the line on which the title or
footnote appears. In the title or footnote area, line 1 is the first line and line 10 is the last. If you don't specify a number,
SAS assumes that you're referring to line 1.

Then, within quotation marks, you specify the text that you want to appear in the title or footnote. You can use either a
set of single quotation marks or a set of double quotation marks to enclose the text string.

In the following example, the TITLE statement specifies a title for line 1, Orion Star Sales Staff. You can specify only one
title or footnote in a single statement.

title1 'Orion Star Sales Staff';


footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

You'd like the title Salary Report to appear on the second title line, so you need to add another TITLE statement, TITLE2.
The FOOTNOTE statement specifies Confidential.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

TITLE and FOOTNOTE statements are global statements, so they can stand alone. Also, any titles or footnotes that you
assign remain in effect until you change them, cancel them, or end your SAS session.

Displaying Titles and Footnotes in a Report


In this demonstration, you display titles and footnotes in a report.

1. Copy and paste the following code into the editor to add titles and a footnote to your report.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

2. Submit the code and examine the report. You can see that the titles Orion Star Sales Staff and Salary Report
have been added to the top, and the footnote Confidential to the bottom of the report.

Suppose you need to run another PROC PRINT step.

3. In the editor, copy and paste the following code into the editor, and then submit it.
title1 'Orion Star Sales Staff';
title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

proc print data=orion.sales;


var Employee_ID First_Name Last_Name Job_Title Hire_Date;
run;

4. Examine the new report. This report shows the same titles and the same footnote as the previous report, but
you don't want these titles. You want different titles to appear that are meaningful for this information.

Remember that titles and footnotes are global statements, and when you assign them, they remain in effect
until you change them, cancel them, or end your SAS session.

Code Challenge
Write a statement to specify the text Wellness Clinic Insurance on the third title line of the output from the PRINT step
below.

;
proc print data=clinic.insure;
var id name company policy;
where id between 1250 and 7590;
run;

title3 'Wellness Clinic Insurance';

You specify the keyword TITLE followed by the number of the line where the title is to appear, and then the text in
quotation marks.

Changing Titles and Footnotes


To change a previously defined title, you add another TITLE statement that has the same number as the one you want to
replace, but with different text. In the following example, we start by defining three lines of titles.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
title3 'Human Resources';

When we add another TITLE1 statement, the new text replaces the current text for TITLE1.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
title3 'Human Resources';

title1 'Salary Report';

Redefining a title also cancels all higher-numbered titles, so this TITLE1 statement changes title 1 and cancels titles 2 and
3. You change footnotes the same way.

To cancel all previously defined titles and footnotes, you can specify null TITLE and FOOTNOTE statements, which have
no numbers and no text.
title1 'Orion Star Sales Staff';
title2 'Salary Report';
title3 'Human Resources';

title1 'Salary Report';

title;
footnote;

It's a good practice to cancel all titles and footnotes at the end of your program so that no unexpected titles and
footnotes appear on output that you generate later in your SAS session.

Question
Which footnote or footnotes appear in the second procedure results?

footnote1 'Orion Star';


footnote2 'Sales Employees';
footnote3 'Confidential';
proc print data=orion.sales;
run;

footnote2 'Non Sales Employees';


proc print data=orion.nonsales;
run;

a. Non Sales Employees

b. Orion Star
Non Sales Employees

c. Non Sales Employees


Confidential

d. Orion Star
Non Sales Employees
Confidential

The correct answer is b. When you run the second PROC PRINT step, the FOOTNOTE2 statement replaces the previous
footnote with the same number: Non Sales Employees replaces Sales Employees. It also cancels all footnotes with higher
numbers, so FOOTNOTE3, Confidential, does not appear in the results. The resulting footnotes are Orion Star and Non
Sales Employees.

Changing and Canceling Titles and Footnotes


In this demonstration, you change the title for the second report and then cancel all titles and footnotes.

1. Copy and paste the following code into the editor to add a new title to the second report. This code adds a TITLE
statement before the second PROC PRINT step. You want this title to replace the previous TITLE1 statement, so
you number it TITLE1.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

title1 'Employee Information';


proc print data=orion.sales;
var Employee_ID First_Name Last_Name Job_Title Hire_Date;
run;

2. Submit the code and view the report. As you can see, the first report has the titles Orion Star Sales Staff and
Salary Report. The second report now has its own title.

3. As part of good programming practice, add the null TITLE and FOOTNOTE statements to the bottom of the
program to cancel all titles and footnotes for any output that you might generate later.

title;
footnote;

4. Submit these statements, and now your session is ready for the next task.

Business Scenario
So far, you haven't made any changes to the appearance of variable names in the body of your reports. In the reports
that you've created, the variable names appear exactly as they are stored in the input data set. For your upcoming
meeting, suppose you want your reports to display more descriptive text instead of the variable names. For example,
you want to change the appearance of the variables Employee_ID, Last_Name, and Salary in your report.

Assigning Temporary Labels by Using the LABEL Statement


To display temporary labels in your report instead of variable names, you can use the LABEL statement in your PROC
PRINT step. You use the keyword LABEL, followed by the variable name, an equal sign, and a descriptive label in
quotation marks.

LABEL variable='label'
variable='label'...;

In this example, you want to display the variable Employee_ID as Sales ID, the variable Last_Name as Last Name with no
underscore, and the variable Salary as Annual Salary.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
label Employee_ID='Sales ID'
Last_Name='Last Name'
Salary='Annual Salary';
run;
title;
footnote;

A label can be up to 256 characters long. You can specify labels for multiple variables in one LABEL statement, or you can
use a separate LABEL statement for each variable.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';
proc print data=orion.sales;
var Employee_ID Last_Name Salary;
label Employee_ID='Sales ID';
label Last_Name='Last Name';
label Salary='Annual Salary';
run;
title;
footnote;

Most SAS procedures display labels automatically, but PROC PRINT does not. You have to add the LABEL option to your
PROC PRINT statement to tell SAS to display the labels.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales label;


var Employee_ID Last_Name Salary;
label Employee_ID='Sales ID';
label Last_Name='Last Name';
label Salary='Annual Salary';
run;
title;
footnote;

Displaying Labels in a Report


In this demonstration, you create temporary labels and display the labels rather than the variable names in a report.

1. Copy and paste the following step into the editor.

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

2. Submit the step and view the variable names as they currently appear in the data set orion.sales.

3. In the editor, copy and paste the following code to replace the existing code. You can see the labels for the three
variables. Remember that in order to print the labels, you must include the LABEL option in the PROC PRINT
step.

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales label;


var Employee_ID Last_Name Salary;
label Employee_ID='Sales ID'
Last_Name='Last Name'
Salary='Annual Salary';
run;
title;
footnote;

4. Submit this code and then check the log. Notice that SAS doesn't print any messages regarding whether labels
were applied. But the code ran without errors.

5. View the report. As you can see, the temporary labels now appear instead of the variable names.
Using the SPLIT= Option
The SPLIT= option in PROC PRINT specifies a split character to control line breaks in column headings. This option is most
useful if you are using text or listing output rather than HTML output. Here's the syntax.

SPLIT='split-character';

After SPLIT=, you specify a split-character, which will indicate where to wrap the label in the report. In the following
example, the split character is an asterisk.

proc print data=orion.sales split='*';


var Employee_ID Last_Name Salary;
label Employee_ID='Sales ID'
Last_Name='Last Name'
Salary='Annual Salary';

Next, in the LABEL statement, you add the split character to the label text at the place where you want the label to wrap
to the next line. Here, an asterisk appears between the two words of the label Annual Salary. In the output, these two
words will appear on different lines.

proc print data=orion.sales split='*';


var Employee_ID Last_Name Salary;
label Employee_ID='Sales ID'
Last_Name='Last Name'
Salary='Annual*Salary';

Do you think the LABEL option is necessary in the PROC PRINT statement now? Actually, SAS knows to print the labels
because you're using the SPLIT= option. The PRINT procedure uses labels only when the LABEL or SPLIT= option is
specified.

Summary of Lesson 4: Producing Detail Reports

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Subsetting Report Data


You can use the VAR statement in a PROC PRINT step to subset the variables in a report. You specify the variables to
include and list them in the order in which they are to be displayed.

You can use the SUM statement in a PROC PRINT step to calculate and display report totals for the requested numeric
variables.

PROC PRINT DATA=SAS-data-set;


VAR variable(s);
SUM variable(s);
RUN;

The WHERE statement in a PROC PRINT step subsets the observations in a report. When you use a WHERE statement,
the output contains only the observations that meet the conditions specified in the WHERE expression. This expression
is a sequence of operands and operators that form a set of instructions that define the condition. The operands can be
constants or variables. Remember that variable operands must be defined in the input data set. Operators include
comparison, arithmetic, logical, and special WHERE operators.

WHERE where-expression;

You can use the ID statement in a PROC PRINT step to specify a variable to print at the beginning of the row instead of
an observation number. The variable that you specify replaces the Obs column.

ID variable(s);

Sorting and Grouping Report Data


The SORT procedure sorts the observations in a data set. You can sort on one variable or multiple variables, sort on
character or numeric variables, and sort in ascending or descending order. By default, SAS replaces the original SAS data
set unless you use the OUT= option to specify an output data set. PROC SORT does not generate printed output.

Every PROC SORT step must include a BY statement to specify one or more BY variables. These are variables in the input
data set whose values are used to sort the data. By default, SAS sorts in ascending order, but you can use the keyword
DESCENDING to specify that the values of a variable are to be sorted in descending order. When your SORT step has
multiple BY variables, some variables can be in ascending and others in descending order.

You can also use a BY statement in PROC PRINT to display observations grouped by a particular variable or variables. The
groups are referred to as BY groups. Remember that the input data set must be sorted on the variables specified in the
BY statement.

PROC SORT DATA=input-SAS-data-set


<OUT=ouput-SAS-data-set>;
BY <DESCENDING> by-variable(s);
RUN;

Enhancing Reports
You can enhance a report by adding titles, footnotes, and column labels. Use the global TITLE statement to define up to
10 lines of titles to be displayed at the top of the output from each procedure. Use the global FOOTNOTE statement to
define up to 10 lines of footnotes to be displayed at the bottom of the output from each procedure.

TITLEn 'text';
FOOTNOTEn 'text';

Titles and footnotes remain in effect until you change or cancel them, or until you end your SAS session. Use a null TITLE
statement to cancel all titles, and a null FOOTNOTE statement to cancel all footnotes.

Use the LABEL statement in a PROC PRINT step to define temporary labels to display in the report instead of variable
names. Labels can be up to 256 characters in length. Most procedures use labels automatically, but PROC PRINT does
not. Use the LABEL option in the PROC PRINT statement to tell SAS to display the labels. Alternatively, the SPLIT= option
tells PROC PRINT to use the labels and also specifies a split character to control line breaks in column headings.
PROC PRINT DATA=SAS-data-set LABEL;
LABEL variable='label'
variable='label'
... ;
RUN;

SPLIT='split-character';

Sample Programs

Subsetting Your Report

proc print data=orion.sales;


var Last_Name First_Name Salary;
sum Salary;
run;

Selecting Observations

proc print data=orion.sales noobs;


var Last_Name First_Name Salary Country;
where Country='AU' and Salary<25500;
run;

Using the CONTAINS Operator

proc print data=orion.sales noobs;


var Last_Name First_Name Country Job_Title;
where Country='AU' and Job_Title contains 'Rep';
run;

Subsetting Observations and Replacing the Obs Column

proc print data=orion.customer_dim;


where Customer_Age=21;
id Customer_ID;
var Customer_Name
Customer_Gender Customer_Country
Customer_Group Customer_Age_Group
Customer_Type;
run;

Sorting a Data Set

proc sort data=orion.sales


out=work.sales_sort;
by Salary;
run;

proc print data=work.sales_sort;


run;
Sorting a Data Set by Multiple Variables

proc sort data=orion.sales


out=work.sales2;
by Country descending Salary;
run;

proc print data=work.sales2;


run;

Grouping Observations in Reports

proc sort data=orion.sales


out=work.sales2;
by Country descending Salary;
run;

proc print data=work.sales2;


by Country;
run;

Displaying Titles and Footnotes in a Report

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

proc print data=orion.sales;


var Employee_ID First_Name Last_Name Job_Title Hire_Date;
run;

Changing and Canceling Titles and Footnotes

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales;


var Employee_ID Last_Name Salary;
run;

title1 'Employee Information';


proc print data=orion.sales;
var Employee_ID First_Name Last_Name Job_Title Hire_Date;
run;

Displaying Labels in a Report

title1 'Orion Star Sales Staff';


title2 'Salary Report';
footnote1 'Confidential';

proc print data=orion.sales label;


var Employee_ID Last_Name Salary;
label Employee_ID = 'Sales ID'
Last_Name = 'Last Name'
Salary = 'Annual Salary';
run;
title;
footnote;

Lesson 5: Formatting Data Values


Lesson Overview
In your SAS reports, formats control the way data values are displayed. You might want to make some data values more
understandable or descriptive. In this lesson, you learn to enhance the way that variable values are displayed and
formatted in your reports by associating existing SAS formats with variables. You also learn how to create and apply your
own custom formats.

Objectives

In this lesson, you learn to do the following:

 describe SAS formats


 apply SAS formats with the FORMAT statement
 create user-defined formats using the FORMAT procedure
 apply user-defined formats using the FORMAT statement
 use formats to recode data values
 use formats to collapse or aggregate data

Using SAS Formats


Suppose you want to enhance the appearance of variable values in your reports. For example, by default, your reports
display values as they are stored in the input data set. But those values are not always formatted in a way that's easy to
understand. This PROC PRINT report shows hire dates for employees, but as unformatted numeric values, who knows
what these mean!

Last_Name First_Name Country Job_Title Salary Hire_Date

Zhou Tom AU Sales Manager 108255 12205

Dawes Wilson AU Sales Manager 87975 6575

Elvish Irenie AU Sales Rep. II 26600 6575

Ngan Christina AU Sales Rep. II 27475 8217

Hotstone Kimiko AU Sales Rep. I 26190 10866

Daymond Lucian AU Sales Rep. I 26480 8460

Hofmeister Fong AU Sales Rep. IV 32040 8460


You need to create a report with more easily understood variable values. Displaying these SAS dates as calendar dates
would improve the report. You could also add dollar signs and commas to the variable Salary to improve its appearance.

Using the FORMAT Statement


To control how values appear in your reports, you can specify temporary SAS formats by adding the FORMAT statement
to your PROC PRINT step.

FORMAT variable(s) format;

proc print data=orion.sales noobs label;


where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary dollar8.;
format Hire_Date mmddyy10.;
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;

You use the keyword FORMAT, followed by the variable and the SAS format that you want to apply to the variable. You
can use a separate FORMAT statement for each variable, or you can format several variables using either the same
format or different formats in a single FORMAT statement.

format Salary dollar8. Hire_Date mmddyy10.;

SAS Formats
What exactly is a format? A format is an instruction that tells SAS how to display data values. For example, you can
display a numeric value with commas and a dollar sign.

There are many existing SAS formats that you can use, and they all use the same form, as shown here.

<$>format<w>.<d>

The dollar sign indicates a character format and precedes the name of the SAS format. Then you specify the total format
width, including decimal places and special characters. The period is required syntax. Finally, you can specify the number
of decimal places in numeric formats.

Let's look at a few general examples. This list shows the general form for several common SAS formats. Take a moment
to read through these definitions.

Format Definition

$w. is a standard character format. It is used to write character data in a field


w positions wide.
w.d is a standard numeric format. It is used to write numeric data in a field w
positions wide with d decimal places. The value of w includes the decimal
point and decimal places.

COMMAw.d writes numeric values with a comma separating every three digits and a
period separating the decimal fraction.

DOLLARw.d writes numeric values with a leading dollar sign, a comma that separates
every three digits, and a period that separates the decimal fraction.

COMMAXw.d is a non-US numeric format. It writes numeric values with a period


separating every three digits and a comma separating the decimal fraction.

EUROXw.d is like COMMAXw.d, but it adds a leading euro symbol (€).

Examples of SAS Formats


This table shows several specific SAS formats and their effect on stored values.

Format Stored Value Displayed Value

$4. Programming Prog

12. 27134.5864 27135

12.2 27134.5864 27134.59

COMMA12.2 27134.5864 27,134.59

DOLLAR12.2 27134.5864 $27,134.59

COMMAX12.2 27134.5864 27.134,59

EUROX12.2 27134.5864 €27.134,59

Character values are truncated if they do not fit in the specified width. In this first example, you can see that the stored
value, Programming, is displayed as Prog because the assigned format only has a width of 4. If you do not specify a
width that is large enough to accommodate a numeric value, the displayed value is automatically adjusted to fit into the
width.

Let's look at some of the numeric formats. The 12. format, which is the same as 12.0, doesn't specify the number of
decimal places, so none are displayed. SAS rounds the displayed value to the nearest integer. With the 12.2 format, the
value is displayed in a field 12 positions wide with 2 decimal places. The decimal value is rounded to the nearest
hundredth. The COMMA12.2 format inserts a comma between the three digits. The DOLLAR12.2 format inserts a dollar
sign in the displayed value. 12 is the total width of the displayed value, including the dollar sign, commas, decimal point,
and decimal places. The COMMAX12.2 format inserts a period between the three digits, and a comma separates the
decimal fraction. Lastly, in the EUROX12.2 format, a euro symbol is inserted in the displayed value.

This table shows additional examples.

Format Stored Value Displayed Value

DOLLAR12.2 27134.5864 $27,134.59

DOLLAR9.2 27134.5864 $27134.59

DOLLAR8.2 27134.5864 27134.59

DOLLAR5.2 27134.5864 27135

DOLLAR4.2 27134.5864 27E3

In the first row of this table, 12 is wide enough to display the value, including the dollar sign, comma, decimal point, and
decimal places. You can see that the displayed value is 10 positions wide. What if you specify a width of 9, as in the
second row? The comma is not displayed. With a width of 8 in the third row, the dollar sign is also dropped. The fourth
row shows that a width of 5 results in the omission of decimal places in the displayed value. And the last row shows that
when a width of 4 is specified, the value is rounded to 27,000 and displayed in E-notation. But remember, the format
only affects the displayed value. The stored value is not affected by a format.

Question
Which format creates the displayed value shown here?
$5,950.35
a. DOLLAR4.2
b. COMMA8.2
c. DOLLAR9.2
d. $12.

The correct answer is c. The DOLLARw.d format writes numeric values with a leading dollar sign, a comma that separates
every three digits, and a period that separates the decimal fraction. The displayed value is nine characters wide, so the
total format width, w, is set to 9. This includes the special characters and decimal places. The displayed value contains
two decimal places, so d is set to 2.

Working with SAS Date Values and SAS Date Formats


SAS date values are a special category of numeric values. SAS stores date values as the number of days between January
1, 1960, and a specific date.
For example, SAS stores January 1, 1960, as 0, and January 2, 1960, as 1, and so on. Notice that dates earlier than
January 1, 1960, have negative SAS date values. When your report displays a SAS date like 12205, it's difficult to know
what date this really is! To make the dates in your report recognizable and meaningful, you must apply a SAS date
format to the SAS date values.

This table lists several common SAS date formats and shows how each one affects a stored SAS date value.

Format Stored Value Displayed Value

MMDDYY6. 0 010160

MMDDYY8. 0 01/01/60

MMDDYY10. 0 01/01/1960

DDMMYY6. 365 311260

DDMMYY8. 365 31/12/60

DDMMYY10. 365 31/12/1960

Let's look at the first three formats, the MMDDYY formats. These formats display values as a numeric month, day, and
year. The number at the end of the format is the width of the displayed field. It determines whether a forward slash
separator will be used, and if the year displays as two digits or four digits. A width of 6 does not provide enough room
for a separator and only allows for a two-digit year. A width of 8 allows for a separator and a two-digit year. With a
width of 10, the value displays with a separator and a four-digit year. So the format MMDDYY10. displays a date value
with a width of 10. It includes slashes to separate the month, day, and year values, and displays the year as a four-digit
value.

The DDMMYY formats are similar to the MMDDYY formats except that they display values as a numeric day, month, and
year. Click the Information button in the course to see more SAS date format examples, as well as the tables of formats
you saw earlier.

Question
Which FORMAT statement formats the variable values as shown below?

Birth_Date Emp_Hire_Date Emp_Term_Date


28/09/1968 01/10/1989 01/31/09

a. format Birth_Date Emp_Hire_Date mmddyy10. Emp_Term_Date ddmmyy10.;


b. format Birth_Date Emp_Hire_Date ddmmyyyy. Emp_Term_Date mmmyyyy.;
c. format Birth_Date Emp_Hire_Date ddmmyy10. Emp_Term_Date mmddyy8.;
The correct answer is c. The variables Birth_Date and Emp_Hire_Date are both displayed as a two-digit day, a two-digit
month, and a four-digit year; the day precedes the month. The displayed values have a length of 10. This is the
DDMMYY10. format.

The variable Emp_Term_Date is displayed as a two-digit month, a two-digit day, and a two-digit year; the month
precedes the day. The displayed value has a length of 8. This is the MMDDYY8. format.

There is no DDMMYYYY or MMMYYYY SAS format.

Applying Temporary Formats


In this demonstration, you apply SAS formats to variables to make them easier to read.

1. Copy and paste the following program into the editor to view the variables in the orion.sales data set before you
apply formats.

proc print data=orion.sales label noobs;


where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;

2. Submit the program and view the results. As you can clearly see, the Hire_Date values are difficult to decipher.
Remember that these date values are shown as numeric values. Now let's format the variables for your new
report.

3. In the editor, add a FORMAT statement that assigns the MMDDYY10. format to the variable Hire_Date and
assigns the DOLLAR8. format to the variable Salary.

proc print data=orion.sales label noobs;


where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Hire_Date mmddyy10. Salary dollar8.;
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;

4. Submit this code and look at the results. Notice that Salary has commas inserted to separate each group of
three digits and is displayed in a column that is 8 positions wide. Hire_Date is displayed as a two-digit month, a
slash, a two-digit day, a slash, and a four-digit year. This column is 10 positions wide. This report is much easier
to understand now.

Creating and Applying User-Defined Formats


The Sales Department requested a special report in which the full country names will appear instead of the country
code. SAS provides many formats but cannot provide every format possibly needed. Fortunately, you can create and
apply your own formats to your reports.

Creating and Applying User-Defined Formats


To create and apply your own formats, you must use two PROC steps. First, you use PROC FORMAT to create the user-
defined format. Then you use a FORMAT statement in the PROC PRINT step to apply the format to a specific variable.
When you create a user-defined format, you don't associate it with a particular variable or data set. Instead, you create
it based on values that you want to display differently.

PROC FORMAT
The general form of the PROC FORMAT statement is shown below:

PROC FORMAT;
VALUE format-name value-or-range1='formatted-value1'
value-or-range2='formatted-value2'
...;
RUN;

In a basic PROC FORMAT step, the PROC FORMAT statement consists only of the keywords PROC FORMAT. The VALUE
statement defines the format. First, you specify a format name. Then you specify a value or range of values, and lastly
you specify the formatted value, or how you want the value to be displayed.

These examples illustrate the rules for constructing a format name.

Type Format-name

character $CTRYFMT

character $_ST3FMT_

numeric ORIONSTAR_SALRANGE2_FMT_

numeric _SALRANGE

A format name can have a maximum of 32 characters. The name of a format that applies to character values must begin
with a dollar sign ($), followed by a letter or underscore. The name of a format that applies to numeric values must
begin with a letter or underscore. A format name cannot end in a number. All remaining characters can be letters,
underscores, or numbers. A user-defined format name cannot be the name of a SAS format. Also, notice that a format
name does not end with a period in the VALUE statement. Later, when you refer to the format in a FORMAT statement,
you'll specify the period.

Using the VALUE Statement


Now let's see how you specify the way you want the data values to appear in your output. You have already seen the
general form of the PROC FORMAT step. You use the VALUE statement in the PROC FORMAT step to specify one or more
expressions that we'll call value-range sets. Each value-range set has three parts: the value or range, which specifies one
or more values to be formatted, an equal sign, and then the formatted value that you want SAS to display instead of the
stored value or values.
You can specify the value-or-range in several ways: as an individual value, as a range of values, or as a list of
values. In each set of examples shown below, the top example has character values and the bottom example has
numeric values.

value-or-range = formatted-value

'AU' or 1 = 'Australia'

'B'-'D' or 0-50000 = 'Tier 1'

'U', 'V' or 1,2,3 = 'Below 49.9'

When you specify the value-or-range, you must enclose character values in quotation marks. The character values that
you specify must match the case of the variable's values. You do not enclose numeric values in quotation marks.

In a range, a hyphen (-) separates the values that define the endpoints of the range. When you specify a range of
character values, be careful not to enclose the entire range in quotation marks. If you do this, SAS assumes that all of the
characters, including the hyphen, are part of a single character value.

In a list, commas separate the individual values. The formatted value is always a character string, no matter whether the
format applies to character values or numeric values. A character string can consist of any type of character. Usually,
each formatted value is enclosed in quotation marks, as shown above. However, SAS does not require the quotation
marks for a formatted value. Formatted values can be up to 32,767 characters in length.

When you specify a value-or-range, you can also use the keyword OTHER to specify values that do not match any other
value-or-range. If you do not include the keyword OTHER, then SAS applies the format only to values that match the
value-range sets that you specify. If SAS encounters a value that you did not anticipate, SAS cannot apply the format but
instead displays that value as it's stored in the data set.

value-or-range = formatted-value

OTHER = 'Australia'

Using PROC FORMAT


Here's the PROC FORMAT step for our scenario:

proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;

Remember that the report needs to display full country names rather than the country codes. The country values are
character, so you need a character format. Its name must begin with a dollar sign. Here we chose the name $CTRYFMT.
We can call this the $ country format. Notice that there is no period at the end of the format name.

Now look at our value ranges. When you apply this format to a variable later, the output will display full country names
instead of country codes. Notice that the last value-range set specifies the keyword OTHER to include all values that do
not match any other value or range. In this example, the output will label any values other than the two country codes
as miscoded.

You can only define one format in the VALUE statement. However, you can define multiple formats in a single PROC
FORMAT step by adding multiple VALUE statements.

proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
value $sports
'FB'='Football'
'BK'='Basketball'
'BS'='Baseball';
run;

Question
Which user-defined format names are valid? Select all that apply.
a. $STFMT
b. $3LEVELS
c. _4YEARS
d. SALRANGES
e. DOLLAR

The correct answers are a, c, and d. Character formats begin with a dollar sign and must be followed by a letter or
underscore. Answer choice b has a dollar sign followed by a number. Also, user-defined formats cannot be the name of a
SAS format, as in answer choice e.

Code Challenge
Complete the PROC FORMAT step by adding the label Soccer to the value SK.

proc format;
value $sports
'FB'='Football'
'BK'='Basketball'
'BS'='Baseball'

;
run;

'SK'='Soccer';
The label is enclosed in quotation marks and assigned to the character value, which is also enclosed in quotation marks.
The VALUE statement ends with a semicolon.
Using PROC PRINT to Apply a User-Defined Format
Now that you know how to create your own formats, let's look at the second PROC step that you can use to apply the
format. You use the FORMAT statement in a PROC PRINT step to apply your formats to variables. The following FORMAT
statement applies the user-defined format $CTRYFMT to the variable Country.

proc print data=orion.sales label;


format Salary dollar10.
Birth_Date Hire_Date monyy7.
Country $ctryfmt.;
run;

When you refer to a user-defined format in the FORMAT statement, notice that you must specify a period after the
format name, the same way that you do for a SAS format name. However, remember that you do not have to include a
period after a user-defined format name when you create it.

We could have added another FORMAT statement, but both SAS and user-defined formats can be applied in a single
FORMAT statement.

Specifying a User-Defined Format for a Character Variable


In this demonstration, you create a user-defined format and assign the format to a character variable.

1. Copy and paste the following program into the editor. The PROC FORMAT step creates the format $CTRYFMT.
The PROC PRINT step assigns the format to the variable Country.

proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;

proc print data=orion.sales label;


var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
label Employee_ID='Sales ID'
Job_Title='Job Title'
Salary='Annual Salary'
Birth_Date='Date of Birth'
Hire_Date='Date of Hire';
format Salary dollar10.
Birth_Date Hire_Date monyy7.
Country $ctryfmt.;
run;

2. Submit the program and check the log. You can see that SAS created the format because there's a note that
states that the $CTRYFMT has been output.

3. View the report. The Country values are displayed as Australia and as United States. None of the Country values
are displayed as Miscoded.

Business Scenario
Now let's look at a numeric example. The Orion Star HR manager has asked for a report showing employee salaries
organized into three user-defined groups, or tiers. You need to display the tiers in the report instead of the dollar
amount.
Specifying Ranges of Values
Let's create a format that applies to numeric values. Suppose we know that the Salary values in orion.sales are between
20,000 and 250,000. Let's say that Tier1 includes salaries from 20,000 to 49,999, Tier2 includes salaries from 50,000 to
99,999, and Tier3 includes salaries from 100,000 to 250,000. This PROC FORMAT step defines the TIERS format.

proc format;
value tiers 20000-49999='Tier1'
50000-99999='Tier2'
100000-250000='Tier3';
run;

In this VALUE statement, each value-range set specifies an inclusive range of values. An inclusive range includes the first
value and the last value. Think about this. After you create the TIERS format, how will the Salary value 99,999.87 appear
in your report? Oops! The value falls outside of the ranges that are specified here: between Tier2 and Tier3. The ranges
defined here assume that the values of Salary are stored as whole numbers.

To create a set of ranges that have no gaps between them, you can add the less-than (<) symbol to exclude one or both
numbers in individual ranges. First, you make sure that the last number in the range is the same as the number at the
beginning of the next range. The number at the end of the first range becomes 50,000. The number at the end of the
middle range becomes 100,000.

Let's look at all possible ways that you can use the less-than symbol in a range. We'll use the middle range in the TIERS
format as an example. Note that the values in bold will be excluded in the range.

First Value Symbol(s) Last Value

50000 - 100000

50000 <- 100000

50000 -< 100000

50000 <-< 100000

To exclude the first value in a range, you put the less-than symbol after the first value. To exclude the last value in a
range, you put the less-than symbol before the last value. And, to exclude both the first and last values, you put a less-
than symbol in both places.

Now, let's add the less-than symbol to the ranges in this PROC FORMAT statement so that the ranges have no gaps
between them.

proc format;
value tiers 20000-<50000='Tier1'
50000-<100000='Tier2'
100000-250000='Tier3';
run;

In the VALUE statement, the ranges are now defined by using the less-than symbol. Here's a question. If SAS applies the
TIERS format to the value 100,000, how do you think it will be displayed? 100,000 is the end value of the second range.
A less-than symbol appears in front of this number, so it is excluded in the second range. This value will appear as Tier3
in the report.

Defining a Continuous Range


So far, we've assumed that the values of Salary are between 20,000 and 250,000. However, what if we don't know the
highest and lowest values? To specify the lowest possible value of a variable, you can use the keyword LOW. And to
specify the highest possible value, you can use the keyword HIGH.

proc format;
value tiers low-<50000='Tier1'
50000-<100000='Tier2'
100000-high='Tier3';
run;

The LOW keyword can be used to define ranges that apply to character values as well as to numeric values. It's
important to know that, for character values, the LOW keyword treats missing values as the lowest possible values.
However, for numeric values, LOW does not include missing values.

Consider this. If you apply the TIERS format to a variable, what does SAS display in the report for a missing value? The
TIERS format applies to numeric values, so a missing value will appear as a period in the report.

Question
How will a value of 50000 be displayed if the TIERS format below is applied to the value?

proc format;
value tiers 20000-<50000 ='Tier1'
50000-<100000='Tier2'
100000-250000='Tier3';
run;

a. Tier1
b. Tier2
c. 50000
d. a missing value

The correct answer is b. In Tier1, the less-than symbol is before 50000, which means that value will be excluded from the
range. In Tier2, however, 50000 is the starting value of the range and will be included in the tier.

Specifying a User-Defined Format for a Numeric Variable


In this demonstration, you create a user-defined format and assign the format to a numeric variable.

1. Copy and paste the following program into the editor. The VALUE statement uses the LOW keyword up to and
excluding 50000 for Tier1. For Tier2, you're including 50000 up to and including 100000. And for Tier3, you're
excluding 100000 and including every value above it.

proc format;
value tiers low-<50000='Tier 1'
50000-100000='Tier 2'
100000<-high='Tier 3';
run;
2. To apply the TIERS format to the variable Salary in the report, copy and paste the following PROC PRINT step
into the editor. Recall that when you refer to a user-defined format in the FORMAT statement, you must specify
a period after the format name, the same way that you do for a SAS format name.

proc print data=orion.sales;


var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
format Birth_Date Hire_Date monyy7.
Salary tiers.;
run;

3. Submit these steps and then check the log to ensure SAS creates the format. A note states that the format TIERS
has been output.

4. View the report. The Salary values have been replaced with the appropriate tier values.

Question
Can you include multiple VALUE statements in a single PROC FORMAT step?

proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
value tiers low-<50000 ='Tier1'
50000-<100000='Tier2'
100000-high ='Tier3';
run;

a. yes
b. no

The correct answer is a. You can create multiple user-defined formats in the same PROC FORMAT step by specifying
multiple VALUE statements.

Summary of Lesson 5: Formatting Data Values

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Using SAS Formats


A format is an instruction that tells SAS how to display data values in output reports. You can add a FORMAT statement
to a PROC PRINT step to specify temporary SAS formats that control how values appear in the report. There are many
existing SAS formats that you can use. Character formats begin with a dollar sign, but numeric formats do not.
FORMAT variable(s) format;

SAS stores date values as the number of days between January 1, 1960, and a specific date. To make the dates in your
report recognizable and meaningful, you must apply a SAS date format to the SAS date values.

Creating and Applying User-Defined Formats


You can create your own user-defined formats. When you create a user-defined format, you don't associate it with a
particular variable or data set. Instead, you create it based on values that you want to display differently. The formats
will be available for the remainder of your SAS session. You can apply user-defined formats to a specific variable in a
PROC PRINT step.

You use the FORMAT procedure to create a format. You assign a format name that can have up to 32 characters. The
name of a character format must begin with a dollar sign, followed by a letter or underscore, followed by letters,
numbers, and underscores. Names for numeric formats must begin with a letter or underscore, followed by letters,
numbers, and underscores. A format name cannot end in a number and cannot be the name of a SAS format.

You use a VALUE statement in a PROC FORMAT step to specify the way that you want the data values to appear in your
output. You define value-range sets to specify the values to be formatted and the formatted values to display instead of
the stored value or values. The value portion of a value-range set can include an individual value, a range of values, a list
of values, or a keyword. The keyword OTHER is used to define a value to display if the stored data value does not match
any of the defined value-ranges.

PROC FORMAT;
VALUE format-name value-or-range1='formatted-value1'
value-or-range2='formatted-value2'
...;
RUN;

When you define a numeric format, it is often convenient to use numeric ranges in the value-range sets. Ranges are
inclusive by default. To exclude the endpoints, use a less-than symbol after the low end of the range or before the high
end.

The LOW and HIGH keywords are used to define a continuous range when the lowest and highest values are not known.
Remember that for character values, the LOW keyword treats missing values as the lowest possible values. However, for
numeric values, LOW does not include missing values.

Sample Programs

Applying Temporary Formats

proc print data=orion.sales label noobs;


where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;
proc print data=orion.sales label noobs;
where Country='AU' and
Job_Title contains 'Rep';
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Hire_Date mmddyy10. Salary dollar8.;
var Last_Name First_Name Country Job_Title
Salary Hire_Date;
run;

Specifying a User-Defined Format for a Character Variable

proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;

proc print data=orion.sales label;


var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
label Employee_ID='Sales ID'
Job_Title='Job Title'
Salary='Annual Salary'
Birth_Date='Date of Birth'
Hire_Date='Date of Hire';
format Salary dollar10.
Birth_Date Hire_Date monyy7.
Country $ctryfmt.;
run;

Specifying a User-Defined Format for a Numeric Variable

proc format;
value tiers low-<50000='Tier1'
50000-100000='Tier2'
100000<-high='Tier3';
run;

proc print data=orion.sales;


var Employee_ID Job_Title Salary
Country Birth_Date Hire_Date;
format Birth_Date Hire_Date monyy7.
Salary tiers.;
run;

Lesson 6: Reading SAS Data Sets


One common SAS programming task is to create a SAS data set. For example, information on Orion Star sales employees
might reside in several different input sources, such as a SAS data set, a raw data file, or even a Microsoft Excel
worksheet. You can create a SAS data set from any of these types of data. In this lesson, you use a DATA step to create a
SAS data set from an existing SAS data set. You also learn to create new variables, select variables and observations, and
add permanent attributes to variables.

Objectives
In this lesson, you learn to do the following:

 use a DATA step to create a SAS data set from an existing SAS data set
 subset observations by using the WHERE statement
 create a new variable by using the assignment statement
 subset variables by using the DROP and KEEP statements
 describe the compilation and execution phases of the DATA step
 store labels and formats in the descriptor portion of a SAS data set

Reading a SAS Data Set


Orion.sales is a SAS data set that contains information about Orion Star sales employees from Australia and from the
United States. Suppose you want to use the data in orion.sales to create a new SAS data set that contains a subset of
this data. For example, you want the new data set, work.subset1, to contain only the Australian sales representatives
with the substring Rep in their job title.

Question
What types of files can a DATA step read as input data?

a. SAS data sets


b. Microsoft Excel worksheets
c. raw data files
d. all of the above

The correct answer is d. A DATA step can read a SAS data set, an Excel worksheet, or a raw data file as input data.

Using the DATA Step


To create a new SAS data set from an existing SAS data set, you use a DATA step. The following DATA step contains a
DATA statement, a SET statement, and a RUN statement.

DATA output-SAS-data-set;
SET input-SAS-data-set;
RUN;

You begin the DATA step with the DATA statement, which provides the name of the SAS data set that you're creating.
The data set can be temporary or permanent. In the following example, you're creating the temporary SAS data set
subset1 in the work library.

data work.subset1;
set orion.sales;
run;

The SET statement names orion.sales as the existing SAS data set that you want to read in as input data. Can you tell
whether this is a permanent or temporary data set? Yes, you can. The data is in the permanent library, orion.

By default, a SET statement reads all observations and all variables from the input data set sequentially. You use the
WHERE statement to subset the input data set by selecting only the observations that meet a particular condition. The
following WHERE statement selects only those observations where the variable Country has a value of AU and where
the value of Job_Title contains the substring Rep.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;

If either of these expressions is false, SAS will not process the observation, and therefore, the observation won't be
included in the output data set.

Here's a question. Can you think of a way to write this WHERE statement using different operators or symbols? You
could write it like this.

data work.subset1;
where Country eq 'AU' and
Job_Title like '%Rep%';
run;

These statements will return the same results.

Subsetting Observations in the DATA Step


In this demonstration, you subset observations in the DATA step.

1. Copy and paste the following PROC PRINT step into the editor to examine the data set orion.sales.

proc print data=orion.sales;


run;

2. Submit the step and then check the log. As you can see, SAS read 165 observations from orion.sales.

3. View the report. You can see that there are nine variables. Notice that the variable Job_Title includes mostly
titles that contain the substring Rep already, but a few of them don't. The Country variable includes both AU and
US values.

4. Copy and paste the following program into the editor. The DATA step uses a WHERE statement to subset the
observations and create the new data set, work.subset1. The PROC PRINT step prints the new data set.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;

proc print data=work.subset1;


run;

5. Submit the code and check the log. The log shows that SAS read 61 observations from orion.sales. In the report,
notice that all of the observations have AU as the value for Country and have Rep somewhere in their value for
Job_Title.

Question
Examine the following program and then decide which statement is true.
data us;
set orion.sales;
where Country='US';
run;

a. The program reads a temporary data set and creates a permanent data set.
b. The program reads a permanent data set and creates a temporary data set.
c. The program contains a syntax error and will not execute.
d. The program will not execute because you cannot work with permanent and temporary data sets in the
same step.

The correct answer is b. The DATA statement doesn't specify a libref, so it's creating the temporary data set us. The SET
statement reads orion.sales, which is a permanent data set. There are no syntax errors in the program.

Code Challenge
Write a statement to specify the data set emp.salary as the data set to be read.

data march.payroll;

;
run;

set emp.salary;

You specify the keyword SET and the two-level name of the data set (libref.filename) to be read.

Business Scenario
Suppose that the management staff wants to give a 10% bonus to each Australian sales representative hired before
January 1, 2000. You'll use the SAS data set orion.sales to create a new data set named work.subset1. In addition to
subsetting for AU and the substring Rep, you'll subset the data based on the employee hire date. You'll also calculate a
10% bonus for these employees based on their salary.

Using a SAS Date Constant


In this scenario, you need to subset the data based on the variable Hire_Date, which contains a SAS date value. How do
you think you can compare a SAS date value to a calendar date? You can use a SAS date constant. A SAS date constant is
a date written in the form of a two-digit day, followed by a three-letter month abbreviation, and then a two or four-digit
year, enclosed in quotes and followed by the letter D.

'ddmmm<yy>yy' D

SAS will automatically convert a date constant to a SAS date value. The following table shows some examples of SAS
date constants.

Examples

'01JAN2000'D
'31Dec11'd

'1jan04'd

'06Nov2000'D

You can use a date constant in any SAS expression, including a WHERE expression.

In the following example, we want the employees whose Hire_Date value is before January 1, 2000, so we use the less-
than symbol to indicate this.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
run;

You enclose a SAS date constant in quotation marks. Notice that this program already contains the WHERE expression
for the Australian employees whose Job_Title value includes the substring Rep.

Using the Assignment Statement


Our next task in the scenario is to calculate a 10% bonus for these employees. We're going to create the new variable
Bonus to store the bonus value. To modify existing values or to create new variables, you can use an assignment
statement in a DATA step. The assignment statement evaluates an expression and assigns the resulting value to a new or
existing variable.

variable=expression;

Notice that the assignment statement is one of the few SAS statements that doesn't begin with a keyword. Variable
names an existing or new variable, and expression is a set of instructions that produces a value.

As in the WHERE statement, an expression is a sequence of operands and operators. Operands are character constants,
numeric constants, date constants, character variables, or numeric variables. Operators are either symbols that
represent an arithmetic calculation or they are SAS functions.

Here are some examples of assignment statements.

Example Type

Salary=26960; numeric constant

Gender='F'; character constant

Hire_Date='21JAN1995'd; date constant


Bonus=Salary*.10; arithmetic expression

BonusMonth=month(Hire_Date); SAS functions

In the first row of the table, we're assigning the numeric constant 26960 to Salary. The next row shows how to assign a
character constant. Remember that it needs to be in quotes, either single or double. Next you can see how to use a date
constant. In the fourth row, we're using an arithmetic expression. We're multiplying Salary by .10 to calculate the 10%
bonus and assigning the resulting value to the new variable Bonus. In the last row, you can see how to use a SAS
function. This function extracts the month from the Hire_Date value.

You should be mindful when using arithmetic operators in an assignment statement. When you use more than one
arithmetic operator in an expression, SAS performs operations based on priority, as is the case normally in math
equations. You can use parentheses to clarify or alter the order of operations. Also, if any operand in the expression has
a missing value in the observation, the result is a missing value.

Symbol Definition Priority

** exponentiation I

* multiplication II

/ division II

+ addition III

- subtraction III

Question
What is the result of the following assignment statement?

num=4+10/2;

a. .(missing)
b. 0
c. 7
d. 9

The correct answer is d. The order of operations is division and multiplication, followed by addition and subtraction. So
10 divided by 2 equals 5, and 5 plus 4 equals 9.

Question
What is the result of this assignment statement given the values of var1 and var2?
num=var1+var2/2; var1 var2
. 10

a. . (missing)
b. 0
c. 5
d. 10

The correct answer is a. If an operand in an arithmetic expression has a missing value, the result is a missing
value.

. = . + 10/2

Subsetting Observations and Creating a New Variable


In this demonstration, you subset observations and create a new variable in the DATA step.

1. Copy and paste the following program into the editor. The DATA step subsets the data set orion.sales by the
Australian sales representatives based on their hire date. Using an assignment statement, the program creates
the new variable Bonus. The PROC PRINT step displays the report and includes a FORMAT statement to format
Hire_Date values with the DATE9. format.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
Bonus=Salary*.10;
run;

proc print data=work.subset1 noobs;


var First_name Last_Name Salary
Job_Title Bonus Hire_Date;
format Hire_Date date9.;
run;

2. Submit the program and check the log. The log shows that SAS read 29 observations from orion.sales, and 29
observations and 10 variables were output to work.subset1. Originally, orion.sales contained nine variables, so
it looks like SAS created the new variable.

3. View the report. The new variable, Bonus, is displayed. In the first observation, notice that the calculation was
performed correctly: 26600 multiplied by .10 equals 2660. Also, notice that the variable Hire_Date does not
include any dates after January 1, 2000.

Customizing a SAS Data Set


You've seen how to modify a DATA step so that the output data set contains a subset of the observations in the input
data set. Suppose that now you want to modify your DATA step further. You want all of the Australian sales reps to
receive a bonus, regardless of hire date, and you only want to include some of the variables from orion.sales in your
new data set. That is, you want your output data set to contain a subset of the variables in the input data set, as well as
a subset of the observations in the input data set.
Using the DROP and KEEP Statements
As you've seen, the SET statement reads all of the variables from the input data set and writes them to the output data
set. You can exclude variables from your output data set by using a DROP statement or a KEEP statement in a DATA step.

DROP variable-list; KEEP variable-list;

You use the DROP statement to specify the variables to exclude from the output data set. The DROP statement begins
with the keyword DROP, followed by a space-separated list of the variables that you want to drop from the output data
set.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country
Birth_Date;
run;

You use the KEEP statement to specify a list of variables to include in the output data set.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name
Salary Job_Title Hire_Date
Bonus;
run;

How can you decide which statement to use? You might want to use the KEEP statement instead of the DROP statement
if the number of variables to keep is significantly smaller than the number to drop. Also, if you use a KEEP statement,
you must include every variable to be written, including any new variables. One more note: the DROP and KEEP
statements have no effect on the input data set.

Subsetting Variables in a DATA Step: DROP and KEEP


In this demonstration, you subset variables in a DATA step using the DROP and KEEP statements.

1. Copy and paste the following program into the editor. This DATA step contains the DROP statement to exclude
the variables Employee_ID, Gender, Country, and Birth_Date from the output data set work.subset1.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country Birth_Date;
run;

proc print data=work.subset1;


run;
2. Submit the code and then check the log. The log shows that SAS read 61 observations from orion.sales. The new
data set, work.subset1 contains six variables. You added the variable Bonus and dropped four others.

3. View the report. You can see that the report displays the six variables, including the new variable Bonus.

4. Copy and paste the following program into the editor. This DATA step contains the KEEP statement and lists the
variables to include in the output data set.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name Salary Job_Title Hire_Date Bonus;
run;

proc print data=work.subset1;


run;

5. Submit the program and check the log. Again, SAS read 61 observations, and work.subset1 contains six
variables. The report is exactly the same. You can use the DROP or KEEP statements to produce the same output
data set.

Question
If you submit the DATA step below, which variables appear in the work.mysubset data set? Select all that apply.

data work.mysubset;
set mylib.salesforce;
drop Gender Salary;
run;

mylib.salesforce
Emp_ID Name Gender Salary Job_Title
120102 Zhou, Tom M 108255 Sales Manager
Dawes,
120103 M 87975 Sales Manager
Wilson
a. Emp_ID
b. Name
c. Gender
d. Salary
e. Job_Title

The correct answer is a, b, and e. The DATA step omits two variables from the output data set: Gender and Salary. The
remaining three variables are included.

How SAS Processes the DATA Step


Let's investigate how SAS processes the DATA step. SAS processes the DATA step in two phases: the compilation phase
and the execution phase. During the compilation phase, SAS scans each DATA step statement for syntax errors. It
converts the program to machine code if no syntax errors are found. SAS also creates the program data vector to hold
the current observation.

When the compilation phase is complete, SAS creates the descriptor portion of the new data set. Remember, the
descriptor portion contains information such as the data set name and the names of all the data set's variables.

Compilation Phase
Let's walk through the compilation phase of our previous program.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country
Birth_Date;
run;

During the compilation phase, SAS creates the program data vector, or PDV. The PDV is an area of memory where SAS
builds one observation. The PDV contains two automatic variables that can be used for processing, but that are not
written to the data set as part of an observation. _N_ is the iteration number of the DATA step, and _ERROR_ signals the
occurrence of an error that is caused by the data during execution. The default value of _ERROR_ is 0, which means
there is no error. When one or more errors occur, the value is set to 1.

PDV

_N_ _ERROR_

SAS scans each statement in the DATA step, looking for syntax errors such as missing or misspelled keywords, invalid
variable names, missing or invalid punctuation, or invalid options.

In our example code above, SAS scans the DATA step. When SAS compiles the SET statement, a slot is added to the PDV
for each variable in the input data set: Employee_ID, First_Name, Last_Name, Gender, Salary, Job_Title, Country,
Birth_Date, and Hire_Date. The descriptor portion of the input SAS data set, orion.sales, supplies the variable names, as
well as attributes such as type and length. Then SAS adds the new variable Bonus to the PDV based on the assignment
statement.

PDV

First_Nam Last_Nam Gende Salar Job_Titl Countr Birth_Dat Hire_Dat Bonu


_N _ERROR Employee_I
D e e r y e y e e s
_ _
N8 $8 $8 $8 N8 $8 $8 N8 N8 N8

SAS determines that Bonus is a numeric variable because the expression on the right is a numeric constant. SAS then
flags the variables to be dropped from the output. In this case, the variables Employee_ID, Gender, Country, and
Birth_Date are marked to be dropped.

At the bottom of the DATA step, the compilation phase is complete, and the descriptor portion of the new SAS data set
work.subset1 is created.

Descriptor portion of work.subset1

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date Bonus
N8 $8 $8 $8 N8 $8 $8 $8 $8 $8

Execution Phase
If the DATA step compiles successfully, then the execution phase begins. During the execution phase, the DATA step
reads and processes the observations from the input data set, and creates observations in the data portion of the output
data set. By default, the DATA step executes once for each observation in the input data set.

At the start of the execution phase, SAS initializes the PDV to missing.

PDV

First_Nam Last_Nam Gende Salar Job_Titl Countr Birth_Dat Hire_Dat Bonu


_N _ERROR Employee_I
D e e r y e y e e s
_ _
N8 $8 $8 $8 N8 $8 $8 N8 N8 N8

. . . . .

Remember that missing character values are displayed as blanks, and missing numeric values are displayed as a period.
In our example program, when the SET statement executes, SAS reads the first observation from orion.sales into the
PDV, providing a value for each variable.

PDV

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date


_N_ _ERROR_
N8 $8 $8 $8 N8 $8 $8 N8

120121 Irenie Elvish F 26600 Sales Rep. II AU -4169

The value of Bonus is missing because Bonus doesn't come from the input data set. It's a new variable being created in
this DATA step. When SAS executes the assignment statement, it assigns a value to Bonus. At the bottom of the DATA
step, SAS uses the values in the PDV to write the first observation to the new SAS data set. SAS doesn't write the
variables in the DROP statement to work.subset1.
work.subset1

First_Name Last_Name Salary Job_Title Hire_Date Bonus

Irenie Elvish 26600 Sales Rep. II 6575 2660.00

Then control returns to the top of the DATA step for the next iteration. This is referred to as implicit output and implicit
return. SAS retains the values of variables that were read from the input data set in the PDV. These values will be
overwritten when the next observation is read into the PDV. SAS reinitializes the value of new variable, Bonus, to
missing.

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Dat


_N_ _ERROR_
N8 $8 $8 $8 N8 $8 $8 N8

120121 Irenie Elvish F 26600 Sales Rep. II AU -416

As the SET statement executes on the second iteration of the DATA step, SAS reads the second observation into the PDV.
It overwrites previous values in the PDV. SAS calculates Bonus for this observation, and then uses the values in the PDV
to write the second observation to the new data set.

work.subset1

First_Name Last_Name Salary Job_Title Hire_Date Bonus

Irenie Elvish 26600 Sales Rep. II 6575 2660.00

Christina Ngan 27475 Sales Rep. II 8217 2747.50

Then control returns to the top of the DATA step for the next iteration. This process continues until all of the
observations are read.

Question
Type the letter of the word or phrase on the right that completes the statements on the left.

When you submit a DATA step, SAS processes the a. descriptor portion
step in the __________________ phase first.

When you submit a DATA step, SAS processes the b. compilation


step in the __________________ phase second.

During the compilation phase, SAS creates the c. execution


_____________ of the output data set.

During the execution phase, SAS creates the d. data portion


_____________ of the output data set.

The correct answers from top to bottom are b, c, a, d. The compilation phase precedes the execution phase. SAS creates
the descriptor portion of the data set during the compilation phase and the data portion during the execution phase.

Business Scenario
Suppose that you want to create a new SAS data set that contains only the Australian employees whose Bonus is at least
$3000. In this situation, you want to subset observations based on the variable Country and the variable Bonus. Country
is a part of the orion.sales input data set. Recall that the variable Bonus does not exist in our input data set. We created
it with an assignment statement in the DATA step.

You could consider using a WHERE statement, but the WHERE statement selects observations when they are read from
the input data set to the PDV. Let's find out what happens if you use a variable that does not exist in the input data set in
a WHERE statement.

Activity
Copy and paste this program, which includes a WHERE statement to subset on the Bonus amount, into the editor and
submit it.

Reminder: Make sure you've defined the orion library.

data work.subset1;
set orion.sales;
Bonus=Salary*.10;
where Country='AU' and
Bonus>=3000;
run;

proc print data=work.subset1;


run;

Is the output data set created successfully?

a. yes
b. no

The correct answer is b. No, the output data set is not created successfully. The log contains an error message, and SAS
stopped processing the step. Because Bonus is a new variable being created in this DATA step and is not in orion.sales, it
cannot be used in a WHERE statement. SAS stopped processing the DATA step.
The Subsetting IF Statement
To subset observations based on the value of a variable you create, you can use the subsetting IF statement. The syntax
for the subsetting IF statement is the keyword IF and an expression that you want to evaluate.

IF expression;

Remember that an expression is a sequence of operands and operators that form a set of instructions. You can specify
multiple expressions in a subsetting IF statement.

if Salary>5000;

if Hire_Date='15APR2008'd;

if Country not in ('GB', 'FR', 'NL');

if Country='US' and Salary>75000;

Although IF expressions are similar to WHERE expressions, you cannot use special WHERE operators in IF expressions.

The subsetting IF statement causes the DATA step to continue processing only those observations that meet the
condition of the expression that you specify. That is, if the expression is true for the observation, SAS continues to
execute statements in the DATA step and writes the current observation to the output data set. The resulting SAS data
set contains a subset of the original SAS data set. If the expression is false, no further statements are processed for that
observation, the current observation is not written to the data set, and the remaining program statements in the DATA
step are not executed. SAS immediately returns to the beginning of the DATA step for the next iteration.

Question
When you use the subsetting IF statement, how are observations excluded?
a. If the expression is true, SAS excludes the observation from the input data set.
b. If the expression is false, SAS excludes the observation from the output data set.
c. If the expression is false, SAS excludes the observation from the PDV.
d. If the expression is true, SAS excludes the observation from the PDV.

The correct answer is b. When the expression is false, SAS excludes the observation from the output data set and
continues processing.

Selecting Observations by Using the Subsetting IF Statement


In this demonstration, you select observations from a data set using the subsetting IF statement in your DATA step.

1. Copy and paste the following program into the editor. You use orion.sales to create the temporary data set
auemps. The WHERE statement selects only the observations where the value of Country is equal to AU. The
assignment statement creates the variable Bonus by calculating 10% of the employees' salaries. The subsetting
IF statement specifies that you only want the observations where the value of Bonus is equal to or greater than
3000. The PROC PRINT step creates the report.

data work.auemps;
set orion.sales;
where Country='AU';
Bonus=Salary*.10;
if Bonus>=3000;
run;

proc print data=work.auemps;


run;

2. Submit the code and check the log. You can see that of the 165 observations in orion.sales, 63 were read into
the PDV for processing, and only 12 were written to work.auemps.

3. View the report. Notice that only AU values for Country and only Bonus values equal to or greater than 3000 are
displayed.

Activity
Copy and paste this program into the editor. This program is a variation of the program in the previous demonstration,
with both conditions combined into a single subsetting IF. Submit the program and review the log and results.

Reminder: Make sure you've defined the orion library.

data work.auemps;
set orion.sales;
Bonus=Salary*.10;
if Country='AU' and Bonus>=3000;
run;

proc print data=work.auemps;


run;

Are the results the same as what you saw in the previous demonstration?
a. yes
b. no

The correct answer is a. The log and results are the same, but the processing isn't as efficient. SAS reads all 165
observations from orion.sales rather than 63 observations in the previous program. You should subset as early as
possible in your program for more efficient processing.

Choosing a Statement for Subsetting Observations


It can be confusing to determine which statement to use—the WHERE statement or the subsetting IF statement. Let's
discuss how you decide. If you're subsetting observations in a PROC step, you must use a WHERE statement. You cannot
use a subsetting IF statement. That's easy.

If you're subsetting observations in a DATA step, you can always use a subsetting IF statement. That's easy, too. The
tricky part is knowing when you can use a WHERE statement in the DATA step. You only have to remember one rule:
when you use a WHERE statement in the DATA step, the WHERE expression must reference only variables from the
input data set.

If you're trying to subset based on a variable that SAS is reading from a single data set using the SET statement, you can
use a WHERE statement. If the variable is not in all data sets, you can't use a WHERE statement.

Why can't you use a WHERE statement based on a variable that's created with an assignment statement? A variable
that's created using an assignment statement doesn't exist in the input data set. Remember, the WHERE statement
subsets data as SAS reads the data into the PDV.

Question
Select the situation(s) in which you can use the WHERE statement to subset observations. Select all that apply.

a. in a PROC step
b. in a DATA step, when the variable in the condition is created
c. in a DATA step, when the variable in the condition is in the input data set

The correct answer is a and c. You can use a WHERE statement to subset observations in situations a and c. A subsetting
IF statement can be used in situations b and c.

Business Scenario
Now that you know how to subset observations and variables to create a customized SAS data set, suppose that you
want to create work.subset1 so that it includes permanent labels and formats. In other words, you want to permanently
associates labels and formats to the variables and store them in the descriptor portion of the data set.

Assigning Permanent Labels in a DATA Step


You use a LABEL statement in a PROC step to assign descriptive labels to variables in your reports.

LABEL variable='label'
variable='label'...;

These are temporary labels. When you use the LABEL statement in a DATA step, SAS permanently associates
the labels to the variables. In the following DATA step, we're assigning the label Sales Title to Job_Title, and
the label Date Hired to Hire_Date.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country
Birth_Date;
run;

SAS will add these labels to the descriptor portion of the data set. Remember that the descriptor portion of a SAS data
set stores variable attributes including the name, type, and length of the variable.

Question
If you submit this program, which of the following column headings will display for Job_Title in the resulting report?

data work.us;
set orion.sales;
where Country='US';
Bonus=Salary*.10;
label Job_Title='Sales Title';
drop Employee_ID Gender Country Birth_Date;
run;
proc print data=work.us label;
label Job_Title='Title';
run;

a. Sales Title
b. Job_Title
c. Title

The correct answer is c. The column heading will be Title, the label specified in the PROC PRINT step. Labels and formats
that you specify in PROC steps override the permanent labels in the current step. However, the permanent labels are
not changed.

Adding Permanent Labels to a SAS Data Set

In this demonstration, you add permanent labels to the descriptor portion of a SAS data set and then print the
labels in a report.

1. Copy and paste the following program into the editor. The DATA step includes the LABEL statement, which
specifies labels for the variables Job_Title and Hire_Date. The PROC CONTENTS creates the descriptor portion of
work.subset1.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country Birth_Date;
run;

proc contents data=work.subset1;


run;

2. Submit the program and check the log. The log shows that the program ran successfully.

3. Examine the results. The PROC CONTENTS report shows that the labels are now associated with the variables.
Consider this: If you write a PROC PRINT step to display work.subset1, will these new labels appear in the
report? No. The report will only include these descriptive labels if you add the LABEL option to the PROC PRINT
step. So even though the labels are permanently associated with the variables, you have the choice of how to
display the variables in your output.

4. Copy and paste the following PROC PRINT step, which includes the LABEL option, into the editor and submit it.

proc print data=work.subset1 label;


run;

5. View the results. The report shows that Sales Title and Date Hired have replaced their variable names as
headings.
Using the FORMAT Statement in a DATA Step
As with the LABEL statement, you can use the FORMAT statement in a DATA step to permanently associate formats with
variables.

FORMAT variable(s) format;

The format information is also stored in the descriptor portion of the data set.

In the data set work.subset1, you can apply SAS formats to the variables Salary, Hire_Date, and Bonus to permanently
format the values so that they are easier to understand.

Partial work.subset1

First_Name Last_Name Salary Job_Title Hire_Date Bonus

Irenie Elvish 26600 Sales Rep. II 6575 2660.00

Christina Ngan 27475 Sales Rep. II 8217 2747.50

Adding Permanent Formats to a SAS Data Set


In this demonstration, you add permanent formats to the descriptor portion of a SAS data set.

1. Copy and paste the following program into the editor. The FORMAT statement applies the format DOLLAR12. to
both the Salary and Bonus variables, and the format DDMMYY10. to the Hire_Date variable. Notice that you
format the variable name and not the label name. The PROC CONTENTS creates the descriptor portion of
work.subset1, and the PROC PRINT step prints the data set.

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary Bonus dollar12.
Hire_Date ddmmyy10.;
drop Employee_ID Gender Country Birth_Date;
run;

proc contents data=work.subset1;


run;

proc print data=work.subset1 label;


run;

2. Submit this program and then check the log. The log shows that SAS ran successfully.
3. Examine the results. PROC CONTENTS shows that the formats were associated with our variables. Notice that
the labels are also still associated. In the report, the Salary and Bonus variable values now have dollar signs and
commas, and the Date Hired values are much easier to read and understand.
Summary of Lesson 6: Reading SAS Data Sets

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Reading a SAS Data Set


You use a DATA step to create a new SAS data set from an existing SAS data set. The DATA step begins with a DATA
statement, which provides the name of the SAS data set to create. Include a SET statement to name the existing SAS
data set to be read in as input.

You use the WHERE statement to subset the input data set by selecting only the observations that meet a particular
condition. To subset based on a SAS date value, you can use a SAS date constant in the WHERE expression. SAS
automatically converts a date constant to a SAS date value.

DATA output-SAS-data-set;
SET input-SAS-data-set;
WHERE where-expression;
RUN;

You use an assignment statement to create a new variable. The assignment statement evaluates an expression and
assigns the resulting value to a new or existing variable. The expression is a sequence of operands and operators. If the
expression includes arithmetic operators, SAS performs the numeric operations based on priority, as in math equations.
You can use parentheses to clarify or alter the order of operations.

variable=expression;

Customizing a SAS Data Set


By default, the SET statement reads all of the observations and variables from the input data set and writes them to the
output data set. You can customize the new data set by selecting only the observations and variables that you want to
include. You can use a WHERE statement to select the observations, as long as the variables included in the condition
come from the input data set. You can use a DROP statement to list the variables to exclude from the new data set, or
use a KEEP statement to list the variables to include. If you use a KEEP statement, you must include every variable to be
written, including any new variables.

DROP variable-list;
KEEP variable-list;

SAS processes the DATA step in two phases: the compilation phase and the execution phase.

You can subset the original data set with a WHERE statement for variables that are defined in the input data set, and a
subsetting IF statement for new variables that are created in the DATA step. Remember that, although IF expressions are
similar to WHERE expressions, you cannot use special WHERE operators in IF expressions.
IF expression;

To subset observations in a PROC step, you must use a WHERE statement. You cannot use a subsetting IF statement in a
PROC step. To subset observations in a DATA step, you can always use a subsetting IF statement. However, a WHERE
statement can make your DATA step more efficient because it subsets on input.

Adding Permanent Attributes


When you use the LABEL statement in a DATA step, SAS permanently associates the labels to the variables by storing the
labels in the descriptor portion of the data set. Using a FORMAT statement in a DATA step permanently associates
formats with variables. The format information is also stored in the descriptor portion of the data set. You can use PROC
CONTENTS to view the label and format information. PROC PRINT does not display permanent labels unless you use the
LABEL or SPLIT= option.

LABEL variable='label'
variable='label'
... ;

FORMAT variable(s) format ...;

Sample Programs

Subsetting Observations in the DATA Step

proc print data=orion.sales;


run;

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;

proc print data=work.subset1;


run;

Subsetting Observations and Creating a New Variable

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
Bonus=Salary*.10;
run;

proc print data=work.subset1 noobs;


var First_name Last_Name Salary
Job_Title Bonus Hire_Date;
format Hire_Date date9.;
run;
Subsetting Variables in a DATA Step: DROP and KEEP

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country Birth_Date;
run;

proc print data=work.subset1;


run;

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name Salary Job_Title Hire_Date Bonus;
run;

proc print data=work.subset1;


run;

Selecting Observations by Using the Subsetting IF Statement

data work.auemps;
set orion.sales;
where Country='AU';
Bonus=Salary*.10;
if Bonus>=3000;
run;

proc print data=work.auemps;


run;

Adding Permanent Labels to a SAS Data Set

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country Birth_Date;
run;

proc contents data=work.subset1;


run;

proc print data=work.subset1 label;


run;

Adding Permanent Formats to a SAS Data Set

data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary Bonus dollar12.
Hire_Date ddmmyy10.;
drop Employee_ID Gender Country Birth_Date;
run;

proc contents data=work.subset1;


run;

proc print data=work.subset1 label;


run;

Lesson 7: Reading Spreadsheet and Database Data


In this lesson, you'll learn how to access a Microsoft Excel workbook in SAS. You'll discover how you can treat the Excel
workbook just as if it is a SAS data set. You'll create a SAS data set using a subset of the worksheet data, and add labels
and formats to the variables. Lastly, you utilize a similar technique to access an Oracle table from within SAS.

Objectives

In this lesson, you learn to do the following:

 assign a libref to a Microsoft Excel workbook using the SAS/ACCESS LIBNAME statement
 access an Excel worksheet as though it is a SAS data set using a SAS two-level name
 use the DATA step to create a SAS data set that contains a subset of worksheet data
 assign a libref to an Oracle database using the SAS/ACCESS LIBNAME statement
 access an Oracle table using a SAS-two-level name
 create a SAS data set that contains a subset of an Oracle table

Reading Spreadsheet Data


As a programmer at Orion Star, you’ve worked with many SAS data sets. But suppose you also need to work with data
that’s stored in a Microsoft Excel workbook. For example, your manager has requested a report on Orion Star sales
employees from Australia and the United States, and the input data is in the Excel workbook, sales.xls. You need to
access this data in SAS.

Fortunately, SAS provides several ways for you to access this data. You can use SAS/ACCESS Interface to PC Files to read
the worksheets within the sales.xls workbook as if they are SAS data sets. Optionally, you can use the IMPORT
procedure to read the worksheet and write the data to a SAS data set. To learn about the IMPORT procedure, visit the
online documentation at support.sas.com.

Examining the Workbook


In this demonstration, you examine an Excel workbook in Microsoft Excel.

1. Open the sales.xls workbook. Either navigate to the file via Microsoft Excel, or open the file from the location
where you stored your practice files for this course.

2. Notice that the data for each country is on a separate tab. You’ll need each country in its own SAS data set. Look
at the date fields in columns H and I. These fields each have a different Excel date format applied. You might
want to address this, but don't worry about it right now. Everything else about this data looks pretty standard.
Exploring SAS/ACCESS
You can use the SAS/ACCESS LIBNAME statement to assign a libref to an Excel workbook. Then, SAS treats each
worksheet in the workbook as though it is a SAS data set. You have access! The SAS/ACCESS interface provides data
connectivity and integration between SAS and third-party data sources, including Microsoft Excel workbooks and various
databases. Using SAS/ACCESS interfaces, which are each licensed separately, your SAS programs can read data from and
write data to a third-party data source in the same way as reading from or writing to a SAS library.

SAS/ACCESS uses data access engines to read, write, and update data, regardless of the data source or platform. One of
the requirements for accessing relational databases is that your SAS installation must include the appropriate
SAS/ACCESS interfaces for the types of files you want to access. Excel and Oracle are examples of engines that are
available in SAS/ACCESS.

Some details about accessing database management systems are specific to your operating environment and to your
SAS installation, so this lesson does not contain practices for accessing relational database files. Click the Information
button in the course interface to learn how to determine which SAS products are licensed and installed in your
environment.

Using the SAS/ACCESS LIBNAME Statement


Let’s take a look at the syntax for the SAS/ACCESS LIBNAME statement.

LIBNAME libref<engine>"workbook-name" <options>;

After the keyword LIBNAME, you specify a libref. The libref must follow the same rules as for any other SAS libref. You
then must specify the SAS/ACCESS engine name, and then the physical file name of the Excel workbook in quotes,
including the path, filename, and extension.

Let’s explore how you determine the proper SAS/ACCESS engine to specify in our scenario. Both SAS and Microsoft
Office offer 32-bit and 64-bit versions. Different SAS/ACCESS engines are needed based on matching and non-matching
number of bits, also known as bitness. If the bitness of both products is the same, the default SAS/ACCESS engine, excel,
can be used. If the bitness of both products is not the same, you must use the PC Files Server engine, pcfiles.

In our scenario, we’re using 64-bit SAS and 32-bit Microsoft Office, so we’ll use the pcfiles engine in the SAS/ACCESS
LIBNAME statement to read the sales.xls workbook. This engine is supported by SAS/ACCESS Interface to PC Files

Here’s the code.

libname orionx pcfiles path="&path/sales.xls";

After the keyword LIBNAME, we’ll specify the libref orionx, followed by the engine name pcfiles. With this engine, we
must also specify path= in front of the workbook name. Then we enter the workbook name in quotation marks. Click the
Information button in the course interface to learn more about the SAS/ACCESS engines and the appropriate
SAS/ACCESS LIBNAME statement syntax.

Accessing Excel Worksheets in SAS


If you are using a client application (such as SAS Enterprise Guide or SAS Studio) to access SAS on a remote server, you
cannot use the SAS/ACCESS Interface to PC Files engine that is necessary to assign a LIBNAME statement to an Excel file.
The demonstrations in this lesson use the SAS windowing environment.
In this demonstration, you access Excel files in SAS.

1. Copy and paste the following SAS/ACCESS LIBNAME statement into the editor and submit it to see how the
resulting libref enables you to access the Excel worksheets.

libname orionx pcfiles path="&path/sales.xls";

2. The log shows that the orionx libref was successfully assigned. This means that you have access to the data. You
can refer to orionx as if it is a SAS library, and you can access each of the worksheets as if they are data sets in
the library.

3. Navigate to the explorer window. Double-click Libraries to see that the library orionx is active. Notice that the
icon looks a little different. It has a globe on the folder, indicating that the data is outside of SAS.

4. Double-click the icon to see its contents. Note: In SAS Enterprise Guide, you can drill into the library using the
Server List window.

The worksheets in the workbook appear with a dollar sign at the end of the name. If the worksheet has named
ranges, the name will also appear, but will not have a dollar sign.

5. Copy and paste the following PROC CONTENTS step into the editor to explore the library. Remember that you
can use the CONTENTS procedure to list the contents of a SAS library. This step specifies orionx._all_ to list the
worksheets and their descriptor portions. Submit this step.
6. proc contents data=orionx._all_;
run;

7. The log shows that the code ran successfully.

8. Examine the results. In the second table of the PROC CONTENTS output, notice that some member names end in
dollar signs and others do not. Again, the members whose names end with a dollar sign are the spreadsheets.
The ones that do not end with a dollar sign are named ranges. You'll access the spreadsheet, Australia$, as if it is
a SAS data set. But remember, SAS data set names cannot include special characters. You learn more about
these dollar sign references and how to deal with them in your code later in this lesson.

9. Look at the first table for the Australia$ worksheet. Notice that the Data Set Name has the two-level worksheet
name and that the second part of the name includes a dollar sign, is enclosed in quotation marks and is followed
by the letter n. This is a SAS name literal. You can also see that the Member Type is DATA, and the Engine is
PCFILES.

10. Look at the Variables and Attributes table for the Australia$ worksheet. The original column headings contain
embedded spaces. In the SAS windowing environment, embedded spaces in column headings are replaced with
underscores to create valid SAS variable names. In SAS Enterprise Guide, the column headings are used as
variable names without modification because special characters are allowed in variable names. You can set the
VALIDVARNAME=V7 option in SAS Enterprise Guide to cause it to behave the same as in the SAS windowing
environment. The column headings are stored as labels in both environments. Notice that Birth_Date and
Hire_Date are both listed as numeric variables with DATE9. formats because they were both formatted as dates
in the original file.

Referencing Excel Worksheets in SAS


Now that you’ve assigned the orionx libref to the Excel workbook and can refer to the spreadsheets within the
workbook as though they are SAS data sets, you need to complete your assignment: create a report of Orion Star sales
employees from Australia and the United States. You’ll use PROC PRINT to create the report, but you need to know a
few things first.
As you saw in the demonstration, the dollar sign is part of each Excel worksheet name, but a valid SAS data set name
can’t contain a dollar sign. So, you need to refer to an Excel worksheet in a special way to account for that special
character. You can use a SAS name literal.

A SAS name literal is a name token that is expressed as a string within quotation marks, followed by the upper- or
lowercase letter n. It enables special characters or blanks in data set names.

libref.'worksheetname$'n

After the libref orionx, you enclose the name of the Excel worksheet, Australia&, in quotation marks followed by the
letter n to print the contents of the Australia$ worksheet. Notice that we’re using the two-level data set name.

proc print data=orionx.'Australia$'n;


run;

Question
Which PROC PRINT step displays the UnitedStates worksheet?

a.
proc print data=orionx.'United States';
run;

b.
proc print data=orionx.'UnitedStates$';
run;

c.
proc print data=orionx.'UnitedStates'n;
run;

d.
proc print data=orionx.'UnitedStates$'n;
run;

The correct answer is d. You use a SAS name literal to refer to an Excel worksheet in SAS code. You enclose the name of
the worksheet, including the dollar sign, in quotation marks followed by the letter n.

Printing an Excel Worksheet


If you are using a client application (such as SAS Enterprise Guide or SAS Studio) to access SAS on a remote server, you
cannot use the SAS/ACCESS Interface to PC Files engine that is necessary to assign a LIBNAME statement to an Excel file.
The demonstrations in this lesson use the SAS windowing environment.

In this demonstration, you create a report of the Orion Star sales employees from Australia from an Excel worksheet.

1. Copy and paste the following step into the editor and submit it.
2. proc print data=orionx.'Australia$'n;
run;

3. Check the log and ensure that the code ran without errors or warnings.

4. Examine the report. You are using the Excel worksheet as if it is a SAS data set, and as you can see, the report
looks just like all the other SAS data set reports you've created. Now consider this. Do you think that you can
select only a subset of the worksheet data to print? After all, SAS is treating the worksheet just like a SAS data
set.

5. Copy and paste the following step into the editor. You're adding a WHERE statement to select only those
employees who have IV in their job titles, and a VAR statement to include only the variables Employee_ID,
Last_Name, Job_Title, and Salary in the report. To suppress the Obs column, you're adding the NOOBS option to
the PROC PRINT statement.
6. proc print data=orionx.'Australia$'n noobs;
7. where Job_Title ? 'IV';
8. var Employee_ID Last_Name Job_Title Salary;
run;

9. Submit this step and examine the report. The report looks great. You successfully created a subset of the
worksheet data.

Disassociating a Libref
It’s important to disassociate a libref when you are finished using it. If SAS has a libref assigned to an Excel workbook,
the workbook cannot be opened in Excel. SAS puts a lock on the Excel file when the libref is assigned. To disassociate a
libref, you use a LIBNAME statement and specify the libref and the CLEAR option.

libname orionx clear;

SAS disconnects from the data source and closes any resources that are associated with the connection.

Business Scenario
Now that you know how to access the sales workbook in SAS, you’ve been asked to perform another task. You need to
create a SAS data set using the workbook as input. The data set work.subset should include only those employees from
Australia who have the word Rep in their job title, a Bonus variable that is 10% of Salary, and permanent labels and
formats.

Creating a SAS Data Set from an Excel Worksheet


If you are using a client application (such as SAS Enterprise Guide or SAS Studio) to access SAS on a remote server, you
cannot use the SAS/ACCESS Interface to PC Files engine that is necessary to assign a LIBNAME statement to an Excel file.
The demonstrations in this lesson use the SAS windowing environment.

In this demonstration, you use a DATA step to read input data from a Microsoft Excel worksheet and create a report of
the Orion Star sales employees from Australia.

1. Copy and paste the following program into the editor. You use a SAS/ACCESS LIBNAME statement to access the
worksheet. This is the same one you saw earlier that uses the PC Files engine.

Next, in a DATA step, you specify the data set work.subset as the output data set. In the SET statement, you
specify the Australia$ worksheet as the input data. To include the employees with the word Rep in their job
title, you use a WHERE statement. In an assignment statement, you create the new variable Bonus and set it
equal to 10% of Salary. You define labels using a LABEL statement. You want to display Job_Title as Sales Title,
and Hire_Date as Date Hired. And finally, you format the variable Salary with the COMMA10. format, the
variable Hire_Date with the MMDDYY10. format, and the variable Bonus with the COMMA8.2 format.
Aside from using the SAS name literal in the SET statement, this DATA step looks the same as if you were using a
SAS data set as your input data.

The program includes a PROC CONTENTS step so you can verify that SAS stored the formats and labels in the
descriptor portion of work.subset, and a PROC PRINT step to create a report of the data set. Do you recall what
you need to add to the PROC PRINT step to tell SAS to print the labels specified? You add the LABEL option.

libname orionx pcfiles path="&path/sales.xls";

data work.subset;
set orionx.'Australia$'n;
where Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary comma10. Hire_Date mmddyy10.
Bonus comma8.2;
run;
proc contents data=work.subset;
run;

proc print data=work.subset label;


run;

2. Submit the code and check the log. SAS read 61 observations from the data set. There were 63 rows in the
spreadsheet, so you can see that the subsetting was successful. And no warnings or errors are present.

3. Examine the results. From the PROC CONTENTS results, you can see the labels you defined, as well as the labels
that SAS added automatically. Remember that the labels are actually the column headings in the Excel
spreadsheet, but Hire_Date and Job_Title now reflect the labels you specified. The formats for the numeric
variables Salary, Hire_Date, and Bonus have been stored correctly. In the PROC PRINT results, you can see that
SAS created a subset of the data based on your specifications. All of the job title values contain the word Rep.
SAS calculated Bonus as 10% of Salary values, and the labels and formats have been applied.

4. What do you need to do now that you've finished using the orionx libref? Right, you need to disassociate it.
Copy and paste the following LIBNAME statement into the editor and submit it to disassociate the libref.

libname orionx clear;

5. The log shows that the libref orionx has been deassigned.

Reading Database Data


The Northeast Sales Manager has requested a report listing the supervisors from New York and New Jersey. This time,
the input data you need is in an Oracle database. Do you think that you can write a SAS program to access the tables in a
relational database? Of course you can. You can use SAS/ACCESS to read the tables within the database as if they are
SAS data sets.

Using SAS/ACCESS
Just as with the Excel input data, you'll use the SAS/ACCESS LIBNAME statement to assign a libref to the database.

LIBNAME libref engine <SAS/ACCESS options>;

We’ll use the LIBNAME statement supported by the SAS/ACCESS interface to Oracle.
You begin with the keyword LIBNAME and then you specify a libref. In our program, we'll assign the libref oralib.
Following the libref, you specify the engine name, such as Oracle or DB2. This is the SAS/ACCESS component that reads
and writes to your DBMS, and it's required. We'll specify the oracle engine.

Now we need to specify additional connection options. These options provide connection information and control how
SAS manages the timing and concurrence of the connection to the DBMS. These arguments are different for each
database. USER= specifies an optional Oracle user name. In our code, we'll specify user=edu101. USER= must be used
with PASSWORD=. PASSWORD=, or PW=, specifies an optional Oracle password that is associated with the Oracle user
name. We'll specify pw=edu101.

PATH= specifies the Oracle driver, node, and database. SAS/ACCESS uses the same Oracle path designation that you use
to connect to Oracle directly. See your database administrator to determine the databases that have been set up in your
operating environment, and to determine the default values if you do not specify a database. We'll specify
path=dbmssrv.

Next, you specify the SCHEMA= option in the SAS/ACCESS LIBNAME statement to connect to the Oracle schema in which
the database resides. SCHEMA= enables you to read database objects, such as tables and views, in the specified schema.
If this option is omitted, you will connect to the default schema for your DBMS. We'll specify schema=educ.

When you submit this LIBNAME statement, SAS treats the Oracle database like a SAS library, and any table in the
database can be referenced using a SAS two-level name.

libname oralib oracle user=edu101 pw=edu101


path=dbmssrv schema=educ;

Summary of Lesson 7: Reading Spreadsheet and Database Data

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Reading Spreadsheet Data


You can use SAS/ACCESS Interface to PC Files to read the worksheets within a Microsoft Excel workbook. After you
submit a SAS/ACCESS LIBNAME statement, SAS treats the Excel workbook as if it were a SAS library and treats the
worksheets as if they were SAS data sets within that library. You submit a LIBNAME statement to specify a libref, an
engine name, and the location and name of the workbook. The engine tells SAS the type of input file and which engine
to use to read the input data.

LIBNAME libref <engine> "workbook-name" <options>;


LIBNAME libref <engine> <PATH=> "workbook-name" <options>;

When you browse the library, you might see worksheets and named ranges. Worksheet names end with a dollar sign,
and named ranges do not. Because the dollar sign is a special character, you must use a SAS name literal when you refer
to a worksheet in a program.
libref.'worksheetname$'n

When you assign a libref to an Excel workbook in SAS, the workbook cannot be opened in Excel. To disassociate a libref,
you submit a LIBNAME statement specifying the libref and the CLEAR option. SAS disconnects from the data source and
closes any resources that are associated with the connection.

Reading Database Data


You can also read database tables as if they were SAS data sets by using the LIBNAME statement supported by
SAS/ACCESS Interface to Oracle. This SAS/ACCESS LIBNAME statement includes a libref, an engine name, and additional
connection options that are site- and installation-specific. After you submit the LIBNAME statement, SAS treats the
Oracle database as if it were a SAS library, and any table in the database can be referenced using a SAS two-level name,
as if it were a SAS data set.

LIBNAME libref engine <SAS/ACCESS Oracle options>;

Sample Programs

Accessing Excel Worksheets in SAS

libname orionx pcfiles path="&path/sales.xls";

proc contents data=orionx._all_;


run;

Printing an Excel Worksheet

proc print data=orionx.'Australia$'n;


run;

proc print data=orionx.'Australia$'n noobs;


where Job_Title ? 'IV';
var Employee_ID Last_Name Job_Title Salary;
run;

Creating a SAS Data Set from an Excel Worksheet

libname orionx pcfiles path="&path/sales.xls";

data work.subset;
set orionx.'Australia$'n;
where Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary comma10. Hire_Date mmddyy10.
Bonus comma8.2;
run;

proc contents data=work.subset;


run;
proc print data=work.subset label;
run;

libname orionx clear;

Lesson 8: Reading Raw Data Files


You can create a SAS data set from another SAS data set or from a Microsoft Excel worksheet. But what if your input
data is stored as a simple text file? In this lesson, you learn how to use delimited raw data files as input for a DATA step.
You also learn how to use the DATA step programming techniques that you already know to work with raw data. Lastly,
you learn how to identify data errors in your raw data files and how to handle missing data.

Objectives

In this lesson, you learn to do the following:

 identify types of raw data files and input styles


 define the terms standard and nonstandard data
 use list input to create a SAS data set from a delimited raw data file
 examine the compilation and execution phases of the DATA step when reading a raw data file
 explicitly define the length of a variable
 use informats to read character data and nonstandard data
 subset observations and add permanent attributes
 create a SAS data set from INSTREAM data
 identify data errors
 use the DSD option to read consecutive delimiters as missing values
 use the MISSOVER option to recognize missing values as the end of a record

Introduction to Reading Raw Data Files


Suppose that you have information about Orion Star sales employees from Australia and the United States stored in a
raw data file, or flat file. You need to create a SAS data set from this data source. As a programmer, you first need to be
able to identify the layout and type of information in the raw data file.

Raw Data Files


A raw data file can be a text file, a CSV file, or an ASCII file. Basically, a raw data file is an external text file that contains
one record per line, and a record typically contains multiple fields. Raw data files are not software specific. Fields in a
raw data file can be delimited or arranged in fixed columns.

Fields in a delimited raw data file are identified by their sequential order and the data values are separated by spaces or
other special characters. That is, the data is not arranged in columns. A given field might begin in a different column in
every record and have varying widths, but will be in the same relative position.

The delimited raw data file shown below contains information about the Orion Star sales employees. Notice that
there are no column headings. Typically, you'll have external documentation called a record layout to explain
the values. For example, suppose you see the number 87975. Is it an ID, a salary, a phone extension? Without a
record layout, it is sometimes impossible to know.

Partial sales.csv

1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982

Fields in a fixed-column raw data file are identified by their starting and ending column. A given field will begin in the
same column and have the same width in every record. In this course, you’ll learn to work with delimited raw data files.

1---5---10---15---20---25---30---35---40---45---50---55---60---65---
120102Tom Zhou Sales Manager 108225AU
120103Wilson Dawes Sales Manager 87975AU
120121Irenie Elvish Sales Rep. II 26600AU
120122Christina Ngan Sales Rep. II 27475AU

Question
Which of the following statements correctly describes a delimited raw data file? Select all that apply.
a. It is external to SAS.
b. It is not software-specific.
c. The values are arranged in columns.
d. The values are separated by spaces or other characters.
e. The values are labeled with field names.

The correct answer is a, b, and d. A delimited raw data file is an external text file in which the values are separated by
spaces or other special characters. The file is not software-specific.

Reading Raw Data Files


In order for SAS to read a raw data file, you must specify the following information about each field: the location of the
data value in the record, the name of the SAS variable in which to store the data, and the type of SAS variable. You can
use different techniques, or styles, for reading raw data files in SAS.

Specification Type of Data Arrangement

list input standard and/or nonstandard separated by delimiter

column input standard in columns

formatted input standard and/or nonstandard in columns

List input can read standard or nonstandard data, and the values must be separated by a delimiter. If your raw data file
is arranged in columns rather than being delimited, you use either column input or formatted input. Column input reads
standard data arranged in columns. Formatted input reads standard and or nonstandard data arranged in columns.

In this course, you’ll use list input to work with delimited raw data files that contain both standard and nonstandard
data.
Standard and Nonstandard Data
So what is the difference between standard and nonstandard data? Standard data is data that SAS can read without any
special instructions. Here are a few examples of standard numeric data. Nonstandard data includes values like dates or
numeric values that include special characters like dollar signs. Here are a few examples of standard numeric data.

Standard Data Nonstandard Data

58 (23)

67.23 5,823

5.67E5 $67.23

-23 01/12/2010

00.99 12May2009

1.2E-2

SAS needs extra instructions to read nonstandard data.

Reading Standard Delimited Data


Now that you understand raw data files a bit, you’re ready to start your task. You need to create a new SAS data set
from the raw data file sales.csv. This is a comma-delimited raw data file that contains standard data values, so you’ll use
list input to read the file.

Using the DATA Step to Read Raw Data


You know that you can use a DATA step with a SET statement to read input data from a SAS data set and create a new
SAS data set.

DATA output-SAS-data-set;
SET input-SAS-data-set;
RUN;

You can also use the DATA step to create a SAS data set from a raw data file, but the syntax is slightly different. Instead
of the SET statement, you use the INFILE statement and the INPUT statement.

DATA output-SAS-data-set;
INFILE 'raw-data-file-name';
INPUT specifications;
RUN;
The INFILE statement identifies the physical name and location of the raw data file. The INPUT statement describes the
arrangement of values in the raw data file and assigns input values to the corresponding SAS variables.

The INFILE Statement


Let's take a closer look at the INFILE statement.

data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;

The INFILE statement identifies the location of the external text file that contains the input data. You specify the full
path and filename, including the extension, for the raw data file sales.csv. As you can see, we're using the &path macro
variable reference, which makes our program more flexible. Be sure to use double quotation marks when referencing a
macro variable within a quoted string.

Here's a question. In our raw data file, what separates the data values?

Partial sales.csv

1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982

Commas separate the values in sales.csv. SAS considers a space, or blank, to be the default delimiter between values in
a delimited raw data file. If your file uses any other character to separate data values, you need to indicate what the
delimiter is in the INFILE statement. You use the DLM= option to specify an alternate delimiter.

DATA output-SAS-data-set;
INFILE 'raw-data-file-name' DLM='delimiter';
INPUT specifications;
RUN;

INPUT Statement for List Input


Now let's examine the INPUT statement. The INPUT statement names the variables in the output data set and provides a
description of the data values to be read and the type of each variable to be created. Standard data fields require only a
variable name and type.

INPUT variable1 <$> variable2 <$> ... variableN <$>;


In the INPUT statement, you specify the variables in the order that they appear in the raw data file, from left to right.
You specify a dollar sign after any variable name that you want to create as a character variable. SAS creates the
variables in the PDV in the same order and case in which they are specified in the INPUT statement. Remember that you
must follow the SAS naming conventions when you name variables.

Why do you think this INPUT statement includes only seven variables even though the raw data file contains nine values
in each record? It's because the last two values in each record are date values, which are not standard values. You learn
to deal with date values later in this lesson.

What’s important to know at this point is that SAS will read the fields in the order in which they appear in the raw data
file, and you cannot skip over fields. You don’t have to read all the fields, but you must read up to the last one that you
need.

Lastly, when using list input, the default length for all variables is 8 bytes, regardless of type.

Question
Suppose you want to write a DATA step that reads a raw data file. Do you need to use a LIBNAME statement to assign a
libref to the directory in which the raw data file is stored?

a. yes
b. no

The correct answer is b. A libref is used to access SAS data sets in a SAS data library. The INFILE statement references the
raw data file, so you do not need to use a libref to point to it.

Code Challenge
Write an INPUT statement that uses list specification to name the variables Country, First_Name, and Salary,
in that order, in the output data set.

1---5---10---15---20---25
AU,Tom,108255
AU,Christina,24475
AU,Irenie,26600

input Country $ First_Name $ Salary;


The INPUT statement reads the data fields sequentially, and specifies the variable name and type. The dollar sign
indicates a character variable.

Creating a SAS Data Set from a Delimited Raw Data File


In this demonstration, you read the delimited raw data file sales.csv and create the new SAS data set work.sales1.

1. Copy and paste the following program into the editor. The DATA step specifies the variables Employee_ID,
First_Name, Last_Name, Gender, Salary, Job_Title, and Country. Remember that SAS will read the fields in the
order in which they appear in the delimited file, and you cannot skip over fields.
data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;

2. Submit the step and then check the log. The log shows that the code ran successfully. SAS read 165 records from
sales.csv, and work.sales1 contains 165 observations and 7 variables.

3. Copy and paste this PROC PRINT step into the editor to view the new data set.

proc print data=work.sales1;


run;

4. Submit this step and examine the report. Take a look at the values for Job_Title. Do you see anything strange?
The values are truncated. In fact, some values for First_Name and Last_Name are also getting truncated in this
report. You'll learn how to correct this problem next.

How SAS Processes the DATA Step


Let’s investigate how SAS processes the DATA step when reading a raw data file. During the compilation phase, SAS
scans each DATA step statement for syntax errors. Syntax errors include missing or misspelled keywords, invalid variable
names, missing or invalid punctuation, and invalid options.

data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;

SAS also creates an input buffer to hold a record from the raw data file. The input buffer is an area of memory that SAS
creates only when reading raw data, not when reading a SAS data set.

Then SAS creates the PDV. Remember, the PDV is an area of memory where SAS builds an observation. Remember also
that the PDV contains two automatic variables that are not written to the output data set: _N_ is the iteration counter,
and _ERROR_ signals the occurrence of a data error during that iteration of the DATA step.

PDV

First_Name Last_Name Gender Salary Job_Title Country


_N_ _ERROR_ Employee_ID
N8 $8 $8 $8 N8 $8 $8

As the INPUT statement compiles, SAS adds a slot to the PDV for each variable in the new data set. Generally, SAS
determines variable attributes such as length and type the first time it encounters a variable. With list input, the default
length for all variables is 8 bytes. Hmmm…is that why some variable values were truncated? Finally, SAS creates the
descriptor portion of the output data set.

Descriptor portion of work.sales1 <

Employee_ID First_Name Last_Name Gender Salary Job_Title Country


N8 $8 $8 $8 N8 $8 $8

Execution Phase
If the DATA step compiles successfully, then the execution phase begins. At the beginning of the execution phase, SAS
initializes the PDV. _N_ is set to 1, _ERROR_ is set to 0, and every other variable in the PDV is set to missing. Remember,
SAS represents missing numeric values with periods, and it represents missing character values with blanks.

PDV

First_Name Last_Name Gender Salary Job_Title Country


_N_ _ERROR_ Employee_ID
N8 $8 $8 $8 N8 $8 $8

1 0 . .

In the first iteration of the DATA step, SAS reads a record from the input data file and holds it in the input buffer.

1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993

SAS reads from non-delimiter to delimiter (the comma) and assigns the data value to the corresponding variable,
Employee_ID, in the PDV. So, the value is converted from text to a floating point numeric value and copied into the PDV.

PDV

First_Name Last_Name Gender Salary Job_Title Country


_N_ _ERROR_ Employee_ID
N8 $8 $8 $8 N8 $8 $8

1 0 120102 .

SAS skips over the delimiter (the comma) and begins at the next non-delimiter, again, reading until it reaches a
delimiter, and then assigns the data value to the corresponding variable, First_Name, in the PDV. The text value is
copied to the PDV without conversion.

PDV

First_Name Last_Name Gender Salary Job_Title Country


_N_ _ERROR_ Employee_ID
N8 $8 $8 $8 N8 $8 $8

1 0 120102 Tom .

This process continues for all variables in the INPUT statement. At the bottom of the DATA step, SAS writes the values
from the PDV to the new SAS data set. Remember, _N_ and _ERROR_ are not included in the new SAS data set. Then
control returns to the top of the DATA step for the next iteration. Here is the information in the output data set after the
first iteration of the DATA step. Notice that the value for Job_Title is already truncated.

work.sales1

Employee_ID First_Name Last_Name Gender Salary Job_Title Country

120102 Tom Zhou M 108255 Sales Ma AU

Control returns to the top of the DATA step, and _N_ is incremented to 2. SAS reinitializes the variables in the PDV to
missing for all of the values being read from the raw data file before the next iteration begins.

PDV

First_Name Last_Name Gender Salary Job_Title Country


_N_ _ERROR_ Employee_ID
N8 $8 $8 $8 N8 $8 $8

2 0 . .

Remember that variables from a SAS data set are not reinitialized, but in this case, no variables are coming from a data
set. They are all new variables, and are therefore all reinitialized. If we were creating other new variables in this DATA
step, they too would be reinitialized.

SAS reads a record from the raw data file each time the INPUT statement executes, so now it reads the second record
from sales.csv into the input buffer.

PDV

First_Name Last_Name Gender Salary Job_Title Country


_N_ _ERROR_ Employee_ID
N8 $8 $8 $8 N8 $8 $8

2 0 120103 Wilson Dawes M 87975 Sales Ma AU

Once again, SAS reads from non-delimiter to delimiter, assigning data values to the variables named in the INPUT
statement. At the bottom of the DATA step, SAS writes the second observation to the new data set and control returns
to the top of the DATA step. Execution continues in this way until the there are no more records in the raw data file to
read.

Question
Which of the following statements is true?
a. SAS creates an input buffer only if reading data from a raw data file.
b. At compile time, the PDV holds the variable name, type, length, and initial value.
c. The descriptor portion of the data set is the first item that SAS creates during the compilation phase.
The correct answer is a. SAS uses an input buffer only if the input data is a raw data file. At compile time, the PDV holds
the variable name, type, and length, but not the initial value. The descriptor portion is the last item created during
compilation.

Question
Which statement is true of a DATA step when reading from a raw data file?

a. SAS reads data from the raw data file into the PDV.
b. The size of the input buffer adjusts automatically based on the length of the input record.
c. At the bottom of the DATA step, SAS writes the contents of the PDV to the output SAS data set.

The correct answer is c. When reading a raw data file, SAS reads the data from the file into the input buffer, and the
input buffer is an area of memory whose default length depends on the operating system. The only true statement here
is that at the bottom of the DATA step, SAS writes the contents of the PDV to the output data set.

Business Scenario
In the last demonstration, you saw that the values for Job_Title were truncated when SAS read them out of sales.csv.
That's because, with list input, SAS creates each variable with a length of 8 bytes, regardless of the type of variable. The
values for Job_Title are longer than 8 characters in the raw data file, so they are truncated in the new SAS data set. You
want to alter your DATA step to specify the correct length for Job_Title and other variables whose values are either
longer or shorter than 8 bytes.

Using the LENGTH Statement


You use a LENGTH statement in your DATA step to explicitly define the length of a character variable.

LENGTH variable(s) <$> length;

Remember, character variables can have a length of 1 to 32,767. We will leave the default length of 8 for the numeric
variables; 8 bytes is large enough to hold 16 to 17 significant digits.

The LENGTH statement begins with the keyword LENGTH. Then you specify the variable name, a dollar sign if it is a
character variable, and the length. You can specify multiple variables in one LENGTH statement. For example, this
LENGTH statement assigns a length of 12 to two variables, First_Name and Last_Name, and a length of 1 to the variable
Gender. Notice that you only need one dollar sign and length specification for the first two variables.

length First_Name Last_Name $ 12 Gender $ 1;

The LENGTH statement below assigns different lengths to three variables: First_Name, Last_Name, and Gender. Notice
that each variable is followed by a dollar sign and the length.

length First_Name $ 12 Last_Name $ 18 Gender $ 1;

Remember that SAS determines variable attributes such as name, length, and type the first time it encounters a variable
during compilation. So for the LENGTH statement to define the length for the variables in the output data set, it needs to
precede the INPUT statement in the DATA step. Also, make sure that you type the variable names exactly as you want
them to be stored in the data set. If you type the variable names in lowercase in the LENGTH statement, SAS will store
them in lowercase in the data set.

data worksales2;
length First_Name $ 12 Last_Name $18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
PDV

First_Name Last_Name Gender Job_Title Country


$ 12 $ 18 $1 $ 25 $2

As the INPUT statement compiles, SAS adds a slot to the PDV for each variable that was not already listed in the LENGTH
statement.

PDV

First_Name Last_Name Gender Job_Title Country Employee_ID Salary


$ 12 $ 18 $1 $ 25 $2 N8 N8

Question
Which of the LENGTH statements below creates the character variable SalesRep with a length of 18?

a. length $ SalesRep 18;


b. length SalesRep 18;
c. length SalesRep $ 18;
d. length SalesRep=18;

The correct answer is c. For character variables, you specify the variable name, followed by the dollar sign, and then
specify the length.

Question
Wich of the DATA steps below creates the variable SalesRep with a length of 18?
a.
data work.mydata;
infile 'filepath/sales.csv';
length SalesRep $ 18;
input Amount SalesRep $ Customer $;
run;

b.
data work.mydata;
infile 'filepath/sales.csv'
input Amount SalesRep $ Customer $;
length SalesRep $ 18;
run;

The correct answer is a. The LENGTH statement must precede the INPUT statement in order to correctly set the length
of the variable.

Specifying the Lengths of Variables Explicitly


In this demonstration, you specify the lengths of variables explicitly in a DATA step.

1. Copy and paste the following program into the editor. The LENGTH statement defines lengths for the variables
First_Name, Last_Name, Gender, Job_Title, and Country. The PROC CONTENTS step will print the attributes of
the variables, and the PROC PRINT step will print the new data set.

data work.sales2;
length First_Name $ 12 Last_Name $ 18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;

proc contents data=work.sales2;


run;

proc print data=work.sales2;


run;

2. Submit the program and then check the log. The log shows that the code ran without errors.

3. Now look at the results. In the PROC CONTENTS output, you can see that each character variable has been
assigned the length you specified. The numeric variables Employee_ID and Salary are unchanged. In the PROC
PRINT results, the character values are no longer truncated, but the order of the variables has changed. Do you
know why? It’s because the order of the variables in the PDV has changed.

If you look back at your code, you’ll see that the variables in the LENGTH statement are created first, in the
order they are listed. Then SAS proceeds to scan the INPUT statement to see if any other variables are listed.
Employee_ID and Salary are then added to the PDV in the last two columns and, therefore, are in the last two
columns of our data set.

Suppose you want the order of the variables in work.sales2 to match the order of the fields in sales.csv? You
can include the numeric variables in the LENGTH statement.

4. In the editor, modify the LENGTH statement and include Employee_ID and Salary with a length of 8. Why do
you need to specify the length though? Numeric variables have a default of 8 bytes, right? If you don't specify
the length, SAS will assume that the length of 12 is assigned to both Employee_ID and First_Name, and SAS will
read Employee_ID as a character variable. Also, although numeric variables can be smaller than 8, it's a good
idea not to change the length.
Next, in the PROC CONTENTS step, add the VARNUM option to display the variables in their creation order.

data work.sales2;
length Employee_ID 8 First_Name $ 12
Last_Name $ 18 Gender $ 1
Salary 8 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;

proc contents data=work.sales2 varnum;


run;

proc print data=work.sales2;


run;

5. Submit the revised program and examine the results. The PROC CONTENTS shows the variables in the order of
the fields in sales.csv, and the work.sales2 data set does as well. The variables have the correct lengths
assigned.

Activity
Copy and paste the following program into the editor. Submit the program and view the log.

Reminder: Make sure you've defined the orion library.

data work.nonsales2;
infile "&path/nonsales.csv" dlm=',';
input Employee_ID First $ Last;
run;

proc print data=work.nonsales2;


run;

Which statement best describes the reason for the log messages, as well as the missing values for Last in the results?
a. The data in the raw data file is bad.
b. The programmer incorrectly read the data.

The correct answer is b. The programmer read the data incorrectly. The raw data values for Last are character values, as
they should be. But the INPUT statement doesn't specify Last as a character variable, so SAS read it as numeric.

Reading Nonstandard Delimited Data


You've seen how to read the comma-delimited standard data values in sales.csv using the INPUT statement. You’ve also
seen how you can explicitly define the lengths for variables using the LENGTH statement. However, the program you
wrote ignored the last two data values in each record. In other words, they were excluded from the data set. These
values are both dates, and are nonstandard values. You now need to include these date values in the output data set
that you create from sales.csv.

Let’s see how you can include both standard and nonstandard values from a comma-delimited raw data file, and how
you can define lengths for variables without using the LENGTH statement.
Using Modified List Input
Your task is to create a temporary SAS data set by reading both standard and nonstandard data values. Remember that
nonstandard data is data that SAS cannot read without extra instructions. You can use what’s called modified list input to
read all of the fields from sales.csv.

Partial sales.csv

1---5---1 0---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982

Let’s first look at how you can use modified list input to read the standard character variables. With modified list input,
you can use informats and the colon format modifier to specify the length of the character variables.

INPUT variable <$> variable<:informat>;

Let’s take a look at how this works. Here’s one record from the sales.csv file, and an INPUT statement that lists all of the
variables to be read.

1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993

data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.;
run;

Notice that First_Name is followed by a colon, a dollar sign, the number 12, and a period. Doesn’t $12. look like a
format? Well, it's called an informat.

Informats are similar to formats except that formats provide instruction on how to write a value, and in general,
informats provide instruction on how to read a value. But in this case, the informat is used to specify the length of a
variable. The $12. informat tells SAS to create First_Name in the PDV as a character variable with a length of 12.
Informats also tell SAS how many characters to read, so this would cause SAS to read 12 characters for First_Name.

But wait! In this input record, the first name is Tom, so it is only three characters long. That’s where the colon format
modifier comes in. It tells SAS to read only until it reaches a delimiter, in this case, the comma following T-O-M.
Remember, the $12. informat specifies the type and length of the variable, and the colon format modifier causes SAS to
read up to the delimiter.

Exploring Modified List Input Processing

Now let’s see what happens if you omit the colon format modifier.

data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.;
run;

Notice that only one of the character variables, First_Name, does not include the colon format modifier. The
other character variables do. Let’s walk through how SAS reads this data. For demonstration purposes, we’ll
leave out the input buffer processing.

Partial sales.csv

1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993

To start, SAS reads the first record from non-delimiter to delimiter and stores 120102 as the value for the numeric
variable Employee_ID. Next, because we omitted the colon format modifier with the variable First_Name, SAS reads 12
characters; it doesn’t stop at the delimiters. SAS assigns the value Tom, Zhou, M, 1 to First_Name. Omitting the colon
format modifier has caused unexpected results.

For Last_Name, we included the colon format modifier and specified 18 characters. SAS reads from the current position,
the 0, to the next delimiter and assigns 08255 to Last_Name. Wait, how can SAS store a numeric value in a character
variable? Remember, a character variable can hold any data.

The $1. informat defines Gender as a character variable with a length of 1. Because of the colon format modifier, SAS
reads up to the next delimiter, reading Sales Manager. Gender can only hold one character, so the S is stored and the
remaining letters are truncated.

For the numeric variable Salary, SAS begins at the next non-delimiter and reads AU, but SAS can’t convert AU to a
numeric value, so the result is a missing value.

Then SAS reads from the next non-delimiter to delimiter and assigns 11AUG1973 to Job_Title. Lastly, we told SAS to
store two characters for Country. So SAS reads the entire date value, but only stores 06.

The resulting PDV is clearly not accurate. Hopefully, from this example, you can see how important the colon format
modifier is.

PDV

Employee_ID First_Name Last_Name Gender Salary Job_Title Country


N8 $ 12 $ 18 $1 N8 $ 25 $2

120102 Tom,Zhou,M,1 080255 S . 11AUG1973 06

Reading Nonstandard Numeric Data


Now let’s learn how to read the nonstandard numeric data in sales.csv. An informat is required to read nonstandard
numeric data. We need to give SAS an idea of what to expect from these values. For example, with the date 06/01/1993,
is 06 the month or the day? Is 01 the month or the day?
We’ll use informats to specify the style of the date fields so that they can be read and converted to SAS dates. The layout
for the birth date values is numeric day, character month, and numeric year. We can use the colon format modifier and
the DATE. informat to describe these values. The DATE. informat reads date values in the form ddmmm followed by a
two- or four-digit year.

input Employee_ID First_Name $ Last_Name $


Gender $ Salary Job_Title $ Country $
Birth_Date :date. Hire_Date;

The layout for the hire date values is numeric month, numeric day, and numeric year, so we can use the MMDDYY.
informat for those values. The MMDDYY. informat reads date values in the form mmdd followed by a two- or four-digit
year, with or without slashes.

input Employee_ID First_Name $ Last_Name $


Gender $ Salary Job_Title $ Country $
Birth_Date :date. Hire_Date :mmddyy.;

SAS Informats
Let’s take a closer look at informats. SAS informats use the form shown here: a dollar sign for character informats, the
name of the informat, an optional width, followed by a dot.

<$><informat><w>.

The width is typically not used when reading numeric values in list input because SAS will read each field until it
encounters a delimiter.

This table shows several SAS informats for nonstandard numeric values and their definitions. Take a moment to read
through these definitions.

Informat Definition

COMMA. reads nonstandard numeric data and removes embedded commas,


DOLLAR. blanks, dollar signs, percent signs, and dashes.

COMMAX. reads nonstandard numeric data and removes embedded non-numeric


DOLLARX. characters; reverses the roles of the decimal point and the comma.

EUROX. reads nonstandard numeric data and removes embedded non-numeric


characters in European currency

$CHAR. reads character values and preserves leading blanks.

$UPCASE. reads character values and converts them to uppercase.

COMMA. and DOLLAR. read nonstandard numeric data and remove embedded commas, blanks, dollar signs, percent
signs, and dashes. COMMAX. and DOLLARX. read nonstandard numeric data and remove embedded non-numeric
characters; these informats also reverse the roles of the decimal point and the comma.

EUROX. reads nonstandard numeric data and removes embedded non-numeric characters in European currency. $CHAR.
reads character values and preserves leading blanks. $UPCASE. reads character values and converts them to uppercase.

This table shows several specific examples of SAS informats and their effect on raw data values.

Informat Raw Data Value SAS Data Value

COMMA. $12,345 12345


DOLLAR.

COMMAX. $12.345 12345


DOLLARX.

EUROX. €12.345 12345

$CHAR. ##Australia ##Australia

$UPCASE. au AU

The pound character in the fourth row represents a blank. In the first example, COMMA. or DOLLAR. can be used to read
the raw data value. The dollar sign and comma are removed, and the result is the numeric value 12345.

In the second example, either COMMAX. or DOLLARX. can be used to read the raw data value. With these informats, the
period is not treated as a decimal point, but rather as a separator between each group of three digits. When the raw
data value is read, the dollar sign and period are removed, resulting in the same numeric value, 12345.

Similarly, in the third example, EUROX. treats the period as a separator between each group of three digits. When the
raw data value is read, the euro sign and period are removed, again resulting in the numeric value, 12345.

When reading character data, leading blanks are removed. But in the fourth example, the $CHAR. informat preserves
the leading blanks in the SAS data value. In the last example, the characters au are read in and converted to uppercase.
Notice the resulting SAS data value is uppercase AU.

This table shows several specific SAS informats for date values and their effect on raw data values.

Informat Raw Data Vaue SAS Data Value

010160

MMDDYY. 01/01/60 0
01/01/1960

1/1/1960

311260
DDMMYY. 365
31/12/60

31/12/1960

31DEC59
DATE. -1
31DEC1959

Notice that the informat can read a variety of raw data widths and achieve the same SAS data value. In the first section,
each raw data value represents January 1, 1960. Some of the values have separators, and some don’t. Some have two-
digit years, others have four-digit years, but it doesn’t matter. The MMDDYY. informat tells SAS to expect a value for
month, followed by day and then year, and all result in the same SAS date, 0.

The dates in the second section all represent December 31, 1960. The DDMMYY. informat is specified to let SAS know to
expect a value for day, then for month, and then for year. All result in the SAS date 365. Finally, the dates in the third
section are a different length, but both represent December 31, 1959. When read using the DATE. informat, both result
in the SAS date value of -1.

Question
Which of these statements is not true about a SAS informat?
a. A SAS informat provides instructions for reading nonstandard values.
b. A SAS informat provides a length for character variables.
c. A SAS informat must include a period.
d. A SAS informat controls the way nonstandard data values are stored in a SAS data set.
e. A SAS informat must include a numeric value to specify the width of the input field.

The correct answer is e. The width is typically not used when reading numeric values in list input because SAS will read
each field until it encounter a delimiter. When you use an informat with the colon format modifier, SAS ignores the
width and reads up to the next delimiter.

Using Modified List Input to Read Nonstandard Data


As it turns out, the date values that you're working with are not all stored with the same number of characters.

1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982
We know that we can use the DATE. informat to describe the data value. If you look at the SAS documentation,
however, you will see that the DATE. informat has a default width of 7, but the value in the first record is 9 characters
wide.

Once again, we can see the importance of the colon format modifier. Without it, SAS would use the default width and
read only the first seven characters. We must use the colon format modifier with the informat. The colon format
modifier tells SAS to ignore the width and read up to the next delimiter.

input Employee_ID First_Name $ Last_Name $


Gender $ Salary Job_Title $ Country $
Birth_Date :date. Hire_Date :mmddyy.;

Similarly, considering the second date field, when the colon modifier precedes it, the MMDDYY. informat enables SAS to
read any of the date values shown here.

01/07/2008 1/7/2008

1/07/2008 01/07/08

01/7/2008 1/7/08

Specifying Informats in the INPUT Statement


In this demonstration, you use informats to read both the standard and nonstandard data values from sales.csv.

1. Copy and paste the following program into the editor. This DATA step reads the standard values with modified
list input. The INPUT statement includes the last two fields, Birth_Date and Hired_Date, from the raw data file.
You format Birth_Date with the colon format modifier and the DATE. informat, and Hire_Date with the colon
format modifier and the MMDDYY. informat.

data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12. Last_Name :$18.
Gender :$1. Salary Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
run;

proc print data=work.sales2;


run;

2. Submit the program and check the log. You can see that the program executed successfully.

3. Examine your results. Are all of the character variables displayed properly, without truncation? Yes, they all look
good. What about the Birth_Date and Hire_Date values? You can see that the data set does include the two
new date variables, but the values don't look like dates. Oh that’s right! Remember that these are SAS date
values, so to appear as recognizable dates (that is, more understandable dates in reports), you would need to
add formats.
As you’ve seen, using modified list input in your program gave SAS the proper instruction to correctly read in the raw
data.

Business Scenario
Suppose you want your new data set work.subset to only include a subset of the fields in the sales.csv file. And what if
you want to add formats and labels to the new data set? You can do all of those things by adding other statements to
the DATA step, such as DROP or KEEP, LABEL, FORMAT, and subsetting IF.

Subsetting and Adding Permanent Attributes


In this demonstration, you subset observations and add permanent attributes to variables in a data set created from a
raw data file.

1. Copy and paste the following program into the editor.

data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;

2. Suppose you want to see only the sales employees from Australia. You’ve seen how to use a WHERE statement
to select observations too, but you can’t use a WHERE statement in this program because the input is a raw data
file, not a SAS data set. Add the following subsetting IF statement to your program to select those observations.

data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';

3. Add a KEEP statement and include only the variables First_Name, Last_Name, Salary, Job_Title, and Hire_Date
in the output data set. To add more descriptive labels, add a LABEL statement and give Job_Title a label of Sales
Title, and Hire_Date a label of Date Hired. Lastly, format the numeric variables so that they are more
understandable. In a FORMAT statement, assign the DOLLAR12. format to Salary, and the MONYY7. format to
Hire_Date. In a PROC PRINT step, add the LABEL option so that SAS will print the new labels. Add the following
KEEP, LABEL, and FORMAT statements to your program, as well as the PROC PRINT step.

data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';
keep First_Name Last_Name Salary
Job_Title Hire_Date;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary dollar12. Hire_Date monyy7.;
run;

proc print data=work.subset label;


run;
4. Submit the program and check the log. No warnings or errors are generated.

5. Examine the results. You can see that the report contains only the variables listed in the KEEP statement. The
new labels have been added, and the values have been formatted so that you can better understand them.

Business Scenario
Throughout this lesson, you've worked with programs that use an INFILE statement to identify an external file to read as
input data. You can also read instream data, which is lines of data that you enter directly into your SAS program. Reading
instream data is extremely helpful if you want to create data and test your programming statements on a few
observations.

The DATALINES Statement


To read instream data, you can use a DATALINES statement in a DATA step. The DATALINES statement is the last
statement in the DATA step, and immediately precedes the first data line. You use a null statement, which is a single
semicolon, to indicate the end of the input data.

DATALINES;
...
;

The following program creates the variables Name and Age using instream raw data lines.:

data new;
input name $ age;
datalines;
john 25
henry 55
cynthia 44
karen 21
;
run;

Question
Which of the following DATA steps correctly reads instream data as input for the data set?
a.
data work.mydata;
input Amount SalesRep $ Customer $;
datalines;
250.35 Phelps Torres
178.50 Deng Horekova
;

b.
data work.mydata;
datalines;
250.35 Phelps Torres
178.50 Deng Horekova
input Amount SalesRep $ Customer $;
;

The correct answer is a. You precede the instream data with the DATALINES statement and follow it with a null
statement. The instream data should be the last part of the DATA step except for a null statement.
Reading Instream Data
In this demonstration, you use the DATALINES statement to read instream data.

1. Copy and paste the following program into the editor. The DATA statement specifies the data set name,
work.newemps. Instream data, like raw data that is stored in external files, might have various layouts. In this
example, the data is delimited with blanks, so you use an INPUT statement for list input. Each record of this data
contains four values: first name, last name, job title, and salary. You specify character as needed, as well as an
informat to read the nonstandard salary values. Next is the DATALINES statement before the lines of data, and
then a null statement after the data.

data work.newemps;
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven Worton Auditor $40,450
Merle Hieds Trainee $24,025
Marta Bamberger Manager $32,000
;

2. Submit the step and check the log. You can see that the data set work.newemps was created successfully with
three observations and four variables.

3. Copy and paste the PROC PRINT step into the editor and submit it.

proc print data=work.newemps;


run;

4. View the results. You can see that the values in the data set are exactly what you entered in the DATA step.

5. Copy and paste the following program into the editor. When your instream data is delimited with commas, you
use the INFILE statement wtih the DLM= option. You specify a comma as the delimiter.

data work.newemps2;
infile datalines dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven,Worton,Auditor,$40450
Merle,Hieds,Trainee,$24025
Marta,Bamberger,Manager,$32000
;

proc print data=work.newemps2;


run;

6. Submit the step and examine the results. You can see that the resulting data set looks exactly like the first data
set.

Business Scenario
Suppose you have a raw data file that includes some invalid data values. For example, the file sales3inv.csv contains
some invalid data, and you need to create the new SAS data set work.sales4 from this file.

Studying the Data


Based on your work with similar data, you know that some of the variables in your final SAS data set need to be
character, and others need to be numeric. For example, you know that Employee_ID and Salary must be numeric
variables, and First, Last, Job_Title, and Country must be character variables.
Here's part of the file that you need to read, as well as the DATA step.

Partial sales3inv.csv

1---5---10---15---20---25---30---35---40---45---50
120102,Tom,Zhou,Sales Manager,108255,AU
120103,Wilson,Dawes,Sales Manager,87975,AU
120121,Irenie,Elvish,Sales Rep. II,26600,AU
120122,Christina,Ngan,Sales Rep. II,n/a,AU
120123,Kimiko,Hotstone,Sales Rep. I,26190,AU
120124,Lucian,Daymond,Sales Rep. I,26480,12
120125,Fong,Hofmeister,Sales Rep IV,32040,AU

Now consider this question. What problems will SAS have reading the numeric data Salary and the character data
Country? The fourth record has the value n/a for Salary. This is not a numeric value. The sixth record has the value 12
for Country. Although 12 can be a valid character value, it is not a valid Country value.

When data values aren't appropriate for the SAS statements in your program, they cause data errors when the program
runs. For example, if you define a variable as numeric when the data contains character values, you create a data error.
As you might guess, anytime you use DATA, INFILE, and INPUT statements, you are likely to encounter data errors.

Reading a Raw Data File That Contains Data Errors


In this demonstration, you read a raw data file that contains data errors.

1. The sales3inv.csv file contains three instances of n/a, as well as a missing value for Salary.The sixth record has a
value of 12 for Country. Copy and paste the following program into the editor. The DATA step creates
work.sales4 with data from the sales3inv.csv file. The INPUT statement lists the variables and their types.

data work.sales4;
infile "&path/sales3inv.csv" dlm=',';
input Employee_ID First $ Last $
Job_Title $ Salary Country $;
run;

proc print data=work.sales4;


run;

2. Submit the program and check the log. The log shows that the DATA step creates the new data set, work.sales4.
The following information is written to the log: a note about the error, the contents of the input buffer and the
contents of the PDV. This note indicates that SAS detected invalid data for the variable Salary in line 4. SAS
displays a ruler above the input buffer to help you locate the invalid data. Notice that SAS assigns a missing value
to the variable that the invalid data affected.

Remember the two automatic variables that SAS creates during processing? The variables _N_ and _ERROR_ are
the temporary variables that SAS creates, and they are helpful when examining your log and data for data
errors. You can see that SAS encountered a data error because the value of _ERROR_ is 1. The error occurred
during the fourth iteration of the DATA step because the value of _N_ is 4. After writing the messages to the log,
SAS continues processing.

Next you can see the two other notes about invalid data for Salary. What about the missing Salary value? The
missing Salary value was not reported as a data error because missing is a valid value in SAS.
3. View the results. Notice that there are four observations with missing Salary values, just as the log indicated.
What about the value of 12 for Country? Even though you know that 12 is not a valid country code, SAS has no
trouble storing it in the Country variable. Remember, Country is a character variable and can store any
characters at all, including numerals. Later you will see that SAS has other procedures that you can use to find
invalid values.

Question
What does SAS do when it encounters a data error in a raw data record? Select all that apply.

a. prints a ruler and the raw data record in the SAS log
b. stops processing the program
c. assigns a missing value to the variable that the invalid data affects
d. prints a note about the error in the SAS log
e. prints the variable values in the corresponding SAS observation in the SAS log

The correct answers are a, c, d, and e. When SAS encounters a data error, it prints messages and a ruler in the log and
assigns a missing value to the affected variable. Then SAS continues processing.

Business Scenario
Now let’s examine how to handle missing data. Programmers at Orion Star have discovered that some files have records
with missing data in one or more fields. The records in phone2.csv have a contact name, phone number, and a mobile
phone number. The phone number is missing from some of the records. The missing data is indicated by consecutive
delimiters.

phone2.csv

1---5---10---15---20---25---30---35---40---45---50
James Kvarniq,(704) 293-8126,(701) 281-8923
Sandrina Stephano,, (919 271-4592
Cornelia Krahl,(212) 891-3241,(212) 233-5413
Karen Ballinger,, (714) 644-9090
Elke Wallstab,(910) 763-5561,(910) 545-3421

How do you think SAS will handle these missing data values?

Activity
Copy and paste the following program into the editor. Submit the program and view the log.

Reminder: Make sure you've defined the orion library.

data work.contacts;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone2.csv" dlm=',';
input Name $ Phone $ Mobile $;
run;

proc print data=work.contacts noobs;


run;

1. How many input records were read and how many observations were created?
In the log, you can see that SAS read 5 records from the input file and created 3 observations. SAS writes the following
note: SAS went to a new line when INPUT statement reached past the end of a line.

2. Examine the report. Does it look correct?

The report does not look correct. Because the second record has missing values, SAS loads the next record to finish the
observation. As you can see, the value for Mobile in the second observation is Cornelia Krahl. That's definitely not
correct.

Using the DSD Option


List input treats two or more consecutive delimiters as a single delimiter and not as a missing value. When there are
missing data values in a record, SAS loads the next record to finish the observation and writes a note to the log.

work.contacts

Name Phone Mobile

James Kvarniq (704) 293-8126 (701) 281-8923

Sandrina Stephano (919) 271-4592 Cornelia Krahl

Karen Ballinger (714) 644-9090 Elke Wallstab

You can use the DSD option in your INFILE statement to correctly read the raw data file phone2.csv. DSD stands for
delimiter sensitive data.

INFILE ' raw-data-file' <DLM=> DSD;

The DSD option sets the default delimiter to a comma, treats consecutive delimiters as missing values, and enables SAS
to read values with embedded delimiters if the value is surrounded by quotation marks.

Reading a Raw Data File That Contains Missing Data


In this demonstration, you read a raw data file that contains missing data.

1. Copy and paste the following program into the editor to correctly read the phone2.csv file and create a new SAS
data set, contacts. The INFILE statement includes the DSD option. You can use the DLM= option with the DSD
option, but it is not needed for comma-delimited files. The INPUT statement reads three character variables. The
PROC PRINT step creates the report.

data work.contacts;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone2.csv" dsd;
input Name $ Phone $ Mobile $;
run;
proc print data=work.contacts noobs;
run;

2. Submit the code and then check the log. You can see that SAS read five records from the file and wrote all five of
them to the new data se,t contacts. There is no note in the log about reaching the end of a line.

3. View the report. It shows that the data is correctly assigned now that you added the DSD option.

Business Scenario
Orion Star programmers have also discovered that some raw data files have records with missing data at the end of the
record, so there are fewer fields in the record than specified in the INPUT statement. For example, the raw data file
phone.csv contains missing values at the end of some records.

phone.csv

1---5---10---15---20---25---30---35---40---45---50
James Kvarniq,(704) 293-8126,(701) 281-8923
Sandrina Stephano,(919 271-4592
Cornelia Krahl,(212) 891-3241,(212) 233-5413
Karen Ballinger,(714) 644-9090
Elke Wallstab,(910) 763-5561,(910) 545-3421

Here’s a question. Can the programmers use the DSD option to correctly read the raw data file? No, the DSD option isn’t
appropriate because the missing data isn’t marked by consecutive delimiters.

Using the MISSOVER Option


You can use the MISSOVER option in your INFILE statement to prevent SAS from loading a new record when it reaches
the end of the current record. If SAS reaches the end of a record without finding values for all fields, variables without
values are set to missing.

INFILE ' raw-data-file' MISSOVER;

Reading a Raw Data File Using the MISSOVER Option


In this demonstration, you read a raw data file and use the MISSOVER option in your INFILE statement.

1. Copy and paste the following program into the editor. The DATA statement creates the new SAS data set
contacts2. The INFILE statement specifies the raw data file, followed by the DLM= option because the file is
comma delimited, and then the MISSOVER option. The INPUT statement reads the three variables, Name,
Phone, and Mobile, and indicates the type as character.

data work.contacts2;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;

proc print data=contacts2 noobs;


run;

2. Submit the program and then check the log. Verify that SAS read five records in and five observations out.
3. Examine the results. You can see that the data seems to be read into the correct variables, but something isn’t
quite right. Do you know what’s wrong? The variable values are truncated. You need to add a LENGTH statement
to the program to specify the proper lengths for the character variables.

4. Add a LENGTH statement before the INFILE statement to define a length of 20 for Name and a length of 14 for
both Phone and Mobile.

data work.contacts2;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;

proc print data=contacts2 noobs;


run;

5. Submit the revised program and view the results. The report looks great. The values are not truncated, and SAS
successfully skipped missing values from the raw data file.

Summary of Lesson 8: Reading Raw Data Files

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Introduction to Reading Raw Data Files


A raw data file is an external text file that contains one record per line, and a record typically contains multiple fields.
The fields can be delimited or arranged in fixed columns. Typically, there are no column headings. The file is usually
described in an external document called a record layout.

In order for SAS to read a raw data file, you must specify the location of each data value in the record, along with the
names and types of the SAS variables in which to store the values. Three styles of input are available: list input, column
input, and formatted input. List input reads delimited files, and column and formatted input read fixed column files. List
input and formatted input can read both standard and nonstandard data, and column input can read only standard data.
In this course, we are reading delimited raw data files, so list input is used.

Reading Standard Delimited Data


You use a DATA step with INFILE and INPUT statements to read data from a raw data file. The INFILE statement identifies
the name and location of the input file. You use the DLM= option if the file has a delimiter other than a blank space. The
INPUT statement tells SAS how to read the values, and specifies the name and type for each variable to be created. In
the INPUT statement, you list the variables in the order that the corresponding values appear in the raw data file, from
left to right. You specify character variables by adding a dollar sign after the variable name. With list input, the default
length for all variables is 8 bytes, regardless of type.

DATA output-SAS-data-set-name;
INFILE 'raw-data-file-name' DLM='delimiter';
INPUT variable1 <$> variable2 <$> ... variableN <$>;
RUN;

SAS processes the DATA step in two phases: compilation and execution. During compilation, SAS creates an input buffer
to hold a record from the raw data file. The input buffer is an area of memory that SAS creates only when reading raw
data, not when reading a SAS data set. SAS also creates the PDV, an area of memory where an observation is built. In
addition to the variables named in the INPUT statement, SAS creates the iteration counter, _N_, and the error indicator,
_ERROR_, in the PDV. These temporary variables are not written to the output data set. At the end of the compilation,
SAS creates the descriptor portion of the output data set.

At the start of the execution phase, SAS initializes the PDV and then reads the first record from the raw data file into the
input buffer. It scans the input buffer from non-delimiter to delimiter and assigns each value to the corresponding
variable in the PDV. SAS ignores delimiters. At the bottom of the DATA step, SAS writes the values from the PDV to the
new SAS data set and then returns to the top of the DATA step.

Truncation often occurs with list input, because character variables are created with a length of 8 bytes, by default. You
can use a LENGTH statement before the INPUT statement in a DATA step to explicitly define the length of character
variables. Numeric variables can be included in the LENGTH statement to preserve the order of variables, but you need
to specify a length of 8 for each numeric variable.

LENGTH variable(s) <$> length;

Reading Nonstandard Delimited Data


You can use modified list input to read standard and nonstandard data from a delimited raw data file. Modified list input
uses an informat and a colon format modifier for each field to be read. An informat tells SAS how to read data values,
including the number of characters. When SAS reads character data, a standard character informat, such as $12., is often
used instead of a LENGTH statement. With list input, the data fields vary in length. The colon format modifier tells SAS to
ignore the specified length when it reads data values, and instead to read only until it reaches a delimiter. Omitting the
colon format modifier is likely to result in data errors.

INPUT variable <$> variable <:informat>;

An informat is required to read nonstandard numeric data, such as calendar dates, and numbers with dollar signs and
commas. Many SAS informats are available for nonstandard numeric values. Every informat has a width, whether stated
explicitly or set by default.

When reading a raw data file, you can use a DROP or KEEP statement to write a subset of variables to the new data set.
You must use a subsetting IF statement to select observations, because the variables are not coming from an input SAS
data set. You can use LABEL and FORMAT statements to permanently store label and format information in the new data
set.

A DATA step can also read instream data, which is data that is within a SAS program. To specify instream data, you use a
DATALINES statement in a DATA step, followed by the lines of data, followed by a null statement.

DATALINES;
<data line 1>
<data line 2>
...
;

Validating Data
When data values in the input file aren't appropriate for the INPUT statement in a program, a data error occurs during
program execution. SAS records the error in the log by writing a note about the error, along with a ruler and the
contents of the input buffer and the PDV. The variable _ERROR_ is set to 1, a missing value is assigned to the
corresponding variable, and execution continues.

You can use the DSD option in the INFILE statement if data values are missing in the middle of a record. When you use
the DSD option, SAS assumes that the file is comma delimited, treats consecutive delimiters as missing data, and allows
embedded delimiters in a field that is enclosed in quotation marks. If you have missing data values at the end of a
record, you can use the MISSOVER option in the INFILE statement. SAS sets the variable values to missing.

INFILE 'raw-data-file-name' <DLM=> <DSD> <MISSOVER>;

Sample Programs

Creating a SAS Data Set from a Delimited Raw Data File

data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;

proc print data=work.sales1;


run;

Specifying the Lengths of Variables Explicitly

data work.sales2;
length First_Name $ 12 Last_Name $ 18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;

proc contents data=work.sales2;


run;

proc print data=work.sales2;


run;

data work.sales2;
length Employee_ID 8 First_Name $ 12
Last_Name $ 18 Gender $ 1
Salary 8 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;

proc contents data=work.sales2 varnum;


run;

proc print data=work.sales2;


run;

Specifying Informats in the INPUT Statement

data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12. Last_Name :$18.
Gender :$1. Salary Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
run;

proc print data=work.sales2;


run;

Subsetting and Adding Permanent Attributes

data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';
keep First_Name Last_Name Salary
Job_Title Hire_Date;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary dollar12. Hire_Date monyy7.;
run;

proc print data=work.subset label;


run;

Reading Instream Data

data work.newemps;
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven Worton Auditor $40,450
Merle Hieds Trainee $24,025
Marta Bamberger Manager $32,000
;

proc print data=work.newemps;


run;

data work.newemps2;
infile datalines dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven,Worton,Auditor,$40450
Merle,Hieds,Trainee,$24025
Marta,Bamberger,Manager,$32000
;

proc print data=work.newemps2;


run;

Reading a Raw Data File That Contains Data Errors

data work.sales4;
infile "&path/sales3inv.csv" dlm=',';
input Employee_ID First $ Last $
Job_Title $ Salary Country $;
run;

proc print data=work.sales4;


run;
Reading a Raw Data File That Contains Missing Data
data work.contacts;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone2.csv" dsd;
input Name $ Phone $ Mobile $;
run;

proc print data=work.contacts noobs;


run;

Reading a Raw Data File Using the MISSOVER Option

data work.contacts2;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;

proc print data=contacts2 noobs;


run;

data work.contacts2;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;

proc print data=contacts2 noobs;


run;

Lesson 9: Manipulating Data


As you know, your data sets do not always contain the exact information you need to perform certain tasks. You often
need to create new variables, perform calculations on existing variables, or assign values to variables based on specific
conditions in your data, among other things. In this lesson, you'll learn how to manipulate the data in your SAS data sets.
You'll learn to create variables using SAS functions and assign variable values based on conditions, and learn why you
might need to explicitly define the length of new variables.

Objectives
In this lesson, you learn to do the following:

 create data values using SAS functions


 process data conditionally using IF-THEN/ELSE statements
 assign values to variables conditionally
 execute multiple statements conditionally using DO and END statements
 control the length of character variables using the LENGTH statement

Using SAS Functions


Suppose Orion Star's Human Resources Department asks you to provide information about each employee's bonus
amount, total compensation amount, and month for receiving the bonus. You need to write SAS programs to provide
this information about sales staff from the United States and Australia.

You begin with the orion.sales data set, which includes the variables Employee_ID, Salary, and Hire_Date, among
others. But orion.sales doesn't contain the information you need for Human Resources. You'll need to create the
variables Bonus, Compensation, and BonusMonth in a DATA step and then store the new data in the work.comp data
set.

Using Assignment Statements to Create Variables


Let's get started on the DATA step. You know that you need to create the new variable Bonus. The Human Resources
Department has indicated that the bonus for all employees this year is $500, so Bonus is a constant. Remember that the
assignment statement evaluates an expression and assigns the resulting value to a new or existing variable.

variable=expression;

The expression can be a numeric constant, so you can use an assignment statement to assign the value 500 to the
variable Bonus.

Bonus=500;

Question
Which of the following statements will create the numeric variable Bonus with a value of 500?

work.comp

Bonus Compensation Bonus Month


500 108755 6
500 88475 1
500 27100 1

a. Bonus=$500;
b. Bonus=500;
c. label Bonus='500';
d. format Bonus 500.;
The correct answer is b. You use an assignment statement to set the value of the variable Bonus equal to 500. Numeric
constants do not include commas or currency symbols.

Using the SUM Function in Assignment Statements


Next, you need to calculate each employee's total compensation amount. The total compensation amount is the sum of
the employee's salary and bonus. To find the sum, you can use the SUM function, which is a descriptive statistic function
that returns the sum of the arguments.

SUM(argument1,argument2,...);

The arguments can be numeric constants, numeric variables, or arithmetic expressions, but the arguments must be
numeric values. Both Salary and Bonus are numeric values. The arguments must be enclosed in parentheses and
separated with commas. Notice that SAS functions can be used within an assignment statement.

data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
run;

Consider this: If Salary has a missing value, what do you think the value of Compensation will be? Its value will be 500.
The SUM function ignores missing values, so if an argument has a missing value, the result of the SUM function is the
sum of the nonmissing values. What if you were to calculate Compensation by simply adding Salary and Bonus in an
assignment statement?

data work.comp;
set orion.sales;
Bonus=500;
Compensation=Salary+Bonus;
run;

If one of these variables contains a missing value, the expression evaluates to missing, and a missing value is assigned to
Compensation. In this case, using the SUM function is the better choice. Compensation will always contain a nonmissing
value.

Using Date Functions in Assignment Statements


The last variable we need to create is BonusMonth. Bonuses are paid the month in which the employee was hired, so
BonusMonth is the hire month. To create this variable, you use another type of function, a date function. You might
remember that in SAS, the date 0 is January 1, 1960. From that date, all SAS dates in the future are positive numbers and
all SAS dates in the past are negative numbers. To calculate days, months, and years, you can specify a SAS date as an
argument in a date function.

You can use different types of SAS date functions to create variable values. For example, these date functions extract
date information from the date value that SAS stores.

Date Function Value Extracted Value Returned

YEAR(SAS-date) the year a four-digit year


QTR(SAS-date) the quarter a number from 1 to 4

MONTH(SAS-date) the month a number from 1 to 12

DAY(SAS-date) the day of the month a number from 1 to 31

WEEKDAY(SAS-date) the day of the week a number from 1 to 7


(1=Sunday, 2=Monday,
and so on

The YEAR function extracts the year from a SAS date and returns a four-digit value for year. The QTR function extracts
the quarter from a SAS date and returns a number from 1 to 4. The MONTH function extracts the month from a SAS date
and returns a number from 1 to 12. The DAY function extracts the day of the month from a SAS date and returns a
number from 1 to 31. The WEEKDAY function extracts the day of the week from a SAS date and returns a number from 1
to 7, where 1 represents Sunday and so on.

These date functions create a SAS date value.

Date Function SAS Date Value Created

TODAY() the current date


DATE()

MDY(month,day,year) a date with numeric month, day, and year

The TODAY function returns the current date as a SAS date value. DATE is an alias for TODAY. It works the same way.
Notice that there are no arguments inside the parentheses when calling the TODAY or DATE function. That is because
these functions do not need any information from the program. They use the system clock to obtain the current date
and convert it to a SAS date. It is important to use parentheses even when no arguments are passed. Without the
parentheses SAS would think TODAY or DATE were variables instead of functions. The MDY function returns a SAS date
value from numeric month, day, and year values.

Now let's return to creating the BonusMonth variable. The hire date for each employee is stored in the Hire_Date
variable. At Orion Star, an employee who was hired in April receives an annual bonus in April. So you need to extract the
month the employee was hired from the Hire_Date variable and assign its value to the BonusMonth variable. You can
do this by using the MONTH function. You use the function named MONTH, followed by one argument in parentheses.
The argument for month must be a SAS date.

MONTH(SAS-date);

Here, the MONTH function extracts the month of hire from Hire_Date and returns a number from 1 to 12. The returned
value is assigned to BonusMonth.
Question
Which of these statements specifies SAS functions correctly? Select all that apply.
a. Deadline=sum(TimeSpent,Last_Name);
b. FingersToes=sum(10,10);
c. Birthday=mdy(8,27,90);
d. Review_Quarter=qtr(Hire_Date+372);
e. GreatDay=today();
f. BirthdayYear=year(Birth_Date);
g. BirthdayYear=year('12dec1987'd);

The correct answers are b, c, d, e, f, and g. Except for the TODAY() function, these numeric functions must specify
appropriate numeric arguments in parentheses following the function keyword. You can specify numeric constants,
including SAS date constants, numeric variables, or arithmetic expressions, as numeric arguments. The TODAY() function
doesn't require an argument.

Creating Variables by Using Functions


In this demonstration, you create one variable using an assignment statement and two variables using functions.

1. Copy and paste the following program into the editor.

data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;

proc print data=work.comp;


var Employee_ID First_Name Last_Name
Salary Bonus Compensation BonusMonth;
run;

2. Submit the program and then check the log. You can see that SAS read 165 observations from orion.sales, and
the work.comp data set contains 12 variables. You might recall that orion.sales contains 9 variables, so it looks
as though SAS created the new variables.

3. View the results. You can see that SAS created the three new variables, and they appear as the last three in the
report. The Bonus values are all 500, which is what you specified. The Compensation values should be the sum
of the Bonus and Salary values. You can verify this in the first observation. Tom's salary is 108255, and when
added to 500, the total is 108755. This looks great. Lastly, the BonusMonth values should all be a number from 1
to 12. Scroll through the report to verify the other values.

Question
A DROP statement has been added to this DATA step. Will the program calculate Compensation and BonusMonth
correctly?

data work.comp;
set orion.sales;
drop Gender Salary Job_Title Country
Birth_Date Hire_Date;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;

a. yes
b. no

The correct answer is a. The DROP statement is a compile-time only statement. SAS sets a drop flag for the
dropped variables, but the variables are in the PDV and, therefore, are available for processing.

Obs Employee_ID First_Name Last_Name Bonus Compensation BonusMonth


1 120102 Tom Zhou 500 108755 6
2 120103 Wilson Dawes 500 88475 1
3 120121 Irenie Elvish 500 27100 1
4 120122 Christina Ngan 500 27975 7
5 120123 Kimiko Hotstone 500 26690 10
6 120124 Lucian Daymond 500 26980 3
7 120125 Fong Hofmeister 500 32540 3
8 120126 Satyakam Denny 500 27280 8
9 120127 Sharryn Clarkson 500 28600 11
10 120128 Monica Kletschkus 500 31390 11

Conditional Processing
The managers at Orion Star plan to give each sales employee a bonus based on his or her job title. Employees with the
job title Sales Rep. IV will receive a $1000 bonus, those with the Sales Manager title will receive a $1500 bonus, Senior
Sales Managers will receive a $2000 bonus, and the Chief Sales Officer will receive a $2500 bonus.

For this task, you need to write a SAS program and use orion.sales to create the new data set work.comp. You need to
include a new variable, Bonus, with a value based on the variable Job_Title. Let's find out how you can complete your
task.

Using IF-THEN Statements


You want SAS to assign a bonus amount based on the employee's job title. To do this, you can use conditional
statements. The IF-THEN statement is a conditional statement. It executes a SAS statement for observations that meet
specific conditions.

IF expression THEN statement;

In the IF-THEN statement, as in the assignment statement, expression is a sequence of operands and operators that
define a condition for selecting observations.
Operands Operators

character constants symbols that represent an arithmetic calculation,


for example, =, >, <, ~, &,|, /,*, -,
numeric constants
SAS functions
date constants

character constants

numeric variables

After the THEN keyword, statement is any executable statement, such as the assignment statement. If the expression is
true, the THEN statement executes. The IF-THEN statement executes for each observation in the data set.

if Job_Title='Sales Rep. IV' then Bonus=1000;

Conditional Processing with IF-THEN Statements


Here's the program for your task. It contains four IF-THEN statements: one for each of the four job titles.

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Senior Sales Office' then
Bonus=2500;
run;

The value that SAS assigns to Bonus is determined by testing for various values of Job_Title. If the expression is true,
then SAS assigns Bonus the value that corresponds to the IF-THEN statement. If the expression is false, then SAS moves
to the next IF-THEN statement.

Let's see how SAS processes this DATA step in more detail. At the start of the execution phase, SAS initializes the PDV to
missing. When the SET statement executes, SAS reads the first observation of orion.sales into the PDV. The value of
Bonus is missing because Bonus doesn't come from the input data set. It's a new variable that we are creating in this
DATA step.

Partial sales.csv

1---5---1 0---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982

PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Bonus

120102 Tom Zhou M 108255 Sales Manager AU .

When SAS executes the first IF-THEN statement, the value of Job_Title in the PDV does not match the specified value, so
the expression is false. Bonus remains missing. When SAS executes the second IF-THEN statement, the value of Job_Title
in the PDV matches the specified value, so the expression is true. SAS assigns 1500 to Bonus.

PDV

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Bonus

120102 Tom Zhou M 108255 Sales Manager AU 1500

When SAS executes the third IF-THEN statement, the value of Job_Title in the PDV does not match the specified value,
so the expression is false. Bonus remains 1500. When SAS executes the fourth IF-THEN statement, the value of Job_Title
in the PDV does not match the specified value, so the expression is false. Bonus remains 1500.

At the bottom of the DATA step, SAS uses the values in the PDV to write the first observation into the new data set,
work.comp. Then control returns to the top of the DATA step for the next iteration. This process continues until SAS
reaches the end of the input data set.

Assigning Values Conditionally


In this demonstration, you assign values to variables conditionally using IF-THEN statements.

1. Copy and paste the following program into the editor.

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;

proc print data=work.comp;


var Last_Name Job_Title Bonus;
run;

2. Submit the program and then check the log. The log shows that SAS read 165 observations from orion.sales. The
code ran without errors.

3. View the report. As you can see, many values for Bonus are missing, as indicated by the periods. Can you think
of why there are missing values? Not all employees in orion.sales met the conditions you specified. After all, you
only assigned bonus amounts to four job titles. The observations that do have values, however, appear to be
correct.
Question
In this program, is it possible for more than one condition to be true for a single observation?

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;

a. Yes, more than one condition can be true.


b. No, the conditions are mutually exclusive, so only one condition can be true.

The correct answer is b. the conditions are mutually exclusive, so only one condition can be true. For each observation,
there is only one value for Job_Title. If that value matches one of the conditions, then it cannot match any other
condition.

Question
Which statement below correctly assigns the value SE to the new variable Region if the variable City has the
value Atlanta?

City ProductNumber
Tampa K445
Atlanta K702
Boston F065
a. if city='Atlanta' then
Region='se';
b. if Region='SE' then
city='Atlanta';
c. if city='Atlanta' then
Region='SE';
d. if Region='Atlanta' then
city='SE';

The correct answer is c. The IF expression tests for the value of City. If the expression is true, SAS executes the
corresponding statement and assigns the value SE to Region.

Using the ELSE Statement


The IF conditions in our program are mutually exclusive. Once SAS encounters a true statement, checking the other
statements isn't necessary. However, when the DATA step executes, SAS tests each IF statement, even after a condition
is true. This wastes system resources and slows the processing of your program.
To make our program more efficient, we can use the ELSE statement to specify an alternative action to be performed
when the condition in an IF-THEN statement is false. Notice the syntax.

ELSE IF statement;

You use the keyword ELSE, followed by a statement, which can be any executable SAS statement, including another IF-
THEN statement. The ELSE statement must immediately follow the IF-THEN statement in your program, and it executes
only if the previous IF-THEN statement is false.

IF expression THEN statement;


<ELSE IF expression THEN statement;>
<ELSE IF expression THEN statement;>

In the program below, SAS evaluates the IF-THEN statements sequentially. When an expression is true, SAS executes the
associated statement and skips subsequent ELSE statements.

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;

Conditional Processing with IF-THEN-ELSE Statements


Let's look at our program.

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
Notice that only the first IF-THEN statement begins with the keyword IF. The subsequent statements begin with the
keyword ELSE. Let's take a look at how SAS processes this code. At the start of the execution phase, SAS initializes the
PDV to missing. When the SET statement executes, SAS reads the first observation of orion.sales into the PDV.

When SAS executes the first IF-THEN statement, the value of Job_Title in the PDV does not match the specified value, so
the statement is false. Bonus remains missing.

PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Bonus

120102 Tom Zhou M 108255 Sales Manager AU .

When SAS executes the ELSE statement, the value of Job_Title in the PDV matches the specified value, so the expression
is true. SAS assigns 1500 to Bonus.

PDV

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Bonus

120102 Tom Zhou M 108255 Sales Manager AU 1500.

When SAS executes the third IF-THEN statement, the value of Job_Title in the PDV does not match the specified value,
so the expression is false. Bonus remains 1500. When SAS executes the fourth IF-THEN statement, the value of Job_Title
in the PDV does not match the specified value, so the expression is false. Bonus remains 1500.

At the bottom of the DATA step, SAS uses the values in the PDV to write the first observation into the new data set,
work.comp. Then control returns to the top of the DATA step for the next iteration. This process continues until SAS
reaches the end of the file.

You can imagine the kind of efficiency this way of programming creates when you have data sets with thousands of
observations!

Business Scenario
The managers at Orion Star plan to give each sales employee a bonus based on his or her job title. Employees with the
job title Sales Rep. IV will receive a $1000 bonus, those with the Sales Manager title will receive a $1500 bonus, Senior
Sales Managers will receive a $2000 bonus, and the Chief Sales Officer will receive a $2500 bonus.

For this task, you need to write a SAS program and use orion.sales to create the new data set work.comp. You need to
include a new variable, Bonus, with a value based on the variable Job_Title. Let's find out how you can complete your
task. Using Conditional Processing

Using Conditional Processing


When you write IF-THEN statements, you can use logical operators to create compound conditions. For example, in our
scenario, we need to assign a bonus value of $1000 to employees with the Sales Rep. III job title and the Sales Rep. IV
job title. However, both of these conditions don't need to be met; either of these conditions needs to be met. We can
use the OR operator to define both conditions. The THEN statement will execute if either condition is true.

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;

Now, how will you assign a bonus amount of $500 to all other job titles? You certainly don't want to continue writing
ELSE statements for every title in the company! Luckily, you can use an optional final ELSE statement in your program:
else Bonus=500. This statement gives an alternative action if none of the conditions are true.

IF expression THEN statement;


<ELSE IF expression THEN statement;>
<ELSE statement;>

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
else Bonus=500;
run;

Using Compound Conditions


In this demonstration, you use compound conditions to assign values to variables.

1. Copy and paste the following program into the editor.

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager'
then Bonus=2000;
else if Job_Title='Chief Sales Officer'
then Bonus=2500;
else Bonus=500;
run;

proc print data=work.comp;


var Last_Name Job_Title Bonus;
run;

2. Submit the program and then view the report. As you can see, every observation has a bonus value assigned.
Both Sales Rep. III and Sales Rep. IV employees have a bonus value of $1000, and employees with a bonus value
of 500 have job titles that were not listed in our IF-THEN statements.
Code Challenge
Write statements to assign the value pass to the variable Grade when the value of the variable Points is greater than or
equal to 70. Otherwise, assign the value fail to Grade.

if Points>=70 then Grade='pass';


else Grade='fail';
In the IF-THEN statement, expression is a sequence of operands and operators that define a condition for selecting
observations. After the THEN keyword, statement is any executable statement. You use an ELSE statement to specify an
alternative action to be performed when the condition in the IF-THEN statement is false.

Business Scenario
Suppose the Orion Star managers are considering another type of bonus for employees, a country-based bonus. They'd
like to assign a bonus amount of $500 to employees in the US, and assign a bonus amount of $300 to employees in
Australia. You need to create a new data set, work.bonus, with this information.

Using IF-THEN-ELSE Statements


In this demonstration, you use IF-THEN-ELSE statements to assign values to a variable.

1. Copy and paste the following program into the editor.

You know that the orion.sales data set has been validated and only includes the Country values US and AU. So
you can use an IF-THEN-ELSE statement for this scenario: if Country equals US, then Bonus equals 500. Else,
Bonus equals 300. You can omit the conditional clause from the ELSE statement because all observations not
equal to US will get a bonus of $300. This technique should be used only when you know that the final ELSE
statement must be executed for all other observations.

data work.bonus;
set orion.sales;
if Country='US' then Bonus=500;
else Bonus=300;
run;

proc print data=work.bonus;


var First_Name Last_Name Country Bonus;
run;

2. Submit the program and then view the report. As you scroll through the report, you can see that all AU Country
values have a $300 bonus and all US Country values have a $500 bonus.

Activity
Copy and paste this program into the editor. This program reads orion.nonsales, which is a non-validated data set.
Submit the code.

Reminder: Make sure you've defined the orion library.


data work.bonus;
set orion.nonsales;
if Country='US' then Bonus=500;
else Bonus=300;
run;

proc print data=work.bonus;


run;

When you examine observations 88 through 235, do you see any values of 300 for Bonus?

a. yes
b. no

The correct answer is a. Bonus is equal to 300 in observations 125, 197, and 200. This is because the variable Country
has some mixed case values in orion.nonsales. Observations with a Country value of US are assigned 500; all others are
assigned 300, including us.

Testing for Invalid Data


In the data set orion.nonsales, you need to test for uppercase and lowercase values of Country. You could use the OR
operator to create a compound condition. But in this example, let's use the IN operator. In your IF-THEN statement, you
use the comparison operator IN, and in parentheses you specify the uppercase and lowercase values of Country. If the
value of Country is equal to one of the values in this list, then the condition is true, and SAS assigns 500 to Bonus.

data work.bonus;
set orion.nonsales;
if Country in ('US', 'us')
then Bonus=500;
else Bonus=300;
run;
Alternatively, you can use the UPCASE function in the expression. The UPCASE
function converts all character values in an argument to uppercase.
data work.bonus;
set orion.nonsales;
if upcase(Country)='US'
then Bonus=500;
else Bonus=300;
run;

You can also clean the data before checking the value in your IF-THEN statement. This program shows how you can first
uppercase all values of the variable Country. It's a best practice to clean the data at the source, but in some cases that is
not possible. With this method, you are creating a clean data set.

data work.bonus;
set orion.nonsales;
Country=upcase(Country);
if Country='US' then
Bonus=500;
else Bonus=300;
run;

Business Scenario
Now let's look at a more complex bonus situation. The Human Resources Department stepped in and decided that Orion
Star employees will receive a bonus once or twice a year, depending on their country. US employees will receive a $500
bonus once a year, and Australian employees will receive a $300 bonus twice a year. So, in addition to creating the
variable Bonus, you also need to create the variable Freq, which is the frequency that the employee receives the bonus.
Freq is equal to Once a Year for United States employees, and is equal to Twice a Year for Australian employees. The
variable Freq isn't related to the FREQ procedure.

Creating Variables Conditionally


Let's think about what you need to do. You want to assign values to two variables based on one condition. You want SAS
to determine whether the employee is from the US or from Australia. Then, based on that condition, you want SAS to
assign values to Bonus and Freq. The IF-THEN/ELSE statements that we've been using allow for only one executable
statement. For our scenario, two statements must be executed for each true expression.

Using DO Groups to Execute Multiple Statements


To execute multiple statements, you can use a DO group with an IF-THEN or an ELSE statement. This enables SAS to
perform multiple actions based on one condition.

IF expression THEN
DO;
executable statements
END;
ELSE IF expression THEN
DO;
executable statements
END;

The DO group consists of a DO statement, the SAS statements to be executed when the condition is true, and the END
statement. Multiple statements are permitted in a DO group.

Creating Two Variables Conditionally


In this demonstration, you use create two variables conditionally using DO groups.

1. Copy and paste the following program into the editor. At the end of the IF-THEN statement, you specify the DO
statement and then the statements to execute if the condition is true. In the first group, you want to assign US
employees the bonus amount 500 and the bonus frequency Once a Year. Notice that you must end each DO
group with an END statement. Next, the ELSE statement contains a second DO group. You assign the Australian
employees the bonus amount 300 and the bonus frequency Twice a Year.

The PROC PRINT step displays only the variables First_Name, Last_Name, Country, Bonus, and Freq.

data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country
Bonus Freq;
run;

2. Submit the program and then view the report. You can see that the new variables have been added to the data
set. And it appears that the values for Bonus and Freq are based on the value for Country.

But, wait! There does seem to be a problem with some values for Freq. For the value Twice a Year, the final r
seems to be cut off. You'll learn why this value is truncated next.

Avoiding Truncated Values When Creating Variables


To determine why a value might get truncated, as the value Twice a Year does, let's consider how SAS assigns lengths to
variables by default.

data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run

SAS begins by creating the PDV from the variables in orion.sales. Remember that SAS lists the variable name and type
(numeric or character), and then allocates space for the variable values.

PDV

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date


N8 $ 12 $ 18 $1 N8 $ 25 $2 N8 N8

Next, SAS creates the new variables. SAS adds Bonus, which is a numeric variable, to the PDV and allocates 8 bytes for it.
From the string of characters that are assigned, SAS deduces that Freq is a character variable. Because SAS wasn't given
any instruction about the variable, SAS assumes that the length is the amount of space taken up by the value Once a
Year, which is the first value assigned to the variable. Once a Year equals 11 characters, so SAS assigns a length of 11 to
the character variable, Freq.

PDV

Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date Bonus Freq
N8 $ 12 $ 18 $1 N8 $ 25 $2 N8 N8 N8 $ 11
Look at the number of characters for the next string assigned to the variable Freq. Twice a Year requires 12 characters.
Now you know why the last character is getting left off in the output: SAS only allowed for 11 characters. How would
you modify your code to allow the full 12 characters?

You can change your code in several ways to deal with the situation. You can specify the condition for Australian
employees first so that SAS allocates 12 bytes for the character variable, Freq,

data work.bonus;
set orion.sales;
if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
else if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
run;
or you can pad the value with blanks in the first occurrence,

data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year ';
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
or you can use the LENGTH statement to declare the byte size of the variable up front.

data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year ';
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
The LENGTH statement is a straightforward approach, so let's use it to avoid truncation. SAS assigns variable lengths
based on the first occurrence of the variable in the DATA step. So it's important to put the LENGTH statement early
enough in the program to make sure that's where SAS first encounters the variable Freq.

PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date Bonus Freq
N8 $ 12 $ 18 $1 N8 $ 25 $2 N8 N8 N8 $ 12

Question
Which of the following can determine the length of a new variable?
a. the length of the variable's first value
b. the assignment statement
c. the LENGTH statement
d. all of the above

The correct answer is d. In the DATA step, the first reference to a variable determines its length. The first reference to a
new variable can be in a LENGTH statement, an assignment statement, or another statement such as an INPUT
statement. After a variable is created in the PDV, the length of the variable's first value doesn't matter.

Adjusting the Program


In this demonstration, you set the length of a variable and remove the condition from the ELSE statement.

1. Copy and paste the following program into the editor. This program is from the previous demonstration, but
includes a LENGTH statement to correct the truncated value for Freq. Remember that it's important to put the
LENGTH statement before any other reference to the variable.

data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country
Bonus Freq;
run;

2. Submit the program and then view the report. The full value Twice a Year is displayed for the variable Freq. You
resolved the problem by using the LENGTH statement. If you were sure of the values for Country, you could
specify the ELSE statement without the condition. A final ELSE statement without a condition allows anything
not equal to the previous conditions to be true.

3. Copy and paste the following program, which no longer includes the condition in the ELSE statement.

data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else do;
Bonus=300;
Freq='Twice a Year';
end;
run;

proc print data=work.bonus;


var First_Name Last_Name Country
Bonus Freq;
run;

4. Submit the revised program and view the results. Notice that the values for Bonus and Freq are correctly
assigned based on the value for Country, and the full value Twice a Year is displayed for Freq.

Summary of Lesson 9: Manipulating Data

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Using SAS Functions


You use an assignment statement in a DATA step to evaluate an expression and assign the result to a new or existing
variable. The expression on the right side of an assignment statement can include calls to SAS functions. A SAS function
is a routine that accepts arguments and returns a value.

variable=expression;

The SUM function, a descriptive statistics function, returns the sum of its arguments and ignores missing values. The
arguments can be numeric constants, numeric variables, or arithmetic expressions, but the arguments must be numeric
values and must be enclosed in parentheses and separated with commas. The parentheses are required, even if no
arguments are passed to the function.

SUM(argument1,argument2, ...)

In addition to descriptive statistics functions, many SAS date functions are available. Some of these functions create SAS
dates, and others extract information from SAS dates. The MONTH function extracts and returns the numeric month
from a SAS date.

MONTH(SAS-date)
Conditional Processing
The IF-THEN statement is a conditional statement. It executes a SAS statement for observations that meet specific
conditions. The statement includes an expression and a SAS program statement. The expression defines a condition that
must be true for the statement to be executed. The expression is evaluated during each iteration of the DATA step. If the
condition is true, the statement following the THEN statement is executed; otherwise, SAS skips the statement.

IF expression THEN statement;

A program often includes a sequence of IF statements with mutually exclusive conditions. When SAS encounters a true
condition in this series, evaluating the other conditions isn't necessary. You can use the ELSE statement to specify an
alternative action to be performed when the condition in an IF-THEN statement is false. This increases the efficiency of
the program.

You can use the logical operators AND and OR to combine conditions in an IF expression. You use the AND operator
when both conditions must be true, and you use the OR operator when only one of the conditions must be true. An
optional final ELSE statement can be used at the end of a series of IF-THEN/ELSE statements. The statement following
the final ELSE executes if none of the IF expressions is true.

Use a DO group with an IF-THEN or an ELSE statement when multiple statements must be executed based on one
condition. The DO group consists of a DO statement, the SAS statements to be executed, and an END statement. Each
DO statement must have a corresponding END statement.

IF expression THEN
DO;
executable statements
END;
ELSE IF expression THEN
DO;
executable statements
END;

Truncation can occur when new variables are assigned values within conditional program statements. During
compilation, SAS creates a variable in the PDV the first time it encounters the variable in the program. If this is in
conditional code, be sure that it is created with a length long enough to store all possible values. It is a best practice to
use a LENGTH statement to explicitly define the length.

Sample Programs

Creating Variables by Using Functions

data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;

proc print data=work.comp;


var Employee_ID First_Name Last_Name
Salary Bonus Compensation BonusMonth;
run;

Assigning Values Conditionally

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;

proc print data=work.comp;


var Last_Name Job_Title Bonus;
run;

Using Compound Conditions

data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
else Bonus=500;
run;

proc print data=work.comp;


var Last_Name Job_Title Bonus;
run;

Using IF-THEN/ELSE Statements

data work.bonus;
set orion.sales;
if Country='US' then Bonus=500;
else Bonus=300;
run;

proc print data=work.bonus;


var First_Name Last_Name Country Bonus;
run;

Creating Two Variables Conditionally

data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country Bonus Freq;
run;

Adjusting the Program

data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;

proc print data=work.bonus;


var First_Name Last_Name Country Bonus Freq;
run;

data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else do;
Bonus=300;
Freq='Twice a Year';
end;
run;

proc print data=work.bonus;


var First_Name Last_Name Country
Bonus Freq;
run;

Lesson 10: Combining SAS Data Sets


Sometimes, a single data set contains all the data you need to perform a single business task. However, sometimes you'll
find that the data you need is stored in multiple data sets. Using SAS, you can combine multiple data sets into a single
data set. Then you can reference the combined data set in your SAS programs. In this lesson, you learn to concatenate
SAS data sets and you learn to merge SAS data sets.]
Objectives

In this lesson, you learn to do the following:

 concatenate two or more SAS data sets using the SET statement in a DATA step
 rename variables using the RENAME= data set option
 prepare data sets for merging by using the SORT procedure
 merge SAS data sets one-to-one based on a common variable
 merge SAS data sets one-to-many based on a common variable
 control the observations in the output data set by using the IN= data set option

Concatenating Data Sets


Suppose you've been asked to combine data sets containing information about Orion Star employees from Denmark and
France into a new data set. The data set empsdk contains employees from Denmark, and the data set empsfr contains
employees from France. In an output data set named empsall1, you want all the observations from empsdk to appear
first, followed by all the observations from empsfr.

empsdk empsfr

First Gender Country First Gender Country

Lars M Denmark Pierre M France

Kari F Denmark Sophie F France

Jonas M Denmark

empsall1

First Gender Country

Lars M Denmark

Kari F Denmark

Jonas M Denmark

Pierre M France

Sophie F France

You can concatenate these data sets. Concatenating copies all the observations from the first data set and then copies
all observations from one or more additional data sets into a new data set. The original data sets are unchanged.
Knowing the Structure and Contents of Your Input Data
To choose the best way to combine your data, you need to understand the structure and contents of your input data
sets. So, for example, it's helpful to examine both the descriptor portion of the data sets, as well as the data portions.
When you combine data sets vertically, one of the most important questions to ask is: Do the data sets have variables in
common?

Suppose you're combining data sets one and two. You can see that the variables have common names: A, B, and C. By
examining the data further, you can make sure that variables that have the same names contain the same type of data.

one two

A B C A B C

You might find that variables that have different names across data sets contain the same data. Suppose data set one
has a variable named Last and data set two has a variable named Last_Name. In this situation, you might want SAS to
combine these two variables when you combine data sets.

On the other hand, you might find that variables that have the same name in different data sets contain different data.
For example, suppose both data sets have a variable named Date. One Date variable might store order dates, but the
other Date variable might store shipping dates. You would not want SAS to combine these variables if they hold
different information.

When you're combining vertically, it's easier to combine data sets that have identical variables. However, you can also
combine data sets that have different variables. As you can see, the data sets in our scenario, empsdk and empsfr, have
the same variable names: First, Gender, and Country. For this example, assume that the variables also have the same
attributes.

Specifying Multiple Data Sets in the SET Statement


You can use the DATA step to combine multiple data sets into a single data set. In the DATA statement, you specify the
name of the new data set. In the SET statement, you specify any number of input data sets. You separate the names by a
space, not by a comma.

DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 ...;
RUN;

When you specify multiple data sets, SAS combines them into a single data set, In other words, SAS concatenates the
observations from the input data sets. In the combined data set, the observations appear in the order in which the data
sets are listed in the SET statement. In this example, the observations from empsdk will appear before the observations
from empsfr because empsdk is listed first.
data empsall1;
set empsdk empsfr;
run;

How SAS Concatenates Data Sets with the Same Variables


Let's see how SAS processes the DATA step to concatenate the data sets listed in the SET statement.

data empsall1;
set empsdk empsfr;
run;

During compilation, SAS reads the descriptor portion of the first data set, empsdk, and determines that it has three
variables. SAS also determines the attributes of the variables. Then SAS creates the PDV with slots for the three
variables.

PDV

First Gender Country


$8 $8 $8

SAS then looks at the second data set, empsfr, to see if it has additional variables that must be added to the PDV. Here,
empsfr has no additional variables, so SAS makes no further changes to the PDV. At the bottom of the DATA step, the
compilation phase is complete, and the descriptor portion of the new SAS data set empsall1 is created.

empsall1

First Gender Country

Now SAS is ready to execute the DATA step and create the data portion of the output data set. To start, SAS initializes
the PDV. This means that SAS sets the value of each variable to missing. Remember that SAS makes a pass, or iteration,
through the DATA step for each observation that's read from an input data set. Consider this: Which observation does
SAS look at first? SAS reads the first observation in empsdk, the first data set specified in the SET statement. SAS reads
the values directly into the PDV. At the bottom of the DATA step, SAS writes the data from the PDV to the output data
set as the first observation.

empsall1

First Gender Country

Lars M Denmark

Now SAS returns to the top of the DATA step for the next iteration. Because SAS continues reading observations from
the same input data set, SAS does not reinitialize the PDV. SAS reads the second observation in empsdk into the PDV
and then writes the data to the output data set as the second observation.
empsall1

First Gender Country

Lars M Denmark

Kari F Denmark

Returning to the top of the DATA step, SAS now reads the third observation from empsdk into the PDV and then writes
it to the output data set. At the top of the DATA step, SAS reaches the end of the file.

empsall1

First Gender Country

Lars M Denmark

Kari F Denmark

Jonas M Denmark

SAS reinitializes the PDV before switching to the second data set.

PDV

First Gender Country

Now SAS reads the first observation from the data set empsfr into the PDV, and then writes the values to the output
data set as the fourth observation. SAS reads in and writes out each observation in the second data set until it reaches
the end of that file. When SAS finishes executing the DATA step, the output data set is complete.

empsall1

First Gender Country

Lars M Denmark

Kari F Denmark
Jonas M Denmark

Pierre M France

Sophie F France

Code Challenge
Write a SET statement to concatenate clinic.stress98 and clinic.stress99, in that order.
data clinic.testtime;

;
run;

set clinic.stress98 clinic.stress99;

You use the SET statement to name the data sets to be concatenated.

Business Scenario
Suppose you want to concatenate two other data sets that contain employee data. The data set empscn contains
employees from China, and the data set empsjp contains employees from Japan.

empscn empsjp

First Gender Country First Gender Region

Chang M China Cho F Japan

Li M China Tomi M Japan

Ming F China

Here's a question. How many variables have a common name across data sets? Only two of the three variables have a
common name: First and Gender. You want to create a new data set named empsall2 that has three variables. As in
empscn, you want the third variable to be named Country, not Region.

Concatenating Data Sets with Different Variables


In this demonstration, you concatenate data sets that contain different variables.

1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.

In the third DATA step, the DATA statement specifies the name of the output data set, empsall2. You want the
employees in China to be listed before the employees in Japan, so the SET statement specifies empscn, a space,
and then empsjp. The PROC PRINT step creates the report.
data empscn;
input First $ Gender $ Country $;
datalines;
Chang M China
Li M China
Ming F China
;
run;

data empsjp;
input First $ Gender $ Region $;
datalines;
Cho F Japan
Tomi M Japan
;
run;

data empsall2;
set empscn empsjp;
run;

proc print data=empsall2;


run;

2. Submit the program and then check the log. You can see messages indicating that SAS ran successfully. Try
answering this question: How many variables does the output data set have? The data set empsall2 has four
variables.

3. View the PROC PRINT output. You can see the four variables First, Gender, Country, and Region. Notice that
some observations have missing values for Country and others have missing values for Region. This isn't the
output that you want. Next you'll learn to create the report you need.

How SAS Concatenates Data Sets with Different Variables


Let's take a closer look at how SAS created the output for our scenario, and learn how SAS concatenates data sets by
default when they have different variables. During compilation, SAS identifies three variables in the first data set,
empscn. SAS then creates a PDV with slots for the three variables First, Gender, and Country.

data empsall2;
set empscn empsjp;
run;

PDV

First Gender Country


$8 $8 $8

In the second data set, empsjp, SAS finds the additional variable Region. SAS then adds Region as the fourth variable in
the PDV. SAS also creates the descriptor portion of the output data set, which contains the four variables.

PDV
First Gender Country Region
$8 $8 $8 $8

Descriptor portion of empsall1

First Gender Country Region


$8 $8 $8 $8

At the start of execution, SAS reads the first observation from empscn. SAS reads the values of First, Gender, and
Country into the PDV. But the data set empscn doesn’t contain the Region variable. Think about this: What value does
SAS assign to Region in the PDV? Because there is no value to be read into Region, it remains missing.

empsall2

First Gender Country Region

Chang M China

When SAS reaches the end of file in empscn, SAS reads the second data set listed in the SET statement, empsjp.
Remember that SAS reinitializes the PDV before switching to this data set. SAS reads the first observation from empsjp
into the PDV. Now the variable Country has a missing value due to the PDV reinitialization.

Here's the final output data set, but this is not the output that we want.

empsall2

First Gender Country Region

Chang M China

Li M China

Ming F China

Cho F Japan

Tomi M Japan
We want our output data set to have only the first three variables. Also, we want to move the values in Region to the
Country variable. To get this output, we can modify our DATA step to rename the variable Region to Country in empsjp.

The RENAME= Data Set Option


We want to rename a variable in the empsjp data set from Region to Country. To change the name of one or more
variables, you can use the RENAME= data set option in your DATA step.

SAS-data-set(RENAME=(old-name-1=new-name-1
old-name-2=new-name-2
...
old-name-n=new-name-n))

You specify the RENAME= option immediately after the associated SAS data set name. Notice that you enclose the
RENAME= option within an outer set of parentheses. In the inner set of parentheses, you specify one or more variables
that you want to rename. For each variable, you specify the existing variable name, an equal sign, and the new name. Of
course, the new name must be a valid SAS name. If you are changing multiple variable names for the same data set, you
add a space, not a comma, between variables.

If the RENAME= option is associated with an input data set in the SET statement, as in this example, the action applies to
the data set that is being read. The name change affects the PDV and the output data set, but has no effect on the input
data set.

data empsall2;
set empscn
empsjp(rename=(Region=Country));
run;

The RENAME= Data Set Option: Additional Examples


Now let's look at a couple of examples that use the RENAME= data set option in other ways. Suppose you want to
rename the third variable in the first data set, empscn, from Country to Region. You specify the RENAME= data set
option immediately after empscn.

empscn

First Gender Country

Chang M China

Li M China

Ming F China

set empscn(rename=(Country=Region))
empsjp;
The next example shows that you can use the RENAME= data set option for multiple data sets in the same SET
statement. Suppose you want to rename two variables in the first data set, empscn, and one variable in the second data
set, empsjp. Even though the first variable in the two data sets is currently the same, you want to change First to Fname
in both data sets. So you specify this variable name change in the RENAME= data set option after both empscn and
empsjp. In the first data set, you also want to change Country to Region, so you add this to the RENAME= data set
option after empscn.

empsjp

First Gender Region

Cho F Japan

Tomi M Japan

set empscn(rename=(First=Fname
Country=Region))
empsjp(rename=(First=Fname));

Question
Which SET statement has correct syntax?
a.
set empscn(rename(Country=Location))
empsjp(rename(Region=Location));

b.
set empscn(rename=(Country=Location))
empsjp(rename=(Region=Location));

c.
set empscn rename=(Country=Location)
empsjp rename=(Region=Location);

The correct answer is b. You specify the keyword RENAME, followed by the equals sign. You enclose the RENAME=
option within an outer set of parentheses. In the inner set of parentheses, you specify one or more variables that you
want to rename.

Code Challenge
In the code below, rename the variable Office in the sales.rep data set to OfficeNumber.

data condata.emppay;

set sales.rep
empinfo.sales empinfo.bonuses;
run;

(rename=(Office=OfficeNumber))
The RENAME= data set option specifies the variable or variables to be renamed. The variables and their new names are
listed in parentheses after the data set name.

How SAS Concatenates Data Sets When a Variable Is Renamed


Let's see how SAS processes our DATA step now that we've added the RENAME= data set option. During compilation,
SAS creates the PDV with the three variables from empscn, as before.

empscn ampsjp

First Gender Country First Gender Region

Chang M China Cho F Japan

Li M China Tomi M Japan

Ming F China

data empsall2;
set empscn
empsjp(rename=(Region=Country));
run;

PDV

First Gender Country

In empsjp, SAS finds the additional variable Region. However, the RENAME= option in the SET statement tells SAS to
treat the variable Region as if it is named Country. So, the PDV and the descriptor portion of the output data set have
only three variables. At execution, when SAS reads the observations in empsjp, SAS stores the values of Region in the
Country variable.

Merging SAS Data Sets One-to-One


You've been asked to combine two data sets about Orion Star employees. The first data set, empsau, contains the first
name, gender, and employee ID number for several employees. The second data set, phoneh, contains the employee ID
number and phone number for the same set of employees. You want to combine these data sets horizontally so that
each observation provides all possible information about one employee.

empsau phoneh

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1793


Kylie F 121151 121151 +61(2)5555-1849

Birin M 121152 121151 +61(2)5555-1665

empsauh

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1793

Kylie F 121151 +61(2)5555-1849

Birin M 121152 +61(2)5555-1665

Merging Data Sets


Let's look at how you combine data sets horizontally by merging. We'll use the sample data sets, one and two. Merging
combines observations from two or more SAS data sets into a single observation in a new data set. You can merge
observations based on either their positions in the original data sets, or by the values of one or more common variables.
This lesson focuses on merging based on the values of one or more common variables, a process called match-merging.

one two

A B C C D E

both

A B C D E
Knowing the Structure and Contents of Your Data
When you combine data sets horizontally, or match-merge data sets, you might want to ask the question: What is the
relationship between observations in the input data sets? The observations can be related in several different ways.

In a one-to-one relationship, a single observation in one data set is related to one, and only one, observation in another
data set based on the values of one or more common variables. For example, suppose two data sets contain employee
identification numbers for the same group of employees. Each employee ID number appears once in each data set, and
each observation in one data set has one matching observation in the other data set.

one two

A B ID ID D E

1 1

2 2

3 3

In a one-to-many relationship, a single observation in one data set is related to one or more observations in another
data set.

one two
A B ID
ID D E

1
1

2
1

In a many-to-one relationship, multiple observations in one data set are related to one observation in another data set.

one two
A B ID
ID D E
1 1

1 2

In a many-to-many relationship, multiple observations in one data set are related to multiple observations in another
data set.

one two
A B ID
ID D E

1
1

1
1

2
2

Sometimes, the data sets have non-matches. At least one observation in one of the data sets is unrelated to any
observation in another data set based on the values of one or more common variables.

one two
A B ID
ID D E

1
2

2
3

4
4

Now take a look at the data sets for your scenario: empsau and phoneh.

empsau phoneh

First Gender EmpID EmpID Phone


Togar M 121150 121150 +61(2)5555-1793

Kylie F 121151 121151 +61(2)5555-1849

Birin M 121152 121152 +61(2)5555-1665

Think about this: Which variable can you use to match-merge these data sets? You can use the EmpID variable for the
match-merge. And what about this: Do these data sets have a one-to-one relationship? Yes, each data set contains the
same three employee ID numbers. One last question: How many variables will the new data set empsauh contain?
Empsauh will contain four variables: the first two variables come from the empsau data set. The third variable, the BY
variable, is common to the two input data sets. And the last variable comes from the phoneh data set.

empsauh

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1793

Kylie F 121151 +61(2)5555-1849

Birin M 121152 +61(2)5555-1665

The MERGE and BY Statements in the DATA Step


You can use the DATA step to merge multiple data sets into a single data set. Instead of the SET statement, you use the
MERGE statement.

DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 ...;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;

The MERGE statement joins observations from two or more SAS data sets into single observations, so you must specify
at least two data sets in the MERGE statement. If you specify only one data set, SAS treats the MERGE statement like a
SET statement.

In this example, we'll specify the data sets empsau and phoneh as the data sets to merge. Next, the BY statement
indicates a match-merge. You specify the common variable or variables to match, which in this case is EmpID.

data empsauh;
merge empsau phoneh;
by EmpID;
run;
The BY variables must be common to all data sets, and the data sets must be sorted by the variables listed in the BY
statement. What can you use to sort the data sets? You can use PROC SORT to sort the emspau and phoneh data sets by
the common variable EmpID. Fortunately, the two data sets that you're working with are already sorted on the BY
variable, EmpID.

Merging Data Sets One-to-One


In this demonstration, you merge data sets that have a one-to-one relationship.

1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.

In the third DATA step, the DATA statement specifies the name of the output data set, empsauh. You want to
store all of the employee information in one data set. The PROC PRINT step creates the report.

data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;

data phoneh;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1793
121151 +61(2)5555-1849
121152 +61(2)5555-1665
;
run;

********** Match-Merge One-to-One**********;


data empsauh;
merge empsau phoneh;
by EmpID;
run;

proc print data=empsauh;


run;

2. Submit the program and then check the log. You can see that SAS ran successfully. SAS read three observations
from each of the input data sets and created a data set that has three observations and four variables.

3. View the PROC PRINT output. You can see the empsauh contains the merged employee information.

Activity
Copy and paste this program into the editor. Complete the DATA step to match-merge the sorted SAS data sets
referenced in the PROC SORT steps. Submit your code. Correct and resubmit if necessary.

Reminder: Make sure you've defined the orion library.

proc sort data=orion.employee_payroll


out=work.payroll;
by Employee_ID;
run;

proc sort data=orion.employee_addresses


out=work.addresses;
by Employee_ID;
run;

data work.payadd;
merge ;
by ;
run;

proc print data=work.payadd;


var Employee_ID Employee_Name
Birth_Date Salary;
format Birth_Date weekdate.;
run;

Which of the following statements correctly match-merges the data sets?

a.

merge work.payroll
work.addresses;
by Employee_ID;

b.

merge orion.employee_payroll
orion.employee_addresses;
by Employee_ID;

c.

merge work.payroll,
work.addresses;
by Employee_ID;

The correct answer is a.

The PROC SORT steps create two new temporary output data sets in the OUT= option. In the MERGE
statement, you specify these data sets, separated by a space. In the BY statement, you specify the common
variable, Employee_ID.

Code Challenge
Complete this program to match-merge the data sets work.reps, empinfo.sales, and empinfo.bonuses, in that order, by
the common variable Emp_ID. Assume that the data sets have been sorted by Emp_ID.

data mergedata.emppay;

;
;
run;

merge work.reps empinfo.sales empinfo.bonuses;


by Emp_ID;

The MERGE statement specifies the input data sets to be merged. Data sets are merged in the order that they appear in
the MERGE statement. In the BY statement, you specify the common variable Emp_ID.

Merging SAS Data Sets One-to-Many


Suppose you want to combine two data sets that contain information about employees in Australia. The data set
empsau contains first names, gender, and employee ID numbers. The data set phones contains employee ID numbers,
as well as home, work, and cell phone numbers.

empsau phones

First Gender EmpID EmpID Type Phone

Togar M 121150 121150 Home +61(2)5555-1793

Kylie F 121151 121150 Work +61(2)5555-1794

Birin M 121152 121151 Home +61(2)5555-1849

121152 Work +61(2)5555-1850

121152 Home +61(2)5555-1665

121152 Cell +61(2)5555-1666

You want to match-merge these data sets based on the common variable EmpID to obtain the phone numbers for each
employee. You assume that the same employees are listed in both data sets, but it's a good idea to examine your data
first.

Now think about this: What is the relationship of these two data sets? One observation in empsau matches one, two, or
three observations in phones, so these data sets have a one-to-many relationship.

How SAS Performs a One-to-Many Match-Merge


To match-merge these data sets, you write the same kind of DATA step that you wrote before.

data empphones;
merge empsau phones;
by EmpID;
run;
The DATA statement identifies the output data set as empphones. The MERGE statement lists the two input data sets,
and the BY statement specifies the BY variable EmpID, which SAS uses to combine the observations. All observations
that have the same value of the BY variable are in the same BY group. Notice that you don't need to sort these data sets
because they are already in order by EmpID.

Let's see how SAS processes this DATA step when the data sets have a one-to-many relationship. At the end of the
compilation phase, SAS has created the PDV as well as the descriptor portion of the output data set. The output data set
isn't shown here; you'll have a chance to see it later. The PDV has five variables: First and Gender from empsau; EmpID,
which appears in both data sets; and Type and Phone from phones.

PDV

First Gender EmpID Type Phone

To start, SAS sets the values in the PDV to missing.

PDV

First Gender EmpID Type Phone

Now, at the start of the execution phase, SAS is ready to combine observations. SAS looks at the first observation in each
of the two data sets to determine which BY group should appear first in the output data set. Here's a question. Do the
EmpID values match? Yes. These two observations have the same BY value, so they are in the same BY group.

empsau phones

First Gender EmpID EmpID Type Phone

Togar M 121150 121150 Home +61(2)5555-1793

Kylie F 121151 121150 Work +61(2)5555-1794

Birin M 121152 121151 Home +61(2)5555-1849

121152 Work +61(2)5555-1850

121152 Home +61(2)5555-1665


121152 Cell +61(2)5555-1666

data empphones;
merge empsau phones;
by EmpID;
run;

The DATA step reads the values from the two data sets into the PDV, in the order they appear in the MERGE statement.
The PDV now contains data for Togar's home phone number.

PDV

First Gender EmpID Type Phone

Togar M 121150 Home +61(2)5555-1793

SAS then writes the contents of the PDV to the output data set as the first observation. SAS is now ready to start another
iteration of the DATA step.

At the beginning of each DATA step iteration, SAS reinitializes any new variables in the PDV. In this example, SAS does
not reinitialize any variables, because they all come from the input data sets. However, if the DATA step had an
assignment statement that created new variables, SAS would reset the values of the new variables to missing in the PDV.

SAS now moves to the second observation in each data set. Do the EmpID values match?

No, they don't. Does either EmpID match the EmpID in the PDV? Yes. The second observation in phones is in the same
BY group. The second observation in phones contains Togar's work phone number. The observation in empsau has a
different BY value. This observation is in a different BY group, and SAS will process it in the next iteration.

Now, SAS reads the values of Type and Phone from the observation in phones into the PDV. These new values replace
the previous values of Type and Phone in the PDV. However, the values of First, Gender, and EmpID remain the same as
before.

PDV

First Gender EmpID Type Phone

Togar M 121150 Work +61(2)5555-1794

Finally, SAS writes the values in the PDV to the output data set, creating the second observation.

Once again, SAS retains the values in the PDV. In empsau, SAS is still looking at the second observation because this data
has not been read to the PDV. However, in phones, SAS moves down to the third observation.
empsau phones

First Gender EmpID EmpID Type Phone

Togar M 121150 121150 Home +61(2)5555-1793

Kylie F 121151 121150 Work +61(2)5555-1794

Birin M 121152 121151 Home +61(2)5555-1849

121152 Work +61(2)5555-1850

121152 Home +61(2)5555-1665

121152 Cell +61(2)5555-1666

Do the EmpID values match? Yes, these observations are in the same BY group. Does either EmpID match the EmpID in
the PDV? No. These observations are in a new BY group, so SAS sets all the values in the PDV to missing.

PDV

First Gender EmpID Type Phone

SAS reads the values from the empsau observation, and then the phones observation, into the PDV.

PDV

First Gender EmpID Type Phone

Kylie F 121151 Home +61(2)5555-1849

Then SAS writes the PDV values to the output data set as the third observation. The DATA step continues executing in
this way until SAS reaches the end of file for both data sets. Here is the final output data set. Notice that the output data
set has multiple observations for each employee.

empphones

First Gender EmpID Type Phone


Togar M 121150 Home +61(2)5555-1793

Togar M 121150 Work +61(2)5555-1794

Kylie F 121151 Home +61(2)5555-1849

Birin M 121152 Work +61(2)5555-1850

Birin M 121152 Home +61(2)5555-1665

Birin M 121152 Cell +61(2)5555-1666

Activity
Copy and paste this program into the editor and submit it. Examine the results. Then reverse the order of the data sets
in the MERGE statement and submit it again. Examine the results.

Reminder: Make sure you've defined the orion library.

********** Create Data **********;


data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;

data phones;
input EmpID Type $ Phone $15.;
datalines;
121150 Home +61(2)5555-1793
121150 Work +61(2)5555-1794
121151 Home +61(2)5555-1849
121152 Work +61(2)5555-1850
121152 Home +61(2)5555-1665
121152 Cell +61(2)5555-1666
;

********** One-to-Many Merge **********;


data empphones;
merge phones empsau;
by EmpID;
run;

proc print data=empphones;


run;

In a one-to-many merge, does it matter which data set is listed first in the MERGE statement?
a. Yes

b. No

The correct answer is a. When you reverse the order of the data sets in the MERGE statement, the results are the same,
but the order of the variables is different. SAS performs a many-to-one merge.

Merging SAS Data Sets that Have Non-Matches


An Orion Star manager in Australia has requested an inventory of employees with company phones. As you know, the
data set empsau contains the first name, gender, and employee ID. The second data set, phonec, contains employee IDs
and phone numbers. Both data sets are already sorted by the common variable, EmpID.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348

You need to match-merge these data sets, but when you examine them, you notice something interesting. There's an
observation in empsau that does not have a match in phonec, and there's an observation in phonec that does not have
a match in empsau. You want the output data set, empsauc, to contain only the observations that match across the
input data sets.

How SAS Match-Merges Data Sets with Non-Matches


By default, the DATA step includes both the matching and non-matching observations in a merged data set. Let's see
how SAS processes the DATA step in this scenario. We'll start at the beginning of the execution phase. SAS has already
created the PDV and the descriptor portion of the output data set with the four variables First, Gender, EmpID, and
Phone.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348


data empsauc;
merge empsau phonec;
by EmpID;
run;

SAS looks at the first observation in each data set to determine which BY group should appear first. These two
observations are in the same BY group. So SAS reads the values from the current observation in each data set into the
PDV.

PDV

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

Then SAS writes the contents of the PDV to the output data set as the first observation.

empsauc

First Gender EmpID Phone

The values remain in the PDV as SAS begins the next iteration of the DATA step.

PDV

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

SAS now looks at the second observation in both data sets.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348

Do these EmpID values match? No, they don't. Does either EmpID match the EmpID in the PDV? No. Neither of these
observations is in the same BY group as the PDV. This is the first non-matching observation that SAS has identified.
Because current observations are not in the same BY group as in the PDV, SAS reinitializes the PDV.

PDV

First Gender EmpID Phone

Now think about this. Which EmpID value comes first sequentially? In the current observations, the EmpID value ending
in 151 comes before the value ending in 152. So SAS reads the second observation in empsau into the PDV. In the PDV,
Phone is still set to missing because there is no phone number for this employee in phonec.

PDV

First Gender EmpID Phone

Kylie F 121151

SAS writes the data in the PDV to the output data set as the second observation.

empsauc

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

Kylie F 121151

Once again, SAS returns to the top of the DATA step and moves down to the third observation in empsau.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348


In phonec, SAS is still looking at the second observation. Do the EmpID values match? Yes, they do. Does either EmpID
match the EmpID in the PDV? No. These two observations are not in the same BY group as in the PDV, so SAS
reinitializes the PDV.

PDV

First Gender EmpID Phone

Then, SAS reads the values from the empsau observation, and then the phonec observation into the PDV.

PDV

First Gender EmpID Phone

Birin M 121152 +61(2)5555-1667

SAS writes the data to the output data set as the third observation.

empsauc

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

Kylie F 121151

Birin M 121152 +61(2)5555-1667

Once again, SAS returns to the top of the DATA step. SAS has reached the end of the file in empsau, but not in phonec.
SAS looks at the third observation in phonec.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667


Birin M 121152 121153 +61(2)5555-1348

Does this EmpID match the EmpID in the PDV?

PDV

First Gender EmpID Phone

Birin M 121152 +61(2)5555-1667

No, this observation is not in the same BY group, so SAS reinitializes the PDV.

PDV

First Gender EmpID Phone

SAS reads the values from the phonec observation into the PDV,

PDV

First Gender EmpID Phone

121153 +61(2)5555-1348

and then writes the data to the output data set as the fourth observation.

empsauc

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

Kylie F 121151

Birin M 121152 +61(2)5555-1667

121153 +61(2)5555-1348
SAS returns to the top of the DATA step. SAS has reached the end of file in both data sets.

Match-Merging Data Sets with Non-Matches


In this demonstration, you match-merge data sets that contain non-matches.

1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.

In the third DATA step, the DATA statement specifies the name of the output data set, empsauc. The PROC
PRINT step creates the report.

data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;

data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;

********** Match-Merge with Non-Matches**********;


data empsauc;
merge empsau phonec;
by EmpID;
run;

proc print data=empsauc;


run;

2. Submit the program and then check the log. You can see messages indicating that SAS ran successfully. SAS read
3 observations from empsau and from phonec, but the new data set has 4 observations.

3. View the PROC PRINT output. The report contains both matches and non-matches. Matches are observations
that contain data from both input data sets. Non-matches are observations that contain data from only one
input data set. This data set has two non-matches, one from each of the input data sets. The manager requested
a report listing employees with cell phones. Next you'll learn how to provide a more accurate report for the task.

Question
Which data set(s) contributed information to the first observation in the output data set empsauc?

Partial empsauc
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795
Kylie F 121151
Birin M 121152 +61(2)5555-1667
121153 +61(2)5555-1348
a. empsau
b. phonec
c. both empsau and
phonec
d. insufficient information

The correct answer is c. Both data sets contributed to the first observation. If one of the data sets had not contributed,
you would see missing values for at least one variable in the observation.

Business Scenario
Given the data for your task, you decide that you'd like to provide the manager with three separate phone inventory
reports: one for employees with company phones, one for employees without company phones, and one for those with
an EmpID not found in the empsau data set. To do this, you'll need to know which data sets contributed to each
observation in the merged data set. You can use a data set option to identify the data set contributors.

Using the IN= Data Set Option


You can use the IN= data set option in a MERGE statement to identify which input data sets contributed to each
observation in your output. After a SAS data set name, you specify the IN= option in parentheses, followed by a valid SAS
variable name.

MERGE SAS-data-set (IN=variable)...

When you specify the IN= option after an input data set in the MERGE statement, SAS creates a temporary numeric
variable that indicates whether the data set contributed data to the current observation. The temporary variable has
two possible values. If the value of the variable is 0, it indicates that the data set did not contribute to the current
observation. If the value of the variable is 1, the data set did contribute to the current observation.

In the this example, the IN= option is specified after each of the input data sets.

data empsauc;
merge empsau(in=Emps);
phonec(in=Cell);
by EmpID;
run;

We want to know when either of these data sets contributes to the current observations. We've chosen the variables
names Emps and Cell. Here's another example using just E and P as the variable names.

data empsauc;
merge empsau(in=E);
phonec(in=P;
by EmpID;
run;

This last example shows how you can use the IN= option on just one of the data sets in a MERGE statement.

data empsauc;
merge empsau(in=AU);
phonec;
by EmpID;
run;

How SAS Processes the IN= Data Set Option


Here's how SAS processes the IN= data set option in the DATA step.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348

data empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
phonec;
by EmpID;
run;

During the execution phase, SAS creates a temporary variable in the PDV for each instance of the IN= data set option in
your code. Each time SAS reads data into the PDV, SAS assigns a value to the temporary variables Emps and Cell to
indicate whether the associated data set contributed data to the current observation.

PDV

First Gender EmpID Emps Phone Cell

In the first iteration, both data sets contributed to the data that is in the PDV, so the value of both temporary variables is
1. We have a match.

PDV

First Gender EmpID Emps Phone Cell

Togar M 121150 1 +61(2)5555-1795 1

In the second iteration, the data set phonec did not contribute, so the value of the temporary variable Cell is set to 0.
We have a non-match.
empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348

PDV

First Gender EmpID Emps Phone Cell

Kylie F 121151 1 0

In the third iteration, both data sets contributed to the data that is in the PDV, so SAS assigns the value 1 to both Emps
and Cell. We have another match.

empsau phonec

First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348

PDV

First Gender EmpID Emps Phone Cell

Birin M 121152 1 +61(2)5555-1667 1

Take a moment to look at the values after the fourth iteration.

empsau phonec
First Gender EmpID EmpID Phone

Togar M 121150 121150 +61(2)5555-1795

Kylie F 121151 121152 +61(2)5555-1667

Birin M 121152 121153 +61(2)5555-1348

What are the values of Emps and Cell for this data? This data is a non-match that comes from phonec but not from
empsau. SAS sets the variable Emps to 0 and Cell to 1.

PDV

First Gender EmpID Emps Phone Cell

121153 0 +61(2)5555-1348 1

You might be wondering whether variables that are created with the IN= data set option appear in the output data set.
These variables are only available during execution. As you can see by looking at the partial output data set here, SAS
does not write these temporary variables to the output data set.

empsauc

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

Kylie F 121151

Selecting Observations by Using the Subsetting IF Statement


Now that you've added the IN= data set option to your DATA step, you can test the values of the IN= variables using
conditional logic. You can add a subsetting IF statement to your DATA step that refers to the variables you created using
IN=.

IF expression;.

This way, you can select only the matches or only the non-matches for your output data set.

Question
Which subsetting IF statement can be added to the DATA step to only output the matches?
data empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
run;

a. if Emps=1 and Cell=0;


b. if Emps=1 and Cell=1;
c. if Emps=1;
d. if Cell=0;

The correct answer is b. If the values of both Emps and Cell equal 1, then both data sets contributed to the observation.
This subsetting IF statement selects only the matches.

Selecting Only Matches


To create a merged data set that includes only matches, you need to modify your DATA step. First, you use the IN= data
set option to identify which input data sets contributed data to each observation that SAS outputs. Then you can use the
subsetting IF statement to output only those observations that contain data from all of the input data sets: if Emps=1
and Cell=1.

date empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=1 and Cell=1;
run;

Look at the tables below.

empsau phonec
First Gender EmpID
EmpID Phone

Togar M 121150
121150 +61(2)5555-1795

Kylie F 121151
121152 +61(2)5555-1667

Birin M 121152
121153 +61(2)5555-1348
How many observations will the output data set contain?

The output data set contains only two observations for the two employee ID numbers that match across the input data
sets. This generates what we need for part of our task: the first report for employees with company phones.

empsauc

First Gender EmpID Phone

Togar M 121150 +61(2)5555-1795

Birin M 121152 +61(2)5555-1667

Selecting Non-Matches
In this demonstration, you select non-matches from data sets using the IN= data set option.

1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.

data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;

data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;

********** Non-Matches from empsau **********;


data empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
run;

proc print data=empsauc;


run;

2. To provide management with the final two reports, one for employees without company phones and one for
employee IDs not found in the empsau data set, you need to select non-matches. First, to select only the
employees without company phones, you want the non-matches from the empsau data set.

Think about what IF expression you can use to select only the non-matches from empsau. You can use if Emps=1
and Cell=0. Add the IF expression after the BY statement as shown below. In both the DATA step and PROC
PRINT step, change the data set name to empsauc2.

data empsauc2;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=1 and Cell=0;
run;

proc print data=empsauc2;


run;

3. Submit the program and then check the log. You can see messages indicating that SAS ran successfully. SAS
wrote one observation to the output data set.

4. View the PROC PRINT output. The report shows that Kylie is the only employee from these data sets who does
not have a company phone.

5. You need to modify the DATA step and PROC PRINT step to create the final report, the one for employee IDs not
found in the empsau data set. Which data set needs to contribute to create this report? Right, phonec needs to
contribute. So the IF expression you need to use is if Emps=0 and Cell=1. Modify the IF statement and change
the output data set name to empsauc3 as shown below.

********** Non-Matches from phonec **********;


data empsauc3;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=0 and Cell=1;
run;

proc print data=empsauc3;


run;

6. Submit this code and check the log. The log shows that SAS read one observation in this case as well.

7. View the report. It shows that the phone number ending in 1348 is unassigned.

Question
Which of the following DATA steps selects non-matches from either data set?

empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348

a.
data empsauc4;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=0 and Cell=0;
run;

b.
data empsauc4;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=0 or Cell=0;
run;

The correct answer is b. You use the OR operator to select non-matches from either data set. The resulting output data
set would contain two observations: one for Kylie from empsau, and one for EmpID 121153 from phonec.

Using Alternate Syntax

When you are checking a variable for a value of 1 or 0, as in the previous scenario, you can use alternate syntax.
For example, instead of using if Emps=1 and Cell=1, you could use if Emps and Cell. Both versions will
return the matches only.

if Emps=1 and Cell=1;

if Emps and Cell;

Instead of using if Emps=1 and Cell=0, you could use if Emps and not Cell. In this case, SAS will find the non-matches
from the first data set.

if Emps=1 and Cell=0;

if Emps and not Cell;

Another example is instead of using if Emps=0 and Cell=1, you could use if not Emps and Cell.

Can you determine what SAS will select in this case?

if Emps=0 and Cell=1;

if not Emps and Cell;

SAS will find the non-matches from the second data set. And in the last example, instead of using if Emps=0 or Cell=0,
you could use if not Emps or not Cell.

if Emps=0 or Cell=0;

if not Emps or not Cell;

SAS selects the non-matches from either data set.

Code Challenge
Write a subsetting IF statement that selects observations for subsequent processing only if all three input data sets
contributed to the current observation.

data mergedata.emppay;
merge sales.reps(rename=(office=OfficeNumber) in=inreps)
empinfo.sales(in=insales)
empinfo.bonuses(in=inbonus);
by Emp_ID;

;
run;

if inreps and insales and inbonus;

To select only observations composed of values from all input data sets, the subsetting IF statement specifies all
three IN= variables in the IF condition. The AND operator joins expressions in the condition.

Summary of Lesson 10: Combining SAS Data Sets

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Concatenating Data Sets


You can concatenate two or more data sets by combining them vertically to create a new data set. It is important to
know the structure and contents of the input data sets.

You use a DATA step to concatenate multiple data sets into a single, new data set. In the SET statement, you can specify
any number of input data sets to concatenate. During compilation, SAS uses the descriptor portion of the first data set
to create variables in the PDV, and then continues with each subsequent data set, creating additional variables in the
PDV as needed. During execution, SAS processes the data sets in the order in which they are listed in the SET statement.

DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 ...;
RUN;

If the data sets have differently named variables, every variable is created in the new data set, and some observations
have missing values for the differently named variables. You can use the RENAME= data set option to change variable
names in one or more data sets. After they are renamed, they are treated as the same variable during compilation and
execution, and in the new data set.

SAS-data-set (RENAME=(old-name-1=new-name-1;
old-name-2=new-name-2
...
old-name-n=new-name-n))
Merging SAS Data Sets One-to-One
Merging combines observations from two or more SAS data sets into a single observation in a new data set. A simple
merge combines observations based on their positions in the original data sets. A match-merge combines them based
on the values of one or more common variables. The result of a match-merge is dependant on the relationship between
observations in the input data sets.

You use a DATA step with a MERGE statement to merge multiple data sets into a single data set. The BY statement
indicates a match-merge and specifies the common variable or variables to match. The common variables are referred
to as BY variables. The BY variables must exist in every data set, and each data set must be sorted by the value of the BY
variables.

DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 ...;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;

Merging SAS Data Sets One-to-Many


In a one-to-many merge, a single observation in one data set matches more than one observation in another data set.
The DATA step is the same, regardless of the relationship between the data sets being merged. SAS processes each BY
group before reinitializing the PDV.

Merging SAS Data Sets That Have Non-Matches


When you merge data sets, observations in one data set might not have a matching observation in another data set.
These are called non-matches. By default, both matches and non-matches are included in a merged data set. The
observations that are matches contain data from every input data set. The non-matching observations do not contain
data from every input data set.

You can use the IN= data set option in a MERGE statement to create a temporary variable that indicates whether a data
set contributed information to the observation in the PDV. The IN= variables have two possible values: 0 and 1. You can
test the value of this variable using subsetting IF statements to output only the matches or only the non-matches to the
merged data set.

MERGE SAS-data-set1 <(IN=variable)>...

Sample Programs

Concatenating Data Sets with Different Variables

********** Create Data **********;


data empscn;
input First $ Gender $ Country $;
datalines;
Chang M China
Li M China
Ming F China
;
run;
data empsjp;
input First $ Gender $ Region $;
datalines;
Cho F Japan
Tomi M Japan
;
run;

********** Unlike-Structured Data Sets **********;


data empsall2;
set empscn empsjp;
run;

proc print data=empsall2;


run;

Merging Data Sets One-to-One

********** Create Data **********;


data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;

data phoneh;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1793
121151 +61(2)5555-1849
121152 +61(2)5555-1665
;
run;

********** Match-Merge One-to-One**********;


data empsauh;
merge empsau phoneh;
by EmpID;
run;

proc print data=empsauh;


run;

Match-Merging Data Sets with Non-Matches

********** Create Data **********;


data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;

data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;

********** Match-Merge with Non-Matches**********;


data empsauc;
merge empsau phonec;
by EmpID;
run;

proc print data=empsauc;


run;

Selecting Non-Matches

********** Create Data **********;


data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;

data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;

********** Non-Matches from empsau Only **********;


data empsauc2;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=1 and Cell=0;
run;

proc print data=empsauc2;


run;

Lesson 11: Creating Summary Reports


You can use PROC steps to generate reports for many different purposes. For example, in this lesson, you'll use SAS
procedures to summarize data. Summary reports provide a concise overview of information by using descriptive
statistics such as frequency counts and percentages, sums, means, ranges, and many more. A summary report
consolidates data so that each row represents multiple observations of input data.

Using PROC FREQ and PROC MEANS, you'll learn to create summary reports. You'll also learn how to use these
procedures, as well as PROC PRINT and PROC UNIVARIATE, to validate your data.

Now suppose you want to send the output you create to external files. In this lesson, you'll also work with the SAS
Output Delivery System, or ODS. You'll use ODS statements to send the output from your SAS procedures to many types
of external files.

Objectives

In this lesson, you learn to do the following:

 produce one-way and two-way frequency tables by using the FREQ procedure
 enhance frequency tables by using options
 use PROC FREQ to validate data in a SAS data set
 calculate summary statistics and multilevel summaries by using the MEANS procedure
 enhance summary tables by using options
 identify extreme and missing values by using the UNIVARIATE procedure
 define the Output Delivery System and ODS destinations
 use ODS statements to direct report output to various ODS destinations
 specify a style definition by using the STYLE= option
 create report output that can be viewed in Microsoft Excel

Business Scenario
Suppose your manager at Orion Star wants to know the number of male and female sales employees in Australia. You
need to use orion.sales to analyze the number of occurrences of each value for Gender. In other words, you want to
know the frequency of the variable Gender.

Using PROC FREQ


To find the frequency of Gender in orion.sales, you can use PROC FREQ to create frequency tables. PROC FREQ produces
frequency tables that report the distribution of any or all variable values in a SAS data set.

PROC FREQ DATA=SAS-data-set<option(s)>;


TABLES variable(s) <|option(s)>;
<additional statements>
RUN;

In a PROC FREQ statement, you specify the input data set.

In the TABLES statement, you specify the frequency tables to produce. You can define one or more frequency tables in a
single TABLES statement.

proc freq data=orion.sales;


tables Gender Country;
where Country='AU';
run;

To create one-way frequency tables, you specify one or more variable names separated by a space. If you omit the
TABLES statement, SAS produces a one-way frequency table for every variable in the data set. This can produce
tremendous output and is seldom desired.

For our task, we'll specify the variable Gender. Notice that we also have a WHERE statement to subset the data by the
Country variable.

proc freq data=orion.sales;


tables Gender;
where Country='AU';
run;

Creating a One-Way Frequency Report


In this demonstration, you create a one-way frequency report.

1. Copy and paste the following program into the editor. PROC FREQ automatically displays output in a report, so
you don't need to add a PROC PRINT step.

proc freq data=orion.sales;


tables Gender;
where Country='AU';
run;

2. Submit the code and then check the log. The log shows that SAS read 63 observations from orion.sales.

3. Examine the report. This frequency table shows frequency statistics for the variable Gender. By default, the first
column displays the values of the specified variable. Each unique value is called a level of the variable. In this
table, Gender has two levels: F for female and M for male.

For each value, a one-way frequency table shows four statistics by default. The Frequency column displays the
frequency count, which is the number of observations in a level. The Percent column displays the percentage of
the total number of observations. The Cumulative Frequency column displays the cumulative frequency count,
which is the sum of the frequency counts for that level and for all levels listed above it. The Cumulative Percent
column displays the percentage of the total number of observations in that level and in all other levels listed
above it.

Can you determine what percent of the Orion Star sales force in Australia is female? Females represent about
43% of the sales force.

Suppressing Statistics in One-Way Frequency Tables


This one-way frequency table shows the four statistics that PROC FREQ displays by default.

proc freq data=orion.sales;


tables Gender;
where Country='AU';
run;

Cumulative Cumulative
Gender Frequency Percent
Frequency Percent

F 27 42.86 27 42.86

M 36 57.14 63 100.00

Suppose you want your frequency table to display only a subset of the available statistics. By specifying options in the
TABLES statement, you can suppress one or more of the statistics in frequency tables.

For example, you can add the NOCUM and NOPERCENT options in the TABLES statement. Let's see how each of these
options affects a one-way frequency table.
The NOCUM option suppresses the display of cumulative frequency and cumulative percentages.

proc freq data=orion.sales;


tables Gender/nocum;
where Country='AU';
run;

Gender Frequency Percent

F 27 42.86

M 36 57.14

Notice that you must specify a forward slash before you specify options. The NOPERCENT option suppresses the display
of all percentages.

proc freq data=orion.sales;


tables Gender/nopercent;
where Country='AU';
run;

Cumulative
Gender Frequency
Frequency

F 27 27

M 36 63

You can also specify both NOCUM and NOPERCENT. When you specify multiple options, you separate them by a space.

proc freq data=orion.sales;


tables Gender/nocum nopercent;
where Country='AU';
run;

If you specify both NOCUM and NOPERCENT, which statistics appear in the frequency table? The one-way frequency
table displays only frequencies.

Gender Frequency

F 27

M 36
Activity
Copy and paste the following program into the editor. Submit the program and check the log. Correct the program and
resubmit it.

Reminder: Make sure you've defined the orion library.

proc freq data=orion.sales;


tables Country nocum nopercent;
run;

What change was needed in the TABLES statement?

a. a comma between NOCUM and NOPERCENT

b. a forward slash before the TABLES statement options

c. an equals sign after the TABLES statement variable

d. none of the above

The correct answer is b. You must use a forward slash before you specify options in the TABLES statement.
When you specify multiple options, you separate them by a space, not a comma.

tables Country / nocum nopercent;

Code Challenge
Complete the FREQ procedure shown below. Compute frequency statistics for the variable Test1. Suppress the
display of all cumulative statistics.

proc freq data=students.bio;

;
run;

tables Test1/nocum;

In the TABLES statement, you list the variable Test1, followed by a forward slash, and then the NOCUM
option. The NOCUM option suppresses the display of cumulative frequency and cumulative percentages.

Selecting Variables for Frequency Tables


When you use PROC FREQ to create summary reports, it's more useful to display the distribution of values for some
variables than for others. Consider this. Which of these variables is probably not useful to include in a PROC FREQ
summary report?

Variables Values

Employee_ID
First_Name

Last_Name

Gender

Salary

Job_Title

Country

Birth_Date

Hire_Date

The variables Employee_ID, First_Name, and Last_Name are not good choices for a PROC FREQ summary report. Every
employee has a unique employee ID number, and most employees have unique names. When you're summarizing data,
there's no need to show a frequency distribution for variables that have a large number of distinct values.

Frequency distributions work best with variables whose values meet two criteria. First, the values of the variable are
categorical. Second, the values are best summarized by counts instead of averages. For example, the values of Gender
fall into two categories: female and male. Gender is a character variable, so its values must be counted and cannot be
averaged.

What are two other categorical variables in this list? Job_Title and Country are also categorical variables, so they are
probably good choices for a frequency distribution.

What about Salary, Birth_Date, and Hire_Date? Variables that have continuous numeric values, such as dollar amounts
and dates, can result in a lengthy and meaningless frequency table. To create a useful frequency report for these
variables, you can group the variable values into categories. How can you do that? You can group the values of a variable
into categories by applying formats. You can use existing SAS formats or user-defined formats. For example, you could
use the TIERS format to categorize salary values.

proc format;
value tiers 20000-<50000='Tier1'
50000-<100000='Tier2'
10000-250000='Tier3';
run;

After you group the values of a continuous numeric variable into categories, you can create a meaningful frequency
table for that variable.

Using Formats in PROC FREQ


In this demonstration, you use a format in a PROC FREQ step to categorize values.
1. Copy and paste the following program into the editor. The PROC FORMAT step creates the TIERS format. It
groups ranges of Salary values into four tier levels. The PROC FREQ step specifies orion.sales as the input data
set and lists the Salary variable in the TABLES statement.

How can you apply the TIERS format to Salary? The FORMAT statement specifies Salary, followed by the TIERS.
format.

proc format;
value tiers low-25000='Tier1'
25000<-50000='Tier2'
50000<-100000='Tier3'
100000<-high='Tier4';
run;

proc freq data=orion.sales;


tables Salary;
format Salary tiers.;
run;

2. Submit the code and then check the log. The log shows that the format TIERS was created and that the PROC
FREQ step ran successfully.

3. Examine the report. As you can see, the Salary values have been categorized into the four tiers. For each tier,
you can see the frequency of employees. Notice that Tier2 has the highest number of employees.

Question
Which variable would be a poor choice for frequency tables?

Variable Type Length Description


Name character 40 first and last name
AgeRange numeric 1 five coded levels
Gender character 1 F or M
Region numeric 1 six coded zones
a. Name
b. AgeRange
c. Gender
d. Region

The correct answer is a. Name is likely to contain unique values, producing lengthy and meaningless output. It is not a
good choice for the FREQ procedure.

Business Scenario
You've analyzed orion.sales for the frequency of Gender. Suppose you've now been asked to determine the number of
female and male sales employees in each country, Australia and the US. Let's see how you can modify your PROC FREQ
program to generate these results.

Specifying Variables in the TABLES Statement


In a TABLES statement, you can list multiple variables separated by a space. Because you want to know gender by
country, let's try specifying the variables Gender and Country in our TABLES statement.
proc freq data=orion.sales;
tables Gender Country;
run;

The order in which the variables appear in the TABLES statement determines the order in which the one-way frequency
tables appear in the report. So think about the output that SAS will produce: a frequency table for Gender and a
frequency table for Country. Let's see if this answers the question for our task.

Listing Multiple Variables in a TABLES Statement


In this demonstration, you list multiple variables in a TABLES statement.

1. Copy and paste the following program into the editor. This PROC FREQ step will create two tables, one for
Gender and one for Country.

proc freq data=orion.sales;


tables Gender Country;
run;

2. Submit the program and then check the log. The log shows that SAS read 165 observations from orion.sales.

3. Examine the results. You can easily see how many female and male sales employees there are, but can you
determine how many females are in Australia? No, you can't determine that information in this report.

The Country table lists the frequency of employees in each country, but you don't know how many females are
in a particular country. Do you think you can use PROC FREQ to determine this kind of information? What you
really want is a separate analysis for each group.

4. To find the frequency of Gender by Country, you need to add a BY statement: by Country. Do you recall what
you first need to do to the input data set before you submit this code? You need to sort orion.sales by Country.
Whenever you use the BY statement, the data set must be sorted by the variable named in the statement. The
PROC SORT step includes the OUT= option and names the output data set sorted. In the PROC FREQ step,you
change the input data set to sorted. Copy and paste the following program into the editor and submit it.

proc sort data=orion.sales out=sorted;


by Country;
run;

proc freq data=sorted;


tables Gender;
by Country;
run;

5. Check the log. The log shows that everything ran successfully.

6. Examine the new report. You can see that each group is in a separate frequency table. The Australian female
and male frequencies are in the first table, followed by the US female and male frequencies. Notice that there
are 27 female sales employees in Australia. As you can see, you can easily modify your analysis to produce the
results you need.

Using PROC FREQ to Create Crosstabulation Tables


The code you saw in the previous demonstration creates two one-way frequency tables: a table for gender frequencies
in Australia followed by a table for gender frequencies in the US. However, sometimes it's helpful to view a single table
with statistics for each distinct combination of values of the selected variables.
PROC FREQ can generate crosstabulation tables, which summarize data for two or more categorical variables by showing
the number of observations for each combination of variable values. The simplest crosstabulation table is a two-way
table. In the TABLES statement, you specify an asterisk instead of a space between the variable names Gender and
Country.

proc freq data=orion.sales;


tables Gender*Country;
run;

Table of Gender by Country

Country
Gender
AU US Total

27 41 68

16.36 24.85 41.21


Frequency F
39.71 60.29
Percent
42.86 40.20
Row Pct
36 61 97
Col Pct
21.82 36.97 58.79
M
37.11 62.89

57.14 59.80

63 102 165
Total
38.18 61.82 100.00

In a two-way table, the first variable specifies the table rows and the second variable specifies the table columns.

Creating a Crosstabulation Table


In this demonstration, you use PROC FREQ to create a crosstabulation table.

1. Copy and paste the following program into the editor. This PROC FREQ step will create a two-way
crosstabulation table for the variables Gender and Country.

proc freq data=orion.sales;


tables Gender*Country;
run;

2. Submit the program and then check the log. The log shows that SAS read 165 observations from orion.sales.
3. Examine the results. This two-way crosstabulation table for Gender and Country shows the distribution of
females and males among Orion Star sales employees in Australia and the United States.

By default, PROC FREQ displays two-way crosstabulation tables in table cell format. The row variable values
appear on the side of the table, and the column variable values appear across the top. Each of the main cells
represents a combination of a row variable level and a column variable level. For example, this cell contains
frequency statistics for female employees in the United States. In addition, the last column and the last row
provide totals. There is a legend in the top left corner of the output.

Four statistics appear by default. The frequency statistic indicates the number of observations with the unique
combination of values represented in that cell. The percent statistic indicates the cell's percentage of the total
frequency. The row percentage is the cell's percentage of the total frequency for its row. And the column
percentage is the cell's percentage of the total frequency for its column.

Take another look at the cell for female employees in the United States. These statistics indicate that 41
observations in the data set have a value of F for Gender and a value of US for Country. These 41 observations
represent 24.85% of the data set. The row percentage indicates that 60.29% of female employees are in the
United States. The column percentage indicates that 40.20% of United States employees are female.

Which two statistics are the same in both a one-way frequency table and a crosstabulation table? The frequency
and percentage appear in both tables. Cumulative frequencies and cumulative percentages appear only in a
one-way frequency table. Row percentages and column percentages appear only in a crosstabulation table.

Code Challenge
Complete the program below so that PROC FREQ creates a two-way crosstabulation of AgeRange and
MovingViolation. Use the values of MovingViolation for the table rows.

proc freq data=insure.auto;

;
run;

tables MovingViolation*AgeRange;

You specify the keyword TABLES, followed by the variables joined by an asterisk. MovingViolation is listed first to form
the table rows.

Suppressing Statistics in Crosstabulation Tables


You can also suppress the display of statistics in a crosstabulation table by adding options in the TABLES statement.
Remember that only some of the statistics shown in crosstabulation tables and one-way frequency tables are the same.
Do you think you can use the NOCUM option for a crosstabulation table? No, you can't use the NOCUM option for a
crosstabulation table because crosstabulation tables do not display cumulative statistics.

You can use the NOPERCENT option for crosstabulation tables as well as one-way frequency tables. In a crosstabulation
table, NOPERCENT suppresses the display of overall percentages. These percentages include the second row for each
level and the second row in the Total level.

proc freq data=orion.sales;


tables Gender* Country/nopercent;
run;
Table of Gender by Country

Country
Gender
AU US Total

27 41 68
Frequency
F 39.71 60.29
Row Pct
42.86 40.20
Col Pct
36 61 97

M 37.11 62.89

57.14 59.80

Total 63 102 165

Now let's look at the additional options that you can specify in the TABLES statement to suppress statistics in a
crosstabulation table. The NOFREQ option suppresses the display of cell frequencies.

proc freq data=orion.sales;


tables Gender* Country/nofreq;
run;

Table of Gender by Country

Country
Gender
AU US Total

Percent 16.36 24.85 41.21


Row Pct F 39.71 60.29
Col Pct 42.86 40.20

21.82 36.97 58.79

M 37.11 62.89

57.14 59.80
63 102 165
Total
38.18 61.82 100.00

NOROW suppresses the display of row percentages.

proc freq data=orion.sales;


tables Gender* Country/norow;
run;

Table of Gender by Country

Country
Gender
AU US Total

27 41 68

Frequency F 16.36 24.85 41.21

Percent 42.86 40.20

Col Pct
36 61 97

M 21.82 36.97 58.79

57.14 59.80

63 102 165
Total
38.18 61.82 100.00

And the NOCOL option suppresses the display of column percentages.

proc freq data=orion.sales;


tables Gender* Country/nocol;
run;

Table of Gender by Country


Frequency
Country
Percent Gender
AU US Total
Row Pct
F 27 41 68
16.36 24.85 41.21

39.71 60.29

36 61 97

M 21.82 36.97 58.79

37.11 62.89

63 102 165
Total
38.18 61.82 100.00

Question
Which TABLES statement correctly creates this report?

a.
tables Gender*Country
nofreq norow nocol;

b.
tables Gender*Country
nocum norow nocol;

c.
tables Gender*Country/
nofreq norow nocol;

d.
tables Gender*Country/
nocum norow nocol;

The correct answer is c. You specify the options that suppress cell frequencies and total frequencies, row percentages,
and column percentages. The only remaining statistic is cell percentages. You list TABLES statement options after a
forward slash.

Changing the Table Format in Crosstabulation Tables


Suppose you find the crosstabulation tables that PROC FREQ produces by default to be difficult to read. To simplify the
format of these tables, you can specify additional options in the TABLES statement. To display crosstabulation tables in
list format, you specify the LIST option.

proc freq data=orion.sales;


tables Gender* Country/list;
run;
Cumulative Cumulative
Gender Country Frequency Percent
Frequency Percent

F AU 27 16.36 27 16.36

F US 41 24.85 68 41.21

M AU 36 21.82 104 63.03

M US 61 36.97 165 100.00

In the list version of this two-way crosstabulation table, notice that the first two columns specify each possible
combination of the two variables. All statistics for each combination are displayed in a single row.

There are differences between the statistics in the default version and those in the list version. The statistics in the list
version are the same as in a one-way frequency table. The cumulative frequency and cumulative percentage appear
instead of the row percentage and the column percentage.

To format your crosstabulation table in the crosslist format, you specify the CROSSLIST option in the TABLES statement.

proc freq data=orion.sales;


tables Gender* Country/crosslist;
run;

Table of Gender by Country

Row Column
Gender Country Frequency Percent
Percent Percent

F AU 27 16.36 39.71 42.86

US 41 24.85 60.29 40.20

Total 68 41.21 100.00

M AU 36 21.82 37.11 57.14

US 61 36.97 62.89 59.80

Total 97 58.79 100.00

Total AU 63 38.18 100.00

US 102 61.82 100.00


Table of Gender by Country

Row Column
Gender Country Frequency Percent
Percent Percent

Total 165 100.00

Like the list format, the crosslist format might be easier to read than the default format. However, notice that the
crosslist format displays the same statistics as the default crosstabulation table.

Specifying a Format for Frequencies in Crosstabulation Tables


You already know that you can apply user-defined formats to variables in frequency tables by using the FORMAT
statement in your PROC FREQ step. But what if you apply a format to display variables with alternate text, and the text
wraps to the next line in your output? For example, suppose you've applied a format to the variable Country to display
the full country names, Australia and United States, in your crosstabulation table. But you actually want to display a
longer value for US: United States of America. Do you think the cell width will automatically adjust for this longer value?

Actually, SAS applies a default format to all of the frequency values, which controls the column width. So it's possible
that changing the length of a variable value could make that value wrap to the next line. This also depends on whether
you are using the SAS windowing environment or a client application such as SAS Enterprise Guide or SAS Studio.

To change the format that SAS applies, you can add another option to the TABLES statement, the FORMAT= option. This
option allows you to format the frequency value and to change the width of the column. In the FORMAT= option, you
can specify any standard SAS numeric format or a user-defined numeric format. The format length cannot exceed 24.

proc freq data=orion.sales;


tables Gender*Country/
format=24.;
format Country $ctryfmt.;
run;

The FORMAT= option applies only to crosstabulation tables displayed in the default format. It doesn't apply to
crosstabulation tables produced with the LIST or CROSSLIST option.

proc freq data=orion.sales;


tables Gender*Country/list;
run;

proc freq data=orion.sales;


tables Gender*Country/crosslist;
run;

Using PROC FREQ for Data Validation


The Orion Star HR Department has given you data that needs to be validated, or cleaned. The data set orion.nonsales2
might include invalid and missing values. You can use PROC FREQ to screen for invalid, missing, and duplicate data
values. You have some requirements for the data beyond just being numeric or character. For example, Employee_ID
must be unique and not missing. Gender must be F or M. Job_Title must not be missing. Country must have a value of
AU or US. And Salary must be in the numeric range of 24000 – 500000.
Variable Additional Requirement

Employee_ID unique and not missing

Gender F or M

Job_Title not missing

Country AU or US

Salary 24000 - 500000

Let's see how you can use a PROC FREQ step with the TABLES statement to detect invalid numeric and character data by
looking at distinct values.

Examining Your Data


In this demonstration, you examine the data set orion.nonsales2 to detect invalid numeric and character data.

1. Copy and paste the following PROC PRINT step to print the first 20 observations.

proc print data=orion.nonsales2 (obs=20);


run;

2. Submit the step and then check the log. The log doesn't (and won't) display problems that violate your data
requirement because the problems don't constitute data errors.

3. Examine the report. As you scan the report, try to identify observations that do not meet your data
requirements.

In observation 2, notice that the value of Country is lowercase au. It should be uppercase. In observation 4,
Salary has a missing value. In observation 10, Job_Title has a missing value. Observation 12 has a value of G for
Gender. Gender must have a value of F or M. Observation 13 contains a Salary value that is less than our
requirement of at least 24000. Observation 14 has a missing value for Employee_ID.

So this data needs a lot of work. This PROC PRINT report has shown you the kinds of issues this data set has.

Validating Data with PROC FREQ


So let's see how you can use PROC FREQ to validate the data. The FREQ procedure lists all discrete values for a variable
and reports missing values. For example, this PROC FREQ step produces tables of unique values for Gender and Country.

proc freq data=orion.nonsales2;


tables Gender Country/nocum nopercent;
run;

Table of Gender by Country


Row Column
Gender Country Frequency Percent
Percent Percent

F AU 27 16.36 39.71 42.86

US 41 24.85 60.29 40.20

Total 68 41.21 100.00

M AU 36 21.82 37.11 57.14

US 61 36.97 62.89 59.80

Total 97 58.79 100.00

Total AU 63 38.18 100.00

US 102 61.82 100.00

Total 165 100.00

The Gender frequency table shows that one observation contains the value G and another contains a missing value. The
Country frequency table shows that six observations contain invalid data: three for lowercase au and three for
lowercase us.

Now let's consider another variable, Employee_ID, which should have a different value for each observation. What
would you have to do to find duplicate values? To find any duplicates, you can look through the list of Employee_ID
frequencies to find values that are greater than 1. To make this easier, you can use the ORDER=FREQ option in the PROC
FREQ statement to display the results in descending frequency order.

proc freq data=orion.nonsales2 order=freq;


tables Employee_ID/nocum nopercent;
run;

Partial output

Employee_ID Frequency

Frequency Missing = 1

120108 2

120101 1

120104 1
Employee_ID Frequency

120105 1

120106 1

120107 1

120110 1

120111 1

120112 1

120113 1

121146 1

121147 1

121148 1

Using PROC FREQ Options to Validate Your Data


In this demonstration, you use PROC FREQ options to validate your data.

1. Copy and paste the following PROC FREQ step into the editor to validate the Employee_ID frequency.

proc freq data=orion.nonsales2 order=freq;


tables Employee_ID/nocum nopercent;
run;

2. Submit the code and then check the log. The code ran successfully.

3. Examine the results. Notice that the Employee_ID 120108 has a frequency of 2, and you can easily find it
because it's listed first.

Another option for validating Employee_ID is to use the NLEVELS option in the PROC FREQ statement.

4. In the PROC FREQ statement, remove the ORDER=FREQ option and add the NLEVELS option. Also, add the
variables Gender and Country to the TABLES statement.

proc freq data=orion.nonsales2 nlevels;


tables Gender Country Employee_ID/nocum nopercent;
run;

5. Submit this code and view the results. When you specify NLEVELS, PROC FREQ displays a table of the distinct
values, or levels, for each variable in the TABLES statement. The Number of Variable Levels table appears before
the individual frequency tables. If you know the number of levels that should occur, the Number of Variable
Levels table can indicate whether a variable contains duplicate values.
For example, this frequency table shows that Employee_ID has 234 levels, one missing and the rest nonmissing.
Here's a question: Assuming that orion.nonsales2 contains 235 observations, how many duplicate values does
Employee_ID contain? Given the 234 levels and 235 observations, Employee_ID must contain one duplicate
value. Since the table indicates a missing value for Employee_ID, the duplicate value might be either missing or
nonmissing.

If you only want to see the Number of Variable Levels table and not the individual frequency tables, you can add
the NOPRINT option to the TABLES statement.

6. In the editor, add the NOPRINT option and submit the code.

proc freq data=orion.nonsales2 nlevels;


tables Gender Country Employee_ID/nocum nopercent noprint;
run;

7. View the results. As you can see, the frequency tables have been suppressed.

Activity
Copy and paste this program into the editor and then submit it.

Reminder: Make sure you've defined the orion library.

proc freq data=orion.nonsales2 nlevels


order=freq;
tables Job_Title/nocum nopercent;
run;

1. How many unique, nonmissing job titles exist?

124

2. Which job title occurs most frequently?

Trainee

3. What is the frequency of missing job titles?

Identifying Observations with Invalid Data


Using PROC FREQ, you uncovered the existence of invalid data values for the categorical variables Employee_ID,
Gender, and Country. You can use PROC PRINT to display the observations containing the invalid values, and provide this
report to the HR Department as part of your task. But what can you use to subset the data based on specific conditions?
You can use a WHERE statement.

For example, here are some WHERE expressions to validate the data.

proc print data=orion.nonsales2;


where Gender not in ('F','M') or
Job_Title is null or
Country not in ('AU','US') or
Salary not between 24000 and 500000 or
Employee_ID is missing;
run;

The first expression selects observations in which the value of Gender is not F or M. The second expression selects
observations in which the value of Job_Title is null. The third expression selects observations in which the value of
Country is not uppercase AU or US. The fourth expression selects observations in which the value of Salary is not
between 24000 and 500000. The last expression selects observations in which the value of Employee_ID is missing.

Is there a way to test whether an Employee_ID value is unique, which is one of our data requirements? Yes there is.
From our PROC FREQ reports we know that Employee_ID 120108 has a frequency of 2. So, in our final WHERE
expression, we can select the observations in which the value of Employee_ID is equal to 120108.

proc print data=orion.nonsales2;


where Gender not in ('F','M') or
Job_Title is null or
Country not in ('AU','US') or
Salary not between 24000 and 500000 or
Employee_ID is missing or
Employee_ID=120108;
run;

Consider this. What would happen if you used the AND operator instead of the OR operator between these expressions?

proc print data=orion.nonsales2;


where Gender not in ('F','M') and
Job_Title is null and
Country not in ('AU','US') and
Salary not between 24000 and 500000 and
Employee_ID is missing and
Employee_ID=120108;
run;

SAS would select an observation only if every one of these conditions exists. In this example, no observations would be
selected from orion.nonsales2. You must use the OR operator.

Using PROC PRINT to Validate Data


In this demonstration, you use PROC PRINT to validate your data.

1. Copy and paste the following PROC PRINT step into the editor to print all of the invalid data.

proc print data=orion.nonsales2;


where Gender not in ('F','M') or
Country not in ('AU','US') or
Job_Title is null or
Salary not between 24000 and 500000 or
Employee_ID is missing or
Employee_ID=120108;
run;

2. Submit the code and then check the log. The log shows that SAS read 15 observations from orion.nonsales2.

3. View the report. As you can see, all of the invalid data, or data that doesn't meet your data requirements, is in
one report. This makes the task of cleaning the data a lot easier. You now know exactly which observations to
correct.
Using the MEANS and UNIVARIATE Procedures
Suppose you need to analyze the salaries of Orion Star sales employees. For example, the payroll manager has asked you
for a report of the average salary of all employees. To create the output required, you can use PROC MEANS.

The MEANS Procedure


PROC MEANS produces summary reports with descriptive statistics.

PROC MEANS DATA=SAS-data-set<statistic(s)>;


VAR analysis-variable(s);
RUN;

It automatically displays output in a report, and you can also save the output in a SAS data set. By default, PROC MEANS
reports the number of nonmissing values, the mean, the standard deviation, the minimum value, and the maximum
value of every numeric variable in a data set. You can use the VAR statement to identify the which variables to use in the
analysis and to specify the order they appear in the results. By using additional statements and options in a PROC
MEANS step, you can create a more complex PROC MEANS report.

For your task, you want to analyze the variable Salary. You start with the PROC MEANS statement and list the input data
set, orion.sales. You add a VAR statement to list the analysis variables, which are the numeric variables for which
statistics are to be computed. You list Salary in your VAR statement.

proc means data=orion.sales;


var Salary;
run;

Creating a Summary Report with PROC MEANS

In this demonstration, you create a summary report using PROC MEANS.

1. Copy and paste the following PROC MEANS step into the editor. This code creates a report that displays the
default PROC MEANS statistics for the analysis variable Salary.

proc means data=orion.sales;


var Salary;
run;

2. Submit the code and then check the log. Everything looks good. SAS read 165 observations from orion.sales.

3. View the report. As you can see, the mean salary at Orion Star is 31160.12.

Business Scenario
You found the average salary at Orion Star. But now you want to see these statistics grouped by Gender and by Country.
In other words, you want to be able to see what the mean salary is for male employees in Australia, or the mean salary
of female employees in the US. Using PROC MEANS, you can create statistics for groups of observations.

Grouping Observations by Using the CLASS Statement


To calculate statistics for groups of observations, you use the CLASS statement. You already know that the VAR
statement specifies one or more numeric variables that PROC MEANS analyzes. The CLASS statement specifies variables
that PROC MEANS uses to group the data. These variables are called classification variables, or class variables. Each
combination of class variable values is called a class level.

PROC MEANS DATA=SAS-data-set<statistic(s)>;


VAR analysis-variable(s);
CLASS classification-variable(s);
RUN;

Classification variables are character or numeric, and they typically have few discrete values. Your data set does not
need to be sorted or indexed by the class variables. In this example, PROC MEANS reports the statistics for Salary for
each combination of values—or each level—of the class variables Gender and Country. In your PROC MEANS output, the
class variables appear in the order that you list them in the CLASS statement.

proc means data=orion.sales;


var Salary;
class Gender Country;
run;

Analysis Variable : Salary

Gender Country N Obs N Mean Std Dev Minimum Maximum

F AU 27 27 27702.41 1728.23 25185.00 30890.00

US 41 41 29460.98 8847.03 25390.00 83505.00

M AU 36 36 32001.39 16592.45 25745.00 108255.00

US 61 61 33336.15 29592.69 22710.00 243190.00

Creating a PROC MEANS Report with Grouped Data


In this demonstration, you create a PROC MEANS report with grouped data.

1. Copy and paste the following PROC MEANS step into the editor.

proc means data=orion.sales;


var Salary;
class Gender Country;
run;

2. Submit the step and view the report. Notice that this report has three more columns and more rows than the
basic PROC MEANS report you saw earlier. The table now has a column for each of the class variables, Gender
and Country.

PROC MEANS reports statistics for each class level, so there's a row for each combination of class variable
values. Can you tell what the mean salary for male employees in Australia is? It's 32001.39.

When you use the CLASS statement, PROC MEANS displays the additional statistic N Obs in your report.
Remember that the N statistic counts all nonmissing values of the analysis variable. However, N Obs reports the
number of observations with each unique combination of class variables, whether or not there are missing
values. If a data set has missing values, the values of N and N Obs for some levels are different.

Based on the values of N and N Obs, do you think this data set has missing values of Salary? For each level, the
values of N and N Obs are identical, so this data set does not have missing values.

Specifying Statistics in the PROC MEANS Statement


The statistics that PROC MEANS displays by default are not always the ones that you need or want. You might prefer to
display additional statistics, display only a subset of statistics, or display the statistics in a different order. To control the
statistics that PROC MEANS displays, you include statistic keywords in the PROC MEANS statement. The requested
statistics override the default statistics, so your PROC MEANS output includes the statistics that you specify, in the order
that you specify them.

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

Requesting Specific Statistics in PROC MEANS


In this demonstration, you request specific statistics in your PROC MEANS report.

1. Copy and paste the following PROC MEANS step into the editor. This step analyzes the variable Salary. You want
to see only the number of observations with nonmissing values and the mean for Salary. You use the keywords
N and MEAN in the PROC MEANS statement.

proc means data=orion.sales n mean;


var Salary;
run;

2. Submit the step and view the report. As you can see, the table includes only these two statistics. Now suppose
you want to see the minimum, maximum, and sum statistics for grouped data.

3. In the editor, add the CLASS statement with the variables Gender and Country, and then change the keywords
to MIN, MAX, and SUM in the PROC MEANS statement.

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

4. Submit this code and view the report. The report shows the new statistics for each combination of class
variables. For example, you can see that the minimum salary of all male employees in the US is 22710. The
report also now has an N Obs column due to the CLASS statement. MIN, MAX, and SUM are just three of the
many statistic keywords that are available for use in PROC MEANS.

PROC MEANS Statement Options


As with other SAS statements, you can add options to your PROC MEANS statement to control the output. In your PROC
MEANS reports, you might want to control the number of decimal places in the numeric values that are displayed. When
your code does not specify a format for writing numeric values, SAS uses the BESTw. format as the default format. The
BESTw. format writes values with the maximum precision, as determined by the field width. Integers, like the values of N
Obs, are written without decimals.
proc means data=orion.sales min max sum;
var Salary;
class Gender Country;
run;
Analysis Variable : Salary

Gender Country N Obs N Mean Std Dev Minimum Maximum

F AU 27 27 27702.41 1728.23 25185.00 30890.00

US 41 41 29460.98 8847.03 25390.00 83505.00

M AU 36 36 32001.39 16592.45 25745.00 108255.00

US 61 61 33336.15 29592.69 22710.00 243190.00

To control the number of decimal places in the numeric values that appear in output, you can specify the MAXDEC=
option in the PROC MEANS statement. In the MAXDEC= option, number is an integer that specifies the maximum
number of decimal places.

MAXDEC=number

In this example, MAXDEC=0 specifies that SAS write numeric values with no decimal places.

proc means data=orion.sales min max sum maxdec=0;


var Salary;
class Gender Country;
run;

Analysis Variable : Salary

Gender Country N Obs Minimum Maximum Sum

F AU 27 25185 30890 747965

US 41 25390 83505 1207900

M AU 36 25745 108255 1152050

US 61 22710 243190 2033505

To specify one decimal place, you set the value to 1.

The PROC MEANS statement shown here does not specify any statistical keywords, so the report displays the default
statistics.
proc means data=orion.sales;
var Salary;
class Gender Country;
run;

Analysis Variable : Salary

Gender Country N Obs N Mean Std Dev Minimum Maximum

F AU 27 27 27702.41 1728.23 25185.00 30890.00

US 41 41 29460.98 8847.03 25390.00 83505.00

M AU 36 36 32001.39 16592.45 25745.00 108255.00

US 61 61 33336.15 29592.69 22710.00 243190.00

Suppose you don't want the N Obs column to appear in your report. In this example, the values of N Obs and N are
identical for each class level, which indicates that Salary has no missing values. To suppress the N Obs column, you can
specify the NONOBS option in the PROC MEANS statement. The N Obs column will no longer appear in PROC MEANS
output.

proc means data=orion.sales nonobs;


var Salary;
class Gender Country;
run;

Analysis Variable : Salary

Gender Country N Mean Std Dev Minimum Maximum

F AU 27 27702.41 1728.23 25185.00 30890.00

US 41 29460.98 8847.03 25390.00 83505.00

M AU 36 32001.39 16592.45 25745.00 108255.00

US 61 33336.15 29592.69 22710.00 243190.00

These tables show other PROC MEANS statistics that are available.

Quantile Statistic Hypothesis Testing


Descriptive Statistic Keywords
Keywords Keywords
MEDIAN | P50 PROBT
CLM SKEWNESS
Q3 | P75 T
MEAN UCLM
P1
KURTOSIS LCLM
P90
SUM N
P5
CSS STDDEV
P95
MIN USS
P10
RANGE MAX
P99
SUMWGT NMISS
Q1 | P25
CV STDERR
QRANGE
MODE VAR

The first category shows the descriptive statistic keywords, followed by the quantile statistic keywords, and then the
hypothesis testing keywords. If you'd like to see descriptions of these statistics, click the Information button in the
course interface.

Question
Which PROC MEANS step creates this output?

a.

proc means data=orion.employee_donations


n mean min max
nonobs maxdec=2;
var Qtr4;
class Paid_By;
run;

b.

proc means data=orion.employee_donations maxdec=2;


var Qtr4;
class Paid_By;
run;

c.

proc means data=orion.employee_donations Qtr4;


class Paid_By;
run;

The correct answer is a. The PROC MEANS statement specifies the four statistics that appear in the table. The NONOBS
option suppresses the N Obs statistic. The MAXDEC= option specifies two decimal places for the numeric values.
Code Challenge
Complete this PROC MEANS statement to produce the mean and sum for numeric variables in the data set
orion.nonsales. Limit the number of decimal places to two.

proc means

;
run;

data=orion.nonsales mean sum maxdec=2;

You specify the data set orion.nonsales, followed by the statistic keywords MEAN and SUM. To change the default
number of decimal places to two, you add MAXDEC=2 as an option.

Using the SAS Output Delivery System


The Orion Star HR Department has given you the data set orion.nonsales2 to validate the numeric range of 24000 to
500000 for the variable Salary. You're familiar with the requirements of the values in this data set. You're also familiar
with using PROC FREQ and PROC PRINT to identify the invalid values. As a SAS programmer though, you probably want a
number of validation techniques to choose from. Luckily, SAS provides you with options. You can use PROC MEANS and
PROC UNIVARIATE as alternative procedures for validating the data.

Using the NMISS Option in PROC MEANS


As you know, PROC MEANS displays these summary statistics by default for all numeric variables: Mean, N, Minimum,
Maximum, Std Dev. To validate the range of values for Salary, let's examine how these statistics can be helpful.

Analysis Variable : Salary

N Mean Std Dev Minimum Maximum

234 43954.60 38354.77 2401.00 433800.00

To start, you know that N represents the number of observations with nonmissing values. If you know how many
observations are in the data set, this number could be helpful, but otherwise, it's not. Next, look at the mean salary
value. This doesn't help you either. The same is true for the standard deviation.

But, the values for minimum and maximum can be helpful. Does the minimum value of 2401 fall within the required
salary range of "24000 - 500000"? No, it doesn't. And does the maximum value of 433800 fall within the required salary
range? Yes, it does. Based on this report, you know that at least one observation falls below the required range. But you
don't know if any of the observations have missing values. You can specify the NMISS option to display the number of
observations with missing values.

Validating Date Using PROC MEANS


In this demonstration, you validate data using PROC MEANS.

1. Copy and paste the following PROC MEANS step into the editor. This step specifies N, NMISS, MIN, and MAX as
the descriptive statistics to display.

proc means data=orion.nonsales2 n nmiss min max;


var Salary;
run;

2. Submit the step and then check the log. The log shows that the code ran successfully. You can see that
orion.nonsales2 contains 235 observations.

3. Examine the results. Now the value of N is more meaningful. It tells you that 234 out of the 235 observations
have nonmissing values. Also, the value of 1 for N MISS reinforces this information. There's one observation with
a missing value for Salary.

Consider this. Given the minimum value, can you tell how many values are too low? No. Based on this report,
you don't know how many values are too low.

Using PROC UNIVARIATE to Detect Data Outliers


A procedure that's useful for detecting data outliers, which are data values that fall outside the expected ranges, is PROC
UNIVARIATE. The syntax for PROC UNIVARIATE is very similar to PROC MEANS syntax, and like PROC MEANS, PROC
UNIVARIATE produces summary reports of descriptive statistics. You start with the PROC UNIVARIATE statement and list
the input data set. You add a VAR statement to list the analysis variables, which are the numeric variables for which
statistics are to be computed.

PROC UNIVARIATE DATA=SAS-data-set;


VAR variable(s);
RUN;

This table compares the various procedures you've learned.

Procedure Numeric Character Method

FREQ X X looking at distinct values


with TABLES statement

PRINT X X subsetting observations based on conditions


with WHERE statements

MEANS X using summary statistics


with VAR statement

UNIVARIATE X looking at extreme values, missing values


with VAR statements

PROC UNIVARIATE displays extreme values, missing values, and other statistics for the variables named in the VAR
statement. If you omit the VAR statement, PROC UNIVARIATE
analyzes all numeric variables in the data set.

Validating Data Using PROC UNIVARIATE


In this demonstration, you use PROC UNIVARIATE to validate the range of values for the variable Salary.
1. Copy and paste the following PROC UNIVARIATE step into the editor.

proc univariate data=orion.nonsales2;


var Salary;
run;

2. Submit the step and then check the log.The procedure ran without error.

3. Examine the results. Notice that PROC UNIVARIATE creates quite a bit of output. For validating data, you're most
interested in the Extreme Observations table. This table shows the five lowest and the five highest values of
Salary, by default. The Obs values indicate the observation number, not the count of observations with that
value. From this table, you can see that there are two observations with Salary values that fall below 24000.
None of the salaries are higher than 500000.

Suppose you'd like to see less than the default five values in the Extreme Observations table. To specify the
number of extreme observations that PROC UNIVARIATE lists, you can use the NEXTROBS= option in the PROC
UNIVARIATE statement.

4. In the editor, add the NEXTROBS= option to the PROC UNIVARIATE statement and set the value to 3.

proc univariate data=orion.nonsales2 nextrobs=3;


var Salary;
run;

5. Submit the code and view the report. As you can see, SAS displays only the three lowest and the three highest
Salary values now. Suppose you'd like to see the employee IDs that correspond to these observation numbers.
In other words, you want to know which Employee_ID has a Salary value of 2401. You can add the ID statement
to your program.

6. In the editor, add the ID statement and specify Employee_ID.

proc univariate data=orion.nonsales2 nextrobs=3;


var Salary;
id Employee_ID;
run;

7. Submit the step and view the report. SAS displays the Employee_ID column in the table now, and you can easily
determine which employees have salaries that are below the required salary range.

Question
PROC UNIVARIATE identified two observations with Salary values less than 24000. What procedure can you use to
display the observations containing the invalid values?

a. PROC MEANS

b. PROC PRINT

c. PROC FREQ

The correct answer is b. You can use PROC PRINT with a WHERE statement to print only those observations
where Salary is less than 24000.
proc print data=orion.nonsales2;
where Salary<24000;
run;

Using the SAS Output Delivery System


You've generated some meaningful reports for Orion Star managers, such as these summary reports. But not every
manager has access to SAS, so how can every manager view the reports? Using the SAS Output Delivery System, you can
create and distribute the results of your SAS procedures in a variety of external formats. Managers can view and work
with the files in applications other than SAS.

Using the SAS Output Delivery System


In this demonstration, you'll use ODS statements to create results that can be viewed in applications other than SAS.

1. Copy and paste the following PROC MEANS step into the editor.

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

2. To create a PDF file of the results that can be viewed in Adobe Acrobat Reader, you just need to add two ODS
statements to the program. The first statement specifies the ODS destination type, and the file in which to store
the PDF content. The path that you specify in the FILE= option needs to include the full path to a location in your
operating environment, and the filename needs to include the appropriate extension for the specified
destination. In this example, we're specifying a PDF destination, and writing to a file named salaries.pdf in the
output folder on the C: drive.

ods pdf file="c:/output/salaries.pdf";

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

3. The second statement closes the PDF destination, allowing you to open the file outside of the SAS environment
with a PDF reader such as Adobe Acrobat.

ods pdf file="c:/output/salaries.pdf";

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

ods pdf close;

4. You can create HTML and RTF files just as easily, using the appropriate ODS destination and file extension. You
can also create csv, html, and xml files that can be viewed as worksheets in Microsoft Excel. Modify the program
to create a csv file using the CSVALL destination.

ods csvall file="c:/output/salaries.csv";

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;
ods csvall close;

5. That's all there is too it! Now you can use Microsoft Excel to open the file. Once opened, you can save it in xls or
xlsx format.

6. You can have multiple destinations open, and execute multiple procedures. All generated output will be sent to
every open destination. You might not be able to view the file, or the most updated file, outside of SAS until you
close the destination.

ods csvall file="c:/output/salaries.csv";


ods rtf file="c:/output/salaries.rtf";

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

proc print data=orion.country;


run;

ods csvall close;


ods rtf close;

7. In some SAS environments, you can select an output destination in the user interface without using ODS
statements, and then you can save the resulting output in a file. The Output Delivery system can be used in any
operating environment, with any version of SAS, to generate different types of files that can be opened in third-
party software applications.

8. Click the Information button to learn more about using the SAS Output Delivery System.

Summary of Lesson 11: Creating Summary Reports

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Using PROC FREQ to Create Summary Reports


You can use PROC FREQ to produce frequency tables that report the distribution of any or all variable values in a SAS
data set. You use the TABLES statement to specify the frequency tables to produce. You can include a WHERE statement
in the PROC FREQ step to subset the observations. SAS has TABLES statement options that you can use to suppress the
default statistics.

PROC FREQ DATA=SAS-data-set <option(s)>;


TABLES variable(s) </option(s)>;
<additional SAS statements>
RUN;

Frequency distributions work best with variables whose values are categorical and best summarized by counts instead of
averages. Variables that have continuous numeric values, such as dollar amounts and dates, or many discrete values, can
result in a lengthy and meaningless frequency table. To create a useful frequency report for these variables, you can
apply a SAS or user-defined format to group the values into categories.

You can list multiple variables in a TABLES statement, separated by spaces. This creates a one-way frequency table for
each variable. You can request a separate analysis for each group by including a BY statement. You can request a two-
way frequency table by separating the variables with an asterisk instead of a space. The resulting crosstabulation table
displays statistics for each distinct combination of values of the selected variables. You can use TABLES statement
options to suppress statistics, change the table format, and format the displayed values.

Using PROC FREQ for Data Validation


The FREQ procedure can also be used to validate a data set. A one-way frequency table, which displays all discrete
values for a variable and reports on missing values, easily identifies the existence of invalid or missing values. You can
use the ORDER=FREQ and NLEVELS options to identify duplicate values. After you've identified invalid values, you can
use PROC PRINT to display the corresponding observations.

Using the MEANS and UNIVARIATE Procedures


You can use PROC MEANS to produce summary reports with descriptive statistics. By default, it reports the number of
nonmissing values, the mean, the standard deviation, the minimum, and the maximum value of every numeric variable
in a data set. You can use the VAR statement to specify the numeric variables to analyze, and add a CLASS statement to
request statistics for groups of observations. The variables listed in the CLASS statement are called classification
variables, or class variables, and each combination of class variable values is called a class level.

PROC MEANS DATA=SAS-data-set <statistic(s)>;


VAR analysis-variable(s);
CLASS classification-variable(s);
RUN;

When you use the CLASS statement, the output includes N Obs, which reports the number of observations with each
unique combination of class variables. You can request specific statistics by listing them as options in the PROC MEANS
statement. Other options are available to control the output.

You can also use PROC MEANS to validate a data set. The MIN, MAX, and NMISS statistics can be used to validate
numeric data when you know the range of valid values. PROC UNIVARIATE can be more useful because it displays the
extreme observations, or outliers. By default, it displays the five highest and five lowest values of the analysis variable,
and the number of the observation with each extreme value. You can use the NEXTROBS= option to display a different
number of extreme observations.

PROC UNIVARIATE DATA=SAS-data-set;


VAR variable(s);
RUN;

Using the Output Delivery System


You can use the SAS Output Delivery System to create different output formats by directing output to various ODS
destinations. For each type of formatted output that you want to create, you use an ODS statement to open that
destination, submit one or more procedures that generate output, and then close the destination. A file is created for
each open destination.
ODS destination FILE="filename" <options>;
<SAS code to generate the report>
ODS destination CLOSE;

Sample Programs

Creating a One-Way Frequency Report

proc freq data=orion.sales;


tables Gender;
where Country='AU';
run;

Using Formats in PROC FREQ

proc format;
value Tiers low-25000='Tier1'
25000<-50000='Tier2'
50000<-100000='Tier3'
100000<-high='Tier4';
run;

proc freq data=orion.sales;


tables Salary;
format Salary Tiers.;
run;

Listing Multiple Variables on a TABLES Statement

proc freq data=orion.sales;


tables Gender Country;
run;

proc sort data=orion.sales out=sorted;


by Country;
run;

proc freq data=sorted;


tables Gender;
by Country;
run;

Creating a Crosstabulation Table

proc freq data=orion.sales;


tables Gender*Country;
run;

Examining Your Data

proc print data=orion.nonsales2 (obs=20);


run;

Using PROC FREQ Options to Validate Your Data

proc freq data=orion.nonsales2 order=freq;


tables Employee_ID/nocum nopercent;
run;

proc freq data=orion.nonsales2 nlevels;


tables Gender Country Employee_ID/nocum nopercent;
run;

proc freq data=orion.nonsales2 nlevels;


tables Gender Country Employee_ID/nocum nopercent noprint;
run;

Using PROC PRINT to Validate Your Data

proc print data=orion.nonsales2;


where Gender not in ('F','M') or
Country not in ('AU','US') or
Job_Title is null or
Salary not between 24000 and 500000 or
Employee_ID is missing or
Employee_ID=120108;
run;

Creating a Summary Report with PROC MEANS

proc means data=orion.sales;


var Salary;
run;

Creating a PROC MEANS Report with Grouped Data

proc means data=orion.sales;


var Salary;
class Gender Country;
run;

Requesting Specific Statistics in PROC MEANS

proc means data=orion.sales n mean;


var Salary;
run;

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

Validating Data Using PROC MEANS

proc means data=orion.nonsales2 n nmiss min max;


var Salary;
run;

Validating Data Using PROC UNIVARIATE

proc univariate data=orion.nonsales2;


var Salary;
run;

proc univariate data=orion.nonsales2 nextrobs=3;


var Salary;
run;

proc univariate data=orion.nonsales2 nextrobs=3;


var Salary;
id Employee_ID;
run;

Using the SAS Output Delivery System

/*Use a filepath to a location where you have Write access.*/


ods pdf file="c:/output/salaries.pdf";

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

ods pdf close;

ods csv file="c:/output/salarysummary.csv";

proc means data=orion.sales min max sum;


var Salary;
class Gender Country;
run;

ods csv close;

You might also like