C747 Transcripts Part1
C747 Transcripts Part1
C747 Transcripts Part1
To complete the practices in this course, you must follow these steps to set up the practice files for SAS Studio. These
instructions apply to all configurations of SAS Studio, including SAS University Edition.
IMPORTANT: You must perform these two tasks in the same SAS Studio session. Do not close your browser until you
have completed both tasks or you will need to start again.
Task 1: Create a folder for your practice files and define the orion library.
1. Start SAS Studio from your SAS University Edition: Information Center page.
2. At the top of the Files and Folders pane, click New
and select Folder.
3. In the Name box, type ecprg193. Click Save.
4. The following code creates a macro variable to store the location of your practice files and defines the orion
library. Copy and paste the following two lines of code into the Code tab in SAS Studio. Don't run the code yet -
you must edit the code before you run it.
%let path=FILEPATH;
libname orion "&path";
5. In the Files and Folders pane, open My Folders. Right-click ecprg193 and select Properties. Highlight the filepath
shown in Location and copy it.
6. In the Code tab in SAS Studio, edit the program as follows:
a. In the first line of the program, highlight FILEPATH and paste the filepath that you copied in the previous
step.
b. Do not change the second line of the program.
7. Click Run
to submit this program to SAS.
8. Check the log to make sure that the libref orion was successfully assigned.
a. You should see this message in the log:
Task 2: Create the practice files for the course — you only have to do this once.
1. In the Files and Folders pane, click New
and select SAS Program.
2. Click here to open a popup window with the SAS code that creates your practice data. Select all of the code
(Ctrl+A or Command+A) and then copy it (Ctrl+C or Command+C).
3. Paste the code (Ctrl+V or Command+V) into the Code tab in SAS Studio.
4. Click Run
to submit the program to SAS. This program creates the practice files in the ecprg193 folder.
5. After the code runs, the Results tab displays two tables. The second table lists the data sets that the program
created. Check the log to verify that there are no errors or warnings. There will be many notes because SAS writes
a note for each data set that the program creates.
NOTE: If your program does not run successfully, make sure you completed all the steps in Task 1 correctly.
6. On the Code tab, click Clear All code
You have set up your practice data and you are ready to work in SAS Studio.
NOTE: Unless you delete your practice files folder, you do not need to perform this task again (if you are using SAS
University Edition on AWS Marketplace, see the note below).
IMPORTANT: Each time you start SAS Studio to practice in this course, navigate to the ecprg193 folder in the Files and
Folders pane and double-click setup.sas to open it. After you run the program, you can access your practice data in the
orion library. Each practice page reminds you to define the orion library and has a link to this setup page.
Objectives
What Is SAS?
SAS is a suite of business solutions and technologies to help organizations solve business problems. Base SAS is the
centerpiece of all SAS software. It provides a flexible and extensible programming language designed for data access,
transformation, and reporting.
To extend the capabilities of Base SAS, you can add other SAS components. For example, you can use a component to
access third-party data. Other components give you tools for report writing, high-resolution graphics, statistical analysis,
visualization and discovery, and business solutions.
Once you access your data, you can manage it. For example, you might need to subset data, create variables, validate
and clean data, or combine data to ready it for analysis. SAS gives you excellent data management capabilities. You’ll
probably want to analyze your data as well. You can perform some simple analyses, such as finding frequency counts or
calculating averages. Or you can run more complex analyses, such as regression or forecasting. For statistical analysis,
SAS is the gold standard.
Finally, you'll want to present your data meaningfully. You can create list reports, summary reports, or graphic reports.
And you can print these reports, write them to new data files, or publish them on the web. You have lots of options for
presenting your data.
The first step in the programming process is to define the business need. You do this by communicating with the
business team or by reviewing a written specification. After you define the business need, you write a SAS program
based on the desired output, the necessary input, and the required processing. After you finish coding, you run the
program and review your results, which can be reports or notes and messages from SAS regarding your code. As you
review the results, you might find inaccuracies or errors, in which case you might need to debug or modify the program.
Depending on your results, you might need to repeat some of the steps.
Raw data files contain data that has not been processed by any other computer program. They are text files that contain
one record per line, and the record typically contains multiple fields. Raw data files aren't reports; they are unformatted
text.
The second major type of file you’ll use is a SAS data set. This important type of file is specific to SAS. A SAS data set is
your data in a form that SAS can understand. Like raw data files, SAS data sets contain data. But in SAS data sets, the
data is created only by SAS and can be read only by SAS.
Now let’s explore the third major type of file you’ll use in SAS: the SAS program file. SAS program files contain SAS
programming code. These instructions tell SAS how to process your data and what output to create.
Objectives
A DATA step typically reads data from an input source, processes it, and creates a SAS data set, which is data in a form
that SAS understands. So, one of the primary purposes of a DATA step is to create a SAS data set. In addition, you can
use a DATA step to create new variables that were not in your original data. In SAS terminology, variables are the
columns in your data.
For example, suppose your raw data file contains the fields Cost Price Per Unit and Quantity Sold. In a DATA step, you
can multiply these variables and assign the value to a new variable named Total_Retail_Price.
A PROC or procedure step typically processes a SAS data set. Various PROC steps generate reports and graphs, manage
data, and sort data.
One way to use these two steps together is to use a DATA step to create a SAS data set, and then use a PROC step to
create a report. Remember, though, that this is just one possible combination of steps in a SAS program. Your SAS
programs might perform other tasks. Now let's learn more about what makes up a SAS step.
A DATA step begins with a DATA statement, and a PROC step begins with a PROC statement. SAS detects the end of a
step when it encounters one of the following: a RUN statement for most steps, a QUIT statement for some procedures,
or the beginning of another step. Occasionally, a user might omit a RUN or QUIT statement, and the step will end
implicitly when the next step begins. It is a best practice to include a RUN or QUIT statement to explicitly end each step
in a SAS program.
data work.newsalesemps;
set orion.sales;
where Country='AU';
run;
title;
Can you tell how many steps it contains? This program contains three steps: one DATA step and two PROC steps. In the
first line of code, the DATA step creates a temporary SAS data set named work.newsalesemps by reading the
orion.sales data set. In the eighth line, the PROC PRINT step creates a list report of the work.newsalesemps data set. In
line 11, the PROC MEANS step creates a summary report of work.newsalesemps with statistics for the variable Salary
for each value of Job_Title.
In addition to DATA and PROC steps, this SAS program also contains global statements. These statements can lie outside
DATA and PROC steps, and they can affect more than one step. For example, the first TITLE statement located before the
PROC statements, specifies a title that appears on both reports. The second TITLE statement located at the end of the
SAS program, turns all titles off for all subsequent output. You'll learn several global statements in this course.
data work.newsalesemps;
set orion.sales;
where Country='AU';
run;
title;
2. Submit the code and check the log. It's a good programming practice to first check the log, even if the program
appears to produce results. You want to ensure that the code ran successfully before you look at any reports
SAS created. Notice that SAS processed the code without warnings or errors.
3. View the results. The first report is the PROC PRINT report. Recall that this type of report simply lists your data.
You can see columns for the various variables and all of their values. Notice that the title you specified appears
at the top of the report. The next report is the PROC MEANS report. The MEANS procedure provides data
summarization tools to compute descriptive statistics on your data, and displays output by default. Here, SAS
calculated statistics for the analysis variable Salary.
Question
Which of the following can represent a step boundary?
a. a RUN statement
b. a QUIT statement
c. a DATA statement
d. a PROC statement
The correct answer is e. All of these statements can represent a step boundary by indicating either the end of a step or
the beginning of a new step.
Question
What does a DATA step typically create?
b. program file
d. report
The correct answer is c. A DATA step typically creates a SAS data set. However, you can use DATA steps to create raw
data, program files, and reports. The DATA step is very flexible.
Question
What does a PROC step typically create?
b. program file
d. report
Business Scenario
Orion Star management encourages their programmers to write well-formatted, clearly documented SAS programs. So,
you need to know the syntax rules and recommended structure for SAS programming statements, as well as how to use
comments in your SAS programs.
Take a look at this program. Can you tell how many statements make up this DATA step?
data work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;
This step contains five statements: a DATA statement, a LENGTH statement, an INFILE statement, an INPUT statement,
and a RUN statement. Each statement has an identifying keyword and ends in a semicolon.
data work.newsalesemps;
set orion.sales;
where Country='AU';
run;
The DATA, PROC, and RUN statements begin in column one, and the other statements are indented. Each statement
begins on a new line, and a blank line separates each step. Using conventional formatting (that is, structured, consistent
spacing) makes a SAS program easy to read.
However, SAS statements are free format. In other words, they can begin and end anywhere. In SAS, you can have as
much or as little white space as you want. You can begin or end a statement in any column and span multiple lines. You
can also place multiple statements on one line, and unquoted values can be lowercase, uppercase, or mixed case.
The following program takes advantage of the free-format style that SAS permits, but at a cost of being difficult to read:
Remember the old saying: "Just because you can do something, doesn't mean that you should." Again, in this program,
the SAS syntax rules have been followed, but this unconventional formatting might be especially difficult for other
programmers to read.
Using conventional formatting can take the guesswork out of your programs. It's recommended that you use a
conventional programming style. Click the Information button in the course to learn about automatic formatting in SAS
Enterprise Guide and other SAS environments.
Question
Which of the following rules is required by SAS syntax? Select all that apply.
Comments can also help you test your SAS programs in stages. By commenting out your error-free code, you can use
comments to submit only the steps that you're testing. When your entire program is error-free, you can remove the
comment symbols without damaging the SAS program.
Types of Comments
Let's take a closer look at comments. In SAS, you can create comments in two ways. Using the first method, called a
block comment, you begin with a forward slash and asterisk, your comment text, and then end with an asterisk and a
forward slash.
data work.newsalesemps;
set orion.sales;
where Country='AU';
run;
These comments can be any length, and can contain semicolons. They cannot be nested. You should avoid placing block
comment symbols in the first or second columns. In some operating environments, SAS might interpret block comment
symbols in columns 1 and 2 as a request to end the SAS job or session.
The second method is called a comment statement. It begins with an asterisk, followed by the comment text, and ends
with a semicolon.
data work.newsalesemps;
set orion.sales;
*where Country='AU';
run;
Comment statements can begin in columns 1 and 2. To comment out a statement in one of these steps, you simply add
an asterisk to the beginning of the statement, as shown above in the WHERE statement. Comments in this form are
complete statements, and they can't contain internal semicolons.
data work.newsalesemps;
set orion.sales;
where Country='US';
run;
2. At the beginning of the program, add a comment statement stating that you're using orion.sales to create
work.newsalesemps.
4. Next, comment out the PROC PRINT step so that it doesn't run when you submit the code.
/*
proc print data=work.newsalesemps;
run;*/
5. Submit this code and examine the log. Notice that SAS didn't process the portions of code that were commented
out. You can see that the data set was created, and that the PROC MEANS step created output, but the PROC
PRINT step that was commented out produced no other messages or output.
Question
How many comments are in this program?
data work.newsalesemps;
length First_Name $ 12 Last_Name $ 18
Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary /*numeric*/;
run;
/*
proc print data=work.newsalesemps;
run;
*/
proc means data=work.newsalesemps;
*var Salary;
run;
a. 2
b. 4
c. 5
d. 6
*var Salary;
Activity
Copy and paste the following program into the editor of your SAS environment, and then submit the program.
data work.newsalesemps;
length First_Name $ 12 Last_Name $ 18
Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary /*numeric*/;
run;
/*
proc means data=work.newsalesemps;
var Salary;
run;*/
Business Scenario
As an Orion Star programmer, you work with a lot of code…some that's yours and some that's not. You need to be able
to diagnose and correct syntax errors in any of these SAS programs.
This misspelling affects other statements following it. Although the following statements in the DATA step are
syntactically correct, they are only permitted in a DATA step. The editor doesn't recognize this as a DATA step though,
due to the misspelled keyword, so SAS also displays the other statements in the DATA step in red.
SAS finds syntax errors during the compilation phase, before it executes the program. So, when you submit a SAS
program, SAS scans each statement for syntax errors. If no errors are found, SAS executes the step when it reaches the
step boundary. Then SAS goes to the next step and repeats the process.
When SAS encounters a syntax error, it writes the following to the SAS log: the word ERROR or WARNING, the location
of the error, and an explanation of the error. SAS continues the syntax scan until it reaches the step boundary, but the
step doesn't execute if errors are found. Then SAS continues scanning the rest of the program, and reports any
additional errors as needed. When you check the log, as all good SAS programmers do, and find a warning or error
message, you need to correct your code.
1. Copy and paste the following program into the editor. As you know, the DATA step keyword is misspelled. Also,
the semicolon is missing from the PROC PRINT statement, and the PROC MEANS step includes an option that is
not valid. As you can see, SAS color-codes the program to indicate the errors.
daat work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;
2. Submit the program and check the log. You should always check the log to make sure that the program ran
successfully, even if output is generated.
Notice that there is a WARNING message and the word DAAT is underlined. In this case, SAS resolved the issue
by assuming that DAAT was simply DATA misspelled. A warning means that SAS was able to perform the action.
In this case, SAS processed the DATA step. But this is a rare situation, as SAS might not always be able to
interpret your misspelled words.
Next, notice that the RUN statement is underlined. In this case, the previous line is missing the semicolon. The
message 'Syntax error, expecting one of the following...' indicates that something was missing. Consider how
SAS processed this step. SAS started with the PROC PRINT statement and kept going until it reached the
semicolon at the end of the RUN statement. So, SAS thought that the PROC PRINT and the RUN statements were
all one statement. SAS interpreted RUN as an option for PROC PRINT and printed an error message about an
invalid option. Notice that SAS did list the semicolon as one of the expected options.
You might be thinking, “Why did SAS report an error in the RUN statement? There's nothing wrong with the RUN
statement.” When you encounter this type of error, always check the statement before the underlined
statement. In many cases you will find that the statement before the error is missing a semicolon.
Now look at the next error message. SAS did not recognize the word AVERAGE as a valid option in the PROC
MEANS statement, so the PROC MEANS step didn't execute. Notice that SAS lists the valid options. The word
MEAN is listed as a valid option and should be used to calculate an average.
3. In the editor, correct the program. First, correct the spelling of DATA, and then add a semicolon to the end of
the PROC PRINT statement. Lastly, change the word AVERAGE to MEAN in the PROC MEANS statement.
data work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;
4. Submit the revised code and check the log. The log shows that the code ran successfully. No errors or warnings
appear. Also, SAS produced the reports you requested. As demonstrated, you can easily view and correct syntax
errors in SAS.
Business Scenario
Another common mistake that programmers make is leaving off a matching quotation mark. For example, suppose you
write a program that creates a data set and generates two reports. You submit the program, but it doesn't produce
results. The program might have unbalanced quotation marks.
You should notice that there's a problem because much of the program will be colored purple in the editor. Purple
represents a quoted string. In this example, the string begins with a single quotation mark followed by a comma, a
semicolon, and then all the remaining statements in the program. Because the string does not contain a matching or
ending quotation mark, SAS reads all of this text as a quoted string.
When you submit a program with unbalanced quotation marks in the SAS windowing environment, the program doesn't
stop running, and the log includes only the code you submitted. You won't see any error or warning messages, nor will
you see any indication that any of the steps executed. You'll also see a message in the banner of the editor stating that
the step is still running. You have to stop an executing program by cancelling the submitted statements. You can then
correct your program by adding the missing quotation mark.
When you submit a program with unbalanced quotation marks in client applications such as SAS Enterprise Guide and
SAS Studio, SAS writes messages to the log to alert you of the error. A warning in the SAS log stating that a quoted string
has become too long, or that a statement containing quotation marks is ambiguous, sometimes indicates unbalanced
quotation marks. In fact, any log message about a quoted string should alert you to the possibility of unbalanced
quotation marks. In client applications, SAS submits additional code, or wrapper code, including a single and double
quotation mark. SAS is attempting to repair any potential unbalanced quotes in a submitted program. The wrapper code
balances quotation marks and the code stops running, but your results will still contain errors and you must correct the
program. To do this, you either add the missing quotation mark, or match the quotation mark, and then resubmit the
program.
For more information on correcting unbalanced quotation marks in the SAS windowing environment and client
applications, click the Information button.
Question
Which of the following represents a syntax error? Select all that apply.
The correct answer is b, c, f, and h. Common syntax errors include invalid options, missing semicolons, unmatched
quotation marks, and misspelled keywords.
Question
Which of the following samples of code is valid?
a.
title "New Sales Employees';
proc print data=work.NewSalesEmps;
run;
b.
title 'New Sales Employees';
proc print data=work.NewSalesEmps;
run;
The correct answer is b. Although SAS allows either single or double quotation marks, you can’t mix the types. If you use
one type to begin and a different type to end, SAS considers the quotation marks unbalanced.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
A SAS program can also contain global statements, which are outside DATA and PROC steps, and typically affect the SAS
session. A TITLE statement is a global statement. After it is defined, a title is displayed on every report, unless the title is
cleared or canceled.
SAS statements usually begin with an identifying keyword, and always end with a semicolon. SAS statements are free
format and can begin and end in any column. A single statement can span multiple lines, and there can be more than
one statement per line. Unquoted values can be lowercase, uppercase, or mixed case. This flexibility can result in
programs that are difficult to read.
Conventional formatting, also called structured formatting, uses consistent spacing to make a SAS program easy to read.
To follow best practices, begin each statement on a new line, indent statements within each step, and indent
subsequent lines in a multi-line statement.
Comments are used to document a program and to mark SAS code as non-executing text. There are two types of
comments: block comments and comment statements.
/* comment */
* comment statement;
Diagnosing and Correcting Syntax Errors
Syntax errors occur when program statements do not conform to the rules of the SAS language. Common syntax errors
include misspelled keywords, missing semicolons, and invalid options. SAS finds syntax errors during the compilation
phase, before it executes the program. When SAS encounters a syntax error, it writes the following to the log: the word
ERROR or WARNING, the location of the error, and an explanation of the error. You should always check the log, even if
the program produces output.
Mismatched or unbalanced quotation marks are considered a syntax error. In some programming environments, this
results in a simple error message. In other environments, it is more difficult to identify this type of error.
Sample Programs
data work.newsalesemps;
set orion.sales;
where Country='AU';
run;
title;
/*
proc print data=work.newsalesemps;
run;*/
proc means data=work.newsalesemps;
class Gender;
var Salary/*numeric variable*/;
run;
daat work.newsalesemps;
length First_Name $ 12
Last_Name $ 18 Job_Title $ 25;
infile "&path/newemps.csv" dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary;
run;
proc print data=work.newsalesemps
run;
Objectives
At the beginning of each SAS session, SAS automatically provides one temporary and at least one permanent SAS library
that you can access. These libraries, or drawers, are open and ready for use. SAS provides a temporary library called
work where you can store and access SAS data sets for the duration of the SAS session. At the end of every SAS session,
SAS deletes the work library and its contents.
SAS also provides permanent libraries. SAS data sets in permanent libraries are saved after your SAS session terminates.
For example, sashelp is a permanent library that SAS makes available. It contains sample SAS data sets that you can
access anytime you start a SAS session. Traditionally, SAS defines a permanent library named sasuser that you can use
for storing and accessing SAS data sets during your SAS session. At your site, there might be different permanent
libraries that are defined for you to use. These data sets will be available in later sessions as well.
Question
Which statement about SAS libraries is true?
The correct answer is b. You refer to a SAS library by a logical name called a libref. A single SAS library cannot contain
files that are stored in different physical locations. And SAS deletes the contents of temporary SAS libraries but not
permanent SAS libraries.
When the data set is in a permanent library, you must use a two-level name. Let’s take a look at the following program
to further understand how SAS data sets are named and how you refer to them in code.
data work.newsalesemps;
set orion.sales;
run;
title;
In this program, the DATA step creates a temporary data set named newsalesemps by using the two-level name
work.newsalesemps. Both of the PROC steps reference this data set. When the current SAS session ends, SAS deletes
the newsalesemps data set, along with any other data sets that are stored in the work library.
Question
In this SAS program, is salesbonus a temporary SAS data set?
a. yes
b. no
The correct answer is a. The data set salesbonus is a temporary data set because it is referenced using a one-level name.
SAS assumes that the data set is stored in the temporary work library.
Business Scenario
Suppose you know the physical location of the SAS data sets that contain Orion Star data. You want to define a SAS libref
named orion to access and view those data sets. After you define the libref, you can explore the data sets and reference
the data sets in your SAS programs.
Let’s look at how you define a library. First, you identify the location of the library to SAS. For example, suppose you
have Orion Star data stored in a Microsoft Windows folder, and you want to use the folder as your SAS library. Your
operating system knows about the folder, but SAS doesn’t. To use this folder as your SAS library, you must tell SAS
where it is. In other words, you need to make a connection between the folder containing your data and SAS.
You begin with the keyword LIBNAME. Next, you specify the name of the libref. A valid libref must follow several rules. It
must have a length of one to eight characters, and must begin with a letter or underscore. The remaining characters
must be letters, numbers, or underscores. You then specify the physical location of the SAS library—your data. You must
reference an existing folder; the LIBNAME statement does not create a new folder. You enclose the physical location in
single or double quotation marks.
In place of filepath, you specify the actual physical location of the library. Here's an example in the Windows operating
environment.
However, depending upon your working environment, the path might look more like this.
The LIBNAME statement is a global statement. It's not part of a DATA or PROC step, and it doesn't need a RUN
statement. You can submit the LIBNAME statement alone, or you can store it with any SAS program so that the SAS
library is defined each time the program runs. If your program needs to reference data sets in multiple locations, you
can use multiple LIBNAME statements, as many as you want.
Question
Which of the following librefs is valid?
a. _orionstar
b. orion/01
c. or_01
d. 1_or_a
The correct answer is c. This libref follows all three rules for valid librefs. It has a length of one to eight characters, it
begins with a letter or underscore, and its remaining characters are letters, numbers, or underscores.
Question
Which of the following correctly assigns the libref myfiles to a SAS library in the c:/mysasfiles folder?
The correct answer is b. This LIBNAME statement begins with the keyword LIBNAME, followed by the name of the libref,
which is myfiles. It then specifies the physical location of the library, in quotation marks, which is c:/mysasfiles.
1. Copy and paste the following LIBNAME statement into the editor:
%let path=/dept/dvt/WebDMS/Education/Prg1PracticeFiles;
libname orion "&path";
You might recognize the code from the data setup program you used for this course. The first line of code
references a macro variable that stores the location of the practice files. Your location will be different from the
one shown here.
The next line is the LIBNAME statement, which is the focus of this demo. You specify the orion libref and then
the physical location of the library. Again, because of the macro variable, the location is simply &path.
Remember that the location must be enclosed in quotation marks. Note that any time you reference a macro
variable within quotation marks, such as in a LIBNAME statement, you must use double quotation marks. So
we'll use double quotation marks here.
2. Submit the code and check the log. Verify that the orion libref was assigned successfully.
Business Scenario
The orion library is now active, so you have access to all of the files that it contains. You can use DATA steps and PROC
steps to work with the files. Suppose you don't know what files are available in the library. To view the contents of the
library, you can write a SAS program that creates a report with general information about the library. The report will also
list the members of the library. Let's find out how to generate this report.
The syntax of a basic PROC CONTENTS step has two statements: a PROC CONTENTS statement and a RUN statement.
The PROC CONTENTS statement begins with the keywords PROC CONTENTS, followed by the DATA= option.
When you use PROC CONTENTS to list the contents of a SAS library, you indicate the library after the DATA= option by
specifying the libref, a period, and the keyword _ALL_. When you use _ALL_ in the DATA= option, PROC CONTENTS
displays a list of all the SAS files that are contained in the SAS library. Finally, remember that the RUN statement tells SAS
to execute the preceding SAS statements.
Browsing a Library
In this demonstration, you submit a PROC CONTENTS step to browse a SAS library programmatically.
1. Copy and paste the following PROC CONTENTS step into the editor.
2. Submit the code and view the results. You might see additional details in your PROC CONTENTS results,
depending upon your SAS environment. The first table, Directory, lists general information about the library. The
second table lists all members of the library in alphabetical order, and provides basic information about each
member. As the Member Type column indicates, the orion library contains two types of SAS files: data sets and
indexes. Only the data sets are numbered in the first column. The names of the data sets indicate the type of
Orion Star information that you'll work with.
3. Scroll down in the report. By default, a PROC CONTENTS report also includes information about each individual
data set in the library, called the descriptor portion. If a library has many data sets, a report that includes all the
descriptors can be very long. To suppress the descriptor portions in the report, you specify the NODS option.
4. In the editor, add the NODS option after the _ALL_ keyword. You must use a space to separate _ALL_ from the
NODS option.
5. Submit the code and view the results. The results now contain only the two tables with the information you're
interested in working with.
Question
Which PROC CONTENTS step prints only general information about a SAS library and a listing of the members of the
library?
a.
proc contents data=orion.country nods;
run;
b.
proc contents data=orion._all_ nods;
run;
c.
proc contents data=orion._all_;
run;
d.
proc contents data=orion.nods _all_;
run;
The correct answer is b. This PROC CONTENTS step generates the specified output. After the keyword _ALL_, you add a
space and then the NODS keyword to suppress the descriptor data for each individual file in the library.
After DATA=, you specify the libref name orion as the first part of the two-level data set name. Then you specify the
name of the data set that you want to display. For example, suppose you know that one of the data sets is named
country. You type data=orion.country in your PROC PRINT step to view the data set.
2. Submit the code and check the log. Verify that SAS read 7 observations from the data set and that there are no
warnings or errors.
3. View the results. You don't need to be concerned with the details of the data set. Just remember that you must
have already assigned the orion libref in order to view and/or work with data in the orion library.
For example, in the following program, the LIBNAME statement associates the libref perm with the data in the folder
myfiles.
At the end of the program, another LIBNAME statement disassociates the perm libref. Suppose you need to specify a
different physical location for the files. To change the location, you can submit a LIBNAME statement with the same
libref name but with a different filepath.
When you end your SAS session, the contents of a permanent library still exist in their physical location in your operating
environment, but SAS deletes everything in the work library. Each time you start a new SAS session, you must resubmit
the LIBNAME statement for the SAS libraries that you need to use.
Let’s look at the following example. In the data set work.newsalesemps, the third column is the variable Job_Title, and
observation 3 contains the data for Kevin Lyon. A variable is a container that stores values. The value of the variable
Job_Title in observation 3 is Sales Rep. I.
work.newsalesemps
First_Name Last_Name Job_Title Salary
A SAS data set contains a descriptor portion and a data portion. The descriptor portion contains information about the
attributes of the data set, or metadata. The metadata includes general properties such as the data set name, the
number of observations, and the date and time that the data set was created, as well as variable properties such as
name, type, and length. You can browse the descriptor portion of your SAS data sets using PROC CONTENTS.
The data portion of a SAS data set contains the data values, stored in variables. Remember that the variable names are
part of the descriptor portion, not the data portion. Data values are either character or numeric. For example,
First_Name, Last_Name and Job_Title have character values, and Salary has numeric values. You can use PROC PRINT to
display the data portion of your SAS data sets. You’ll learn more about the data portion of SAS data sets in a bit.
1. Copy and paste the following code into the editor. The sales data set is in the permanent orion library, so you
must use a two-level data set name, orion.sales.
2. Submit the code and view the results. The PROC CONTENTS output displays the descriptor portion of the data
set in three tables. The first table shows general information about the data set, such as the data set name, and
the date and time the data set was created.
Take a look at the other information in this table. Can you determine how many observations are in this data
set? There are 165 observations in orion.sales.
The second table displays operating environment information, the physical location of the file, and other data
set information. The third table is an alphabetic list of variables in the data set and their attributes.
1. Copy and paste the following code into the editor. Remember that the data portion contains the data values.
2. Submit the code and check the log. A note in the log confirms that SAS read 165 observations from the data set.
3. View the report. By default, PROC PRINT displays all variables and observations, using the variable names as
column headings. The Obs column is displayed to identify each observation, similar to a row number.
In the PROC CONTENTS output, SAS displayed a table with variable attributes, including the variable type:
character or numeric. You can determine a variable’s type by looking at your PROC PRINT output; SAS
automatically displays character variables left-aligned, and numeric variables right-aligned.
Question
Type the correct letter to match the types of information with the portion of a SAS data set in which each is documented
or stored.
The correct answers from top to bottom are b, a, a, a. The descriptor portion of a SAS data set contains the name of the
data set, the variable's type, and the creation date of the data set. The data portion contains the data values. Although
you can determine a variable's type by the alignment of the variable values in the data portion, SAS documents the type
in the descriptor portion.
Question
How many observations and variables does the data set shown here contain?
The correct answer is b. The SAS data set contains four observations and three variables. Recall that in SAS, observations
are the rows in a data set, and variables are the columns in a data set.
For character variables such as Job_Title, a blank represents a missing value. For numeric variables such as Salary, SAS
uses a period, by default, to represent a missing value.
work.newsalesemps
First_Name Last_Name Job_Title Salary
Satyakam Denny Sales Rep. II 26780
Monica Kletschkus Sales Rep. IV .
Kevin Lyon 26955
Petrea Soltau Sales Rep. II 27440
You can alter this default with the MISSING= SAS system option, which specifies a character to print for missing numeric
variable values.
MISSING='character'
Question
According to the data set shown, what type of variable is ActLevel?
ID DoB ActLevel
134 05MAR59
. 22MAY41 3
224 . 4
298 12DEC43 2
a. numeric
b. character
c. can't tell from the data shown
The correct answer is b. The variable ActLevel has a missing value in row one that is represented with a blank. Missing
character values in SAS are represented with blanks. In addition, the values for ActLevel are left justified, which also
indicates that ActLevel is a character variable. If the MISSING SAS system option was turned on, the missing numeric
values for ID and DoB would not show periods, as they do here.
8 Birth_Date Num 8
7 Country Char 2
2 First_Name Char 12
4 Gender Char 1
9 Hire_Date Num 8
6 Job_Title Char 25
3 Last_Name Char 18
5 Salary Num 8
As you know, a variable's type is either character or numeric. Character variables can store any values, such as letters,
numbers, special characters, and blanks.
Character Values
Monica
120101
3Top Sports
Now let's look at some examples of valid numeric values. Numeric variables can store only numeric values, which can
include the digits 0 through 9, a minus sign, a single decimal point, and E for scientific notation.
Numeric Values
26780
-30
-29.92
3.1E6
A variable's length indicates the number of bytes used to store it. The length is related to the variable's type. Character
values are stored with a length of 1 to 32,767 bytes. One byte equals one character. In orion.sales, the First_Name
variable has a length of 12 characters and uses 12 bytes of storage.
Numeric variables have 8 bytes of storage by default, no matter how many digits they contain. When stored in floating
point or binary representation, 8 bytes of storage provide space for 16 or 17 significant digits. SAS variables can have
additional attributes, such as formats, as well as some that are not shown in this PROC CONTENTS output: informat and
label. You will learn more about the additional attributes in a later lesson.
Question
Which of the following values can be stored in the variable Product_Line, based on the attributes shown in this partial
PROC CONTENTS output? Select all that apply.
The correct answers are b, c, d, and e. All of the values are valid character values, but the first value is longer than the
specified length of 20.
Question
In this PROC CONTENTS output, what is the default length of the variable Street_ID?
a. 8 bytes
b. 9 bytes
c. 16 or 17 bytes
d. 32,767 bytes
The correct answer is a. The default length of numeric variables is 8 bytes. A numeric variable of the default length can
hold 16 or 17 significant digits. 32,767 bytes is the largest possible length for a character variable.
Job_Title
_quantity2012_
Customer_FirstName
SAS creates each variable name in the same case that you first specify it, and that is the way it appears in reports. After a
variable has been created, you can refer to it in any case in your code without affecting the way that it is stored. You
apply the same naming conventions to SAS data set names.
Partial output
Obs Salary
1 108255
2 87975
3 26600
4 27475
5 26190
6 26480
Click the Information button in the course interface to learn about using special characters in variable names.
Question
Which of the following variable names are valid? Select all that apply.
a. data5mon
b. 5monthsdata
c. data#5
d. five months data
e. five_months_data
f. FiveMonthsData
The correct answers are a, e, and f. Valid variable names begin with a letter or underscore, and continue with letters,
numbers, or underscores.
Question
Which of the following is not a valid name for a SAS data set?
a. Customer_Purchases_Quarter2_2007
b. _Sales_MainOffice
c. 2007_Sales
d. TotalSales2007
The correct answer is c. A valid SAS name cannot begin with a number. Valid names must begin with either a letter or an
underscore. Subsequent characters can be letters, underscores, or numerals.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Accessing SAS Libraries
SAS data sets are stored in SAS libraries. A SAS library is a collection of one or more SAS files that are recognized by SAS.
SAS automatically provides one temporary and at least one permanent SAS library in every SAS session.
Work is a temporary library that is used to store and access SAS data sets for the duration of the session. Sasuser and
sashelp are permanent libraries that are available in every SAS session.
You refer to a SAS library by a library reference name, or libref. A libref is a shortcut to the physical location of the SAS
files.
All SAS data sets have a two-level name that consists of the libref and the data set name, separated by a period. Data
sets in the work library can be referenced with a one-level name, consisting of only the data set name, because work is
the default library. Data sets in permanent libraries must be referenced with a two-level name.
You can create and access your own SAS libraries. User-defined libraries are permanent but are not automatically
available in a SAS session. You must assign a libref to a user-created library to make it available. You use a LIBNAME
statement to associate the libref with the physical location of the library, that is, the physical location of your data. You
can submit the LIBNAME statement alone at the start of a SAS session, or you can store it in a SAS program so that the
SAS library is defined each time the program runs. If your program needs to reference data sets in multiple locations,
you can use multiple LIBNAME statements.
Use PROC CONTENTS with libref._ALL_ to display the contents of a SAS library. The report will list all the SAS files
contained in the library, as well as the descriptor portion of each data set in the library. Use the NODS option in the
PROC CONTENTS statement to suppress the descriptor information for each data set.
After associating a libref with a permanent library, you can write a PROC PRINT step to display a SAS data set within the
library.
In an interactive SAS session, a libref remains in effect until you cancel it, change it, or end your SAS session. To cancel a
libref, you submit a LIBNAME statement with the CLEAR option. This clears or disassociates a libref that was previously
assigned. To specify a different physical location, you submit a LIBNAME statement with the same libref name but with a
different filepath.
When a SAS session ends, everything in the work library is deleted. The librefs are also deleted. Remember that the
contents of permanent libraries still exist in in the operating environment, but each time you start a new SAS session,
you must resubmit the LIBNAME statement to redefine a libref for each user-created library that you want to access.
Examining SAS Data Sets
SAS data sets are specially structured data files that SAS creates and that only SAS can read. A SAS data set is displayed
as a table composed of variables and observations. A SAS data set contains a descriptor portion and a data portion.
The descriptor portion contains general information about the data set (such as the data set name and the number of
observations) and information about the variable attributes (such as name, type, and length). There are two types of
variables: character and numeric. A character variable can store any value and can be up to 32,767 characters long.
Numeric variables store numeric values in floating point or binary representation in 8 bytes of storage by default. Other
attributes include formats, informats, and labels. You can use PROC CONTENTS to browse the descriptor portion of a
data set.
The data portion contains the data values. Data values are either character or numeric. A valid value must exist for every
variable in every observation in a SAS data set. A missing value is a valid value in SAS. A missing character value is
displayed as a blank, and a missing numeric value is displayed as a period. You can specify an alternate character to print
for missing numeric values using the MISSING= SAS system option. You can use PROC PRINT to display the data portion
of a SAS data set.
SAS variable and data set names must be 1 to 32 characters in length and start with a letter or underscore, followed by
letters, underscores, and numbers. Variable names are not case sensitive.
Sample Programs
Browsing a Library
Objectives
Selecting Variables
By default, a PROC PRINT step displays all observations and variables in a data set, and the variables appear in the order
in which they occur in the data set. You can use the VAR statement to modify the default behavior and display only the
variables you want.
VAR variable(s);
In a VAR statement, you list the variables to include in the report. In this scenario, you want to print the last name and
first name of each sales employee, as well as their salaries, so you list the variables in that order, as shown below.
SUM variable(s);
This is the code that produces the column totals that will appear at the end of the report.
proc print data=orion.sales;
var Last_Name First_Name Salary;
sum Salary;
run;
Question
Which SUM statement will produce column totals for the variables Quantity and Total_Retail_Price?
a. sum=Quantity, Total_Retail_Price;
b. sum Quantity, Total_Retail_Price;
c. sum Quantity Total_Retail_Price;
d. sum=Quantity sum=Total_Retail_Price;
The correct answer is c. You specify the variable names separated by blanks to display totals for variables in your report.
2. Submit the code and view the results. Notice that there are nine variables.
3. In the editor, modify the code. You want to display Last_Name, First_Name, and Salary, and you want to
summarize the values of Salary. Add a VAR statement and a SUM statement to the PROC PRINT step.
4. Submit the code. Check the log to make sure the code ran successfully.
5. View the report. The report shows only the three variables that you requested, and the salary total is in the last
row of the report.
Code Challenge
Write a statement to specify that the variables ID, Name, Company, and Policy be printed for the SAS data set
orion.sales.
You specify variable names separated by blanks in the VAR statement in the order in which you want the variables
printed.
Business Scenario
The management team now wants a report that displays the names and salaries of the sales employees earning a salary
that is less than $25,500. To create this report, you'll need to subset the observations.
WHERE where-expression;
When you use a WHERE statement, your output contains only the observations that meet the conditions specified in the
where-expression. The WHERE expression defines a condition for selecting observations, and can be any valid SAS
expression.
This WHERE statement defines the condition for your task: it selects observations where the value of Salary is less than
$25,500. Notice that we still have the VAR statement, because those are still the variables we want to see. We've
removed the SUM statement here though, because we aren't interested in seeing a salary total anymore.
Now let's explore the WHERE expression a bit. An expression is a sequence of operands and operators that form a set of
instructions. The operands can be constants or variables. A constant operand is a fixed value, such as 25500. Numeric
constants do not use quotation marks or special characters.
Character constants must be enclosed in quotation marks and are case sensitive. If the value of the variable Gender is
equal to the uppercase M character constant, SAS will select that observation.
A variable operand must be a variable from the input data set. Orion.sales is the input data set in this program, so the
variable Salary must exist in that data set. SAS selects observations where the value of Salary is less than 25500.
You can use comparison, arithmetic, or logical operators in a WHERE expression. There are also several special WHERE
operators that can only be used in WHERE statements. Let's look at each of these types of operators in more detail.
= EQ equal to
^= ¬= ~= NE not equal to
Look over the following examples of comparison operators used in WHERE statements.
where Gender='M';
where Salary ne .;
where Salary>50000;
where Salary<=60000;
The IN operator selects observations if they are equal to one of a list, and the value list in the IN operator must be
enclosed in parentheses and separated by either commas or blanks. In the last example, SAS selects observations where
the value of Country is equal to AU or US. Remember that character values must be enclosed in quotation marks.
Symbol Definition
** exponentiation
* multiplication
/ division
+ addition
- subtraction
where Salary+Bonus<=10000;
Activity
Copy and paste the following code, which contains two WHERE statements, into the editor. Submit the code.
The correct answer is d. The following log message indicates that SAS replaces the first WHERE condition with the
second WHERE condition:
Logical Operators
You can use logical operators in a WHERE statement to combine or modify expressions.
| OR logical or
Here, you can see the logical operators that are available and several examples of WHERE statements that use these
operators. The logical operator AND finds observations that satisfy both conditions. The logical operator OR finds
observations that satisfy one or both conditions. And the logical operator NOT modifies a condition by finding the
complement to the specified criteria.
2. Add a WHERE statement to select only the employees who are from the country Australia and who have a Salary
value that is less than $25,500. Notice that you use the logical operator AND to combine the two expressions.
3. Add the variable Country to the VAR statement so that it's included in the report.
5. View the report. Only the employees whose Salary and Country values met the WHERE expression conditions
are displayed. But take a look at the Obs column. SAS displays the original observation numbers.
6. To suppress the Obs column, modify the PROC PRINT statement by adding the NOOBS option.
7. Submit the code and verify that the report does not have the Obs column.
Question
Which WHERE statement correctly subsets on the numeric values for May, June, or July and missing character names?
The correct answer is b. You specify the value list in the IN operator in parentheses, and separate the values by either
commas or blanks. Only character values must be enclosed in quotation marks, and a blank represents a missing
character value.
Business Scenario
You need to create a report that lists only the Australian sales representatives with Rep anywhere in their job titles, so
you need to subset by Country and Job_Title. Think about it. You can add another condition using the AND operator, but
how do you specify this condition? Let's find out.
The position of the substring within the variable's value does not matter. For example, the WHERE statements shown
here will select observations in which the value for Job_Title is Sales Rep I, Sales Rep II, and so on.
The CONTAINS operator is case sensitive, so you must specify the substring in the exact case that you want it to match.
Using the CONTAINS Operator
In this demonstration, you use the CONTAINS operator to select observations for your report.
1. Copy and paste the following program into the editor. The WHERE statement uses the special operator
CONTAINS to specify the conditions for the report.
2. Submit the code and view the log. The log shows that SAS read 61 observations from orion.sales.
3. View the report. You can see that all of the values for Country are AU and all of the Job_Title values contain the
Rep character string.
Mnemonic Definition
The BETWEEN-AND operator selects observations in which the value of a variable falls within an inclusive range of
values. For example, the first WHERE statement in the program below selects observations in which the value of Salary
is between 50000 and 100000. You could write an equivalent statement without using the BETWEEN-AND operator, as
shown in the second WHERE statement. You could use the BETWEEN-AND operator along with the NOT operator to
select values outside of the specified range, as shown in the last WHERE statement.
Use the WHERE SAME AND operator to add more conditions to an existing WHERE expression later in the program
without retyping the original conditions. The WHERE SAME AND condition augments the original condition. This means
that SAS will read observations where Country=AU and Gender=F and Salary<25500.
proc print data=orion.sales;
var First_Name Last_Name Gender Salary Country;
where Country='AU' and Salary<25500;
where same and Gender='F';
run;
The IS NULL and IS MISSING operators select observations in which the value of a variable is missing. These operators
can be used for both character and numeric variables. For example, the WHERE statements shown below both select
observations in which the value of Employee_ID is missing.
Here's a question: can you think of another operator you could use to select observations with a missing value? You
could use the equals operator, as long as you knew whether the variable was numeric or character.
where Salary=.;
where Last_Name=' ';
The NOT logical operator can be added to select observations with nonmissing values.
The LIKE operator selects observations by comparing character values to specified patterns. There are two special
characters available for specifying a pattern: the percent sign specifies that any number of characters, including zero
characters, can occupy that position. The underscore specifies that exactly one character must occupy that position. You
can specify consecutive underscores. You can also specify a percent sign and an underscore in the same pattern.
Symbol Replaces
- one character
For example, in the following program the first WHERE statement selects observations in which the value of Name ends
in an uppercase N, which is preceded by any number of characters. The second WHERE statement selects observations
in which the value of Name begins with an uppercase T, followed by a single character, followed by a lowercase m,
followed by any number of characters. This statement can match both the name Tom and the name Tommy.
Question
The values for the variable Name in the table below are in the form last name, first name. Which WHERE statement will
return all the observations that have a first name starting with the letter M for the given values?
Name
Elvish, Irenie
Ngan, Christina
Hotstone, Kimiko
Daymond, Lucian
Hofmeister, Fong
Denny, Satyakam
Clarkson, Sharryn
Kletschkus, Monica
a. where Name like '_, M_';
b. where Name like '%, M%';
c. where Name like '_, M%';
d. where Name like '%, M_';
The correct answer is b. By using a percent sign, the pattern specifies last names that contain any number of characters.
The last name must be followed by a comma, a space, and an uppercase M to start the first name. This can be followed
by any number of characters. If you use an underscore, exactly one character must occupy that position.
Business Scenario
The sales manager wants a report that includes only customers who are 21 years old. He also wants the Customer_ID
variable to print at the beginning of each row.
ID variable(s);
The variable you specify replaces the Obs column. We'll specify Customer_ID.
3. In the editor, add a WHERE statement to select the observations for customers who are 21 years old. Also add a
VAR statement to display only the variables Customer_ID, Customer_Name, Customer_Gender,
Customer_Country, Customer_Group, Customer_Age_Group, and Customer_Type.
4. Submit the code and check the log. Verify that SAS read 6 observations.
5. View the report and verify that the Customer_Age_Group has the value 15-30 years in all observations.
Consider this: If the observations were to wrap to another line, the observation numbers would make it easy to
find the continuation. But a better technique is to choose a variable that uniquely identifies observations.
6. In the editor, add the ID statement to replace the Obs column and identify observations based on the
Customer_ID variable. Remove Customer_ID from the VAR statement because you don't want the variable to
appear twice.
7. Submit the code and check the log. The log still shows that SAS read 6 observations, and in the report, the
Customer_ID variable identifies the observations rather than the observation numbers.
Business Scenario
The payroll manager wants a report that displays the observations from orion.sales in ascending order by the variable
Salary. As you've probably noticed, PROC PRINT displays observations in the order in which they appear in your data set.
But you can write a SAS program to sort the observations in the data set and then use PROC PRINT to display the sorted
data set.
PROC SORT
A basic PROC SORT step has three statements: a PROC SORT statement, a BY statement, and a RUN statement.
PROC SORT DATA= input-SAS-data-set
<OUT=output-SAS-data-set>;
BY<DESCENDING> by-variable(s);
RUN;
As with a PROC PRINT step, you use the DATA= option to specify the input data set. We need to sort orion.sales, so it's
our input data set. Next, we use the OUT= option in the PROC SORT step because we don't want to permanently sort
orion.sales. We want to sort orion.sales and create a new data set, work.sales_sort. Note that in the syntax box, the
general form of the OUT= option is enclosed in angle brackets (<>). These brackets indicate this element is optional.
Every PROC SORT step must include a BY statement. The BY statement specifies one or more variables in the input data
set whose values are used to sort the data. These are called BY variables. The BY statement also indicates whether you
want to sort in ascending or descending order. By default, SAS sorts in ascending order and you don't have to specify
anything additional in the BY statement. In the example above, PROC SORT will sort the observations by the values of
Salary in ascending order.
1. Copy and paste the following program into the editor. SAS will display the Salary values from lowest to highest,
that is, in ascending order.
2. Submit the code and check the log. You can see that SAS read 165 observations from orion.sales and created
work.sales_sort successfully.
You haven't created the report that the payroll manager requested yet. You've only sorted the data. To create
the report, you need to write a PROC PRINT step.
4. Submit the code and view the report. The report shows that the data set is sorted by the value of the variable
Salary. You can see that the values are in ascending order, with the highest salary in the last observation.
Question
Which step sorts the observations in a SAS data set and overwrites the same data set?
a.
proc sort data=work.empsau
out=work.sorted;
by First;
run;
b.
proc sort data=orion.empsau
out=empsau;
by First;
run;
c.
proc sort data=work.empsau;
by First;
run;
The correct answer is c. PROC SORT replaces the original data set unless you specify an output data set in the OUT=
option.
Code Challenge
Add a statement to specify that the values in work.tests be sorted by the variable TimeMin.
;
run;
by TimeMin;
The BY statement specifies the variable or variables whose values are used to sort observations in the data set.
Business Scenario
Now that you're familiar with sorting, the payroll manager at Orion Star has asked you to create another report that
displays sales employees grouped by Country and displays them in descending Salary order within Country. So within
each country, you want the observations arranged by Salary, from highest to lowest.
To sort on a variable in descending order, you must specify the DESCENDING keyword immediately before each variable
that you want in descending order. SAS will sort the observations from the largest value to the smallest value.
Question
Which BY statement in a PROC SORT step can produce the output shown here?
a. by Postal_Code Employee_ID;
b. by descending Postal_Code Employee_ID;
c. by Postal_Code descending Employee_ID;
d. by descending Postal_Code descending Employee_ID;
The correct answer is d. In the output, the observations are sorted in descending order for Postal_Code and, within each
postal code, in descending order for Employee_ID. The BY statement must specify the keyword DESCENDING before
each variable.
1. Copy and paste the following code into the editor. Notice that you're creating the output data set work.sales2.
2. Submit the code and check the log. The log shows that the code ran successfully. SAS read 165 observation from
the data set orion.sales.
3. View the report. The first country listed is Australia, or AU. Can you tell whether the Salary variable has been
sorted in descending order? Yes. Scroll down the report to see that the Salary values decrease. And starting with
observation 64, you see the employees from the US, and the highest Salary value is listed first.
The payroll manager wanted the report grouped by Country. This report displays the AU employees first
because you sorted Country in ascending order, but it doesn't actually group the observations by Country.
This BY statement specifies the variable to use to form BY groups. The variables in the BY statement are called BY
variables. Think about this: when you specify variables in the BY statement, what do you need to verify in the input data
set? The input data set must be sorted on the variables specified in the BY statement. The variables must also be sorted
in the order specified, either ascending or descending.
2. In the PROC PRINT step, add a BY statement that groups the data by Country.
3. Submit the code and examine the report. You can see that SAS grouped the report by Country. The first table in
the report is for employees who are from Australia, and the second table is for employees who are from the US.
Notice that the Salary values are still listed in descending order.
Activity
Copy and paste the following program into the editor and submit it. View the log.
a. yes
b. no
The correct answer is b. The log shows an error message indicating that the program failed. The input data set is not
sorted by Gender. Only a portion of the output is produced.
Question
In which PROC step would you add the following WHERE statement to result in the most efficient processing?
where Salary<25500;
The correct answer is a. Subsetting in the PROC SORT step is more effiicient. It selects and sorts only the required
observations.
Business Scenario
Suppose you need to share one of your reports at an upcoming staff meeting. You know that the information in the
report is accurate, but your report could use some improvements in the way it looks. You can enhance the report by
adding titles and footnotes.
To add a title to your report, you use the TITLE statement, and to add a footnote, you use the FOOTNOTE statement.
Aside from the keyword at the beginning, these two statements have the same syntax.
TITLEn 'text'; FOOTNOTEn 'text';
Following the keyword, you specify a value for n: a number from 1 to 10 that indicates the line on which the title or
footnote appears. In the title or footnote area, line 1 is the first line and line 10 is the last. If you don't specify a number,
SAS assumes that you're referring to line 1.
Then, within quotation marks, you specify the text that you want to appear in the title or footnote. You can use either a
set of single quotation marks or a set of double quotation marks to enclose the text string.
In the following example, the TITLE statement specifies a title for line 1, Orion Star Sales Staff. You can specify only one
title or footnote in a single statement.
You'd like the title Salary Report to appear on the second title line, so you need to add another TITLE statement, TITLE2.
The FOOTNOTE statement specifies Confidential.
TITLE and FOOTNOTE statements are global statements, so they can stand alone. Also, any titles or footnotes that you
assign remain in effect until you change them, cancel them, or end your SAS session.
1. Copy and paste the following code into the editor to add titles and a footnote to your report.
2. Submit the code and examine the report. You can see that the titles Orion Star Sales Staff and Salary Report
have been added to the top, and the footnote Confidential to the bottom of the report.
3. In the editor, copy and paste the following code into the editor, and then submit it.
title1 'Orion Star Sales Staff';
title2 'Salary Report';
footnote1 'Confidential';
4. Examine the new report. This report shows the same titles and the same footnote as the previous report, but
you don't want these titles. You want different titles to appear that are meaningful for this information.
Remember that titles and footnotes are global statements, and when you assign them, they remain in effect
until you change them, cancel them, or end your SAS session.
Code Challenge
Write a statement to specify the text Wellness Clinic Insurance on the third title line of the output from the PRINT step
below.
;
proc print data=clinic.insure;
var id name company policy;
where id between 1250 and 7590;
run;
You specify the keyword TITLE followed by the number of the line where the title is to appear, and then the text in
quotation marks.
When we add another TITLE1 statement, the new text replaces the current text for TITLE1.
Redefining a title also cancels all higher-numbered titles, so this TITLE1 statement changes title 1 and cancels titles 2 and
3. You change footnotes the same way.
To cancel all previously defined titles and footnotes, you can specify null TITLE and FOOTNOTE statements, which have
no numbers and no text.
title1 'Orion Star Sales Staff';
title2 'Salary Report';
title3 'Human Resources';
title;
footnote;
It's a good practice to cancel all titles and footnotes at the end of your program so that no unexpected titles and
footnotes appear on output that you generate later in your SAS session.
Question
Which footnote or footnotes appear in the second procedure results?
b. Orion Star
Non Sales Employees
d. Orion Star
Non Sales Employees
Confidential
The correct answer is b. When you run the second PROC PRINT step, the FOOTNOTE2 statement replaces the previous
footnote with the same number: Non Sales Employees replaces Sales Employees. It also cancels all footnotes with higher
numbers, so FOOTNOTE3, Confidential, does not appear in the results. The resulting footnotes are Orion Star and Non
Sales Employees.
1. Copy and paste the following code into the editor to add a new title to the second report. This code adds a TITLE
statement before the second PROC PRINT step. You want this title to replace the previous TITLE1 statement, so
you number it TITLE1.
2. Submit the code and view the report. As you can see, the first report has the titles Orion Star Sales Staff and
Salary Report. The second report now has its own title.
3. As part of good programming practice, add the null TITLE and FOOTNOTE statements to the bottom of the
program to cancel all titles and footnotes for any output that you might generate later.
title;
footnote;
4. Submit these statements, and now your session is ready for the next task.
Business Scenario
So far, you haven't made any changes to the appearance of variable names in the body of your reports. In the reports
that you've created, the variable names appear exactly as they are stored in the input data set. For your upcoming
meeting, suppose you want your reports to display more descriptive text instead of the variable names. For example,
you want to change the appearance of the variables Employee_ID, Last_Name, and Salary in your report.
LABEL variable='label'
variable='label'...;
In this example, you want to display the variable Employee_ID as Sales ID, the variable Last_Name as Last Name with no
underscore, and the variable Salary as Annual Salary.
A label can be up to 256 characters long. You can specify labels for multiple variables in one LABEL statement, or you can
use a separate LABEL statement for each variable.
Most SAS procedures display labels automatically, but PROC PRINT does not. You have to add the LABEL option to your
PROC PRINT statement to tell SAS to display the labels.
2. Submit the step and view the variable names as they currently appear in the data set orion.sales.
3. In the editor, copy and paste the following code to replace the existing code. You can see the labels for the three
variables. Remember that in order to print the labels, you must include the LABEL option in the PROC PRINT
step.
4. Submit this code and then check the log. Notice that SAS doesn't print any messages regarding whether labels
were applied. But the code ran without errors.
5. View the report. As you can see, the temporary labels now appear instead of the variable names.
Using the SPLIT= Option
The SPLIT= option in PROC PRINT specifies a split character to control line breaks in column headings. This option is most
useful if you are using text or listing output rather than HTML output. Here's the syntax.
SPLIT='split-character';
After SPLIT=, you specify a split-character, which will indicate where to wrap the label in the report. In the following
example, the split character is an asterisk.
Next, in the LABEL statement, you add the split character to the label text at the place where you want the label to wrap
to the next line. Here, an asterisk appears between the two words of the label Annual Salary. In the output, these two
words will appear on different lines.
Do you think the LABEL option is necessary in the PROC PRINT statement now? Actually, SAS knows to print the labels
because you're using the SPLIT= option. The PRINT procedure uses labels only when the LABEL or SPLIT= option is
specified.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
You can use the SUM statement in a PROC PRINT step to calculate and display report totals for the requested numeric
variables.
The WHERE statement in a PROC PRINT step subsets the observations in a report. When you use a WHERE statement,
the output contains only the observations that meet the conditions specified in the WHERE expression. This expression
is a sequence of operands and operators that form a set of instructions that define the condition. The operands can be
constants or variables. Remember that variable operands must be defined in the input data set. Operators include
comparison, arithmetic, logical, and special WHERE operators.
WHERE where-expression;
You can use the ID statement in a PROC PRINT step to specify a variable to print at the beginning of the row instead of
an observation number. The variable that you specify replaces the Obs column.
ID variable(s);
Every PROC SORT step must include a BY statement to specify one or more BY variables. These are variables in the input
data set whose values are used to sort the data. By default, SAS sorts in ascending order, but you can use the keyword
DESCENDING to specify that the values of a variable are to be sorted in descending order. When your SORT step has
multiple BY variables, some variables can be in ascending and others in descending order.
You can also use a BY statement in PROC PRINT to display observations grouped by a particular variable or variables. The
groups are referred to as BY groups. Remember that the input data set must be sorted on the variables specified in the
BY statement.
Enhancing Reports
You can enhance a report by adding titles, footnotes, and column labels. Use the global TITLE statement to define up to
10 lines of titles to be displayed at the top of the output from each procedure. Use the global FOOTNOTE statement to
define up to 10 lines of footnotes to be displayed at the bottom of the output from each procedure.
TITLEn 'text';
FOOTNOTEn 'text';
Titles and footnotes remain in effect until you change or cancel them, or until you end your SAS session. Use a null TITLE
statement to cancel all titles, and a null FOOTNOTE statement to cancel all footnotes.
Use the LABEL statement in a PROC PRINT step to define temporary labels to display in the report instead of variable
names. Labels can be up to 256 characters in length. Most procedures use labels automatically, but PROC PRINT does
not. Use the LABEL option in the PROC PRINT statement to tell SAS to display the labels. Alternatively, the SPLIT= option
tells PROC PRINT to use the labels and also specifies a split character to control line breaks in column headings.
PROC PRINT DATA=SAS-data-set LABEL;
LABEL variable='label'
variable='label'
... ;
RUN;
SPLIT='split-character';
Sample Programs
Selecting Observations
Objectives
You use the keyword FORMAT, followed by the variable and the SAS format that you want to apply to the variable. You
can use a separate FORMAT statement for each variable, or you can format several variables using either the same
format or different formats in a single FORMAT statement.
SAS Formats
What exactly is a format? A format is an instruction that tells SAS how to display data values. For example, you can
display a numeric value with commas and a dollar sign.
There are many existing SAS formats that you can use, and they all use the same form, as shown here.
<$>format<w>.<d>
The dollar sign indicates a character format and precedes the name of the SAS format. Then you specify the total format
width, including decimal places and special characters. The period is required syntax. Finally, you can specify the number
of decimal places in numeric formats.
Let's look at a few general examples. This list shows the general form for several common SAS formats. Take a moment
to read through these definitions.
Format Definition
COMMAw.d writes numeric values with a comma separating every three digits and a
period separating the decimal fraction.
DOLLARw.d writes numeric values with a leading dollar sign, a comma that separates
every three digits, and a period that separates the decimal fraction.
Character values are truncated if they do not fit in the specified width. In this first example, you can see that the stored
value, Programming, is displayed as Prog because the assigned format only has a width of 4. If you do not specify a
width that is large enough to accommodate a numeric value, the displayed value is automatically adjusted to fit into the
width.
Let's look at some of the numeric formats. The 12. format, which is the same as 12.0, doesn't specify the number of
decimal places, so none are displayed. SAS rounds the displayed value to the nearest integer. With the 12.2 format, the
value is displayed in a field 12 positions wide with 2 decimal places. The decimal value is rounded to the nearest
hundredth. The COMMA12.2 format inserts a comma between the three digits. The DOLLAR12.2 format inserts a dollar
sign in the displayed value. 12 is the total width of the displayed value, including the dollar sign, commas, decimal point,
and decimal places. The COMMAX12.2 format inserts a period between the three digits, and a comma separates the
decimal fraction. Lastly, in the EUROX12.2 format, a euro symbol is inserted in the displayed value.
In the first row of this table, 12 is wide enough to display the value, including the dollar sign, comma, decimal point, and
decimal places. You can see that the displayed value is 10 positions wide. What if you specify a width of 9, as in the
second row? The comma is not displayed. With a width of 8 in the third row, the dollar sign is also dropped. The fourth
row shows that a width of 5 results in the omission of decimal places in the displayed value. And the last row shows that
when a width of 4 is specified, the value is rounded to 27,000 and displayed in E-notation. But remember, the format
only affects the displayed value. The stored value is not affected by a format.
Question
Which format creates the displayed value shown here?
$5,950.35
a. DOLLAR4.2
b. COMMA8.2
c. DOLLAR9.2
d. $12.
The correct answer is c. The DOLLARw.d format writes numeric values with a leading dollar sign, a comma that separates
every three digits, and a period that separates the decimal fraction. The displayed value is nine characters wide, so the
total format width, w, is set to 9. This includes the special characters and decimal places. The displayed value contains
two decimal places, so d is set to 2.
This table lists several common SAS date formats and shows how each one affects a stored SAS date value.
MMDDYY6. 0 010160
MMDDYY8. 0 01/01/60
MMDDYY10. 0 01/01/1960
Let's look at the first three formats, the MMDDYY formats. These formats display values as a numeric month, day, and
year. The number at the end of the format is the width of the displayed field. It determines whether a forward slash
separator will be used, and if the year displays as two digits or four digits. A width of 6 does not provide enough room
for a separator and only allows for a two-digit year. A width of 8 allows for a separator and a two-digit year. With a
width of 10, the value displays with a separator and a four-digit year. So the format MMDDYY10. displays a date value
with a width of 10. It includes slashes to separate the month, day, and year values, and displays the year as a four-digit
value.
The DDMMYY formats are similar to the MMDDYY formats except that they display values as a numeric day, month, and
year. Click the Information button in the course to see more SAS date format examples, as well as the tables of formats
you saw earlier.
Question
Which FORMAT statement formats the variable values as shown below?
The variable Emp_Term_Date is displayed as a two-digit month, a two-digit day, and a two-digit year; the month
precedes the day. The displayed value has a length of 8. This is the MMDDYY8. format.
1. Copy and paste the following program into the editor to view the variables in the orion.sales data set before you
apply formats.
2. Submit the program and view the results. As you can clearly see, the Hire_Date values are difficult to decipher.
Remember that these date values are shown as numeric values. Now let's format the variables for your new
report.
3. In the editor, add a FORMAT statement that assigns the MMDDYY10. format to the variable Hire_Date and
assigns the DOLLAR8. format to the variable Salary.
4. Submit this code and look at the results. Notice that Salary has commas inserted to separate each group of
three digits and is displayed in a column that is 8 positions wide. Hire_Date is displayed as a two-digit month, a
slash, a two-digit day, a slash, and a four-digit year. This column is 10 positions wide. This report is much easier
to understand now.
PROC FORMAT
The general form of the PROC FORMAT statement is shown below:
PROC FORMAT;
VALUE format-name value-or-range1='formatted-value1'
value-or-range2='formatted-value2'
...;
RUN;
In a basic PROC FORMAT step, the PROC FORMAT statement consists only of the keywords PROC FORMAT. The VALUE
statement defines the format. First, you specify a format name. Then you specify a value or range of values, and lastly
you specify the formatted value, or how you want the value to be displayed.
Type Format-name
character $CTRYFMT
character $_ST3FMT_
numeric ORIONSTAR_SALRANGE2_FMT_
numeric _SALRANGE
A format name can have a maximum of 32 characters. The name of a format that applies to character values must begin
with a dollar sign ($), followed by a letter or underscore. The name of a format that applies to numeric values must
begin with a letter or underscore. A format name cannot end in a number. All remaining characters can be letters,
underscores, or numbers. A user-defined format name cannot be the name of a SAS format. Also, notice that a format
name does not end with a period in the VALUE statement. Later, when you refer to the format in a FORMAT statement,
you'll specify the period.
value-or-range = formatted-value
'AU' or 1 = 'Australia'
When you specify the value-or-range, you must enclose character values in quotation marks. The character values that
you specify must match the case of the variable's values. You do not enclose numeric values in quotation marks.
In a range, a hyphen (-) separates the values that define the endpoints of the range. When you specify a range of
character values, be careful not to enclose the entire range in quotation marks. If you do this, SAS assumes that all of the
characters, including the hyphen, are part of a single character value.
In a list, commas separate the individual values. The formatted value is always a character string, no matter whether the
format applies to character values or numeric values. A character string can consist of any type of character. Usually,
each formatted value is enclosed in quotation marks, as shown above. However, SAS does not require the quotation
marks for a formatted value. Formatted values can be up to 32,767 characters in length.
When you specify a value-or-range, you can also use the keyword OTHER to specify values that do not match any other
value-or-range. If you do not include the keyword OTHER, then SAS applies the format only to values that match the
value-range sets that you specify. If SAS encounters a value that you did not anticipate, SAS cannot apply the format but
instead displays that value as it's stored in the data set.
value-or-range = formatted-value
OTHER = 'Australia'
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;
Remember that the report needs to display full country names rather than the country codes. The country values are
character, so you need a character format. Its name must begin with a dollar sign. Here we chose the name $CTRYFMT.
We can call this the $ country format. Notice that there is no period at the end of the format name.
Now look at our value ranges. When you apply this format to a variable later, the output will display full country names
instead of country codes. Notice that the last value-range set specifies the keyword OTHER to include all values that do
not match any other value or range. In this example, the output will label any values other than the two country codes
as miscoded.
You can only define one format in the VALUE statement. However, you can define multiple formats in a single PROC
FORMAT step by adding multiple VALUE statements.
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
value $sports
'FB'='Football'
'BK'='Basketball'
'BS'='Baseball';
run;
Question
Which user-defined format names are valid? Select all that apply.
a. $STFMT
b. $3LEVELS
c. _4YEARS
d. SALRANGES
e. DOLLAR
The correct answers are a, c, and d. Character formats begin with a dollar sign and must be followed by a letter or
underscore. Answer choice b has a dollar sign followed by a number. Also, user-defined formats cannot be the name of a
SAS format, as in answer choice e.
Code Challenge
Complete the PROC FORMAT step by adding the label Soccer to the value SK.
proc format;
value $sports
'FB'='Football'
'BK'='Basketball'
'BS'='Baseball'
;
run;
'SK'='Soccer';
The label is enclosed in quotation marks and assigned to the character value, which is also enclosed in quotation marks.
The VALUE statement ends with a semicolon.
Using PROC PRINT to Apply a User-Defined Format
Now that you know how to create your own formats, let's look at the second PROC step that you can use to apply the
format. You use the FORMAT statement in a PROC PRINT step to apply your formats to variables. The following FORMAT
statement applies the user-defined format $CTRYFMT to the variable Country.
When you refer to a user-defined format in the FORMAT statement, notice that you must specify a period after the
format name, the same way that you do for a SAS format name. However, remember that you do not have to include a
period after a user-defined format name when you create it.
We could have added another FORMAT statement, but both SAS and user-defined formats can be applied in a single
FORMAT statement.
1. Copy and paste the following program into the editor. The PROC FORMAT step creates the format $CTRYFMT.
The PROC PRINT step assigns the format to the variable Country.
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;
2. Submit the program and check the log. You can see that SAS created the format because there's a note that
states that the $CTRYFMT has been output.
3. View the report. The Country values are displayed as Australia and as United States. None of the Country values
are displayed as Miscoded.
Business Scenario
Now let's look at a numeric example. The Orion Star HR manager has asked for a report showing employee salaries
organized into three user-defined groups, or tiers. You need to display the tiers in the report instead of the dollar
amount.
Specifying Ranges of Values
Let's create a format that applies to numeric values. Suppose we know that the Salary values in orion.sales are between
20,000 and 250,000. Let's say that Tier1 includes salaries from 20,000 to 49,999, Tier2 includes salaries from 50,000 to
99,999, and Tier3 includes salaries from 100,000 to 250,000. This PROC FORMAT step defines the TIERS format.
proc format;
value tiers 20000-49999='Tier1'
50000-99999='Tier2'
100000-250000='Tier3';
run;
In this VALUE statement, each value-range set specifies an inclusive range of values. An inclusive range includes the first
value and the last value. Think about this. After you create the TIERS format, how will the Salary value 99,999.87 appear
in your report? Oops! The value falls outside of the ranges that are specified here: between Tier2 and Tier3. The ranges
defined here assume that the values of Salary are stored as whole numbers.
To create a set of ranges that have no gaps between them, you can add the less-than (<) symbol to exclude one or both
numbers in individual ranges. First, you make sure that the last number in the range is the same as the number at the
beginning of the next range. The number at the end of the first range becomes 50,000. The number at the end of the
middle range becomes 100,000.
Let's look at all possible ways that you can use the less-than symbol in a range. We'll use the middle range in the TIERS
format as an example. Note that the values in bold will be excluded in the range.
50000 - 100000
To exclude the first value in a range, you put the less-than symbol after the first value. To exclude the last value in a
range, you put the less-than symbol before the last value. And, to exclude both the first and last values, you put a less-
than symbol in both places.
Now, let's add the less-than symbol to the ranges in this PROC FORMAT statement so that the ranges have no gaps
between them.
proc format;
value tiers 20000-<50000='Tier1'
50000-<100000='Tier2'
100000-250000='Tier3';
run;
In the VALUE statement, the ranges are now defined by using the less-than symbol. Here's a question. If SAS applies the
TIERS format to the value 100,000, how do you think it will be displayed? 100,000 is the end value of the second range.
A less-than symbol appears in front of this number, so it is excluded in the second range. This value will appear as Tier3
in the report.
proc format;
value tiers low-<50000='Tier1'
50000-<100000='Tier2'
100000-high='Tier3';
run;
The LOW keyword can be used to define ranges that apply to character values as well as to numeric values. It's
important to know that, for character values, the LOW keyword treats missing values as the lowest possible values.
However, for numeric values, LOW does not include missing values.
Consider this. If you apply the TIERS format to a variable, what does SAS display in the report for a missing value? The
TIERS format applies to numeric values, so a missing value will appear as a period in the report.
Question
How will a value of 50000 be displayed if the TIERS format below is applied to the value?
proc format;
value tiers 20000-<50000 ='Tier1'
50000-<100000='Tier2'
100000-250000='Tier3';
run;
a. Tier1
b. Tier2
c. 50000
d. a missing value
The correct answer is b. In Tier1, the less-than symbol is before 50000, which means that value will be excluded from the
range. In Tier2, however, 50000 is the starting value of the range and will be included in the tier.
1. Copy and paste the following program into the editor. The VALUE statement uses the LOW keyword up to and
excluding 50000 for Tier1. For Tier2, you're including 50000 up to and including 100000. And for Tier3, you're
excluding 100000 and including every value above it.
proc format;
value tiers low-<50000='Tier 1'
50000-100000='Tier 2'
100000<-high='Tier 3';
run;
2. To apply the TIERS format to the variable Salary in the report, copy and paste the following PROC PRINT step
into the editor. Recall that when you refer to a user-defined format in the FORMAT statement, you must specify
a period after the format name, the same way that you do for a SAS format name.
3. Submit these steps and then check the log to ensure SAS creates the format. A note states that the format TIERS
has been output.
4. View the report. The Salary values have been replaced with the appropriate tier values.
Question
Can you include multiple VALUE statements in a single PROC FORMAT step?
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
value tiers low-<50000 ='Tier1'
50000-<100000='Tier2'
100000-high ='Tier3';
run;
a. yes
b. no
The correct answer is a. You can create multiple user-defined formats in the same PROC FORMAT step by specifying
multiple VALUE statements.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
SAS stores date values as the number of days between January 1, 1960, and a specific date. To make the dates in your
report recognizable and meaningful, you must apply a SAS date format to the SAS date values.
You use the FORMAT procedure to create a format. You assign a format name that can have up to 32 characters. The
name of a character format must begin with a dollar sign, followed by a letter or underscore, followed by letters,
numbers, and underscores. Names for numeric formats must begin with a letter or underscore, followed by letters,
numbers, and underscores. A format name cannot end in a number and cannot be the name of a SAS format.
You use a VALUE statement in a PROC FORMAT step to specify the way that you want the data values to appear in your
output. You define value-range sets to specify the values to be formatted and the formatted values to display instead of
the stored value or values. The value portion of a value-range set can include an individual value, a range of values, a list
of values, or a keyword. The keyword OTHER is used to define a value to display if the stored data value does not match
any of the defined value-ranges.
PROC FORMAT;
VALUE format-name value-or-range1='formatted-value1'
value-or-range2='formatted-value2'
...;
RUN;
When you define a numeric format, it is often convenient to use numeric ranges in the value-range sets. Ranges are
inclusive by default. To exclude the endpoints, use a less-than symbol after the low end of the range or before the high
end.
The LOW and HIGH keywords are used to define a continuous range when the lowest and highest values are not known.
Remember that for character values, the LOW keyword treats missing values as the lowest possible values. However, for
numeric values, LOW does not include missing values.
Sample Programs
proc format;
value $ctryfmt 'AU'='Australia'
'US'='United States'
other='Miscoded';
run;
proc format;
value tiers low-<50000='Tier1'
50000-100000='Tier2'
100000<-high='Tier3';
run;
Objectives
In this lesson, you learn to do the following:
use a DATA step to create a SAS data set from an existing SAS data set
subset observations by using the WHERE statement
create a new variable by using the assignment statement
subset variables by using the DROP and KEEP statements
describe the compilation and execution phases of the DATA step
store labels and formats in the descriptor portion of a SAS data set
Question
What types of files can a DATA step read as input data?
The correct answer is d. A DATA step can read a SAS data set, an Excel worksheet, or a raw data file as input data.
DATA output-SAS-data-set;
SET input-SAS-data-set;
RUN;
You begin the DATA step with the DATA statement, which provides the name of the SAS data set that you're creating.
The data set can be temporary or permanent. In the following example, you're creating the temporary SAS data set
subset1 in the work library.
data work.subset1;
set orion.sales;
run;
The SET statement names orion.sales as the existing SAS data set that you want to read in as input data. Can you tell
whether this is a permanent or temporary data set? Yes, you can. The data is in the permanent library, orion.
By default, a SET statement reads all observations and all variables from the input data set sequentially. You use the
WHERE statement to subset the input data set by selecting only the observations that meet a particular condition. The
following WHERE statement selects only those observations where the variable Country has a value of AU and where
the value of Job_Title contains the substring Rep.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;
If either of these expressions is false, SAS will not process the observation, and therefore, the observation won't be
included in the output data set.
Here's a question. Can you think of a way to write this WHERE statement using different operators or symbols? You
could write it like this.
data work.subset1;
where Country eq 'AU' and
Job_Title like '%Rep%';
run;
1. Copy and paste the following PROC PRINT step into the editor to examine the data set orion.sales.
2. Submit the step and then check the log. As you can see, SAS read 165 observations from orion.sales.
3. View the report. You can see that there are nine variables. Notice that the variable Job_Title includes mostly
titles that contain the substring Rep already, but a few of them don't. The Country variable includes both AU and
US values.
4. Copy and paste the following program into the editor. The DATA step uses a WHERE statement to subset the
observations and create the new data set, work.subset1. The PROC PRINT step prints the new data set.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;
5. Submit the code and check the log. The log shows that SAS read 61 observations from orion.sales. In the report,
notice that all of the observations have AU as the value for Country and have Rep somewhere in their value for
Job_Title.
Question
Examine the following program and then decide which statement is true.
data us;
set orion.sales;
where Country='US';
run;
a. The program reads a temporary data set and creates a permanent data set.
b. The program reads a permanent data set and creates a temporary data set.
c. The program contains a syntax error and will not execute.
d. The program will not execute because you cannot work with permanent and temporary data sets in the
same step.
The correct answer is b. The DATA statement doesn't specify a libref, so it's creating the temporary data set us. The SET
statement reads orion.sales, which is a permanent data set. There are no syntax errors in the program.
Code Challenge
Write a statement to specify the data set emp.salary as the data set to be read.
data march.payroll;
;
run;
set emp.salary;
You specify the keyword SET and the two-level name of the data set (libref.filename) to be read.
Business Scenario
Suppose that the management staff wants to give a 10% bonus to each Australian sales representative hired before
January 1, 2000. You'll use the SAS data set orion.sales to create a new data set named work.subset1. In addition to
subsetting for AU and the substring Rep, you'll subset the data based on the employee hire date. You'll also calculate a
10% bonus for these employees based on their salary.
'ddmmm<yy>yy' D
SAS will automatically convert a date constant to a SAS date value. The following table shows some examples of SAS
date constants.
Examples
'01JAN2000'D
'31Dec11'd
'1jan04'd
'06Nov2000'D
You can use a date constant in any SAS expression, including a WHERE expression.
In the following example, we want the employees whose Hire_Date value is before January 1, 2000, so we use the less-
than symbol to indicate this.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
run;
You enclose a SAS date constant in quotation marks. Notice that this program already contains the WHERE expression
for the Australian employees whose Job_Title value includes the substring Rep.
variable=expression;
Notice that the assignment statement is one of the few SAS statements that doesn't begin with a keyword. Variable
names an existing or new variable, and expression is a set of instructions that produces a value.
As in the WHERE statement, an expression is a sequence of operands and operators. Operands are character constants,
numeric constants, date constants, character variables, or numeric variables. Operators are either symbols that
represent an arithmetic calculation or they are SAS functions.
Example Type
In the first row of the table, we're assigning the numeric constant 26960 to Salary. The next row shows how to assign a
character constant. Remember that it needs to be in quotes, either single or double. Next you can see how to use a date
constant. In the fourth row, we're using an arithmetic expression. We're multiplying Salary by .10 to calculate the 10%
bonus and assigning the resulting value to the new variable Bonus. In the last row, you can see how to use a SAS
function. This function extracts the month from the Hire_Date value.
You should be mindful when using arithmetic operators in an assignment statement. When you use more than one
arithmetic operator in an expression, SAS performs operations based on priority, as is the case normally in math
equations. You can use parentheses to clarify or alter the order of operations. Also, if any operand in the expression has
a missing value in the observation, the result is a missing value.
** exponentiation I
* multiplication II
/ division II
+ addition III
- subtraction III
Question
What is the result of the following assignment statement?
num=4+10/2;
a. .(missing)
b. 0
c. 7
d. 9
The correct answer is d. The order of operations is division and multiplication, followed by addition and subtraction. So
10 divided by 2 equals 5, and 5 plus 4 equals 9.
Question
What is the result of this assignment statement given the values of var1 and var2?
num=var1+var2/2; var1 var2
. 10
a. . (missing)
b. 0
c. 5
d. 10
The correct answer is a. If an operand in an arithmetic expression has a missing value, the result is a missing
value.
. = . + 10/2
1. Copy and paste the following program into the editor. The DATA step subsets the data set orion.sales by the
Australian sales representatives based on their hire date. Using an assignment statement, the program creates
the new variable Bonus. The PROC PRINT step displays the report and includes a FORMAT statement to format
Hire_Date values with the DATE9. format.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
Bonus=Salary*.10;
run;
2. Submit the program and check the log. The log shows that SAS read 29 observations from orion.sales, and 29
observations and 10 variables were output to work.subset1. Originally, orion.sales contained nine variables, so
it looks like SAS created the new variable.
3. View the report. The new variable, Bonus, is displayed. In the first observation, notice that the calculation was
performed correctly: 26600 multiplied by .10 equals 2660. Also, notice that the variable Hire_Date does not
include any dates after January 1, 2000.
You use the DROP statement to specify the variables to exclude from the output data set. The DROP statement begins
with the keyword DROP, followed by a space-separated list of the variables that you want to drop from the output data
set.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country
Birth_Date;
run;
You use the KEEP statement to specify a list of variables to include in the output data set.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name
Salary Job_Title Hire_Date
Bonus;
run;
How can you decide which statement to use? You might want to use the KEEP statement instead of the DROP statement
if the number of variables to keep is significantly smaller than the number to drop. Also, if you use a KEEP statement,
you must include every variable to be written, including any new variables. One more note: the DROP and KEEP
statements have no effect on the input data set.
1. Copy and paste the following program into the editor. This DATA step contains the DROP statement to exclude
the variables Employee_ID, Gender, Country, and Birth_Date from the output data set work.subset1.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country Birth_Date;
run;
3. View the report. You can see that the report displays the six variables, including the new variable Bonus.
4. Copy and paste the following program into the editor. This DATA step contains the KEEP statement and lists the
variables to include in the output data set.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name Salary Job_Title Hire_Date Bonus;
run;
5. Submit the program and check the log. Again, SAS read 61 observations, and work.subset1 contains six
variables. The report is exactly the same. You can use the DROP or KEEP statements to produce the same output
data set.
Question
If you submit the DATA step below, which variables appear in the work.mysubset data set? Select all that apply.
data work.mysubset;
set mylib.salesforce;
drop Gender Salary;
run;
mylib.salesforce
Emp_ID Name Gender Salary Job_Title
120102 Zhou, Tom M 108255 Sales Manager
Dawes,
120103 M 87975 Sales Manager
Wilson
a. Emp_ID
b. Name
c. Gender
d. Salary
e. Job_Title
The correct answer is a, b, and e. The DATA step omits two variables from the output data set: Gender and Salary. The
remaining three variables are included.
When the compilation phase is complete, SAS creates the descriptor portion of the new data set. Remember, the
descriptor portion contains information such as the data set name and the names of all the data set's variables.
Compilation Phase
Let's walk through the compilation phase of our previous program.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country
Birth_Date;
run;
During the compilation phase, SAS creates the program data vector, or PDV. The PDV is an area of memory where SAS
builds one observation. The PDV contains two automatic variables that can be used for processing, but that are not
written to the data set as part of an observation. _N_ is the iteration number of the DATA step, and _ERROR_ signals the
occurrence of an error that is caused by the data during execution. The default value of _ERROR_ is 0, which means
there is no error. When one or more errors occur, the value is set to 1.
PDV
_N_ _ERROR_
SAS scans each statement in the DATA step, looking for syntax errors such as missing or misspelled keywords, invalid
variable names, missing or invalid punctuation, or invalid options.
In our example code above, SAS scans the DATA step. When SAS compiles the SET statement, a slot is added to the PDV
for each variable in the input data set: Employee_ID, First_Name, Last_Name, Gender, Salary, Job_Title, Country,
Birth_Date, and Hire_Date. The descriptor portion of the input SAS data set, orion.sales, supplies the variable names, as
well as attributes such as type and length. Then SAS adds the new variable Bonus to the PDV based on the assignment
statement.
PDV
SAS determines that Bonus is a numeric variable because the expression on the right is a numeric constant. SAS then
flags the variables to be dropped from the output. In this case, the variables Employee_ID, Gender, Country, and
Birth_Date are marked to be dropped.
At the bottom of the DATA step, the compilation phase is complete, and the descriptor portion of the new SAS data set
work.subset1 is created.
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date Bonus
N8 $8 $8 $8 N8 $8 $8 $8 $8 $8
Execution Phase
If the DATA step compiles successfully, then the execution phase begins. During the execution phase, the DATA step
reads and processes the observations from the input data set, and creates observations in the data portion of the output
data set. By default, the DATA step executes once for each observation in the input data set.
At the start of the execution phase, SAS initializes the PDV to missing.
PDV
. . . . .
Remember that missing character values are displayed as blanks, and missing numeric values are displayed as a period.
In our example program, when the SET statement executes, SAS reads the first observation from orion.sales into the
PDV, providing a value for each variable.
PDV
The value of Bonus is missing because Bonus doesn't come from the input data set. It's a new variable being created in
this DATA step. When SAS executes the assignment statement, it assigns a value to Bonus. At the bottom of the DATA
step, SAS uses the values in the PDV to write the first observation to the new SAS data set. SAS doesn't write the
variables in the DROP statement to work.subset1.
work.subset1
Then control returns to the top of the DATA step for the next iteration. This is referred to as implicit output and implicit
return. SAS retains the values of variables that were read from the input data set in the PDV. These values will be
overwritten when the next observation is read into the PDV. SAS reinitializes the value of new variable, Bonus, to
missing.
As the SET statement executes on the second iteration of the DATA step, SAS reads the second observation into the PDV.
It overwrites previous values in the PDV. SAS calculates Bonus for this observation, and then uses the values in the PDV
to write the second observation to the new data set.
work.subset1
Then control returns to the top of the DATA step for the next iteration. This process continues until all of the
observations are read.
Question
Type the letter of the word or phrase on the right that completes the statements on the left.
When you submit a DATA step, SAS processes the a. descriptor portion
step in the __________________ phase first.
The correct answers from top to bottom are b, c, a, d. The compilation phase precedes the execution phase. SAS creates
the descriptor portion of the data set during the compilation phase and the data portion during the execution phase.
Business Scenario
Suppose that you want to create a new SAS data set that contains only the Australian employees whose Bonus is at least
$3000. In this situation, you want to subset observations based on the variable Country and the variable Bonus. Country
is a part of the orion.sales input data set. Recall that the variable Bonus does not exist in our input data set. We created
it with an assignment statement in the DATA step.
You could consider using a WHERE statement, but the WHERE statement selects observations when they are read from
the input data set to the PDV. Let's find out what happens if you use a variable that does not exist in the input data set in
a WHERE statement.
Activity
Copy and paste this program, which includes a WHERE statement to subset on the Bonus amount, into the editor and
submit it.
data work.subset1;
set orion.sales;
Bonus=Salary*.10;
where Country='AU' and
Bonus>=3000;
run;
a. yes
b. no
The correct answer is b. No, the output data set is not created successfully. The log contains an error message, and SAS
stopped processing the step. Because Bonus is a new variable being created in this DATA step and is not in orion.sales, it
cannot be used in a WHERE statement. SAS stopped processing the DATA step.
The Subsetting IF Statement
To subset observations based on the value of a variable you create, you can use the subsetting IF statement. The syntax
for the subsetting IF statement is the keyword IF and an expression that you want to evaluate.
IF expression;
Remember that an expression is a sequence of operands and operators that form a set of instructions. You can specify
multiple expressions in a subsetting IF statement.
if Salary>5000;
if Hire_Date='15APR2008'd;
Although IF expressions are similar to WHERE expressions, you cannot use special WHERE operators in IF expressions.
The subsetting IF statement causes the DATA step to continue processing only those observations that meet the
condition of the expression that you specify. That is, if the expression is true for the observation, SAS continues to
execute statements in the DATA step and writes the current observation to the output data set. The resulting SAS data
set contains a subset of the original SAS data set. If the expression is false, no further statements are processed for that
observation, the current observation is not written to the data set, and the remaining program statements in the DATA
step are not executed. SAS immediately returns to the beginning of the DATA step for the next iteration.
Question
When you use the subsetting IF statement, how are observations excluded?
a. If the expression is true, SAS excludes the observation from the input data set.
b. If the expression is false, SAS excludes the observation from the output data set.
c. If the expression is false, SAS excludes the observation from the PDV.
d. If the expression is true, SAS excludes the observation from the PDV.
The correct answer is b. When the expression is false, SAS excludes the observation from the output data set and
continues processing.
1. Copy and paste the following program into the editor. You use orion.sales to create the temporary data set
auemps. The WHERE statement selects only the observations where the value of Country is equal to AU. The
assignment statement creates the variable Bonus by calculating 10% of the employees' salaries. The subsetting
IF statement specifies that you only want the observations where the value of Bonus is equal to or greater than
3000. The PROC PRINT step creates the report.
data work.auemps;
set orion.sales;
where Country='AU';
Bonus=Salary*.10;
if Bonus>=3000;
run;
2. Submit the code and check the log. You can see that of the 165 observations in orion.sales, 63 were read into
the PDV for processing, and only 12 were written to work.auemps.
3. View the report. Notice that only AU values for Country and only Bonus values equal to or greater than 3000 are
displayed.
Activity
Copy and paste this program into the editor. This program is a variation of the program in the previous demonstration,
with both conditions combined into a single subsetting IF. Submit the program and review the log and results.
data work.auemps;
set orion.sales;
Bonus=Salary*.10;
if Country='AU' and Bonus>=3000;
run;
Are the results the same as what you saw in the previous demonstration?
a. yes
b. no
The correct answer is a. The log and results are the same, but the processing isn't as efficient. SAS reads all 165
observations from orion.sales rather than 63 observations in the previous program. You should subset as early as
possible in your program for more efficient processing.
If you're subsetting observations in a DATA step, you can always use a subsetting IF statement. That's easy, too. The
tricky part is knowing when you can use a WHERE statement in the DATA step. You only have to remember one rule:
when you use a WHERE statement in the DATA step, the WHERE expression must reference only variables from the
input data set.
If you're trying to subset based on a variable that SAS is reading from a single data set using the SET statement, you can
use a WHERE statement. If the variable is not in all data sets, you can't use a WHERE statement.
Why can't you use a WHERE statement based on a variable that's created with an assignment statement? A variable
that's created using an assignment statement doesn't exist in the input data set. Remember, the WHERE statement
subsets data as SAS reads the data into the PDV.
Question
Select the situation(s) in which you can use the WHERE statement to subset observations. Select all that apply.
a. in a PROC step
b. in a DATA step, when the variable in the condition is created
c. in a DATA step, when the variable in the condition is in the input data set
The correct answer is a and c. You can use a WHERE statement to subset observations in situations a and c. A subsetting
IF statement can be used in situations b and c.
Business Scenario
Now that you know how to subset observations and variables to create a customized SAS data set, suppose that you
want to create work.subset1 so that it includes permanent labels and formats. In other words, you want to permanently
associates labels and formats to the variables and store them in the descriptor portion of the data set.
LABEL variable='label'
variable='label'...;
These are temporary labels. When you use the LABEL statement in a DATA step, SAS permanently associates
the labels to the variables. In the following DATA step, we're assigning the label Sales Title to Job_Title, and
the label Date Hired to Hire_Date.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country
Birth_Date;
run;
SAS will add these labels to the descriptor portion of the data set. Remember that the descriptor portion of a SAS data
set stores variable attributes including the name, type, and length of the variable.
Question
If you submit this program, which of the following column headings will display for Job_Title in the resulting report?
data work.us;
set orion.sales;
where Country='US';
Bonus=Salary*.10;
label Job_Title='Sales Title';
drop Employee_ID Gender Country Birth_Date;
run;
proc print data=work.us label;
label Job_Title='Title';
run;
a. Sales Title
b. Job_Title
c. Title
The correct answer is c. The column heading will be Title, the label specified in the PROC PRINT step. Labels and formats
that you specify in PROC steps override the permanent labels in the current step. However, the permanent labels are
not changed.
In this demonstration, you add permanent labels to the descriptor portion of a SAS data set and then print the
labels in a report.
1. Copy and paste the following program into the editor. The DATA step includes the LABEL statement, which
specifies labels for the variables Job_Title and Hire_Date. The PROC CONTENTS creates the descriptor portion of
work.subset1.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country Birth_Date;
run;
2. Submit the program and check the log. The log shows that the program ran successfully.
3. Examine the results. The PROC CONTENTS report shows that the labels are now associated with the variables.
Consider this: If you write a PROC PRINT step to display work.subset1, will these new labels appear in the
report? No. The report will only include these descriptive labels if you add the LABEL option to the PROC PRINT
step. So even though the labels are permanently associated with the variables, you have the choice of how to
display the variables in your output.
4. Copy and paste the following PROC PRINT step, which includes the LABEL option, into the editor and submit it.
5. View the results. The report shows that Sales Title and Date Hired have replaced their variable names as
headings.
Using the FORMAT Statement in a DATA Step
As with the LABEL statement, you can use the FORMAT statement in a DATA step to permanently associate formats with
variables.
The format information is also stored in the descriptor portion of the data set.
In the data set work.subset1, you can apply SAS formats to the variables Salary, Hire_Date, and Bonus to permanently
format the values so that they are easier to understand.
Partial work.subset1
1. Copy and paste the following program into the editor. The FORMAT statement applies the format DOLLAR12. to
both the Salary and Bonus variables, and the format DDMMYY10. to the Hire_Date variable. Notice that you
format the variable name and not the label name. The PROC CONTENTS creates the descriptor portion of
work.subset1, and the PROC PRINT step prints the data set.
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary Bonus dollar12.
Hire_Date ddmmyy10.;
drop Employee_ID Gender Country Birth_Date;
run;
2. Submit this program and then check the log. The log shows that SAS ran successfully.
3. Examine the results. PROC CONTENTS shows that the formats were associated with our variables. Notice that
the labels are also still associated. In the report, the Salary and Bonus variable values now have dollar signs and
commas, and the Date Hired values are much easier to read and understand.
Summary of Lesson 6: Reading SAS Data Sets
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
You use the WHERE statement to subset the input data set by selecting only the observations that meet a particular
condition. To subset based on a SAS date value, you can use a SAS date constant in the WHERE expression. SAS
automatically converts a date constant to a SAS date value.
DATA output-SAS-data-set;
SET input-SAS-data-set;
WHERE where-expression;
RUN;
You use an assignment statement to create a new variable. The assignment statement evaluates an expression and
assigns the resulting value to a new or existing variable. The expression is a sequence of operands and operators. If the
expression includes arithmetic operators, SAS performs the numeric operations based on priority, as in math equations.
You can use parentheses to clarify or alter the order of operations.
variable=expression;
DROP variable-list;
KEEP variable-list;
SAS processes the DATA step in two phases: the compilation phase and the execution phase.
You can subset the original data set with a WHERE statement for variables that are defined in the input data set, and a
subsetting IF statement for new variables that are created in the DATA step. Remember that, although IF expressions are
similar to WHERE expressions, you cannot use special WHERE operators in IF expressions.
IF expression;
To subset observations in a PROC step, you must use a WHERE statement. You cannot use a subsetting IF statement in a
PROC step. To subset observations in a DATA step, you can always use a subsetting IF statement. However, a WHERE
statement can make your DATA step more efficient because it subsets on input.
LABEL variable='label'
variable='label'
... ;
Sample Programs
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
run;
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep' and
Hire_Date<'01jan2000'd;
Bonus=Salary*.10;
run;
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
drop Employee_ID Gender Country Birth_Date;
run;
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
keep First_Name Last_Name Salary Job_Title Hire_Date Bonus;
run;
data work.auemps;
set orion.sales;
where Country='AU';
Bonus=Salary*.10;
if Bonus>=3000;
run;
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
drop Employee_ID Gender Country Birth_Date;
run;
data work.subset1;
set orion.sales;
where Country='AU' and
Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary Bonus dollar12.
Hire_Date ddmmyy10.;
drop Employee_ID Gender Country Birth_Date;
run;
Objectives
assign a libref to a Microsoft Excel workbook using the SAS/ACCESS LIBNAME statement
access an Excel worksheet as though it is a SAS data set using a SAS two-level name
use the DATA step to create a SAS data set that contains a subset of worksheet data
assign a libref to an Oracle database using the SAS/ACCESS LIBNAME statement
access an Oracle table using a SAS-two-level name
create a SAS data set that contains a subset of an Oracle table
Fortunately, SAS provides several ways for you to access this data. You can use SAS/ACCESS Interface to PC Files to read
the worksheets within the sales.xls workbook as if they are SAS data sets. Optionally, you can use the IMPORT
procedure to read the worksheet and write the data to a SAS data set. To learn about the IMPORT procedure, visit the
online documentation at support.sas.com.
1. Open the sales.xls workbook. Either navigate to the file via Microsoft Excel, or open the file from the location
where you stored your practice files for this course.
2. Notice that the data for each country is on a separate tab. You’ll need each country in its own SAS data set. Look
at the date fields in columns H and I. These fields each have a different Excel date format applied. You might
want to address this, but don't worry about it right now. Everything else about this data looks pretty standard.
Exploring SAS/ACCESS
You can use the SAS/ACCESS LIBNAME statement to assign a libref to an Excel workbook. Then, SAS treats each
worksheet in the workbook as though it is a SAS data set. You have access! The SAS/ACCESS interface provides data
connectivity and integration between SAS and third-party data sources, including Microsoft Excel workbooks and various
databases. Using SAS/ACCESS interfaces, which are each licensed separately, your SAS programs can read data from and
write data to a third-party data source in the same way as reading from or writing to a SAS library.
SAS/ACCESS uses data access engines to read, write, and update data, regardless of the data source or platform. One of
the requirements for accessing relational databases is that your SAS installation must include the appropriate
SAS/ACCESS interfaces for the types of files you want to access. Excel and Oracle are examples of engines that are
available in SAS/ACCESS.
Some details about accessing database management systems are specific to your operating environment and to your
SAS installation, so this lesson does not contain practices for accessing relational database files. Click the Information
button in the course interface to learn how to determine which SAS products are licensed and installed in your
environment.
After the keyword LIBNAME, you specify a libref. The libref must follow the same rules as for any other SAS libref. You
then must specify the SAS/ACCESS engine name, and then the physical file name of the Excel workbook in quotes,
including the path, filename, and extension.
Let’s explore how you determine the proper SAS/ACCESS engine to specify in our scenario. Both SAS and Microsoft
Office offer 32-bit and 64-bit versions. Different SAS/ACCESS engines are needed based on matching and non-matching
number of bits, also known as bitness. If the bitness of both products is the same, the default SAS/ACCESS engine, excel,
can be used. If the bitness of both products is not the same, you must use the PC Files Server engine, pcfiles.
In our scenario, we’re using 64-bit SAS and 32-bit Microsoft Office, so we’ll use the pcfiles engine in the SAS/ACCESS
LIBNAME statement to read the sales.xls workbook. This engine is supported by SAS/ACCESS Interface to PC Files
After the keyword LIBNAME, we’ll specify the libref orionx, followed by the engine name pcfiles. With this engine, we
must also specify path= in front of the workbook name. Then we enter the workbook name in quotation marks. Click the
Information button in the course interface to learn more about the SAS/ACCESS engines and the appropriate
SAS/ACCESS LIBNAME statement syntax.
1. Copy and paste the following SAS/ACCESS LIBNAME statement into the editor and submit it to see how the
resulting libref enables you to access the Excel worksheets.
2. The log shows that the orionx libref was successfully assigned. This means that you have access to the data. You
can refer to orionx as if it is a SAS library, and you can access each of the worksheets as if they are data sets in
the library.
3. Navigate to the explorer window. Double-click Libraries to see that the library orionx is active. Notice that the
icon looks a little different. It has a globe on the folder, indicating that the data is outside of SAS.
4. Double-click the icon to see its contents. Note: In SAS Enterprise Guide, you can drill into the library using the
Server List window.
The worksheets in the workbook appear with a dollar sign at the end of the name. If the worksheet has named
ranges, the name will also appear, but will not have a dollar sign.
5. Copy and paste the following PROC CONTENTS step into the editor to explore the library. Remember that you
can use the CONTENTS procedure to list the contents of a SAS library. This step specifies orionx._all_ to list the
worksheets and their descriptor portions. Submit this step.
6. proc contents data=orionx._all_;
run;
8. Examine the results. In the second table of the PROC CONTENTS output, notice that some member names end in
dollar signs and others do not. Again, the members whose names end with a dollar sign are the spreadsheets.
The ones that do not end with a dollar sign are named ranges. You'll access the spreadsheet, Australia$, as if it is
a SAS data set. But remember, SAS data set names cannot include special characters. You learn more about
these dollar sign references and how to deal with them in your code later in this lesson.
9. Look at the first table for the Australia$ worksheet. Notice that the Data Set Name has the two-level worksheet
name and that the second part of the name includes a dollar sign, is enclosed in quotation marks and is followed
by the letter n. This is a SAS name literal. You can also see that the Member Type is DATA, and the Engine is
PCFILES.
10. Look at the Variables and Attributes table for the Australia$ worksheet. The original column headings contain
embedded spaces. In the SAS windowing environment, embedded spaces in column headings are replaced with
underscores to create valid SAS variable names. In SAS Enterprise Guide, the column headings are used as
variable names without modification because special characters are allowed in variable names. You can set the
VALIDVARNAME=V7 option in SAS Enterprise Guide to cause it to behave the same as in the SAS windowing
environment. The column headings are stored as labels in both environments. Notice that Birth_Date and
Hire_Date are both listed as numeric variables with DATE9. formats because they were both formatted as dates
in the original file.
A SAS name literal is a name token that is expressed as a string within quotation marks, followed by the upper- or
lowercase letter n. It enables special characters or blanks in data set names.
libref.'worksheetname$'n
After the libref orionx, you enclose the name of the Excel worksheet, Australia&, in quotation marks followed by the
letter n to print the contents of the Australia$ worksheet. Notice that we’re using the two-level data set name.
Question
Which PROC PRINT step displays the UnitedStates worksheet?
a.
proc print data=orionx.'United States';
run;
b.
proc print data=orionx.'UnitedStates$';
run;
c.
proc print data=orionx.'UnitedStates'n;
run;
d.
proc print data=orionx.'UnitedStates$'n;
run;
The correct answer is d. You use a SAS name literal to refer to an Excel worksheet in SAS code. You enclose the name of
the worksheet, including the dollar sign, in quotation marks followed by the letter n.
In this demonstration, you create a report of the Orion Star sales employees from Australia from an Excel worksheet.
1. Copy and paste the following step into the editor and submit it.
2. proc print data=orionx.'Australia$'n;
run;
3. Check the log and ensure that the code ran without errors or warnings.
4. Examine the report. You are using the Excel worksheet as if it is a SAS data set, and as you can see, the report
looks just like all the other SAS data set reports you've created. Now consider this. Do you think that you can
select only a subset of the worksheet data to print? After all, SAS is treating the worksheet just like a SAS data
set.
5. Copy and paste the following step into the editor. You're adding a WHERE statement to select only those
employees who have IV in their job titles, and a VAR statement to include only the variables Employee_ID,
Last_Name, Job_Title, and Salary in the report. To suppress the Obs column, you're adding the NOOBS option to
the PROC PRINT statement.
6. proc print data=orionx.'Australia$'n noobs;
7. where Job_Title ? 'IV';
8. var Employee_ID Last_Name Job_Title Salary;
run;
9. Submit this step and examine the report. The report looks great. You successfully created a subset of the
worksheet data.
Disassociating a Libref
It’s important to disassociate a libref when you are finished using it. If SAS has a libref assigned to an Excel workbook,
the workbook cannot be opened in Excel. SAS puts a lock on the Excel file when the libref is assigned. To disassociate a
libref, you use a LIBNAME statement and specify the libref and the CLEAR option.
SAS disconnects from the data source and closes any resources that are associated with the connection.
Business Scenario
Now that you know how to access the sales workbook in SAS, you’ve been asked to perform another task. You need to
create a SAS data set using the workbook as input. The data set work.subset should include only those employees from
Australia who have the word Rep in their job title, a Bonus variable that is 10% of Salary, and permanent labels and
formats.
In this demonstration, you use a DATA step to read input data from a Microsoft Excel worksheet and create a report of
the Orion Star sales employees from Australia.
1. Copy and paste the following program into the editor. You use a SAS/ACCESS LIBNAME statement to access the
worksheet. This is the same one you saw earlier that uses the PC Files engine.
Next, in a DATA step, you specify the data set work.subset as the output data set. In the SET statement, you
specify the Australia$ worksheet as the input data. To include the employees with the word Rep in their job
title, you use a WHERE statement. In an assignment statement, you create the new variable Bonus and set it
equal to 10% of Salary. You define labels using a LABEL statement. You want to display Job_Title as Sales Title,
and Hire_Date as Date Hired. And finally, you format the variable Salary with the COMMA10. format, the
variable Hire_Date with the MMDDYY10. format, and the variable Bonus with the COMMA8.2 format.
Aside from using the SAS name literal in the SET statement, this DATA step looks the same as if you were using a
SAS data set as your input data.
The program includes a PROC CONTENTS step so you can verify that SAS stored the formats and labels in the
descriptor portion of work.subset, and a PROC PRINT step to create a report of the data set. Do you recall what
you need to add to the PROC PRINT step to tell SAS to print the labels specified? You add the LABEL option.
data work.subset;
set orionx.'Australia$'n;
where Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary comma10. Hire_Date mmddyy10.
Bonus comma8.2;
run;
proc contents data=work.subset;
run;
2. Submit the code and check the log. SAS read 61 observations from the data set. There were 63 rows in the
spreadsheet, so you can see that the subsetting was successful. And no warnings or errors are present.
3. Examine the results. From the PROC CONTENTS results, you can see the labels you defined, as well as the labels
that SAS added automatically. Remember that the labels are actually the column headings in the Excel
spreadsheet, but Hire_Date and Job_Title now reflect the labels you specified. The formats for the numeric
variables Salary, Hire_Date, and Bonus have been stored correctly. In the PROC PRINT results, you can see that
SAS created a subset of the data based on your specifications. All of the job title values contain the word Rep.
SAS calculated Bonus as 10% of Salary values, and the labels and formats have been applied.
4. What do you need to do now that you've finished using the orionx libref? Right, you need to disassociate it.
Copy and paste the following LIBNAME statement into the editor and submit it to disassociate the libref.
5. The log shows that the libref orionx has been deassigned.
Using SAS/ACCESS
Just as with the Excel input data, you'll use the SAS/ACCESS LIBNAME statement to assign a libref to the database.
We’ll use the LIBNAME statement supported by the SAS/ACCESS interface to Oracle.
You begin with the keyword LIBNAME and then you specify a libref. In our program, we'll assign the libref oralib.
Following the libref, you specify the engine name, such as Oracle or DB2. This is the SAS/ACCESS component that reads
and writes to your DBMS, and it's required. We'll specify the oracle engine.
Now we need to specify additional connection options. These options provide connection information and control how
SAS manages the timing and concurrence of the connection to the DBMS. These arguments are different for each
database. USER= specifies an optional Oracle user name. In our code, we'll specify user=edu101. USER= must be used
with PASSWORD=. PASSWORD=, or PW=, specifies an optional Oracle password that is associated with the Oracle user
name. We'll specify pw=edu101.
PATH= specifies the Oracle driver, node, and database. SAS/ACCESS uses the same Oracle path designation that you use
to connect to Oracle directly. See your database administrator to determine the databases that have been set up in your
operating environment, and to determine the default values if you do not specify a database. We'll specify
path=dbmssrv.
Next, you specify the SCHEMA= option in the SAS/ACCESS LIBNAME statement to connect to the Oracle schema in which
the database resides. SCHEMA= enables you to read database objects, such as tables and views, in the specified schema.
If this option is omitted, you will connect to the default schema for your DBMS. We'll specify schema=educ.
When you submit this LIBNAME statement, SAS treats the Oracle database like a SAS library, and any table in the
database can be referenced using a SAS two-level name.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
When you browse the library, you might see worksheets and named ranges. Worksheet names end with a dollar sign,
and named ranges do not. Because the dollar sign is a special character, you must use a SAS name literal when you refer
to a worksheet in a program.
libref.'worksheetname$'n
When you assign a libref to an Excel workbook in SAS, the workbook cannot be opened in Excel. To disassociate a libref,
you submit a LIBNAME statement specifying the libref and the CLEAR option. SAS disconnects from the data source and
closes any resources that are associated with the connection.
Sample Programs
data work.subset;
set orionx.'Australia$'n;
where Job_Title contains 'Rep';
Bonus=Salary*.10;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary comma10. Hire_Date mmddyy10.
Bonus comma8.2;
run;
Objectives
Fields in a delimited raw data file are identified by their sequential order and the data values are separated by spaces or
other special characters. That is, the data is not arranged in columns. A given field might begin in a different column in
every record and have varying widths, but will be in the same relative position.
The delimited raw data file shown below contains information about the Orion Star sales employees. Notice that
there are no column headings. Typically, you'll have external documentation called a record layout to explain
the values. For example, suppose you see the number 87975. Is it an ID, a salary, a phone extension? Without a
record layout, it is sometimes impossible to know.
Partial sales.csv
1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982
Fields in a fixed-column raw data file are identified by their starting and ending column. A given field will begin in the
same column and have the same width in every record. In this course, you’ll learn to work with delimited raw data files.
1---5---10---15---20---25---30---35---40---45---50---55---60---65---
120102Tom Zhou Sales Manager 108225AU
120103Wilson Dawes Sales Manager 87975AU
120121Irenie Elvish Sales Rep. II 26600AU
120122Christina Ngan Sales Rep. II 27475AU
Question
Which of the following statements correctly describes a delimited raw data file? Select all that apply.
a. It is external to SAS.
b. It is not software-specific.
c. The values are arranged in columns.
d. The values are separated by spaces or other characters.
e. The values are labeled with field names.
The correct answer is a, b, and d. A delimited raw data file is an external text file in which the values are separated by
spaces or other special characters. The file is not software-specific.
List input can read standard or nonstandard data, and the values must be separated by a delimiter. If your raw data file
is arranged in columns rather than being delimited, you use either column input or formatted input. Column input reads
standard data arranged in columns. Formatted input reads standard and or nonstandard data arranged in columns.
In this course, you’ll use list input to work with delimited raw data files that contain both standard and nonstandard
data.
Standard and Nonstandard Data
So what is the difference between standard and nonstandard data? Standard data is data that SAS can read without any
special instructions. Here are a few examples of standard numeric data. Nonstandard data includes values like dates or
numeric values that include special characters like dollar signs. Here are a few examples of standard numeric data.
58 (23)
67.23 5,823
5.67E5 $67.23
-23 01/12/2010
00.99 12May2009
1.2E-2
DATA output-SAS-data-set;
SET input-SAS-data-set;
RUN;
You can also use the DATA step to create a SAS data set from a raw data file, but the syntax is slightly different. Instead
of the SET statement, you use the INFILE statement and the INPUT statement.
DATA output-SAS-data-set;
INFILE 'raw-data-file-name';
INPUT specifications;
RUN;
The INFILE statement identifies the physical name and location of the raw data file. The INPUT statement describes the
arrangement of values in the raw data file and assigns input values to the corresponding SAS variables.
data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;
The INFILE statement identifies the location of the external text file that contains the input data. You specify the full
path and filename, including the extension, for the raw data file sales.csv. As you can see, we're using the &path macro
variable reference, which makes our program more flexible. Be sure to use double quotation marks when referencing a
macro variable within a quoted string.
Here's a question. In our raw data file, what separates the data values?
Partial sales.csv
1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982
Commas separate the values in sales.csv. SAS considers a space, or blank, to be the default delimiter between values in
a delimited raw data file. If your file uses any other character to separate data values, you need to indicate what the
delimiter is in the INFILE statement. You use the DLM= option to specify an alternate delimiter.
DATA output-SAS-data-set;
INFILE 'raw-data-file-name' DLM='delimiter';
INPUT specifications;
RUN;
Why do you think this INPUT statement includes only seven variables even though the raw data file contains nine values
in each record? It's because the last two values in each record are date values, which are not standard values. You learn
to deal with date values later in this lesson.
What’s important to know at this point is that SAS will read the fields in the order in which they appear in the raw data
file, and you cannot skip over fields. You don’t have to read all the fields, but you must read up to the last one that you
need.
Lastly, when using list input, the default length for all variables is 8 bytes, regardless of type.
Question
Suppose you want to write a DATA step that reads a raw data file. Do you need to use a LIBNAME statement to assign a
libref to the directory in which the raw data file is stored?
a. yes
b. no
The correct answer is b. A libref is used to access SAS data sets in a SAS data library. The INFILE statement references the
raw data file, so you do not need to use a libref to point to it.
Code Challenge
Write an INPUT statement that uses list specification to name the variables Country, First_Name, and Salary,
in that order, in the output data set.
1---5---10---15---20---25
AU,Tom,108255
AU,Christina,24475
AU,Irenie,26600
1. Copy and paste the following program into the editor. The DATA step specifies the variables Employee_ID,
First_Name, Last_Name, Gender, Salary, Job_Title, and Country. Remember that SAS will read the fields in the
order in which they appear in the delimited file, and you cannot skip over fields.
data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;
2. Submit the step and then check the log. The log shows that the code ran successfully. SAS read 165 records from
sales.csv, and work.sales1 contains 165 observations and 7 variables.
3. Copy and paste this PROC PRINT step into the editor to view the new data set.
4. Submit this step and examine the report. Take a look at the values for Job_Title. Do you see anything strange?
The values are truncated. In fact, some values for First_Name and Last_Name are also getting truncated in this
report. You'll learn how to correct this problem next.
data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
SAS also creates an input buffer to hold a record from the raw data file. The input buffer is an area of memory that SAS
creates only when reading raw data, not when reading a SAS data set.
Then SAS creates the PDV. Remember, the PDV is an area of memory where SAS builds an observation. Remember also
that the PDV contains two automatic variables that are not written to the output data set: _N_ is the iteration counter,
and _ERROR_ signals the occurrence of a data error during that iteration of the DATA step.
PDV
As the INPUT statement compiles, SAS adds a slot to the PDV for each variable in the new data set. Generally, SAS
determines variable attributes such as length and type the first time it encounters a variable. With list input, the default
length for all variables is 8 bytes. Hmmm…is that why some variable values were truncated? Finally, SAS creates the
descriptor portion of the output data set.
Execution Phase
If the DATA step compiles successfully, then the execution phase begins. At the beginning of the execution phase, SAS
initializes the PDV. _N_ is set to 1, _ERROR_ is set to 0, and every other variable in the PDV is set to missing. Remember,
SAS represents missing numeric values with periods, and it represents missing character values with blanks.
PDV
1 0 . .
In the first iteration of the DATA step, SAS reads a record from the input data file and holds it in the input buffer.
1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
SAS reads from non-delimiter to delimiter (the comma) and assigns the data value to the corresponding variable,
Employee_ID, in the PDV. So, the value is converted from text to a floating point numeric value and copied into the PDV.
PDV
1 0 120102 .
SAS skips over the delimiter (the comma) and begins at the next non-delimiter, again, reading until it reaches a
delimiter, and then assigns the data value to the corresponding variable, First_Name, in the PDV. The text value is
copied to the PDV without conversion.
PDV
1 0 120102 Tom .
This process continues for all variables in the INPUT statement. At the bottom of the DATA step, SAS writes the values
from the PDV to the new SAS data set. Remember, _N_ and _ERROR_ are not included in the new SAS data set. Then
control returns to the top of the DATA step for the next iteration. Here is the information in the output data set after the
first iteration of the DATA step. Notice that the value for Job_Title is already truncated.
work.sales1
Control returns to the top of the DATA step, and _N_ is incremented to 2. SAS reinitializes the variables in the PDV to
missing for all of the values being read from the raw data file before the next iteration begins.
PDV
2 0 . .
Remember that variables from a SAS data set are not reinitialized, but in this case, no variables are coming from a data
set. They are all new variables, and are therefore all reinitialized. If we were creating other new variables in this DATA
step, they too would be reinitialized.
SAS reads a record from the raw data file each time the INPUT statement executes, so now it reads the second record
from sales.csv into the input buffer.
PDV
Once again, SAS reads from non-delimiter to delimiter, assigning data values to the variables named in the INPUT
statement. At the bottom of the DATA step, SAS writes the second observation to the new data set and control returns
to the top of the DATA step. Execution continues in this way until the there are no more records in the raw data file to
read.
Question
Which of the following statements is true?
a. SAS creates an input buffer only if reading data from a raw data file.
b. At compile time, the PDV holds the variable name, type, length, and initial value.
c. The descriptor portion of the data set is the first item that SAS creates during the compilation phase.
The correct answer is a. SAS uses an input buffer only if the input data is a raw data file. At compile time, the PDV holds
the variable name, type, and length, but not the initial value. The descriptor portion is the last item created during
compilation.
Question
Which statement is true of a DATA step when reading from a raw data file?
a. SAS reads data from the raw data file into the PDV.
b. The size of the input buffer adjusts automatically based on the length of the input record.
c. At the bottom of the DATA step, SAS writes the contents of the PDV to the output SAS data set.
The correct answer is c. When reading a raw data file, SAS reads the data from the file into the input buffer, and the
input buffer is an area of memory whose default length depends on the operating system. The only true statement here
is that at the bottom of the DATA step, SAS writes the contents of the PDV to the output data set.
Business Scenario
In the last demonstration, you saw that the values for Job_Title were truncated when SAS read them out of sales.csv.
That's because, with list input, SAS creates each variable with a length of 8 bytes, regardless of the type of variable. The
values for Job_Title are longer than 8 characters in the raw data file, so they are truncated in the new SAS data set. You
want to alter your DATA step to specify the correct length for Job_Title and other variables whose values are either
longer or shorter than 8 bytes.
Remember, character variables can have a length of 1 to 32,767. We will leave the default length of 8 for the numeric
variables; 8 bytes is large enough to hold 16 to 17 significant digits.
The LENGTH statement begins with the keyword LENGTH. Then you specify the variable name, a dollar sign if it is a
character variable, and the length. You can specify multiple variables in one LENGTH statement. For example, this
LENGTH statement assigns a length of 12 to two variables, First_Name and Last_Name, and a length of 1 to the variable
Gender. Notice that you only need one dollar sign and length specification for the first two variables.
The LENGTH statement below assigns different lengths to three variables: First_Name, Last_Name, and Gender. Notice
that each variable is followed by a dollar sign and the length.
Remember that SAS determines variable attributes such as name, length, and type the first time it encounters a variable
during compilation. So for the LENGTH statement to define the length for the variables in the output data set, it needs to
precede the INPUT statement in the DATA step. Also, make sure that you type the variable names exactly as you want
them to be stored in the data set. If you type the variable names in lowercase in the LENGTH statement, SAS will store
them in lowercase in the data set.
data worksales2;
length First_Name $ 12 Last_Name $18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
PDV
As the INPUT statement compiles, SAS adds a slot to the PDV for each variable that was not already listed in the LENGTH
statement.
PDV
Question
Which of the LENGTH statements below creates the character variable SalesRep with a length of 18?
The correct answer is c. For character variables, you specify the variable name, followed by the dollar sign, and then
specify the length.
Question
Wich of the DATA steps below creates the variable SalesRep with a length of 18?
a.
data work.mydata;
infile 'filepath/sales.csv';
length SalesRep $ 18;
input Amount SalesRep $ Customer $;
run;
b.
data work.mydata;
infile 'filepath/sales.csv'
input Amount SalesRep $ Customer $;
length SalesRep $ 18;
run;
The correct answer is a. The LENGTH statement must precede the INPUT statement in order to correctly set the length
of the variable.
1. Copy and paste the following program into the editor. The LENGTH statement defines lengths for the variables
First_Name, Last_Name, Gender, Job_Title, and Country. The PROC CONTENTS step will print the attributes of
the variables, and the PROC PRINT step will print the new data set.
data work.sales2;
length First_Name $ 12 Last_Name $ 18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
2. Submit the program and then check the log. The log shows that the code ran without errors.
3. Now look at the results. In the PROC CONTENTS output, you can see that each character variable has been
assigned the length you specified. The numeric variables Employee_ID and Salary are unchanged. In the PROC
PRINT results, the character values are no longer truncated, but the order of the variables has changed. Do you
know why? It’s because the order of the variables in the PDV has changed.
If you look back at your code, you’ll see that the variables in the LENGTH statement are created first, in the
order they are listed. Then SAS proceeds to scan the INPUT statement to see if any other variables are listed.
Employee_ID and Salary are then added to the PDV in the last two columns and, therefore, are in the last two
columns of our data set.
Suppose you want the order of the variables in work.sales2 to match the order of the fields in sales.csv? You
can include the numeric variables in the LENGTH statement.
4. In the editor, modify the LENGTH statement and include Employee_ID and Salary with a length of 8. Why do
you need to specify the length though? Numeric variables have a default of 8 bytes, right? If you don't specify
the length, SAS will assume that the length of 12 is assigned to both Employee_ID and First_Name, and SAS will
read Employee_ID as a character variable. Also, although numeric variables can be smaller than 8, it's a good
idea not to change the length.
Next, in the PROC CONTENTS step, add the VARNUM option to display the variables in their creation order.
data work.sales2;
length Employee_ID 8 First_Name $ 12
Last_Name $ 18 Gender $ 1
Salary 8 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
5. Submit the revised program and examine the results. The PROC CONTENTS shows the variables in the order of
the fields in sales.csv, and the work.sales2 data set does as well. The variables have the correct lengths
assigned.
Activity
Copy and paste the following program into the editor. Submit the program and view the log.
data work.nonsales2;
infile "&path/nonsales.csv" dlm=',';
input Employee_ID First $ Last;
run;
Which statement best describes the reason for the log messages, as well as the missing values for Last in the results?
a. The data in the raw data file is bad.
b. The programmer incorrectly read the data.
The correct answer is b. The programmer read the data incorrectly. The raw data values for Last are character values, as
they should be. But the INPUT statement doesn't specify Last as a character variable, so SAS read it as numeric.
Let’s see how you can include both standard and nonstandard values from a comma-delimited raw data file, and how
you can define lengths for variables without using the LENGTH statement.
Using Modified List Input
Your task is to create a temporary SAS data set by reading both standard and nonstandard data values. Remember that
nonstandard data is data that SAS cannot read without extra instructions. You can use what’s called modified list input to
read all of the fields from sales.csv.
Partial sales.csv
1---5---1 0---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982
Let’s first look at how you can use modified list input to read the standard character variables. With modified list input,
you can use informats and the colon format modifier to specify the length of the character variables.
Let’s take a look at how this works. Here’s one record from the sales.csv file, and an INPUT statement that lists all of the
variables to be read.
1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.;
run;
Notice that First_Name is followed by a colon, a dollar sign, the number 12, and a period. Doesn’t $12. look like a
format? Well, it's called an informat.
Informats are similar to formats except that formats provide instruction on how to write a value, and in general,
informats provide instruction on how to read a value. But in this case, the informat is used to specify the length of a
variable. The $12. informat tells SAS to create First_Name in the PDV as a character variable with a length of 12.
Informats also tell SAS how many characters to read, so this would cause SAS to read 12 characters for First_Name.
But wait! In this input record, the first name is Tom, so it is only three characters long. That’s where the colon format
modifier comes in. It tells SAS to read only until it reaches a delimiter, in this case, the comma following T-O-M.
Remember, the $12. informat specifies the type and length of the variable, and the colon format modifier causes SAS to
read up to the delimiter.
Now let’s see what happens if you omit the colon format modifier.
data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.;
run;
Notice that only one of the character variables, First_Name, does not include the colon format modifier. The
other character variables do. Let’s walk through how SAS reads this data. For demonstration purposes, we’ll
leave out the input buffer processing.
Partial sales.csv
1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
To start, SAS reads the first record from non-delimiter to delimiter and stores 120102 as the value for the numeric
variable Employee_ID. Next, because we omitted the colon format modifier with the variable First_Name, SAS reads 12
characters; it doesn’t stop at the delimiters. SAS assigns the value Tom, Zhou, M, 1 to First_Name. Omitting the colon
format modifier has caused unexpected results.
For Last_Name, we included the colon format modifier and specified 18 characters. SAS reads from the current position,
the 0, to the next delimiter and assigns 08255 to Last_Name. Wait, how can SAS store a numeric value in a character
variable? Remember, a character variable can hold any data.
The $1. informat defines Gender as a character variable with a length of 1. Because of the colon format modifier, SAS
reads up to the next delimiter, reading Sales Manager. Gender can only hold one character, so the S is stored and the
remaining letters are truncated.
For the numeric variable Salary, SAS begins at the next non-delimiter and reads AU, but SAS can’t convert AU to a
numeric value, so the result is a missing value.
Then SAS reads from the next non-delimiter to delimiter and assigns 11AUG1973 to Job_Title. Lastly, we told SAS to
store two characters for Country. So SAS reads the entire date value, but only stores 06.
The resulting PDV is clearly not accurate. Hopefully, from this example, you can see how important the colon format
modifier is.
PDV
The layout for the hire date values is numeric month, numeric day, and numeric year, so we can use the MMDDYY.
informat for those values. The MMDDYY. informat reads date values in the form mmdd followed by a two- or four-digit
year, with or without slashes.
SAS Informats
Let’s take a closer look at informats. SAS informats use the form shown here: a dollar sign for character informats, the
name of the informat, an optional width, followed by a dot.
<$><informat><w>.
The width is typically not used when reading numeric values in list input because SAS will read each field until it
encounters a delimiter.
This table shows several SAS informats for nonstandard numeric values and their definitions. Take a moment to read
through these definitions.
Informat Definition
COMMA. and DOLLAR. read nonstandard numeric data and remove embedded commas, blanks, dollar signs, percent
signs, and dashes. COMMAX. and DOLLARX. read nonstandard numeric data and remove embedded non-numeric
characters; these informats also reverse the roles of the decimal point and the comma.
EUROX. reads nonstandard numeric data and removes embedded non-numeric characters in European currency. $CHAR.
reads character values and preserves leading blanks. $UPCASE. reads character values and converts them to uppercase.
This table shows several specific examples of SAS informats and their effect on raw data values.
$UPCASE. au AU
The pound character in the fourth row represents a blank. In the first example, COMMA. or DOLLAR. can be used to read
the raw data value. The dollar sign and comma are removed, and the result is the numeric value 12345.
In the second example, either COMMAX. or DOLLARX. can be used to read the raw data value. With these informats, the
period is not treated as a decimal point, but rather as a separator between each group of three digits. When the raw
data value is read, the dollar sign and period are removed, resulting in the same numeric value, 12345.
Similarly, in the third example, EUROX. treats the period as a separator between each group of three digits. When the
raw data value is read, the euro sign and period are removed, again resulting in the numeric value, 12345.
When reading character data, leading blanks are removed. But in the fourth example, the $CHAR. informat preserves
the leading blanks in the SAS data value. In the last example, the characters au are read in and converted to uppercase.
Notice the resulting SAS data value is uppercase AU.
This table shows several specific SAS informats for date values and their effect on raw data values.
010160
MMDDYY. 01/01/60 0
01/01/1960
1/1/1960
311260
DDMMYY. 365
31/12/60
31/12/1960
31DEC59
DATE. -1
31DEC1959
Notice that the informat can read a variety of raw data widths and achieve the same SAS data value. In the first section,
each raw data value represents January 1, 1960. Some of the values have separators, and some don’t. Some have two-
digit years, others have four-digit years, but it doesn’t matter. The MMDDYY. informat tells SAS to expect a value for
month, followed by day and then year, and all result in the same SAS date, 0.
The dates in the second section all represent December 31, 1960. The DDMMYY. informat is specified to let SAS know to
expect a value for day, then for month, and then for year. All result in the SAS date 365. Finally, the dates in the third
section are a different length, but both represent December 31, 1959. When read using the DATE. informat, both result
in the SAS date value of -1.
Question
Which of these statements is not true about a SAS informat?
a. A SAS informat provides instructions for reading nonstandard values.
b. A SAS informat provides a length for character variables.
c. A SAS informat must include a period.
d. A SAS informat controls the way nonstandard data values are stored in a SAS data set.
e. A SAS informat must include a numeric value to specify the width of the input field.
The correct answer is e. The width is typically not used when reading numeric values in list input because SAS will read
each field until it encounter a delimiter. When you use an informat with the colon format modifier, SAS ignores the
width and reads up to the next delimiter.
1---5---10---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982
We know that we can use the DATE. informat to describe the data value. If you look at the SAS documentation,
however, you will see that the DATE. informat has a default width of 7, but the value in the first record is 9 characters
wide.
Once again, we can see the importance of the colon format modifier. Without it, SAS would use the default width and
read only the first seven characters. We must use the colon format modifier with the informat. The colon format
modifier tells SAS to ignore the width and read up to the next delimiter.
Similarly, considering the second date field, when the colon modifier precedes it, the MMDDYY. informat enables SAS to
read any of the date values shown here.
01/07/2008 1/7/2008
1/07/2008 01/07/08
01/7/2008 1/7/08
1. Copy and paste the following program into the editor. This DATA step reads the standard values with modified
list input. The INPUT statement includes the last two fields, Birth_Date and Hired_Date, from the raw data file.
You format Birth_Date with the colon format modifier and the DATE. informat, and Hire_Date with the colon
format modifier and the MMDDYY. informat.
data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12. Last_Name :$18.
Gender :$1. Salary Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
run;
2. Submit the program and check the log. You can see that the program executed successfully.
3. Examine your results. Are all of the character variables displayed properly, without truncation? Yes, they all look
good. What about the Birth_Date and Hire_Date values? You can see that the data set does include the two
new date variables, but the values don't look like dates. Oh that’s right! Remember that these are SAS date
values, so to appear as recognizable dates (that is, more understandable dates in reports), you would need to
add formats.
As you’ve seen, using modified list input in your program gave SAS the proper instruction to correctly read in the raw
data.
Business Scenario
Suppose you want your new data set work.subset to only include a subset of the fields in the sales.csv file. And what if
you want to add formats and labels to the new data set? You can do all of those things by adding other statements to
the DATA step, such as DROP or KEEP, LABEL, FORMAT, and subsetting IF.
data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
2. Suppose you want to see only the sales employees from Australia. You’ve seen how to use a WHERE statement
to select observations too, but you can’t use a WHERE statement in this program because the input is a raw data
file, not a SAS data set. Add the following subsetting IF statement to your program to select those observations.
data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';
3. Add a KEEP statement and include only the variables First_Name, Last_Name, Salary, Job_Title, and Hire_Date
in the output data set. To add more descriptive labels, add a LABEL statement and give Job_Title a label of Sales
Title, and Hire_Date a label of Date Hired. Lastly, format the numeric variables so that they are more
understandable. In a FORMAT statement, assign the DOLLAR12. format to Salary, and the MONYY7. format to
Hire_Date. In a PROC PRINT step, add the LABEL option so that SAS will print the new labels. Add the following
KEEP, LABEL, and FORMAT statements to your program, as well as the PROC PRINT step.
data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';
keep First_Name Last_Name Salary
Job_Title Hire_Date;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary dollar12. Hire_Date monyy7.;
run;
5. Examine the results. You can see that the report contains only the variables listed in the KEEP statement. The
new labels have been added, and the values have been formatted so that you can better understand them.
Business Scenario
Throughout this lesson, you've worked with programs that use an INFILE statement to identify an external file to read as
input data. You can also read instream data, which is lines of data that you enter directly into your SAS program. Reading
instream data is extremely helpful if you want to create data and test your programming statements on a few
observations.
DATALINES;
...
;
The following program creates the variables Name and Age using instream raw data lines.:
data new;
input name $ age;
datalines;
john 25
henry 55
cynthia 44
karen 21
;
run;
Question
Which of the following DATA steps correctly reads instream data as input for the data set?
a.
data work.mydata;
input Amount SalesRep $ Customer $;
datalines;
250.35 Phelps Torres
178.50 Deng Horekova
;
b.
data work.mydata;
datalines;
250.35 Phelps Torres
178.50 Deng Horekova
input Amount SalesRep $ Customer $;
;
The correct answer is a. You precede the instream data with the DATALINES statement and follow it with a null
statement. The instream data should be the last part of the DATA step except for a null statement.
Reading Instream Data
In this demonstration, you use the DATALINES statement to read instream data.
1. Copy and paste the following program into the editor. The DATA statement specifies the data set name,
work.newemps. Instream data, like raw data that is stored in external files, might have various layouts. In this
example, the data is delimited with blanks, so you use an INPUT statement for list input. Each record of this data
contains four values: first name, last name, job title, and salary. You specify character as needed, as well as an
informat to read the nonstandard salary values. Next is the DATALINES statement before the lines of data, and
then a null statement after the data.
data work.newemps;
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven Worton Auditor $40,450
Merle Hieds Trainee $24,025
Marta Bamberger Manager $32,000
;
2. Submit the step and check the log. You can see that the data set work.newemps was created successfully with
three observations and four variables.
3. Copy and paste the PROC PRINT step into the editor and submit it.
4. View the results. You can see that the values in the data set are exactly what you entered in the DATA step.
5. Copy and paste the following program into the editor. When your instream data is delimited with commas, you
use the INFILE statement wtih the DLM= option. You specify a comma as the delimiter.
data work.newemps2;
infile datalines dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven,Worton,Auditor,$40450
Merle,Hieds,Trainee,$24025
Marta,Bamberger,Manager,$32000
;
6. Submit the step and examine the results. You can see that the resulting data set looks exactly like the first data
set.
Business Scenario
Suppose you have a raw data file that includes some invalid data values. For example, the file sales3inv.csv contains
some invalid data, and you need to create the new SAS data set work.sales4 from this file.
Partial sales3inv.csv
1---5---10---15---20---25---30---35---40---45---50
120102,Tom,Zhou,Sales Manager,108255,AU
120103,Wilson,Dawes,Sales Manager,87975,AU
120121,Irenie,Elvish,Sales Rep. II,26600,AU
120122,Christina,Ngan,Sales Rep. II,n/a,AU
120123,Kimiko,Hotstone,Sales Rep. I,26190,AU
120124,Lucian,Daymond,Sales Rep. I,26480,12
120125,Fong,Hofmeister,Sales Rep IV,32040,AU
Now consider this question. What problems will SAS have reading the numeric data Salary and the character data
Country? The fourth record has the value n/a for Salary. This is not a numeric value. The sixth record has the value 12
for Country. Although 12 can be a valid character value, it is not a valid Country value.
When data values aren't appropriate for the SAS statements in your program, they cause data errors when the program
runs. For example, if you define a variable as numeric when the data contains character values, you create a data error.
As you might guess, anytime you use DATA, INFILE, and INPUT statements, you are likely to encounter data errors.
1. The sales3inv.csv file contains three instances of n/a, as well as a missing value for Salary.The sixth record has a
value of 12 for Country. Copy and paste the following program into the editor. The DATA step creates
work.sales4 with data from the sales3inv.csv file. The INPUT statement lists the variables and their types.
data work.sales4;
infile "&path/sales3inv.csv" dlm=',';
input Employee_ID First $ Last $
Job_Title $ Salary Country $;
run;
2. Submit the program and check the log. The log shows that the DATA step creates the new data set, work.sales4.
The following information is written to the log: a note about the error, the contents of the input buffer and the
contents of the PDV. This note indicates that SAS detected invalid data for the variable Salary in line 4. SAS
displays a ruler above the input buffer to help you locate the invalid data. Notice that SAS assigns a missing value
to the variable that the invalid data affected.
Remember the two automatic variables that SAS creates during processing? The variables _N_ and _ERROR_ are
the temporary variables that SAS creates, and they are helpful when examining your log and data for data
errors. You can see that SAS encountered a data error because the value of _ERROR_ is 1. The error occurred
during the fourth iteration of the DATA step because the value of _N_ is 4. After writing the messages to the log,
SAS continues processing.
Next you can see the two other notes about invalid data for Salary. What about the missing Salary value? The
missing Salary value was not reported as a data error because missing is a valid value in SAS.
3. View the results. Notice that there are four observations with missing Salary values, just as the log indicated.
What about the value of 12 for Country? Even though you know that 12 is not a valid country code, SAS has no
trouble storing it in the Country variable. Remember, Country is a character variable and can store any
characters at all, including numerals. Later you will see that SAS has other procedures that you can use to find
invalid values.
Question
What does SAS do when it encounters a data error in a raw data record? Select all that apply.
a. prints a ruler and the raw data record in the SAS log
b. stops processing the program
c. assigns a missing value to the variable that the invalid data affects
d. prints a note about the error in the SAS log
e. prints the variable values in the corresponding SAS observation in the SAS log
The correct answers are a, c, d, and e. When SAS encounters a data error, it prints messages and a ruler in the log and
assigns a missing value to the affected variable. Then SAS continues processing.
Business Scenario
Now let’s examine how to handle missing data. Programmers at Orion Star have discovered that some files have records
with missing data in one or more fields. The records in phone2.csv have a contact name, phone number, and a mobile
phone number. The phone number is missing from some of the records. The missing data is indicated by consecutive
delimiters.
phone2.csv
1---5---10---15---20---25---30---35---40---45---50
James Kvarniq,(704) 293-8126,(701) 281-8923
Sandrina Stephano,, (919 271-4592
Cornelia Krahl,(212) 891-3241,(212) 233-5413
Karen Ballinger,, (714) 644-9090
Elke Wallstab,(910) 763-5561,(910) 545-3421
How do you think SAS will handle these missing data values?
Activity
Copy and paste the following program into the editor. Submit the program and view the log.
data work.contacts;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone2.csv" dlm=',';
input Name $ Phone $ Mobile $;
run;
1. How many input records were read and how many observations were created?
In the log, you can see that SAS read 5 records from the input file and created 3 observations. SAS writes the following
note: SAS went to a new line when INPUT statement reached past the end of a line.
The report does not look correct. Because the second record has missing values, SAS loads the next record to finish the
observation. As you can see, the value for Mobile in the second observation is Cornelia Krahl. That's definitely not
correct.
work.contacts
You can use the DSD option in your INFILE statement to correctly read the raw data file phone2.csv. DSD stands for
delimiter sensitive data.
The DSD option sets the default delimiter to a comma, treats consecutive delimiters as missing values, and enables SAS
to read values with embedded delimiters if the value is surrounded by quotation marks.
1. Copy and paste the following program into the editor to correctly read the phone2.csv file and create a new SAS
data set, contacts. The INFILE statement includes the DSD option. You can use the DLM= option with the DSD
option, but it is not needed for comma-delimited files. The INPUT statement reads three character variables. The
PROC PRINT step creates the report.
data work.contacts;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone2.csv" dsd;
input Name $ Phone $ Mobile $;
run;
proc print data=work.contacts noobs;
run;
2. Submit the code and then check the log. You can see that SAS read five records from the file and wrote all five of
them to the new data se,t contacts. There is no note in the log about reaching the end of a line.
3. View the report. It shows that the data is correctly assigned now that you added the DSD option.
Business Scenario
Orion Star programmers have also discovered that some raw data files have records with missing data at the end of the
record, so there are fewer fields in the record than specified in the INPUT statement. For example, the raw data file
phone.csv contains missing values at the end of some records.
phone.csv
1---5---10---15---20---25---30---35---40---45---50
James Kvarniq,(704) 293-8126,(701) 281-8923
Sandrina Stephano,(919 271-4592
Cornelia Krahl,(212) 891-3241,(212) 233-5413
Karen Ballinger,(714) 644-9090
Elke Wallstab,(910) 763-5561,(910) 545-3421
Here’s a question. Can the programmers use the DSD option to correctly read the raw data file? No, the DSD option isn’t
appropriate because the missing data isn’t marked by consecutive delimiters.
1. Copy and paste the following program into the editor. The DATA statement creates the new SAS data set
contacts2. The INFILE statement specifies the raw data file, followed by the DLM= option because the file is
comma delimited, and then the MISSOVER option. The INPUT statement reads the three variables, Name,
Phone, and Mobile, and indicates the type as character.
data work.contacts2;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;
2. Submit the program and then check the log. Verify that SAS read five records in and five observations out.
3. Examine the results. You can see that the data seems to be read into the correct variables, but something isn’t
quite right. Do you know what’s wrong? The variable values are truncated. You need to add a LENGTH statement
to the program to specify the proper lengths for the character variables.
4. Add a LENGTH statement before the INFILE statement to define a length of 20 for Name and a length of 14 for
both Phone and Mobile.
data work.contacts2;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;
5. Submit the revised program and view the results. The report looks great. The values are not truncated, and SAS
successfully skipped missing values from the raw data file.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
In order for SAS to read a raw data file, you must specify the location of each data value in the record, along with the
names and types of the SAS variables in which to store the values. Three styles of input are available: list input, column
input, and formatted input. List input reads delimited files, and column and formatted input read fixed column files. List
input and formatted input can read both standard and nonstandard data, and column input can read only standard data.
In this course, we are reading delimited raw data files, so list input is used.
DATA output-SAS-data-set-name;
INFILE 'raw-data-file-name' DLM='delimiter';
INPUT variable1 <$> variable2 <$> ... variableN <$>;
RUN;
SAS processes the DATA step in two phases: compilation and execution. During compilation, SAS creates an input buffer
to hold a record from the raw data file. The input buffer is an area of memory that SAS creates only when reading raw
data, not when reading a SAS data set. SAS also creates the PDV, an area of memory where an observation is built. In
addition to the variables named in the INPUT statement, SAS creates the iteration counter, _N_, and the error indicator,
_ERROR_, in the PDV. These temporary variables are not written to the output data set. At the end of the compilation,
SAS creates the descriptor portion of the output data set.
At the start of the execution phase, SAS initializes the PDV and then reads the first record from the raw data file into the
input buffer. It scans the input buffer from non-delimiter to delimiter and assigns each value to the corresponding
variable in the PDV. SAS ignores delimiters. At the bottom of the DATA step, SAS writes the values from the PDV to the
new SAS data set and then returns to the top of the DATA step.
Truncation often occurs with list input, because character variables are created with a length of 8 bytes, by default. You
can use a LENGTH statement before the INPUT statement in a DATA step to explicitly define the length of character
variables. Numeric variables can be included in the LENGTH statement to preserve the order of variables, but you need
to specify a length of 8 for each numeric variable.
An informat is required to read nonstandard numeric data, such as calendar dates, and numbers with dollar signs and
commas. Many SAS informats are available for nonstandard numeric values. Every informat has a width, whether stated
explicitly or set by default.
When reading a raw data file, you can use a DROP or KEEP statement to write a subset of variables to the new data set.
You must use a subsetting IF statement to select observations, because the variables are not coming from an input SAS
data set. You can use LABEL and FORMAT statements to permanently store label and format information in the new data
set.
A DATA step can also read instream data, which is data that is within a SAS program. To specify instream data, you use a
DATALINES statement in a DATA step, followed by the lines of data, followed by a null statement.
DATALINES;
<data line 1>
<data line 2>
...
;
Validating Data
When data values in the input file aren't appropriate for the INPUT statement in a program, a data error occurs during
program execution. SAS records the error in the log by writing a note about the error, along with a ruler and the
contents of the input buffer and the PDV. The variable _ERROR_ is set to 1, a missing value is assigned to the
corresponding variable, and execution continues.
You can use the DSD option in the INFILE statement if data values are missing in the middle of a record. When you use
the DSD option, SAS assumes that the file is comma delimited, treats consecutive delimiters as missing data, and allows
embedded delimiters in a field that is enclosed in quotation marks. If you have missing data values at the end of a
record, you can use the MISSOVER option in the INFILE statement. SAS sets the variable values to missing.
Sample Programs
data work.sales1;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $
Last_Name $ Gender $ Salary
Job_Title $ Country $;
run;
data work.sales2;
length First_Name $ 12 Last_Name $ 18
Gender $ 1 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
data work.sales2;
length Employee_ID 8 First_Name $ 12
Last_Name $ 18 Gender $ 1
Salary 8 Job_Title $ 25
Country $ 2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name $ Last_Name $
Gender $ Salary Job_Title $ Country $;
run;
data work.sales2;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12. Last_Name :$18.
Gender :$1. Salary Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
run;
data work.subset;
infile "&path/sales.csv" dlm=',';
input Employee_ID First_Name :$12.
Last_Name :$18. Gender :$1. Salary
Job_Title :$25. Country :$2.
Birth_Date :date. Hire_Date :mmddyy.;
if Country='AU';
keep First_Name Last_Name Salary
Job_Title Hire_Date;
label Job_Title='Sales Title'
Hire_Date='Date Hired';
format Salary dollar12. Hire_Date monyy7.;
run;
data work.newemps;
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven Worton Auditor $40,450
Merle Hieds Trainee $24,025
Marta Bamberger Manager $32,000
;
data work.newemps2;
infile datalines dlm=',';
input First_Name $ Last_Name $
Job_Title $ Salary :dollar8.;
datalines;
Steven,Worton,Auditor,$40450
Merle,Hieds,Trainee,$24025
Marta,Bamberger,Manager,$32000
;
data work.sales4;
infile "&path/sales3inv.csv" dlm=',';
input Employee_ID First $ Last $
Job_Title $ Salary Country $;
run;
data work.contacts2;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;
data work.contacts2;
length Name $ 20 Phone Mobile $ 14;
infile "&path/phone.csv" dlm=',' missover;
input Name $ Phone $ Mobile $;
run;
Objectives
In this lesson, you learn to do the following:
You begin with the orion.sales data set, which includes the variables Employee_ID, Salary, and Hire_Date, among
others. But orion.sales doesn't contain the information you need for Human Resources. You'll need to create the
variables Bonus, Compensation, and BonusMonth in a DATA step and then store the new data in the work.comp data
set.
variable=expression;
The expression can be a numeric constant, so you can use an assignment statement to assign the value 500 to the
variable Bonus.
Bonus=500;
Question
Which of the following statements will create the numeric variable Bonus with a value of 500?
work.comp
a. Bonus=$500;
b. Bonus=500;
c. label Bonus='500';
d. format Bonus 500.;
The correct answer is b. You use an assignment statement to set the value of the variable Bonus equal to 500. Numeric
constants do not include commas or currency symbols.
SUM(argument1,argument2,...);
The arguments can be numeric constants, numeric variables, or arithmetic expressions, but the arguments must be
numeric values. Both Salary and Bonus are numeric values. The arguments must be enclosed in parentheses and
separated with commas. Notice that SAS functions can be used within an assignment statement.
data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
run;
Consider this: If Salary has a missing value, what do you think the value of Compensation will be? Its value will be 500.
The SUM function ignores missing values, so if an argument has a missing value, the result of the SUM function is the
sum of the nonmissing values. What if you were to calculate Compensation by simply adding Salary and Bonus in an
assignment statement?
data work.comp;
set orion.sales;
Bonus=500;
Compensation=Salary+Bonus;
run;
If one of these variables contains a missing value, the expression evaluates to missing, and a missing value is assigned to
Compensation. In this case, using the SUM function is the better choice. Compensation will always contain a nonmissing
value.
You can use different types of SAS date functions to create variable values. For example, these date functions extract
date information from the date value that SAS stores.
The YEAR function extracts the year from a SAS date and returns a four-digit value for year. The QTR function extracts
the quarter from a SAS date and returns a number from 1 to 4. The MONTH function extracts the month from a SAS date
and returns a number from 1 to 12. The DAY function extracts the day of the month from a SAS date and returns a
number from 1 to 31. The WEEKDAY function extracts the day of the week from a SAS date and returns a number from 1
to 7, where 1 represents Sunday and so on.
The TODAY function returns the current date as a SAS date value. DATE is an alias for TODAY. It works the same way.
Notice that there are no arguments inside the parentheses when calling the TODAY or DATE function. That is because
these functions do not need any information from the program. They use the system clock to obtain the current date
and convert it to a SAS date. It is important to use parentheses even when no arguments are passed. Without the
parentheses SAS would think TODAY or DATE were variables instead of functions. The MDY function returns a SAS date
value from numeric month, day, and year values.
Now let's return to creating the BonusMonth variable. The hire date for each employee is stored in the Hire_Date
variable. At Orion Star, an employee who was hired in April receives an annual bonus in April. So you need to extract the
month the employee was hired from the Hire_Date variable and assign its value to the BonusMonth variable. You can
do this by using the MONTH function. You use the function named MONTH, followed by one argument in parentheses.
The argument for month must be a SAS date.
MONTH(SAS-date);
Here, the MONTH function extracts the month of hire from Hire_Date and returns a number from 1 to 12. The returned
value is assigned to BonusMonth.
Question
Which of these statements specifies SAS functions correctly? Select all that apply.
a. Deadline=sum(TimeSpent,Last_Name);
b. FingersToes=sum(10,10);
c. Birthday=mdy(8,27,90);
d. Review_Quarter=qtr(Hire_Date+372);
e. GreatDay=today();
f. BirthdayYear=year(Birth_Date);
g. BirthdayYear=year('12dec1987'd);
The correct answers are b, c, d, e, f, and g. Except for the TODAY() function, these numeric functions must specify
appropriate numeric arguments in parentheses following the function keyword. You can specify numeric constants,
including SAS date constants, numeric variables, or arithmetic expressions, as numeric arguments. The TODAY() function
doesn't require an argument.
data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;
2. Submit the program and then check the log. You can see that SAS read 165 observations from orion.sales, and
the work.comp data set contains 12 variables. You might recall that orion.sales contains 9 variables, so it looks
as though SAS created the new variables.
3. View the results. You can see that SAS created the three new variables, and they appear as the last three in the
report. The Bonus values are all 500, which is what you specified. The Compensation values should be the sum
of the Bonus and Salary values. You can verify this in the first observation. Tom's salary is 108255, and when
added to 500, the total is 108755. This looks great. Lastly, the BonusMonth values should all be a number from 1
to 12. Scroll through the report to verify the other values.
Question
A DROP statement has been added to this DATA step. Will the program calculate Compensation and BonusMonth
correctly?
data work.comp;
set orion.sales;
drop Gender Salary Job_Title Country
Birth_Date Hire_Date;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;
a. yes
b. no
The correct answer is a. The DROP statement is a compile-time only statement. SAS sets a drop flag for the
dropped variables, but the variables are in the PDV and, therefore, are available for processing.
Conditional Processing
The managers at Orion Star plan to give each sales employee a bonus based on his or her job title. Employees with the
job title Sales Rep. IV will receive a $1000 bonus, those with the Sales Manager title will receive a $1500 bonus, Senior
Sales Managers will receive a $2000 bonus, and the Chief Sales Officer will receive a $2500 bonus.
For this task, you need to write a SAS program and use orion.sales to create the new data set work.comp. You need to
include a new variable, Bonus, with a value based on the variable Job_Title. Let's find out how you can complete your
task.
In the IF-THEN statement, as in the assignment statement, expression is a sequence of operands and operators that
define a condition for selecting observations.
Operands Operators
character constants
numeric variables
After the THEN keyword, statement is any executable statement, such as the assignment statement. If the expression is
true, the THEN statement executes. The IF-THEN statement executes for each observation in the data set.
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Senior Sales Office' then
Bonus=2500;
run;
The value that SAS assigns to Bonus is determined by testing for various values of Job_Title. If the expression is true,
then SAS assigns Bonus the value that corresponds to the IF-THEN statement. If the expression is false, then SAS moves
to the next IF-THEN statement.
Let's see how SAS processes this DATA step in more detail. At the start of the execution phase, SAS initializes the PDV to
missing. When the SET statement executes, SAS reads the first observation of orion.sales into the PDV. The value of
Bonus is missing because Bonus doesn't come from the input data set. It's a new variable that we are creating in this
DATA step.
Partial sales.csv
1---5---1 0---15---20---25---30---35---40---45---50---55---60---65---70---75
120102,Tom,Zhou,M,108255,Sales Manager,AU,11AUG1973,06/01/1993
120103,Wilson,Dawes,M,87975,Sales Manager,AU,22JAN1953,01/01/1978
120121,Irenie,Elvish,F,26600,Sales Rep. II,AU,01AUG1948,01/01/1978
120122,Christina,Ngan,F,27475,Sales Rep. II,AU,27JUL1958,07/01/1982
PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Bonus
When SAS executes the first IF-THEN statement, the value of Job_Title in the PDV does not match the specified value, so
the expression is false. Bonus remains missing. When SAS executes the second IF-THEN statement, the value of Job_Title
in the PDV matches the specified value, so the expression is true. SAS assigns 1500 to Bonus.
PDV
When SAS executes the third IF-THEN statement, the value of Job_Title in the PDV does not match the specified value,
so the expression is false. Bonus remains 1500. When SAS executes the fourth IF-THEN statement, the value of Job_Title
in the PDV does not match the specified value, so the expression is false. Bonus remains 1500.
At the bottom of the DATA step, SAS uses the values in the PDV to write the first observation into the new data set,
work.comp. Then control returns to the top of the DATA step for the next iteration. This process continues until SAS
reaches the end of the input data set.
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
2. Submit the program and then check the log. The log shows that SAS read 165 observations from orion.sales. The
code ran without errors.
3. View the report. As you can see, many values for Bonus are missing, as indicated by the periods. Can you think
of why there are missing values? Not all employees in orion.sales met the conditions you specified. After all, you
only assigned bonus amounts to four job titles. The observations that do have values, however, appear to be
correct.
Question
In this program, is it possible for more than one condition to be true for a single observation?
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
The correct answer is b. the conditions are mutually exclusive, so only one condition can be true. For each observation,
there is only one value for Job_Title. If that value matches one of the conditions, then it cannot match any other
condition.
Question
Which statement below correctly assigns the value SE to the new variable Region if the variable City has the
value Atlanta?
City ProductNumber
Tampa K445
Atlanta K702
Boston F065
a. if city='Atlanta' then
Region='se';
b. if Region='SE' then
city='Atlanta';
c. if city='Atlanta' then
Region='SE';
d. if Region='Atlanta' then
city='SE';
The correct answer is c. The IF expression tests for the value of City. If the expression is true, SAS executes the
corresponding statement and assigns the value SE to Region.
ELSE IF statement;
You use the keyword ELSE, followed by a statement, which can be any executable SAS statement, including another IF-
THEN statement. The ELSE statement must immediately follow the IF-THEN statement in your program, and it executes
only if the previous IF-THEN statement is false.
In the program below, SAS evaluates the IF-THEN statements sequentially. When an expression is true, SAS executes the
associated statement and skips subsequent ELSE statements.
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
Notice that only the first IF-THEN statement begins with the keyword IF. The subsequent statements begin with the
keyword ELSE. Let's take a look at how SAS processes this code. At the start of the execution phase, SAS initializes the
PDV to missing. When the SET statement executes, SAS reads the first observation of orion.sales into the PDV.
When SAS executes the first IF-THEN statement, the value of Job_Title in the PDV does not match the specified value, so
the statement is false. Bonus remains missing.
PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Bonus
When SAS executes the ELSE statement, the value of Job_Title in the PDV matches the specified value, so the expression
is true. SAS assigns 1500 to Bonus.
PDV
When SAS executes the third IF-THEN statement, the value of Job_Title in the PDV does not match the specified value,
so the expression is false. Bonus remains 1500. When SAS executes the fourth IF-THEN statement, the value of Job_Title
in the PDV does not match the specified value, so the expression is false. Bonus remains 1500.
At the bottom of the DATA step, SAS uses the values in the PDV to write the first observation into the new data set,
work.comp. Then control returns to the top of the DATA step for the next iteration. This process continues until SAS
reaches the end of the file.
You can imagine the kind of efficiency this way of programming creates when you have data sets with thousands of
observations!
Business Scenario
The managers at Orion Star plan to give each sales employee a bonus based on his or her job title. Employees with the
job title Sales Rep. IV will receive a $1000 bonus, those with the Sales Manager title will receive a $1500 bonus, Senior
Sales Managers will receive a $2000 bonus, and the Chief Sales Officer will receive a $2500 bonus.
For this task, you need to write a SAS program and use orion.sales to create the new data set work.comp. You need to
include a new variable, Bonus, with a value based on the variable Job_Title. Let's find out how you can complete your
task. Using Conditional Processing
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
Now, how will you assign a bonus amount of $500 to all other job titles? You certainly don't want to continue writing
ELSE statements for every title in the company! Luckily, you can use an optional final ELSE statement in your program:
else Bonus=500. This statement gives an alternative action if none of the conditions are true.
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
else Bonus=500;
run;
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager'
then Bonus=2000;
else if Job_Title='Chief Sales Officer'
then Bonus=2500;
else Bonus=500;
run;
2. Submit the program and then view the report. As you can see, every observation has a bonus value assigned.
Both Sales Rep. III and Sales Rep. IV employees have a bonus value of $1000, and employees with a bonus value
of 500 have job titles that were not listed in our IF-THEN statements.
Code Challenge
Write statements to assign the value pass to the variable Grade when the value of the variable Points is greater than or
equal to 70. Otherwise, assign the value fail to Grade.
Business Scenario
Suppose the Orion Star managers are considering another type of bonus for employees, a country-based bonus. They'd
like to assign a bonus amount of $500 to employees in the US, and assign a bonus amount of $300 to employees in
Australia. You need to create a new data set, work.bonus, with this information.
You know that the orion.sales data set has been validated and only includes the Country values US and AU. So
you can use an IF-THEN-ELSE statement for this scenario: if Country equals US, then Bonus equals 500. Else,
Bonus equals 300. You can omit the conditional clause from the ELSE statement because all observations not
equal to US will get a bonus of $300. This technique should be used only when you know that the final ELSE
statement must be executed for all other observations.
data work.bonus;
set orion.sales;
if Country='US' then Bonus=500;
else Bonus=300;
run;
2. Submit the program and then view the report. As you scroll through the report, you can see that all AU Country
values have a $300 bonus and all US Country values have a $500 bonus.
Activity
Copy and paste this program into the editor. This program reads orion.nonsales, which is a non-validated data set.
Submit the code.
When you examine observations 88 through 235, do you see any values of 300 for Bonus?
a. yes
b. no
The correct answer is a. Bonus is equal to 300 in observations 125, 197, and 200. This is because the variable Country
has some mixed case values in orion.nonsales. Observations with a Country value of US are assigned 500; all others are
assigned 300, including us.
data work.bonus;
set orion.nonsales;
if Country in ('US', 'us')
then Bonus=500;
else Bonus=300;
run;
Alternatively, you can use the UPCASE function in the expression. The UPCASE
function converts all character values in an argument to uppercase.
data work.bonus;
set orion.nonsales;
if upcase(Country)='US'
then Bonus=500;
else Bonus=300;
run;
You can also clean the data before checking the value in your IF-THEN statement. This program shows how you can first
uppercase all values of the variable Country. It's a best practice to clean the data at the source, but in some cases that is
not possible. With this method, you are creating a clean data set.
data work.bonus;
set orion.nonsales;
Country=upcase(Country);
if Country='US' then
Bonus=500;
else Bonus=300;
run;
Business Scenario
Now let's look at a more complex bonus situation. The Human Resources Department stepped in and decided that Orion
Star employees will receive a bonus once or twice a year, depending on their country. US employees will receive a $500
bonus once a year, and Australian employees will receive a $300 bonus twice a year. So, in addition to creating the
variable Bonus, you also need to create the variable Freq, which is the frequency that the employee receives the bonus.
Freq is equal to Once a Year for United States employees, and is equal to Twice a Year for Australian employees. The
variable Freq isn't related to the FREQ procedure.
IF expression THEN
DO;
executable statements
END;
ELSE IF expression THEN
DO;
executable statements
END;
The DO group consists of a DO statement, the SAS statements to be executed when the condition is true, and the END
statement. Multiple statements are permitted in a DO group.
1. Copy and paste the following program into the editor. At the end of the IF-THEN statement, you specify the DO
statement and then the statements to execute if the condition is true. In the first group, you want to assign US
employees the bonus amount 500 and the bonus frequency Once a Year. Notice that you must end each DO
group with an END statement. Next, the ELSE statement contains a second DO group. You assign the Australian
employees the bonus amount 300 and the bonus frequency Twice a Year.
The PROC PRINT step displays only the variables First_Name, Last_Name, Country, Bonus, and Freq.
data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country
Bonus Freq;
run;
2. Submit the program and then view the report. You can see that the new variables have been added to the data
set. And it appears that the values for Bonus and Freq are based on the value for Country.
But, wait! There does seem to be a problem with some values for Freq. For the value Twice a Year, the final r
seems to be cut off. You'll learn why this value is truncated next.
data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run
SAS begins by creating the PDV from the variables in orion.sales. Remember that SAS lists the variable name and type
(numeric or character), and then allocates space for the variable values.
PDV
Next, SAS creates the new variables. SAS adds Bonus, which is a numeric variable, to the PDV and allocates 8 bytes for it.
From the string of characters that are assigned, SAS deduces that Freq is a character variable. Because SAS wasn't given
any instruction about the variable, SAS assumes that the length is the amount of space taken up by the value Once a
Year, which is the first value assigned to the variable. Once a Year equals 11 characters, so SAS assigns a length of 11 to
the character variable, Freq.
PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date Bonus Freq
N8 $ 12 $ 18 $1 N8 $ 25 $2 N8 N8 N8 $ 11
Look at the number of characters for the next string assigned to the variable Freq. Twice a Year requires 12 characters.
Now you know why the last character is getting left off in the output: SAS only allowed for 11 characters. How would
you modify your code to allow the full 12 characters?
You can change your code in several ways to deal with the situation. You can specify the condition for Australian
employees first so that SAS allocates 12 bytes for the character variable, Freq,
data work.bonus;
set orion.sales;
if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
else if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
run;
or you can pad the value with blanks in the first occurrence,
data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year ';
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
or you can use the LENGTH statement to declare the byte size of the variable up front.
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year ';
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
The LENGTH statement is a straightforward approach, so let's use it to avoid truncation. SAS assigns variable lengths
based on the first occurrence of the variable in the DATA step. So it's important to put the LENGTH statement early
enough in the program to make sure that's where SAS first encounters the variable Freq.
PDV
Employee_ID First_Name Last_Name Gender Salary Job_Title Country Birth_Date Hire_Date Bonus Freq
N8 $ 12 $ 18 $1 N8 $ 25 $2 N8 N8 N8 $ 12
Question
Which of the following can determine the length of a new variable?
a. the length of the variable's first value
b. the assignment statement
c. the LENGTH statement
d. all of the above
The correct answer is d. In the DATA step, the first reference to a variable determines its length. The first reference to a
new variable can be in a LENGTH statement, an assignment statement, or another statement such as an INPUT
statement. After a variable is created in the PDV, the length of the variable's first value doesn't matter.
1. Copy and paste the following program into the editor. This program is from the previous demonstration, but
includes a LENGTH statement to correct the truncated value for Freq. Remember that it's important to put the
LENGTH statement before any other reference to the variable.
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country
Bonus Freq;
run;
2. Submit the program and then view the report. The full value Twice a Year is displayed for the variable Freq. You
resolved the problem by using the LENGTH statement. If you were sure of the values for Country, you could
specify the ELSE statement without the condition. A final ELSE statement without a condition allows anything
not equal to the previous conditions to be true.
3. Copy and paste the following program, which no longer includes the condition in the ELSE statement.
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else do;
Bonus=300;
Freq='Twice a Year';
end;
run;
4. Submit the revised program and view the results. Notice that the values for Bonus and Freq are correctly
assigned based on the value for Country, and the full value Twice a Year is displayed for Freq.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
variable=expression;
The SUM function, a descriptive statistics function, returns the sum of its arguments and ignores missing values. The
arguments can be numeric constants, numeric variables, or arithmetic expressions, but the arguments must be numeric
values and must be enclosed in parentheses and separated with commas. The parentheses are required, even if no
arguments are passed to the function.
SUM(argument1,argument2, ...)
In addition to descriptive statistics functions, many SAS date functions are available. Some of these functions create SAS
dates, and others extract information from SAS dates. The MONTH function extracts and returns the numeric month
from a SAS date.
MONTH(SAS-date)
Conditional Processing
The IF-THEN statement is a conditional statement. It executes a SAS statement for observations that meet specific
conditions. The statement includes an expression and a SAS program statement. The expression defines a condition that
must be true for the statement to be executed. The expression is evaluated during each iteration of the DATA step. If the
condition is true, the statement following the THEN statement is executed; otherwise, SAS skips the statement.
A program often includes a sequence of IF statements with mutually exclusive conditions. When SAS encounters a true
condition in this series, evaluating the other conditions isn't necessary. You can use the ELSE statement to specify an
alternative action to be performed when the condition in an IF-THEN statement is false. This increases the efficiency of
the program.
You can use the logical operators AND and OR to combine conditions in an IF expression. You use the AND operator
when both conditions must be true, and you use the OR operator when only one of the conditions must be true. An
optional final ELSE statement can be used at the end of a series of IF-THEN/ELSE statements. The statement following
the final ELSE executes if none of the IF expressions is true.
Use a DO group with an IF-THEN or an ELSE statement when multiple statements must be executed based on one
condition. The DO group consists of a DO statement, the SAS statements to be executed, and an END statement. Each
DO statement must have a corresponding END statement.
IF expression THEN
DO;
executable statements
END;
ELSE IF expression THEN
DO;
executable statements
END;
Truncation can occur when new variables are assigned values within conditional program statements. During
compilation, SAS creates a variable in the PDV the first time it encounters the variable in the program. If this is in
conditional code, be sure that it is created with a length long enough to store all possible values. It is a best practice to
use a LENGTH statement to explicitly define the length.
Sample Programs
data work.comp;
set orion.sales;
Bonus=500;
Compensation=sum(Salary,Bonus);
BonusMonth=month(Hire_Date);
run;
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. IV' then
Bonus=1000;
if Job_Title='Sales Manager' then
Bonus=1500;
if Job_Title='Senior Sales Manager' then
Bonus=2000;
if Job_Title='Chief Sales Officer' then
Bonus=2500;
run;
data work.comp;
set orion.sales;
if Job_Title='Sales Rep. III' or
Job_Title='Sales Rep. IV' then
Bonus=1000;
else if Job_Title='Sales Manager' then
Bonus=1500;
else if Job_Title='Senior Sales Manager' then
Bonus=2000;
else if Job_Title='Chief Sales Officer' then
Bonus=2500;
else Bonus=500;
run;
data work.bonus;
set orion.sales;
if Country='US' then Bonus=500;
else Bonus=300;
run;
data work.bonus;
set orion.sales;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
proc print data=work.bonus;
var First_Name Last_Name Country Bonus Freq;
run;
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else if Country='AU' then
do;
Bonus=300;
Freq='Twice a Year';
end;
run;
data work.bonus;
set orion.sales;
length Freq $ 12;
if Country='US' then
do;
Bonus=500;
Freq='Once a Year';
end;
else do;
Bonus=300;
Freq='Twice a Year';
end;
run;
concatenate two or more SAS data sets using the SET statement in a DATA step
rename variables using the RENAME= data set option
prepare data sets for merging by using the SORT procedure
merge SAS data sets one-to-one based on a common variable
merge SAS data sets one-to-many based on a common variable
control the observations in the output data set by using the IN= data set option
empsdk empsfr
Jonas M Denmark
empsall1
Lars M Denmark
Kari F Denmark
Jonas M Denmark
Pierre M France
Sophie F France
You can concatenate these data sets. Concatenating copies all the observations from the first data set and then copies
all observations from one or more additional data sets into a new data set. The original data sets are unchanged.
Knowing the Structure and Contents of Your Input Data
To choose the best way to combine your data, you need to understand the structure and contents of your input data
sets. So, for example, it's helpful to examine both the descriptor portion of the data sets, as well as the data portions.
When you combine data sets vertically, one of the most important questions to ask is: Do the data sets have variables in
common?
Suppose you're combining data sets one and two. You can see that the variables have common names: A, B, and C. By
examining the data further, you can make sure that variables that have the same names contain the same type of data.
one two
A B C A B C
You might find that variables that have different names across data sets contain the same data. Suppose data set one
has a variable named Last and data set two has a variable named Last_Name. In this situation, you might want SAS to
combine these two variables when you combine data sets.
On the other hand, you might find that variables that have the same name in different data sets contain different data.
For example, suppose both data sets have a variable named Date. One Date variable might store order dates, but the
other Date variable might store shipping dates. You would not want SAS to combine these variables if they hold
different information.
When you're combining vertically, it's easier to combine data sets that have identical variables. However, you can also
combine data sets that have different variables. As you can see, the data sets in our scenario, empsdk and empsfr, have
the same variable names: First, Gender, and Country. For this example, assume that the variables also have the same
attributes.
DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 ...;
RUN;
When you specify multiple data sets, SAS combines them into a single data set, In other words, SAS concatenates the
observations from the input data sets. In the combined data set, the observations appear in the order in which the data
sets are listed in the SET statement. In this example, the observations from empsdk will appear before the observations
from empsfr because empsdk is listed first.
data empsall1;
set empsdk empsfr;
run;
data empsall1;
set empsdk empsfr;
run;
During compilation, SAS reads the descriptor portion of the first data set, empsdk, and determines that it has three
variables. SAS also determines the attributes of the variables. Then SAS creates the PDV with slots for the three
variables.
PDV
SAS then looks at the second data set, empsfr, to see if it has additional variables that must be added to the PDV. Here,
empsfr has no additional variables, so SAS makes no further changes to the PDV. At the bottom of the DATA step, the
compilation phase is complete, and the descriptor portion of the new SAS data set empsall1 is created.
empsall1
Now SAS is ready to execute the DATA step and create the data portion of the output data set. To start, SAS initializes
the PDV. This means that SAS sets the value of each variable to missing. Remember that SAS makes a pass, or iteration,
through the DATA step for each observation that's read from an input data set. Consider this: Which observation does
SAS look at first? SAS reads the first observation in empsdk, the first data set specified in the SET statement. SAS reads
the values directly into the PDV. At the bottom of the DATA step, SAS writes the data from the PDV to the output data
set as the first observation.
empsall1
Lars M Denmark
Now SAS returns to the top of the DATA step for the next iteration. Because SAS continues reading observations from
the same input data set, SAS does not reinitialize the PDV. SAS reads the second observation in empsdk into the PDV
and then writes the data to the output data set as the second observation.
empsall1
Lars M Denmark
Kari F Denmark
Returning to the top of the DATA step, SAS now reads the third observation from empsdk into the PDV and then writes
it to the output data set. At the top of the DATA step, SAS reaches the end of the file.
empsall1
Lars M Denmark
Kari F Denmark
Jonas M Denmark
SAS reinitializes the PDV before switching to the second data set.
PDV
Now SAS reads the first observation from the data set empsfr into the PDV, and then writes the values to the output
data set as the fourth observation. SAS reads in and writes out each observation in the second data set until it reaches
the end of that file. When SAS finishes executing the DATA step, the output data set is complete.
empsall1
Lars M Denmark
Kari F Denmark
Jonas M Denmark
Pierre M France
Sophie F France
Code Challenge
Write a SET statement to concatenate clinic.stress98 and clinic.stress99, in that order.
data clinic.testtime;
;
run;
You use the SET statement to name the data sets to be concatenated.
Business Scenario
Suppose you want to concatenate two other data sets that contain employee data. The data set empscn contains
employees from China, and the data set empsjp contains employees from Japan.
empscn empsjp
Ming F China
Here's a question. How many variables have a common name across data sets? Only two of the three variables have a
common name: First and Gender. You want to create a new data set named empsall2 that has three variables. As in
empscn, you want the third variable to be named Country, not Region.
1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.
In the third DATA step, the DATA statement specifies the name of the output data set, empsall2. You want the
employees in China to be listed before the employees in Japan, so the SET statement specifies empscn, a space,
and then empsjp. The PROC PRINT step creates the report.
data empscn;
input First $ Gender $ Country $;
datalines;
Chang M China
Li M China
Ming F China
;
run;
data empsjp;
input First $ Gender $ Region $;
datalines;
Cho F Japan
Tomi M Japan
;
run;
data empsall2;
set empscn empsjp;
run;
2. Submit the program and then check the log. You can see messages indicating that SAS ran successfully. Try
answering this question: How many variables does the output data set have? The data set empsall2 has four
variables.
3. View the PROC PRINT output. You can see the four variables First, Gender, Country, and Region. Notice that
some observations have missing values for Country and others have missing values for Region. This isn't the
output that you want. Next you'll learn to create the report you need.
data empsall2;
set empscn empsjp;
run;
PDV
In the second data set, empsjp, SAS finds the additional variable Region. SAS then adds Region as the fourth variable in
the PDV. SAS also creates the descriptor portion of the output data set, which contains the four variables.
PDV
First Gender Country Region
$8 $8 $8 $8
At the start of execution, SAS reads the first observation from empscn. SAS reads the values of First, Gender, and
Country into the PDV. But the data set empscn doesn’t contain the Region variable. Think about this: What value does
SAS assign to Region in the PDV? Because there is no value to be read into Region, it remains missing.
empsall2
Chang M China
When SAS reaches the end of file in empscn, SAS reads the second data set listed in the SET statement, empsjp.
Remember that SAS reinitializes the PDV before switching to this data set. SAS reads the first observation from empsjp
into the PDV. Now the variable Country has a missing value due to the PDV reinitialization.
Here's the final output data set, but this is not the output that we want.
empsall2
Chang M China
Li M China
Ming F China
Cho F Japan
Tomi M Japan
We want our output data set to have only the first three variables. Also, we want to move the values in Region to the
Country variable. To get this output, we can modify our DATA step to rename the variable Region to Country in empsjp.
SAS-data-set(RENAME=(old-name-1=new-name-1
old-name-2=new-name-2
...
old-name-n=new-name-n))
You specify the RENAME= option immediately after the associated SAS data set name. Notice that you enclose the
RENAME= option within an outer set of parentheses. In the inner set of parentheses, you specify one or more variables
that you want to rename. For each variable, you specify the existing variable name, an equal sign, and the new name. Of
course, the new name must be a valid SAS name. If you are changing multiple variable names for the same data set, you
add a space, not a comma, between variables.
If the RENAME= option is associated with an input data set in the SET statement, as in this example, the action applies to
the data set that is being read. The name change affects the PDV and the output data set, but has no effect on the input
data set.
data empsall2;
set empscn
empsjp(rename=(Region=Country));
run;
empscn
Chang M China
Li M China
Ming F China
set empscn(rename=(Country=Region))
empsjp;
The next example shows that you can use the RENAME= data set option for multiple data sets in the same SET
statement. Suppose you want to rename two variables in the first data set, empscn, and one variable in the second data
set, empsjp. Even though the first variable in the two data sets is currently the same, you want to change First to Fname
in both data sets. So you specify this variable name change in the RENAME= data set option after both empscn and
empsjp. In the first data set, you also want to change Country to Region, so you add this to the RENAME= data set
option after empscn.
empsjp
Cho F Japan
Tomi M Japan
set empscn(rename=(First=Fname
Country=Region))
empsjp(rename=(First=Fname));
Question
Which SET statement has correct syntax?
a.
set empscn(rename(Country=Location))
empsjp(rename(Region=Location));
b.
set empscn(rename=(Country=Location))
empsjp(rename=(Region=Location));
c.
set empscn rename=(Country=Location)
empsjp rename=(Region=Location);
The correct answer is b. You specify the keyword RENAME, followed by the equals sign. You enclose the RENAME=
option within an outer set of parentheses. In the inner set of parentheses, you specify one or more variables that you
want to rename.
Code Challenge
In the code below, rename the variable Office in the sales.rep data set to OfficeNumber.
data condata.emppay;
set sales.rep
empinfo.sales empinfo.bonuses;
run;
(rename=(Office=OfficeNumber))
The RENAME= data set option specifies the variable or variables to be renamed. The variables and their new names are
listed in parentheses after the data set name.
empscn ampsjp
Ming F China
data empsall2;
set empscn
empsjp(rename=(Region=Country));
run;
PDV
In empsjp, SAS finds the additional variable Region. However, the RENAME= option in the SET statement tells SAS to
treat the variable Region as if it is named Country. So, the PDV and the descriptor portion of the output data set have
only three variables. At execution, when SAS reads the observations in empsjp, SAS stores the values of Region in the
Country variable.
empsau phoneh
empsauh
one two
A B C C D E
both
A B C D E
Knowing the Structure and Contents of Your Data
When you combine data sets horizontally, or match-merge data sets, you might want to ask the question: What is the
relationship between observations in the input data sets? The observations can be related in several different ways.
In a one-to-one relationship, a single observation in one data set is related to one, and only one, observation in another
data set based on the values of one or more common variables. For example, suppose two data sets contain employee
identification numbers for the same group of employees. Each employee ID number appears once in each data set, and
each observation in one data set has one matching observation in the other data set.
one two
A B ID ID D E
1 1
2 2
3 3
In a one-to-many relationship, a single observation in one data set is related to one or more observations in another
data set.
one two
A B ID
ID D E
1
1
2
1
In a many-to-one relationship, multiple observations in one data set are related to one observation in another data set.
one two
A B ID
ID D E
1 1
1 2
In a many-to-many relationship, multiple observations in one data set are related to multiple observations in another
data set.
one two
A B ID
ID D E
1
1
1
1
2
2
Sometimes, the data sets have non-matches. At least one observation in one of the data sets is unrelated to any
observation in another data set based on the values of one or more common variables.
one two
A B ID
ID D E
1
2
2
3
4
4
Now take a look at the data sets for your scenario: empsau and phoneh.
empsau phoneh
Think about this: Which variable can you use to match-merge these data sets? You can use the EmpID variable for the
match-merge. And what about this: Do these data sets have a one-to-one relationship? Yes, each data set contains the
same three employee ID numbers. One last question: How many variables will the new data set empsauh contain?
Empsauh will contain four variables: the first two variables come from the empsau data set. The third variable, the BY
variable, is common to the two input data sets. And the last variable comes from the phoneh data set.
empsauh
DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 ...;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;
The MERGE statement joins observations from two or more SAS data sets into single observations, so you must specify
at least two data sets in the MERGE statement. If you specify only one data set, SAS treats the MERGE statement like a
SET statement.
In this example, we'll specify the data sets empsau and phoneh as the data sets to merge. Next, the BY statement
indicates a match-merge. You specify the common variable or variables to match, which in this case is EmpID.
data empsauh;
merge empsau phoneh;
by EmpID;
run;
The BY variables must be common to all data sets, and the data sets must be sorted by the variables listed in the BY
statement. What can you use to sort the data sets? You can use PROC SORT to sort the emspau and phoneh data sets by
the common variable EmpID. Fortunately, the two data sets that you're working with are already sorted on the BY
variable, EmpID.
1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.
In the third DATA step, the DATA statement specifies the name of the output data set, empsauh. You want to
store all of the employee information in one data set. The PROC PRINT step creates the report.
data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;
data phoneh;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1793
121151 +61(2)5555-1849
121152 +61(2)5555-1665
;
run;
2. Submit the program and then check the log. You can see that SAS ran successfully. SAS read three observations
from each of the input data sets and created a data set that has three observations and four variables.
3. View the PROC PRINT output. You can see the empsauh contains the merged employee information.
Activity
Copy and paste this program into the editor. Complete the DATA step to match-merge the sorted SAS data sets
referenced in the PROC SORT steps. Submit your code. Correct and resubmit if necessary.
data work.payadd;
merge ;
by ;
run;
a.
merge work.payroll
work.addresses;
by Employee_ID;
b.
merge orion.employee_payroll
orion.employee_addresses;
by Employee_ID;
c.
merge work.payroll,
work.addresses;
by Employee_ID;
The PROC SORT steps create two new temporary output data sets in the OUT= option. In the MERGE
statement, you specify these data sets, separated by a space. In the BY statement, you specify the common
variable, Employee_ID.
Code Challenge
Complete this program to match-merge the data sets work.reps, empinfo.sales, and empinfo.bonuses, in that order, by
the common variable Emp_ID. Assume that the data sets have been sorted by Emp_ID.
data mergedata.emppay;
;
;
run;
The MERGE statement specifies the input data sets to be merged. Data sets are merged in the order that they appear in
the MERGE statement. In the BY statement, you specify the common variable Emp_ID.
empsau phones
You want to match-merge these data sets based on the common variable EmpID to obtain the phone numbers for each
employee. You assume that the same employees are listed in both data sets, but it's a good idea to examine your data
first.
Now think about this: What is the relationship of these two data sets? One observation in empsau matches one, two, or
three observations in phones, so these data sets have a one-to-many relationship.
data empphones;
merge empsau phones;
by EmpID;
run;
The DATA statement identifies the output data set as empphones. The MERGE statement lists the two input data sets,
and the BY statement specifies the BY variable EmpID, which SAS uses to combine the observations. All observations
that have the same value of the BY variable are in the same BY group. Notice that you don't need to sort these data sets
because they are already in order by EmpID.
Let's see how SAS processes this DATA step when the data sets have a one-to-many relationship. At the end of the
compilation phase, SAS has created the PDV as well as the descriptor portion of the output data set. The output data set
isn't shown here; you'll have a chance to see it later. The PDV has five variables: First and Gender from empsau; EmpID,
which appears in both data sets; and Type and Phone from phones.
PDV
PDV
Now, at the start of the execution phase, SAS is ready to combine observations. SAS looks at the first observation in each
of the two data sets to determine which BY group should appear first in the output data set. Here's a question. Do the
EmpID values match? Yes. These two observations have the same BY value, so they are in the same BY group.
empsau phones
data empphones;
merge empsau phones;
by EmpID;
run;
The DATA step reads the values from the two data sets into the PDV, in the order they appear in the MERGE statement.
The PDV now contains data for Togar's home phone number.
PDV
SAS then writes the contents of the PDV to the output data set as the first observation. SAS is now ready to start another
iteration of the DATA step.
At the beginning of each DATA step iteration, SAS reinitializes any new variables in the PDV. In this example, SAS does
not reinitialize any variables, because they all come from the input data sets. However, if the DATA step had an
assignment statement that created new variables, SAS would reset the values of the new variables to missing in the PDV.
SAS now moves to the second observation in each data set. Do the EmpID values match?
No, they don't. Does either EmpID match the EmpID in the PDV? Yes. The second observation in phones is in the same
BY group. The second observation in phones contains Togar's work phone number. The observation in empsau has a
different BY value. This observation is in a different BY group, and SAS will process it in the next iteration.
Now, SAS reads the values of Type and Phone from the observation in phones into the PDV. These new values replace
the previous values of Type and Phone in the PDV. However, the values of First, Gender, and EmpID remain the same as
before.
PDV
Finally, SAS writes the values in the PDV to the output data set, creating the second observation.
Once again, SAS retains the values in the PDV. In empsau, SAS is still looking at the second observation because this data
has not been read to the PDV. However, in phones, SAS moves down to the third observation.
empsau phones
Do the EmpID values match? Yes, these observations are in the same BY group. Does either EmpID match the EmpID in
the PDV? No. These observations are in a new BY group, so SAS sets all the values in the PDV to missing.
PDV
SAS reads the values from the empsau observation, and then the phones observation, into the PDV.
PDV
Then SAS writes the PDV values to the output data set as the third observation. The DATA step continues executing in
this way until SAS reaches the end of file for both data sets. Here is the final output data set. Notice that the output data
set has multiple observations for each employee.
empphones
Activity
Copy and paste this program into the editor and submit it. Examine the results. Then reverse the order of the data sets
in the MERGE statement and submit it again. Examine the results.
data phones;
input EmpID Type $ Phone $15.;
datalines;
121150 Home +61(2)5555-1793
121150 Work +61(2)5555-1794
121151 Home +61(2)5555-1849
121152 Work +61(2)5555-1850
121152 Home +61(2)5555-1665
121152 Cell +61(2)5555-1666
;
In a one-to-many merge, does it matter which data set is listed first in the MERGE statement?
a. Yes
b. No
The correct answer is a. When you reverse the order of the data sets in the MERGE statement, the results are the same,
but the order of the variables is different. SAS performs a many-to-one merge.
empsau phonec
You need to match-merge these data sets, but when you examine them, you notice something interesting. There's an
observation in empsau that does not have a match in phonec, and there's an observation in phonec that does not have
a match in empsau. You want the output data set, empsauc, to contain only the observations that match across the
input data sets.
empsau phonec
SAS looks at the first observation in each data set to determine which BY group should appear first. These two
observations are in the same BY group. So SAS reads the values from the current observation in each data set into the
PDV.
PDV
Then SAS writes the contents of the PDV to the output data set as the first observation.
empsauc
The values remain in the PDV as SAS begins the next iteration of the DATA step.
PDV
empsau phonec
Do these EmpID values match? No, they don't. Does either EmpID match the EmpID in the PDV? No. Neither of these
observations is in the same BY group as the PDV. This is the first non-matching observation that SAS has identified.
Because current observations are not in the same BY group as in the PDV, SAS reinitializes the PDV.
PDV
Now think about this. Which EmpID value comes first sequentially? In the current observations, the EmpID value ending
in 151 comes before the value ending in 152. So SAS reads the second observation in empsau into the PDV. In the PDV,
Phone is still set to missing because there is no phone number for this employee in phonec.
PDV
Kylie F 121151
SAS writes the data in the PDV to the output data set as the second observation.
empsauc
Kylie F 121151
Once again, SAS returns to the top of the DATA step and moves down to the third observation in empsau.
empsau phonec
PDV
Then, SAS reads the values from the empsau observation, and then the phonec observation into the PDV.
PDV
SAS writes the data to the output data set as the third observation.
empsauc
Kylie F 121151
Once again, SAS returns to the top of the DATA step. SAS has reached the end of the file in empsau, but not in phonec.
SAS looks at the third observation in phonec.
empsau phonec
PDV
No, this observation is not in the same BY group, so SAS reinitializes the PDV.
PDV
SAS reads the values from the phonec observation into the PDV,
PDV
121153 +61(2)5555-1348
and then writes the data to the output data set as the fourth observation.
empsauc
Kylie F 121151
121153 +61(2)5555-1348
SAS returns to the top of the DATA step. SAS has reached the end of file in both data sets.
1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.
In the third DATA step, the DATA statement specifies the name of the output data set, empsauc. The PROC
PRINT step creates the report.
data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;
data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;
2. Submit the program and then check the log. You can see messages indicating that SAS ran successfully. SAS read
3 observations from empsau and from phonec, but the new data set has 4 observations.
3. View the PROC PRINT output. The report contains both matches and non-matches. Matches are observations
that contain data from both input data sets. Non-matches are observations that contain data from only one
input data set. This data set has two non-matches, one from each of the input data sets. The manager requested
a report listing employees with cell phones. Next you'll learn how to provide a more accurate report for the task.
Question
Which data set(s) contributed information to the first observation in the output data set empsauc?
Partial empsauc
First Gender EmpID Phone
Togar M 121150 +61(2)5555-1795
Kylie F 121151
Birin M 121152 +61(2)5555-1667
121153 +61(2)5555-1348
a. empsau
b. phonec
c. both empsau and
phonec
d. insufficient information
The correct answer is c. Both data sets contributed to the first observation. If one of the data sets had not contributed,
you would see missing values for at least one variable in the observation.
Business Scenario
Given the data for your task, you decide that you'd like to provide the manager with three separate phone inventory
reports: one for employees with company phones, one for employees without company phones, and one for those with
an EmpID not found in the empsau data set. To do this, you'll need to know which data sets contributed to each
observation in the merged data set. You can use a data set option to identify the data set contributors.
When you specify the IN= option after an input data set in the MERGE statement, SAS creates a temporary numeric
variable that indicates whether the data set contributed data to the current observation. The temporary variable has
two possible values. If the value of the variable is 0, it indicates that the data set did not contribute to the current
observation. If the value of the variable is 1, the data set did contribute to the current observation.
In the this example, the IN= option is specified after each of the input data sets.
data empsauc;
merge empsau(in=Emps);
phonec(in=Cell);
by EmpID;
run;
We want to know when either of these data sets contributes to the current observations. We've chosen the variables
names Emps and Cell. Here's another example using just E and P as the variable names.
data empsauc;
merge empsau(in=E);
phonec(in=P;
by EmpID;
run;
This last example shows how you can use the IN= option on just one of the data sets in a MERGE statement.
data empsauc;
merge empsau(in=AU);
phonec;
by EmpID;
run;
empsau phonec
data empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
phonec;
by EmpID;
run;
During the execution phase, SAS creates a temporary variable in the PDV for each instance of the IN= data set option in
your code. Each time SAS reads data into the PDV, SAS assigns a value to the temporary variables Emps and Cell to
indicate whether the associated data set contributed data to the current observation.
PDV
In the first iteration, both data sets contributed to the data that is in the PDV, so the value of both temporary variables is
1. We have a match.
PDV
In the second iteration, the data set phonec did not contribute, so the value of the temporary variable Cell is set to 0.
We have a non-match.
empsau phonec
PDV
Kylie F 121151 1 0
In the third iteration, both data sets contributed to the data that is in the PDV, so SAS assigns the value 1 to both Emps
and Cell. We have another match.
empsau phonec
PDV
empsau phonec
First Gender EmpID EmpID Phone
What are the values of Emps and Cell for this data? This data is a non-match that comes from phonec but not from
empsau. SAS sets the variable Emps to 0 and Cell to 1.
PDV
121153 0 +61(2)5555-1348 1
You might be wondering whether variables that are created with the IN= data set option appear in the output data set.
These variables are only available during execution. As you can see by looking at the partial output data set here, SAS
does not write these temporary variables to the output data set.
empsauc
Kylie F 121151
IF expression;.
This way, you can select only the matches or only the non-matches for your output data set.
Question
Which subsetting IF statement can be added to the DATA step to only output the matches?
data empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
run;
The correct answer is b. If the values of both Emps and Cell equal 1, then both data sets contributed to the observation.
This subsetting IF statement selects only the matches.
date empsauc;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=1 and Cell=1;
run;
empsau phonec
First Gender EmpID
EmpID Phone
Togar M 121150
121150 +61(2)5555-1795
Kylie F 121151
121152 +61(2)5555-1667
Birin M 121152
121153 +61(2)5555-1348
How many observations will the output data set contain?
The output data set contains only two observations for the two employee ID numbers that match across the input data
sets. This generates what we need for part of our task: the first report for employees with company phones.
empsauc
Selecting Non-Matches
In this demonstration, you select non-matches from data sets using the IN= data set option.
1. Copy and paste the following program into the editor. The first two DATA steps create the data for this
demonstration.
data empsau;
input First $ Gender $ EmpID;
datalines;
Togar M 121150
Kylie F 121151
Birin M 121152
;
run;
data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;
2. To provide management with the final two reports, one for employees without company phones and one for
employee IDs not found in the empsau data set, you need to select non-matches. First, to select only the
employees without company phones, you want the non-matches from the empsau data set.
Think about what IF expression you can use to select only the non-matches from empsau. You can use if Emps=1
and Cell=0. Add the IF expression after the BY statement as shown below. In both the DATA step and PROC
PRINT step, change the data set name to empsauc2.
data empsauc2;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=1 and Cell=0;
run;
3. Submit the program and then check the log. You can see messages indicating that SAS ran successfully. SAS
wrote one observation to the output data set.
4. View the PROC PRINT output. The report shows that Kylie is the only employee from these data sets who does
not have a company phone.
5. You need to modify the DATA step and PROC PRINT step to create the final report, the one for employee IDs not
found in the empsau data set. Which data set needs to contribute to create this report? Right, phonec needs to
contribute. So the IF expression you need to use is if Emps=0 and Cell=1. Modify the IF statement and change
the output data set name to empsauc3 as shown below.
6. Submit this code and check the log. The log shows that SAS read one observation in this case as well.
7. View the report. It shows that the phone number ending in 1348 is unassigned.
Question
Which of the following DATA steps selects non-matches from either data set?
empsau phonec
First Gender EmpID EmpID Phone
Togar M 121150 121150 +61(2)5555-1795
Kylie F 121151 121152 +61(2)5555-1667
Birin M 121152 121153 +61(2)5555-1348
a.
data empsauc4;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=0 and Cell=0;
run;
b.
data empsauc4;
merge empsau(in=Emps)
phonec(in=Cell);
by EmpID;
if Emps=0 or Cell=0;
run;
The correct answer is b. You use the OR operator to select non-matches from either data set. The resulting output data
set would contain two observations: one for Kylie from empsau, and one for EmpID 121153 from phonec.
When you are checking a variable for a value of 1 or 0, as in the previous scenario, you can use alternate syntax.
For example, instead of using if Emps=1 and Cell=1, you could use if Emps and Cell. Both versions will
return the matches only.
Instead of using if Emps=1 and Cell=0, you could use if Emps and not Cell. In this case, SAS will find the non-matches
from the first data set.
Another example is instead of using if Emps=0 and Cell=1, you could use if not Emps and Cell.
SAS will find the non-matches from the second data set. And in the last example, instead of using if Emps=0 or Cell=0,
you could use if not Emps or not Cell.
if Emps=0 or Cell=0;
Code Challenge
Write a subsetting IF statement that selects observations for subsequent processing only if all three input data sets
contributed to the current observation.
data mergedata.emppay;
merge sales.reps(rename=(office=OfficeNumber) in=inreps)
empinfo.sales(in=insales)
empinfo.bonuses(in=inbonus);
by Emp_ID;
;
run;
To select only observations composed of values from all input data sets, the subsetting IF statement specifies all
three IN= variables in the IF condition. The AND operator joins expressions in the condition.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
You use a DATA step to concatenate multiple data sets into a single, new data set. In the SET statement, you can specify
any number of input data sets to concatenate. During compilation, SAS uses the descriptor portion of the first data set
to create variables in the PDV, and then continues with each subsequent data set, creating additional variables in the
PDV as needed. During execution, SAS processes the data sets in the order in which they are listed in the SET statement.
DATA SAS-data-set;
SET SAS-data-set1 SAS-data-set2 ...;
RUN;
If the data sets have differently named variables, every variable is created in the new data set, and some observations
have missing values for the differently named variables. You can use the RENAME= data set option to change variable
names in one or more data sets. After they are renamed, they are treated as the same variable during compilation and
execution, and in the new data set.
SAS-data-set (RENAME=(old-name-1=new-name-1;
old-name-2=new-name-2
...
old-name-n=new-name-n))
Merging SAS Data Sets One-to-One
Merging combines observations from two or more SAS data sets into a single observation in a new data set. A simple
merge combines observations based on their positions in the original data sets. A match-merge combines them based
on the values of one or more common variables. The result of a match-merge is dependant on the relationship between
observations in the input data sets.
You use a DATA step with a MERGE statement to merge multiple data sets into a single data set. The BY statement
indicates a match-merge and specifies the common variable or variables to match. The common variables are referred
to as BY variables. The BY variables must exist in every data set, and each data set must be sorted by the value of the BY
variables.
DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 ...;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;
You can use the IN= data set option in a MERGE statement to create a temporary variable that indicates whether a data
set contributed information to the observation in the PDV. The IN= variables have two possible values: 0 and 1. You can
test the value of this variable using subsetting IF statements to output only the matches or only the non-matches to the
merged data set.
Sample Programs
data phoneh;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1793
121151 +61(2)5555-1849
121152 +61(2)5555-1665
;
run;
data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;
Selecting Non-Matches
data phonec;
input EmpID Phone $15.;
datalines;
121150 +61(2)5555-1795
121152 +61(2)5555-1667
121153 +61(2)5555-1348
;
run;
Using PROC FREQ and PROC MEANS, you'll learn to create summary reports. You'll also learn how to use these
procedures, as well as PROC PRINT and PROC UNIVARIATE, to validate your data.
Now suppose you want to send the output you create to external files. In this lesson, you'll also work with the SAS
Output Delivery System, or ODS. You'll use ODS statements to send the output from your SAS procedures to many types
of external files.
Objectives
produce one-way and two-way frequency tables by using the FREQ procedure
enhance frequency tables by using options
use PROC FREQ to validate data in a SAS data set
calculate summary statistics and multilevel summaries by using the MEANS procedure
enhance summary tables by using options
identify extreme and missing values by using the UNIVARIATE procedure
define the Output Delivery System and ODS destinations
use ODS statements to direct report output to various ODS destinations
specify a style definition by using the STYLE= option
create report output that can be viewed in Microsoft Excel
Business Scenario
Suppose your manager at Orion Star wants to know the number of male and female sales employees in Australia. You
need to use orion.sales to analyze the number of occurrences of each value for Gender. In other words, you want to
know the frequency of the variable Gender.
In the TABLES statement, you specify the frequency tables to produce. You can define one or more frequency tables in a
single TABLES statement.
To create one-way frequency tables, you specify one or more variable names separated by a space. If you omit the
TABLES statement, SAS produces a one-way frequency table for every variable in the data set. This can produce
tremendous output and is seldom desired.
For our task, we'll specify the variable Gender. Notice that we also have a WHERE statement to subset the data by the
Country variable.
1. Copy and paste the following program into the editor. PROC FREQ automatically displays output in a report, so
you don't need to add a PROC PRINT step.
2. Submit the code and then check the log. The log shows that SAS read 63 observations from orion.sales.
3. Examine the report. This frequency table shows frequency statistics for the variable Gender. By default, the first
column displays the values of the specified variable. Each unique value is called a level of the variable. In this
table, Gender has two levels: F for female and M for male.
For each value, a one-way frequency table shows four statistics by default. The Frequency column displays the
frequency count, which is the number of observations in a level. The Percent column displays the percentage of
the total number of observations. The Cumulative Frequency column displays the cumulative frequency count,
which is the sum of the frequency counts for that level and for all levels listed above it. The Cumulative Percent
column displays the percentage of the total number of observations in that level and in all other levels listed
above it.
Can you determine what percent of the Orion Star sales force in Australia is female? Females represent about
43% of the sales force.
Cumulative Cumulative
Gender Frequency Percent
Frequency Percent
F 27 42.86 27 42.86
M 36 57.14 63 100.00
Suppose you want your frequency table to display only a subset of the available statistics. By specifying options in the
TABLES statement, you can suppress one or more of the statistics in frequency tables.
For example, you can add the NOCUM and NOPERCENT options in the TABLES statement. Let's see how each of these
options affects a one-way frequency table.
The NOCUM option suppresses the display of cumulative frequency and cumulative percentages.
F 27 42.86
M 36 57.14
Notice that you must specify a forward slash before you specify options. The NOPERCENT option suppresses the display
of all percentages.
Cumulative
Gender Frequency
Frequency
F 27 27
M 36 63
You can also specify both NOCUM and NOPERCENT. When you specify multiple options, you separate them by a space.
If you specify both NOCUM and NOPERCENT, which statistics appear in the frequency table? The one-way frequency
table displays only frequencies.
Gender Frequency
F 27
M 36
Activity
Copy and paste the following program into the editor. Submit the program and check the log. Correct the program and
resubmit it.
The correct answer is b. You must use a forward slash before you specify options in the TABLES statement.
When you specify multiple options, you separate them by a space, not a comma.
Code Challenge
Complete the FREQ procedure shown below. Compute frequency statistics for the variable Test1. Suppress the
display of all cumulative statistics.
;
run;
tables Test1/nocum;
In the TABLES statement, you list the variable Test1, followed by a forward slash, and then the NOCUM
option. The NOCUM option suppresses the display of cumulative frequency and cumulative percentages.
Variables Values
Employee_ID
First_Name
Last_Name
Gender
Salary
Job_Title
Country
Birth_Date
Hire_Date
The variables Employee_ID, First_Name, and Last_Name are not good choices for a PROC FREQ summary report. Every
employee has a unique employee ID number, and most employees have unique names. When you're summarizing data,
there's no need to show a frequency distribution for variables that have a large number of distinct values.
Frequency distributions work best with variables whose values meet two criteria. First, the values of the variable are
categorical. Second, the values are best summarized by counts instead of averages. For example, the values of Gender
fall into two categories: female and male. Gender is a character variable, so its values must be counted and cannot be
averaged.
What are two other categorical variables in this list? Job_Title and Country are also categorical variables, so they are
probably good choices for a frequency distribution.
What about Salary, Birth_Date, and Hire_Date? Variables that have continuous numeric values, such as dollar amounts
and dates, can result in a lengthy and meaningless frequency table. To create a useful frequency report for these
variables, you can group the variable values into categories. How can you do that? You can group the values of a variable
into categories by applying formats. You can use existing SAS formats or user-defined formats. For example, you could
use the TIERS format to categorize salary values.
proc format;
value tiers 20000-<50000='Tier1'
50000-<100000='Tier2'
10000-250000='Tier3';
run;
After you group the values of a continuous numeric variable into categories, you can create a meaningful frequency
table for that variable.
How can you apply the TIERS format to Salary? The FORMAT statement specifies Salary, followed by the TIERS.
format.
proc format;
value tiers low-25000='Tier1'
25000<-50000='Tier2'
50000<-100000='Tier3'
100000<-high='Tier4';
run;
2. Submit the code and then check the log. The log shows that the format TIERS was created and that the PROC
FREQ step ran successfully.
3. Examine the report. As you can see, the Salary values have been categorized into the four tiers. For each tier,
you can see the frequency of employees. Notice that Tier2 has the highest number of employees.
Question
Which variable would be a poor choice for frequency tables?
The correct answer is a. Name is likely to contain unique values, producing lengthy and meaningless output. It is not a
good choice for the FREQ procedure.
Business Scenario
You've analyzed orion.sales for the frequency of Gender. Suppose you've now been asked to determine the number of
female and male sales employees in each country, Australia and the US. Let's see how you can modify your PROC FREQ
program to generate these results.
The order in which the variables appear in the TABLES statement determines the order in which the one-way frequency
tables appear in the report. So think about the output that SAS will produce: a frequency table for Gender and a
frequency table for Country. Let's see if this answers the question for our task.
1. Copy and paste the following program into the editor. This PROC FREQ step will create two tables, one for
Gender and one for Country.
2. Submit the program and then check the log. The log shows that SAS read 165 observations from orion.sales.
3. Examine the results. You can easily see how many female and male sales employees there are, but can you
determine how many females are in Australia? No, you can't determine that information in this report.
The Country table lists the frequency of employees in each country, but you don't know how many females are
in a particular country. Do you think you can use PROC FREQ to determine this kind of information? What you
really want is a separate analysis for each group.
4. To find the frequency of Gender by Country, you need to add a BY statement: by Country. Do you recall what
you first need to do to the input data set before you submit this code? You need to sort orion.sales by Country.
Whenever you use the BY statement, the data set must be sorted by the variable named in the statement. The
PROC SORT step includes the OUT= option and names the output data set sorted. In the PROC FREQ step,you
change the input data set to sorted. Copy and paste the following program into the editor and submit it.
5. Check the log. The log shows that everything ran successfully.
6. Examine the new report. You can see that each group is in a separate frequency table. The Australian female
and male frequencies are in the first table, followed by the US female and male frequencies. Notice that there
are 27 female sales employees in Australia. As you can see, you can easily modify your analysis to produce the
results you need.
Country
Gender
AU US Total
27 41 68
57.14 59.80
63 102 165
Total
38.18 61.82 100.00
In a two-way table, the first variable specifies the table rows and the second variable specifies the table columns.
1. Copy and paste the following program into the editor. This PROC FREQ step will create a two-way
crosstabulation table for the variables Gender and Country.
2. Submit the program and then check the log. The log shows that SAS read 165 observations from orion.sales.
3. Examine the results. This two-way crosstabulation table for Gender and Country shows the distribution of
females and males among Orion Star sales employees in Australia and the United States.
By default, PROC FREQ displays two-way crosstabulation tables in table cell format. The row variable values
appear on the side of the table, and the column variable values appear across the top. Each of the main cells
represents a combination of a row variable level and a column variable level. For example, this cell contains
frequency statistics for female employees in the United States. In addition, the last column and the last row
provide totals. There is a legend in the top left corner of the output.
Four statistics appear by default. The frequency statistic indicates the number of observations with the unique
combination of values represented in that cell. The percent statistic indicates the cell's percentage of the total
frequency. The row percentage is the cell's percentage of the total frequency for its row. And the column
percentage is the cell's percentage of the total frequency for its column.
Take another look at the cell for female employees in the United States. These statistics indicate that 41
observations in the data set have a value of F for Gender and a value of US for Country. These 41 observations
represent 24.85% of the data set. The row percentage indicates that 60.29% of female employees are in the
United States. The column percentage indicates that 40.20% of United States employees are female.
Which two statistics are the same in both a one-way frequency table and a crosstabulation table? The frequency
and percentage appear in both tables. Cumulative frequencies and cumulative percentages appear only in a
one-way frequency table. Row percentages and column percentages appear only in a crosstabulation table.
Code Challenge
Complete the program below so that PROC FREQ creates a two-way crosstabulation of AgeRange and
MovingViolation. Use the values of MovingViolation for the table rows.
;
run;
tables MovingViolation*AgeRange;
You specify the keyword TABLES, followed by the variables joined by an asterisk. MovingViolation is listed first to form
the table rows.
You can use the NOPERCENT option for crosstabulation tables as well as one-way frequency tables. In a crosstabulation
table, NOPERCENT suppresses the display of overall percentages. These percentages include the second row for each
level and the second row in the Total level.
Country
Gender
AU US Total
27 41 68
Frequency
F 39.71 60.29
Row Pct
42.86 40.20
Col Pct
36 61 97
M 37.11 62.89
57.14 59.80
Now let's look at the additional options that you can specify in the TABLES statement to suppress statistics in a
crosstabulation table. The NOFREQ option suppresses the display of cell frequencies.
Country
Gender
AU US Total
M 37.11 62.89
57.14 59.80
63 102 165
Total
38.18 61.82 100.00
Country
Gender
AU US Total
27 41 68
Col Pct
36 61 97
57.14 59.80
63 102 165
Total
38.18 61.82 100.00
39.71 60.29
36 61 97
37.11 62.89
63 102 165
Total
38.18 61.82 100.00
Question
Which TABLES statement correctly creates this report?
a.
tables Gender*Country
nofreq norow nocol;
b.
tables Gender*Country
nocum norow nocol;
c.
tables Gender*Country/
nofreq norow nocol;
d.
tables Gender*Country/
nocum norow nocol;
The correct answer is c. You specify the options that suppress cell frequencies and total frequencies, row percentages,
and column percentages. The only remaining statistic is cell percentages. You list TABLES statement options after a
forward slash.
F AU 27 16.36 27 16.36
F US 41 24.85 68 41.21
In the list version of this two-way crosstabulation table, notice that the first two columns specify each possible
combination of the two variables. All statistics for each combination are displayed in a single row.
There are differences between the statistics in the default version and those in the list version. The statistics in the list
version are the same as in a one-way frequency table. The cumulative frequency and cumulative percentage appear
instead of the row percentage and the column percentage.
To format your crosstabulation table in the crosslist format, you specify the CROSSLIST option in the TABLES statement.
Row Column
Gender Country Frequency Percent
Percent Percent
Row Column
Gender Country Frequency Percent
Percent Percent
Like the list format, the crosslist format might be easier to read than the default format. However, notice that the
crosslist format displays the same statistics as the default crosstabulation table.
Actually, SAS applies a default format to all of the frequency values, which controls the column width. So it's possible
that changing the length of a variable value could make that value wrap to the next line. This also depends on whether
you are using the SAS windowing environment or a client application such as SAS Enterprise Guide or SAS Studio.
To change the format that SAS applies, you can add another option to the TABLES statement, the FORMAT= option. This
option allows you to format the frequency value and to change the width of the column. In the FORMAT= option, you
can specify any standard SAS numeric format or a user-defined numeric format. The format length cannot exceed 24.
The FORMAT= option applies only to crosstabulation tables displayed in the default format. It doesn't apply to
crosstabulation tables produced with the LIST or CROSSLIST option.
Gender F or M
Country AU or US
Let's see how you can use a PROC FREQ step with the TABLES statement to detect invalid numeric and character data by
looking at distinct values.
1. Copy and paste the following PROC PRINT step to print the first 20 observations.
2. Submit the step and then check the log. The log doesn't (and won't) display problems that violate your data
requirement because the problems don't constitute data errors.
3. Examine the report. As you scan the report, try to identify observations that do not meet your data
requirements.
In observation 2, notice that the value of Country is lowercase au. It should be uppercase. In observation 4,
Salary has a missing value. In observation 10, Job_Title has a missing value. Observation 12 has a value of G for
Gender. Gender must have a value of F or M. Observation 13 contains a Salary value that is less than our
requirement of at least 24000. Observation 14 has a missing value for Employee_ID.
So this data needs a lot of work. This PROC PRINT report has shown you the kinds of issues this data set has.
The Gender frequency table shows that one observation contains the value G and another contains a missing value. The
Country frequency table shows that six observations contain invalid data: three for lowercase au and three for
lowercase us.
Now let's consider another variable, Employee_ID, which should have a different value for each observation. What
would you have to do to find duplicate values? To find any duplicates, you can look through the list of Employee_ID
frequencies to find values that are greater than 1. To make this easier, you can use the ORDER=FREQ option in the PROC
FREQ statement to display the results in descending frequency order.
Partial output
Employee_ID Frequency
Frequency Missing = 1
120108 2
120101 1
120104 1
Employee_ID Frequency
120105 1
120106 1
120107 1
120110 1
120111 1
120112 1
120113 1
121146 1
121147 1
121148 1
1. Copy and paste the following PROC FREQ step into the editor to validate the Employee_ID frequency.
2. Submit the code and then check the log. The code ran successfully.
3. Examine the results. Notice that the Employee_ID 120108 has a frequency of 2, and you can easily find it
because it's listed first.
Another option for validating Employee_ID is to use the NLEVELS option in the PROC FREQ statement.
4. In the PROC FREQ statement, remove the ORDER=FREQ option and add the NLEVELS option. Also, add the
variables Gender and Country to the TABLES statement.
5. Submit this code and view the results. When you specify NLEVELS, PROC FREQ displays a table of the distinct
values, or levels, for each variable in the TABLES statement. The Number of Variable Levels table appears before
the individual frequency tables. If you know the number of levels that should occur, the Number of Variable
Levels table can indicate whether a variable contains duplicate values.
For example, this frequency table shows that Employee_ID has 234 levels, one missing and the rest nonmissing.
Here's a question: Assuming that orion.nonsales2 contains 235 observations, how many duplicate values does
Employee_ID contain? Given the 234 levels and 235 observations, Employee_ID must contain one duplicate
value. Since the table indicates a missing value for Employee_ID, the duplicate value might be either missing or
nonmissing.
If you only want to see the Number of Variable Levels table and not the individual frequency tables, you can add
the NOPRINT option to the TABLES statement.
6. In the editor, add the NOPRINT option and submit the code.
7. View the results. As you can see, the frequency tables have been suppressed.
Activity
Copy and paste this program into the editor and then submit it.
124
Trainee
For example, here are some WHERE expressions to validate the data.
The first expression selects observations in which the value of Gender is not F or M. The second expression selects
observations in which the value of Job_Title is null. The third expression selects observations in which the value of
Country is not uppercase AU or US. The fourth expression selects observations in which the value of Salary is not
between 24000 and 500000. The last expression selects observations in which the value of Employee_ID is missing.
Is there a way to test whether an Employee_ID value is unique, which is one of our data requirements? Yes there is.
From our PROC FREQ reports we know that Employee_ID 120108 has a frequency of 2. So, in our final WHERE
expression, we can select the observations in which the value of Employee_ID is equal to 120108.
Consider this. What would happen if you used the AND operator instead of the OR operator between these expressions?
SAS would select an observation only if every one of these conditions exists. In this example, no observations would be
selected from orion.nonsales2. You must use the OR operator.
1. Copy and paste the following PROC PRINT step into the editor to print all of the invalid data.
2. Submit the code and then check the log. The log shows that SAS read 15 observations from orion.nonsales2.
3. View the report. As you can see, all of the invalid data, or data that doesn't meet your data requirements, is in
one report. This makes the task of cleaning the data a lot easier. You now know exactly which observations to
correct.
Using the MEANS and UNIVARIATE Procedures
Suppose you need to analyze the salaries of Orion Star sales employees. For example, the payroll manager has asked you
for a report of the average salary of all employees. To create the output required, you can use PROC MEANS.
It automatically displays output in a report, and you can also save the output in a SAS data set. By default, PROC MEANS
reports the number of nonmissing values, the mean, the standard deviation, the minimum value, and the maximum
value of every numeric variable in a data set. You can use the VAR statement to identify the which variables to use in the
analysis and to specify the order they appear in the results. By using additional statements and options in a PROC
MEANS step, you can create a more complex PROC MEANS report.
For your task, you want to analyze the variable Salary. You start with the PROC MEANS statement and list the input data
set, orion.sales. You add a VAR statement to list the analysis variables, which are the numeric variables for which
statistics are to be computed. You list Salary in your VAR statement.
1. Copy and paste the following PROC MEANS step into the editor. This code creates a report that displays the
default PROC MEANS statistics for the analysis variable Salary.
2. Submit the code and then check the log. Everything looks good. SAS read 165 observations from orion.sales.
3. View the report. As you can see, the mean salary at Orion Star is 31160.12.
Business Scenario
You found the average salary at Orion Star. But now you want to see these statistics grouped by Gender and by Country.
In other words, you want to be able to see what the mean salary is for male employees in Australia, or the mean salary
of female employees in the US. Using PROC MEANS, you can create statistics for groups of observations.
Classification variables are character or numeric, and they typically have few discrete values. Your data set does not
need to be sorted or indexed by the class variables. In this example, PROC MEANS reports the statistics for Salary for
each combination of values—or each level—of the class variables Gender and Country. In your PROC MEANS output, the
class variables appear in the order that you list them in the CLASS statement.
1. Copy and paste the following PROC MEANS step into the editor.
2. Submit the step and view the report. Notice that this report has three more columns and more rows than the
basic PROC MEANS report you saw earlier. The table now has a column for each of the class variables, Gender
and Country.
PROC MEANS reports statistics for each class level, so there's a row for each combination of class variable
values. Can you tell what the mean salary for male employees in Australia is? It's 32001.39.
When you use the CLASS statement, PROC MEANS displays the additional statistic N Obs in your report.
Remember that the N statistic counts all nonmissing values of the analysis variable. However, N Obs reports the
number of observations with each unique combination of class variables, whether or not there are missing
values. If a data set has missing values, the values of N and N Obs for some levels are different.
Based on the values of N and N Obs, do you think this data set has missing values of Salary? For each level, the
values of N and N Obs are identical, so this data set does not have missing values.
1. Copy and paste the following PROC MEANS step into the editor. This step analyzes the variable Salary. You want
to see only the number of observations with nonmissing values and the mean for Salary. You use the keywords
N and MEAN in the PROC MEANS statement.
2. Submit the step and view the report. As you can see, the table includes only these two statistics. Now suppose
you want to see the minimum, maximum, and sum statistics for grouped data.
3. In the editor, add the CLASS statement with the variables Gender and Country, and then change the keywords
to MIN, MAX, and SUM in the PROC MEANS statement.
4. Submit this code and view the report. The report shows the new statistics for each combination of class
variables. For example, you can see that the minimum salary of all male employees in the US is 22710. The
report also now has an N Obs column due to the CLASS statement. MIN, MAX, and SUM are just three of the
many statistic keywords that are available for use in PROC MEANS.
To control the number of decimal places in the numeric values that appear in output, you can specify the MAXDEC=
option in the PROC MEANS statement. In the MAXDEC= option, number is an integer that specifies the maximum
number of decimal places.
MAXDEC=number
In this example, MAXDEC=0 specifies that SAS write numeric values with no decimal places.
The PROC MEANS statement shown here does not specify any statistical keywords, so the report displays the default
statistics.
proc means data=orion.sales;
var Salary;
class Gender Country;
run;
Suppose you don't want the N Obs column to appear in your report. In this example, the values of N Obs and N are
identical for each class level, which indicates that Salary has no missing values. To suppress the N Obs column, you can
specify the NONOBS option in the PROC MEANS statement. The N Obs column will no longer appear in PROC MEANS
output.
These tables show other PROC MEANS statistics that are available.
The first category shows the descriptive statistic keywords, followed by the quantile statistic keywords, and then the
hypothesis testing keywords. If you'd like to see descriptions of these statistics, click the Information button in the
course interface.
Question
Which PROC MEANS step creates this output?
a.
b.
c.
The correct answer is a. The PROC MEANS statement specifies the four statistics that appear in the table. The NONOBS
option suppresses the N Obs statistic. The MAXDEC= option specifies two decimal places for the numeric values.
Code Challenge
Complete this PROC MEANS statement to produce the mean and sum for numeric variables in the data set
orion.nonsales. Limit the number of decimal places to two.
proc means
;
run;
You specify the data set orion.nonsales, followed by the statistic keywords MEAN and SUM. To change the default
number of decimal places to two, you add MAXDEC=2 as an option.
To start, you know that N represents the number of observations with nonmissing values. If you know how many
observations are in the data set, this number could be helpful, but otherwise, it's not. Next, look at the mean salary
value. This doesn't help you either. The same is true for the standard deviation.
But, the values for minimum and maximum can be helpful. Does the minimum value of 2401 fall within the required
salary range of "24000 - 500000"? No, it doesn't. And does the maximum value of 433800 fall within the required salary
range? Yes, it does. Based on this report, you know that at least one observation falls below the required range. But you
don't know if any of the observations have missing values. You can specify the NMISS option to display the number of
observations with missing values.
1. Copy and paste the following PROC MEANS step into the editor. This step specifies N, NMISS, MIN, and MAX as
the descriptive statistics to display.
2. Submit the step and then check the log. The log shows that the code ran successfully. You can see that
orion.nonsales2 contains 235 observations.
3. Examine the results. Now the value of N is more meaningful. It tells you that 234 out of the 235 observations
have nonmissing values. Also, the value of 1 for N MISS reinforces this information. There's one observation with
a missing value for Salary.
Consider this. Given the minimum value, can you tell how many values are too low? No. Based on this report,
you don't know how many values are too low.
PROC UNIVARIATE displays extreme values, missing values, and other statistics for the variables named in the VAR
statement. If you omit the VAR statement, PROC UNIVARIATE
analyzes all numeric variables in the data set.
2. Submit the step and then check the log.The procedure ran without error.
3. Examine the results. Notice that PROC UNIVARIATE creates quite a bit of output. For validating data, you're most
interested in the Extreme Observations table. This table shows the five lowest and the five highest values of
Salary, by default. The Obs values indicate the observation number, not the count of observations with that
value. From this table, you can see that there are two observations with Salary values that fall below 24000.
None of the salaries are higher than 500000.
Suppose you'd like to see less than the default five values in the Extreme Observations table. To specify the
number of extreme observations that PROC UNIVARIATE lists, you can use the NEXTROBS= option in the PROC
UNIVARIATE statement.
4. In the editor, add the NEXTROBS= option to the PROC UNIVARIATE statement and set the value to 3.
5. Submit the code and view the report. As you can see, SAS displays only the three lowest and the three highest
Salary values now. Suppose you'd like to see the employee IDs that correspond to these observation numbers.
In other words, you want to know which Employee_ID has a Salary value of 2401. You can add the ID statement
to your program.
7. Submit the step and view the report. SAS displays the Employee_ID column in the table now, and you can easily
determine which employees have salaries that are below the required salary range.
Question
PROC UNIVARIATE identified two observations with Salary values less than 24000. What procedure can you use to
display the observations containing the invalid values?
a. PROC MEANS
b. PROC PRINT
c. PROC FREQ
The correct answer is b. You can use PROC PRINT with a WHERE statement to print only those observations
where Salary is less than 24000.
proc print data=orion.nonsales2;
where Salary<24000;
run;
1. Copy and paste the following PROC MEANS step into the editor.
2. To create a PDF file of the results that can be viewed in Adobe Acrobat Reader, you just need to add two ODS
statements to the program. The first statement specifies the ODS destination type, and the file in which to store
the PDF content. The path that you specify in the FILE= option needs to include the full path to a location in your
operating environment, and the filename needs to include the appropriate extension for the specified
destination. In this example, we're specifying a PDF destination, and writing to a file named salaries.pdf in the
output folder on the C: drive.
3. The second statement closes the PDF destination, allowing you to open the file outside of the SAS environment
with a PDF reader such as Adobe Acrobat.
4. You can create HTML and RTF files just as easily, using the appropriate ODS destination and file extension. You
can also create csv, html, and xml files that can be viewed as worksheets in Microsoft Excel. Modify the program
to create a csv file using the CSVALL destination.
5. That's all there is too it! Now you can use Microsoft Excel to open the file. Once opened, you can save it in xls or
xlsx format.
6. You can have multiple destinations open, and execute multiple procedures. All generated output will be sent to
every open destination. You might not be able to view the file, or the most updated file, outside of SAS until you
close the destination.
7. In some SAS environments, you can select an output destination in the user interface without using ODS
statements, and then you can save the resulting output in a file. The Output Delivery system can be used in any
operating environment, with any version of SAS, to generate different types of files that can be opened in third-
party software applications.
8. Click the Information button to learn more about using the SAS Output Delivery System.
Topic Summaries
To go to the movie where you learned a task or concept, select a link.
Frequency distributions work best with variables whose values are categorical and best summarized by counts instead of
averages. Variables that have continuous numeric values, such as dollar amounts and dates, or many discrete values, can
result in a lengthy and meaningless frequency table. To create a useful frequency report for these variables, you can
apply a SAS or user-defined format to group the values into categories.
You can list multiple variables in a TABLES statement, separated by spaces. This creates a one-way frequency table for
each variable. You can request a separate analysis for each group by including a BY statement. You can request a two-
way frequency table by separating the variables with an asterisk instead of a space. The resulting crosstabulation table
displays statistics for each distinct combination of values of the selected variables. You can use TABLES statement
options to suppress statistics, change the table format, and format the displayed values.
When you use the CLASS statement, the output includes N Obs, which reports the number of observations with each
unique combination of class variables. You can request specific statistics by listing them as options in the PROC MEANS
statement. Other options are available to control the output.
You can also use PROC MEANS to validate a data set. The MIN, MAX, and NMISS statistics can be used to validate
numeric data when you know the range of valid values. PROC UNIVARIATE can be more useful because it displays the
extreme observations, or outliers. By default, it displays the five highest and five lowest values of the analysis variable,
and the number of the observation with each extreme value. You can use the NEXTROBS= option to display a different
number of extreme observations.
Sample Programs
proc format;
value Tiers low-25000='Tier1'
25000<-50000='Tier2'
50000<-100000='Tier3'
100000<-high='Tier4';
run;