Notes For SAS Programming Fall2009
Notes For SAS Programming Fall2009
Notes For SAS Programming Fall2009
Why SAS?
Able to process large data set(s)
Regression results
Most government agencies and private sectors use SAS
Roadmap
Thinking in SAS Basic rules Read in data Data cleaning commands Summary statistics Combine two or more datasets Hypothesis testing Regression
Thinking in SAS
What is a program?
Algorithm, recipe, set of instructions
Thinking in SAS
Creating a program What is your problem? (take project 3 as an example) How can you find a solution? What steps need to be taken to find an answer? Do I need to read in data? What variables do I need? Where is the data? What format is the data in? How do I need to clean the data? Are there outliers? Are there any unexpected values in the data? How do I need to transform the data? Are the variables in the form that I need?
Variable names
<=32 characters if SAS 9.0 or above <=8 characters if SAS 8 or below case insensitive
DATA newdata; use the data set called proj3rawdata in set proj3rawdata; the temporary library fracuninsured=uninsured/total; percentuninsured=fracuninsured*100; run; Define new variables
input data proj3rawdata define fracuninsured define percentuninsured output data newdata
obs1
obs2 obs n
obs1
obs n
PROC PRINT data=newdata; var fracuninsured percentuninsured; title print out new data; run; Signal the end of PROC step, could
be ignored if this is followed by a Data or Proc step
The following several slides wont be covered in class. But you are welcome to use them by yourselves.
22
Save data
* Save in sas format; libname mylib M:\; data mylib,proj3rawdata3; set proj3rawdata3; run; * Export data to excel; Proc export data=proj3rawdata3 outfile=M:\proj3data-fromsas.xls dbms=excel replace; Run;
No ; here
You can also export a sas data file into a comma delimited text file if you write dbms=csv.
They are not required, but you may find them useful in the future. We skip them in the regular class.
proc sort
proc sort data=proj3rawdata3; by year state; run; proc sort data=proj3rawdata3 out=proj3rawdata3_sorted; by year descending fracuninsured; run; * note that missing value is always counted as the smallest;
proc freq
* Remember we already generate a variable called newgrp to indicate categories of fraction uninsured and a variable called popgrp to indicate categories of population size; proc freq data=proj3rawdata3; tables newgrp One dimension frequency table popgrp newgrp*popgrp; run; Two-dimension frequency table
The following page may be useful in practice, but I am not going to cover it in class.
high
year state totalpop .fracuninsured newgrp popgrp avguninsure avgfracuninsured 2009 MA 6420947 0.0548 1 high 7500000 0.073
appended:
year state totalpop .fracuninsured newgrp popgrp avguninsure avgfracuninsured 2009 MA 6420947 0.0548 1 high . . . . . 1 high 7500000 0.073
append
data appended; set proj3rawdata3 summary1; run; proc print data=appended; run; proc print data=merged; run;
Task2: generate average fracuninsured per state and merge it back to the main data
Source format of Proj3rawdata3 (long): year state totalpop fracuninsured . 2009 MA 6420947 0.0548 . 2009 HI 1257622 0.078 . . 2008 MA 6339513 0.0536 . 2008 HI 1267409 0.075 .. Target format (wide) state totalpop2009 fracuninsured2009 . .. Totalpop2008 MA 6420947 0.0548 . 6339513 HI 1257622 0.078 . 1267409
focus on 2009
data subsample2009; set proj3rawdata3; if year=2009; run;
data subsample2009; set subsample2009; rename totalpop=totalpop2009; rename insured=insured2009; rename uninsured=uninsured2009; rename fracuninsured=fracuninsured2009; drop year; run;
data subsample2008; set subsample2008; rename totalpop=totalpop2008; rename insured=insured2008; rename uninsured=uninsured2008; rename fracuninsured=fracuninsured2008; drop year; run;
H0: mean of fracuninsured2008 = mean of fracuninsured2009. H1: mean of fracuninsured2008 not equal to mean of fracuninsured2009.
Step 1: focus on the subsample that has 2008 and 2009 data only. Step 2: create a binary variable dummy2009=1 if year=2009, 0 if year=2008. Step 3: depend on whether 2008 and 2009 are independent samples or matched pairs. If independent samples, regress fracuninsured as: fracuninsured = a + b* dummy2009 + error If matched pairs, regress fracuninsured as: fracuninsured = a + b * dummy2009 + c1* dummy_AL + c2* dummy_AK + +c51*dummy_WY + error
2.
Be careful about one-tail and two-tail tests. The standard SAS output of coefficient t-stat and p-value are based on a two-tail test of H0: coeff=0, but could be used for a one-tail test if we compare 1-alpha vs. pvalue/2 instead of p-value.
Comparison across more than two groups (as independent samples) H0: all groups have the same mean F-test of whole regression OR H0: group x and group y has the same mean waller or lsd statistics
3.
4.
Comparison across more than two groups (as matched pairs) requires specific test on regression coefficients. H0: 2003 = 2004 test the coefficient of dummy2004=0 because 2003 is set as the benchmark
H0: 2004=2005 test the coefficient of dummy2004 = coefficient of dummy2005.
regression in SAS
Question: how do fracuninsured vary by total population of a state? * model: fracuninsured=a+b*totalpop+error; proc reg data=proj3rawdata3; model fracuninsured=totalpop; run; * Add year fixed effects; * Model: fracuninsured=a+b*totalpop+c1*dummy2004 +c2*dummy2005 + +c51*dummy2009+error; proc glm data=proj3rawdata3; class year; model fracuninsured=totalpop year/solution; run;
A comprehensive example
A review of 1. readin data 2. summary statistics 3. mean comparison 4. regression
reg-cityreg-simple.sas in N:\share\
Nov. 16-18, 1997 CBS 2 News Behind the Kitchen Door January 16, 1998, LA county inspectors start issuing hygiene grade cards
A grade if score of 90 to 100 B grade if score of 80 to 89 C grade if score of 70 to 79 score below 70 actual score shown
in restaurant windows
regulation
by county
by city
hygiene scores
Data complications
(blue font indicates our final choices)
Unit of analysis:
individual restaurant? city? zipcode? census tract? Unit of time: each inspection? per month? per quarter? per year? Define information: county regulation? city regulation? the date of passing the regulation? days since passing the regulation? % of days under regulation? Define quality: average hygiene score? the number of A restaurants? % of A restaurants?
real test
reg-cityreg-simple.sas in N:\share\
Questions
How many observations in the sample?
log of the first data step, or output from proc contents
How many variables in the sample? How many are numerical, how many are characters?
Output from proc contents
Questions
What is the difference between cityreg and ctyreg? We know county regulation came earlier than city regulation, is that reflected in our data?
Yes, cityreg<=ctyreg in every observation We can check this in proc means for cityreg and ctyreg, or add a proc print to eyeball each obs
What is the difference between cityreg and citymper? What is the mean of cityreg? What is the mean of citymper? Are they consistent with their definitions?
The unit of cityreg is # of days, so it should be a non-negative integer The unit of citymper is % of days, so it should be a real number between 0 and 1 To check this, we can add a proc means for cityreg and citymper
Questions
Economic theories suggest quality be higher after the regulation if regulation gives consumers better information. Is that true? The summary statistics reported in proc means (class citym_g or ctym_g) show the average percentage of A restaurants in different regulation environments. Rigorous mean comparison tests are done in proc glm with waller or lsd options.
Questions
Summary statistics often reflect many economic factors, not only the one in our mind. That is why we need regressions. Does more regulation lead to higher quality? is the coefficient of city regulation positive and significantly different from zero? (proc reg) is the coefficient of county regulation positive and significantly different from zero? (proc reg) Do we omit other sensible explanations for quality changes? What are they? (proc glm, year, quarter, city)
course evaluation
University wide: www.CourseEvalUM.umd.edu TTclass in particular: (password plstt) www.surveyshare.com/survey/take/?sid=81087