Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
54 views

Advanced Analytics Using SAS

The document summarizes various SAS procedures for fundamental and advanced analytics including PROC MEANS, PROC UNIVARIATE, PROC FREQ, PROC CORR, PROC REG, and PROC SQL. These procedures allow for descriptive statistics, frequency tables, correlation analysis, linear regression modeling, and querying SAS data using SQL syntax.

Uploaded by

Arjun Khosla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Advanced Analytics Using SAS

The document summarizes various SAS procedures for fundamental and advanced analytics including PROC MEANS, PROC UNIVARIATE, PROC FREQ, PROC CORR, PROC REG, and PROC SQL. These procedures allow for descriptive statistics, frequency tables, correlation analysis, linear regression modeling, and querying SAS data using SQL syntax.

Uploaded by

Arjun Khosla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Fundamental & Advanced

Analytics
Using SAS
SAS Procedures for Fundamental & Advanced
Analytics
› PROC MEANS
› PROC UNIVARIATE
› PROC FREQUENCY
› PROC CORR
› PROG REG
› PROC SQL
› Prints Descriptive Statistics
› Without any options, prints for all numeric
PROC MEANS variables in the data set
– No of Non-missing observations,
– mean,
SYNTAX – standard deviation,
– minimum and
PROC MEANS data=XYZ;
– maximum.
RUN; › With Options presents the opted measures
› Computing Statistics for Each Value- BY
Variable
– Pre-requisite the dataset must be sorted on the
BY variable
› CLASS –
– Substitute for BY,
– Sorted dataset not needed
PROC MEANS- Options
Option Description

N No of Non-missing Observations used to compute the Statistics

NMISS No of Missing Observations

MEAN The Mean

STD The Standard Deviation

CV The Coefficient of Variation

CLM The 95% confidence interval for the mean

STDERR The Standard Error

MIN The Minimum Value

MAX The Maximum Value

MEDIAN The Median

MAXDEC=n The Maximum no of Decimal Places in all table values


Example Code
PROC Sort DATA= SASHELP.SHOES PROC MEANS DATA= SASHELP.SHOES n
OUT=Sorted_Shoes; nmiss mean std;
BY REGION; CLASS REGION;
RUN; VAR STORES SALES INVENTORY;
PROC MEANS DATA= Sorted_Shoes n RUN;
nmiss mean std;
Including Multiple CLASS Variables
BY STORES;
PROC MEANS DATA= SASHELP.SHOES n
VAR STORES SALES INVENTORY; nmiss mean std;
RUN; CLASS REGION PRODUCT;
VAR STORES SALES INVENTORY;
RUN;
PRINTALLTYPES
› Used with Multiple CLASS Variables.
› Outputs Statistics broken down by every combination of CLASS
Variables

Sample Code:

PROC MEANS DATA= SASHELP.SHOES n nmiss mean std PRINTALLTYPES;


CLASS REGION PRODUCT;
VAR STORES SALES INVENTORY;
RUN;
› Similar to PROC MEANS
PROC
UNIVARIATE › Also produces Histograms & Probability
Plots.
› Options:
Syntax: › Histogram: Generates Histogram of all
variables on the VAR statement
Proc UNIVARIATE DATA= XYZ; › Qqplot: Produces quantile-quantile plot to
ID Var;
determine deviations from normality.
– Option NORMAL draws a straight line representing
Var v1 v2 v3; what a normal distribution would look like on the
Histogram; plot.
– mu(Mean) sigma(standard deviation) for theoretical
Qqplot /normal (mu=est normal plot.
sigma=est); – Option est helps get data to request these.
RUN;
One way Frequency Tables
PROC FREQ data= SASHELP.SHOES;
Tables Region product;
PROC FREQ Run;
Option: NOCUM: Eliminates cumulative frequencies
Generates Frequency Tables
PROC FREQ data= SASHELP.SHOES;
• One-way, Tables Region product /nocum;
Run;
• Two-way And
• Three-way
Two way & Three Way Frequency Tables
PROC FREQ data= SASHELP.SHOES;
Tables REGION * product; Two Way
Tables REGION * product * Sales;
Three Way
Run;
Region as rows and Product as columns

Option: Chisq: Chi square tables added in output

PROC FREQ data= SASHELP.SHOES;


Tables REGION * product / chisq; Run;
› Correlation Analysis of all SAS Variables
PROC CORR with each other.

Syntax › If variables are specified then their


correlation with each other will be
presented.
PROC CORR Data=XYZ;
RUN;

PROC CORR Data=XYZ;


Var v1 v2 v3…;
RUN;
› Models relationship between scalar dependent
PROC REG variable and one or more explanatory variables.
› Syntax 1 for Simple Linear Regression
Syntax 1: › Syntax 2 for Multiple Linear Regression
PROC REG Data=XYZ; Options: OUT, RESIDUAL, P
MODEL Var1=Var2 OUTPUT OUT=res RESIDUAL=resid P=pred
RUN; › OUT: For sending output to New dataset instead of
screen
› RESIDUAL: Residual Value
Syntax 2:
› P: Predicted Value
PROC REG Data=XYZ;
MODEL Var1=Var2 Var2 Var3
…; clm prints 95% confidence intervals for mean of each obs

RUN; cli prints 95% prediction intervals


PROC SQL › SAS offers extensive support to SQL by
using SQL queries inside SAS programs.

Syntax › Most of the ANSI SQL syntax is


supported.
› PROC SQL is used to process the SQL
PROC SQL;
statements.
SELECT Columns
› This procedure can
FROM TABLE – gives back the result of an SQL query,
WHERE Columns – can create SAS tables & variables.
GROUP BY Columns
;
QUIT;
Running SQL Commands
CREATING TABLES READING DATA

PROC SQL; PROC SQL;

CREATE TABLE EMPLOYEES AS SELECT make, model, type, invoice, horsepower

SELECT * FROM TEMP; FROM SASHELP.CARS;

QUIT; QUIT;

PROC PRINT data = EMPLOYEES; UPDATING DATA


RUN; PROC SQL;
UPDATE EMPLOYEES2 SET SALARY=SALARY*1.25;
WHERE CLAUSE QUIT;
PROC SQL; PROC PRINT data = EMPLOYEES2; RUN;
SELECT make, model, type, invoice,
horsepower DELETING DATA
FROM SASHELP.CARS PROC SQL;
Where make = 'Audi‘ and Type = 'Sports'; DELETE FROM EMPLOYEES2 WHERE SALARY >
900; QUIT;
QUIT;
PROC PRINT data = EMPLOYEES2; RUN;
INTCK Function
› Counts number of Intervals between two dates or times.
› SYNTAX:
INTCK(‘Interval’, From, To)

› INTERVAL may be
– ‘YEAR’, ‘SEMIYEAR’,
– ‘MONTH’, ‘SEMIMONTH’, ‘QTR’,
– ‘DAY’, ‘WEEKDAY’, ‘TENDAY’.
› Selects a Sample from a population.
PROC
SURVEYSELECT › Options:
› OUT=
– output data set that contains the sample.
PROC SURVEYSELECT
Data=XYZ <options>; › METHOD=
– Sample selection method.
STRATA variables;
– Default is simple random sampling (METHOD=SRS) with
CONTRAL variables; no SIZE statement.
– With SIZE statement, default is probability proportional to
SIZE variable; size without replacement (METHOD=PPS)
ID variables; › SAMPSIZE= number for sample size
› STRATA partitions input data set into nonoverlapping
groups
› ID lists variables from the input data set to be included
in the output data set else all variables inlcuded

You might also like