Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
1) We want to test the association between gender and survival (0-Not survived and 1-
Survived).
Do male and females have different possibilities of survival?
Step 2: -
click on data in the left section. Select the variables (survival and Gender) from
“Variables to assign” section and transfer it to analysis variable under “Task Roles”
section by dragging or by clicking on right arrow button and selecting “Table
Variables” option from pop-up menu.
1
Step 3:-
Select Tables option in the left section. Select the variable Gender from “Variables
permitted in table” section and transfer it to “Preview” section by dragging or by
clicking on right arrow button and selecting column. This assigns Gender to column of
the Table.
2
Now, Select the variable Survival from “Variables permitted in table” section and
transfer it to “Preview” section by dragging or by clicking on right arrow button and
selecting Row. This assigns Survival to row of the Table.
The resultant table after assigning both the variables will look like the table given below:
3
Step 4:-
Select Cell Statistics option in the left section. You can select desired options like
Missing Value frquencies, Expected cell frequencies etc. in addition to the default Column
percentages and Cell frequencies by checking boxes in front of them.
Step 5: -
4
Select Association under Table Statistic group in the left section. Select Chi-Square
tests and Relative risk for 2 x 2 tables (this option isto be used only for 2 x 2 table) by
checking boxes in front of them and click on Run.
5
Chi-Square Tests
Chi-square tests and the corresponding p-values
o determine whether an association exists
o do not measure the strength of an association
Cramer’s V:
Cramer’s V takes any value between -1 to +1 and indicates strength of association. For
tables larger than 2X2, it is always non-negative.
Odds Ratio: -
An odds ratio indicates how much more likely, with respect to odds, a certain event occurs
in one group relative to its occurrence in another group.
6
Example: How do the odds of males surviving compare to those of females?
pevent
Odds =
1 pevent
e.g. Out of 466 females travelling on titanic, 339 have survived. Probability of female
surviving is p(s)=339/466=0.7275 or 72.75 percent. Therefore, Probability of female not
surviving = 1-p(s)=1-0.7272=0.2725 or 27.25 percent. Hence, Odds for female surviving =
p(s)/ {1-p(s)} = 0.7275/0.2725=2.67. similarly, odds for male surviving (161 male
survived out of 843 travelled) is 0.191/0.809=0.236. Odds ratio for females surviving to
male surviving indicates how much more likely females to survive, with respect to odds,
compared to male. Thus odds ratio for female surviving to male surviving =
2.67/0.236=11.31. Odds for female surviving is 11.31 times more than that of male
surviving.
Logistic Regression
1. Logistic regression is used when dependent variable is categorical (in our examples, it
is binary ie it can take only two values viz survived/died, yes/no, up/down etc)
2. Unlike linear regression
In case of logistic regression, method of least squares is not used to fit the
regression line.
Regression line equation takes a different form. Select logistic regression as below.
o Logit (y) = mx + c
o Logit is log of Odds for y=1
o Y can take value of 1 or 0 (binary variable)
o If probability of Y=1 is P, then Odds (for Y=1) P/(1-P)
Method to fit the line is called maximum likelihood method. In this method,
probability of finding the observed Y values for all the observations is maximized.
Each observation is independent of the other in a sample. Hence combined
probability for all observations to match their observed values is equal to the
product of the probabilities of these observations. This is called likelihood function.
Log of this value is called log-likelihood function.
Log likelihood is equivalent to R square in case of linear regression.
What is actually reported is minus of log likelihhod ( -2logL). Hence
o For R square, higher the better
o For -2logL, lower the better
7
More the number of input variables;
o Higher is the R square in case of linear regression
o Lower is the -2logL ( lower because minus sign is used)
In case of linear regression, adjusted R square is used as an improved measure.
Adjusted R square is thought to be a better measure than R square because it
adjusts for the number of input variables.
Equivalently, in case of logistic regression Akaike Information Criterion (AIC) is
developed as an improvised measure over -2logL
Lower AIC, better model fitment.
Another set of measures, more intuitive; relates to “pairs”
One pair means a pair of one observation with observed value of Y =1 and second
observation with observed Y value of 0 (Y=0). Let us call the first observation O+
and second one O-.
Logistic model will lead to predicting probability of Y=1 for each observation. Let
us say the predicted value of probability for the first observation is P+ and for the
second one is P-.
o If P+ > P- then the pair is concordant
o If P+<P- then the pair is discordant.
o If P+ = P- then the pair is tied.
EG reports percent concordant. Higher the percent concordant, better is the
model.
EG also reports other statistics including c statistic and Somer’s D along the
similar lines.
8
Logistics Regression Task: -
Problem: - The objective is to develop a model to classify the person into either survived
group or not survived group based on the inputs such as Age of a person, Class in which
the person is travelling and gender of the person.
Step 1:- Import “Titanic” from dataset_for_sas. In step 3 of the import wizard, ensure you
have converted “Age” from string to Number data type.
9
Step 4:- Select Fit Model to Level 1 in response.
Ensure all variables chosen as classification as well as quantitative are shown as effects. Note
that each time input variables are changed, effects have to be re-entered.
10
Step 5:- Click factorial to select all effects including combinations.
11
Step 7:- Uncheck Plots.
12
Output Screen 1: Note Number of observations and response profile
Output Screen 2 : Note Model Fit Statistics such as AIC, SC and -2logL. Also, note Likelihood ratio
Chi square for Testing Global Null Hypothesis. (This is similar to the F value testing the
significance of linear regression)
13
Note that the values under the
column Intercept and
covariates are relevant
Output Screen 3 :
14
Note wald Chisquare for each effect (This is similar to t value for each parameter estimate in case
of linear regression)
Output Screen 4 :
Note percent concordant, percent discordant, percent tied, number of pairs, c statistic and
Somer’s D
15
16