Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
123 views

Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables

1) The document describes steps to test the association between categorical variables using chi-square tests in SAS. It analyzes the relationship between gender and survival using Titanic passenger data. 2) Chi-square tests are conducted and odds ratios calculated to determine if there is a significant association between gender and survival. The results show that females had significantly higher odds of survival than males. 3) Logistic regression is then performed to predict survival based on gender, age, and class. Model fit statistics like AIC are examined to evaluate model performance, and concordance measures indicate how well the model distinguishes between survived and non-survived passengers.

Uploaded by

NISHITA MALPANI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
123 views

Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables

1) The document describes steps to test the association between categorical variables using chi-square tests in SAS. It analyzes the relationship between gender and survival using Titanic passenger data. 2) Chi-square tests are conducted and odds ratios calculated to determine if there is a significant association between gender and survival. The results show that females had significantly higher odds of survival than males. 3) Logistic regression is then performed to predict survival based on gender, age, and class. Model fit statistics like AIC are examined to evaluate model performance, and concordance measures indicate how well the model distinguishes between survived and non-survived passengers.

Uploaded by

NISHITA MALPANI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SAS NOTES

Module 4- Categorical Data Analysis

Testing association between categorical variables: -

e.g. (import worksheet Titanic from the excel workbook dataset_for_sas)

1) We want to test the association between gender and survival (0-Not survived and 1-
Survived).
Do male and females have different possibilities of survival?

Step 1:- Formulating hypothesis.

H0: χ2 = 0 (There is no significant association between gender and survival)


H1: χ2 ≠ 0 (There is a significant association between gender and survival)

Step 2:- Describe -> Table Analysis

Step 2: -
click on data in the left section. Select the variables (survival and Gender) from
“Variables to assign” section and transfer it to analysis variable under “Task Roles”
section by dragging or by clicking on right arrow button and selecting “Table
Variables” option from pop-up menu.

1
Step 3:-
Select Tables option in the left section. Select the variable Gender from “Variables
permitted in table” section and transfer it to “Preview” section by dragging or by
clicking on right arrow button and selecting column. This assigns Gender to column of
the Table.

2
Now, Select the variable Survival from “Variables permitted in table” section and
transfer it to “Preview” section by dragging or by clicking on right arrow button and
selecting Row. This assigns Survival to row of the Table.

The resultant table after assigning both the variables will look like the table given below:

3
Step 4:-
Select Cell Statistics option in the left section. You can select desired options like
Missing Value frquencies, Expected cell frequencies etc. in addition to the default Column
percentages and Cell frequencies by checking boxes in front of them.

Step 5: -

4
Select Association under Table Statistic group in the left section. Select Chi-Square
tests and Relative risk for 2 x 2 tables (this option isto be used only for 2 x 2 table) by
checking boxes in front of them and click on Run.

Step 6: - Interpreting the result


1) Check p-value for Chi-Square test since we are testing association between two
nominal variables gender and survival. Since this p-value is less than 0.05, we reject
null hypothesis. i.e. we say that there is a significant association between gender
and Survival. If we have to test association between ordinal variables e.g. We
want to test if there is a significant association between survival and class then
we will interpret p-value for Mantel-Haenszel chi-square.

5
Chi-Square Tests
Chi-square tests and the corresponding p-values
o determine whether an association exists
o do not measure the strength of an association

Mantel-Haenszel Chi-Square Test


The Mantel-Haenszel chi-square test
 determines whether an ordinal association exists
 does not measure the strength of the ordinal association

Cramer’s V:
Cramer’s V takes any value between -1 to +1 and indicates strength of association. For
tables larger than 2X2, it is always non-negative.

Odds Ratio: -
An odds ratio indicates how much more likely, with respect to odds, a certain event occurs
in one group relative to its occurrence in another group.

6
Example: How do the odds of males surviving compare to those of females?

pevent
Odds =
1 pevent
e.g. Out of 466 females travelling on titanic, 339 have survived. Probability of female
surviving is p(s)=339/466=0.7275 or 72.75 percent. Therefore, Probability of female not
surviving = 1-p(s)=1-0.7272=0.2725 or 27.25 percent. Hence, Odds for female surviving =
p(s)/ {1-p(s)} = 0.7275/0.2725=2.67. similarly, odds for male surviving (161 male
survived out of 843 travelled) is 0.191/0.809=0.236. Odds ratio for females surviving to
male surviving indicates how much more likely females to survive, with respect to odds,
compared to male. Thus odds ratio for female surviving to male surviving =
2.67/0.236=11.31. Odds for female surviving is 11.31 times more than that of male
surviving.

Logistic Regression
1. Logistic regression is used when dependent variable is categorical (in our examples, it
is binary ie it can take only two values viz survived/died, yes/no, up/down etc)
2. Unlike linear regression
 In case of logistic regression, method of least squares is not used to fit the
regression line.
 Regression line equation takes a different form. Select logistic regression as below.
o Logit (y) = mx + c
o Logit is log of Odds for y=1
o Y can take value of 1 or 0 (binary variable)
o If probability of Y=1 is P, then Odds (for Y=1) P/(1-P)
 Method to fit the line is called maximum likelihood method. In this method,
probability of finding the observed Y values for all the observations is maximized.
 Each observation is independent of the other in a sample. Hence combined
probability for all observations to match their observed values is equal to the
product of the probabilities of these observations. This is called likelihood function.
Log of this value is called log-likelihood function.
 Log likelihood is equivalent to R square in case of linear regression.
 What is actually reported is minus of log likelihhod ( -2logL). Hence
o For R square, higher the better
o For -2logL, lower the better

7
 More the number of input variables;
o Higher is the R square in case of linear regression
o Lower is the -2logL ( lower because minus sign is used)
 In case of linear regression, adjusted R square is used as an improved measure.
Adjusted R square is thought to be a better measure than R square because it
adjusts for the number of input variables.
 Equivalently, in case of logistic regression Akaike Information Criterion (AIC) is
developed as an improvised measure over -2logL
 Lower AIC, better model fitment.
 Another set of measures, more intuitive; relates to “pairs”
 One pair means a pair of one observation with observed value of Y =1 and second
observation with observed Y value of 0 (Y=0). Let us call the first observation O+
and second one O-.
 Logistic model will lead to predicting probability of Y=1 for each observation. Let
us say the predicted value of probability for the first observation is P+ and for the
second one is P-.
o If P+ > P- then the pair is concordant
o If P+<P- then the pair is discordant.
o If P+ = P- then the pair is tied.
 EG reports percent concordant. Higher the percent concordant, better is the
model.
 EG also reports other statistics including c statistic and Somer’s D along the
similar lines.

8
Logistics Regression Task: -

Problem: - The objective is to develop a model to classify the person into either survived
group or not survived group based on the inputs such as Age of a person, Class in which
the person is travelling and gender of the person.

Step 1:- Import “Titanic” from dataset_for_sas. In step 3 of the import wizard, ensure you
have converted “Age” from string to Number data type.

Step 2:-Select Logistic Regression as below.

Step 3:-Select variables as specified.


DO NOT MISS CHECKING REFERENCE FOR EVERY CLASSIFICATION VARIABLE SEPARATELY.

9
Step 4:- Select Fit Model to Level 1 in response.

Ensure all variables chosen as classification as well as quantitative are shown as effects. Note
that each time input variables are changed, effects have to be re-entered.

10
Step 5:- Click factorial to select all effects including combinations.

Step 6:-Select Model selection method as specified.

11
Step 7:- Uncheck Plots.

Step 8:- Check Predictions

Step 9:- Click Run.

12
Output Screen 1: Note Number of observations and response profile

Output Screen 2 : Note Model Fit Statistics such as AIC, SC and -2logL. Also, note Likelihood ratio
Chi square for Testing Global Null Hypothesis. (This is similar to the F value testing the
significance of linear regression)

13
Note that the values under the
column Intercept and
covariates are relevant

Output Screen 3 :

14
Note wald Chisquare for each effect (This is similar to t value for each parameter estimate in case
of linear regression)

Also note the values of parameter estimates.

Output Screen 4 :

Note percent concordant, percent discordant, percent tied, number of pairs, c statistic and
Somer’s D

Output screen 5 : Note predicted values of probability for dependent variable = 1 or 0

15
16

You might also like