100% found this document useful (1 vote)

123 views

Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables

1) The document describes steps to test the association between categorical variables using chi-square tests in SAS. It analyzes the relationship between gender and survival using Titanic passenger data. 2) Chi-square tests are conducted and odds ratios calculated to determine if there is a significant association between gender and survival. The results show that females had significantly higher odds of survival than males. 3) Logistic regression is then performed to predict survival based on gender, age, and class. Model fit statistics like AIC are examined to evaluate model performance, and concordance measures indicate how well the model distinguishes between survived and non-survived passengers.

Uploaded by

NISHITA MALPANI

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

123 views

Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables

Uploaded by

NISHITA MALPANI

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

SAS NOTES

Module 4- Categorical Data Analysis

Testing association between categorical variables: -

e.g. (import worksheet Titanic from the excel workbook dataset_for_sas)

1) We want to test the association between gender and survival (0-Not survived and 1-
Survived).
Do male and females have different possibilities of survival?

Step 1:- Formulating hypothesis.

H0: χ2 = 0 (There is no significant association between gender and survival)

H1: χ2 ≠ 0 (There is a significant association between gender and survival)

Step 2:- Describe -> Table Analysis

Step 2: -
click on data in the left section. Select the variables (survival and Gender) from
“Variables to assign” section and transfer it to analysis variable under “Task Roles”
section by dragging or by clicking on right arrow button and selecting “Table
Variables” option from pop-up menu.

1
Step 3:-
Select Tables option in the left section. Select the variable Gender from “Variables
permitted in table” section and transfer it to “Preview” section by dragging or by
clicking on right arrow button and selecting column. This assigns Gender to column of
the Table.

2
Now, Select the variable Survival from “Variables permitted in table” section and
transfer it to “Preview” section by dragging or by clicking on right arrow button and
selecting Row. This assigns Survival to row of the Table.

The resultant table after assigning both the variables will look like the table given below:

3
Step 4:-
Select Cell Statistics option in the left section. You can select desired options like
Missing Value frquencies, Expected cell frequencies etc. in addition to the default Column
percentages and Cell frequencies by checking boxes in front of them.

Step 5: -

4
Select Association under Table Statistic group in the left section. Select Chi-Square
tests and Relative risk for 2 x 2 tables (this option isto be used only for 2 x 2 table) by
checking boxes in front of them and click on Run.

Step 6: - Interpreting the result

1) Check p-value for Chi-Square test since we are testing association between two
nominal variables gender and survival. Since this p-value is less than 0.05, we reject
null hypothesis. i.e. we say that there is a significant association between gender
and Survival. If we have to test association between ordinal variables e.g. We
want to test if there is a significant association between survival and class then
we will interpret p-value for Mantel-Haenszel chi-square.

5
Chi-Square Tests
Chi-square tests and the corresponding p-values
o determine whether an association exists
o do not measure the strength of an association

Mantel-Haenszel Chi-Square Test

The Mantel-Haenszel chi-square test
 determines whether an ordinal association exists
 does not measure the strength of the ordinal association

Cramer’s V:
Cramer’s V takes any value between -1 to +1 and indicates strength of association. For
tables larger than 2X2, it is always non-negative.

Odds Ratio: -
An odds ratio indicates how much more likely, with respect to odds, a certain event occurs
in one group relative to its occurrence in another group.

6
Example: How do the odds of males surviving compare to those of females?

pevent
Odds =
1 pevent
e.g. Out of 466 females travelling on titanic, 339 have survived. Probability of female
surviving is p(s)=339/466=0.7275 or 72.75 percent. Therefore, Probability of female not
surviving = 1-p(s)=1-0.7272=0.2725 or 27.25 percent. Hence, Odds for female surviving =
p(s)/ {1-p(s)} = 0.7275/0.2725=2.67. similarly, odds for male surviving (161 male
survived out of 843 travelled) is 0.191/0.809=0.236. Odds ratio for females surviving to
male surviving indicates how much more likely females to survive, with respect to odds,
compared to male. Thus odds ratio for female surviving to male surviving =
2.67/0.236=11.31. Odds for female surviving is 11.31 times more than that of male
surviving.

Logistic Regression
1. Logistic regression is used when dependent variable is categorical (in our examples, it
is binary ie it can take only two values viz survived/died, yes/no, up/down etc)
2. Unlike linear regression
 In case of logistic regression, method of least squares is not used to fit the
regression line.
 Regression line equation takes a different form. Select logistic regression as below.
o Logit (y) = mx + c
o Logit is log of Odds for y=1
o Y can take value of 1 or 0 (binary variable)
o If probability of Y=1 is P, then Odds (for Y=1) P/(1-P)
 Method to fit the line is called maximum likelihood method. In this method,
probability of finding the observed Y values for all the observations is maximized.
 Each observation is independent of the other in a sample. Hence combined
probability for all observations to match their observed values is equal to the
product of the probabilities of these observations. This is called likelihood function.
Log of this value is called log-likelihood function.
 Log likelihood is equivalent to R square in case of linear regression.
 What is actually reported is minus of log likelihhod ( -2logL). Hence
o For R square, higher the better
o For -2logL, lower the better

7
 More the number of input variables;
o Higher is the R square in case of linear regression
o Lower is the -2logL ( lower because minus sign is used)
 In case of linear regression, adjusted R square is used as an improved measure.
Adjusted R square is thought to be a better measure than R square because it
adjusts for the number of input variables.
 Equivalently, in case of logistic regression Akaike Information Criterion (AIC) is
developed as an improvised measure over -2logL
 Lower AIC, better model fitment.
 Another set of measures, more intuitive; relates to “pairs”
 One pair means a pair of one observation with observed value of Y =1 and second
observation with observed Y value of 0 (Y=0). Let us call the first observation O+
and second one O-.
 Logistic model will lead to predicting probability of Y=1 for each observation. Let
us say the predicted value of probability for the first observation is P+ and for the
second one is P-.
o If P+ > P- then the pair is concordant
o If P+<P- then the pair is discordant.
o If P+ = P- then the pair is tied.
 EG reports percent concordant. Higher the percent concordant, better is the
model.
 EG also reports other statistics including c statistic and Somer’s D along the
similar lines.

8
Logistics Regression Task: -

Problem: - The objective is to develop a model to classify the person into either survived
group or not survived group based on the inputs such as Age of a person, Class in which
the person is travelling and gender of the person.

Step 1:- Import “Titanic” from dataset_for_sas. In step 3 of the import wizard, ensure you
have converted “Age” from string to Number data type.

Step 2:-Select Logistic Regression as below.

Step 3:-Select variables as specified.

DO NOT MISS CHECKING REFERENCE FOR EVERY CLASSIFICATION VARIABLE SEPARATELY.

9
Step 4:- Select Fit Model to Level 1 in response.

Ensure all variables chosen as classification as well as quantitative are shown as effects. Note
that each time input variables are changed, effects have to be re-entered.

10
Step 5:- Click factorial to select all effects including combinations.

Step 6:-Select Model selection method as specified.

11
Step 7:- Uncheck Plots.

Step 8:- Check Predictions

Step 9:- Click Run.

12
Output Screen 1: Note Number of observations and response profile

Output Screen 2 : Note Model Fit Statistics such as AIC, SC and -2logL. Also, note Likelihood ratio
Chi square for Testing Global Null Hypothesis. (This is similar to the F value testing the
significance of linear regression)

13
Note that the values under the
column Intercept and
covariates are relevant

Output Screen 3 :

14
Note wald Chisquare for each effect (This is similar to t value for each parameter estimate in case
of linear regression)

Also note the values of parameter estimates.

Output Screen 4 :

Note percent concordant, percent discordant, percent tied, number of pairs, c statistic and
Somer’s D

Output screen 5 : Note predicted values of probability for dependent variable = 1 or 0

15
16

Statistics 1 Subject Guide
67% (3)
Statistics 1 Subject Guide
231 pages
GENEVIEVE BRIAND, R. CARTER HILL - Using Excel For Principles of Econometrics-Wiley (2011) PDF
100% (1)
GENEVIEVE BRIAND, R. CARTER HILL - Using Excel For Principles of Econometrics-Wiley (2011) PDF
484 pages
Marketing-Orientated Pricing - Understanding and Applying Factors That Discriminate Between Successful High and Low Price Strategies
No ratings yet
Marketing-Orientated Pricing - Understanding and Applying Factors That Discriminate Between Successful High and Low Price Strategies
24 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
Introduction To Machine Learning: Jaime S. Cardoso
100% (1)
Introduction To Machine Learning: Jaime S. Cardoso
52 pages
DA4675 CFA Level II SmartSheet 2020 PDF
100% (3)
DA4675 CFA Level II SmartSheet 2020 PDF
10 pages
8multiple Linear Regression
100% (1)
8multiple Linear Regression
21 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Correlation & Regression
100% (1)
Correlation & Regression
53 pages
Lecture 4 Linear Regression
100% (1)
Lecture 4 Linear Regression
44 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Community Medicine Trans - Epidemic Investigation 2
100% (1)
Community Medicine Trans - Epidemic Investigation 2
10 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Homework 2
100% (1)
Homework 2
14 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Homework 2
100% (1)
Homework 2
12 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
Data Analytics Week 3
100% (1)
Data Analytics Week 3
42 pages
Tutor
100% (1)
Tutor
309 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Human Life Span Prediction Using Machine Learning
100% (1)
Human Life Span Prediction Using Machine Learning
9 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Import As
100% (1)
Import As
27 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Quest Stat
100% (1)
Quest Stat
2 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
Forecasting of Stock Prices Using Multi Layer Perceptron
100% (1)
Forecasting of Stock Prices Using Multi Layer Perceptron
6 pages
Poly
100% (1)
Poly
108 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
EFFIE 2002 Case Studies
100% (1)
EFFIE 2002 Case Studies
16 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Case Study 2
100% (1)
Case Study 2
12 pages
Preparation and Evaluation of Polyherbal Hair Oil
100% (1)
Preparation and Evaluation of Polyherbal Hair Oil
13 pages
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
100% (1)
Python Numpy (1) : Intro To Multi-Dimensional Array & Numerical Linear Algebra
27 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Linear Regression (Check List)
100% (1)
Linear Regression (Check List)
2 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
100% (1)
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
25 pages
Regression
100% (1)
Regression
20 pages
Airbnbs in Seattle, Wa: Questions
100% (1)
Airbnbs in Seattle, Wa: Questions
5 pages
LPTHW
100% (1)
LPTHW
220 pages
Life Expectancy Using Data Analytics
100% (1)
Life Expectancy Using Data Analytics
9 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Logistic Regression
100% (1)
Logistic Regression
56 pages
Practical Problems in Statistic
100% (1)
Practical Problems in Statistic
8 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Micromod 4
No ratings yet
Micromod 4
36 pages
SPSS Logistic Regression
No ratings yet
SPSS Logistic Regression
4 pages
Correlation and Regression: Six Sigma Thinking, #8
From Everand
Correlation and Regression: Six Sigma Thinking, #8
Sumeet Savant
5/5 (1)
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Applying Machine Learning Algorithms in Mechanical Engineering
No ratings yet
Applying Machine Learning Algorithms in Mechanical Engineering
8 pages
CV_MontesRojas
No ratings yet
CV_MontesRojas
12 pages
Effect of Customer-Based Brand Equity On Customer Satisfaction in Shopee Indonesia
No ratings yet
Effect of Customer-Based Brand Equity On Customer Satisfaction in Shopee Indonesia
5 pages
7.Impact of Social Media Influencers on Consumer Buying Decisions1
No ratings yet
7.Impact of Social Media Influencers on Consumer Buying Decisions1
10 pages
Hu W. AI for Power Electronics and Renewable Energy Systems 2024
No ratings yet
Hu W. AI for Power Electronics and Renewable Energy Systems 2024
346 pages
Improving Regularized Singular Value Decomposition For Collaborative Filtering
No ratings yet
Improving Regularized Singular Value Decomposition For Collaborative Filtering
4 pages
Lee Wooldridge 20230720
No ratings yet
Lee Wooldridge 20230720
45 pages
Regression Analysis and Its Application: A Data-Oriented Approach First Edition Richard F. Gunst download pdf
100% (1)
Regression Analysis and Its Application: A Data-Oriented Approach First Edition Richard F. Gunst download pdf
65 pages
Econometrics ch4
No ratings yet
Econometrics ch4
66 pages
Brauchler Ryan 610 Article
No ratings yet
Brauchler Ryan 610 Article
20 pages
Data Analytics For Air Travel Data: A Survey and New Perspectives
No ratings yet
Data Analytics For Air Travel Data: A Survey and New Perspectives
35 pages
Types of Research - Vinuta & Group
No ratings yet
Types of Research - Vinuta & Group
18 pages
Samozino2016 Hal
No ratings yet
Samozino2016 Hal
12 pages
Pre-Thesis Final
No ratings yet
Pre-Thesis Final
57 pages
A Practical Introduction To Nordpred - Cancerview - Ca
No ratings yet
A Practical Introduction To Nordpred - Cancerview - Ca
46 pages
(Ebook) Credit Risk Modeling using Excel and VBA by Gunter Löeffler, Peter N. Posch ISBN 9780470660928, 0470660929download
100% (5)
(Ebook) Credit Risk Modeling using Excel and VBA by Gunter Löeffler, Peter N. Posch ISBN 9780470660928, 0470660929download
54 pages
Dixon Test
No ratings yet
Dixon Test
8 pages
Pengaruh Kualitas Produk, Harga Dan Lokasi Terhadap Kepuasan Konsumen Martabak Alim
No ratings yet
Pengaruh Kualitas Produk, Harga Dan Lokasi Terhadap Kepuasan Konsumen Martabak Alim
19 pages
Time Series Notes
100% (1)
Time Series Notes
38 pages
Mediation and Moderation Analysis
No ratings yet
Mediation and Moderation Analysis
2 pages
QAS 510 - Syllabus
No ratings yet
QAS 510 - Syllabus
9 pages
Introduction To Probability and Statistics - Regression - Jupyter Notebook
No ratings yet
Introduction To Probability and Statistics - Regression - Jupyter Notebook
57 pages
Mec R2018
No ratings yet
Mec R2018
227 pages
Walmart - Sales: Pandas PD Seaborn Sns Numpy NP Matplotlib - Pyplot PLT Matplotlib Datetime
100% (1)
Walmart - Sales: Pandas PD Seaborn Sns Numpy NP Matplotlib - Pyplot PLT Matplotlib Datetime
26 pages
Saim PDF
No ratings yet
Saim PDF
12 pages