Logistic Regression Analysis
Logistic Regression Analysis
Contact information
Email: bianh@ecu.edu Phone: 328-5428 Location: 2307 Old Cafeteria Complex (east campus) Website:
http://core.ecu.edu/ofe/StatisticsResearch/
It is used to predict the presence or absence of a characteristic or outcome based on values of a set of predictor variables. It is similar to a linear regression model, but suited to models where the dependent variable is dichotomous.
to determine the probability of a case belonging to the 1 category of dependent variable or the probability of event occurring (event occurring is always coded as 1) for a given set of predictors.
Indicator coded: SPSS will automatically recode categorical variable for us.
Assumptions
Homogeneity of variance and normality of errors are NOT assumed, but it requires:
Absence of multicollinearity
No specification errors: all relevant
p: probability of a case belonging to category 1 p/1-p: odds a: constant n: number of predictors b1-bn: regression coefficients
The Y-axis is P (probability), which indicates the proportion of 1 at any given value of X.
Coding of variables
the study) Use 0 as absence of event (the reference category) SPSS automatically recodes the lower number of the category to 0 and higher number to 1.
Coding of variables
10
Coding of variables
Cases coded as 1 referred to as the response group, comparison group, or target group. Cases coded as 0 referred to as the reference group, base group, or control group.
11
Terms
Non-users
102 106 208
Total
222 191 413
12
Terms
Odds: the probability of belonging to one group or event occurring divided by the probability of not belonging to that group or event not occurring. The odds of a male using drug is 120/102=1.18, The odds of a female using drug is 85/106= .80 For males, it means that a male is 1.18 times as likely to use drug as not to use.
Drug users Male Female Total 120 85 205 Non-users 102 106 208 Total 222 191 413
13
Terms
Odds ratio: an important estimate in logistic regression and used to answer our research question. For the table below, the research question is whether there is a gender difference in using drugs or whether the probability of drug use is the same for males and females.
Drug users (1) Male (1) Female (0) Total 120 85 205 Non-users (0) 102 106 208 Total 222 191 413
14
Terms
Odds ratio
A ratio of the odds for each group. Always odd for the response group (males)
divided by odd for the referent group (females). Odds ratio is 1.18/.80= 1.48
Drug users (1) Male (1) Female (0) Total 120 85 205 Non-users (0) 102 106 208 Total 222 191 413
15
Terms
Odds ratio
Males in this example were 1.48 times more
likely than females to use drugs. An odds ratio > 1 indicates that the likelihood of an event occurring is more likely for the response category than the referent category of an independent variable. An odds ratio < 1 indicates that the likelihood of an event occurring is less likely for the response category than the referent category of an independent variable.
16
Terms
model It indicates the contribution of a particular predictor when other predictors are controlled.
17
Terms
observed responses in the dataset and the responses predicted by the linear model.
18
Example of logistic regression analysis Research question is whether a gender, selfcontrol, and self-efficacy predict drug use status.
Three predictors Gender (a01: 1 = males, 0 = females) Self-control (continuous variable) Self-efficacy (a80r: 1 =somewhat-not sure, 0 = very sure) One dependent variable Drug use status (1 = drug users, 0 =non-users)
19
Logistic regression
Null hypothesis
There is an equal chance of drug use or not
20
21
Click Categorical
Enter two categorical variables (a01 and
22
Categorical
The default contrast is Indicator and
23
category membership. The reference category is represented in the contrast matrix as a row of zeros. Simple. Each category of the predictor variable (except the reference category) is compared to the reference category. Difference. Each category of the predictor variable except the first category is compared to the average effect of previous categories. Also known as reverse Helmert contrasts. Helmert. Each category of the predictor variable except the last category is compared to the average effect of subsequent categories.
24
Categorical: contrast
Repeated. Each category of the predictor
variable except the first category is compared to the category that precedes it. Polynomial. Orthogonal polynomial contrasts. Categories are assumed to be equally spaced. Polynomial contrasts are available for numeric variables only. Deviation. Each category of the predictor variable except the reference category is compared to the overall effect.
25
Categorical
For a01 and a80r, we want category 0 as a
26
Save
For each case, if the predicted probability is greater than 0.5, then this respondent would be predicted to use drug (coded as 1). If the predicted probability is less than 0.5, this respondent would be predicted to not use drug (coded as 0).
27
Options
28
SPSS output
Coding
29
SPSS output
Constant-only-model: Block 0 (beginning block/step 0) means only constant is in the model and our predictors are not in the equation yet.
30
SPSS output
Block 1
Model fit statistics
Full model: Block1 (step1) indicates that our predictors are entered the model simultaneously. The method used is Enter. Pseudo R square: Nagelkerke R2 is preferred. The model accounts for almost 10% of variance of DV. This test assesses whether the predicted probabilities match the observed probabilities. P > .05 means a set of IVs will accurately predict the actual probabilities.
31
SPSS output
32
SPSS output
In block 0, the probability of a correct prediction is 50.4%. In block 1, the overall predictive accuracy is 62.7%.
33
SPSS output
Classification table
64.9% is also known as the sensitivity of
34
SPSS output
1. Wald test It tests the effect of individual predictor while controlling other predictors. 2. Exp(B) It is an odds ratio. For gender, males are 1.60 times more likely to use drugs than females. For self-control, the probability of drug use is contingent on self-control level. Higher score of self-control, less likely to use drugs. For a80r, low self-efficacy group is 1.53 times more likely to use drugs than high self-efficacy group.
35
SPSS output
Equation
The equation should be:
36
SPSS output
1. For each case, if the predicted probability is greater than 0.5, then this respondent would be predicted to use drug (coded as 1). 2. If the predicted probability is less than 0.5, this respondent would be predicted to not use drug (coded as 0).
37
Results
The logistic regression was performed to test effects of self-control, self-efficacy, and gender on drug use. Results indicated that the three-predictor model provided a statistically significant improvement over the constantonly-model, 2(3, N= 413) = 31.36, p = .00. The Nagelkerke R2 indicated that the model accounted for 9.8% of the total variance. The correct prediction rate was about 63.7% . The Wald tests showed that all three predictors significantly predicted drug use status.
38
okay for me to drinks too much alcohol) into the model as an independent variable. Rerun previous logistic regression Use Indicator method and first level as a reference.
39
SPSS recoded a93a into three dummy variables with first level as the reference (in the contrast matrix as a row of zeros).
40
41
Example: we want to know whether there is a significant interaction of self-efficacy and self-control on probability of drug use. Add interaction term (self-efficacy*self-control) to the model We are going to have three models: constant-onlymodel, model with three predictors and constant, and model with interaction term, three predictors, and constant.
42
43
Block 2
Highlight a80r and hold Control to select self-control. Then click a*b>, enter a80r*self-control to Block 2.
44
SPSS output
The results for Block 0 and Block 1 are the same as those from the previous study.
1. The Block 2 is not significant. It means the interaction term is not significant. Model means with everything in the equation, the whole model is significant. 2. The difference of -2Log Likelihood between block 2 and 1 is 541.153537.486 = 3.68, this is a Chisquare statistic with df = 1, p < .05 (check Chi-square table, Chisquare = 3.84 as p =.05, df = 1)
45
SPSS output
The Chi-square change from Block1 to Block 2 is 35.032-31.364=3.668, which is the Chi-square for interaction term. The R2 change indicates that 1% of variance is explained by interaction term. The improvement of prediction is not significant ( p = .055).
46
SPSS output
The Wald test also shows that there is no significant interaction effect of self-efficacy and self-control on DV. Equation of model: ln(odds) = .021 + .476 Gender -.063Self-control + 1.03 Self-efficacy -.079Self-efficacy * Self-control
47
48
49
Used to classify subjects based on values of a set of predictor variables. This type of regression is similar to logistic regression, but it is more general because the dependent variable is not restricted to two categories.
50
Variables
The dependent variable should be
categorical. Independent variables can be factors or covariates. In general, factors should be categorical variables and covariates should be continuous variables.
51
each breakfast option. Dependent variable: Choice of breakfast: 1 = Breakfast bar; 2 = Oatmeal; 3 = Cereal Independent variables: age, gender, lifestyle (they are all categorical variables)
52
53
54
Click Model
The default model is Main effects. We can custom our model and request main effects and interaction effects.
55
Click Statistics
56
SPSS outputs
57
SPSS Output
Cells with zero frequencies can be a useful indicator for potential problems. Since there are few (only 4.2%) of these empty cells, you can probably safely use the results of the goodness-of-fit tests.
58
SPSS Outputs
1. The likelihood ratio tests check the difference between null model and final model. 2. The Chi-Square in the first table is the change of -2 Log Likelihood from intercept-onlymodel to the final model. 3. The results show that the final model is outperforming the null. 3. Results of Goodness-of-Fit show that the model adequately fits the data.
59
SPSS Output
The likelihood ratio tests check the contribution of each effect to the model. Age and Active make significant contributions to the model.
60
SPSS Output
61
SPSS Output: Parameter estimates table summarizes the effect of each predictor.
Parameters with significant negative
coefficients decrease the likelihood of that response category with respect to the reference category. Parameters with positive coefficients increase the likelihood of that response category.
62
SPSS Output
Cells on the diagonal are correct predictions and cells off the diagonal are incorrect predictions. Overall, 56.4% of the cases are classified correctly.
63
References
Agresti, A. & Finlay, B. (1997). Statistical methods for the social sciences (3rd ed.). Upper Saddle River, NJ: Prentice Hall, Inc. Meyers, L. S., Gamst, G., & Guarino, A. J. (2006). Applied multivariate research: design and interpretation. Thousand Oaks, CA: Sage Publications, Inc. Stevens, J. P. (2002). Applied multivariate statistics for the social sciences (4th ed.). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
64
65