Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
143 views

L5 Logistic Regression (2011)

This document discusses logistic regression for analyzing the relationship between hypercholesterolemia and other variables. Logistic regression is appropriate when the outcome is binary or categorical. It transforms the probability of the outcome using the logit function to allow for linear regression on the log odds. Key steps include data exploration, simple logistic regression of each variable, variable selection using forward and backward stepwise methods, and interpreting the results including odds ratios. The document provides an example analysis of factors associated with hypercholesterolemia.

Uploaded by

woihon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
143 views

L5 Logistic Regression (2011)

This document discusses logistic regression for analyzing the relationship between hypercholesterolemia and other variables. Logistic regression is appropriate when the outcome is binary or categorical. It transforms the probability of the outcome using the logit function to allow for linear regression on the log odds. Key steps include data exploration, simple logistic regression of each variable, variable selection using forward and backward stepwise methods, and interpreting the results including odds ratios. The document provides an example analysis of factors associated with hypercholesterolemia.

Uploaded by

woihon
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

L5: Logistic Regression

by
Dr Lin Naing @ Mohd. Ayub Sadiq

1
Contents
Why Logistic Regression? (Theory)

Steps in Handling Multiple Logistic Regression


Analysis

Data Presentation and Interpretation

2
Relationship betw. Hypercholesterolemia & Age
Hypercholesterolemia: Yes=1; No=0 (Dichotomous Categorical Variable)

Using linear regression, apparently


it doesn't fit . The predicted values
go beyond 0-1 (less than zero and
more than 1).

3
Relationship betw. Hypercholesterolemia & Age

RISK
Probability of HC

Age in year
HC = Hypercholesterolaemia
4
Relationship betw. Hypercholesterolemia & Age

RISK
Probability of HC

Age in year
HC = Hypercholesterolaemia
5
Probability of HC = P
Odds of HC = P / (1- P)

Log odds of HC = Log {P / (1- P)}

Log odds of HC is also called


“Logit of HC”. We can use
linear model on logit of HC.
That is called “Logit or
Logistic Regression”.
6
More about Odds Ratio
 What is odds?
 Risk …. Chance to get disease (2 out of 10 meaning 0.2)
Chance to get disease
 Odds of having a disease =
Chance not to get disease
0.2
 Odds of having a disease = = 0.25
0.8
Risk Odds
0.10 0.11
0.20 0.25
 Chance of getting disease can be
0.30 0.43
0.40 0.67 presented in both ‘risk’ and ‘odds’.
0.50 1.00  In logistic regression, as the model is
0.75 3.00
designed based on P/(1-P), it will give us
0.90 9.00
0.95 19.00
‘odds’.
0.99 99.00
7
Y = β0 + β1X1 + β2X2 + ….+ βnXn Linear

Log {P/(1-P)} = β0 + β1X1 + β2X2 + ….+ βnXn Logistic

P/(1-P) = exp(β0 + β1X1 + β2X2 + ….+ βnXn)

exp(β0 + β1X1 + β2X2 + ….+ βnXn)


P=
1 + exp(β0 + β1X1 + β2X2 + ….+ βnXn)

exp(β) = Odds Ratio

8
More about Odds Ratio
 What is Odds Ratio?
 Risk Ratio (Relative Risk)
Risk to get disease among exposed
 Risk Ratio =
Risk to get disease among non-exposed
Odds of getting disease among exposed
 Odds Ratio =
Odds of getting disease among non-exposed
 Odds ratio 3 means .. “the odds of getting disease among
exposed is 3 times of that of among non-exposed”.

 Odds ratio 3 for smoking with lung cancer .. “the odds of


getting lung cancer among smokers is 3 times of that of
among non-smokers”.

9
exp(β) = Odds Ratio … WHY?

P/(1-P) = exp(β0 + β1X1) X1=Smoking variable


(1=smoker; 0=non-smoker)

Odds of having disease for smoker = exp(β0 + β1)

Odds of having disease for non-smoker = exp(β0)

How many times smokers have higher odds as compared


to non-smokers? (i.e. Odds Ratio)

exp(β0 + β1) exp(β0) * exp(β1)


Odds Ratio = = = exp(β1)
exp(β0) exp(β0)

10
When to use Logistic Regression?

► When Outcome is categorical, regardless of


the type of independent variables
► If dichotomous outcome, use binary logistic
regression.
► If the outcome variable has more than 2 levels,
and if it is ordinal, we can use ordinal logistic
regression.
► If the outcome variable has more than 2 levels,
and if it is nominal, we can use multinomial
logistic regression.
11
Principles are the same as in MLR.
• Process of modeling is the same as MLR.
• Theory of independent effect (dealing with
confounding), and interactions are the same
as in MLR.
• Why the same? In fact, it is a linear model
but the outcome (0,1) is transformed into
“log odds”.
• Some other models (e.g. Poisson regression
for count data, Cox for survival data) also
have transformed the outcome and model it
as linear. All these models are collectively
called as "Generalized Linear Model" (GLM).
12
Steps in Handling MLogR
Step 1: Data exploration (Descriptive Statistics)
Step 2: Simple Logistic Regression
Step 3: Variable selection & checking “linearity in the logit”
 Preliminary main-effect model

Step 4: Checking interaction & multicollinearitya


 Preliminary final model

Step 5: Checking model assumptionsa & outliers


 Final model

Step 6:b Interpretation & data presentation


a need remedial measures if problems are detected
b external validation may be done before step 6 (It is essential in developing a

prediction model).

13
Step 2: Simple logistic regression
Use file name - hypercholesterol

1
Analyze –
Regression –
2
Binary logistic...

One at a time

4
3
In option, please click ’CI for exp(b)’
14
4

It means, female is
‘first’ and male is
‘last’ (‘zero’ comes
first before ‘1’).
Therefore, if we want ‘male’ as reference, we don’t do
anything.
If we want ‘female’ as reference, we click ‘first’ and then
click ‘change’ button (Let's do this !!).

15
OR & 95% CI OR

Wald test: testing b=0 or exp(b)=1 or not?

LR test: The variable(s) in the


model contributes significantly or
not?

What has LR test done? The model with (only intercept) and the model with
(intercept + sex), are compared their "likelihoods".
In our study, the 2 models are significantly different (it means that sex
contributes significantly in the model). If not significant, sex is a useless
variable.
16
OR & 95% CI OR

Wald test: testing b=0 or exp(b)=1 or not?

LR test: DM variable contributes to


the model significantly (P=0.001).
17
Table 4. Factors associated with hypercholesterolemia (using
simple logistic regression)
Variable Crude OR (95% CI OR) X2 stat. (df)a P valuea
Age (year) 01.34 (1.28; 01.40) 263.4 (1) <0.001

Sex Male 13.86 (9.83; 19.55) 276.4 (1) <0.001


Female 01.00

DM 60.07 (2) <0.001 All simple


Uncontr. 04.15 (2.66; 06.46) 39.59 (1)b <0.001b logistic
Contr. 00.79 (0.57, 01.08) 02.18 (1)b <0.140b regression
No DM 01.00 result

HPT Yes 00.97 (0.73; 01.29) 00.04 (1) <0.843


No 01.00

Exercise No 04.54 (3.36; 06.14) 103.9 (1) <0.001


Yes 01.00
a Likelihood Ratio (LR) test b Wald test
18
Step 3.1: Variable selection
⌦ Automatic / Manual procedure
 Forward procedure
 Backward procedure
 Stepwise (forward / backward) procedure
⌦ Nowadays, as computers are faster, automatic
procedures can be done easily.
⌦ In SPSS, all are stepwise, either forward or backward,
using different tests (LR test is preferable)
⌦ Therefore, we will do at least 2 methods, backward LR
& forward LR.

19
Step 3.1: Variable selection

3
1

DM (No=0; Yes=1) – We want ‘No’ as reference.


The same for HPT.
But Exercise (No=0; Yes=1) – ‘Yes’ as reference.
20
OR & 95% CI OR

Wald test result of DM needs to present, because we can’t get LR for


each level of categorical variable if there is more than 2 levels.

Here, forward and


backward LR give
different result.

This is X2 statistics and the result of LR test for


each variable in the model.
21
Step 3.1: Variable selection

Here LR test compares likelihood of 2 models:


the model with age (sex + DM + exercise + age) versus
the model without age (sex + DM + exercise)

If the two models are significantly different in term of likelihood, it means


that age contributes significantly in the model.
If not significant, age does not contribute in the model.

22
Step 3.2: Linear in logit for numerical
independent variables
In our independent
variables, only age is a
numerical variable. We
have to test whether age
is linear in logit (of
outcome) or not? HOW?

Log {P/(1-P)} = β0 + β1X1 + β2X2 + ….+ βnXn


There are more than one methods to check this
assumption. Most practical method using SPSS is
mentioned here. (i.e. design variable based on quartiles)
23
Step 1: Categorize the variable in quartiles (4 levels)

Step 2: Calculate the midpoints of the quartile groups

Step 3: Fit the model with 4-level categorical variable

Step 4: Plot the midpoints and "b" to observe the linearity

Log {P/(1-P)} = β0 + β1X1 + β2X2 + ….+ βnXn


24
Step 1: Categorize the variable in quartiles (4 levels)

Transform ...
>>Visual binning 2

1
5

4
3 8

6
7

25
Step 2: Calculate the midpoints of the quartile groups
Analyze >>
Compare means >>
means... 1
4

2 30.5
38.0
41.5
48.0

3 3 Calculate mid point


= min + (range/2)
26
Step 3: Fit the model with 4-level categorical variable

Age4 MidPt. b
Age4(0) 30.5 0.00
Age4(1) 38.0 1.11
Age4(2) 41.5 1.28
Age4(3) 48.0 1.63
1
remove age

add age4
2
3

27
Step 4: Plot the midpoints and "b" to observe the linearity

Age4 MidPt. b 1
Age4(0) 30.5 0.00
Age4(1) 38.0 1.11
Age4(2) 41.5 1.28 enter in SPSS
Age4(3) 48.0 1.63

2
This will be
considered
Make a
'linear'.
scatter plot
Because it is
We call it as Linear very rare to
Quartile get perfect
Plot. linear in
practice.

28
Step 4: Plot the midpoints and "b" to observe the linearity

Linear
Nonlinear

If nonlinear, the assumption of


"linear in logit" is not satisfied.
Therefore, the numerical variable is
Nonlinear not appropriate.

The solution to this problem is ..


instead of using numerical variable,
use the categorized variable.

29
Step 4.1: Checking Interactions
⌦All possible 2-ways interactions are checked.
Interaction terms are created.
Add into the "Pre. Main Effect Model" as additional
independent variable.
Run the model using "enter".
If an interaction term is significant (P<0.05), it
means that there is an interaction between the 2
variables. And therefore, the appropriate model is
the main effect variables plus the significant
interaction term.
Check one interaction term at a time.
30
How to create an interaction term?
Select "age" first.
Then select "dm" while
1 you press the "control"
key, so that both
variables are selected.
Run the model using
"enter".
Check the interaction
term is significant or
not?
2 Do this process for one
3
interaction term at a
time (as in MLR).

31
Int
era
cti
on
ter
ms

We have found that Sex and Exercise have a significant interaction.


It means that our model should include age, sex, DM, exercise &
sex*exercise interaction term.
32
Step 4.2: Checking Multicollinearity (MC)
• If the independent variables are highly correlated, the
regression model is said to be “statistically not stable”.
– P values of the involved variables are considerably larger
(than what it should be).
– The width of 95% CI of the regression coefficients are larger.
– Appropriate variables may be rejected wrongly.
– Therefore, statistically, it is said that ‘the model is not
stable’.
• We have to check the Preliminary Main Effect Model whether
this kind of problem (MC) exists or not.
• There is no facility to check MC under logistic regression.
Therefore, we check this under linear regression.

33
Step 4.2: Checking Multicollinearity (MC)
• Just run the Preliminary main effect model (without interaction
terms) by using ‘enter’, and click ‘collinearity diagnostics’ in
the statistics window.

Notice: No interaction here.


Only age, sex, DM and exercise
Use "enter"

34
Step 4.2: Checking Multicollinearity (MC)
• Just run the Preliminary main effect model (without interaction
terms) by using ‘enter’, and click ‘collinearity diagnostics’ in
the statistics window.

Look at VIF (Variance-inflation factor). VIF measures the extent of


multicollinearity problem. If VIF is more than 10, the problem needs
remedial measures. Consult a statistician.

35
Step 5: Checking assumptions & outliers
• One main assumption statistically, i.e. overall model fitness
• There were several measures of goodness-of-fit.
• We run the model (with interaction if present), and check ..
– (1) Hosmer-Lemeshow goodness-of-fit test

"Not significant" means ..


the model fits well;
all sig. OR
variables the dataset fits well to the logistic model;
and sig. OR
interaction the dataset relationship pattern is not
term(s) significantly different from theoretical
logistic model.

Click Hosmer-Lemeshow goodness-of-fit


36
Step 5: Checking assumptions & outliers
• We run the model (with interaction if present), and check ..
– (2) Classification table (sensitivity, specificity of model's prediction)

Specificity

Sensitivity

82.5% of cases are predicted


correctly whether they have HC
or not.
About 70% and above is
expected to considered as a
good model.

37
Step 5: Checking assumptions & outliers
– (3) Area under the ROC (Receiver Operating Characteristic) curve - shows
the model's ability to discriminate between 2 outcomes
1. Run the preliminary final model & save Predicted Value (Probabilities)
2. Go to Analyze > ROC Curve...
1
3
2

AUC of ROC is 91.2%.


0.5 = no discrimination
≥0.7 to <0.8 = acceptable d.
≥0.8 to <0.9 = excellent d.
≥0.9 = outstanding d.
38
Step 5: Checking assumptions & outliers
For checking outliers, we have to save Cook's influential statistic.

The cut-off point for Cook's influential statistic is 1.0. Data point above 1.0 is
considered influential outlier.
In our data, none is more than 1.0. Therefore, there is no influential outlier.
39
Step 6: Interpretation & data presentation
Table 5. Factors associated with hypercholesterolemia (using
multiple logistic regression) Tip: To obtain
Variable Adj. OR (95% CI OR) X2 stat. (df)a P valuea LR test result
(Table of 'model
Age (year) 1.16 (1.08, 01.24) 16.28 (1) <0.001
if term remove),
Sex Male 7.99 (3.19, 20.01) 21.60 (1) <0.001
we have to run
Female 1.00 -
forward or
DM 54.51 (2) <0.001
backward LR,
Uncontr. 8.16 (4.30, 15.48) 41.31 (1)b <0.001b
Contr. 0.95 (0.60, 01.51) 00.05 (1)b <0.827b and click 'at
Normal 1.00 - each step' in
Exercise No 6.48 (3.19, 13.16) 33.16 (1) <0.001 option window.
Yes 1.00 - Using 'enter' will
Sex*Exercise 2.99 (1.09, 08.25) 04.47 (2) 0.034 not give LR test
Adj. OR = Adjusted odds ratio a Likelihood Ratio (LR) test b Wald test result.

This table is just to present as the final model.


We have to report about model fitness, outlier, pseudo R2, perhaps in the text.
Because of IA, we can only interpret Age & DM, which do not involve in IA.
Sex and Exercise have to run special models for the purpose of interpretation.

40
Step 6: Interpretation & data presentation
Sex*Exercise 2.99 (1.09, 08.25) 04.47 (2) 0.034

When you have (an) interaction(s), we have more work for interpretation.
In our case, Age and DM are not involved in interaction, so that they can be
interpreted as usual but Sex and Exercise involve in interaction.

Those who do exercise

Effect of Sex:
Exercise & Sex Male OR1 6.96
interaction Female
means …..

OR1 ≠ OR2
Those who do not exercise
OR for Sex …
We need to Effect of Sex:
present two Male OR2 22.76
ORs Female
41
Step 6: Interpretation & data presentation
Sex*Exercise 2.99 (1.09, 08.25) 04.47 (2) 0.034

When you have (an) interaction(s), we have more work for interpretation.
In our case, Age and DM are not involved in interaction, so that they can be
interpreted as usual but Sex and Exercise involve in interaction.

Male

Effect of Exercise:
Exercise & Sex Those who do not do exercise OR3 18.78
interaction Those who do exercise
means …..

Female OR3 ≠ OR4


OR for Exercise Effect of Exercise:
We need to Those who do not do exercise OR4 6.43
present two Those who do exercise
ORs
42
Step 6: Interpretation & data presentation
Sex*Exercise 2.99 (1.09, 08.25) 04.47 (2) 0.034

When you have (an) interaction(s), we have more work for interpretation.
In our case, Age and DM are not involved in interaction, so that they can be
interpreted as usual but Sex and Exercise involve in interaction.

Those who do exercise

Effect of DM:
Exercise & DM Uncontrolled DM OR5
No interaction No DM
means …..

Those who do not exercise OR5 = OR6


OR for DM Effect of DM:
We need only Uncontrolled DM OR6
one OR. No DM

43
Step 6: Interpretation & data presentation
Why an interaction causes these different ORs?

Hyperchol. Gender Exercise

1 unit Male Do exercise Gender Effect

1 unit Female No Exercise Exerc. Effect

2 units Male No Exercise Combined Effect (No IA)

2.5 units Male No Exercise Combined Effect (Syn. IA)

1.5 units Male No Exercise Combined Effect (Ant. IA)

IA=Interaction; Syn. IA=Synergistic Interaction; Ant. IA= Antagonistic Interaction

44
Step 6: Interpretation & data presentation
Why an interaction causes these different ORs?
Hyperchol. Gender Exercise

1 unit Male Do exercise Gender Effect


1 unit Female No Exercise Exerc. Effect

2.5 units Male No Exercise Combined Effect (Syn. IA)

0 unit Female Do exercise The Exercise effect among


Female is
1 unit Female No exercise
1 unit (0  1)

0 unit Female Do exercise


1 unit Male Do exercise The Exercise effect
among Male is
2.5 units Male No exercise
1.5 units (1  2.5)

45
Step 6: Interpretation & data presentation
Why an interaction causes these different ORs?
Hyperchol. Gender Exercise

1 unit Male Do exercise Gender Effect


1 unit Female No Exercise Exerc. Effect

2.5 units Male No Exercise Combined Effect (Syn. IA)

0 unit Female Do exercise The Gender effect among


those who do exercise is
1 unit Male Do exercise
1 unit (0  1)

0 unit Female Do exercise


The Gender effect
1 unit Female No exercise among those who no
exercise is
2.5 units Male No exercise
1.5 units (1  2.5)

46
Step 6: Interpretation & data presentation
Table 5. Factors associated with hypercholesterolemia (using
multiple logistic regression)
Variable Adj. OR (95% CI OR) X2 stat. (df)a P valuea
Age (year) 1.16 (1.08, 01.24) 16.28 (1) <0.001

OR for age (1.16): Imagine comparing two groups, one group who are one (1)
year older than the other, the older group have 1.16 times of odds of having HC
of the younger group (P<0.001).
For example, those with 31 years old people have 1.16 times of odds of having
HC compared to those with 30 years old.
In other words, those with 31 years old people have 16% more odds of having
HC compared to those with 30 years old.

But in practice, one year older is not that important difference. It is more
practical if we interpret for 5 or 10 years older group.
In this case, the odds ratio for 5 years older = 1.165 = 2.10

The group with 5 years older have 2.1 times of odds of having HC compared to
the younger age group.

47
Step 6: Interpretation & data presentation
Table 5. Factors associated with hypercholesterolemia (using
multiple logistic regression)
Variable Adj. OR (95% CI OR) X2 stat. (df)a P valuea
DM 54.51 (2) <0.001
Uncontr. 8.16 (4.30, 15.48) 41.31 (1)b <0.001b
Contr. 0.95 (0.60, 01.51) 00.05 (1)b <0.827b
Normal 1.00 -

OR for uncontrolled DM (8.16): Those with uncontrolled DM are 8.16 times of


odds of having HC of those without DM (P<0.001).
OR for controlled DM (0.96): The odds of having HC is not significantly different
between controlled DM patients and those without DM (P=0.827).

48
Step 6: Interpretation & data presentation
Because of the presence of interaction between Sex and Exercise, the effect of
Sex should be interpreted for each level of exercise.
Similarly, the effect of exercise should be interpreted for each category of Sex
(male and female). For this purpose, we have to run some specific (selected)
models.
Here.. we want to obtain the effect of Sex for
those who have no exercise.

2
This is OR of Sex among those who do not
do exercise. Male are 22.8 times of ... female.
49
Step 6: Interpretation & data presentation
Because of the presence of interaction between Sex and Exercise, the effect of
Sex should be interpreted for each level of exercise.
Similarly, the effect of exercise should be interpreted for each category of Sex
(male and female). For this purpose, we have to run some specific (selected)
models.

Here.. we want to obtain the effect of Sex for


those who do exercise.

This is OR of Sex among those who do


exercise. Male are 7 times of .... female.

50
n = 392

Those who do not do exercise.

n = 408
Those who do exercise.

For those who do not do exercise, the OR for sex (male as compared to
female) is 22.8 and for those who do exercise, it is 7.0.

It shows the effect of Sex (Male vs Female) is different between those who do
exercise and those who do not do exercise. That is an 'interaction'.

51
Step 6: Interpretation & data presentation
Effect of exercise also have to be interpreted the same manner.
Here.. we want to obtain the effect of
exercise among male.

n = 368

n = 432 (among female)

52
Step 6: Interpretation & data presentation
Table 5. Factors associated with hypercholesterolemia (using multiple logistic
regression)
Variable Adj. OR (95% CI OR) X2 stat. (df)a P valuea
Age (year) 1.16 (1.08, 01.24) 16.28 (1) <0.001
DM 54.51 (2) <0.001
Uncontr. 8.16 (4.30, 15.48) 41.31 (1)b <0.001b
Contr. 0.95 (0.60, 01.51) 00.05 (1)b <0.827b
Normal 1.00 -
(Exercise=No)
Sex Male 22.76 (8.16, 63.50) 35.64 (1)b <0.001b
Female 1.00 -
(Exercise=Yes)
Sex Male 6.96 (2.29, 21.13) 11.74 (1)b 0.001b
Female 1.00 -
(Male)
Exercise No 18.78 (9.00, 39.18) 61.13 (1)b <0.001b
Yes 1.00 -
(Female)
Exercise No 6.43 (3.16, 13.07) 26.44 (1)b <0.001b
Yes 1.00 -
Adj. OR = Adjusted odds ratio a Likelihood Ratio (LR) test b Wald test

53
Step 6: Interpretation & data presentation
If the interaction is with a numerical variable, categorize the
numerical variable into dichotomous (cutpoint at median),
and analyse & interpret as described earlier.
During running selected model for interpretation of
interaction variables, the effect of other variables might
change (e.g. OR of age become not significant). We do NOT
have to concern about it.
If the effect of Sex in 2 situations (those with exercise and
without exercise) differ but considered not important
difference, the interaction term can be decided to drop at
this stage.
When running selected models, sample sizes become
smaller, therefore, sometimes, 95% CI becomes extremely
wider. Therefore, many levels in categorical variable is not
desirable.
54
Questions
Questions
and
and
Answers
Answers

Reference:
Hosmer D.W., Lemeshow S. (2000). Applied logistic
regression. (2nd Ed.) New York: Wiley & Sons, Inc.

55

You might also like