Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
20 views

Logistic Regression & Practice

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
20 views

Logistic Regression & Practice

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Logistic Regression & Practice

Truong Phuoc Long, ph.D

1
Logistic regression

 When studying linear regression, we tried to estimate a


population regression equation

 In linear regression, we fit the model of the form as follows:


y  0  1 x1  ...  q xq 
 The outcome y is continuous variable, assumed to
follow a normal distribution.
 In many situations, y is a binary variable (disease or not)
with 2 values: 0 (no) and 1 (yes).
The mean of y is the proportion of times that it takes
the value 1: p = Pr( y = 1)
2
Categorical Response Variables
Examples:

Whether or not a person smokes  Non  smoker


Y 
Smoker
Binary Response
Survives
Success of a medical treatment Y 
Dies

Opinion poll responses Agree



Y   Neutral
Ordinal Response Disagree

3
Example:
Age and signs of coronary heart disease (CD)

Age CD Age CD Age CD


22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1

4
How can we analyse these data?

 Compare the mean age of diseased and non-diseased


women

Non-diseased: 38.6 years


Diseased: 58.7 years (p<0.0001)
 Linear regression?
 Can we apply an Ordinary Least Squares regression ?

5
The scatter plot

Yes 1
coronary
Signs of

The relationship between age and signs of


disease

coronary cannot be linear.

No 0

0 20 40 60 80 100
AGE(years)
6
Linear regression for Binary variable

In the OLS regression:


Y = 0 + 1 X +  ; where Y = (0, 1)
• The error terms are heteroskedastic
•  is not normally distributed because Y takes on only two values
 Can not apply ordinary linear regression because data is not
continuous or distributed normally.

7
Let’s try another way
Table: Prevalence (%) of signs of CD according to age group
Diseased
Age group # in group # %
20 -29 5 0 0

30 - 39 6 1 17

40 - 49 7 2 29

50 - 59 7 4 57

60 - 69 5 4 80

70 - 79 2 2 100

80 - 89 1 1 100

8
The scatter plot now
100
Diseased %
80

60

40

20

0
0 2 4 6 8

Age group

It looks like an S-shape of a sigmoid curve or a logistic curve


9
Sigmoid curve and logistic function

• A sigmoid function is a mathematical function having a characteristic


"S"-shaped curve or sigmoid curve.
• A common example of a sigmoid function is the logistic
function shown in the figure and defined by the formula:

Sigmoid functions most often


show a return value (y axis) in the
range 0 to 1

10
The Logistic Function
1.0
Probability
of disease 0.8

0.6   x
e
P( y x )    x
0.4
1 e
0.2

0.0
x
So we fit the model of the form
This is called the logistic function
11
Logistic regression

 The probability of having the disease:

 The odds of having the disease:

 Taking the natural logarithm of each side

 We fit a linear regression model between x and the log of the


odds of having the disease, assuming that the relationship
between ln[p/(1-p)] and x is linear.
 This technique is known as logistic regression 12
The Logistic Regression Model

 p is the probability that the event Y occurs, Pr(Y=1).


 p/(1-p) is the “odds”.
 ln[p/(1-p)] is the log odds.
 The logistic model assumes a linear relationship
between the predictors and log(odds).
13
The Logistic Regression Model

0 = log odds of disease in unexposed

1 = log odds ratio associated with being exposed

e 1 = odds ratio

14
Fitting equation to the data

• Linear regression: Least squares


• Logistic regression: Maximum likelihood
• In statistics, maximum likelihood estimation is a method
of estimating the parameters of an assumed probability distribution,
given some observed data.
• This is achieved by maximizing a likelihood function so that, under
the assumed statistical model, the observed data is most probable.
• Likelihood function
- Estimates parameters 0 and 1
- Practically easier to work with log-likelihood
15
Maximum likelihood

• Iterative computing
- Choice of an arbitrary value for the coefficients (usually 0)
- Computing of log-likelihood
- Variation of coefficients’ values
- Reiteration until maximisation (plateau)
• Results
- Maximum Likelihood Estimates (MLE) for 0 and 1
- Estimates of P(y) for a given value of x
16
Multiple logistic regression

• More than one independent variable


- Predictor variables may be of any data level (categorical,
ordinal, or continuous).  P 
ln    α  β1x1  β2 x 2  ... βixi
 1- P 
• Interpretation of i
- Increase in log-odds for a one unit increase in xi with all
the other xis constant.
- Measures association between xi and log-odds adjusted for
all other xi 17
Statistical testing

• Question
- Does model including given independent variable provide
more information about dependent variable than model
without this variable?
• Three tests
- Likelihood ratio statistic (LRS)
- Wald test
- Score test
18
Likelihood ratio statistic

• Compares two nested models


Log(odds) =  + 1x1 + 2x2 + 3x3 (model 1)
Log(odds) =  + 1x1 + 2x2 (model 2)
• Likelihood ratio statistic (LRS)
-2 log (likelihood model 2 / likelihood model 1) =
-2 log (likelihood model 2) minus -2log (likelihood model 1)
LR statistic is a 2 with DF = number of extra parameters in model
- The null hypothesis of the test states that the smaller model provides as
good a fit for the data as the larger model.
- If the null hypothesis is rejected, then the alternative hypothesis, larger
model provides a significant improvement over the smaller model.
19
Interpretation: Example
 Using a child’s birth weight to predict the likelihood that he
or she will develop the chronic lung disease, we fit the
model.
 p̂  ˆ
ln    0  1x
ˆ
 1 pˆ 
 From the sample, the estimated logistic regression equation is

 The coefficient of weight implies that for each one-gram


increase in birth weight, the log odds that the infant
develops the disease decrease by 0.0042 on average.
20
Interpretation

 If an infant weighs 750 grams at birth, what is the


probability that he develops the disease?

21
Interpretation
 If an infant weighs 750 grams at birth, what is the
probability that he develops the disease?

 The logit:

22
Notes

 Inference of the coefficient:


There is no relationship between p and x
against the alternative

We need to know the standard error of the estimator ˆ. We can


calculate z score and test statistic

23
Logistic regression example

 The research done by Wuensch and Poteat and


published in the Journal of Social Behavior and
Personality, 1998, 13, 139-150.
 College students (N = 315) were asked to pretend that
they were serving on a university research committee
to decide whether or not to withdraw a faculty’s
authorization to conduct an animal research.
 Use data file logistic.sav on Blackboard Data.

24
Logistic regression example
 Which variables are binary (dichotomous)?

25
Logistic regression example

 Let’s explore how “gender” predicts the “decisions”


on whether to stop or continue the research.
 What will be the dependent variable Y and
independent variable X (predictor)?

26
Logistic regression example

 Our regression model will be predicting the logit, that


is, the natural log of the odds, of having made one or
the other decision.

In statistics, the logit function or the log-odds is


the logarithm of the odds {p/1-p} where p is a probability.

 What are yˆ and 1 - yˆ ?


27
Logistic regression example

 Our regression model will be predicting the logit, that


is, the natural log of the odds, of having made one or
the other decision.

yˆ is the predicted probability of the event which is


coded with 1 (continue the research)
1 - yˆ is the predicted probability of the other decision
(stop the research)
28
Logistic regression example

29
Predict that all subjects will
decide to stop  correct 59.4%
Logistic regression exampleof the time
21

In the intercept-only model: ln(odds) = - 0.379


 the predicted odds of deciding to continue the
research = [Exp(B)] = 0.684

30
• The Omnibus Tests is used to check
that the new model is an improvement
over the baseline model.
• It uses chi-square tests to see if there is
a significant difference between the
Log-likelihoods of the baseline model
and the new model.
• If the new model has a significantly
reduced -2LL compared to the baseline
then it suggests that the new model is
explaining more of the variance in the
outcome and is an improvement!

This statistic measures how poorly the model predicts the decisions -- the smaller
the statistic the better the model  used to compare nested (reduced) models

31
 Based on this result, what is the regression equation?

32
24

 Use this model to predict the odds that a subject of a given


gender will decide to continue the research:
What is the odds that a woman will decide to continue
the research?
What is the probability that women will decide to
continue the research?
33
25

 If the subject is a woman (gender = 0), then


ODDS = e-0.847+1.217(0) = 0.429
A woman is 0.429 times more likely to decide to continue the
research than to decide to stop the research
 Convert odds to probabilities:

Our model predicts that 30% of women will decide to


continue the research 34
26

 What is the odds that a man will decide to continue the


research?
 What is the probability that men will decide to
continue the research?

35
27

 If the subject is a man (gender = 1), then


ODDS = e-0.847+1.217(1) = 1.448
A man is 1.448 times more likely to decide to continue the
research than to decide to stop the research
 Convert odds to probabilities:

 Our model predicts that 59% of men will decide to continue the
research 36
 The probability that men will decide to continue the research = 0.59
 The probability that women will decide to continue the research = 0.30
 With the cut value of 0.50:
If the probability ≥ 0.50, the subject is classified into “Continue the research”
 all male subjects (115) are predicted to continue the research
If the probability < 0.50, the subject is classified into “Stop the research”
 all female subject (200) are predicted to stop the research
This rule allows us to correctly classify 68/128 = 53.1% of the subjects where
the predicted event (deciding to continue the research) was observed 
Sensitivity of prediction, P(correct | event did occur) = 53.1%
This rule allows us to correctly classify 140/187 = 74.9% of the subjects where
the predicted event was not observed. This is known as the specificity of
prediction, the P(correct | event did not occur) = 74.9%
Overall our predictions were correct 208 out of 315 times, for an overall
success rate of 66% 37
29

 Exp(B) is the odds ratio predicted by the model

 The model predicts that the odds of deciding to continue


the research are 3.376 times higher for men than they
are for women.
 For the men, the odds are 1.448, and for the women they
are 0.429. The odds ratio is

38
95% CI for the predicted OR (odds ratio)

39
95% CI for the predicted OR (odds ratio)

 Interpretation?

40
95% CI for the predicted OR (odds ratio)

 95% confident that the odds of deciding to continue


the research are 2.09-5.45 times higher for men than
they are for women in the population.

41
Exercise
 Use the low birth weight data set (lowbwt.sav).

42
Exercise
 Let’s consider age of the mother as independent variable
to predict the low birth weight of her infant (low).
 Model 1: Perform a simple logistic regression to derive an
equation to compute the probability of the low birth weight
infants from the age of their mothers.
Is the effect of mother age on low birth weight infants
significant?
What is the predicted odds ratio? Interpret it
What is 95% CI of the odds ratio? Interpret it
What is the predicted probability of having a low birth
weight infant of a woman at 35 years old?
43
35

44
45
Result

• Model 1:

• As age of mother increases 1, the logit of having low birth weight


of infant decreases 0.051. p-value of ˆ is 0.105 > 0.05
 the effect of age is not significant.
• ORˆ = 0.95 as age of mother increases 1, the odds ratio of having
low birth weight of infant decreases by 0.95 times.
• 95% CI of OR = 0.893 – 1.011: in the population, we are 95%
sure that the OR of having LBW infant as age of mother increases
1 is between 0.89 and 1.01. The 95% CI of OR includes 1, which
means the effect of mother age on LBW is not significant.
46
Exercise
 Model 2: Let’s add smoking as another the
independent variable.
Interpret the result of each predictor: age
and smoking (significance, OR).
What is the predicted probability of having a low
birth weight infant of a 35-year-old woman who
smokes?
Compare model 2 with model 1 using Chi-square
test for the difference in -2 log likelihood.
47
48
Result
 Model 2:
 As age of mother increases 1, the logit of having LBW infant decreases
0.05, holding smoking status constant. p-value of ˆ is 0.119 > 0.05  the
effect of age is not significant.
 When the mother smokes, the logit of having LBW infant increases 0.692,
holding mother age constant. p-value of ˆ is 0.032 < 0.05  the effect of
smoking is significant.
 ORˆ(smoke)  1.997
As the mother smokes, the odds ratio of having LBW infant increases by
1.997 times, holding mother age constant.
 95% CI of OR (smoke) = 1.063 – 3.753: in the population, we are 95% sure
that the OR of having LBW infant as the mother smokes is between 1.063
and 3.753. The 95% CI of OR does not include 1, which means the effect
49
of mother smoking on LBW is significant.
Chi-square test of -2LLs

• Model 1: -2LL = 231.912


• Model 2: -2LL = 227.276
• χ2 = 4.636 > 3.84 (χ2 at df = 1 (difference in number of
predictors between two models))
• p<0.05 reject the null and conclude that adding the
smoking variable has significantly increased the ability
to predict infant low birth weight.

50
http://bis.net.vn/forums/t/484.aspx

51

You might also like