0% found this document useful (0 votes)

40 views

Bio2 Module 5 - Logistic Regression

Logistic regression is used when the outcome variable is binary (yes/no) rather than continuous. It predicts the probability of an outcome using independent variables. There are two main uses: prediction and understanding relationships between variables. The model transforms the dependent variable using the logit function to allow linear regression techniques to be applied. Significant tests include the z-test and likelihood ratio test, which compares models with and without variables. Key assumptions are that the relationship between dependent and independent variables can be non-linear, and the dependent variable need not be normally distributed.

Uploaded by

tamirat hailu

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Bio2 Module 5 - Logistic Regression

Uploaded by

tamirat hailu

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 19

Biostat 2 _Module 5

Logistic regression

Getu Degu

November 6, 2008

1
Logistic regression
► The preceding section dealt with multiple regression with
a continuous dependent variable, extending the methods
of linear regression introduced in Biostat 2 ( module 1).

► In many studies the outcome variable of interest is the

presence or absence of some condition, whether or not
the subject has a particular characteristic such as a
symptom of a certain disease.

► We cannot use ordinary multiple linear regression for

such data, but instead we can use a similar approach known
as multiple linear logistic regression or just logistic
regression.
Logistic regression is part of a category of statistical models
called generalized linear models. This broad class of models
includes ordinary regression, ANOVA, ANCOVA and loglinear
regression.

2
Uses and selection of independent variables

♣ In general, there are two main uses

of logistic regression.

♣ The first is the prediction

(estimation) of the probability that an
individual will have (develop) the
characteristic. For example, logistic
regression is often used in
epidemiological studies where the
result of the analysis is the probability
of developing cancer after controlling
for other associated risks.

♣ Logistic regression also provides

knowledge of the relationships and
strengths between an outcome
variable (dependent) with two values
and explanatory variables
(independent) that can be categorical
or continuous (e.g., smoking 10 packs
a day puts you at a higher risk for

3
developing cancer than working in an
asbestos mine).

♣ Logistic regression can be applied to

case-control, follow-up and cross-
sectional data.

The Model:

 The basic principle of logistic regression is much the

same as for ordinary multiple regression.

 The main difference is that instead of developing a

model that uses a combination of the values of a
group of explanatory variables to predict the
value of a dependent variable, we predict a
transformation of the dependent variable.
 The dependent variable in logistic regression is
usually dichotomous, that is, the dependent variable
can take the value 1 with a probability of success q,
or the value 0 with a probability of failure 1-q.

 This type of variable is called a binomial (or binary)

variable.

 Although not discussed in this pack, applications of

logistic regression have also been extended to cases

4
where the dependent variable is of more than two
cases, known as multinomial logistic regression.

 When multiple classes of the dependent variable can

be ranked, then ordinal logistic regression is
preferred to multinomial logistic regression.

 As mentioned previously, one of the goals of logistic

regression is to correctly predict the category of
outcome for individual cases using the most
parsimonious (condensed) model.

 To accomplish this goal, a model is created that

includes all predictor variables that are useful in
predicting the response variable.

 Several different options are available during model

creation. Variables can be entered into the model in
the order specified by the researcher or logistic
regression can test the fit of the model after each
coefficient is added or deleted, called stepwise
regression.

5
Backward stepwise regression appears to be the
preferred method of exploratory analyses, where the
analysis begins with a full or saturated model and
variables are eliminated from the model in an iterative
process. The fit of the model is tested after the elimination
of each variable to ensure that the model still adequately fits
the data. When no more variables can be eliminated from
the model, the analysis has been completed.

 Logistic regression is a powerful statistical tool for

estimating the magnitude of the association between
an exposure and a binary outcome after adjusting
simultaneously for a number of potential
confounding factors.
 If we have a binary variable and give the categories
numerical values of 0 and 1, usually representing ‘No’
and ‘Yes’ respectively, then the mean of these values in
a sample of individuals is the same as the proportion of
individuals with the characteristic.
 We could expect, therefore, that the appropriate
regression model would estimate the probability
(proportion) that an individual will have the
characteristic.
 We cannot use an ordinary linear regression, because
this might predict proportions less than zero or greater
than one, which would be meaningless.
 In practice, a statistically preferable method is to
use a transformation of this proportion.

6
♣ The transformation we use is called the logit
transformation, written as logit (p). Here p is the proportion
of individuals with the characteristic.

For example, if p is the probability of a subject having a

myocardial infarction, then 1-p is the probability that they do
not have one. The ratio p / (1-p) is called the odds and thus

logit (p) = ln is the log odds.

♣ The logit can take any value from minus infinity to plus
infinity.

♣ We can fit regression models to the logit which are very

similar to the ordinary multiple regression and analysis of
variance models found for data from a normal distribution.

♣ We assume that relationships are linear on the logistic

scale:

ln = a + b1X1 + b2X2 + … + bnXn

where, X1, … Xn are the predictor variables and p is the

proportion to be predicted. The calculation is computer
intensive.

7
 Because the logistic regression
equation predicts the log odds, the
coefficients represent the difference
between two log odds, a log odds
ratio.

 The antilog of the coefficients is thus an odds ratio.

Most programs print these odds ratios.

 These are often called adjusted odds ratios.

The above equation can be rewritten to represent the

probability of disease as:

Significance tests
The process by which coefficients are tested for significance
for inclusion or elimination from the model involves several
different techniques.

8
I) Z-test

The significance of each variable can be assessed by

treating
Z=
The corresponding P-values are easily computed (found
from the table of Z-distribution).

II) Likelihood-Ratio Test:

The likelihood-ratio test (LRT) uses the ratio of the

maximized value of the likelihood function for the full model
(L1) over the maximized value of the likelihood function for
the simpler model (L0).

Deviance

9
 Before proceeding to the likelihood
ratio test, we need to know about the
deviance which is analogous to the
residual sum of squares from a linear
model.
 The deviance of a model is -2 times
the log likelihood associated with
each model.
 As a model’s ability to predict
outcomes improves, the deviance
falls. Poorly-fitting models have
higher deviance.
 If a model perfectly predicts
outcomes, the deviance will be zero.
This is analogous to the situation in
linear regression, where the residual
sum of squares falls to 0 if the model
predicts the values of the dependent
variable perfectly.
 Based on the deviance, it is possible to
construct an analogous to r² for logistic

10
regression, commonly referred to as the
Pseudo r².

 If G1² is the deviance of a model with

variables, and G0² is the deviance of a null
model, the pseudo r² of the model is :

r² = 1 - =1 – (ln L1 / ln L0)
 One might think of it as the proportion of
deviance explained.

► The likelihood ratio test, which makes use

of the deviance , is analogous to the F-test
from linear regression.

11
► Inits most basic form, it can test the
hypothesis that all the coefficients in a
model are all equal to 0:
H0: ß1 = ß2 = . . . = ßk = 0
►The test statistic has a chi-square
distribution, with k degrees of freedom.
► Ifwe want to test whether a subset
consisting of q coefficients in a model are all
equal to zero, the test statistic is the same,
except that for L0 we use the likelihood from
the model without the coefficients, and L1 is
the likelihood from the model with them.
►This chi-square has q degrees of freedom.

Assumptions

12
► Logistic regression is popular in part because
it enables the researcher to overcome many of
the restrictive assumptions of OLS regression:

1. Logistic regression does not assume a

linear relationship between the dependents
and the independents. It may handle nonlinear
effects even when exponential and polynomial
terms are not explicitly added as additional
independents because the logit link function on
the left-hand side of the logistic regression
equation is non-linear. However, it is also
possible and permitted to add explicit
interaction and power terms as variables on
the right-hand side of the logistic equation, as in
OLS regression.
2. The dependent variable need not be
normally distributed.
3. The dependent variable need not be
homoscedastic for each level of the
independents; that is, there is no homogeneity
of variance assumption.
However, other assumptions still apply:

13
1. Meaningful coding. Logistic coefficients will
be difficult to interpret if not coded meaningfully.
The convention for binomial logistic regression is
to code the dependent class of greatest interest
as 1 and the other class as 0.

2.Inclusion of all relevant variables in the

regression model
3.Exclusion of all irrelevant variables
4.Error terms are assumed to be
independent (independent sampling).
Violations of this assumption can have serious
effects. Violations are apt to occur, for
instance, in correlated samples and repeated
measures designs, such as before-after or
matched-pairs studies, cluster sampling, or
time-series data. That is, subjects cannot
provide multiple observations at different time
points. In some cases, special methods are
available to adapt logistic models to handle
non-independent data.
5.Linearity: Logistic regression does not require
linear relationships between the independents
and the dependent, as does OLS regression,
but it does assume a linear relationship
between the logit of the independents and the
dependent.

14
6.No multicollinearity: To the extent that one
independent is a linear function of another
independent, the problem of multicollinearity
will occur in logistic regression, as it does in
OLS regression. As the independents increase
in correlation with each other, the standard
errors of the logit (effect) coefficients will
become inflated.

7.No outliers: As in OLS regression, outliers

can affect results significantly. The researcher
should analyze standardized residuals for
outliers and consider removing them or
modeling them separately. Standardized
residuals >2.58 are outliers at the .01 level,
which is the customary level (standardized
residuals > 1.96 are outliers at the less-
used .05 level).

8.Large samples: Unlike OLS regression,

logistic regression uses maximum likelihood
estimation (MLE) rather than ordinary least
squares (OLS) to derive parameters.

15
 MLE relies on large-sample asymptotic
normality which means that reliability of
estimates decline when there are few
cases for each observed combination of
independent variables.
 That is, in small samples one may get
high standard errors. In the extreme, if
there are too few cases in relation to the
number of variables, it may be impossible
to converge on a solution.
 Very high parameter estimates (logistic
coefficients) may signal inadequate
sample size.

Hosmer and Lemeshow Test

♣ The Hosmer -Lemeshow goodness- of - fit

statistic is used to assess whether the
16
necessary assumptions for the application of
multiple logistic regression are fulfilled.

♣ The Hosmer and Lemeshow's goodness-of-fit

statistic is computed as the Pearson chi-square
from the contingency table of observed
frequencies and expected frequencies.

♣ A good fit as measured by Hosmer and

Lemeshow's test will yield a large p-value.

Summary
♣ A likelihood is a probability, specifically the
probability that the values of the dependent variable
may be predicted from the values of the independent

17
variables. Like any probability, the likelihood varies
from 0 to 1.

♣ The log likelihood ratio test (or sometimes called as

model chi-square test) of a model tests the difference
between -2LL for the full model and -2LL for the
initial chi-square in the null model. That is, Model chi-
square is computed as -2LL for the null (initial) model
minus -2LL for the researcher’s model.

♣ The initial chi-square is -2LL for the model which

accepts the null hypothesis that all the b coefficients
are zero.

♣ The log likelihood ratio test tests the null

hypothesis that all population logistic regression
coefficients except the constant are zero. It is an
overall model test which does not assure that every
independent is significant.

18
♣ It measures the improvement in fit that the
explanatory variables make compared to the null
model.

♣ The method of analysis uses an iterative procedure

whereby the answer is obtained by several repeated
cycles of calculation using the maximum likelihood
approach.

♣ Because of this extra complexity, logistic

regression is only found in large statistical packages
or those primarily intended for the analysis of
epidemiological data.

Annex A Barangay Profile DCF No. 1
77% (13)
Annex A Barangay Profile DCF No. 1
5 pages
Infrared Thermography: Errors and Uncertainties
No ratings yet
Infrared Thermography: Errors and Uncertainties
3 pages
5.1) Binary logistic regression
No ratings yet
5.1) Binary logistic regression
32 pages
Multicollinearity Assignment April 5
No ratings yet
Multicollinearity Assignment April 5
10 pages
Sta 3010 Quizes
No ratings yet
Sta 3010 Quizes
10 pages
Chapter Three 3.0 Methodology 3.1 Source of Data
No ratings yet
Chapter Three 3.0 Methodology 3.1 Source of Data
10 pages
Samatrix Kaa Kaam
No ratings yet
Samatrix Kaa Kaam
3 pages
Models Assignment
No ratings yet
Models Assignment
43 pages
DA Unit-3
No ratings yet
DA Unit-3
11 pages
2
No ratings yet
2
6 pages
Linear Regression
No ratings yet
Linear Regression
10 pages
CHAPTER 2
No ratings yet
CHAPTER 2
11 pages
Multinomial Logistic Regression-1
No ratings yet
Multinomial Logistic Regression-1
17 pages
Bio2 Module 4 - Multiple Linear Regression
No ratings yet
Bio2 Module 4 - Multiple Linear Regression
20 pages
Using SAS To Extend Logistic Regression
No ratings yet
Using SAS To Extend Logistic Regression
8 pages
Unit1 - Data Science - SPPU
No ratings yet
Unit1 - Data Science - SPPU
15 pages
Econometrics Essay
No ratings yet
Econometrics Essay
9 pages
2-Logistic Regression
No ratings yet
2-Logistic Regression
17 pages
DS Unit-Iv
No ratings yet
DS Unit-Iv
34 pages
Econometrics Revision Work
100% (6)
Econometrics Revision Work
6 pages
Assignment On Probit Model
No ratings yet
Assignment On Probit Model
17 pages
Assumptions of Logistic Regression
100% (1)
Assumptions of Logistic Regression
2 pages
Unit V
No ratings yet
Unit V
27 pages
ML Unit 4
No ratings yet
ML Unit 4
28 pages
Advice: sciences/business/economics/kit-baum-workshops/Bham13P4slides PDF
No ratings yet
Advice: sciences/business/economics/kit-baum-workshops/Bham13P4slides PDF
11 pages
Chapter 16 - Logistic Regression Model
No ratings yet
Chapter 16 - Logistic Regression Model
7 pages
An Applicationon Multinomial Logistic Regression Model
No ratings yet
An Applicationon Multinomial Logistic Regression Model
22 pages
1_UNIT 2 2 files merged
No ratings yet
1_UNIT 2 2 files merged
80 pages
Multiple Regression SPECIALISTICA
No ratings yet
Multiple Regression SPECIALISTICA
93 pages
MEDI 1020_Workshop 7_Regression (1)
No ratings yet
MEDI 1020_Workshop 7_Regression (1)
15 pages
Lecture 3
No ratings yet
Lecture 3
27 pages
Newsletter 23 - Logit, Probit, Tobit (2P)
No ratings yet
Newsletter 23 - Logit, Probit, Tobit (2P)
2 pages
Econometrics II CH 1
No ratings yet
Econometrics II CH 1
48 pages
Thesis Using Logistic Regression
100% (2)
Thesis Using Logistic Regression
7 pages
An Application On Multinomial Logistic Regression Model
No ratings yet
An Application On Multinomial Logistic Regression Model
22 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Level 2 Quants Notes
No ratings yet
Level 2 Quants Notes
7 pages
DA Unit-3
No ratings yet
DA Unit-3
14 pages
Logistic Regression Report
No ratings yet
Logistic Regression Report
39 pages
ArunRangrej
No ratings yet
ArunRangrej
5 pages
Probit Logit Interpretation
No ratings yet
Probit Logit Interpretation
26 pages
ML & DS Unit 1-2 Insem Pyq
No ratings yet
ML & DS Unit 1-2 Insem Pyq
16 pages
03 Logistic Regression
No ratings yet
03 Logistic Regression
23 pages
Multiple Regression: Curve Estimation
100% (2)
Multiple Regression: Curve Estimation
23 pages
BRM Assignment
No ratings yet
BRM Assignment
26 pages
Unit - II_DA
No ratings yet
Unit - II_DA
22 pages
Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes
No ratings yet
Applying_Machine_Learning_Algorithms_with_Scikit-learn(Sklearn)_-_Notes
19 pages
Ecc321 chapter 3
No ratings yet
Ecc321 chapter 3
8 pages
Ecotrix Ecotrix: B.A. Economics (Hons.) (University of Delhi) B.A. Economics (Hons.) (University of Delhi)
No ratings yet
Ecotrix Ecotrix: B.A. Economics (Hons.) (University of Delhi) B.A. Economics (Hons.) (University of Delhi)
18 pages
FDSA UNIT 5
No ratings yet
FDSA UNIT 5
48 pages
Unit 2
No ratings yet
Unit 2
11 pages
Exp2 Milf
No ratings yet
Exp2 Milf
7 pages
Introduction To Econometrics - Summary
No ratings yet
Introduction To Econometrics - Summary
23 pages
new89梁涛企业管理（运营与供应链方向）202111080248Application of linear regression model and logistic regression model based on Iris data set
No ratings yet
new89梁涛企业管理（运营与供应链方向）202111080248Application of linear regression model and logistic regression model based on Iris data set
21 pages
Logistic Regression Tutorial
No ratings yet
Logistic Regression Tutorial
25 pages
Quantitative Methods Vocabulary
No ratings yet
Quantitative Methods Vocabulary
5 pages
Baum Chapter10 PDF
No ratings yet
Baum Chapter10 PDF
29 pages
spss10 LOGIT
No ratings yet
spss10 LOGIT
17 pages
Thesis Linear Regression
100% (2)
Thesis Linear Regression
5 pages
Chapter 15 ANCOVA For Dichotomous Dependent Variables
No ratings yet
Chapter 15 ANCOVA For Dichotomous Dependent Variables
12 pages
Basic Regression Analysis
No ratings yet
Basic Regression Analysis
5 pages
A Weak Convergence Approach to the Theory of Large Deviations
From Everand
A Weak Convergence Approach to the Theory of Large Deviations
Paul Dupuis
4/5 (1)
Ai Practice Paper - 1 QP
100% (3)
Ai Practice Paper - 1 QP
5 pages
Wayspire AI Course
No ratings yet
Wayspire AI Course
4 pages
Medical Device Regulations in Canada Key Challenges and International Initiatives
No ratings yet
Medical Device Regulations in Canada Key Challenges and International Initiatives
18 pages
Logistic Officer HADAAF
No ratings yet
Logistic Officer HADAAF
6 pages
SAP IQ11.0 SQL Reference en
No ratings yet
SAP IQ11.0 SQL Reference en
2,074 pages
Tarm Wood Boiler Planning Guide
No ratings yet
Tarm Wood Boiler Planning Guide
22 pages
Afs Ball Valve Product Catalog
No ratings yet
Afs Ball Valve Product Catalog
8 pages
Herculink Bhs Series
No ratings yet
Herculink Bhs Series
20 pages
PDF Design and Power Quality Improvement of Photovoltaic Power System 1st Edition Adel A. Elbaset download
100% (3)
PDF Design and Power Quality Improvement of Photovoltaic Power System 1st Edition Adel A. Elbaset download
62 pages
Ambit 250 and 255 COMPLETE Haxorware Tutorial Step by Step With Pictures
100% (1)
Ambit 250 and 255 COMPLETE Haxorware Tutorial Step by Step With Pictures
9 pages
Super South Bridge: Preliminary
No ratings yet
Super South Bridge: Preliminary
239 pages
Careers in STEM - Updated
No ratings yet
Careers in STEM - Updated
19 pages
Docling vs MarkitDown vs Marker-1_3_2025
100% (1)
Docling vs MarkitDown vs Marker-1_3_2025
2 pages
Gas Turbine
100% (2)
Gas Turbine
25 pages
ACC328 Thesisprosalfinal
No ratings yet
ACC328 Thesisprosalfinal
69 pages
Case Study-DNS
No ratings yet
Case Study-DNS
7 pages
Act6 - Sensors
No ratings yet
Act6 - Sensors
6 pages
Analyzing RAC Performance
No ratings yet
Analyzing RAC Performance
33 pages
9426-Article Text-15565-1-10-20230903
No ratings yet
9426-Article Text-15565-1-10-20230903
7 pages
Assignment of Group Case Study 1: Submitted by
0% (1)
Assignment of Group Case Study 1: Submitted by
3 pages
2015 ATD - Research - Skills - Challenges - and - Trends - in - Instructional - Design
No ratings yet
2015 ATD - Research - Skills - Challenges - and - Trends - in - Instructional - Design
29 pages
SNSC 1 0 1 SC Integration Guide en
No ratings yet
SNSC 1 0 1 SC Integration Guide en
44 pages
EODB Coordination Meeting
No ratings yet
EODB Coordination Meeting
25 pages
CCTV Bosh Datasheet - Data - Sheet - enUS - 24097715595
No ratings yet
CCTV Bosh Datasheet - Data - Sheet - enUS - 24097715595
6 pages
What Is Logistics Finance
No ratings yet
What Is Logistics Finance
5 pages
Sliding Door Operator
No ratings yet
Sliding Door Operator
12 pages
Log
No ratings yet
Log
2,484 pages
GIS 132 Substation
No ratings yet
GIS 132 Substation
63 pages