Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Correlation_Linear_Logistic Regression

The document discusses linear correlation and regression, focusing on the relationship between dependent and independent variables through mathematical models. It covers concepts such as Pearson's correlation coefficient, rank correlation coefficient, and the method of least squares for regression analysis. Additionally, it highlights the importance of understanding the assumptions of linear regression and the interpretation of correlation coefficients.

Uploaded by

Bewket Chalachew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Correlation_Linear_Logistic Regression

The document discusses linear correlation and regression, focusing on the relationship between dependent and independent variables through mathematical models. It covers concepts such as Pearson's correlation coefficient, rank correlation coefficient, and the method of least squares for regression analysis. Additionally, it highlights the importance of understanding the assumptions of linear regression and the interpretation of correlation coefficients.

Uploaded by

Bewket Chalachew
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 123

Linear Correlation and Regression

Introduction
• Data are frequently given in pairs where one
variable is dependent on the other
• E.g.
– Weight and height
– Age and blood pressure
– Birth weight and gestational age
– House rent and income
– Plasma volume and weights
Introduction
• It is usually desirable to express their
relationship by finding an appropriate
mathematical equation/model.
• To form the equation/model, collect the data
on these two variables.
• Let the observations be denoted by (X1 ,Y1),
(X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn).
• We have to have at least two Quantitative
Variables
• Goal: Test whether the level of the
response variable is associated with
(depends on) the level of the explanatory
variable
• Goal: Measure the strength of the
association between the two variables ‐
Correlation
• Goal: Use the level of the explanatory to
predict the level of the response variable‐
Regression
Introduction
• However, before trying to quantify this
relationship, plot the data and get an idea of
their nature.
• Plot these points on the XY plane and obtain
the scatter diagram.
Introduction
Relationship between heights of
fathers and their oldest sons

73
Heights of oldest sons (inches)

72
71
70
69
68
67
66
65
64
63
62
62 64 66 68 70 72 74
Heights of fathers (inches)
1. Simple Linear
Correlation
(Karl Pearson’s Coefficient of linear correlation)

• Measures the degree of linear correlation


between two variables (eg. X and Y).
• This correlation coefficient is given in pure
number, independent of the units in which the
variables are expressed.
• It also tells us the direction of the relationship
positive or negative.
Simple Linear Correlation
Karl Pearson’s
n

 (x i − x )(y i − y )
r=  i=1
n n

 (x i − x) 2
 (y i − y)
2

i =1 i =1
Simple Linear Correlation
• Properties
– ‐1  r 1
– r is a pure number without any unit
– If r is close to 1  a strong positive relationship
– If r is close to ‐1  a strong negative relationship
– If r = 0 → no correlation
Simple Linear Correlation
• Example: Heights of 10 fathers (X) together with their oldest
sons (Y) are given below (in inches). Find the degree of
relationship between Y and X.
• Father (X) oldest son (Y)
• 63 65
• 64 67
• 70 69
• 72 70
• 65 64
• 67 68
• 68 71
• 66 63
• 70 70
• 71 72
Simple Linear Correlation
• Calculate the correlation coefficient for the
above data!

• r = 0.7776  0.78
Rank correlation coefficient

• The Karl Pearson’s coefficient of correlation


cannot be used in cases where the direct
quantitative measurement of the
phenomenon under study is not possible or
the assumptions of normality is not fulfilled.
• In such situations we can use the Spearman’s
rank correlation coefficient.
Rank correlation coefficient
spearman’s
• The spearman’s rank correlation coefficient,
denoted by rs , measures the correlation
between two paired samples of ranked data.
• This correlation coefficient is applied to the
ranks in two paired samples (not to the
original scores).
• The formula for computing rank correlation by
this method is given in the next slide
Rank correlation coefficient

n
6 d i
2


rs = 1 − 3i=1
n −n
In which di is the difference of the two
ranks associated with the ith point.
Rank correlation coefficient

• E.g. : Six paintings were ranked by two judges.


Calculate the rank correlation coefficient.
Painting First judge Second judge di di²
(X) (Y)

A 2 2 0 0
B 1 3 -2 4
C 4 4 0 0
D 5 6 -1 1
E 6 5 1 1
F 3 1 2 4
Rank correlation coefficient

• di² = 10, n = 6.

6 di 2 610
rs = 1− = 1−
n(n2 −1) 6(6 2 −1)
= 1− 0.29
= 0.71

• How do you interpret the above correlation


coefficient?
Spurious correlation
• There are warnings pertain to the limitations in
the interpretation of a correlation coefficient.
– The correlation coefficient applies only to a linear
relationship between X and Y
– Correlation does not mean causation
• Often one encounters what seem to be nonsense
or spurious correlations between two variables
that logically appear to be totally unrelated to
one another.
Spurious correlation
• What do you think about the correlation
coefficient (r) of 0.9 between the amount of
rainfall in Canada and the maize production in
Ethiopia from 1990 to 2000 ?
• Assume the yearly data of the amount of
rainfall and maize production for the years
1990 to 2000 are available.
2. Simple Linear Regression
Two variables, X and Y, are interest.
Where, X = independent variable
Y = dependent variable
y = 0 + 1x+ 
 = Random error component
0 (beta zero) = y intercept of the line
(beta one) = Slope of the line , the amount of
1
increase or decrease in the
deterministic of y for every 1
unit of x increase or decrease.
Simple Linear Regression
• In practice, the relationship between Y and X is
not “perfect”. Other sources of variation exist.
We decompose Y into 2 components:
– Systematic Relationship with X: 0 +  X
– Random Error: 
• in which the  are assumed independent and N(0,σ2).
• Random responses can be written as the sum of
the systematic (also thought of as the mean)
and random components: Y = 0 + 1X +
• The (conditional on X) mean response is:
E(Y/x) = y/x= 0 +  X
ID X Y
Y

X
DCH, AAU
•When we look at the scatter plot the linear model
looks reasonable, but any function of x might be
possible in practice. Although we almost always
begin with the linear model.
The question now is given a set of data what is the best fitting straight
line. Lets look at the scatter plot for n = 30 with a few possible guess.

Which line, L1, L2 or L3 do you think fits best and why?


• The obvious choice is L2 since it goes
through the middle of the data.

• What you have done with your “mind’s-


eye” is try to minimize the differences
between the y’s and the line, which we call
it a residual or error, i.e.,

DCH, AAU
Let L2 be the following graph

Y P4

Y= 0 + 1X

Q4
u1 P1 Q3
Q2
0 Q1 P3
P2
0 + 1 X 1

X1 X2 X3 X4 X

Each value of Y thus has a nonrandom component, 0 + 1X, and a random


component, . The first observation has been decomposed into these two
components.

6
A GENERAL SIMPLE REGRESSION MODEL

Y (actual value)
Y Yˆ(fitted value) P4

Y − Yˆ= e (residual) e4 Yˆ= b0 + b1X


R3 R4
R2
e1 P1 e3
e2
R1 P3
P2
b0

X1 X2 X3 X4 X

The discrepancies between the actual and fitted values of Y are known as the
residuals.

10
Simple Linear Regression
• Y on X means Y is the dependent variable and
X is the independent one.
• The purpose of a regression equation is to use
one variable to predict another.
The Method of least square

• The difference between the given score Y and


the predicted score Ŷ is known as the error of
estimation.
• The regression line, or the line which best fits
the given pairs of scores, is the line for which
the sum of the squares of these errors of
estimation (Σеi²) is minimized
The Method of least square

• That is, of all the curves, the curve with


minimum Σеi² is the least square regression
which best fits the given data.
• The least square regression line for the set of
observations (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn
,Yn) has the equation Ŷ= b0 + b1xi .
The Method of least square
• The values ‘b0’ and ‘b1’ in the equation are
constants, i.e., their values are fixed. The
constant ‘b0’ indicates the value of y when x=0.
• It is also called the y intercept.
• The value of ‘b1’ shows the slope of the
regression line and gives us a measure of the
change in y for a unit change in x.
• This slope (b1) is frequently termed as the
regression coefficient of Y on X. If we know the
values of ‘b0’ and ‘b1’, we can easily compute the
value of Y for any given value of X.
The Method of least square

SS
Slope : bˆ1 xy

SS xx

y − intercept : bˆ0 = y − bˆ1 x


Where;
(xi )(yi )
SSxy = (xi − x)(yi − y) = xi yi −
n
− (xi )2
SSxx = (xi − x) = x
2 2
i
n
n =Sample size
The Method of least square

• If the correlation coefficient, r, has already


been calculated, it can be multiplied by the
ratio of the standard deviation of Y to the
standard deviation of X, to obtain b1. Thus
Sy
b1 = r .
Sx
• Similarly, if the regression coefficient is
known, r can be found by
S
r = b 1
x
.
S y
Example
• Heights of 10 fathers (X) together with their
oldest sons (Y) are given below (in inches).
Find the regression of Y on X.
Father (X) oldest son (Y) product (XY) X²
63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041

Total 676 679 45967 45784


Example
• b1 =10(45967) − (676x679) =0.77
10(45784) − (676)2

• b0 = 679 ‐ 0.77 ( 676 ) = 67.9 – 52.05 = 15.85


10 10

• Therefore, Ŷ = 15.85 + 0.77 X


Example
• The regression coefficient of Y on X (i.e., 0.77)
tells us the change in Y due to a unit change in
X.

• Estimate the height of the oldest son for a
father’s height of 70 inches.

• Ŷ = 15.85 + 0.77 (70) = 69.75 inches.
Explained, unexplained (error), total
variations
• If all the points on the scatter diagram fall on
the regression line we could say that the
entire variance of Y is due to variations in X.

• Explained variation = Σ(Ŷ‐ Y )²


• The measure of the scatter of points away
from the regression line gives an idea of the
variance in Y that is not explained with the
help of the regression equation.
Explained, unexplained (error), total
variations
• Unexplained variation = Σ(Y ‐ Ŷ)²

• The variation of the Y’s about their mean can


also be computed. The quantity Σ(Y‐ Y )² is
called the total variation.

• Explained variation + unexplained variation
=Total variation
ANOVA Table for linear regression

k P

n‐k‐1

Ho: linear model does not explain variation in data


P : reject Ho & accept Ha, i.e., the model does explain the variation
observed in the data
Explained, unexplained (error), total
variations
• The ratio of the explained variation to the
total variation measures how well the linear
regression line fits the given pairs of scores.
• It is called the coefficient of determination,
and is denoted by r².

explained var iation


• r² = total variation
Explained, unexplained (error), total
variations
• The explained variation is never negative and is
never larger than the total variation. Therefore, r²
is always between 0 and 1. If the explained
variation equals 0, r² = 0.

• If r² is known, then r =  . The sign of r is the


same as the sign of b from the regression
equation.

• Since r² is between 0 and 1, r is between ‐1 and


+1.
Assumptions of linear regression
Assumption1:
• For each value of the X variable, the Y variable is assumed to
have a normal distribution.
Assumpiton2:
• Linear assumption: the mean values of Y corresponding to
various values of X fall on a straight line.
Assumption 3:
• The values of Y are assumed to be independent from each
other.
Assumption 4:
• The observations are a random sample from the population
of interest.
Assumption 5
• The mean of the probability distribution of
 is 0. That is, the average of the values of
 over an infinitely long series of
experiments is 0 for each setting of the
independents variable x. This assumption
implies that the mean value of y, E(y), for
givenE v(alyue) o=f x is 0 +  1 x
Assumption 6:
• The variance of the probability distribution of  is constant
for all settings of the independent variable x. For our straight‐
line model, this assumption means that the variance of is
equal to a constant, say 2 , for all values of x.

Assumption 7:
• The probability distribution of  is normal.

Assumption 8:
• The values of associated with any two observed values of y
are independent. That is, the value of associated with one
value of y has no effect on the values of  associated with
other y values.
Computer output for the above example
Potential misuse of model
Multivariate Analysis
• Multivariate analysis refers to the analysis of data
that takes into account a number of explanatory
variables and one outcome variable
simultaneously.
• It allows for the efficient estimation of measures
of association while controlling for a number of
confounding factors.
• All types of multivariate analyses involve the
construction of a mathematical model to describe
the association between independent and
dependent variables
Multivariate Analysis
• A large number of multivariate models have been
developed for specialized purposes, each with a
particular set of assumptions underlying its
applicability.
• The choice of the appropriate model is based on
the underlying design of the study, the nature of
the variables, as well as assumptions regarding
the inter‐relationship between the exposures and
outcomes under investigation.

Multiple Linear Regression

• Multiple linear regression (we often refer to this


method as multiple regression) is an extension of
the most fundamental model describing the
linear relationship between two variables.
• Multiple regression is a statistical technique that
is used to measure and describe the function
relating two (or more) predictors (independent)
variables to a single response (dependent)
variable.
Multiple Linear Regression
• The General Linear Regression Model
y = 0 + 1 x1 + 2 x2 + L + k xk + 

Where
• y is the dependent variable
• x1, x2,…,xk are the independent variable
• E(y) = 0 + 1 x1 + 2 x2 + L + k xisk the deterministic
portion of the model.
•  is a random error component
• i determines the contribution of the independent
variable xi.
Analyzing a Multiple Regression Model

Step 1.
Hypothesis the deterministic component of the
model. This component relates the mean, E(y), to
the independent variables x1, x2,…, xk. This
involved the choice of the independent variables
to be included in the model

Step 2: 0 , 1, 2 ..., k


Use the sample data to estimate the unknown
model parameters in the model.
Step 3:
Specify the probability distribution of the random
error term, , and estimate the standard
deviation of the distribution, .

Step 4:
Statistically evaluate the usefulness of the model

Step 5:
When satisfied that the model is useful, use it for
prediction, estimation and other purposes.
Fitting The Model: The Least Squares Approach
In this case since we have p+1 linear equations that are easily solved but not in
closed form. If we write the model in matrix form we can express the solution in
closed form
Assumption for Random Error 

1. For any given set of values of x1,x2, x3, …, xk, the


random error  has a normal probability
distribution with mean equal to 0 and variance
equal to 2 .
2. The random errors are independent (in
probabilistic sense).
Since the variance, σ2 , of the random error,έ, will
rarely be known, thus we need to use the
regression results of the regression analysis to
estimate its value.
Choice of the Number of Variables
• Multiple regression is a seductive technique:
"plug in" as many predictor variables as you can
think of and usually at least a few of them will
come out significant.
• This is because one is capitalizing on chance
when simply including as many variables as one
can think of as predictors of some other variable
of interest. This problem is compounded when, in
addition, the number of observations is relatively
low.
Choice of the Number of Variables
• Intuitively, it is clear that one can hardly draw
conclusions from an analysis of 100
questionnaire items based on 10 respondents.
• Most authors recommend that one should
have at least 10 to 20 times as many
observations (cases, respondents) as one has
variables, otherwise the estimates of the
regression line will probably be unstable.
Choice of the Number of Variables
• Sometimes we know in advance which variables
we wish to include in a multiple regression
model. Here it is straightforward to fit a
regression model containing all of those
variables. Variables that are not significant can be
omitted and the analysis redone.
• There is no hard rule about this, however.
Sometimes it is desirable to keep a variable in a
model because past experience shows that it is
important.
Choice of the Number of Variables
• In large samples the omission of non‐
significant variables will have little effect on
the other regression coefficients.
• Usually it makes sense to omit variables that
do not contribute much to the model ( P >
.05).
Choice of the Number of Variables
• The statistical significance of each variable in
the multiple regression model is obtained
simply by calculating the ratio of the
regression coefficient to its standard error
and relating this value to the t distribution
with n‐k‐1 degrees of freedom, where n is the
sample size and k is the number of variables
in the model.
Stepwise regression
• Stepwise regression is a technique for
choosing predictor variables from a large set.
The stepwise approach can be used with
multiple linear, logistic and Cox regressions.
There are two basic strategies of applying this
technique known as forward and backward
stepwise regression.
Forward stepwise regression
• The first step in many analyses of multivariate
data is to examine the simple relation
between each potential explanatory variable
and the outcome variable of interest ignoring
all the other variables. Forward stepwise
regression analysis uses this analysis as its
starting point. Steps in applying this method
are:
Forward stepwise regression
• Find the single variable that has the strongest association
with the dependent variable and enter it into the model
(i.e., the variable with the smallest p‐value)

• Find the variable among those not in the model that, when
added to the model so far obtained, explains the largest
amount of the remaining variability.

• Repeat step (b) until the addition of an extra variable is not


statistically significant at some chosen level such as P=.05.

• N.B. You have to stop the process at some point otherwise


you will end up with all the variables in the model.
Backward stepwise regression

• As its name indicates, with the backward


stepwise method we approach the problem
from the other direction. The argument given
is that we have collected data on these
variables because we believe them to be
potentially important explanatory variables.
Backward stepwise regression
• Therefore, we should fit the full model,
including all of these variables, and then
remove unimportant variables one at a time
until all those remaining in the model
contribute significantly. We use the same
criterion, say P<.05, to determine significance.
At each step we remove the variable with the
smallest contribution to the model (or the
largest P‐value) as long as that P‐value is
greater than the chosen level.
Multicollinearity

• This is a common problem in many correlation


analyses. Imagine that you have two
predictors (X variables) of a person's height:
(1) weight in pounds and (2) weight in ounces.
• Obviously, our two predictors are completely
redundant; weight is one and the same
variable, regardless of whether it is measured
in pounds or ounces.
Multicollinearity
• Trying to decide which one of the two measures is a
better predictor of height would be rather silly;
however, this is exactly what one would try to do if one
were to perform a multiple regression analysis with
height as the dependent (Y) variable and the two
measures of weight as the independent (X) variables.
• When there are very many variables involved, it is
often not immediately apparent that this problem
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation.
Multicollinearity

• Nevertheless, when this problem occurs it


means that at least one of the predictor
variables is (practically) completely redundant
with other predictors. There are many
statistical indicators of this type of redundancy
The Importance of Residual Analysis

• Even though most assumptions of multiple


regression cannot be tested explicitly, gross
violations can be detected and should be dealt
with appropriately.
• In particular, outliers (i.e., extreme cases) can
seriously bias the results by "pulling" or
"pushing" the regression line in a particular
direction (see the example below), thereby
leading to biased regression coefficients.
• Often, excluding just a single extreme case can
yield a completely different set of results.
Logistic Regression
Logistic regression
• In many studies the outcome variable of interest is the
presence or absence of some condition, whether or
not the subject has a particular characteristic such as a
symptom of a certain disease.
• We cannot use ordinary multiple linear regression for
such data, but instead we can use a similar approach
known as multiple linear logistic regression or just
logistic regression.
• Logistic regression is part of a category of statistical models
called generalized linear models. This broad class of models
includes ordinary regression, ANOVA, ANCOVA and loglinear
regression
Logistic regression
• In general, there are two main uses of logistic
regression.
• The first is the prediction (estimation) of the
probability that an individual will have (develop)
the characteristic. For example, logistic regression
is often used in epidemiological studies where
the result of the analysis is the probability of
developing cancer after controlling for other
associated risks.
Logistic regression
• Logistic regression also provides knowledge of
the relationships and strengths between an
outcome (dependent) variable with two values
and explanatory (independent) variables that can
be categorical or continuous (e.g., smoking 10
packs a day puts you at a higher risk for
developing cancer than working in an asbestos
mine).
• Logistic regression can be applied to case‐control,
follow‐up and cross‐sectional data.
Logistic regression
• The basic principle of logistic regression is much the
same as for ordinary multiple regression.
• The main difference is that instead of developing a
model that uses a combination of the values of a group
of explanatory variables to predict the value of a
dependent variable, we predict a transformation of the
dependent variable.
• The dependent variable in logistic regression is usually
dichotomous, that is, the dependent variable can take
the value 1 with a probability of success , or the value
0 with a probability of failure 1‐.
Logistic regression
• This type of variable is called a binomial (or
binary) variable.
• Applications of logistic regression have also been
extended to cases where the dependent variable
is of more than two cases, known as multinomial
logistic regression.
• When multiple classes of the dependent variable
can be ranked, then ordinal logistic regression is
preferred to multinomial logistic regression.
Logistic regression

• As mentioned previously, one of the goals of


logistic regression is to correctly predict the
category of outcome for individual cases using
the most parsimonious (condensed) model.
• To accomplish this goal, a model is created
that includes all predictor variables that are
useful in predicting the response variable.
Logistic regression
• Several different options are available during
model creation. Variables can be entered into
the model in the order specified by the
researcher or logistic regression can test the
fit of the model after each coefficient is added
or deleted, called stepwise regression.
Logistic regression

• Logistic regression is a powerful statistical tool for


estimating the magnitude of the association
between an exposure and a binary outcome after
adjusting simultaneously for a number of
potential confounding factors.
• If we have a binary variable and give the
categories numerical values of 0 and 1, usually
representing ‘No’ and ‘Yes’ respectively, then the
mean of these values in a sample of individuals is
the same as the proportion of individuals with
the characteristic.
Logistic regression

• We could expect, therefore, that the


appropriate regression model would estimate
the probability (proportion) that an individual
will have the characteristic.
• We cannot use an ordinary linear regression,
because this might predict proportions less
than zero or greater than one, which would be
meaningless.
Logistic regression

• In practice, a statistically preferable method is


to use a transformation of this proportion.
• The transformation we use is called the logit
transformation, written as logit (p). Here p is
the proportion of individuals with the
characteristic.
Logistic regression
• For example, if p is the probability of a subject
having a myocardial infarction, then 1‐p is the
probability that they do not have one. The
ratio p / (1‐p) is called the odds and thus

• logit (p) = ln
p
(
(1 − p)
)
is the log odds.

• The logit can take any value from minus


infinity to plus infinity.
Logistic regression
• We can fit regression models to the logit which
are very similar to the ordinary multiple
regression models found for data from a normal
distribution.
• We assume that relationships are linear on the
logistic scale:
p
ln = a + b1X1 + b2X2 + … + bnXn
(
(1 − p)
)

where, X1, … Xn are the predictor variables and p


is the proportion to be predicted. The calculation
is computer intensive.
Logistic regression

• Because the logistic regression equation


predicts the log odds, the coefficients
represent the difference between two log
odds, a log odds ratio.
• The antilog of the coefficients is thus an odds
ratio. Most programs print these odds ratios.
• These are often called adjusted odds ratios.
Logistic regression

• The above equation can be rewritten to


represent the probability of disease as:

1
• P= 1 + e −(a + b1X 1+ b2 X 2+.+bnXn )

e(a +b1X 1+b2 X 2++bnXn)


= 1 + e(a +b1X 1+b 2 X 2+bnXn)
Logistic regression
Table Age and signs of coronary heart disease (CD)

Age CD Age CD Age CD


22 0 40 0 54 0
23 0 41 1 55 1
24 0 46 0 58 1
27 0 47 0 60 1
28 0 48 0 60 0
30 0 49 1 62 1
30 0 49 0 65 1
32 0 50 1 67 1
33 0 51 0 71 1
35 1 51 1 77 1
38 0 52 0 81 1
How can we analyse these data?

• Compare mean age of diseased and non‐


diseased

– Non‐diseased: 38.6 years


– Diseased: 58.7 years (p<0.0001)

• Linear regression?
Dot‐plot: Data from The Table
Yes
Signs of coronary disease

No

0 20 40 60 80 100
AGE (years)
Logistic regression (2)
Table 2 Prevalence (%) of signs of CD according to age
group
Diseased

Age group # ingroup # %

20 - 29 5 0 0

30 - 39 6 1 17

40 - 49 7 2 29

50 - 59 7 4 57

60 - 69 5 4 80

70 - 79 2 2 100

80 - 89 1 1 100
Dot‐plot: Data from Table 2

100
Diseased %

80

60

40

20

0
0 2 4 6 8
Age group
Logistic function (1)
Probability
of disease 1.0
 + x
P ( y x )= e
0.8
1 +e +  x

0.6

0.4

0.2

0.0

x
Logistic transformation

+ x
P(y x) = e
1+ e+x

P(y x)
ln =  + x
1− P(y x) 

logit of P(y|x)
Advantages of Logit

• Properties of a linear regression model


• Logit between ‐  and + 
• Probability (P) constrained between 0 and 1

• Directly related to notion of odds of disease


 P 
ln  = α + βx P
= eα+βx
1- P  1- P
Interpretation of coefficient 
Exposure x
Disease y yes no

yes P( y x= 1) P(y x = 0)

no 1− P(y x = 1) 1− P( y x = 0)

e + 
P = e α+βx Oddsd e = e + OR =  = e 
1- P e
Oddsd e = e ln(OR) = 
Interpretation of coefficient 
•  = increase in logarithm of odds ratio for
a one unit increase in x
• Test of the hypothesis that =0 (Wald
test) β2
2= (1 df)
Variance ( β)

• Interval testing (1.96SE )



95%CI= e
Example
• Risk of developing coronary heart disease
(CD) by age (<55 and 55+ years)

CD
Present Absent
(1) (0)
55+ (1) 21 6
<55 (0) 22 51

Odds of disease among exposed = 21/6


Odds of disease among unexposed = 22/51 Odds ratio = 8.1
• Logistic Regression Model

ln = α +β 1 Age = - 0.841+2.094 Age


P
1-P

Coefficient SE Coeff/SE

Age 2.094 0.529 3.96

Constant -0.841 0.255 -3.30

OR = e 2.094 = 8.1
Wald Test = 3.96 2
with 1df (p  0.05)
95% CI = e (2.094  1.96 x 0.529 )
= 2.9, 22.9
Logistic regression
• Significance tests
• The process by which coefficients are tested for
significance for inclusion or elimination from the
model involves several different techniques.
I) Z‐test
• The significance of each variable can be assessed
by treating b
Z = se(b)
• The corresponding P‐values are easily computed
(found from the table of Z‐distribution).
Logistic regression
• II) Likelihood‐Ratio Test:
• The likelihood‐ratio test uses the ratio of the
maximized value of the likelihood function for
the full model (L1) over the maximized value
of the likelihood function for the simpler
model (L0).
Logistic regression
• Deviance
– Before proceeding to the likelihood ratio test, we
need to know about the deviance which is
analogous to the residual sum of squares from a
linear model.
– The deviance of a model is ‐2 times the log
likelihood associated with each model.
– As a model’s ability to predict outcomes improves,
the deviance falls. Poorly‐fitting models have
higher deviance.
Logistic regression
• Deviance
– If a model perfectly predicts outcomes, the
deviance will be zero. This is analogous to the
situation in linear regression, where the residual
sum of squares falls to 0 if the model predicts the
values of the dependent variable perfectly.
– Based on the deviance, it is possible to construct
an analogous to r² for logistic regression,
commonly referred to as the Pseudo r².
Logistic regression
• Deviance
– If G1² is the deviance of a model with variables,
and G0² is the deviance of a null model, the
pseudo r² of the model is :
G12
r² = 1 ‐ G02 = 1 – (ln L1 / ln L0)
– One might think of it as the proportion of
deviance explained.
– The likelihood ratio test, which makes use of the
deviance , is analogous to the F‐test from linear
regression.
Logistic regression
• Deviance
– In its most basic form, it can test the hypothesis that
all the coefficients in a model are all equal to 0:
H0: ß1 = ß2 = . . . = ßk = 0
– The test statistic has a chi‐square distribution, with k
degrees of freedom.
– If we want to test whether a subset consisting of q
coefficients in a model are all equal to zero, the test
statistic is the same, except that for L0 we use the
likelihood from the model without the coefficients,
and L1 is the likelihood from the model with them.
– This chi‐square has q degrees of freedom.
Logistic regression
• Assumptions
• Logistic regression is popular in part because it enables
the researcher to overcome many of the restrictive
assumptions of OLS regression:
1. Logistic regression does not assume a linear relationship
between the dependents and the independents. It may
handle nonlinear effects even when exponential and
polynomial terms are not explicitly added as additional
independents because the logit link function on the left‐
hand side of the logistic regression equation is non‐
linear. However, it is also possible and permitted to add
explicit interaction and power terms as variables on the
right‐hand side of the logistic equation, as in OLS
regression.
Logistic regression
• Assumptions
2. The dependent variable need not be normally
distributed.
3. The dependent variable need not be
homoscedastic for each level of the
independents; that is, there is no homogeneity
of variance assumption.
Logistic regression
• However, other assumptions still apply:
– Meaningful coding. Logistic coefficients will be
difficult to interpret if not coded meaningfully. The
convention for binomial logistic regression is to
code the dependent class of greatest interest as 1
and the other class as 0.
– Inclusion of all relevant variables in the
regression model
– Exclusion of all irrelevant variables
Logistic regression
• However, other assumptions still apply:
– Error terms are assumed to be independent
(independent sampling). Violations of this
assumption can have serious effects. Violations are
apt to occur, for instance, in correlated samples and
repeated measures designs, such as before‐after or
matched‐pairs studies, cluster sampling, or time‐
series data. That is, subjects cannot provide multiple
observations at different time points. In some cases,
special methods are available to adapt logistic models
to handle non‐independent data.
Logistic regression
• However, other assumptions still apply:
– Linearity. Logistic regression does not require linear
relationships between the independents and the
dependent, as does OLS regression, but it does
assume a linear relationship between the logit of the
independents and the dependent.
– No multicollinearity: To the extent that one
independent is a linear function of another
independent, the problem of multicollinearity will
occur in logistic regression, as it does in OLS
regression. As the independents increase in
correlation with each other, the standard errors of the
logit (effect) coefficients will become inflated.
Logistic regression
• However, other assumptions still apply:
– No outliers. As in OLS regression, outliers can affect
results significantly. The researcher should analyze
standardized residuals for outliers and consider
removing them or modeling them separately.
Standardized residuals >2.58 are outliers at the .01
level, which is the customary level (standardized
residuals > 1.96 are outliers at the less‐used .05 level).
– Large samples. Unlike OLS regression, logistic
regression uses maximum likelihood estimation (MLE)
rather than ordinary least squares (OLS) to derive
parameters.
Logistic regression
• MLE relies on large‐sample asymptotic normality
which means that reliability of estimates decline
when there are few cases for each observed
combination of independent variables.
• That is, in small samples one may get high
standard errors. In the extreme, if there are too
few cases in relation to the number of variables,
it may be impossible to converge on a solution.
• Very high parameter estimates (logistic
coefficients) may signal inadequate sample size.
Multiple Logistic Regression
• More than one independent variable
– Dichotomous, ordinal, nominal, continuous …

P 
ln = α+β1x1 + β2x2 +...βixi
1-P 
• Interpretation of i
– Increase in log‐odds for a one unit increase in xi with all the
other xis constant
– Measures association between xi and log‐odds adjusted for
all other xi
Effect modification
• Effect modification
– Can be modelled by including interaction terms

 P
ln   =α+βx1 +1 βx2 +β x3 1x
2 2
1-P
Statistical testing
• Question
– Does model including given independent variable
provide more information about dependent
variable than model without this variable?
• Three tests
– Likelihood ratio statistic (LRS)
– Wald test
– Score test
Likelihood ratio statistic

• Compares two nested models


Log(odds) =  + 1x1 + 2x2 + 3x3 + 4x4 (model 1)
Log(odds) =  + 1x1 + 2x2 (model 2)

• LR statistic
‐2 log (likelihood model 2 / likelihood model 1) =
‐2 log (likelihood model 2) minus ‐2log (likelihood model 1)

LR statistic is a 2 with DF = number of extra parameters


in model
Example
P Probability for cardiac arrest
Exc 1= lack of exercise, 0 = exercise
Smk 1= smokers, 0= non-smokers

 P  
ln  = α +β1Exc+ β 2 Smk
1-P 
= 0.7102+1.0047Exc+ 0.7005Smk
(SE0.2614) (SE0.2664)

OR for lack of exercise = e1.0047 = 2.73 (adjusted for smoking)


95% CI = e(1.0047  1.96 x 0.2614) = 1.64 - 4.56
• Interaction between smoking and exercise?
 P 
ln  = α+ β 1 Exc+ β2 Smk+ β3 SmkExc
1- P 
• Product term 3 = ‐0.4604 (SE 0.5332)
Wald test = 0.75 (1df)

‐2log(L) = 342.092 with interaction term


= 342.836 without interaction term

 LR statistic = 0.74 (1df), p = 0.39


 No evidence of any interaction
Logistic regression
Summary
• A likelihood is a probability, specifically the probability
that the values of the dependent variable may be
predicted from the values of the independent
variables. Like any probability, the likelihood varies
from 0 to 1.
• The log likelihood ratio test (or sometimes called as
model chi‐square test) of a model tests the difference
between ‐2LL for the full model and ‐2LL for the
initial chi‐square in the null model. That is, Model chi‐
square is computed as ‐2LL for the null (initial) model
minus ‐2LL for the researcher’s model.
Logistic regression
Summary
• The initial chi‐square is ‐2LL for the model which
accepts the null hypothesis that all the b
coefficients are zero.
• The log likelihood ratio test tests the null
hypothesis that all population logistic regression
coefficients except the constant are zero. It is an
overall model test which does not assure that
every independent is significant .
• It measures the improvement in fit that the
explanatory variables make compared to the null
model.
Logistic regression
Summary
• The method of analysis uses an iterative
procedure whereby the answer is obtained by
several repeated cycles of calculation using
the maximum likelihood approach.
• Because of this extra complexity, logistic
regression is only found in large statistical
packages or those primarily intended for the
analysis of epidemiological data.

You might also like