Week 9
Week 9
Week 9
MLR_Logistic.Docx
2
Definition
Dependent variable: the variable we wish to explain or predict.
Independent/exogenous/explanatory variable: the variable
we use to explain or predict the dependent variable.
MLR_Logistic.Docx
3
MLR_Logistic.Docx
4
MLR_Logistic.Docx
5
yˆ =
a + bX , where a =
∑
y − bx , and b =
( x − x )( y − y ) i i
∑(x − x ) i
2
∑ ( yi − yˆi=
) ∑ i
( )
2 2
(SSE), y − ( a + bx ) with respect to a and
=i 1 =i 1
b, where ŷ is an estimate of E ( y / X= x )= α + β x , the conditional
mean of y given X=x (fixed).]
The accuracy of the estimated model can be evaluated by the adjusted
coefficient of determination ( ) , which varies from 0 to 1.
R2 R 2
(When should we use a multiple linear regression model?)
If the coefficient of determination is unsatisfactory (low), we
incorporate/add meaningful and relevant independent variables
into the model to create the Multiple Linear Regression Model
(MLRM).
MLR_Logistic.Docx
6
∑ ( yi − y ) ∑ ( yˆi − y ) ∑ i i
( )
2 2 2
= + y − ˆ
y
i 1 =i 1 =i 1
Total SS (TSS) = Regression SS (RSS) + Error SS (ESS)
The above expression is beneficial in constructing the ANOVA
table and Testing hypotheses.
MLR_Logistic.Docx
7
ANOVA table
SV df SS F
MSS
Regr k RSS= MSSR=R F =
essio n SS/k
∑ ( ) MSSR / k
2
n ˆ
y i − y
i =1 MSSE / (n − k − 1)
Error n-(k+1) ESS= MSSE=E
n SS/(n-k-
∑ ( ) 2
yi − ˆ
y i 1)
i =1
Total n-1 TSS=
n
∑ i
( ) 2
y − y
i =1
Abbreviation: SV – Source of variation, df – degrees of freedom, SS – Sum
of Squares, MSS – Mean Sum of Squares, RSS – Regression Sum of
Squares, ESS – Error Sum of Squares, TSS – Total Sum of squares.
MLR_Logistic.Docx
8
MLR_Logistic.Docx
9
Questions:
MLR_Logistic.Docx
10
The majority of the above questions can be answered from the following
computer output:
MLR_Logistic.Docx
11
MLR_Logistic.Docx
12
The majority of the above questions can be answered from the following
computer output:
MLR_Logistic.Docx
13
where, X β = β 0 + β1 x1 + β 2 x2 + + β k xk .
In case of violating at least one assumption of normality, independence,
and constant variance, we use a Generalized linear model (GLM)
(Agresti, 2019, chapter 3). Simple Linear Regression and Multiple Linear
Regression are particular cases of GLM.
MLR_Logistic.Docx
14
p
ln = β 0 + β1 x1 + β 2 x2 + + β k xk = X β .
1− p .
eX β 1
Thus,=p P=
(Y 1/=
x) E=
(Y 1/=
x) = Xβ −Xβ
.
1+ e 1+ e
Note that, Logistic Regression is nonlinear in regression coefficients.
p
Odds= = exp( β 0 + β1 x1 + β 2 x2 + + β k xk )
1− p
= exp ( β 0 ) × exp ( β1 x1 ) × exp ( β 2 x2 ) × × exp ( β k xk )
=
where, = 1/ x) and
p Pr(Y (1-=
p) =
Pr(Y 0 / x)
MLR_Logistic.Docx
15
1
The sigmoid function y = (so named because it looks
1 + e− X β
like an s) is also called the logistic function. It takes a real value
and maps it to the range [0, 1]. It is nearly linear around 0, but
outlier values get squashed toward 0 or 1.
Note that the Method of least squares is used in MLR and Maximum
Likelihood in Logistic Regression. We require a Computer to estimate
each model listed in the GLM table.
In terms of log odds, Logistic Regression is like regular Regression
MLR_Logistic.Docx
16
3. Confusion Matrix
MLR_Logistic.Docx
17
MLR_Logistic.Docx
18
TN
TNR =
True Negative Rate, TN + FP , It indicates how many
negative values, out of all the negative values, have been correctly
predicted. It is also known as Specificity.
FN
FNR =
False Negative Rate, FN + TP , It indicates how many
positive values, out of all the positive values, have been incorrectly
predicted.
MLR_Logistic.Docx
19
The following data (Table 11.4.3 in the text) are the ages of 185 women
discharged from a hospital in Australia who met eligibility criteria
involving discharge for myocardial infarction, artery bypass surgery,
angioplasty, or stent.
MLR_Logistic.Docx
20
44 0 63 0 67 0 85 0 59 1 70 1
53 0 63 0 55 0 84 0 73 1 70 1
45 0 72 0 71 0 39 0 73 1 63 1
79 0 64 0 80 0 52 0 65 1 63 1
46 0 72 0 75 0 67 0 67 1 65 1
62 0 79 0 69 0 82 0 60 1 67 1
58 0 75 0 80 0 84 0 69 1 68 1
70 0 70 0 79 0 79 0 61 1 84 1
60 0 73 0 71 0 81 0 79 1 69 1
67 0 66 0 69 0 74 0 66 1 78 1
64 0 75 0 78 0 85 0 68 1 69 1
62 0 73 0 75 0 92 0 61 1 79 1
50 0 71 0 71 0 69 0 63 1 83 1
61 0 72 0 69 0 83 0 70 1 67 1
69 0 69 0 77 0 82 0 68 1 47 1
74 0 76 0 81 0 85 0 59 1 57 1
65 0 60 0 78 0 82 0 64 1 66 1
80 0 79 0 76 0 80 0 62 1
69 0 78 0 84 0 74 1 74 1
77 0 62 0 74 0 50 1 61 1
61 0 73 0 59 0 55 1 69 1
72 0 46 0 81 0 66 1 76 1
67 0 57 0 74 0 49 1 71 1
73 0 53 0 77 0 55 1 61 1
75 0 40 0 59 0 73 1 46 1
(Data file: Logisticdata1.xlsx)
Partial SPSS output
Model Summary
Confusion/Classification Tablea
Predicted
ATT Percentage
Observed 0 1 Correct
MLR_Logistic.Docx
21
From the above SPSS output, we can write the estimated Binary Logistic
Regression Model as
pˆ
yˆi =ln i =αˆ + βˆ xi =1.875 − 0.038 xi
1 − pˆ i
The predicted probability of attending cardiac rehabilitation for a
woman aged xi is
1
pˆ i =
1 + e − (1.875−0.038 xi )
1
=
For x = 57, pˆ = 0.427759
1 + e − (1.875−0.038×57)
=pˆ 57 0.427759 < 0.50 , Thus, a 57-year-old woman did not participate
in the program.
1
=
For x = 37, pˆ = − (1.875 − 0.038×37)
0.615147
1+ e
=pˆ 37 0.615147 > 0.50 , Thus, a 37-year-old woman participated in the
program.
MLR_Logistic.Docx
22
Test: We can check the adequacy of the logistic model by testing the null
hypothesis that the slope of the regression line/coefficient of age (x) is
zero. That is, we test the null hypothesis
H 0 : β = 0 versus the two-sided alternative H a : β ≠ 0 .
Under the null hypothesis, the test statistic is
2
βˆ
W = χ12 (distributed as Chi-square with 1 degree of freedom)
se( β )
ˆ
.
From computer output, W = 6.710 with p-value 0.01<0.05.
Thus, we reject the null hypothesis at a 5% significance level.
The logistic regression coefficient is significant, and hence, the logistic
regression model is adequate. That is, the age of a woman influences her
participation in the program.
References:
Agresti, A. (2019) An Introduction to categorical Analysis (Chapter 4),
Wiley & Sons.
Agresti, A. (2019). AN INTRODUCTION TO CATEGORICAL DATA
ANALYSIS, chapter 3, Wiley.
MLR_Logistic.Docx
23
MLR_Logistic.Docx