Unit - 5
Unit - 5
Unit - 5
Deviance Residuals:
Min 1Q Median 3Q Max
-2.77443 -0.34870 -0.05375 0.32973 2.37928
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---
Call:
glm(formula = Gender ~ Hgt, family = binomial, data = Pulse)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---
64.14 - 0.9424 Ht
e
pˆ = 64.14 - .9424 Ht
1+ e
proportion of females at that
Hgt
> plot(fitted(logitmodel)~Pulse$Hgt)
Example: Golf Putts
Length 3 4 5 6 7
Made 84 88 61 61 44
Missed 17 31 47 64 90
Total 101 119 108 125 134
Call:
glm(formula = Made ~ Length, family = binomial, data =
Putts1)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8705 -1.1186 0.6181 1.0026 1.4882
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.25684 0.36893 8.828 <2e-16 ***
Length -0.56614 0.06747 -8.391 <2e-16 ***
---
æ pˆ ö
logç ÷ vs. Length
è1- pˆ ø
1.5
Linear part of
1.0
logistic fit
logitPropMade
0.5
0.0
-0.5
3 4 5 6 7
PuttLength
Probability Form of Putting Model
1.0
e 3.2570.566Length
ˆ
1 e 3.2570.566Length
0.8
Probability Made
0.6
0.4
0.2
0.0
2 4 6 8 10 12
PuttLength
Odds
Definition:
P (Yes )
is the odds of Yes.
1 P ( No)
odds
odds
1 1 odds
Odds
Logit form of the model:
æ p ö
log ç ÷ =b 0 + b1 X
è1 - p ø
The logistic model assumes a linear
⇒ relationship between the predictors
and log(odds).
p b 0 + b1 X
odds = =e
1- p
Odds Ratio
Odds1
Odds Ratio OR
Odds2
X is replaced by X + 1:
b 0 +b1 X
odds =e
is replaced by
b 0 +b1 ( X +1)
odds =e
So the ratio is
b 0 +b1 ( X +1)
e b0 +b1 ( X +1)- ( b0 +b1 X ) b1
b0 +b1 X
=e =e
e
Example: TMS for Migraines
Transcranial Magnetic Stimulation vs. Placebo
Pain Free? TMS Placebo
YES 39 22
NO 61 78
Total 100 100
• There are multiple ways to train a Logistic Regression model (fit the S
shaped line to our data). We can use an iterative optimization algorithm
like Gradient Descent to calculate the parameters of the model (the
weights) or we can use probabilistic methods like Maximum likelihood.
• Once we have used one of these methods to train our model, we are
ready to make some predictions.
• Let's see an example of how the process of training a Logistic
Regression model and using it to make predictions would go:
• First, we would collect a Dataset of patients who have and who have not
been diagnosed as obese, along with their corresponding weights.
• After this, we would train our model, to fit our S shape line to the data and
obtain the parameters of the model. After training using Maximum
Likelihood, we got the following parameters:
• General form:
𝑃(𝑋
1, 𝑋 2,…. 𝑋 𝑁)= ∏ 𝑃( 𝑋𝑖∨𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖))
𝑖
A B C Absolute Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian Networks
• Conditionally
independent effects:
A B
A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
The Alarm Example
B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001
false true false 0.71
false true true 0.29
true false false 0.06
true false true 0.94
true true false 0.05
true true true 0.95
A Directed Acyclic Graph
Burglary Earthquake
Alarm
46
A Directed Acyclic Graph
Burglary Earthquake
Alarm
47
A Set of Parameters
B P(B) E P(E) Burglary Earthquake
false 0.999 false 0.998
true 0.001 true 0.002
B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001 Each node Xi has a conditional probability
false true false 0.71 distribution P(Xi | Parents(Xi)) that quantifies the
false true true 0.29 effect of the parents on the node
true false false 0.06
The parameters are the probabilities in these
true false true 0.94
conditional probability distributions
true true false 0.05
Because we have discrete random variables, we
true true true 0.95
have conditional probability tables (CPTs)
48
A Set of Parameters
Conditional Probability Stores the probability distribution
Distribution for Alarm for Alarm given the values of
Burglary and Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the
false false true 0.001
parents (B and E in this example), the
false true false 0.71
entries for P(A=true|B,E) and P(A=false|
false true true 0.29
B,E) must add up to 1 eg. P(A=true|
true false false 0.06 B=false,E=false) + P(A=false|
true false true 0.94 B=false,E=false)=1
true true false 0.05
true true true 0.95
Where 𝑥𝑖 ,𝑗 ={
0 ; 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑚�
1
• No hidden variables – easy solution. θ� ∑𝑖 =� 𝑥 𝑖 ,𝑗 ; sample
𝑚𝑗
=
mean � 1
Simplified MLE
Goal: determine coin parameters without knowing the identity of each
data set’s coin.
•What if you were given the same dataset of coin flip results,
but no coin identities defining the datasets?
2) Expectation (E-step) 𝑝 𝑥 = 𝑥𝑖 θ = θ 𝑗 )
𝐸 𝑧𝑖 , =
𝑗 ∑ 𝑘𝑛 = 𝑝 𝑥 = 𝑥𝑖 θ = θ 𝑛 )
1
If 𝑧 𝑖 ,𝑗 is known:
2) Maximization (M-step)
𝑚 𝑗 𝑥�
θ� = ∑𝑚
𝑖=1 𝐸 𝑧 𝑖 ,𝑗 𝑥𝑖 θ𝑗 =
∑ 𝑖=1
𝑚𝑗 �
� ∑𝑚
𝑖= 𝐸 𝑧
𝑖,𝑗
1
EM- Coin Flip example
• Compute a probability
distribution of possible
completions of the data using
current parameters
EM- Coin Flip example
Set 1
• What is the probability that I observe 5 heads and 5 tails in coin A and B
given the initializing parameters θA=0.6, θB= 0.5?
• Compute likelihood of set 1 coming from coin A or B using the binomial
distribution with mean probability θ on n trials with k successes
• Likelihood of “A”=0.00079
• Likelihood of “B”=0.00097
• Normalize to get probabilities A=0.45, B=0.55
EM example
0
The M-step
Hierarchical Clustering
• In hierarchical clustering the goal is to produce a hierarchical
series of nested clusters, ranging from clusters of individual
points at the bottom to an all-inclusive cluster at the top. A
diagram called a dendrogram graphically represents this
hierarchy.
Hierarchical Clustering 79
Dendrogram
Hierarchical Clustering 80
Types of Hierarchical Clustering
• A hierarchical method can be classified as being either
• Agglomerative
• Divisive
81
Hierarchical Clustering
Agglomerative Hierarchical Clustering
• Start with the points as individual clusters and, at each step, merge
the closest pair of clusters.
1. Linkage Method
• Single Linkage
• Complete Linkage
• Average Linkage
2. Search the distance matrix for the most similar pair of cluster. Let the
distance between “most similar” clusters be .
i.e.
Example-
1 2 3 4 5
Complete Linkage
•
• For the complete linkage, the distance is to be
maximum between any two points in the different
clusters.
i.e.
Example-
1 2 3 4 5
86
Average Linkage
••
Average linkage treats the distance between two clusters as the
average distance between all pairs of items where one member
of pair belongs to each cluster.
i.e.
Example-
1 2 4 5 3
87
Divisive Clustering