Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit - 5

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 111

Logistic Regression

• Linear regression is the simplest and most extensively used statistical


technique for predictive modelling analysis.
• It is a way to explain the relationship between a dependent variable
(target) and one or more explanatory variables(predictors) using a
straight line.
• Linear Regression can be really useful when you are trying to predict a
continuous output value from a linear relationship.
• But a Logistic Regression output values lie between 0 and 1; a
probability. Hence, an output continuous value not in the range
between 0 and 1 does not work with Logistic Regression.
•  “How to draw the straight line that fits as closely to these (sample)
points as possible?”
• The most common method for fitting a regression line is the method
of Ordinary Least Squares used to minimize the sum of squared errors (SSE).
• Now we have a classification problem, we want to predict the binary
output variable Y (2 values: either 1 or 0). For example, the case of
flipping a coin (Head/Tail). The response yi is binary: 1 if the coin is
Head, 0 if the coin is Tail.
• Linear regression is only dealing with continuous variables instead
of Bernoulli variables.
• The problem of Linear Regression is that these predictions are not
sensible for classification since the true probability must fall between
0 and 1 but it can be larger than 1 or smaller than 0.
• So…how can we predict a classificiation problem?
• we can transform our linear regression to a logistic regression curve.
• The output value in logistic regression is a numbered classification,
but before the classification is given, the ACTUAL output is a
numerical probability in the range of 0 to 1.
• Based on the probability, a classification of 1 or 0 will be given. The
algorithm essentially rounds the value to give a classification; 0 being
your negative class and 1 being your positive class.
Binary Logistic Regression Model
Y = Binary response X = Quantitative predictor
π = proportion of 1’s (yes,success) at any X
Equivalent forms of the logistic regression model:
Logit form Probability form
   e
b 0 + b1 X
log     0  1 X p=
1   b0 + b1 X
1+ e
What does this look like?
N.B.: This is natural log (aka “ln”)
Binary Logistic Regression via R
> logitmodel=glm(Gender~Hgt,family=binomial,
data=Pulse)
> summary(logitmodel)
Call:
glm(formula = Gender ~ Hgt, family = binomial)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.77443 -0.34870 -0.05375 0.32973 2.37928

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---
Call:
glm(formula = Gender ~ Hgt, family = binomial, data = Pulse)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 64.1416 8.3694 7.664 1.81e-14 ***
Hgt -0.9424 0.1227 -7.680 1.60e-14***
---

64.14 - 0.9424 Ht
e
pˆ = 64.14 - .9424 Ht
1+ e
proportion of females at that
Hgt
> plot(fitted(logitmodel)~Pulse$Hgt)
Example: Golf Putts
Length 3 4 5 6 7
Made 84 88 61 61 44
Missed 17 31 47 64 90
Total 101 119 108 125 134

Build a model to predict the proportion of


putts made (success) based on length (in feet).
Logistic Regression for Putting

Call:
glm(formula = Made ~ Length, family = binomial, data =
Putts1)

Deviance Residuals:
Min 1Q Median 3Q Max
-1.8705 -1.1186 0.6181 1.0026 1.4882

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.25684 0.36893 8.828 <2e-16 ***
Length -0.56614 0.06747 -8.391 <2e-16 ***
---
æ pˆ ö
logç ÷ vs. Length
è1- pˆ ø

1.5
Linear part of

1.0
logistic fit

logitPropMade
0.5
0.0
-0.5

3 4 5 6 7

PuttLength
Probability Form of Putting Model

1.0
e 3.2570.566Length
ˆ 
1  e 3.2570.566Length

0.8
Probability Made
0.6
0.4
0.2
0.0

2 4 6 8 10 12

PuttLength
Odds
Definition:
 P (Yes )
 is the odds of Yes.
1   P ( No)

 odds
odds   
1  1  odds
Odds
Logit form of the model:
æ p ö
log ç ÷ =b 0 + b1 X
è1 - p ø
The logistic model assumes a linear
⇒ relationship between the predictors
and log(odds).
p b 0 + b1 X
odds = =e
1- p
Odds Ratio

A common way to compare two groups


is to look at the ratio of their odds

Odds1
Odds Ratio  OR 
Odds2
X is replaced by X + 1:
b 0 +b1 X
odds =e
is replaced by
b 0 +b1 ( X +1)
odds =e
So the ratio is
b 0 +b1 ( X +1)
e b0 +b1 ( X +1)- ( b0 +b1 X ) b1
b0 +b1 X
=e =e
e
Example: TMS for Migraines
Transcranial Magnetic Stimulation vs. Placebo
Pain Free? TMS Placebo
YES 39 22
NO 61 78
Total 100 100

pˆ TMS =0.39 odds = 39 / 100 =39 =0.639 pˆ = 0.639 =0.39


TMS
61 / 100 61 1+ 0.639
ˆ Placebo  0.22 22
odds Placebo   0.282
78
0.639
Odds ratio = =2.27 Odds are 2.27 times higher of getting
0.282 relief using TMS than placebo
Logistic Regression for TMS data
> lmod=glm(cbind(Yes,No)~Group,family=binomial,data=TMS)
> summary(lmod)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2657 0.2414 -5.243 1.58e-07 ***
GroupTMS 0.8184 0.3167 2.584 0.00977 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 6.8854 on 1 degrees of freedom


Residual deviance: 0.0000 on 0 degrees of freedom
AIC: 13.701

Note: e0.8184 = 2.27 = odds ratio


logistic function (also called the ‘inverse logit’).
• The formula for the sigmoid function is the following:

• If we wanted to predict if a person was obese or not given their


weight, we would first compute a weighted sum of their weight and
then input this into the sigmoid function:
• 1) Calculate weighted sum of inputs
• 2) Calculate the probability of Obese

• There are multiple ways to train a Logistic Regression model (fit the S
shaped line to our data). We can use an iterative optimization algorithm
like Gradient Descent to calculate the parameters of the model (the
weights) or we can use probabilistic methods like Maximum likelihood.
• Once we have used one of these methods to train our model, we are
ready to make some predictions.
• Let's see an example of how the process of training a Logistic
Regression model and using it to make predictions would go:
• First, we would collect a Dataset of patients who have and who have not
been diagnosed as obese, along with their corresponding weights.
• After this, we would train our model, to fit our S shape line to the data and
obtain the parameters of the model. After training using Maximum
Likelihood, we got the following parameters:

• Now, we are ready to make some predictions: imagine we got two


patients; one is 120 kg and one is 60 kg. Let's see what happens when we
plug these numbers into the model:
• the first patient (60 kg) has a very low probability of being obese,
however, the second one (120 kg) has a very high one.

• Now, given the weight of any patient, we could calculate their


probability of being obese, and give our doctors a quick first round of
information!
Random Forest
• Random Forest is a popular machine learning algorithm that belongs to
the supervised learning technique.
• It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.
• Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset.
• Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
• The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
• The below diagram explains the working of the Random Forest
algorithm:
• Let’s say that you’re looking to buy a house, but you’re unable to
decide which one to buy. So, you consult a few agents and they give
you a list of parameters that you should consider before buying a
house. The list includes:
• Price of the house
• Locality
• Number of bedrooms
• Parking space
• Available facilities
• These parameters are known as predictor variables, which are used to
find the response variable. 
• Decision trees are built on the entire data set, by making use of all the
predictor variables.
•  let’s see how Random Forest would solve the same problem.
• Random forest is an ensemble of decision trees, it randomly selects a
set of parameters and creates a decision tree for each set of chosen
parameters.
• 3 Decision Trees and each Decision Tree is taking only 3 parameters
from the entire data set.
• Each decision tree predicts the outcome based on the respective
predictor variables used in that tree and finally takes the average of
the results from all the decision trees in the random forest.
• In simple words, after creating multiple Decision trees using this
method, each tree selects or votes the class and the class receiving
the most votes by a simple majority is termed as the predicted class.
• To conclude, Decision trees are built on the entire data set using all
the predictor variables, whereas Random Forests are used to create
multiple decision trees, such that each decision tree is built only on a
part of the data set.
• The random forest algorithm works by completing the following steps:
• Step 1: The algorithm select random samples from the dataset provided.
• Step 2: The algorithm will create a decision tree for each sample selected.
Then it will get a prediction result from each decision tree created.
• Step 3: Voting will then be performed for every predicted result. For a
classification problem, it will use mode, and for a regression problem, it
will use mean.
• Step 4: And finally, the algorithm will select the most voted prediction
result as the final prediction.
Terminologies associated with Bayesian Belief Networks
• Random Variables:
• A Random Variable is a set of possible values from a random experiment.
• A Random Variable's set of values is the Sample Space.
• For Example:
• Tossing a coin: We could get Heads or Tails. Let Heads=0 and Tails=1 and
Random Variable X represents this event.
• X = {0, 1}
• The probability of an event happening is denoted by P(x).
• The Probability Mass Function(PMF) is f(x) which is the P(X=x).
• Therefore f(0) = f(1) = 1/2
• Probability of event X not happening is denoted by P(~X) and is equal to 1 -
P(X).
• Intersection
• The probability that Events A and B both occur is the probability of the
intersection of X and Y. The probability of the intersection of Events X and Y is
denoted by P(X ∩ Y).
• Joint Distribution:
• A joint probability distribution shows a probability distribution for two (or
more) random variables.
• For Example:
• Let's have 2 coin tosses represented by random variables X and Y.
• The joint probability distribution f(x, y) of X and Y defines probabilities for each
pair of outcomes. X = {0, 1} and Y = {0, 1}
• All possible outcomes are: (X=0,Y=0), (X=0,Y=1), (X=1,Y=0), (X=1,Y=1).Each of
these outcomes has a probability of 1/4.
• f(0, 0) = f(0, 1) = f(1, 0) = f(1, 1) = 1/4.
• This concept can be extended to more than 2 variables as well.
• Conditional Distribution:
• Sometimes, we know an event has happened already and we want to model what will
happen next.
• The conditional probability of two events X and Y as follows:
P(Y|X) = P(X ∩ Y)/P(X)

For Example:
• Yahoo’s share price is low and Microsoft will buy it.
• It is cloudy and it might rain.
• Conditional Independence
• The concept of Conditional Independence is Backbone of Bayesian Networks. Two
events are said to be conditionally independent if the occurrence of one event doesn't
affect the occurrence of the other event.
• For example let one event be the tossing of a coin and the second event be whether it is
raining outside or not.
• The above mentioned events are conditionally independent as if it rains or not doesn't
affect the probability of getting heads or tails.
Bayesian Belief Networks
• Bayesian Belief Networks are simple, graphical notation for
conditional independence assertions.
• Bayesian network models capture both conditionally dependent and
conditionally independent relationships between random variables.
• They also compactly specify the joint distributions.
• They provide a graphical model of causal relationship on which
learning can be performed.

• Two components to a Bayesian network


• The graph structure (conditional independence assumptions)
• The numerical probabilities (for each variable given its parents)
Bayesian Networks

• General form:

𝑃(𝑋
 
1, 𝑋 2,…. 𝑋 𝑁)= ∏ 𝑃( 𝑋𝑖∨𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖))
𝑖

The full joint distribution The graph-structured approximation


Example of a simple Bayesian network
𝑃(𝑋
 
1, 𝑋 2,…. 𝑋 𝑁)= ∏ 𝑃( 𝑋𝑖∨𝑝𝑎𝑟𝑒𝑛𝑡𝑠(𝑋𝑖)) A B
𝑖
𝑃  ( 𝐴 ,𝐵 ,𝐶 )=𝑃 ( 𝐶| 𝐴 , 𝐵 ) 𝑃 ( 𝐴 ) 𝑃 (𝐵)
C

•Probability model has simple factored form


•Directed edges => direct dependence
•Absence of an edge => conditional independence

•Also known as belief networks, graphical models, causal networks


•Other formulations, e.g., undirected graphical models
Examples of 3-way Bayesian Networks

A B C Absolute Independence:
p(A,B,C) = p(A) p(B) p(C)
Examples of 3-way Bayesian Networks
• Conditionally
  independent effects:

• B and C are conditionally A


independent given A
B C
• e.g., A is a disease, and we model
B and C as conditionally
independent symptoms given A
Examples of 3-way Bayesian Networks
• Independent
  Clauses:

A B

• “Explaining away” effect:


• A and B are independent but become C
dependent once C is known!!
• (we’ll come back to this later)
Examples of 3-way Bayesian Networks

A B C Markov dependence:
p(A,B,C) = p(C|B) p(B|A)p(A)
The Alarm Example

• You have a new burglar alarm installed


• It is reliable about detecting burglary, but responds to minor earthquakes
• Two neighbors (John, Mary) promise to call you at work when they hear the alarm
• John always calls when hears alarm, but confuses alarm with phone ringing
(and calls then also)
• Mary likes loud music and sometimes misses alarm!
• Given evidence about who has and hasn’t called, estimate the probability of a
burglary
The Alarm Example

• Represent problem using 5 binary variables:


• B = a burglary occurs at your house
• E = an earthquake occurs at your house
• A = the alarm goes off
• J = John calls to report the alarm
• M = Mary calls to report the alarm
A Bayesian Network
B P(B) E P(E) A Bayesian network is made
false 0.999 false 0.998 up of two parts:
true 0.001 true 0.002
1. A directed acyclic graph
Burglary Earthquake 2. A set of parameters

B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001
false true false 0.71
false true true 0.29
true false false 0.06
true false true 0.94
true true false 0.05
true true true 0.95
A Directed Acyclic Graph

Burglary Earthquake

Alarm

1. A directed acyclic graph:


• The nodes are random variables (which can be discrete or
continuous)
• Arrows connect pairs of nodes (X is a parent of Y if there is an
arrow from node X to node Y).

46
A Directed Acyclic Graph
Burglary Earthquake

Alarm

• Intuitively, an arrow from node X to node Y means X has a direct


influence on Y (we can say X has a casual effect on Y)
• Easy for a domain expert to determine these relationships
• The absence/presence of arrows will be made more precise later
on

47
A Set of Parameters
B P(B) E P(E) Burglary Earthquake
false 0.999 false 0.998
true 0.001 true 0.002

B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001 Each node Xi has a conditional probability
false true false 0.71 distribution P(Xi | Parents(Xi)) that quantifies the
false true true 0.29 effect of the parents on the node
true false false 0.06
The parameters are the probabilities in these
true false true 0.94
conditional probability distributions
true true false 0.05
Because we have discrete random variables, we
true true true 0.95
have conditional probability tables (CPTs)

48
A Set of Parameters
Conditional Probability Stores the probability distribution
Distribution for Alarm for Alarm given the values of
Burglary and Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the
false false true 0.001
parents (B and E in this example), the
false true false 0.71
entries for P(A=true|B,E) and P(A=false|
false true true 0.29
B,E) must add up to 1 eg. P(A=true|
true false false 0.06 B=false,E=false) + P(A=false|
true false true 0.94 B=false,E=false)=1
true true false 0.05
true true true 0.95

If you have a Boolean variable with k Boolean parents, how big is


the conditional probability table?
How many entries are independently specifiable?
Bias Variance Tradeoff
• The goal of any supervised machine learning algorithm is to best
estimate the mapping function (f) for the output variable (Y) given the
input data (X).
• The mapping function is often called the target function because it is
the function that a given supervised machine learning algorithm aims
to approximate.
• The prediction error for any machine learning algorithm can be
broken down into three parts:
• Bias Error
• Variance Error
• Irreducible Error
• Irreducible error
• Irreducible error is nothing but those errors that cannot be reduced
irrespective of any algorithm that you use in the model.
• It is caused by unusual variables that have a direct influence on the
output variable.
• So in order to make your model efficient, we are left with the
reducible error that we need to optimize at all costs.
• Bias
• The bias is known as the difference between the prediction of the
values by the ML model and the correct value.
• Being high in biasing gives a large error in training as well as testing
data.
• It is recommended that an algorithm should always be low biased to
avoid the problem of underfitting.
• By high bias, the data predicted is in a straight line format, thus not
fitting accurately in the data in the data set. Such fitting is known
as Underfitting of Data. This happens when the hypothesis is too
simple or linear in nature
• Variance
The variability of model prediction for a given data point which tells
us spread of our data is called the variance of the model.
• The model with high variance has a very complex fit to the training
data and thus is not able to fit accurately on the data which it hasn’t
seen before.
• As a result, such models perform very well on training data but has
high error rates on test data.
• When a model is high on variance, it is then said to as Overfitting of
Data. Overfitting is fitting the training set accurately via complex curve
and high order hypothesis but is not the solution as the error with
unseen data is high.
• While training a data model variance should be kept low.
• Let the variable that we are predicting to be Y and the other
independent variables to be X. Now let us assume there is a
relationship between the two variables such that:
• Y = f(X) + e
• In the above equation, Here e is the estimated error with a mean
value 0. When we make a classifier using algorithms like linear
regression, SVM, etc, the expected squared error at point x will be:
• err(x) = Bias2  + Variance + irreducible error
• Let us also understand how the Bias-Variance will affect a Machine
Learning model’s performance.
• We can put the relationship between bias-variance in four categories
listed below:
• High Variance-High Bias – The model is inconsistent and also
inaccurate on average
• Low Variance-High Bias – Models are consistent but low on average
• High Variance-Low Bias – Somewhat accurate but inconsistent on
averages
• Low Variance-Low Bias – It is the ideal scenario, the model is
consistent and accurate on average.
• Detecting bias and variance in a model is quite evident.
• A model with high variance will have a low training error and high
validation error.
• And in the case of high bias, the model will have high training error
and validation error is the same as training error.
• Bias-Variance Trade-Off
• Finding the right balance between the bias and variance of the model
is called the Bias-Variance trade-off.
• It is basically a way to make sure the model is neither overfitted nor
underfitted in any case.
• If the model is too simple and has very few parameters, it will suffer
from high bias and low variance.
• On the other hand, if the model has a large number of parameters, it
will have high variance and low bias.
• This trade-off should result in a perfectly balanced relationship
between the two.
• Ideally, low bias and low variance is the target for any Machine
Learning model.
Important Terminologies and Model Selection
• Training data
• This type of data builds up the machine learning algorithm. The data
scientist feeds the algorithm input data, which corresponds to an expected
output.
• The model evaluates the data repeatedly to learn more about the data’s
behavior and then adjusts itself to serve its intended purpose.
• Validation data 
• During training, validation data infuses new data into the model that it
hasn’t evaluated before.
• Validation data provides the first test against unseen data, allowing data
scientists to evaluate how well the model makes predictions based on the
new data.
• Not all data scientists use validation data, but it can provide some helpful
information to optimize hyperparameters, which influence how the model
assesses data.
• Based on the accuracy of the predictions after the validation stage,
data scientists can adjust hyperparameters such as learning rate,
input features and hidden layers.
• These adjustments prevent overfitting, in which the algorithm can
make excellent determinations on the training data, but can't
effectively adjust predictions for additional data.
• The opposite problem, underfitting, occurs when the model isn’t
complex enough to make accurate predictions against either training
data or new data.
• Test Dataset
• The sample of data used to provide an unbiased evaluation of a final
model fit on the training dataset.
• Prediction error quantifies one of two things:
• In regression analysis, it’s a measure of how well the model predicts
the response variable.
• In classification (machine learning), it’s a measure of how well samples are
classified to the correct category.
• In regression, the term “prediction error” and “Residuals” are sometimes
used synonymously.
• Generalization error is a measure of how accurately an algorithm is able to
predict outcome values for previously unseen data. Because learning
algorithms are evaluated on finite samples, the evaluation of a learning
algorithm may be sensitive to sampling error.
• As a result, measurements of prediction error on the current data may not
provide much information about predictive ability on new data.
• Generalization error can be minimized by avoiding overfitting in the
learning algorithm.
• If a model has been trained too well on training data, it will be unable
to generalize. It will make inaccurate predictions when given new
data, making the model useless even though it is able to make
accurate predictions for the training data. This is called Overfitting.
The inverse is also true. 
• Underfitting happens when a model has not been trained enough on
the data. In the case of underfitting, it makes the model just as
useless and it is not capable of making accurate predictions, even
with the training data.
• The best approach to model selection requires “sufficient” data,
which may be nearly infinite depending on the complexity of the
problem.
• In this ideal situation, we would split the data into training, validation,
and test sets, then fit candidate models on the training set, evaluate
and select them on the validation set, and report the performance of
the final model on the test set.
• If we are in a data-rich situation, the best approach is to randomly
divide the dataset into three parts: a training set, a validation set, and
a test set. The training set is used to fit the models; the validation set
is used to estimate prediction error for model selection; the test set is
used for assessment of the generalization error of the final chosen
model.
• This is impractical on most predictive modeling problems given that we
rarely have sufficient data, or are able to even judge what would be
sufficient.
• In many applications, however, the supply of data for training and testing
will be limited, and in order to build good models, we wish to use as
much of the available data as possible for training.
• However, if the validation set is small, it will give a relatively noisy
estimate of predictive performance.
• There are two main classes of techniques to approximate the ideal case
of model selection; they are:
• Probabilistic Measures: Choose a model via in-sample error and
complexity.
• Resampling Methods: Choose a model via estimated out-of-sample
error.
• Probabilistic measures involve analytically scoring a candidate model
using both its performance on the training dataset and the complexity
of the model.
• Four commonly used probabilistic model selection measures include:
• Akaike Information Criterion (AIC).
• Bayesian Information Criterion (BIC).
• Minimum Description Length (MDL).
• Structural Risk Minimization (SRM).
• Resampling methods seek to estimate the performance of a model (or
more precisely, the model development process) on out-of-sample data.
• This is achieved by splitting the training dataset into sub train and test
sets, fitting a model on the sub train set, and evaluating it on the test
set. This process may then be repeated multiple times and the mean
performance across each trial is reported.
• Three common resampling model selection methods include:
• Random train/test splits.
• Cross-Validation (k-fold, LOOCV, etc.).
• Bootstrap.
Expectation Maximization Algorithm

• EM algorithm provides a general approach to learning in


presence of unobserved variables.

• In many practical learning settings, only a subset of


relevant features or variables might be observable.
– Eg: Hidden Markov, Bayesian Belief Networks
Simple Example: Coin Flipping

• Suppose you have 2 coins, A and B, each with a certain bias of


landing heads, θ𝐴 , θ𝐵 .

• Given data sets 𝑋𝐴 = 𝑥1,𝐴 , … , 𝑥𝑚 𝐴 ,𝐴 and 𝑋𝐵 = 𝑥1,𝐵 , … ,


𝑥𝑚 𝐵 ,𝐵 1 ; 𝑖𝑓 ℎ𝑒𝑎𝑑𝑠

Where 𝑥𝑖 ,𝑗 ={
0 ; 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑚�
1
• No hidden variables – easy solution. θ� ∑𝑖 =� 𝑥 𝑖 ,𝑗 ; sample
𝑚𝑗
=
mean � 1
Simplified MLE
Goal: determine coin parameters without knowing the identity of each
data set’s coin.

Solution: Expectation- maximization


Coin Flip With hidden variables

•What if you were given the same dataset of coin flip results,
but no coin identities defining the datasets?

Here: 𝑋 = 𝑥1, … 𝑥 𝑚 ; the observed variable


𝑧1,1 …
𝑧 𝑚,1 1 ; 𝑖𝑓 𝑥 𝑖𝑠 𝑓𝑟𝑜𝑚 𝑗 𝑡ℎ
𝑍= … 𝑧 𝑖,𝑗 … 𝑖
where 𝑧 𝑖 ,𝑗 =
… 𝑐𝑜𝑖𝑛 0;
𝑧1,𝑘 𝑧 𝑚,𝑘 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
But Z is not known. (Ie: ‘hidden’ / ‘latent’
variable)
EM Algorithm

1) Initialize some arbitrary hypothesis of parameter values (θ):


θ= θ1 , … , θ𝑘 coin flip example: θ = {θ𝐴 , θ𝐵 } =
{0.6, 0.5}

2) Expectation (E-step) 𝑝 𝑥 = 𝑥𝑖 θ = θ 𝑗 )
𝐸 𝑧𝑖 , =
𝑗 ∑ 𝑘𝑛 = 𝑝 𝑥 = 𝑥𝑖 θ = θ 𝑛 )
1
If 𝑧 𝑖 ,𝑗 is known:
2) Maximization (M-step)
𝑚 𝑗 𝑥�

θ� = ∑𝑚
𝑖=1 𝐸 𝑧 𝑖 ,𝑗 𝑥𝑖 θ𝑗 =
∑ 𝑖=1
𝑚𝑗 �
� ∑𝑚
𝑖= 𝐸 𝑧
𝑖,𝑗
1
EM- Coin Flip example

• Initialize θA and θB to chosen


value
– Ex: θA=0.6, θB= 0.5

• Compute a probability
distribution of possible
completions of the data using
current parameters
EM- Coin Flip example

Set 1

• What is the probability that I observe 5 heads and 5 tails in coin A and B
given the initializing parameters θA=0.6, θB= 0.5?
• Compute likelihood of set 1 coming from coin A or B using the binomial
distribution with mean probability θ on n trials with k successes

• Likelihood of “A”=0.00079
• Likelihood of “B”=0.00097
• Normalize to get probabilities A=0.45, B=0.55
EM example

March 20, 2017 77 / 1

0
The M-step
Hierarchical Clustering
• In hierarchical clustering the goal is to produce a hierarchical
series of nested clusters, ranging from clusters of individual
points at the bottom to an all-inclusive cluster at the top. A
diagram called a dendrogram graphically represents this
hierarchy.

• One of the attractions of hierarchical techniques is that they


correspond to taxonomies that are very common in the biological
sciences, e.g., kingdom, phylum, genus, species.

Hierarchical Clustering 79
Dendrogram

• A dendrogram is a diagram representing a tree.

• In hierarchical clustering, it illustrates the arrangement of the cluster


produced by the corresponding analysis.

Hierarchical Clustering 80
Types of Hierarchical Clustering
• A hierarchical method can be classified as being either
• Agglomerative

• Divisive

81
Hierarchical Clustering
Agglomerative Hierarchical Clustering
• Start with the points as individual clusters and, at each step, merge
the closest pair of clusters.

• The agglomerative hierarchical clustering further classified as

1. Linkage Method

• Single Linkage

• Complete Linkage

• Average Linkage

2. Variance Method Or Ward’s Method


Hierarchical Clustering 82
Algorithmic Step For Agglomerative
Hierarchical
• 
1. Start with N cluster each containing a single entity and an symmetric
matrix of distance or similarity

2. Search the distance matrix for the most similar pair of cluster. Let the
distance between “most similar” clusters be .

3. Merge cluster Label the newly formed cluster (

4. Update the entries in the distance matrix by

a) Deleting the rows and columns corresponding to the cluster .


Continue …
• 
b) Adding a row and column giving the distances
between and the remaining clusters.

5. Repeat step 2 & 3 a total of N-1 times. Record the


identity of cluster that are merged and the levels at
which the mergers take place.
Single Linkage
• 
• For the single linkage, the distance is to be minimum
between any two points in the different clusters.

i.e.
Example-

1 2 3 4 5
Complete Linkage
• 
• For the complete linkage, the distance is to be
maximum between any two points in the different
clusters.

i.e.

Example-

1 2 3 4 5

86
Average Linkage
••  
Average linkage treats the distance between two clusters as the
average distance between all pairs of items where one member
of pair belongs to each cluster.

i.e.

Example-

1 2 4 5 3

87
Divisive Clustering

• The divisive clustering algorithm is a top-down


clustering approach, initially, all the points in the
dataset belong to one cluster and split is performed
recursively as one moves down the hierarchy.
• Steps of Divisive Clustering:
• Initially, all points in the dataset belong to one single cluster.
• Partition the cluster into two least similar cluster
• Proceed recursively to form new clusters until the desired number of
clusters is obtained.
• In the above sample dataset, it is observed that there is 3 cluster that
is far separated from each other. So we stopped after getting 3
clusters.
• Even if start separating further more clusters, below is the obtained
result.
How to choose which cluster to split?
• Check the sum of squared errors of each cluster and
choose the one with the largest value.
• In the below 2-dimension dataset, currently, the data
points are separated into 2 clusters, for further
separating it to form the 3rd cluster find the sum of
squared errors (SSE) for each of the points in a red
cluster and blue cluster.
• The cluster with the largest SSE value is separated into 2 clusters,
hence forming a new cluster. In the above image, it is observed red
cluster has larger SSE so it is separated into 2 clusters forming 3 total
clusters.
How to split the above-chosen cluster?
• Once we have decided to split which cluster, then the question arises
on how to split the chosen cluster into 2 clusters. One way is to
use Ward’s criterion to chase for the largest reduction in the
difference in the SSE criterion as a result of the split.
Supervised Learning after clustering
• Clustering methods are used to find similarities between instances
and thus group instances.
• If such groups are found, these may be named (by application
experts) and their attributes be defined.
• One can choose the group mean as the representative prototype of
instances in the group, or the possible range of attributes can be
written.
• This allows a simpler description of the data.
• For example, if the customers of a company seem to fall in one
customer of k groups, called segments, customers being defined in
terms of their segmentation demographic attributes and transactions
with the company, then a better understanding of the customer base
will be provided that will allow the company to provide different
strategies for different types of customers;
• Customer this is part of customer relationship management (CRM).
• Likewise, the relationship management company will also be able to
develop strategies for those customers who do not fall in any large
group, and who may require attention, for example, churning
customers.
• Frequently, clustering is also used as a preprocessing stage. Just like
the dimensionality reduction methods which allowed us to make a
mapping to a new space, after clustering, we also map to a new k-
dimensional space where the dimensions are hi (or bi at the risk of
loss of information).
• In a supervised setting, we can then learn the discriminant or
regression function in this new space.
Choosing the Number of Clusters
• Like any learning method, clustering also has its knob to adjust complexity; it is k,
the number of clusters. Given any k, clustering will always find k centers, whether
they really are meaningful groups, or whether they are imposed by the method
we use. There are various ways we can use to fine-tune k:
• In some applications such as color quantization, k is defined by the application.
• Plotting the data in two dimensions using PCA may be used in uncovering the
structure of data and the number of clusters in the data.
• An incremental approach may also help: setting a maximum allowed distance is
equivalent to setting a maximum allowed reconstruction error per instance.
• In some applications, validation of the groups can be done manually by checking
whether clusters actually code meaningful groups of the data. For example, in a
data mining application, application experts may do this check. In color
quantization, we may inspect the image visually to check its quality (despite the
fact that our eyes and brain do not analyze an image pixel by pixel).

You might also like