Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

K Means Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

k -means Clustering

TOP: Data Clustering 076/091

Instructor: Sayan Bandyapadhyay


Portland State University
Outline

1 The k -means Algorithm

2 Quality Analysis of k -means


Real Points

Suppose the set of points X are from Rd


A natural center of points is the average point or mean

1 X
µ= · x
|S|
x∈S

Here the sum is coordinate-wise total:


(1, 3) + (2, 5) = (3, 8)
Real Points

Suppose the set of points X are from Rd


A natural center of points is the average point or mean

1 X
µ= · x
|S|
x∈S

Here the sum is coordinate-wise total:


(1, 3) + (2, 5) = (3, 8)
This is the basis of the k -means algorithm
Proposed by Lloyd in 1957, published in 1982
Also by Max in 1960
Real Points

Suppose the set of points X are from Rd


A natural center of points is the average point or mean

1 X
µ= · x
|S|
x∈S

Here the sum is coordinate-wise total:


(1, 3) + (2, 5) = (3, 8)
This is the basis of the k -means algorithm
Proposed by Lloyd in 1957, published in 1982
Also by Max in 1960
qP
d 2
Euclidean distance of x and y , ||x − y || = i=1 (xi − yi )
The k -means Algorithm (Lloyd-Max)

Algorithm k -means
Require: Set of points X
1: Start with centers c1 , . . . , ck chosen arbitrarily from X
2: repeat
3: for each point xi ∈ X do
4: Assign xi to cluster Cj that minimizes ||xi − cj ||
5: end for
6: for each clusterPCj do
7: cj ← |C1j | · xi ∈Cj xi
8: end for
9: until cluster centers do not change
Time Complexity of k -means

Algorithm k -means
Require: Set of points X
1: Start with centers c1 , . . . , ck chosen arbitrarily from X
2: repeat
3: for each point xi ∈ X do
4: Assign xi to cluster Cj that minimizes ||xi − cj ||
5: end for
6: for each clusterPCj do
7: cj ← |C1j | · xi ∈Cj xi
8: end for
9: until cluster centers do not change
Time Complexity of k -means

Algorithm k -means
Require: Set of points X
1: Start with centers c1 , . . . , ck chosen arbitrarily from X
2: repeat
3: for each point xi ∈ X do
4: Assign xi to cluster Cj that minimizes ||xi − cj ||
5: end for
6: for each clusterPCj do
7: cj ← |C1j | · xi ∈Cj xi
8: end for
9: until cluster centers do not change

Again we need a “rate of cost decrease” type argument as


for k -median
What is a suitable cost function that the mean minimizes?
A Suitable Cost Function

P
For what function g(., .), mean(S) minimizes x∈S g(x, c)
over all c?
A Suitable Cost Function

P
For what function g(., .), mean(S) minimizes x∈S g(x, c)
over all c?
Such g is called Bregman divergence that encompasses
many functions
A Suitable Cost Function

P
For what function g(., .), mean(S) minimizes x∈S g(x, c)
over all c?
Such g is called Bregman divergence that encompasses
many functions
One such g is squared Euclidean distance

g(x, c) = ||x − c||2


A Suitable Cost Function

P
For what function g(., .), mean(S) minimizes x∈S g(x, c)
over all c?
Such g is called Bregman divergence that encompasses
many functions
One such g is squared Euclidean distance

g(x, c) = ||x − c||2

This leads to our k -means clustering problem for real


points with Euclidean distance
k -means clustering

Given a set X of n points in the metric space (U, d)

Find a set C of k points (cluster centers) in U that


minimizes,
X
cost(C) = d(p, NearestCenter (p))2
p∈X
Euclidean k -means clustering

Given a set X of n points in Rd

Find a set C of k points (cluster centers) in Rd that


minimizes,
X 2
cost(C) = ||p − NearestCenter (p)||
p∈X
Time Complexity of Lloyd’s Algorithm

M1 , M2 , . . . , Mℓ are the sets of means computed over ℓ


iterations
To show: cost(Mℓ ) < cost(Mℓ−1 ) < cost(Mℓ−2 ) < . . . < cost(M1 )
Time Complexity of Lloyd’s Algorithm

M1 , M2 , . . . , Mℓ are the sets of means computed over ℓ


iterations
To show: cost(Mℓ ) < cost(Mℓ−1 ) < cost(Mℓ−2 ) < . . . < cost(M1 )
In every iteration, means are picked as centers of the
clusters
A mean minimizes the sum-of-squares cost function
So, k -means cost also decreases for the new set of centers
Time Complexity

In every iteration, cost decreases


The algorithm never cycles – The same set of centers never
comes back
Number of iterations is bounded by the number of distinct
sets of means
Time Complexity

In every iteration, cost decreases


The algorithm never cycles – The same set of centers never
comes back
Number of iterations is bounded by the number of distinct
sets of means
2n subsets: 2n distinct means; (2n )k distinct sets of means
Time Complexity

In every iteration, cost decreases


The algorithm never cycles – The same set of centers never
comes back
Number of iterations is bounded by the number of distinct
sets of means
2n subsets: 2n distinct means; (2n )k distinct sets of means
Lloyd’s algorithm always terminates
In practice, it is very fast
One can also terminate the algorithm after a few iterations
Outline

1 The k -means Algorithm

2 Quality Analysis of k -means


Analysis of Quality

Initialization/Seeding is the key


Initialization

Does random initialization help?


Initialization

Does random initialization help? No! We still can get


nearby centers
Initialization

Does random initialization help? No! We still can get


nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Initialization

Does random initialization help? No! We still can get


nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Sensitive to outliers
Initialization

Does random initialization help? No! We still can get


nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Sensitive to outliers
We need something in between
Initialization

Does random initialization help? No! We still can get


nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Sensitive to outliers
We need something in between
we should pick far away points only if there are many
points in the vicinity
Initialization

Does random initialization help? No! We still can get


nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Sensitive to outliers
We need something in between
we should pick far away points only if there are many
points in the vicinity
We should not pick an outlier as a center
Initialization

Does random initialization help? No! We still can get


nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Sensitive to outliers
We need something in between
we should pick far away points only if there are many
points in the vicinity
We should not pick an outlier as a center
This leads to a new seeding algorithm!
Non-uniformly Random Seeding

Start with a uniformly random center


Non-uniformly Random Seeding

Start with a uniformly random center


Next center is chosen from a distribution biased towards
far away points
Non-uniformly Random Seeding

Start with a uniformly random center


Next center is chosen from a distribution biased towards
far away points
Cost of a point xi , cost(xi , C) = minc∈C ||xi − c||2
Non-uniformly Random Seeding

Start with a uniformly random center


Next center is chosen from a distribution biased towards
far away points
Cost of a point xi , cost(xi , C) = minc∈C ||xi − c||2
Define the probability pi = cost(xi , C)/cost(C)
The k -means++ Algorithm

Algorithm k -means++
Require: Set of points X , parameter k
1: Select c1 randomly from X
2: C ← {c1 }
3: while (|C| ≠ k ) do
4: for each i = 1 to n do
5: cost(xi , C) ← minc∈C ||xi − c||2
6: end for P
7: cost(C) ← ni=1 cost(xi , C)
8: Sample a random point y ∈ X , selecting each xi w.p.
pi = cost(xi , C)/cost(C)
9: C ← C ∪ {y }
10: end while
11: Invoke Lloyd’s algorithm with C as the seed
Analysis of k -means++

Time complexity: O(nk )+ Lloyd’s


Analysis of k -means++

Time complexity: O(nk )+ Lloyd’s


Approximation factor: O(log k )
Analysis of k -means++

Time complexity: O(nk )+ Lloyd’s


Approximation factor: O(log k )

Compare this with p-swap Local search


O(np+1 k p+1 log n) time, but 9 + (1/p)-approximation
Works in general metric space (even for non-numerical
data)

You might also like