K Means Clustering

k -means Clustering
TOP: Data Clustering 076/091
Instructor: Sayan Bandyapadhyay

Portland State University
Outline
1 The k -means Algorithm
2 Quality Analysis of k -means

Real Points
Suppose the set of points X are from Rd

A natural center of points is the average point or mean
1 X
µ= · x
|S|
x∈S
Here the sum is coordinate-wise total:

(1, 3) + (2, 5) = (3, 8)
Real Points

1 X
µ= · x
|S|
x∈S

(1, 3) + (2, 5) = (3, 8)
This is the basis of the k -means algorithm
Proposed by Lloyd in 1957, published in 1982
Also by Max in 1960
Real Points

1 X
µ= · x
|S|
x∈S

(1, 3) + (2, 5) = (3, 8)
This is the basis of the k -means algorithm
Proposed by Lloyd in 1957, published in 1982
Also by Max in 1960
qP
d 2
Euclidean distance of x and y , ||x − y || = i=1 (xi − yi )
The k -means Algorithm (Lloyd-Max)
Algorithm k -means
Require: Set of points X
1: Start with centers c1 , . . . , ck chosen arbitrarily from X
2: repeat
3: for each point xi ∈ X do
4: Assign xi to cluster Cj that minimizes ||xi − cj ||
5: end for
6: for each clusterPCj do
7: cj ← |C1j | · xi ∈Cj xi
8: end for
9: until cluster centers do not change
Time Complexity of k -means
Algorithm k -means
2: repeat
5: end for
7: cj ← |C1j | · xi ∈Cj xi
8: end for
Time Complexity of k -means
Algorithm k -means
2: repeat
5: end for
7: cj ← |C1j | · xi ∈Cj xi
8: end for
Again we need a “rate of cost decrease” type argument as

for k -median
What is a suitable cost function that the mean minimizes?
A Suitable Cost Function
P
For what function g(., .), mean(S) minimizes x∈S g(x, c)
over all c?
P
over all c?
Such g is called Bregman divergence that encompasses
many functions
P
over all c?
many functions
One such g is squared Euclidean distance
g(x, c) = ||x − c||2

P
over all c?
many functions
One such g is squared Euclidean distance
g(x, c) = ||x − c||2
This leads to our k -means clustering problem for real

points with Euclidean distance
k -means clustering
Given a set X of n points in the metric space (U, d)
Find a set C of k points (cluster centers) in U that

minimizes,
X
cost(C) = d(p, NearestCenter (p))2
p∈X
Euclidean k -means clustering
Given a set X of n points in Rd
Find a set C of k points (cluster centers) in Rd that

minimizes,
X 2
cost(C) = ||p − NearestCenter (p)||
p∈X
Time Complexity of Lloyd’s Algorithm
M1 , M2 , . . . , Mℓ are the sets of means computed over ℓ

iterations
To show: cost(Mℓ ) < cost(Mℓ−1 ) < cost(Mℓ−2 ) < . . . < cost(M1 )
Time Complexity of Lloyd’s Algorithm
M1 , M2 , . . . , Mℓ are the sets of means computed over ℓ

iterations
To show: cost(Mℓ ) < cost(Mℓ−1 ) < cost(Mℓ−2 ) < . . . < cost(M1 )
In every iteration, means are picked as centers of the
clusters
A mean minimizes the sum-of-squares cost function
So, k -means cost also decreases for the new set of centers
Time Complexity
In every iteration, cost decreases

The algorithm never cycles – The same set of centers never
comes back
Number of iterations is bounded by the number of distinct
sets of means
Time Complexity

comes back
sets of means
2n subsets: 2n distinct means; (2n )k distinct sets of means
Time Complexity

comes back
sets of means
2n subsets: 2n distinct means; (2n )k distinct sets of means
Lloyd’s algorithm always terminates
In practice, it is very fast
One can also terminate the algorithm after a few iterations
Outline
1 The k -means Algorithm
2 Quality Analysis of k -means

Analysis of Quality
Initialization/Seeding is the key

Initialization
Does random initialization help?

Initialization
Does random initialization help? No! We still can get

nearby centers
Initialization

nearby centers
We need well-separated centers – can we use Greedy 2 -
Furthest point algorithm for k -center?
Initialization

nearby centers
Sensitive to outliers
Initialization

nearby centers
We need something in between
Initialization

nearby centers
we should pick far away points only if there are many
points in the vicinity
Initialization

nearby centers
We should not pick an outlier as a center
Initialization

nearby centers
We should not pick an outlier as a center
This leads to a new seeding algorithm!
Non-uniformly Random Seeding
Start with a uniformly random center


Next center is chosen from a distribution biased towards
far away points

far away points
Cost of a point xi , cost(xi , C) = minc∈C ||xi − c||2

far away points
Cost of a point xi , cost(xi , C) = minc∈C ||xi − c||2
Define the probability pi = cost(xi , C)/cost(C)
The k -means++ Algorithm
Algorithm k -means++
Require: Set of points X , parameter k
1: Select c1 randomly from X
2: C ← {c1 }
3: while (|C| ≠ k ) do
4: for each i = 1 to n do
5: cost(xi , C) ← minc∈C ||xi − c||2
6: end for P
7: cost(C) ← ni=1 cost(xi , C)
8: Sample a random point y ∈ X , selecting each xi w.p.
pi = cost(xi , C)/cost(C)
9: C ← C ∪ {y }
10: end while
11: Invoke Lloyd’s algorithm with C as the seed
Analysis of k -means++
Time complexity: O(nk )+ Lloyd’s


Approximation factor: O(log k )

Approximation factor: O(log k )
Compare this with p-swap Local search

O(np+1 k p+1 log n) time, but 9 + (1/p)-approximation
Works in general metric space (even for non-numerical
data)

K Means Clustering

Uploaded by

Copyright:

Available Formats

K Means Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means Clustering

Uploaded by

Copyright:

Available Formats

k -means Clustering

TOP: Data Clustering 076/091

Instructor: Sayan Bandyapadhyay

1 The k -means Algorithm

2 Quality Analysis of k -means

Suppose the set of points X are from Rd

Here the sum is coordinate-wise total:

Suppose the set of points X are from Rd

Here the sum is coordinate-wise total:

Suppose the set of points X are from Rd

Here the sum is coordinate-wise total:

Again we need a “rate of cost decrease” type argument as

g(x, c) = ||x − c||2

g(x, c) = ||x − c||2

This leads to our k -means clustering problem for real

Given a set X of n points in the metric space (U, d)

Find a set C of k points (cluster centers) in U that

Given a set X of n points in Rd

Find a set C of k points (cluster centers) in Rd that

M1 , M2 , . . . , Mℓ are the sets of means computed over ℓ

M1 , M2 , . . . , Mℓ are the sets of means computed over ℓ

In every iteration, cost decreases

In every iteration, cost decreases

In every iteration, cost decreases

1 The k -means Algorithm

2 Quality Analysis of k -means

Initialization/Seeding is the key

Does random initialization help?

Does random initialization help? No! We still can get

Does random initialization help? No! We still can get

Does random initialization help? No! We still can get

Does random initialization help? No! We still can get

Does random initialization help? No! We still can get

Does random initialization help? No! We still can get

Does random initialization help? No! We still can get

Start with a uniformly random center

Start with a uniformly random center

Start with a uniformly random center

Start with a uniformly random center

Time complexity: O(nk )+ Lloyd’s

Time complexity: O(nk )+ Lloyd’s

Time complexity: O(nk )+ Lloyd’s

Compare this with p-swap Local search

You might also like