Scalable k-means plus plus

Scalabale K mean plus plus
Prabin Giri
Cpre 528

outline
• Clustering definition
• K means algorithm
• K means ++ algorithm
• K-means||
• summary

What is clustering?
Task:
Process of partitioning in set of meaningful sub-
classes, called clusters
Main Goal:
data analysis or extracting the useful information
or pattern.
Clustering is unsupervised classification

K-means clustering
Popular clustering algorithm
Basic intuition:
grouping is done by minimizing the sum of
squares of distance between data points and
centroid

K-means clustering
Objective:
Given a set of points X={𝑥1, 𝑥2 … … … 𝑥 𝑛} in d
dimensional space.
We have to find the set of centers C={𝑐1……… 𝑐 𝑛}
That will minimize the following equation
𝑥∈ 𝑋
𝑖∈ 𝑘
𝑚𝑖𝑛
||𝑥 − 𝑐𝑖||2
This is considered as the NP hard problem

Basic understanding of k-means
Step1: initialization: start with randomly chosen
centers, k
Step 2:Assign each data point to the nearest
center
and Recompute the centroid/center after the point
assignments
Step 3: Repeat the step 2 until the convergence i.e
when the center value is constant.

K-means algorithm
The approach to solve the problem is
Expectation-Maximization.
Advantages:
Simplicity
Flexible and suit to large datasets

Problems in K-means
Efficiency: run time can be exponential in worst
case
Moreover, the final obtained solution maybe locally
optimal rather than globally optimal
Its very sensitive to initialization
For example

Sensitive to initialization
• It will stuck on local optimum
• Figure: David Aruther

So, what can we do?
We have to focus on initialization
A better approach might lead to drastic
improvements.

So, what can we do?
• Spread out the initialization

K-means++
An algorithm that tries to spread out initialization
by Arthur et al. ’07
The first center is selected randomly from the data
Subsequent center are then chosen by using the
probability density function
𝑝 𝑥=
𝑑2 𝑥,𝐶
𝑥 𝑑2 𝑥,𝐶

K-means++
𝑝 𝑥=
𝑑2 𝑥,𝐶
The above probability is proportional to the
selection of previous centers.

Pros and cons of k-means++
Pros:
Convergence time speed ups
Tries to give global optimal solution
Cons:
Need K passes over the data
Infeasible to massive dataset imagine (K=1000)
It does not scale

Pros and cons of k-means++
It is sequential in nature
That is,
the probability of choosing the ith center depends
critically on the realization of previous i-1 centers
that is it has inherent sequential nature

So what do we need?
An algorithm which has fewer passes on data
And, which can give theoretical guarantee

Scalable k means plus plus(||)
This algorithm tries to address the previous issues
K-mean++ samples one point per iteration and
updates its distribution
Parallel version of initializing the centers
This method oversample by sampling each point
independently with larger probability
Unlike k mean plus plus, it uses oversampling
factor l=Ω(k)

K means ||
First center C: sample a point uniformly at random
Initial cost,
𝜑 =
𝑥
𝑑2
𝑥, 𝐶
For O log 𝜑 times do
𝐶′ = sample each point 𝑥 ∈ 𝑋 independently with
probability
𝑝 𝑥 =
𝑙.𝑑2 𝑥,𝐶
𝐶 ← 𝐶′
∪ 𝐶

K-means ||
For x ∈ 𝐶, 𝑙𝑒𝑡 𝑤𝑥 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑏𝑒𝑙𝑜𝑛𝑔𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟
Recluster the weighted points in C into k clusters
What is the number of points here?
we have, oversampling factor is l = 𝜃 𝑘
So expected number of points in C would be l log 𝜑

Understanding k-means||
• Lets use the previous example

• K=4
• L=3
Intermediate centers

Reclustering to the final centers

Pic:Aruther

Theoretical guarantee
Theorem: If an ∝ −𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 is used in last step, then k-
mean|| obtains a solution that is an 0(∝)-approximation to-k
means.
Here we could use k-means++,and get 0(logk)-approximation.

Analysis
Theorem: if 𝜑 𝑎𝑛𝑑 𝜑′ are the cost of clustering at the beginning
and at the end of iteration, and OPT is the cost of optimal
clustering.
𝐸 𝜑′ ≤ 𝑂 𝑂𝑃𝑇 +
𝑘
𝑒𝑙
𝜑
Lets assume a cluster A in an OPT.
Order the data points of A in increasing order of their distance to
centroid
𝐴 = 𝑎1, 𝑎2, 𝑎3, … … . 𝑎 𝑇, 𝑎𝑛𝑑
Centroid 𝐶𝐴 =
1
𝑇
𝑎 𝑡

Analysis
Before the iteration, we have C and
Let ∅ 𝐶 = 𝑥 𝑑2 𝑥, 𝐶 ;
𝑡ℎ𝑒𝑛,
∅ 𝐴 𝐶 =
𝑎
𝑑2 𝑎, 𝐶
Let 𝑝𝑡 be the probability of selecting 𝑎 𝑡, 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒
𝑝𝑡 =
𝑙𝑑2 𝑎 𝑡, 𝐶
∅ 𝐶

Analysis
𝑙𝑒𝑡𝑠, 𝑞𝑡 is the probability that 𝑎 𝑡 is the 1st point in ordering chosen
by k mean|| and 𝑞 𝑇+1 is the probability that no point in A is
selected,
For any 1 ≤ 𝑡 ≤ 𝑇,
𝑞𝑡 = 𝑝𝑡
𝑗=1
𝑡−1
1 − 𝑝𝑗
So either assign all the points to newly selected 𝑎 𝑡 or stick with
the original one that is
𝑠𝑡 = ∅ 𝐴,
𝑎∈𝐴
𝑎 − 𝑎 𝑡
2
𝑞 𝑇+1 = 1 −
𝑡=1
𝑇
𝑞𝑡
Then we have,

Analysis
𝐸 ∅ 𝐴 𝐶 ∪ 𝐶′
≤
𝑡
𝑞𝑡 𝑠𝑡 + 𝑞 𝑇+1∅ 𝐴 𝐶
When all points are far from C and they are tightly clustered, we
can write 𝑝𝑡= 𝑝
Then,
𝑞𝑡 = 𝑝 1 − 𝑝 𝑡
And we can have,
𝑠𝑡
′
=
𝑎∈𝐴
𝑎 − 𝑎 𝑡
2
Where, 𝑠𝑡
′
is an increasing sequence.

Analysis
Then, we can write
𝑡
𝑞𝑡 𝑠𝑡 ≤
𝑡
𝑞𝑡 𝑠𝑡
′
≤
1
𝑇
𝑡
𝑞𝑡
𝑡
𝑠𝑡
′
=
𝑡
𝑞𝑡.
1
𝑇
𝑡
𝑠𝑡
′
=
𝑡
𝑞𝑡 . 2∅ 𝐴
′
Finally, we will have,
𝐸 ∅ 𝐴 𝐶 ∪ 𝐶′ ≤ 1 − 𝑞 𝑇+1 2∅ 𝐴
′
+ 𝑞 𝑇+1∅ 𝐴 𝐶

Analysis
How it is parallel ?
First center C: sample a point uniformly at random
Initial cost, 𝜑 = 𝑥 𝑑2 𝑥, 𝐶 (simply adding here)
For O log 𝜑 times do
𝐶′ = sample each point 𝑥 ∈ 𝑋 independently with probability
𝑝 𝑥 =
𝑙.𝑑2 𝑥,𝐶
(sampling is independent here)
𝐶 ← 𝐶′ ∪ 𝐶
For x ∈ 𝐶, 𝑙𝑒𝑡 𝑤𝑥 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑏𝑒𝑙𝑜𝑛𝑔𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟
Recluster the weighted points in C into k clusters

Summary
analogy
K-means K-
mean++
K-means
||
Seeding choose k
centers
randomly
faster
Choose k
centers
proportion
ally
slow
Choose
subset C
and get k
centers
from C
slower
clustering slow Fast Faster

Summary
when to choose which?
K-means K-
mean++
K-means
||
Small
sized data
fast Very fast slow
Moderate
and large
sized data
slow Very slow Fast

Reference papers
k-means++: the advantages of careful seeding by
David Arthur and Sergei Vassilvitskii
Scalable K-Means++ by Bahmani, Benjamin,
Vattani, Ravi, Vassilvitskii

Scalable k-means plus plus

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Scalable k-means plus plus

Similar to Scalable k-means plus plus (20)

Recently uploaded

Recently uploaded (20)

Scalable k-means plus plus