Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Scalabale K mean plus plus
Prabin Giri
Cpre 528
outline
• Clustering definition
• K means algorithm
• K means ++ algorithm
• K-means||
• summary
What is clustering?
Task:
Process of partitioning in set of meaningful sub-
classes, called clusters
Main Goal:
data analysis or extracting the useful information
or pattern.
Clustering is unsupervised classification
K-means clustering
Popular clustering algorithm
Basic intuition:
grouping is done by minimizing the sum of
squares of distance between data points and
centroid
K-means clustering
Objective:
Given a set of points X={𝑥1, 𝑥2 … … … 𝑥 𝑛} in d
dimensional space.
We have to find the set of centers C={𝑐1……… 𝑐 𝑛}
That will minimize the following equation
𝑥∈ 𝑋
𝑖∈ 𝑘
𝑚𝑖𝑛
||𝑥 − 𝑐𝑖||2
This is considered as the NP hard problem
Basic understanding of k-means
Step1: initialization: start with randomly chosen
centers, k
Step 2:Assign each data point to the nearest
center
and Recompute the centroid/center after the point
assignments
Step 3: Repeat the step 2 until the convergence i.e
when the center value is constant.
K-means algorithm
The approach to solve the problem is
Expectation-Maximization.
Advantages:
Simplicity
Flexible and suit to large datasets
Problems in K-means
Efficiency: run time can be exponential in worst
case
Moreover, the final obtained solution maybe locally
optimal rather than globally optimal
Its very sensitive to initialization
For example
sensitive to initialization
Sensitive to initialization
Sensitive to initialization
• It will stuck on local optimum
• Figure: David Aruther
So, what can we do?
We have to focus on initialization
A better approach might lead to drastic
improvements.
So, what can we do?
• Spread out the initialization
K-means++
An algorithm that tries to spread out initialization
by Arthur et al. ’07
The first center is selected randomly from the data
Subsequent center are then chosen by using the
probability density function
𝑝 𝑥=
𝑑2 𝑥,𝐶
𝑥 𝑑2 𝑥,𝐶
K-means++
𝑝 𝑥=
𝑑2 𝑥,𝐶
𝑥 𝑑2 𝑥,𝐶
The above probability is proportional to the
selection of previous centers.
How it works
How it works?
c1
c2
c4
c3
Pros and cons of k-means++
Pros:
Convergence time speed ups
Tries to give global optimal solution
Cons:
Need K passes over the data
Infeasible to massive dataset imagine (K=1000)
It does not scale
Pros and cons of k-means++
It is sequential in nature
That is,
the probability of choosing the ith center depends
critically on the realization of previous i-1 centers
that is it has inherent sequential nature
So what do we need?
An algorithm which has fewer passes on data
And, which can give theoretical guarantee
Scalable k means plus plus(||)
This algorithm tries to address the previous issues
K-mean++ samples one point per iteration and
updates its distribution
Parallel version of initializing the centers
This method oversample by sampling each point
independently with larger probability
Unlike k mean plus plus, it uses oversampling
factor l=Ω(k)
K means ||
First center C: sample a point uniformly at random
Initial cost,
𝜑 =
𝑥
𝑑2
𝑥, 𝐶
For O log 𝜑 times do
𝐶′ = sample each point 𝑥 ∈ 𝑋 independently with
probability
𝑝 𝑥 =
𝑙.𝑑2 𝑥,𝐶
𝑥 𝑑2 𝑥,𝐶
𝐶 ← 𝐶′
∪ 𝐶
K-means ||
For x ∈ 𝐶, 𝑙𝑒𝑡 𝑤𝑥 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑏𝑒𝑙𝑜𝑛𝑔𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟
Recluster the weighted points in C into k clusters
What is the number of points here?
we have, oversampling factor is l = 𝜃 𝑘
So expected number of points in C would be l log 𝜑
Understanding k-means||
• Lets use the previous example
Understanding k-means||
• K=4
• L=3
Intermediate centers
Understanding k-means||
Reclustering to the final centers
Understanding k-means||
Pic:Aruther
Theoretical guarantee
Theorem: If an ∝ −𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 is used in last step, then k-
mean|| obtains a solution that is an 0(∝)-approximation to-k
means.
Here we could use k-means++,and get 0(logk)-approximation.
Analysis
Theorem: if 𝜑 𝑎𝑛𝑑 𝜑′ are the cost of clustering at the beginning
and at the end of iteration, and OPT is the cost of optimal
clustering.
𝐸 𝜑′ ≤ 𝑂 𝑂𝑃𝑇 +
𝑘
𝑒𝑙
𝜑
Lets assume a cluster A in an OPT.
Order the data points of A in increasing order of their distance to
centroid
𝐴 = 𝑎1, 𝑎2, 𝑎3, … … . 𝑎 𝑇, 𝑎𝑛𝑑
Centroid 𝐶𝐴 =
1
𝑇
𝑎 𝑡
Analysis
Before the iteration, we have C and
Let ∅ 𝐶 = 𝑥 𝑑2 𝑥, 𝐶 ;
𝑡ℎ𝑒𝑛,
∅ 𝐴 𝐶 =
𝑎
𝑑2 𝑎, 𝐶
Let 𝑝𝑡 be the probability of selecting 𝑎 𝑡, 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒
𝑝𝑡 =
𝑙𝑑2 𝑎 𝑡, 𝐶
∅ 𝐶
Analysis
𝑙𝑒𝑡𝑠, 𝑞𝑡 is the probability that 𝑎 𝑡 is the 1st point in ordering chosen
by k mean|| and 𝑞 𝑇+1 is the probability that no point in A is
selected,
For any 1 ≤ 𝑡 ≤ 𝑇,
𝑞𝑡 = 𝑝𝑡
𝑗=1
𝑡−1
1 − 𝑝𝑗
So either assign all the points to newly selected 𝑎 𝑡 or stick with
the original one that is
𝑠𝑡 = ∅ 𝐴,
𝑎∈𝐴
𝑎 − 𝑎 𝑡
2
𝑞 𝑇+1 = 1 −
𝑡=1
𝑇
𝑞𝑡
Then we have,
Analysis
𝐸 ∅ 𝐴 𝐶 ∪ 𝐶′
≤
𝑡
𝑞𝑡 𝑠𝑡 + 𝑞 𝑇+1∅ 𝐴 𝐶
When all points are far from C and they are tightly clustered, we
can write 𝑝𝑡= 𝑝
Then,
𝑞𝑡 = 𝑝 1 − 𝑝 𝑡
And we can have,
𝑠𝑡
′
=
𝑎∈𝐴
𝑎 − 𝑎 𝑡
2
Where, 𝑠𝑡
′
is an increasing sequence.
Analysis
Then, we can write
𝑡
𝑞𝑡 𝑠𝑡 ≤
𝑡
𝑞𝑡 𝑠𝑡
′
≤
1
𝑇
𝑡
𝑞𝑡
𝑡
𝑠𝑡
′
=
𝑡
𝑞𝑡.
1
𝑇
𝑡
𝑠𝑡
′
=
𝑡
𝑞𝑡 . 2∅ 𝐴
′
Finally, we will have,
𝐸 ∅ 𝐴 𝐶 ∪ 𝐶′ ≤ 1 − 𝑞 𝑇+1 2∅ 𝐴
′
+ 𝑞 𝑇+1∅ 𝐴 𝐶
Analysis
How it is parallel ?
First center C: sample a point uniformly at random
Initial cost, 𝜑 = 𝑥 𝑑2 𝑥, 𝐶 (simply adding here)
For O log 𝜑 times do
𝐶′ = sample each point 𝑥 ∈ 𝑋 independently with probability
𝑝 𝑥 =
𝑙.𝑑2 𝑥,𝐶
𝑥 𝑑2 𝑥,𝐶
(sampling is independent here)
𝐶 ← 𝐶′ ∪ 𝐶
For x ∈ 𝐶, 𝑙𝑒𝑡 𝑤𝑥 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑏𝑒𝑙𝑜𝑛𝑔𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟
Recluster the weighted points in C into k clusters
Analysis
Pic credit: Aruther
Summary
analogy
K-means K-
mean++
K-means
||
Seeding choose k
centers
randomly
faster
Choose k
centers
proportion
ally
slow
Choose
subset C
and get k
centers
from C
slower
clustering slow Fast Faster
Summary
when to choose which?
K-means K-
mean++
K-means
||
Small
sized data
fast Very fast slow
Moderate
and large
sized data
slow Very slow Fast
Reference papers
k-means++: the advantages of careful seeding by
David Arthur and Sergei Vassilvitskii
Scalable K-Means++ by Bahmani, Benjamin,
Vattani, Ravi, Vassilvitskii
Questions?

More Related Content

What's hot

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
Simplilearn
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
Kasun Ranga Wijeweera
 
K means clustering
K means clusteringK means clustering
K means clustering
Ahmedasbasb
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
Mohammed Hussein
 
Randomized Algorithms
Randomized AlgorithmsRandomized Algorithms
Randomized Algorithms
Ketan Kamra
 
Fuzzy c means manual work
Fuzzy c means manual workFuzzy c means manual work
Fuzzy c means manual work
Dr.E.N.Sathishkumar
 
Strassen.ppt
Strassen.pptStrassen.ppt
Strassen.ppt
ShivareddyGangam
 
Clustering
ClusteringClustering
Clustering
Rashmi Bhat
 
The Design and Analysis of Algorithms.pdf
The Design and Analysis of Algorithms.pdfThe Design and Analysis of Algorithms.pdf
The Design and Analysis of Algorithms.pdf
Saqib Raza
 
One shot scene specific crowd counting
One shot scene specific crowd countingOne shot scene specific crowd counting
One shot scene specific crowd counting
madhobilota
 
Greedy Algorithms WITH Activity Selection Problem.ppt
Greedy Algorithms WITH Activity Selection Problem.pptGreedy Algorithms WITH Activity Selection Problem.ppt
Greedy Algorithms WITH Activity Selection Problem.ppt
Ruchika Sinha
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
guestfee8698
 
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-StepBackpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
Ahmed Gad
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Quick sort Algorithm Discussion And Analysis
Quick sort Algorithm Discussion And AnalysisQuick sort Algorithm Discussion And Analysis
Quick sort Algorithm Discussion And Analysis
SNJ Chaudhary
 
KNN
KNNKNN
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
SOYEON KIM
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
Learnbay Datascience
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
Saad Elbeleidy
 
Expectation maximization
Expectation maximizationExpectation maximization
Expectation maximization
LALAOUIBENCHERIFSIDI
 

What's hot (20)

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
K means clustering
K means clusteringK means clustering
K means clustering
 
Divide and Conquer
Divide and ConquerDivide and Conquer
Divide and Conquer
 
Randomized Algorithms
Randomized AlgorithmsRandomized Algorithms
Randomized Algorithms
 
Fuzzy c means manual work
Fuzzy c means manual workFuzzy c means manual work
Fuzzy c means manual work
 
Strassen.ppt
Strassen.pptStrassen.ppt
Strassen.ppt
 
Clustering
ClusteringClustering
Clustering
 
The Design and Analysis of Algorithms.pdf
The Design and Analysis of Algorithms.pdfThe Design and Analysis of Algorithms.pdf
The Design and Analysis of Algorithms.pdf
 
One shot scene specific crowd counting
One shot scene specific crowd countingOne shot scene specific crowd counting
One shot scene specific crowd counting
 
Greedy Algorithms WITH Activity Selection Problem.ppt
Greedy Algorithms WITH Activity Selection Problem.pptGreedy Algorithms WITH Activity Selection Problem.ppt
Greedy Algorithms WITH Activity Selection Problem.ppt
 
Inference in Bayesian Networks
Inference in Bayesian NetworksInference in Bayesian Networks
Inference in Bayesian Networks
 
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-StepBackpropagation: Understanding How to Update ANNs Weights Step-by-Step
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Quick sort Algorithm Discussion And Analysis
Quick sort Algorithm Discussion And AnalysisQuick sort Algorithm Discussion And Analysis
Quick sort Algorithm Discussion And Analysis
 
KNN
KNNKNN
KNN
 
Spectral clustering
Spectral clusteringSpectral clustering
Spectral clustering
 
PCA (Principal component analysis)
PCA (Principal component analysis)PCA (Principal component analysis)
PCA (Principal component analysis)
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Expectation maximization
Expectation maximizationExpectation maximization
Expectation maximization
 

Similar to Scalable k-means plus plus

Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
AlaaZ
 
Fortran chapter 2.pdf
Fortran chapter 2.pdfFortran chapter 2.pdf
Fortran chapter 2.pdf
JifarRaya
 
Intro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfIntro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdf
JifarRaya
 
K Means Clustering in ML.pptx
K Means Clustering in ML.pptxK Means Clustering in ML.pptx
K Means Clustering in ML.pptx
Ramakrishna Reddy Bijjam
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
Emad Nabil
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
SyedNahin1
 
Applied Algorithms and Structures week999
Applied Algorithms and Structures week999Applied Algorithms and Structures week999
Applied Algorithms and Structures week999
fashiontrendzz20
 
algorithm Unit 2
algorithm Unit 2 algorithm Unit 2
algorithm Unit 2
Monika Choudhery
 
Unit 2 in daa
Unit 2 in daaUnit 2 in daa
Unit 2 in daa
Nv Thejaswini
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
Edureka!
 
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
AIST
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
Programming Exam Help
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
Darshak Mehta
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
Sajib Sen
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
jins0618
 
Seminar9
Seminar9Seminar9
Seminar9
kim taegong
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
Rahul926331
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
Junghoon Kim
 
Graphics6 bresenham circlesandpolygons
Graphics6 bresenham circlesandpolygonsGraphics6 bresenham circlesandpolygons
Graphics6 bresenham circlesandpolygons
Ketan Jani
 

Similar to Scalable k-means plus plus (20)

Enhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial DatasetEnhance The K Means Algorithm On Spatial Dataset
Enhance The K Means Algorithm On Spatial Dataset
 
Fortran chapter 2.pdf
Fortran chapter 2.pdfFortran chapter 2.pdf
Fortran chapter 2.pdf
 
Intro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdfIntro. to computational Physics ch2.pdf
Intro. to computational Physics ch2.pdf
 
K Means Clustering in ML.pptx
K Means Clustering in ML.pptxK Means Clustering in ML.pptx
K Means Clustering in ML.pptx
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
Lecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.pptLecture_3_k-mean-clustering.ppt
Lecture_3_k-mean-clustering.ppt
 
Applied Algorithms and Structures week999
Applied Algorithms and Structures week999Applied Algorithms and Structures week999
Applied Algorithms and Structures week999
 
algorithm Unit 2
algorithm Unit 2 algorithm Unit 2
algorithm Unit 2
 
Unit 2 in daa
Unit 2 in daaUnit 2 in daa
Unit 2 in daa
 
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
K Means Clustering Algorithm | K Means Example in Python | Machine Learning A...
 
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...Dmitrii Tihonkih - The Iterative Closest Points Algorithm and  Affine Transfo...
Dmitrii Tihonkih - The Iterative Closest Points Algorithm and Affine Transfo...
 
Computer Science Exam Help
Computer Science Exam Help Computer Science Exam Help
Computer Science Exam Help
 
K means clustering algorithm
K means clustering algorithmK means clustering algorithm
K means clustering algorithm
 
K-means Clustering
K-means ClusteringK-means Clustering
K-means Clustering
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
Seminar9
Seminar9Seminar9
Seminar9
 
5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf5 DimensionalityReduction.pdf
5 DimensionalityReduction.pdf
 
Selection K in K-means Clustering
Selection K in K-means ClusteringSelection K in K-means Clustering
Selection K in K-means Clustering
 
Graphics6 bresenham circlesandpolygons
Graphics6 bresenham circlesandpolygonsGraphics6 bresenham circlesandpolygons
Graphics6 bresenham circlesandpolygons
 

Recently uploaded

Importent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdfImportent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdf
PradeepNigam12
 
Thermal Power Station Ukai Report pdf 24
Thermal Power Station Ukai Report pdf 24Thermal Power Station Ukai Report pdf 24
Thermal Power Station Ukai Report pdf 24
AnishVasava
 
presentation for storage Storage Tank PPT.pdf
presentation for storage Storage Tank PPT.pdfpresentation for storage Storage Tank PPT.pdf
presentation for storage Storage Tank PPT.pdf
MustafaAhsan7
 
一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理
一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理
一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理
b1k7zip
 
Thermodynamics and Heat Transfer - KRCE.pptx
Thermodynamics and Heat Transfer - KRCE.pptxThermodynamics and Heat Transfer - KRCE.pptx
Thermodynamics and Heat Transfer - KRCE.pptx
krceseo
 
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
r07z26xt
 
Fuel-Dlivery-Project PowerPoint presentations
Fuel-Dlivery-Project  PowerPoint presentationsFuel-Dlivery-Project  PowerPoint presentations
Fuel-Dlivery-Project PowerPoint presentations
jithujithin657
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
Kiran Kumar Manigam
 
Presentation on ergonomics in mining industry
Presentation on ergonomics in mining industryPresentation on ergonomics in mining industry
Presentation on ergonomics in mining industry
praku727
 
355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx
355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx
355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx
Le Hoang Phong
 
Youtube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of APIYoutube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of API
AnamikaRani12
 
Software Requirement Engineering Analyzing the Problem.pdf
Software Requirement Engineering Analyzing the Problem.pdfSoftware Requirement Engineering Analyzing the Problem.pdf
Software Requirement Engineering Analyzing the Problem.pdf
jeevaakatiravanhod
 
一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理
一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理
一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理
g1toa2w
 
Electrical Engineering, DC - AC Machines
Electrical Engineering, DC - AC MachinesElectrical Engineering, DC - AC Machines
Electrical Engineering, DC - AC Machines
Jason J Pulikkottil
 
13th International Conference on Information Technology Convergence and Servi...
13th International Conference on Information Technology Convergence and Servi...13th International Conference on Information Technology Convergence and Servi...
13th International Conference on Information Technology Convergence and Servi...
ijait
 
How Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer SupportHow Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer Support
Dean Wyatte
 
A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...
A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...
A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...
CrimsonPublishers-SBB
 
UNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERING
UNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERINGUNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERING
UNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERING
Chandra Kumar S
 
How BIM Modeling Services Revolutionize Architecture and Design.pdf
How BIM Modeling Services Revolutionize Architecture and Design.pdfHow BIM Modeling Services Revolutionize Architecture and Design.pdf
How BIM Modeling Services Revolutionize Architecture and Design.pdf
Chemionix Ltd
 
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
【网祉:5j8.net】 极品美鲍【网祉:5j8.net】
 

Recently uploaded (20)

Importent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdfImportent indian standard code.4081.1986.pdf
Importent indian standard code.4081.1986.pdf
 
Thermal Power Station Ukai Report pdf 24
Thermal Power Station Ukai Report pdf 24Thermal Power Station Ukai Report pdf 24
Thermal Power Station Ukai Report pdf 24
 
presentation for storage Storage Tank PPT.pdf
presentation for storage Storage Tank PPT.pdfpresentation for storage Storage Tank PPT.pdf
presentation for storage Storage Tank PPT.pdf
 
一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理
一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理
一比一原版(ucberkeley毕业证书)加州大学伯克利分校毕业证如何办理
 
Thermodynamics and Heat Transfer - KRCE.pptx
Thermodynamics and Heat Transfer - KRCE.pptxThermodynamics and Heat Transfer - KRCE.pptx
Thermodynamics and Heat Transfer - KRCE.pptx
 
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
一比一原版美国特拉华大学毕业证(ud毕业证书)如何办理
 
Fuel-Dlivery-Project PowerPoint presentations
Fuel-Dlivery-Project  PowerPoint presentationsFuel-Dlivery-Project  PowerPoint presentations
Fuel-Dlivery-Project PowerPoint presentations
 
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
RAILWAYS, a vital part of our infrastructure, play a crucial role in ensuring...
 
Presentation on ergonomics in mining industry
Presentation on ergonomics in mining industryPresentation on ergonomics in mining industry
Presentation on ergonomics in mining industry
 
355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx
355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx
355536825-03-Oil-Gas-Flow-Metering-System-pptx.pptx
 
Youtube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of APIYoutube Transcript Sumariser- application of API
Youtube Transcript Sumariser- application of API
 
Software Requirement Engineering Analyzing the Problem.pdf
Software Requirement Engineering Analyzing the Problem.pdfSoftware Requirement Engineering Analyzing the Problem.pdf
Software Requirement Engineering Analyzing the Problem.pdf
 
一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理
一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理
一比一原版(surrey毕业证书)英国萨里大学毕业证如何办理
 
Electrical Engineering, DC - AC Machines
Electrical Engineering, DC - AC MachinesElectrical Engineering, DC - AC Machines
Electrical Engineering, DC - AC Machines
 
13th International Conference on Information Technology Convergence and Servi...
13th International Conference on Information Technology Convergence and Servi...13th International Conference on Information Technology Convergence and Servi...
13th International Conference on Information Technology Convergence and Servi...
 
How Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer SupportHow Cash App Trains Large Language Models For Customer Support
How Cash App Trains Large Language Models For Customer Support
 
A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...
A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...
A Case of Unrecognized Peripartum Cardiomyopathy Which Was Noticed During Eme...
 
UNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERING
UNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERINGUNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERING
UNIT-1-INTRODUCTION- MECHATRONICS-ENGGINERING
 
How BIM Modeling Services Revolutionize Architecture and Design.pdf
How BIM Modeling Services Revolutionize Architecture and Design.pdfHow BIM Modeling Services Revolutionize Architecture and Design.pdf
How BIM Modeling Services Revolutionize Architecture and Design.pdf
 
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
抖音人气博主卖逼【网祉:5j8.net】反差幼师【网祉:5j8.net】中国农村野战
 

Scalable k-means plus plus

  • 1. Scalabale K mean plus plus Prabin Giri Cpre 528
  • 2. outline • Clustering definition • K means algorithm • K means ++ algorithm • K-means|| • summary
  • 3. What is clustering? Task: Process of partitioning in set of meaningful sub- classes, called clusters Main Goal: data analysis or extracting the useful information or pattern. Clustering is unsupervised classification
  • 4. K-means clustering Popular clustering algorithm Basic intuition: grouping is done by minimizing the sum of squares of distance between data points and centroid
  • 5. K-means clustering Objective: Given a set of points X={𝑥1, 𝑥2 … … … 𝑥 𝑛} in d dimensional space. We have to find the set of centers C={𝑐1……… 𝑐 𝑛} That will minimize the following equation 𝑥∈ 𝑋 𝑖∈ 𝑘 𝑚𝑖𝑛 ||𝑥 − 𝑐𝑖||2 This is considered as the NP hard problem
  • 6. Basic understanding of k-means Step1: initialization: start with randomly chosen centers, k Step 2:Assign each data point to the nearest center and Recompute the centroid/center after the point assignments Step 3: Repeat the step 2 until the convergence i.e when the center value is constant.
  • 7. K-means algorithm The approach to solve the problem is Expectation-Maximization. Advantages: Simplicity Flexible and suit to large datasets
  • 8. Problems in K-means Efficiency: run time can be exponential in worst case Moreover, the final obtained solution maybe locally optimal rather than globally optimal Its very sensitive to initialization For example
  • 11. Sensitive to initialization • It will stuck on local optimum • Figure: David Aruther
  • 12. So, what can we do? We have to focus on initialization A better approach might lead to drastic improvements.
  • 13. So, what can we do? • Spread out the initialization
  • 14. K-means++ An algorithm that tries to spread out initialization by Arthur et al. ’07 The first center is selected randomly from the data Subsequent center are then chosen by using the probability density function 𝑝 𝑥= 𝑑2 𝑥,𝐶 𝑥 𝑑2 𝑥,𝐶
  • 15. K-means++ 𝑝 𝑥= 𝑑2 𝑥,𝐶 𝑥 𝑑2 𝑥,𝐶 The above probability is proportional to the selection of previous centers.
  • 18. Pros and cons of k-means++ Pros: Convergence time speed ups Tries to give global optimal solution Cons: Need K passes over the data Infeasible to massive dataset imagine (K=1000) It does not scale
  • 19. Pros and cons of k-means++ It is sequential in nature That is, the probability of choosing the ith center depends critically on the realization of previous i-1 centers that is it has inherent sequential nature
  • 20. So what do we need? An algorithm which has fewer passes on data And, which can give theoretical guarantee
  • 21. Scalable k means plus plus(||) This algorithm tries to address the previous issues K-mean++ samples one point per iteration and updates its distribution Parallel version of initializing the centers This method oversample by sampling each point independently with larger probability Unlike k mean plus plus, it uses oversampling factor l=Ω(k)
  • 22. K means || First center C: sample a point uniformly at random Initial cost, 𝜑 = 𝑥 𝑑2 𝑥, 𝐶 For O log 𝜑 times do 𝐶′ = sample each point 𝑥 ∈ 𝑋 independently with probability 𝑝 𝑥 = 𝑙.𝑑2 𝑥,𝐶 𝑥 𝑑2 𝑥,𝐶 𝐶 ← 𝐶′ ∪ 𝐶
  • 23. K-means || For x ∈ 𝐶, 𝑙𝑒𝑡 𝑤𝑥 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑏𝑒𝑙𝑜𝑛𝑔𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 Recluster the weighted points in C into k clusters What is the number of points here? we have, oversampling factor is l = 𝜃 𝑘 So expected number of points in C would be l log 𝜑
  • 24. Understanding k-means|| • Lets use the previous example
  • 25. Understanding k-means|| • K=4 • L=3 Intermediate centers
  • 28. Theoretical guarantee Theorem: If an ∝ −𝑎𝑝𝑝𝑟𝑜𝑥𝑖𝑚𝑎𝑡𝑖𝑜𝑛 is used in last step, then k- mean|| obtains a solution that is an 0(∝)-approximation to-k means. Here we could use k-means++,and get 0(logk)-approximation.
  • 29. Analysis Theorem: if 𝜑 𝑎𝑛𝑑 𝜑′ are the cost of clustering at the beginning and at the end of iteration, and OPT is the cost of optimal clustering. 𝐸 𝜑′ ≤ 𝑂 𝑂𝑃𝑇 + 𝑘 𝑒𝑙 𝜑 Lets assume a cluster A in an OPT. Order the data points of A in increasing order of their distance to centroid 𝐴 = 𝑎1, 𝑎2, 𝑎3, … … . 𝑎 𝑇, 𝑎𝑛𝑑 Centroid 𝐶𝐴 = 1 𝑇 𝑎 𝑡
  • 30. Analysis Before the iteration, we have C and Let ∅ 𝐶 = 𝑥 𝑑2 𝑥, 𝐶 ; 𝑡ℎ𝑒𝑛, ∅ 𝐴 𝐶 = 𝑎 𝑑2 𝑎, 𝐶 Let 𝑝𝑡 be the probability of selecting 𝑎 𝑡, 𝑡ℎ𝑒𝑛 𝑤𝑒 ℎ𝑎𝑣𝑒 𝑝𝑡 = 𝑙𝑑2 𝑎 𝑡, 𝐶 ∅ 𝐶
  • 31. Analysis 𝑙𝑒𝑡𝑠, 𝑞𝑡 is the probability that 𝑎 𝑡 is the 1st point in ordering chosen by k mean|| and 𝑞 𝑇+1 is the probability that no point in A is selected, For any 1 ≤ 𝑡 ≤ 𝑇, 𝑞𝑡 = 𝑝𝑡 𝑗=1 𝑡−1 1 − 𝑝𝑗 So either assign all the points to newly selected 𝑎 𝑡 or stick with the original one that is 𝑠𝑡 = ∅ 𝐴, 𝑎∈𝐴 𝑎 − 𝑎 𝑡 2 𝑞 𝑇+1 = 1 − 𝑡=1 𝑇 𝑞𝑡 Then we have,
  • 32. Analysis 𝐸 ∅ 𝐴 𝐶 ∪ 𝐶′ ≤ 𝑡 𝑞𝑡 𝑠𝑡 + 𝑞 𝑇+1∅ 𝐴 𝐶 When all points are far from C and they are tightly clustered, we can write 𝑝𝑡= 𝑝 Then, 𝑞𝑡 = 𝑝 1 − 𝑝 𝑡 And we can have, 𝑠𝑡 ′ = 𝑎∈𝐴 𝑎 − 𝑎 𝑡 2 Where, 𝑠𝑡 ′ is an increasing sequence.
  • 33. Analysis Then, we can write 𝑡 𝑞𝑡 𝑠𝑡 ≤ 𝑡 𝑞𝑡 𝑠𝑡 ′ ≤ 1 𝑇 𝑡 𝑞𝑡 𝑡 𝑠𝑡 ′ = 𝑡 𝑞𝑡. 1 𝑇 𝑡 𝑠𝑡 ′ = 𝑡 𝑞𝑡 . 2∅ 𝐴 ′ Finally, we will have, 𝐸 ∅ 𝐴 𝐶 ∪ 𝐶′ ≤ 1 − 𝑞 𝑇+1 2∅ 𝐴 ′ + 𝑞 𝑇+1∅ 𝐴 𝐶
  • 34. Analysis How it is parallel ? First center C: sample a point uniformly at random Initial cost, 𝜑 = 𝑥 𝑑2 𝑥, 𝐶 (simply adding here) For O log 𝜑 times do 𝐶′ = sample each point 𝑥 ∈ 𝑋 independently with probability 𝑝 𝑥 = 𝑙.𝑑2 𝑥,𝐶 𝑥 𝑑2 𝑥,𝐶 (sampling is independent here) 𝐶 ← 𝐶′ ∪ 𝐶 For x ∈ 𝐶, 𝑙𝑒𝑡 𝑤𝑥 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑜𝑖𝑛𝑡𝑠 𝑏𝑒𝑙𝑜𝑛𝑔𝑖𝑛𝑔 𝑡𝑜 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 Recluster the weighted points in C into k clusters
  • 36. Summary analogy K-means K- mean++ K-means || Seeding choose k centers randomly faster Choose k centers proportion ally slow Choose subset C and get k centers from C slower clustering slow Fast Faster
  • 37. Summary when to choose which? K-means K- mean++ K-means || Small sized data fast Very fast slow Moderate and large sized data slow Very slow Fast
  • 38. Reference papers k-means++: the advantages of careful seeding by David Arthur and Sergei Vassilvitskii Scalable K-Means++ by Bahmani, Benjamin, Vattani, Ravi, Vassilvitskii