0% found this document useful (0 votes)

33 views

Pyspark - Kmeans Clustering With Map Reduce in Spark - Stack Overflow

The document provides an explanation and code sample for implementing K-means clustering with MapReduce in Spark. It describes mapping points to their closest centroids, reducing to calculate new centroid positions, and iterating until centroids converge within a threshold. The key steps are: 1. Initialize centroids randomly and cache points. 2. Map points to closest centroids and sum points by centroid. 3. Reduce to calculate new centroids from sums. 4. Iterate until centroids change less than a threshold.

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Pyspark - Kmeans Clustering With Map Reduce in Spark - Stack Overflow

Uploaded by

jefferyleclerc

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

Kmeans clustering with map reduce in spark

Asked 2 years, 2 months ago Modified 2 years, 2 months ago Viewed 822 times

Hello can someone help me to do map reduce with Kmeans using Spark . Actually can do Kmeans with spark , but i dont how to map
and reduce it . Thanks .
2
apache-spark pyspark mapreduce k-means

Share Improve this question Follow asked Dec 11, 2021 at 16:55
Ibrahim Ahmed-Nour
29 1

2 Answers Sorted by: Highest score (default)

Below a proposed pseudo-code for your exercise:

centroids = k random sampled points from the dataset

1
Map:

Given
By clicking a point
“Accept and theyou
all cookies”, setagree
of centroids
Stack Exchange
can store cookies on your device and disclose information in
Calculate
accordance with ourthe distance
Cookie Policy.between the point and each centroid
Emit the point and the closest centroid
Accept all cookies Necessary cookies only
Reduce:
Customize settings
Given the centroid and the points belonging to its cluster
Calculate the new centroid as the arithmetic mean position of the points

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 1/6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

Emit the new centroid

prev_centroids = centroids

centroids = new_centroids

while prev_centroids - centroids > threshold

The mapper class calculates the distance between the data point and each centroid. Then emits the index of the closest centroid and
the data point:

class MAPPER
method MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point)
if (distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)

The reducer calculates the new approximation of the centroid and emits it.

class REDUCER
method REDUCER(centroid_index, list_of_point_sums)
number_of_points = partial_sum.number_of_points
point_sum = 0
for all partial_sum in list_of_partial_sums:
By clicking “Accept all cookies”,
point_sum you agree Stack Exchange
+= partial_sum
point_sum.number_of_points
can store cookies += partial_sum.number_of_points
on your device and disclose information in
accordancecentroid_value = point_sum / point_sum.number_of_points
with our Cookie Policy.
EMIT(centroid_index, centroid_value)

The actual K-Means Spark implementation:

First you read the file with the points and generate the initial centroids with a random sampling, using takeSample(False, k): this
function takes k random samples, without replacement, from the RDD; so, the application generates the initial centroids in a

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 2/6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

distributed manner, avoiding to move all the data to the driver. You may reuse the RDD in an iterative algorithm, hence cache it in
memory with cache() to avoid to re-evaluate it every time an action is triggered:

points = sc.textFile(INPUT_PATH).map(Point).cache()
initial_centroids = init_centroids(points, k=parameters["k"])

def init_centroids(dataset, k):

start_time = time.time()
initial_centroids = dataset.takeSample(False, k)
print("init centroid execution:", len(initial_centroids), "in",
(time.time() - start_time), "s")
return initial_centroids

After that, you iterate the mapper and the reducer stages until the stopping criterion is verified or when the maximum number of
iterations is reached.

while True:
print("--Iteration n. {itr:d}".format(itr=n+1), end="\r",
flush=True)
cluster_assignment_rdd = points.map(assign_centroids)
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortByKey(ascending=True)

new_centroids = [item[1] for item in centroids_rdd.collect()]

stop = stopping_criterion(new_centroids,parameters["threshold"])

n += 1
if(stop == False and n < parameters["maxiteration"]):
centroids_broadcast = sc.broadcast(new_centroids)
else:
By clicking “Accept all cookies”, you agree Stack Exchange
break
can store cookies on your device and disclose information in
accordance with our Cookie Policy.
The stopping condition is computed this way:

def stopping_criterion(new_centroids, threshold):

old_centroids = centroids_broadcast.value
for i in range(len(old_centroids)):
check = old_centroids[i].distance(new_centroids[i],
distance_broadcast.value) <= threshold

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 3/6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow
if check == False:
return False
return True

In order to represent the points, a class Point has been defined. It's characterized by the following fields:

a numpyarray of components

number of points: a point can be seen as the aggregation of many points, so this variable is used to track the number of points
that are represented by the object

It includes the following operations:

distance (it is possible to pass as parameter the type of distance)

sum

get_average_point: this method returns a point that has as components the average of the actual components on the number of
the points represented by the object

class Point: def init(self, line): values = line.split(",") self.components = np.array([round(float(k), 5) for k in values])
self.number_of_points = 1

def sum(self, p):

self.components = np.add(self.components, p.components)
self.number_of_points += p.number_of_points
return self

def distance(self, p, h):

if (h < 0):
By clicking “Accept allh cookies”,
= 2 you agree Stack Exchange
return linalg.norm(self.components - p.components, h)
can store cookies on your device and disclose information in
accordance with
defour Cookie Policy.
get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self

The mapper method is invoked, at each iteration, on the input file, that contains the points from the dataset

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 4/6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

cluster_assignment_rdd = points.map(assign_centroids)

The assign_centroids function, for each point on which is invoked, assign the closest centroid to that point. The centroids are taken
from the broadcast variable. The function returns the result as a tuple (id of the centroid, point)

def assign_centroids(p):
min_dist = float("inf")
centroids = centroids_broadcast.value
nearest_centroid = 0
for i in range(len(centroids)):
distance = p.distance(centroids[i], distance_broadcast.value)
if(distance < min_dist):
min_dist = distance
nearest_centroid = i
return (nearest_centroid, p)

The reduce stage is done using two spark transformations:

reduceByKey: for each cluster, compute the sum of the points belonging to it. It is mandatory to pass one associative function as
a parameter. The associative function (which accepts two arguments and returns a single element) should be commutative and
associative in mathematical nature

sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))

mapValues: it is used to calculate the average point for each cluster at the end of each stage. The points are already divided by
key. This trasformation works only on the value of a key. The results are sorted in order to make easier comparisons.

centroids_rdd = sum_rdd.mapValues(lambda x: x.get_average_point()).sortBy(lambda x: x[1].components[0])

By clicking “Accept all cookies”, you agree Stack Exchange
canThe
storeget_average_point()
cookies on your devicefunction returns
and disclose the newincomputed centroid.
information
accordance with our Cookie Policy.

def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self

Share Improve this answer Follow edited Dec 24, 2021 at 15:14 answered Dec 24, 2021 at 14:41

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 5/6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow
el_pazzu
85 3 12

You don't need to write map-reduce. You can use spark dataframe API and use Spark ML library.

You can read more about it here.

0
https://spark.apache.org/docs/latest/ml-clustering.html

Share Improve this answer Follow edited Dec 11, 2021 at 18:51 answered Dec 11, 2021 at 18:45
Rahul Kumar
2,234 4 25 52

By clicking “Accept all cookies”, you agree Stack Exchange

can store cookies on your device and disclose information in
accordance with our Cookie Policy.

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 6/6

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
Shortcut To Shred Ebook Revised 9-9-2015 PDF
86% (7)
Shortcut To Shred Ebook Revised 9-9-2015 PDF
15 pages
Anastasia: The New Broadway Musical (LIBRETTO)
94% (174)
Anastasia: The New Broadway Musical (LIBRETTO)
117 pages
Trauma-Focused ACT - Russ Harris
95% (38)
Trauma-Focused ACT - Russ Harris
568 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
How 2 Setup Trust
97% (307)
How 2 Setup Trust
3 pages
Cellular Communication POGIL
80% (10)
Cellular Communication POGIL
5 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
Do You Like Big Girls V01
20% (20)
Do You Like Big Girls V01
161 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
1001 Songs
71% (69)
1001 Songs
1,798 pages
Trademark License Agreement
78% (381)
Trademark License Agreement
3 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Ass_11.ipynb - Colab
No ratings yet
Ass_11.ipynb - Colab
7 pages
DS - ML - 7 - 60019210046 1
No ratings yet
DS - ML - 7 - 60019210046 1
6 pages
Labsheet2
No ratings yet
Labsheet2
8 pages
ML Minors Exp7
No ratings yet
ML Minors Exp7
6 pages
Recor
No ratings yet
Recor
6 pages
AbidAdhikari26840-DWDM
No ratings yet
AbidAdhikari26840-DWDM
43 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Clustering
No ratings yet
Clustering
1 page
Assignment 4 A
No ratings yet
Assignment 4 A
15 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
Untitled66 - Jupyter Notebook
No ratings yet
Untitled66 - Jupyter Notebook
2 pages
4 Clustering With K-Means - Kaggle
No ratings yet
4 Clustering With K-Means - Kaggle
9 pages
21BEC505 Exp2
No ratings yet
21BEC505 Exp2
7 pages
Experiment No 7
No ratings yet
Experiment No 7
4 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
DSM 1
No ratings yet
DSM 1
6 pages
Pratibha Sikheriya (Data Mining)
No ratings yet
Pratibha Sikheriya (Data Mining)
4 pages
Week6_Bai
No ratings yet
Week6_Bai
14 pages
ML Python Exercises UOM BDS Cluster Analysis
No ratings yet
ML Python Exercises UOM BDS Cluster Analysis
8 pages
AdityaGaur BDA Exp8
No ratings yet
AdityaGaur BDA Exp8
4 pages
C1 W2 Lab05 Sklearn GD Soln
No ratings yet
C1 W2 Lab05 Sklearn GD Soln
3 pages
K++
No ratings yet
K++
5 pages
21BCE5775 Clustering
No ratings yet
21BCE5775 Clustering
42 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Assignment NeuralNetwork
No ratings yet
Assignment NeuralNetwork
8 pages
DSM 2
No ratings yet
DSM 2
7 pages
2020 06-06-02 Hierarchical Clustering.ipynb Colab
No ratings yet
2020 06-06-02 Hierarchical Clustering.ipynb Colab
5 pages
IDM Assignment
No ratings yet
IDM Assignment
15 pages
ML0101EN Clus DBSCN Weather Py v1
No ratings yet
ML0101EN Clus DBSCN Weather Py v1
16 pages
K-Means Clustering From Scratch
No ratings yet
K-Means Clustering From Scratch
3 pages
3 Salazar Francisco Improving - Accuracy - Using - Convolutions
No ratings yet
3 Salazar Francisco Improving - Accuracy - Using - Convolutions
14 pages
Assignment 3 B
No ratings yet
Assignment 3 B
7 pages
BDG Tutorial
No ratings yet
BDG Tutorial
11 pages
Music
No ratings yet
Music
34 pages
EXP6_DOGS&CATS.ipynb - Colaboratory
No ratings yet
EXP6_DOGS&CATS.ipynb - Colaboratory
4 pages
Unsuper
No ratings yet
Unsuper
15 pages
From Import Import As Import As From Import From Import From Import From Import
No ratings yet
From Import Import As Import As From Import From Import From Import From Import
9 pages
Grid Search For SVM
No ratings yet
Grid Search For SVM
9 pages
Project 2 Clustering Algorithms: Team Members Chaitanya Vedurupaka (50205782) Anirudh Yellapragada (50206970)
No ratings yet
Project 2 Clustering Algorithms: Team Members Chaitanya Vedurupaka (50205782) Anirudh Yellapragada (50206970)
15 pages
Lab Assignment 3 Ai
No ratings yet
Lab Assignment 3 Ai
1 page
Assignment 10
100% (1)
Assignment 10
3 pages
DSM 3
No ratings yet
DSM 3
6 pages
Rajeek8 12
No ratings yet
Rajeek8 12
21 pages
Unsupervised Learning - Clustering Cheatsheet - Codecademy
No ratings yet
Unsupervised Learning - Clustering Cheatsheet - Codecademy
5 pages
INTRO TO ML ASS
No ratings yet
INTRO TO ML ASS
3 pages
K Means
100% (2)
K Means
329 pages
DOC-20241108-WA0003
No ratings yet
DOC-20241108-WA0003
16 pages
Project
No ratings yet
Project
17 pages
Lab Report6 - B21CI014
No ratings yet
Lab Report6 - B21CI014
8 pages
PS Project - Jupyter Notebook
No ratings yet
PS Project - Jupyter Notebook
6 pages
Code_Chapter_7
No ratings yet
Code_Chapter_7
6 pages
k-means-clustering
No ratings yet
k-means-clustering
6 pages
minor-project
No ratings yet
minor-project
24 pages
kmeans.ipynb - Colab
No ratings yet
kmeans.ipynb - Colab
2 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
2.3 Aiml Rishit
No ratings yet
2.3 Aiml Rishit
7 pages
Image Segmentation in Python- Practical Hands-On (3)
No ratings yet
Image Segmentation in Python- Practical Hands-On (3)
24 pages
Data Mining Assignment No. 1
No ratings yet
Data Mining Assignment No. 1
22 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
2 Mapreduce Model Principles
No ratings yet
2 Mapreduce Model Principles
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-C
10 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-14
3 pages
MapReduce - What It Is, and Why It Is So Popular
No ratings yet
MapReduce - What It Is, and Why It Is So Popular
7 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-H
4 pages
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
No ratings yet
2023 Data, Analytics, and Artificial Intelligence Adoption Strategy-A
7 pages
Paper Dvi
No ratings yet
Paper Dvi
7 pages
Balanced K-Means Revisited-5
No ratings yet
Balanced K-Means Revisited-5
3 pages
Hadoop
No ratings yet
Hadoop
7 pages
Balanced K-Means Revisited-1
No ratings yet
Balanced K-Means Revisited-1
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1E
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-17
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-1Q
2 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-16
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-O
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-A
6 pages
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
No ratings yet
A Distance-Based Kernel For Classification Via Support Vector Machines - PMC-17
1 page
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-P
3 pages
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
No ratings yet
Data Visualization Cheat Sheet For Basic Machine Learning Algorithms - by Boriharn K - Mar, 2024 - Towards Data Science
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-9
4 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-4
3 pages
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
No ratings yet
SAP HANA PAL - K-Means Algorithm or How To Do Cust... - SAP Community-5
4 pages
K-Means Clustering Optimization Algorithm Based On Mapreduce
No ratings yet
K-Means Clustering Optimization Algorithm Based On Mapreduce
6 pages
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
No ratings yet
Improved K-Means Map Reduce Algorithm For Big Data Cluster Analysis
7 pages
Fast Scalable K-Means++ Algorithm With Mapreduce
No ratings yet
Fast Scalable K-Means++ Algorithm With Mapreduce
2 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-5
3 pages
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
No ratings yet
The Incremental Online K Means Clustering Algorithm and Its Application To Color Quantization
42 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
No ratings yet
Fuzzy K-Mean Clustering in Mapreduce On Cloud Based Hadoop: Dweepna Garg
4 pages