Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
33 views

Pyspark - Kmeans Clustering With Map Reduce in Spark - Stack Overflow

The document provides an explanation and code sample for implementing K-means clustering with MapReduce in Spark. It describes mapping points to their closest centroids, reducing to calculate new centroid positions, and iterating until centroids converge within a threshold. The key steps are: 1. Initialize centroids randomly and cache points. 2. Map points to closest centroids and sum points by centroid. 3. Reduce to calculate new centroids from sums. 4. Iterate until centroids change less than a threshold.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Pyspark - Kmeans Clustering With Map Reduce in Spark - Stack Overflow

The document provides an explanation and code sample for implementing K-means clustering with MapReduce in Spark. It describes mapping points to their closest centroids, reducing to calculate new centroid positions, and iterating until centroids converge within a threshold. The key steps are: 1. Initialize centroids randomly and cache points. 2. Map points to closest centroids and sum points by centroid. 3. Reduce to calculate new centroids from sums. 4. Iterate until centroids change less than a threshold.

Uploaded by

jefferyleclerc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

Kmeans clustering with map reduce in spark


Asked 2 years, 2 months ago Modified 2 years, 2 months ago Viewed 822 times

Hello can someone help me to do map reduce with Kmeans using Spark . Actually can do Kmeans with spark , but i dont how to map
and reduce it . Thanks .
2
apache-spark pyspark mapreduce k-means

Share Improve this question Follow asked Dec 11, 2021 at 16:55
Ibrahim Ahmed-Nour
29 1

2 Answers Sorted by: Highest score (default)

Below a proposed pseudo-code for your exercise:

centroids = k random sampled points from the dataset


1
Map:

Given
By clicking a point
“Accept and theyou
all cookies”, setagree
of centroids
Stack Exchange
can store cookies on your device and disclose information in
Calculate
accordance with ourthe distance
Cookie Policy.between the point and each centroid
Emit the point and the closest centroid
Accept all cookies Necessary cookies only
Reduce:
Customize settings
Given the centroid and the points belonging to its cluster
Calculate the new centroid as the arithmetic mean position of the points

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 1/6


3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

Emit the new centroid

prev_centroids = centroids

centroids = new_centroids

while prev_centroids - centroids > threshold

The mapper class calculates the distance between the data point and each centroid. Then emits the index of the closest centroid and
the data point:

class MAPPER
method MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point)
if (distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)

The reducer calculates the new approximation of the centroid and emits it.

class REDUCER
method REDUCER(centroid_index, list_of_point_sums)
number_of_points = partial_sum.number_of_points
point_sum = 0
for all partial_sum in list_of_partial_sums:
By clicking “Accept all cookies”,
point_sum you agree Stack Exchange
+= partial_sum
point_sum.number_of_points
can store cookies += partial_sum.number_of_points
on your device and disclose information in
accordancecentroid_value = point_sum / point_sum.number_of_points
with our Cookie Policy.
EMIT(centroid_index, centroid_value)

The actual K-Means Spark implementation:

First you read the file with the points and generate the initial centroids with a random sampling, using takeSample(False, k): this
function takes k random samples, without replacement, from the RDD; so, the application generates the initial centroids in a

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 2/6


3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

distributed manner, avoiding to move all the data to the driver. You may reuse the RDD in an iterative algorithm, hence cache it in
memory with cache() to avoid to re-evaluate it every time an action is triggered:

points = sc.textFile(INPUT_PATH).map(Point).cache()
initial_centroids = init_centroids(points, k=parameters["k"])

def init_centroids(dataset, k):


start_time = time.time()
initial_centroids = dataset.takeSample(False, k)
print("init centroid execution:", len(initial_centroids), "in",
(time.time() - start_time), "s")
return initial_centroids

After that, you iterate the mapper and the reducer stages until the stopping criterion is verified or when the maximum number of
iterations is reached.

while True:
print("--Iteration n. {itr:d}".format(itr=n+1), end="\r",
flush=True)
cluster_assignment_rdd = points.map(assign_centroids)
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortByKey(ascending=True)

new_centroids = [item[1] for item in centroids_rdd.collect()]


stop = stopping_criterion(new_centroids,parameters["threshold"])

n += 1
if(stop == False and n < parameters["maxiteration"]):
centroids_broadcast = sc.broadcast(new_centroids)
else:
By clicking “Accept all cookies”, you agree Stack Exchange
break
can store cookies on your device and disclose information in
accordance with our Cookie Policy.
The stopping condition is computed this way:

def stopping_criterion(new_centroids, threshold):


old_centroids = centroids_broadcast.value
for i in range(len(old_centroids)):
check = old_centroids[i].distance(new_centroids[i],
distance_broadcast.value) <= threshold

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 3/6


3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow
if check == False:
return False
return True

In order to represent the points, a class Point has been defined. It's characterized by the following fields:

a numpyarray of components

number of points: a point can be seen as the aggregation of many points, so this variable is used to track the number of points
that are represented by the object

It includes the following operations:

distance (it is possible to pass as parameter the type of distance)

sum

get_average_point: this method returns a point that has as components the average of the actual components on the number of
the points represented by the object

class Point: def init(self, line): values = line.split(",") self.components = np.array([round(float(k), 5) for k in values])
self.number_of_points = 1

def sum(self, p):


self.components = np.add(self.components, p.components)
self.number_of_points += p.number_of_points
return self

def distance(self, p, h):


if (h < 0):
By clicking “Accept allh cookies”,
= 2 you agree Stack Exchange
return linalg.norm(self.components - p.components, h)
can store cookies on your device and disclose information in
accordance with
defour Cookie Policy.
get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self

The mapper method is invoked, at each iteration, on the input file, that contains the points from the dataset

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 4/6


3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow

cluster_assignment_rdd = points.map(assign_centroids)

The assign_centroids function, for each point on which is invoked, assign the closest centroid to that point. The centroids are taken
from the broadcast variable. The function returns the result as a tuple (id of the centroid, point)

def assign_centroids(p):
min_dist = float("inf")
centroids = centroids_broadcast.value
nearest_centroid = 0
for i in range(len(centroids)):
distance = p.distance(centroids[i], distance_broadcast.value)
if(distance < min_dist):
min_dist = distance
nearest_centroid = i
return (nearest_centroid, p)

The reduce stage is done using two spark transformations:

reduceByKey: for each cluster, compute the sum of the points belonging to it. It is mandatory to pass one associative function as
a parameter. The associative function (which accepts two arguments and returns a single element) should be commutative and
associative in mathematical nature

sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))

mapValues: it is used to calculate the average point for each cluster at the end of each stage. The points are already divided by
key. This trasformation works only on the value of a key. The results are sorted in order to make easier comparisons.

centroids_rdd = sum_rdd.mapValues(lambda x: x.get_average_point()).sortBy(lambda x: x[1].components[0])


By clicking “Accept all cookies”, you agree Stack Exchange
canThe
storeget_average_point()
cookies on your devicefunction returns
and disclose the newincomputed centroid.
information
accordance with our Cookie Policy.

def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self

Share Improve this answer Follow edited Dec 24, 2021 at 15:14 answered Dec 24, 2021 at 14:41

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 5/6


3/9/24, 5:41 AM pyspark - Kmeans clustering w ith map reduce in spark - Stack Overflow
el_pazzu
85 3 12

You don't need to write map-reduce. You can use spark dataframe API and use Spark ML library.

You can read more about it here.


0
https://spark.apache.org/docs/latest/ml-clustering.html

Share Improve this answer Follow edited Dec 11, 2021 at 18:51 answered Dec 11, 2021 at 18:45
Rahul Kumar
2,234 4 25 52

By clicking “Accept all cookies”, you agree Stack Exchange


can store cookies on your device and disclose information in
accordance with our Cookie Policy.

https://stackoverflow .com/questions/70317122/kmeans-clustering-w ith-map-reduce-in-spark 6/6

You might also like