Pyspark - Kmeans Clustering With Map Reduce in Spark - Stack Overflow
Pyspark - Kmeans Clustering With Map Reduce in Spark - Stack Overflow
Hello can someone help me to do map reduce with Kmeans using Spark . Actually can do Kmeans with spark , but i dont how to map
and reduce it . Thanks .
2
apache-spark pyspark mapreduce k-means
Share Improve this question Follow asked Dec 11, 2021 at 16:55
Ibrahim Ahmed-Nour
29 1
Given
By clicking a point
“Accept and theyou
all cookies”, setagree
of centroids
Stack Exchange
can store cookies on your device and disclose information in
Calculate
accordance with ourthe distance
Cookie Policy.between the point and each centroid
Emit the point and the closest centroid
Accept all cookies Necessary cookies only
Reduce:
Customize settings
Given the centroid and the points belonging to its cluster
Calculate the new centroid as the arithmetic mean position of the points
prev_centroids = centroids
centroids = new_centroids
The mapper class calculates the distance between the data point and each centroid. Then emits the index of the closest centroid and
the data point:
class MAPPER
method MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point)
if (distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)
The reducer calculates the new approximation of the centroid and emits it.
class REDUCER
method REDUCER(centroid_index, list_of_point_sums)
number_of_points = partial_sum.number_of_points
point_sum = 0
for all partial_sum in list_of_partial_sums:
By clicking “Accept all cookies”,
point_sum you agree Stack Exchange
+= partial_sum
point_sum.number_of_points
can store cookies += partial_sum.number_of_points
on your device and disclose information in
accordancecentroid_value = point_sum / point_sum.number_of_points
with our Cookie Policy.
EMIT(centroid_index, centroid_value)
First you read the file with the points and generate the initial centroids with a random sampling, using takeSample(False, k): this
function takes k random samples, without replacement, from the RDD; so, the application generates the initial centroids in a
distributed manner, avoiding to move all the data to the driver. You may reuse the RDD in an iterative algorithm, hence cache it in
memory with cache() to avoid to re-evaluate it every time an action is triggered:
points = sc.textFile(INPUT_PATH).map(Point).cache()
initial_centroids = init_centroids(points, k=parameters["k"])
After that, you iterate the mapper and the reducer stages until the stopping criterion is verified or when the maximum number of
iterations is reached.
while True:
print("--Iteration n. {itr:d}".format(itr=n+1), end="\r",
flush=True)
cluster_assignment_rdd = points.map(assign_centroids)
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortByKey(ascending=True)
n += 1
if(stop == False and n < parameters["maxiteration"]):
centroids_broadcast = sc.broadcast(new_centroids)
else:
By clicking “Accept all cookies”, you agree Stack Exchange
break
can store cookies on your device and disclose information in
accordance with our Cookie Policy.
The stopping condition is computed this way:
In order to represent the points, a class Point has been defined. It's characterized by the following fields:
a numpyarray of components
number of points: a point can be seen as the aggregation of many points, so this variable is used to track the number of points
that are represented by the object
sum
get_average_point: this method returns a point that has as components the average of the actual components on the number of
the points represented by the object
class Point: def init(self, line): values = line.split(",") self.components = np.array([round(float(k), 5) for k in values])
self.number_of_points = 1
The mapper method is invoked, at each iteration, on the input file, that contains the points from the dataset
cluster_assignment_rdd = points.map(assign_centroids)
The assign_centroids function, for each point on which is invoked, assign the closest centroid to that point. The centroids are taken
from the broadcast variable. The function returns the result as a tuple (id of the centroid, point)
def assign_centroids(p):
min_dist = float("inf")
centroids = centroids_broadcast.value
nearest_centroid = 0
for i in range(len(centroids)):
distance = p.distance(centroids[i], distance_broadcast.value)
if(distance < min_dist):
min_dist = distance
nearest_centroid = i
return (nearest_centroid, p)
reduceByKey: for each cluster, compute the sum of the points belonging to it. It is mandatory to pass one associative function as
a parameter. The associative function (which accepts two arguments and returns a single element) should be commutative and
associative in mathematical nature
mapValues: it is used to calculate the average point for each cluster at the end of each stage. The points are already divided by
key. This trasformation works only on the value of a key. The results are sorted in order to make easier comparisons.
def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self
Share Improve this answer Follow edited Dec 24, 2021 at 15:14 answered Dec 24, 2021 at 14:41
You don't need to write map-reduce. You can use spark dataframe API and use Spark ML library.
Share Improve this answer Follow edited Dec 11, 2021 at 18:51 answered Dec 11, 2021 at 18:45
Rahul Kumar
2,234 4 25 52