Unit 2 Topic 4 Map Reduce
Unit 2 Topic 4 Map Reduce
Unit 2 Topic 4 Map Reduce
Resilience
Each node periodically updates its status to the master node.
If a slave node doesn’t send its notification, the master node reassigns
the currently running task of that slave node to other available nodes
in the cluster.
Conti…
Quick
Data processing is quick as MapReduce uses HDFS as the storage system.
MapReduce takes minutes to process terabytes of unstructured large
volumes of data.
Parallel Processing
In MapReduce, dividing the job among multiple nodes and each node
works with a part of the job simultaneously.
MapReduce is based on Divide and Conquer paradigm which helps us to
process the data using different machines.
As the data is processed by multiple machines instead of a single machine
in parallel, the time taken to process the data gets reduced by a tremendous
amount
Conti…
Conti…
Availability
Multiple replicas of the same data are sent to numerous nodes in the
network.
Thus, in case of any failure, other copies are readily available for
processing without any loss.
Scalability
Hadoop is a highly scalable platform.
Traditional RDBMS systems are not scalable according to the increase
in data volume.
MapReduce lets you run applications from a huge number of nodes,
using terabytes and petabytes of data.
Map Reduce Framework
MapReduce job usually splits the input data-set into
independent chunks which are processed by the map
tasks in a completely parallel manner.
Typically both the input and the output of the job are
stored in a file-system.
Conti…
Conti…
How Map Reduce works
MapReduce can perform distributed and parallel
computations using large datasets across a large number
of nodes.
class RatingsBreak(MRJob):
def steps(self):
return [
MRstep(mapper=self.mapper_get_ratings,
reducer=self.reducer_count_ratings)
]
# MAPPER CODE