Lecture Notes Map Reduce
Lecture Notes Map Reduce
HDFS, and then proceeded to learn how to write programs that can be executed on a Hadoop
cluster. Now in this module we discussed about MapReduce programming model, where you
also learnt working principles of the MapReduce framework.
In this segment you understood the mapreduce programming model through the well-known
example of counting the occurence of words in a document. The problem was as follows: You
are given a huge list of documents and asked to list all the words along with their frequency.
For instance, the text “The course 2 of the big data program is the most important part of the
entire program” should return (“the”, 4) (“course”, 1) (“2”, 1) (“of”, 2) (“big”,1) (“data”, 1)
(“program”, 2) (“is”, 1) (“most”, 1) (“important”, 1) (“part”, 1) (“entire”, 1).
Since you are working in a distributed environment, different parts of the documents are stored
across multiple nodes that are connected through a communication network.
The mapreduce programming model utilizes the data-parallel model during the map phase and
the inverse tree-parallel model for aggregating results in the reduce phase.
Essentially, a mapreduce program executes as follows:
1. The master node initiates some number of map tasks depending on the number of file
chunks.
2. Each map task takes one or more file chunks as input and outputs a sequence of
intermediate key-value pairs.
3. The master node collects all the intermediate key-value pairs from the map tasks and
sorts them by key.
4. It then sends them to the reduce tasks in such a way that all key-value pairs with the
same key go to the same reduce task.
5. Each reduce task aggregates the entire set of values associated with a key, typically
using a combination of binary operators.
Assume that you have three separate nodes storing each file chunk. In the first step, you write a
map function that takes text as input and outputs key-value pairs of the form (word, 1) for each
word in the chunk. The master node sees three file chunks and hence initiates three map tasks.
These map tasks run in parallel on the three nodes. The first map task reads the input “The
course 2 of the big data” and produces the following key-value pairs: (The, 1) (course, 1) (2, 1)
(of, 1) (the, 1) (big, 1) (data, 1). The second map task reads the input “program is the most
important” and outputs (program, 1) (is, 1) (the, 1) (most, 1) (important, 1) and similarly for the
third map task.
In the second step, the intermediate key-value pairs are grouped by the key. The master node
allocates a certain number of reduce tasks as per the user’s instruction, let’s say it is r. It then
uses a hash function to distribute the key-value pairs among the reduce tasks.
In our example, if there are 2 reducers numbered 0 and 1, the master node uses a hash code
followed by a mod 2 operation to find the reducer corresponding to a key. The result of this
operation would be similar to what is shown here: All the key-value pairs with keys “the”,
“course”, “data”, “important”, “part”, “entire” should be sent to reduce task 0 and the rest
should be sent to reduce task 1.
Based on this distribution, each node creates separate partitions to store the intermediate
key-value pairs in its local disk. Node 1 creates two partitions. The first partition contains the
key-value pairs, (“The, 1) (“course”,1) (“the,1”) (“data”,1) and the second partition contains
(“2”,1) (“of”,1) and (“big”,1).
Similarly nodes 2 and 3 create two partitions each. The key-value pairs in each partition are
then sorted according to their keys. As you can see, this naturally results in the grouping of the
key-value pairs. Subsequently, the master node collects all the key-value pairs with a single key
from all the map tasks, and sends the merged list to the respective reduce task.
For example, the master node collects two key-value pairs (the,1) from the first map task, one
(the,1) from the second map task and another (the,1) from the third map task, and merges
them into a list of the form (the, [1,1,1,1]) and then sends it to its reduce task which is the
reducer task, i.e. task 0 in our example.
there is a special name for the application of the reduce function to a single key and its
associated values, which is called as a reducer. So, each reduce task executes multiple reducers.
For instance, reduce task 0 first executes a reducer that adds the four 1’s associated with the
key “the” and outputs (“the”,4). For the next key “course”, there is nothing to be added since it
has only one value. Similarly, reduce task 1 executes a reducer that takes (2, [1]) as input and
outputs (2,1) and so on.
In summary the following steps are involved in the execution of a MapReduce program:
1. The master node initiates a number of map tasks depending on the number of file
chunks. All inputs and outputs in MapReduce are in the form of key-value pairs. As you
will see later, the text document is also input in the form of key-value pairs. Each map
task can take multiple file chunks in the form of key-value pairs as inputs and outputs a
sequence of intermediate key-value pairs, which look like the following:
(K,V)→(K1,V1)(K2,V2)…
(K′,V′)→(K′1,V′1)(K′2,V′2)… .
.
(K′′,V′′)→(K′′1,V′′1)(K′′2,V′′2)…
Note that it is the user who describes the entire process of extracting useful information, i.e.
these intermediate key-value pairs are extracted from the input data in the form of a Map
function. Typically, there would be several common keys among these intermediate key-value
pairs. In the next step, these intermediate key-value pairs that have common keys are grouped
together.
2. The map tasks partition and sort all the intermediate key-value pairs by their keys.
These are later merged by the Master node and sent to the reduce tasks in such a way
that all key-value pairs with the same key arrive at the same reduce task in the form of a
key and an iterable list of all its associated values.
Suppose that there are only three distinct keys, K1, K2, andK3, in the entire collection of
intermediate key-value pairs that you obtained in step 1. Then, this step may be represented in
the following manner:
(K1,V1)(K2,V2)… (K1,V11)(K1,V12)… (K1,[V11,V12,…])
(K′1,V′1)(K′2,V′2)…→ (K2,V21)(K2,V22)…→ (K2,[V21,V22,…])
(K′′1,V′′1)(K′′2,V′′2)… (K3,V31)(K3,V32)…
(K3,[V31,V32,…])
If you assume that there are two reducers, then the three groups shown above are distributed
between these two reducers using a hash function on the keys. If hash(K1)=hash(K3)=0 and
hash(K2)=1, then the groups with the keys K1 and K3 go to reduce task 0, and the group with
the key K2 goes to reduce task 1.
3. Each reduce task aggregates the entire set of values associated with a key, typically, by
using a combination of binary operators. Again, it’s the user who describes this process
in the form of a Reduce function. By denoting the binary operator with an asterisk (*),
this step may be represented in the following manner:
Reduce task 0 (K1,[V11,V12,…])→(K1,V11∗V12∗…)
(K3,[V31,V32,…])→(K3,V31∗V32∗…)
Reduce task 1 (K2,[V21,V22,…])→(K2,V21∗V22∗…)
In the previous segment, you saw that all the intermediate key value pairs are transferred to
the reduce tasks through the master node. This doesn’t look daunting for a small example that
you have considered here. However, if you were to perform word count at the scale of millions
of documents, these intermediate key-value pairs may themselves be huge in number.
The mapreduce framework provides us a way to achieve the same. Instead of transferring the
entire set of intermediate key-value pairs, you can perform some part of the reduce function in
the map task itself.
Consider the first map task that produced the intermediate key-value pairs (“The”, 1) (“course”,
1) (“2”, 1) (“of”, 1) (“the”, 1) (“big”, 1) (“data”, 1).
Instead of passing two separate tuples (“the”,1) you can add their values and then transfer a
single tuple (“the”, 2) to the reduce task. If your document had a thousand occurrences of the
word “the”, then you end up saving a lot of time by performing this intermediate aggregation.
This step is typically referred to as the combine phase of MapReduce framework.
However, the combiner works only if the reduce function is commutative and associative. The
Combiner dependencies are as follows:
● The reduce function must be independent of the sequence in which it’s executed.
● If the reduce function consists of the averaging operator, then the combine phase fails
to work. This is because, the averaging operator is not associative.
A combiner may be used even for the non-commutative and non-associative reduce functions
by writing a separate combiner function.
So in summary the Combiner, also known as the semi-reducer, is an optional class in the
MapReduce style of programming. It is particularly useful when the reduce operation is
commutative as well as associative.
In other words, we say that a reduce function is both commutative and associative if it
produces the same result irrespective of the sequence in which it is executed.
You saw that we could use a combiner for the WordCount program since the reduce function
would add up the values associated with each key, addition being both commutative and
associative.However, if the reduce function were to take an average of the values, then you
need to write a separate combine function.
In this session you understood the execution details of a mapreduce program. In the previous
module, you looked at the HDFS file system. In particular, you saw that there are multiple data
nodes(slave) and a single namenode(master) that coordinates with and manages the data
nodes.
Given a mapreduce program, the master node usually assigns either a map task or a reduce
task to a worker node, but not both. The master node is responsible for creating a number of
map tasks and reduce tasks. Each of these numbers is specified by the user program.
The master node maintains a queue of map and reduce tasks along with their status, namely,
IDLE or PROCESSING AT NODE X or COMPLETED. Depending on the availability of worker nodes,
it assigns each idle task to a node.
Now suppose a worker executing a map task fails. The master not just resets all the map tasks
being processed by the same worker to idle, it also resets the tasks that were completed by
that particular node. This needs to be done since the intermediate key-value pairs are stored in
the local disk of the worker node and hence, they are no longer available after its failure.
Therefore, all those map tasks are added to the master’s queue and are then assigned to other
nodes when they become available. Further, the master also informs the respective reduce task
about the updated node from where it has to receive its input.
In case a node executing a reduce task fails, it is much simpler for the master to handle.
It only has to reset the tasks that are being processed to IDLE since the results of the completed
reduce tasks are stored in the HDFS file system and hence, they are still available. These tasks
are later allotted to one of the healthy nodes when they become available. Typically, the
number of map tasks is kept higher than the number of worker nodes.
To be more precise, the number of map tasks is kept close to the number of distinct file chunks.
This helps in better processor utilisation, especially in case of node failures.
If the number of map tasks was equal to the number of worker nodes, then a failure would
require the master node to wait till one of the healthy nodes becomes available to assign the
failed task. This would result in unbalanced and non-optimum processor utilisation.
So in summary you learnt about the execution process in the context of MapReduce 2 in
Hadoop 2.0, also known as Yet Another Resource Negotiator (YARN). Please note that the basic
idea remains unchanged. However, there are a few changes with regard to the nodes
performing the scheduling operations. The details of MapReduce 1 in Hadoop 1.0 may be
looked up here.
YARN also follows the Master-Slave architecture. The master is called ResourceManager, and
the slave is called NodeManager. Each cluster has exactly one ResourceManager and several
instances of the NodeManager. All instances of NodeManager are usually run on the same
servers running the HDFS DataNodes. However, this is not a mandatory requirement. Similar to
HDFS, the NodeManagers send heartbeat signals to the ResourceManager, thereby
communicating that they are alive. A significant difference between MR 1 and MR 2 is that the
ResourceManager performs only resource management and delegates any work related to
scheduling to an ApplicationMaster that runs on a slave machine. The ApplicationMaster too
sends regular heartbeat signals to the ResourceManager.
3) The ApplicationMaster registers itself with the ResourceManager and then requests for
resources to perform the MapReduce job.
4) The ResourceManager, through the heartbeat signals from NodeManagers and the
existing ApplicationMasters, obtains complete information about the available
resources. The resources such as memory, network bandwidth, and CPU cores are
abstracted in YARN in the form of containers. You can think of containers as workers
performing the tasks (map and reduce) that were explained by the professor. So, the
ResourceManager informs the ApplicationMaster about the completed containers,
newly allocated containers, and the current state of available resources.
5) The ApplicationMaster then sends signals to launch the application on the allocated
containers. In other words, it launches a number of map and reduce tasks (according to
the user inputs or by default) and sends them the required set of instructions contained
in the user program.
b) In case a node executing a reduce task fails, it is much simpler for the
ApplicationMaster to handle. It only has to reset the tasks that were being
processed by that node to become idle, since the results of the completed
reduce tasks are stored in the HDFS file system, and hence, they are still
available. These tasks are allotted to healthy nodes later when they become
available.
d) In case the master node hosting the ResourceManager fails, the entire cluster
becomes unusable until it is restarted, i.e. the ResourceManager is the Single
Point Of Failure (SPOF) in a Hadoop cluster.
7) On successful completion of the MapReduce job, the ApplicationMaster notifies the
ResourceManager. The ResourceManager keeps a log of the execution including the
location of the final output, which may be read by the client.
Typically, the number of map tasks is kept higher than the number of DataNodes. The default
number of map tasks equals the number of distinct file chunks. The user can change the default
setting for the input split size by modifying it. For instance, you may specify an input split size of
64MB, instead of the default block size of 128MB. And the number of map tasks spawned
would then become double the default number. This helps in better processor utilisation,
especially in case of node failures.
If the number of map tasks was equal to the number of nodes, then a failure would require the
ApplicationMaster to wait until one of the healthy nodes becomes available to get assigned to
the failed task. This would result in an unbalanced and non-optimum processor utilisation. On
the other hand, if the number of map tasks was higher, then all the map tasks residing on a
failed node can be distributed among the available nodes, thereby maintaining a higher level of
resource utilisation.
However, you need to remember that the number of reduce tasks is always kept low. The
default is to have one reduce task. This is because the map tasks create intermediate files or
partitions on their local file system corresponding to the number of reduce tasks. A large
number of reduce tasks will result in a large number of such partitions, which is undesirable in
HDFS.
The data type of a value or variable is an attribute that tells what kind of data that value or
variable should have. There are 8 primitive data types in Java-
● int
● byte
● short
● long
● float
● Double
● boolean
● And, char
Certain important object oriented concepts, such as Collections, operate on 'objects' instead of
fundamental data types. Java supports wrapper classes around each basic data type, which are
used to instantiate basic data type values or variables as objects. The Compiler takes care of
converting wrapper classes to data types which is called unboxing and data types to wrapper
classes which is called autoboxing. Every data type has a corresponding wrapper class -
● ‘Integer’ for ‘int’
● ‘Byte’ for ‘byte’
● ‘Short’ for ‘short’
● ‘Long’ for ‘long’
● ‘Float’ for ‘float’
● ‘Double’ for ‘double’
● ‘Character’ for ‘char’
● And,‘Boolean’ for ‘boolean’
Recall that string is not a primitive data type in Java. This is because string was designed to be
immutable and to achieve this, it was created as a wrapper class in the first place. Immutable
means that once we create an object of that class, we cannot change it’s content. Hence an
object of type string once created cannot be changed. In other words, whenever we call
methods on a string then those methods always create a new string.
Hadoop deals in objects and collections. Hence, wrapper classes become a very important part
of Hadoop. However, we don’t directly use the wrapper classes as data types. Instead, we use
what we will call 'Boxed' classes, which are an extension of the wrapper classes to account for
more specialised hadoop constructs such as serialisation. the standard box classes that are
actually used in our code -
An important point to note here is that unlike Java wrapper classes, for boxed classes, the
programmer must implement boxing and unboxing using in-built methods.
For Int to IntWritable conversion, we use the IntWritable() constructor that takes an int
variable as its argument.
For IntWritable to int conversion, we use the get() method.
Similarly we can convert other data types to their corresponding wrapper classes and vice
versa. The string data type presents an exception.
For String to Text conversion, we use the Text() constructor.
However, for Text to String conversion, we use the toString() method.
Serialization is the act of converting an object into a byte stream. Deserialization implies
converting a byte stream to an object in memory. Serialization is done when transferring data
over a network or storing it on disk. Serializing an object to bytes is necessary as the network
infrastructure or disk only understands bits and bytes and not a java object.
The java serializer is not efficient enough when it comes to dealing with a lot of data. It is heavy
and increases the network or disk transfer overheads thereby slowing down large data
processing jobs. In contrast, Hadoop’s 'writable' interface is designed to reduce the data size
overhead and make the data transfer easier in the network or on disk.
TO understand the differences between java’s native serializer and hadoop’s writable interface
in more detail, refer to this link here.
Implementing the writable interface significantly reduces the time to transfer data over the
network. Writable is a strong interface in Hadoop which serializes the data and reduces the
data size making data transfers more efficient.
The Writable interface contains separate read and write methods to read data from the
network and write data on disk respectively.
The box classes implement not only the writable interface but also the comparable interface.
The in-built intermediate shuffle and sort phase of map-reduce requires a comparator for the
involved keys. Keys of type, say, IntWritable need to be compared with each other during the
shuffle-sort phase. This comparator requirement is realized through the comparable interface.
Any data type which is to be used as a key in the Hadoop Map-Reduce framework should
implement the comparable interface.
The WritableComparable interface is a subinterface that inherits from both the Writable and
Comparable interfaces.
When transferring data between nodes, these methods play a vital role.
You can also create your own custom writable and comparable types using these interfaces.
Let’s now understand some details around map-reduce’s InputFormats. We will use the Java
code in the WordCount Driver class for this purpose.
The input to the WordCount program consists of a simple text file. However, Mappers and
Reducers only understand data as key-value pairs. Thus, data read from the text file must be
converted to Key-Value pairs before being fed into the mappers. Intuitively, there are 3 key
pieces of information required by the system to realize this conversion:
1. The system must be told that the input is contained in *files*
2. The mapper must know that the files contain *text* data
3. The mapper must be able to identify *records* within a block of text and extract keys and
values from a record
Hadoop’s set of predefined InputFormat classes help a programmer easily provide this
information and read data seamlessly from a variety of sources. There are several in-built
InputFormats.
FileInputFormat :
FileInputFormat is the base class for all file-based InputFormats i.e when data is contained in
files on HDFS. We provide FileInputFormat with the path to the input file or the directory
containing the input files.
FileInputFormat divides the given files into Input Splits. Each input split is assigned to a single
mapper for processing. Hence, number of map tasks is equal to number of input splits. The
default split size is 64MB in Apache Hadoop and 128 MB in Cloudera Hadoop. Split size can be
set through mapred.min.split.size and mapred.max.split.size parameters in mapred-site.xml. It
can also be controlled by overriding these parameters in the Job object used to submit a
particular MapReduce job. By setting the split size we can decide the number of mappers for
our job.
Once the files are split and assigned to mappers, each map task must read data from the input
split it has been assigned. Here, an input split contains a block of text. Thus, the mapper uses
Hadoop’s TextInputFormat class to parse and read data from the given split. TextInputFormat is
also the default InputFormat for Hadoop and extends the FileInputFormat base class. The
InputFormat for a job can be programmatically set using -
job.setInputFormatClass(TextInputFormat.class);
To break a given input split or block of text into record-level key-value pairs, the programmer
must specify a RecordReader. The RecordReader takes the byte-oriented view of input
provided by the InputSplit, and presents as a record-oriented view for a Mapper. It uses the data
within the boundaries that were created by the InputSplit and creates Key-value pairs that can
be understood and processed by the Map function.
The RecordReader takes the byte-oriented view of input provided by the InputSplit, and
presents as a record-oriented view for a Mapper. It uses the data within the boundaries that
were created by the InputSplit and creates Key-value pairs that can be understood and
processed by the map function. TextInputFormat class uses the LineRecordReader which treats
each line of the input split as a separate record. Furthermore, it uses the byte-offset of the
beginning of the line in a split as the key and the complete line as the value. There are several
other InputFormats such as KeyValueTextInputFormat and NLineFormatInputfor for different
purposes and data.
KeyValueTextInputFormat : This format identifies each line as a record, but creates key-value
pairs differently. It breaks the input line by a tab character (‘/t’) and assigns everything up to the
first tab character as key. The remaining part of the line is interpreted as the value.
NLineFormatInput : This format is same as TextInputFormat except that one can define the
exact number of lines or key-value pairs in an input split. By default, N=1.
There are several other InputFormats for different purposes and data. One can also create a
custom Input format.
After the reducers finish processing the data, the final key-value pairs must be written to a text
file so that they can be stored on HDFS. A RecordWriter from the TextOutputFormat class
writes these output key-value pairs as text to output files.
Just like Input Format, Output Format determines certain properties, such as type, of the final
output written by the RecordWriter to output files. Typically, the MapReduce framework writes
data only to new, non existent, output paths to avoid overwriting existing data. The
FileOutputFormat.setOutputPath() method is used to set the output directory. It checks the
existence of the given path to avoid overwriting. FileOutputFormat is the base class for all
file-based OutputFormat implementations. Every Reducer writes a separate file in this common
output directory.
A HDFS block is a physical representation of data on storage while Input Split is a logical
representation of the input data at runtime. InputSplit contains a reference to the data within a
certain range. FileInputFormat divides the given input files into logical chunks called Input Splits.
Hence, the number of input splits = (Total data size)/(split size). Each input split is assigned to a
single mapper for processing and hence, the number of map tasks = number of input splits.
Input Split size is a user defined value that can be tuned based on the overall size of data. By
setting the split size you can set the number of input splits and thereby the number of map tasks
for a job. By default, the input split size is equal to the block size.
In general, more map tasks implies greater parallelism incentivising lowering of the split size.
However, the total time taken by a job depends on the processing time at mappers and
reducers, the setup time and the time taken to transfer the data over the network. Thus, we
need to carefully choose a split size such that the total time for job completion is minimized
taking into account processing, setup and networking factors. The default values of
64MB(Apache) and 128MB(Cloudera) are chosen based on typical workloads. They generally
give good performance for the most common Hadoop applications.
We can also set the number of reducers to zero. Such a job is called a map-only job and the
outputs of the map-tasks go directly to the FileSystem at the output path set by
FileOutputFormat.setOutputPath(Job, Path). The framework does not sort such map-only
outputs before writing them to the FileSystem.
Hadoop has a set of APIs which define the flow of a mapreduce program.
A job is the complete flow of execution of a mapreduce program from start to finish. A job
starts with reading input data from HDFS. It then processes the data according to the map and
the reduce functions. A job ends with writing output files on HDFS.
It helps the user create a job, describe various parameters of the job, control its execution and
then submit the job. It also allows the user to monitor progress of a submitted job.In other
words, the Hadoop framework executes a map-reduce application as described by the
corresponding Job class.
You can configure a job’s parameters using the Job class. Initially, we instantiate a new Job
object that takes the Hadoop configuration. A configuration contains information about
resources. Default resources are specified in Core-default.xml and
Core-site.xml. These files contain read-only and site specific configurations of the Hadoop
installation.
There are several methods defined in the job class which can be used to either configure a job,
or monitor it. Some of the important ones are -
1. setJobName() - You set the job name using the setJobName method.
2. setJarByClass() - Through the setJarByClass API, the user helps Hadoop identify the main
driver jar or class for a MapReduce job. Note that the jar files are put on HDFS so that all
nodes can access them.
3. setOutputKeyClass() - You specify the output key format using setOutputKeyClass.
4. setOutputValueClass() - You specify the output value format using setOutputValueClass.
5. setMapOutputKeyClass() - You can specifically set the map output key class separately.
If not, it defaults to the class that is passed earlier to setOutputKeyClass. This API must
be used if the mapper emitted keys differ in format from the final output.
6. setMapOutputValueClass() - Similarly, you can explicitly specify the map output value
class. If not set, it defaults to the class set as the overall output value class. This API
must be used if the mapper emitted values differ in format from the final output.
7. setMapperClass() - The setMapperClass method is provided to set the mapper for the
job. The mappers will process the data according to the class passed here as the
argument.
8. setCombinerClass() - This api sets the combiner for the job. This step is optional, but
recommended to perform local aggregation of the intermediate outputs, which helps to
cut down the amount of data transferred from the Mapper to the Reducer.
9. setReducerClass() - The SetReducerClass method is used to set the reducer for the job.
The reducers will process the intermediate output from the mappers according to the
class passed here as the argument
10. FileInputFormat.addInputPath(job,<path>) - You provide the path to the input file
through the addInputPath method of the FileInputFormat class. You can also specify this
in command line along with the command to run this job.
11. FileOutputFormat.setOutputPath(job,<path>) - This api is used to provide the output
path. This may also be provided in the command line.
12. job.waitForCompletion(true) - The waitforcompletion method submits the job and then
polls for progress until the job is complete
There are several other methods defined in the job class. We can use these methods to either
configure a job, or monitor it.
If you encounter jobconf instead of the job class, then you are looking at the older MapReduce
package and APIs. The newer package replaces jobconf with the job class.
The driver class implements the Tool interface and extends the Configured class. Configured is
the base class required to configure a Hadoop job. The configured class has two main methods -
getConf() and setConf(). getConf() Returns the configuration used by the object while setConf()
sets the configuration to be used by the object.
Tool is an interface that supports handling of generic command-line options. With the help of
Tool we can define job parameters from the command line. A job is submitted for execution
using the ToolRunner.run API.
It is not mandatory to use the Configured class and Tool interface. However, using them makes
your code more portable and cleaner as you do not need to hardcode specific configurations.
MapReduce programming skills may be developed only through practice. You can refer here to
run your code. You may further refer to session 3 and 4 for some code demonstrations.