0% found this document useful (0 votes)

102 views

Lecture Notes Map Reduce

This document discusses the MapReduce programming model and how it is used to count word frequencies in documents. It explains that the MapReduce model utilizes a map phase to process input data in parallel across multiple nodes, and a reduce phase to aggregate the results by key. Mappers process file chunks to generate intermediate key-value pairs, which are shuffled and grouped by key before being sent to reducers to aggregate the values for each key using functions like addition. An optional combine phase allows intermediate aggregation of mapper outputs to reduce network traffic between phases.

Uploaded by

Yuvaraj V, Assistant Professor, BCA

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views

Lecture Notes Map Reduce

Uploaded by

Yuvaraj V, Assistant Professor, BCA

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

In the previous segments you learnt about Apache Hadoop and its distributed file system, i.e.

HDFS, and then proceeded to learn how to write programs that can be executed on a Hadoop
cluster. Now in this module we discussed about MapReduce programming model, where you
also learnt working principles of the MapReduce framework.

In this segment you understood the mapreduce programming model through the well-known
example of counting the occurence of words in a document. The problem was as follows: You
are given a huge list of documents and asked to list all the words along with their frequency.

For instance, the text “The course 2 of the big data program is the most important part of the
entire program” should return (“the”, 4) (“course”, 1) (“2”, 1) (“of”, 2) (“big”,1) (“data”, 1)
(“program”, 2) (“is”, 1) (“most”, 1) (“important”, 1) (“part”, 1) (“entire”, 1).

Since you are working in a distributed environment, different parts of the documents are stored
across multiple nodes that are connected through a communication network.

The mapreduce programming model utilizes the data-parallel model during the map phase and
the inverse tree-parallel model for aggregating results in the reduce phase.
Essentially, a mapreduce program executes as follows:
1. The master node initiates some number of map tasks depending on the number of file
chunks.

2. Each map task takes one or more file chunks as input and outputs a sequence of
intermediate key-value pairs.

3. The master node collects all the intermediate key-value pairs from the map tasks and
sorts them by key.

4. It then sends them to the reduce tasks in such a way that all key-value pairs with the
same key go to the same reduce task.

5. Each reduce task aggregates the entire set of values associated with a key, typically
using a combination of binary operators.

Assume that you have three separate nodes storing each file chunk. In the first step, you write a
map function that takes text as input and outputs key-value pairs of the form (word, 1) for each
word in the chunk. The master node sees three file chunks and hence initiates three map tasks.
These map tasks run in parallel on the three nodes. The first map task reads the input “The
course 2 of the big data” and produces the following key-value pairs: (The, 1) (course, 1) (2, 1)
(of, 1) (the, 1) (big, 1) (data, 1). The second map task reads the input “program is the most
important” and outputs (program, 1) (is, 1) (the, 1) (most, 1) (important, 1) and similarly for the
third map task.
In the second step, the intermediate key-value pairs are grouped by the key. The master node
allocates a certain number of reduce tasks as per the user’s instruction, let’s say it is r. It then
uses a hash function to distribute the key-value pairs among the reduce tasks.

In our example, if there are 2 reducers numbered 0 and 1, the master node uses a hash code
followed by a mod 2 operation to find the reducer corresponding to a key. The result of this
operation would be similar to what is shown here: All the key-value pairs with keys “the”,
“course”, “data”, “important”, “part”, “entire” should be sent to reduce task 0 and the rest
should be sent to reduce task 1.
Based on this distribution, each node creates separate partitions to store the intermediate
key-value pairs in its local disk. Node 1 creates two partitions. The first partition contains the
key-value pairs, (“The, 1) (“course”,1) (“the,1”) (“data”,1) and the second partition contains
(“2”,1) (“of”,1) and (“big”,1).

Similarly nodes 2 and 3 create two partitions each. The key-value pairs in each partition are
then sorted according to their keys. As you can see, this naturally results in the grouping of the
key-value pairs. Subsequently, the master node collects all the key-value pairs with a single key
from all the map tasks, and sends the merged list to the respective reduce task.
For example, the master node collects two key-value pairs (the,1) from the first map task, one
(the,1) from the second map task and another (the,1) from the third map task, and merges
them into a list of the form (the, [1,1,1,1]) and then sends it to its reduce task which is the
reducer task, i.e. task 0 in our example.

there is a special name for the application of the reduce function to a single key and its
associated values, which is called as a reducer. So, each reduce task executes multiple reducers.

For instance, reduce task 0 first executes a reducer that adds the four 1’s associated with the
key “the” and outputs (“the”,4). For the next key “course”, there is nothing to be added since it
has only one value. Similarly, reduce task 1 executes a reducer that takes (2, [1]) as input and
outputs (2,1) and so on.
In summary the following steps are involved in the execution of a MapReduce program:

1. The master node initiates a number of map tasks depending on the number of file
chunks. All inputs and outputs in MapReduce are in the form of key-value pairs. As you
will see later, the text document is also input in the form of key-value pairs. Each map
task can take multiple file chunks in the form of key-value pairs as inputs and outputs a
sequence of intermediate key-value pairs, which look like the following:

(K,V)→(K1,V1)(K2,V2)…
(K′,V′)→(K′1,V′1)(K′2,V′2)… .
.
(K′′,V′′)→(K′′1,V′′1)(K′′2,V′′2)…

Note that it is the user who describes the entire process of extracting useful information, i.e.
these intermediate key-value pairs are extracted from the input data in the form of a Map
function. Typically, there would be several common keys among these intermediate key-value
pairs. In the next step, these intermediate key-value pairs that have common keys are grouped
together.

2. The map tasks partition and sort all the intermediate key-value pairs by their keys.
These are later merged by the Master node and sent to the reduce tasks in such a way
that all key-value pairs with the same key arrive at the same reduce task in the form of a
key and an iterable list of all its associated values.

Suppose that there are only three distinct keys, K1, K2, andK3, in the entire collection of
intermediate key-value pairs that you obtained in step 1. Then, this step may be represented in
the following manner:
(K1,V1)(K2,V2)… (K1,V11)(K1,V12)… (K1,[V11,V12,…])
(K′1,V′1)(K′2,V′2)…→ (K2,V21)(K2,V22)…→ (K2,[V21,V22,…])
(K′′1,V′′1)(K′′2,V′′2)… (K3,V31)(K3,V32)…
(K3,[V31,V32,…])
If you assume that there are two reducers, then the three groups shown above are distributed
between these two reducers using a hash function on the keys. If hash(K1)=hash(K3)=0 and
hash(K2)=1, then the groups with the keys K1 and K3 go to reduce task 0, and the group with
the key K2 goes to reduce task 1.
3. Each reduce task aggregates the entire set of values associated with a key, typically, by
using a combination of binary operators. Again, it’s the user who describes this process
in the form of a Reduce function. By denoting the binary operator with an asterisk (*),
this step may be represented in the following manner:
Reduce task 0 (K1,[V11,V12,…])→(K1,V11∗V12∗…)
(K3,[V31,V32,…])→(K3,V31∗V32∗…)
Reduce task 1 (K2,[V21,V22,…])→(K2,V21∗V22∗…)

Finally, the output of all the reduce tasks is stored on HDFS.

In the previous segment, you saw that all the intermediate key value pairs are transferred to
the reduce tasks through the master node. This doesn’t look daunting for a small example that
you have considered here. However, if you were to perform word count at the scale of millions
of documents, these intermediate key-value pairs may themselves be huge in number.

The mapreduce framework provides us a way to achieve the same. Instead of transferring the
entire set of intermediate key-value pairs, you can perform some part of the reduce function in
the map task itself.

Consider the first map task that produced the intermediate key-value pairs (“The”, 1) (“course”,
1) (“2”, 1) (“of”, 1) (“the”, 1) (“big”, 1) (“data”, 1).
Instead of passing two separate tuples (“the”,1) you can add their values and then transfer a
single tuple (“the”, 2) to the reduce task. If your document had a thousand occurrences of the
word “the”, then you end up saving a lot of time by performing this intermediate aggregation.
This step is typically referred to as the combine phase of MapReduce framework.

However, the combiner works only if the reduce function is commutative and associative. The
Combiner dependencies are as follows:

● The reduce function must be independent of the sequence in which it’s executed.

● If the reduce function consists of the averaging operator, then the combine phase fails
to work. This is because, the averaging operator is not associative.

A combiner may be used even for the non-commutative and non-associative reduce functions
by writing a separate combiner function.

So in summary the Combiner, also known as the semi-reducer, is an optional class in the
MapReduce style of programming. It is particularly useful when the reduce operation is
commutative as well as associative.

● We say that a reduce operation Reduce() is commutative if —

Reduce((K,[V1,V2]))= Reduce((K,[V2,V1])), for all possible key-value pairs.

● We also say that Reduce() is associative if —

Reduce((K,[V1,V2,V3]))= Reduce((Reduce((K,[V1,V2])), Reduce((K,[V3])))
= Reduce((Reduce((K,[V1])), Reduce((K,[V2,V3])))

In other words, we say that a reduce function is both commutative and associative if it
produces the same result irrespective of the sequence in which it is executed.

You saw that we could use a combiner for the WordCount program since the reduce function
would add up the values associated with each key, addition being both commutative and
associative.However, if the reduce function were to take an average of the values, then you
need to write a separate combine function.

In this session you understood the execution details of a mapreduce program. In the previous
module, you looked at the HDFS file system. In particular, you saw that there are multiple data
nodes(slave) and a single namenode(master) that coordinates with and manages the data
nodes.

Given a mapreduce program, the master node usually assigns either a map task or a reduce
task to a worker node, but not both. The master node is responsible for creating a number of
map tasks and reduce tasks. Each of these numbers is specified by the user program.

The master node maintains a queue of map and reduce tasks along with their status, namely,
IDLE or PROCESSING AT NODE X or COMPLETED. Depending on the availability of worker nodes,
it assigns each idle task to a node.
Now suppose a worker executing a map task fails. The master not just resets all the map tasks
being processed by the same worker to idle, it also resets the tasks that were completed by
that particular node. This needs to be done since the intermediate key-value pairs are stored in
the local disk of the worker node and hence, they are no longer available after its failure.
Therefore, all those map tasks are added to the master’s queue and are then assigned to other
nodes when they become available. Further, the master also informs the respective reduce task
about the updated node from where it has to receive its input.
In case a node executing a reduce task fails, it is much simpler for the master to handle.
It only has to reset the tasks that are being processed to IDLE since the results of the completed
reduce tasks are stored in the HDFS file system and hence, they are still available. These tasks
are later allotted to one of the healthy nodes when they become available. Typically, the
number of map tasks is kept higher than the number of worker nodes.

To be more precise, the number of map tasks is kept close to the number of distinct file chunks.
This helps in better processor utilisation, especially in case of node failures.

If the number of map tasks was equal to the number of worker nodes, then a failure would
require the master node to wait till one of the healthy nodes becomes available to assign the
failed task. This would result in unbalanced and non-optimum processor utilisation.

So in summary you learnt about the execution process in the context of MapReduce 2 in
Hadoop 2.0, also known as Yet Another Resource Negotiator (YARN). Please note that the basic
idea remains unchanged. However, there are a few changes with regard to the nodes
performing the scheduling operations. The details of MapReduce 1 in Hadoop 1.0 may be
looked up here.

YARN also follows the Master-Slave architecture. The master is called ResourceManager, and
the slave is called NodeManager. Each cluster has exactly one ResourceManager and several
instances of the NodeManager. All instances of NodeManager are usually run on the same
servers running the HDFS DataNodes. However, this is not a mandatory requirement. Similar to
HDFS, the NodeManagers send heartbeat signals to the ResourceManager, thereby
communicating that they are alive. A significant difference between MR 1 and MR 2 is that the
ResourceManager performs only resource management and delegates any work related to
scheduling to an ApplicationMaster that runs on a slave machine. The ApplicationMaster too
sends regular heartbeat signals to the ResourceManager.

The details of a MapReduce execution are listed below:

1) A client submits a MapReduce application to the ResourceManager. The

ResourceManager has a component called ApplicationManager that checks the request
for duplicates, looks up if there exist nodes with the required resources to even run the
ApplicationMaster, and then forwards them to the Scheduler, which is another
component of ResourceManager.

2) The ResourceManager then launches an ApplicationMaster on a slave node. From this

point onwards, it is the ApplicationMaster’s responsibility to schedule the required
number of map and reduce tasks.

3) The ApplicationMaster registers itself with the ResourceManager and then requests for
resources to perform the MapReduce job.

4) The ResourceManager, through the heartbeat signals from NodeManagers and the
existing ApplicationMasters, obtains complete information about the available
resources. The resources such as memory, network bandwidth, and CPU cores are
abstracted in YARN in the form of containers. You can think of containers as workers
performing the tasks (map and reduce) that were explained by the professor. So, the
ResourceManager informs the ApplicationMaster about the completed containers,
newly allocated containers, and the current state of available resources.
5) The ApplicationMaster then sends signals to launch the application on the allocated
containers. In other words, it launches a number of map and reduce tasks (according to
the user inputs or by default) and sends them the required set of instructions contained
in the user program.

6) The ApplicationMaster continuously monitors the state of the launched containers

through the ResourceManager or the individual NodeManagers and acts in case of node
failures.
a) If a node with a container running a map task fails, the ApplicationMaster not
only resets all the map tasks that are being processed by the same node to
become idle, it also resets the tasks that were completed by that node. This is
necessary because the intermediate key-value pairs are stored on the local file
system of the node, and hence, they are no longer available after it encounters a
failure. Therefore, all these map tasks are added to the ApplicationMaster’s
queue and are then assigned to other nodes when they become available.
Furthermore, the ApplicationMaster also informs the respective reduce task
about the updated node from where it has to receive its input.

b) In case a node executing a reduce task fails, it is much simpler for the
ApplicationMaster to handle. It only has to reset the tasks that were being
processed by that node to become idle, since the results of the completed
reduce tasks are stored in the HDFS file system, and hence, they are still
available. These tasks are allotted to healthy nodes later when they become
available.

c) In case the node running the ApplicationMaster fails, the ResourceManager

launches another instance of it on a healthy node. However, there is no
functionality as yet to resume its state before failure. In other words, the
MapReduce job needs to be started all over again.

d) In case the master node hosting the ResourceManager fails, the entire cluster
becomes unusable until it is restarted, i.e. the ResourceManager is the Single
Point Of Failure (SPOF) in a Hadoop cluster.
7) On successful completion of the MapReduce job, the ApplicationMaster notifies the
ResourceManager. The ResourceManager keeps a log of the execution including the
location of the final output, which may be read by the client.

Typically, the number of map tasks is kept higher than the number of DataNodes. The default
number of map tasks equals the number of distinct file chunks. The user can change the default
setting for the input split size by modifying it. For instance, you may specify an input split size of
64MB, instead of the default block size of 128MB. And the number of map tasks spawned
would then become double the default number. This helps in better processor utilisation,
especially in case of node failures.

If the number of map tasks was equal to the number of nodes, then a failure would require the
ApplicationMaster to wait until one of the healthy nodes becomes available to get assigned to
the failed task. This would result in an unbalanced and non-optimum processor utilisation. On
the other hand, if the number of map tasks was higher, then all the map tasks residing on a
failed node can be distributed among the available nodes, thereby maintaining a higher level of
resource utilisation.

However, you need to remember that the number of reduce tasks is always kept low. The
default is to have one reduce task. This is because the map tasks create intermediate files or
partitions on their local file system corresponding to the number of reduce tasks. A large
number of reduce tasks will result in a large number of such partitions, which is undesirable in
HDFS.

The data type of a value or variable is an attribute that tells what kind of data that value or
variable should have. There are 8 primitive data types in Java-
● int
● byte
● short
● long
● float
● Double
● boolean
● And, char
Certain important object oriented concepts, such as Collections, operate on 'objects' instead of
fundamental data types. Java supports wrapper classes around each basic data type, which are
used to instantiate basic data type values or variables as objects. The Compiler takes care of
converting wrapper classes to data types which is called unboxing and data types to wrapper
classes which is called autoboxing. Every data type has a corresponding wrapper class -
● ‘Integer’ for ‘int’
● ‘Byte’ for ‘byte’
● ‘Short’ for ‘short’
● ‘Long’ for ‘long’
● ‘Float’ for ‘float’
● ‘Double’ for ‘double’
● ‘Character’ for ‘char’
● And,‘Boolean’ for ‘boolean’

Recall that string is not a primitive data type in Java. This is because string was designed to be
immutable and to achieve this, it was created as a wrapper class in the first place. Immutable
means that once we create an object of that class, we cannot change it’s content. Hence an
object of type string once created cannot be changed. In other words, whenever we call
methods on a string then those methods always create a new string.

Hadoop deals in objects and collections. Hence, wrapper classes become a very important part
of Hadoop. However, we don’t directly use the wrapper classes as data types. Instead, we use
what we will call 'Boxed' classes, which are an extension of the wrapper classes to account for
more specialised hadoop constructs such as serialisation. the standard box classes that are
actually used in our code -

● IntWritable for int

● ByteWritable for byte
● ShortWritable for short
● LongWritable for long
● FloatWritable for float
● DoubleWritable for double
● Text for char/string
● And, BooleanWritable for boolean

An important point to note here is that unlike Java wrapper classes, for boxed classes, the
programmer must implement boxing and unboxing using in-built methods.
For Int to IntWritable conversion, we use the IntWritable() constructor that takes an int
variable as its argument.
For IntWritable to int conversion, we use the get() method.
Similarly we can convert other data types to their corresponding wrapper classes and vice
versa. The string data type presents an exception.
For String to Text conversion, we use the Text() constructor.
However, for Text to String conversion, we use the toString() method.

Serialization is the act of converting an object into a byte stream. Deserialization implies
converting a byte stream to an object in memory. Serialization is done when transferring data
over a network or storing it on disk. Serializing an object to bytes is necessary as the network
infrastructure or disk only understands bits and bytes and not a java object.

The java serializer is not efficient enough when it comes to dealing with a lot of data. It is heavy
and increases the network or disk transfer overheads thereby slowing down large data
processing jobs. In contrast, Hadoop’s 'writable' interface is designed to reduce the data size
overhead and make the data transfer easier in the network or on disk.

TO understand the differences between java’s native serializer and hadoop’s writable interface
in more detail, refer to this link here.

Implementing the writable interface significantly reduces the time to transfer data over the
network. Writable is a strong interface in Hadoop which serializes the data and reduces the
data size making data transfers more efficient.

The Writable interface contains separate read and write methods to read data from the
network and write data on disk respectively.

The box classes implement not only the writable interface but also the comparable interface.
The in-built intermediate shuffle and sort phase of map-reduce requires a comparator for the
involved keys. Keys of type, say, IntWritable need to be compared with each other during the
shuffle-sort phase. This comparator requirement is realized through the comparable interface.
Any data type which is to be used as a key in the Hadoop Map-Reduce framework should
implement the comparable interface.

The WritableComparable interface is a subinterface that inherits from both the Writable and
Comparable interfaces.

The writable interface contains two methods -

1. readFields - to read data from network
2. write - to write data on disk.

When transferring data between nodes, these methods play a vital role.

The implementation of WritableComparable is similar to Writable but with an additional

‘CompareTo’ method. This method implements the comparison logic for 2 elements of the
concerned type.

You can also create your own custom writable and comparable types using these interfaces.

Let’s now understand some details around map-reduce’s InputFormats. We will use the Java
code in the WordCount Driver class for this purpose.

The input to the WordCount program consists of a simple text file. However, Mappers and
Reducers only understand data as key-value pairs. Thus, data read from the text file must be
converted to Key-Value pairs before being fed into the mappers. Intuitively, there are 3 key
pieces of information required by the system to realize this conversion:
1. The system must be told that the input is contained in *files*
2. The mapper must know that the files contain *text* data
3. The mapper must be able to identify *records* within a block of text and extract keys and
values from a record

Hadoop’s set of predefined InputFormat classes help a programmer easily provide this
information and read data seamlessly from a variety of sources. There are several in-built
InputFormats.

FileInputFormat :
FileInputFormat is the base class for all file-based InputFormats i.e when data is contained in
files on HDFS. We provide FileInputFormat with the path to the input file or the directory
containing the input files.

FileInputFormat.addInputPath(job, new Path(args[0]));

FileInputFormat divides the given files into Input Splits. Each input split is assigned to a single
mapper for processing. Hence, number of map tasks is equal to number of input splits. The
default split size is 64MB in Apache Hadoop and 128 MB in Cloudera Hadoop. Split size can be
set through mapred.min.split.size and mapred.max.split.size parameters in mapred-site.xml. It
can also be controlled by overriding these parameters in the Job object used to submit a
particular MapReduce job. By setting the split size we can decide the number of mappers for
our job.

Once the files are split and assigned to mappers, each map task must read data from the input
split it has been assigned. Here, an input split contains a block of text. Thus, the mapper uses
Hadoop’s TextInputFormat class to parse and read data from the given split. TextInputFormat is
also the default InputFormat for Hadoop and extends the FileInputFormat base class. The
InputFormat for a job can be programmatically set using -

job.setInputFormatClass(TextInputFormat.class);

To break a given input split or block of text into record-level key-value pairs, the programmer
must specify a RecordReader. The RecordReader takes the byte-oriented view of input
provided by the InputSplit, and presents as a record-oriented view for a Mapper. It uses the data
within the boundaries that were created by the InputSplit and creates Key-value pairs that can
be understood and processed by the Map function.

The RecordReader takes the byte-oriented view of input provided by the InputSplit, and
presents as a record-oriented view for a Mapper. It uses the data within the boundaries that
were created by the InputSplit and creates Key-value pairs that can be understood and
processed by the map function. TextInputFormat class uses the LineRecordReader which treats
each line of the input split as a separate record. Furthermore, it uses the byte-offset of the
beginning of the line in a split as the key and the complete line as the value. There are several
other InputFormats such as KeyValueTextInputFormat and NLineFormatInputfor for different
purposes and data.

KeyValueTextInputFormat : This format identifies each line as a record, but creates key-value
pairs differently. It breaks the input line by a tab character (‘/t’) and assigns everything up to the
first tab character as key. The remaining part of the line is interpreted as the value.

NLineFormatInput : This format is same as TextInputFormat except that one can define the
exact number of lines or key-value pairs in an input split. By default, N=1.

There are several other InputFormats for different purposes and data. One can also create a
custom Input format.

After the reducers finish processing the data, the final key-value pairs must be written to a text
file so that they can be stored on HDFS. A RecordWriter from the TextOutputFormat class
writes these output key-value pairs as text to output files.
Just like Input Format, Output Format determines certain properties, such as type, of the final
output written by the RecordWriter to output files. Typically, the MapReduce framework writes
data only to new, non existent, output paths to avoid overwriting existing data. The
FileOutputFormat.setOutputPath() method is used to set the output directory. It checks the
existence of the given path to avoid overwriting. FileOutputFormat is the base class for all
file-based OutputFormat implementations. Every Reducer writes a separate file in this common
output directory.

The default OutputFormat in hadoop is TextOuputFormat. TextOutputFormat’s default

RecordWriter writes one record per line, by converting keys and values to strings using
toString() method. The key and value in a record or line are by a tab character. To read these
output text files as input, KeyValueTextInputFormat is best suited, since it breaks input lines into
key value pairs based on a separator character.

There exist other OutputFormats such as SequenceFileOutputFormat, MapFileOutputFormat to

name a few. One can also create custom OutputFormats.

A HDFS block is a physical representation of data on storage while Input Split is a logical
representation of the input data at runtime. InputSplit contains a reference to the data within a
certain range. FileInputFormat divides the given input files into logical chunks called Input Splits.
Hence, the number of input splits = (Total data size)/(split size). Each input split is assigned to a
single mapper for processing and hence, the number of map tasks = number of input splits.
Input Split size is a user defined value that can be tuned based on the overall size of data. By
setting the split size you can set the number of input splits and thereby the number of map tasks
for a job. By default, the input split size is equal to the block size.

In general, more map tasks implies greater parallelism incentivising lowering of the split size.
However, the total time taken by a job depends on the processing time at mappers and
reducers, the setup time and the time taken to transfer the data over the network. Thus, we
need to carefully choose a split size such that the total time for job completion is minimized
taking into account processing, setup and networking factors. The default values of
64MB(Apache) and 128MB(Cloudera) are chosen based on typical workloads. They generally
give good performance for the most common Hadoop applications.

Early versions of MapReduce supported a method setNumMapTasks(int) to set the number of

mappers in a job configuration. In later versions, the only way to set the number of mappers is
by setting the split size. The split size can be set by editing the parameters mapred.min.split.size
and mapred.max.split.size in Mapred-site.xml. These parameters can also be set from within
code using the conf.set APIs of Hadoop’s job configuration class. In the new API, the
Filesystem block size of the input files is treated as an upper bound for input splits and a
lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize
property.

We can set the number of reducers by passing it as an argument to the method

setNumReduceTasks(int) or from the command line using the -D option.
Just like with mappers, more number of reducers implies greater parallelism. But again, more
reducers is not always good. Depending on the specific application, a large number of reducers
might lead to network, I/O, and management overheads, thereby increasing overall time for job
completion. Hence, one should be careful when deciding the number of reducers. By default,
number of reducers is 1.

We can also set the number of reducers to zero. Such a job is called a map-only job and the
outputs of the map-tasks go directly to the FileSystem at the output path set by
FileOutputFormat.setOutputPath(Job, Path). The framework does not sort such map-only
outputs before writing them to the FileSystem.

Hadoop has a set of APIs which define the flow of a mapreduce program.

A job is the complete flow of execution of a mapreduce program from start to finish. A job
starts with reading input data from HDFS. It then processes the data according to the map and
the reduce functions. A job ends with writing output files on HDFS.

The Job class is an important class in the mapreduce API.

It helps the user create a job, describe various parameters of the job, control its execution and
then submit the job. It also allows the user to monitor progress of a submitted job.In other
words, the Hadoop framework executes a map-reduce application as described by the
corresponding Job class.

You can configure a job’s parameters using the Job class. Initially, we instantiate a new Job
object that takes the Hadoop configuration. A configuration contains information about
resources. Default resources are specified in Core-default.xml and
Core-site.xml. These files contain read-only and site specific configurations of the Hadoop
installation.
There are several methods defined in the job class which can be used to either configure a job,
or monitor it. Some of the important ones are -

1. setJobName() - You set the job name using the setJobName method.
2. setJarByClass() - Through the setJarByClass API, the user helps Hadoop identify the main
driver jar or class for a MapReduce job. Note that the jar files are put on HDFS so that all
nodes can access them.
3. setOutputKeyClass() - You specify the output key format using setOutputKeyClass.
4. setOutputValueClass() - You specify the output value format using setOutputValueClass.
5. setMapOutputKeyClass() - You can specifically set the map output key class separately.
If not, it defaults to the class that is passed earlier to setOutputKeyClass. This API must
be used if the mapper emitted keys differ in format from the final output.
6. setMapOutputValueClass() - Similarly, you can explicitly specify the map output value
class. If not set, it defaults to the class set as the overall output value class. This API
must be used if the mapper emitted values differ in format from the final output.
7. setMapperClass() - The setMapperClass method is provided to set the mapper for the
job. The mappers will process the data according to the class passed here as the
argument.
8. setCombinerClass() - This api sets the combiner for the job. This step is optional, but
recommended to perform local aggregation of the intermediate outputs, which helps to
cut down the amount of data transferred from the Mapper to the Reducer.
9. setReducerClass() - The SetReducerClass method is used to set the reducer for the job.
The reducers will process the intermediate output from the mappers according to the
class passed here as the argument
10. FileInputFormat.addInputPath(job,<path>) - You provide the path to the input file
through the addInputPath method of the FileInputFormat class. You can also specify this
in command line along with the command to run this job.
11. FileOutputFormat.setOutputPath(job,<path>) - This api is used to provide the output
path. This may also be provided in the command line.
12. job.waitForCompletion(true) - The waitforcompletion method submits the job and then
polls for progress until the job is complete

There are several other methods defined in the job class. We can use these methods to either
configure a job, or monitor it.

If you encounter jobconf instead of the job class, then you are looking at the older MapReduce
package and APIs. The newer package replaces jobconf with the job class.
The driver class implements the Tool interface and extends the Configured class. Configured is
the base class required to configure a Hadoop job. The configured class has two main methods -
getConf() and setConf(). getConf() Returns the configuration used by the object while setConf()
sets the configuration to be used by the object.

Tool is an interface that supports handling of generic command-line options. With the help of
Tool we can define job parameters from the command line. A job is submitted for execution
using the ToolRunner.run API.

It is not mandatory to use the Configured class and Tool interface. However, using them makes
your code more portable and cleaner as you do not need to hardcode specific configurations.

MapReduce programming skills may be developed only through practice. You can refer here to
run your code. You may further refer to session 3 and 4 for some code demonstrations.

Woolworths Pegasus Card Induction Instructions
No ratings yet
Woolworths Pegasus Card Induction Instructions
11 pages
FLDFU Download Manual - GEN2U GEN3
100% (1)
FLDFU Download Manual - GEN2U GEN3
33 pages
How To Define Document Types For RFQ in SAP
No ratings yet
How To Define Document Types For RFQ in SAP
6 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
Map Reduce Examples
No ratings yet
Map Reduce Examples
7 pages
Map Reduce Algorithm
No ratings yet
Map Reduce Algorithm
4 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
2403RES29 - Hemant Choudhary - CS546 - Assignment - 2
No ratings yet
2403RES29 - Hemant Choudhary - CS546 - Assignment - 2
10 pages
BDA Module 3
No ratings yet
BDA Module 3
66 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Unit-4
No ratings yet
Unit-4
19 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Excel To KML
No ratings yet
Excel To KML
10 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Project4 Handout
No ratings yet
Project4 Handout
7 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce_3
No ratings yet
Map Reduce_3
23 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
43 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Unit 4 2 - CC
No ratings yet
Unit 4 2 - CC
6 pages
Mapreduce Class Notes
No ratings yet
Mapreduce Class Notes
43 pages
Google'S Mapreduce Programming Model - Revisited: Ralf L Ammel
No ratings yet
Google'S Mapreduce Programming Model - Revisited: Ralf L Ammel
30 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
ECS765P_W2_The MapReduce Programming Model
No ratings yet
ECS765P_W2_The MapReduce Programming Model
53 pages
MapReduce
No ratings yet
MapReduce
14 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Mapreduce
No ratings yet
Mapreduce
13 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Map Reduce Architecture: Adapted From Lectures by
No ratings yet
Map Reduce Architecture: Adapted From Lectures by
37 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Map Reduce PDF
No ratings yet
Map Reduce PDF
29 pages
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
No ratings yet
(BIG DATA) (MapReduce - Quick Guide, Tutorialspoint - Com)
36 pages
1c MR YARN Transcript
No ratings yet
1c MR YARN Transcript
4 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
Map Reduce Algorithm - Hadoop
No ratings yet
Map Reduce Algorithm - Hadoop
15 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
AutoLisp Programming
No ratings yet
AutoLisp Programming
11 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Gd Script
From Everand
Gd Script
Marijo Trkulja
No ratings yet
More on C# in Front Office
From Everand
More on C# in Front Office
Xing Zhou
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
From Everand
R Fast Track Guide - 86 Key Points Every Programmer from Other Languages Should Master
Ginno
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
C++ Programming For Beginners
From Everand
C++ Programming For Beginners
Artur Kalls
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Technology in Health Care Model QP
No ratings yet
Technology in Health Care Model QP
3 pages
IPV-4 ROUTING-Assignment Questions With Solutions
No ratings yet
IPV-4 ROUTING-Assignment Questions With Solutions
15 pages
Technology in Health Care
No ratings yet
Technology in Health Care
4 pages
CH 01
No ratings yet
CH 01
33 pages
Dse-Iv E (Theory-Programming) : BATCH: 2019
No ratings yet
Dse-Iv E (Theory-Programming) : BATCH: 2019
11 pages
DEEP LEARNING - FDP
No ratings yet
DEEP LEARNING - FDP
18 pages
18mit13c U2
No ratings yet
18mit13c U2
13 pages
7.6.3 Exploring IPv6 Addressing On Routers
No ratings yet
7.6.3 Exploring IPv6 Addressing On Routers
3 pages
Elective-I-Fundamental of Networks-Assignment
No ratings yet
Elective-I-Fundamental of Networks-Assignment
42 pages
Basic IP Servies - Assignment
No ratings yet
Basic IP Servies - Assignment
20 pages
Open Ele-Cyber Security-II-NAAC Hours
No ratings yet
Open Ele-Cyber Security-II-NAAC Hours
7 pages
B.SC - IT-UG-CBCS-2019-STRUCTURE & SCHEME-19.3.21 (WITHOUT COP) - Merged-Merged
No ratings yet
B.SC - IT-UG-CBCS-2019-STRUCTURE & SCHEME-19.3.21 (WITHOUT COP) - Merged-Merged
54 pages
Programme:: B.Sc. CS/IT/CA
No ratings yet
Programme:: B.Sc. CS/IT/CA
12 pages
7.2.3 Routing Troubleshooting Tools
No ratings yet
7.2.3 Routing Troubleshooting Tools
7 pages
C - Unit-Iii
No ratings yet
C - Unit-Iii
25 pages
7.2.1 IPv4 Routing Overview
No ratings yet
7.2.1 IPv4 Routing Overview
5 pages
Lecture 15
No ratings yet
Lecture 15
82 pages
Lecture 17
No ratings yet
Lecture 17
136 pages
LECT3
No ratings yet
LECT3
87 pages
Os Unit Iv
No ratings yet
Os Unit Iv
62 pages
Cyber Security-Unit-I
No ratings yet
Cyber Security-Unit-I
64 pages
Cyber Security-Unit-II
No ratings yet
Cyber Security-Unit-II
29 pages
Os Unit Ii
No ratings yet
Os Unit Ii
69 pages
Download Complete (Ebook) Probabilistic Machine Learning for Civil Engineers by James-A Goulet ISBN 9780262538701, 0262538709 PDF for All Chapters
100% (2)
Download Complete (Ebook) Probabilistic Machine Learning for Civil Engineers by James-A Goulet ISBN 9780262538701, 0262538709 PDF for All Chapters
71 pages
Manual Steps v1.7
No ratings yet
Manual Steps v1.7
8 pages
CEC335 ANTENNA DESIGN LAB MANUAL
No ratings yet
CEC335 ANTENNA DESIGN LAB MANUAL
37 pages
ZB4BG3: Product Data Sheet
No ratings yet
ZB4BG3: Product Data Sheet
16 pages
Daily Time Record Daily Time Record Daily Time Record Daily Time Record
No ratings yet
Daily Time Record Daily Time Record Daily Time Record Daily Time Record
2 pages
Lecture 1: Newton Forward and Backward Interpolation: M R Mishra May 9, 2022
No ratings yet
Lecture 1: Newton Forward and Backward Interpolation: M R Mishra May 9, 2022
10 pages
ArcGIS Pro 3.0 and 3.1
100% (1)
ArcGIS Pro 3.0 and 3.1
2 pages
Immediate Download The Construction Technology Handbook Making Sense of Artificial Intelligence and Beyond 1st Edition Seaton Hugh Ebooks 2024
100% (1)
Immediate Download The Construction Technology Handbook Making Sense of Artificial Intelligence and Beyond 1st Edition Seaton Hugh Ebooks 2024
46 pages
Uj Uqj Core Optional Tools Equipment m058 v1 Low File Size PDF
No ratings yet
Uj Uqj Core Optional Tools Equipment m058 v1 Low File Size PDF
21 pages
DBMS Case Study 9
No ratings yet
DBMS Case Study 9
25 pages
Week1 e Text PDF
No ratings yet
Week1 e Text PDF
48 pages
Foundations of Computer Science 4th Edition by Forouzan
No ratings yet
Foundations of Computer Science 4th Edition by Forouzan
7 pages
Tuples
No ratings yet
Tuples
2 pages
Odc 2024 RFP Sow v0.2
No ratings yet
Odc 2024 RFP Sow v0.2
28 pages
TechLibrary - Juniper Networks
No ratings yet
TechLibrary - Juniper Networks
1 page
Nurshath Dulal Wiki, Age, Photos, Height, Family, Net Worth
No ratings yet
Nurshath Dulal Wiki, Age, Photos, Height, Family, Net Worth
1 page
Indicator NAS Ultimate Algo Remastered for TradingView
No ratings yet
Indicator NAS Ultimate Algo Remastered for TradingView
6 pages
HouseRules-MSH-1 0 6
No ratings yet
HouseRules-MSH-1 0 6
4 pages
OSI Model Vs TCP-IP Model
No ratings yet
OSI Model Vs TCP-IP Model
29 pages
CCNA4 Case Study Inst en
No ratings yet
CCNA4 Case Study Inst en
9 pages
Define Resume in English
100% (2)
Define Resume in English
7 pages
SAFe Agilist
No ratings yet
SAFe Agilist
18 pages
Homework 1 - Legal Aspects
No ratings yet
Homework 1 - Legal Aspects
2 pages
5267Instant ebooks textbook Beyond the Mushroom Cloud Commemoration Religion and Responsibility after Hiroshima 1st Edition Yuki Miyamoto download all chapters
100% (1)
5267Instant ebooks textbook Beyond the Mushroom Cloud Commemoration Religion and Responsibility after Hiroshima 1st Edition Yuki Miyamoto download all chapters
48 pages
03 Compute Virtualization - Print
No ratings yet
03 Compute Virtualization - Print
9 pages
Kumon Connect Operational and Technical FAQ - June 5 2023
No ratings yet
Kumon Connect Operational and Technical FAQ - June 5 2023
11 pages
Module 2.1 Globalization Thru MEDIA
No ratings yet
Module 2.1 Globalization Thru MEDIA
11 pages