Bigdata Unit IV

UNIT IV HADOOP AND MAP REDUCE
UNIT IV
SYLLABUS:
Unit-4: HADOOP AND MAP REDUCE Hadoop: The

Hadoop Distributed File System Components of Hadoop Analysing the
Data with Hadoop Scaling Out Hadoop Streaming Design of HDFS
Java interfaces to HDFS Basics Developing a Map Reduce
Application How MapReduce Works Anatomy of a Map Reduce Job
run Failures Job Scheduling Shuffle and Sort Task execution Map
Reduce Types and Formats Map Reduce Features Hadoop
environment.
The Hadoop Distributed File System:
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high availability
to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks,

data nodes and node name.
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
P.SARITHA M.Sc.,M.Phil., PAGE 1

The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
Manages the file system namespace.
It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.

Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file segments
are called as blocks. In other words, the minimum amount of data that HDFS can read or write
is called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
Fault detection and recovery
Since HDFS includes a large number of commodity hardware, failure of components
is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection
and recovery.
Huge datasets
HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data
A requested task can be done efficiently, when the computation takes place near the
data. Especially where huge datasets are involved, it reduces the network traffic and increases
the throughput.
Portability
To facilitate adoption, HDFS is designed to be portable across multiple hardware
platforms and to be compatible with a variety of underlying operating systems.
HDFS DataNode and NameNode Image:
HDFS Read Image:

HDFS Write Image:
Since all the metadata is stored in name node, it is very important. If it fails the file
system can not be used as there would be no way of knowing how to reconstruct the files from
blocks present in data node. To overcome this, the concept of secondary name node arises.
Secondary Name Node:
It is a separate physical machine which acts as a helper of name node. It performs

periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.

Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may
be loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the
other machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
COMPONENTS OF HADOOP ANALYSING THE DATA WITH HADOOP
Building Blocks of Hadoop

1. HDFS (The storage layer)
As the name suggests, Hadoop Distributed File System is the storage layer of Hadoop
and is responsible for storing the data in a distributed environment (master and slave
configuration). It splits the data into several blocks of data and stores them across different data
nodes.
These data blocks are also replicated across different data nodes to prevent loss of data
when one of the nodes goes down.
It has two main processes running for processing of the data:
a. NameNode
It is running on the master machine. It saves the locations of all the files stored in the
file system and tracks where the data resides across the cluster i.e. it stores the metadata of the
files. When the client applications want to make certain operations on the data, it interacts with
the NameNode. When the NameNode receives the request, it responds by returning a list of
Data Node servers where the required data resides.
b. DataNode
This process runs on every slave machine. One of its functionalities is to store each
HDFS data block in a separate file in its local file system. In other words, it contains the actual
data in form of blocks. It sends heartbeat signals periodically and waits for the request from
the NameNode to access the data.

2. MapReduce (The processing layer)

It is a programming technique based on Java that is used on top of the Hadoop
framework for faster processing of huge quantities of data. It processes this huge data in a
distributed environment using many Data Nodes which enables parallel processing and faster
execution of operations in a fault-tolerant way.
A MapReduce job splits the data set into multiple chunks of data which are further
converted into key-value pairs in order to be processed by the mappers. The raw format of the
data may not be suitable for processing. Thus, the input data compatible with the map phase is
generated using the InputSplit function and RecordReader.
InputSplit is the logical representation of the data which is to be processed by an

individual mapper. RecordReader converts these splits into records which take the form of key-
value pairs. It basically converts the byte-oriented representation of the input into a record-
oriented representation.
These records are then fed to the mappers for further processing the data. MapReduce
jobs primarily consist of three phases namely the Map phase, the Shuffle phase, and the Reduce
phase.
a. Map Phase
It is the first phase in the processing of the data. The main task in the map phase is to
process each input from the RecordReader and convert it into intermediate tuples (key-value
pairs). This intermediate output is stored in the local disk by the mappers.
The values of these key-value pairs can differ from the ones received as input from the
RecordReader. The map phase can also contain combiners which are also called as local
reducers. They perform aggregations on the data but only within the scope of one mapper.
As the computations are performed across different data nodes, it is essential that all
the values associated with the same key are merged together into one reducer. This task is
performed by the partitioner. It performs a hash function over these key-value pairs to merge
them together.
It also ensures that all the tasks are partitioned evenly to the reducers. Partitioners
generally come into the picture when we are working with more than one reducer.
b. Shuffle and Sort Phase
This phase transfers the intermediate output obtained from the mappers to the reducers.
This process is called as shuffling. The output from the mappers is also sorted before
transferring it to the reducers. The sorting is done on the basis of the keys in the key-value
pairs. It helps the reducers to perform the computations on the data even before the entire data
is received and eventually helps in reducing the time required for computations.
As the keys are sorted, whenever the reducer gets a different key as the input it starts to
perform the reduce tasks on the previously received data.
c. Reduce Phase
The output of the map phase serves as an input to the reduce phase. It takes these key-
value pairs and applies the reduce function on them to produce the desired result. The keys and

the values associated with the key are passed on to the reduce function to perform certain
operations.
We can filter the data or combine it to obtain the aggregated output. Post the execution
of the reduce function, it can create zero or more key-value pairs. This result is written back in
the Hadoop Distributed File System.
3. YARN (The management layer)

Yet Another Resource Navigator is the resource managing component of Hadoop.
There are background processes running at each node (Node Manager on the slave machines
and Resource Manager on the master node) that communicate with each other for the allocation
of resources.
The Resource Manager is the centrepiece of the YARN layer which manages resources
among all the applications and passes on the requests to the Node Manager.
The Node Manager monitors the resource utilization like memory, CPU, and disk of
the machine and conveys the same to the Resource Manager. It is installed on every Data Node
and is responsible for executing the tasks on the Data Nodes.
SCALING OUT
Scale out is a growth architecture or method that focuses on horizontal growth, or the
addition of new resources instead of increasing the capacity of current resources (known as
scaling up). Ina system such as a cloud storage facility, following a scale-out growth would
mean that new storage hardware and controllers would be added in order to increase capacity.
This has two obvious pros one is that storage capacity is increased and the second is traffic
capacity is also increased because there is more hardware to share the load.
increases in the volume of data being produced and processed, many

databases are being overwhelmed with the deluge of data they are facing. To manage, store and

organizations dealing with exploding datasets. A scalable data platform accommodates rapid
changes in the growth of data, either in traffic or volume. These platforms utilize added
hardware or software to increase output and storage of data. When a company has a scalable
data platform, it also is prepared for the potential of growth in its data needs.
Common Performance Bottlenecks
Companies should implement scalability into their organization precisely when performance
issues arise. These issues can negatively impact the workflow, efficiency and customer
retention. There are three common, key performance bottlenecks, that often point the way
toward a proper resolution with data scaling:
1. High CPU Usage is the most common bottleneck, and the most visible. Slowing and erratic
performance is a key indicator of high CPU usage, and can often be a harbinger of other issues.
User CPU means the CPU is doing productive work, but needs a server upgrade; system CPU
refers to usage consumed by the operating system, and is usually related to the software; and
I/O wait, which is the idling time caused by the CPU waiting for the I/O subsystem.
2. Low Memory is the next most common bottleneck. Servers without enough memory to handle
an application load can slow the application completely. Low memory can require a RAM
upgrade, but this can also be an indicator of a memory leak, which requires finding and
3. High Disk Usage is another common bottleneck. This is often caused by maxed out disks, and
is a huge indicator of the need for a data scale.
Scaling Up vs. Scaling Out
Once a decision has been made for data scaling, the specific scaling approach must be chosen.
There are two commonly used types of data scaling, up and out:

1. Scaling up, or vertical scaling, involves obtaining a faster server with more powerful
processors and more memory. This solution uses less network hardware, and consumes less
power; but ultimately, for many platforms may only provide a short-term fix, especially if
continued growth is expected.
2. Scaling out, or horizontal scaling, involves adding servers for parallel computing. The scale
out technique is a long-term solution, as more and more servers may be added when needed.
But going from one monolithic system to this type of cluster may be a difficult, although
extremely effective solution.
-eye view
of the system and look at the data flow for large inputs. For simplicity, the examples so far
have used files on the local filesystem. However, to scale out, we need to store the data in a
Hadoop to mov
see how this works.
Data Flow
First, some terminology. A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it into tasks, of which there are two types: map
tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker and a number
of tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks
to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which
keeps a record of the overall progress of each job. if a task fails, the jobtracker can reschedule
it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just
splits. Hadoop creates one map task for each split, which runs the userdefined map function for
each record in the split.
Having many splits means the time taken to process each split is small compared to the time to
process the whole input. So if we are processing the splits in parallel, the processing is better

load-balanced if the splits are small, since a faster machine will be able to process
proportionally more splits over the course of the job than a slower machine. Even if the
machines are identical, failed processes or other jobs running concurrently make load balancing
desirable, and the quality of the load balancing increases as the splits become more fine-
grained.
On the other hand, if splits are too small, then the overhead of managing the splits and of map
task creation begins to dominate the total job execution time. For most jobs, a good split size
tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the
cluster (for all newly created files), or specified when each file is created.
Hadoop does its best to run the map task on a node where the input data resides in HDFS. This
is called the data locality optimization. It should now be clear why the optimal split size is the
same as the block size: it is the largest size of input that can be guaranteed to be stored on a
single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored
both blocks, so some of the split would have to be transferred across the network to the node
running the map task, which is clearly less efficient than running the whole map task using
local data.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
job is complete the map output can be thrown away. So storing it in HDFS, with replication,
would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically rerun the map task on another
node to re-create the map output.
f data locality the input to a single reduce task is

normally the output from all mappers. In the present example, we have a single reduce task that
is fed by all of the map tasks. Therefore, the sorted map outputs have to be transferred across
the network to the node where the reduce task is running, where they are merged and then
passed to the user-defined reduce function. The output of the reduce is normally stored in HDFS
for reliability. As explained in Chapter The Hadoop Disrtibuted Fle Systemfor each HDFS
block of the reduce output, the first replica is stored on the local node, with other replicas being

stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth,
but only as much as a normal HDFS write pipeline consumes.
The whole data flow with a single reduce task is illustrated in Figure The dotted boxes indicate
nodes, the light arrows show data transfers on a node, and the heavy arrows show data transfers
between nodes.
The number of reduce tasks is not governed by the size of the input, but is specified
reduce tasks for a given job.
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys (and their associated values) in each
partition, but the records for any given key are all in a single partition. The partitioning can be
controlled by a user-defined partitioning function, but normally the default partitioner which
buckets keys using a hash function works very well.
The data flow for the general case of multiple reduce tasks is illustrated in Figure This diagram
makes it clear why the data flow between map and reduce tasks is coll

this diagram suggests, and tuning it can have a big impact on job execution time, as you will
need the shuffle since the processing can be carried out entirely in parallel (a few examples are
-node data transfer is when the
map tasks write to HDFS (see Figure ).
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to
minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify
a
to the reduce function. Since the combiner function is an optimization, Hadoop does not
provide a guarantee of how many times it will call it for a particular map output record, if at
all. In other words, calling the combiner function zero, one, or many times should produce the
same output from the reducer

The contract for the combiner function constrains the type of function that may be used. This
is best illustrated with an example. Suppose that for the maximum temperature example,
readings for the year 1950 were processed by two maps (because they were in different splits).
Imagine the first map produced the output:
1.(1950, 0)
2.(1950, 20)
3.(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
1 (1950, [0, 20, 10, 25, 15])

with output:
(1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that, just like the
reduce function, finds the maximum temperature for each map output. The reduce would then
be called with:
(1950, [20, 25])

and the reduce would produce the same output as before. More succinctly, we may express the
function calls on the temperature values in this case as follows:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Not all function
mean(0, 20, 10, 25, 15) = 14
but:
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15

The
is still needed to process records with the same key from different maps.) But it can help cut
down the amount of data shuffled between the maps and the reduces, and for this reason alone

it is always worth considering whether you can use a combiner function in your MapReduce
job.
Specifying a combiner function

Going back to the Java MapReduce program, the combiner function is defined using the
Reducer interface, and for this application, it is the same implementation as the reducer
function in MaxTemperatureReducer. The only change we need to make is to set the combiner
class on the JobConf (see Example ).
Example . Application to find the maximum temperature, using a combiner function for
efficiency
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);

JobConf conf = new JobConf(MaxTemperatureWithCombiner.class);
conf.setJobName("Max temperature");
10
FileInputFormat.addInputPath(conf, new Path(args[0]));
11
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
12
conf.setMapperClass(MaxTemperatureMapper.class);
13
conf.setCombinerClass(MaxTemperatureReducer.class);
14
conf.setReducerClass(MaxTemperatureReducer.class);
15
conf.setOutputKeyClass(Text.class);
16
conf.setOutputValueClass(IntWritable.class);
17

JobClient.runJob(conf);
18
19
Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-
Running a Distributed MapReduce Job The same program will run, without alteration, on a full
dataset. This is the point of MapReduce: it scales to the size of your data and the size of your
-node EC2 cluster running High-CPU Extra Large
Instances, the program took six minutes to run.
HADOOP STREAMING
Hadoop provides an API to MapReduce that allows you to write your map and reduce functions
in languages other than Java. Hadoop Streaming uses Unix standard stream as the interface
between Hadoop and your program, so you can use any language that can read standard input
and write to standard output to write your MapReduce program.
Streaming is naturally suited for text processing (although, as of version 0.21.0, it can handle
binary streams, too), and when used in text mode, it has a line-oriented view of data. Map input
data is passed over standard input to your map function, which processes it line by line and
writes lines to standard output. A map output key-value pair is written as a single tab-delimited
line. Input to the reduce function is in the same format a tab-separated key-value pair passed
over standard input. The reduce function reads lines from standard input, which the framework
guarantees are sorted by key, and writes its results to standard output.
ng our MapReduce program for finding maximum temperatures

by year in Streaming.

Ruby
The map function can be expressed in Ruby as shown in Example
Example Map function for maximum temperature in Ruby
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
puts "#{year}t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
end
This is a factor of seven faster than the serial run on one machine using awk. The main reason
convenience, the input files were gzipped by year, resulting in large files for later years in the
dataset, when the number of weather records was much higher.
The program iterates over lines from standard input by executing a block for each line from
STDIN (a global constant of type IO). The block pulls out the relevant fields from each input
line, and, if the temperature is valid, writes the year and the temperature separated by a tab
character t to standard output (using puts).
The Java API is geared toward processing your map function one record at a time. The
framework calls the map() method on your Mapper for each record in the input, whereas with
Streaming the map program can decide how to process the input for example, it could easily
possible to consider multiple lines at a time by accumulating previous lines in an instance

variable in the Mapper.§ In this case, you need to implement the close() method so that you
know when the last record has been read, so you can finish processing the last group of lines.
using Hadoop, simply using Unix pipes:
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078
The reduce function shown in Example is a little more complex.
Example . Reduce function for maximum temperature in Ruby
#!/usr/bin/env ruby
last_key, max_val = nil, 0

STDIN.each_line do |line|
key, val = line.split("t")
if last_key && last_key != key
puts "#{last_key}t#{max_val}"
last_key, max_val = key, val.to_i
else
last_key, max_val = key, [max_val, val.to_i].max
10
end
11
end
12
puts "#{last_key}t#{max_val}" if last_key

13
Again, the program iterates over lines from standard input, but this time we have to
store some state as we process each key group. In this case, the keys are weather station
identifiers, and we store the last key seen and the maximum temperature seen so far for that
key. The MapReduce framework ensures that the keys are ordered, so we know that if a key is
different from the previous one, we have moved into a new key group. In contrast to the Java
API, where you are provided an iterator over each key group, in Streaming you have to find
key group boundaries in your program.
For each line, we pull o

&& last_key != key), we write the key and the maximum temperature for that group, separated
just finished a group, we just update the maximum temperature for the current key.
The last line of the program ensures that a line is written for the last key group in the
input.
We can now simulate the whole MapReduce pipeline with a Unix pipeline (which is
equivalent to the Unix pipeline shown in Figure ):
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb |
sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111

1950 22
The output is the same as the Java program, so the next step is to run it using Hadoop itself.
Streaming JAR file along with the jar option. Options to the Streaming program specify the
input and output paths, and the map and reduce scripts. This is what it looks like:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input
input/ncdc/sample.txt
-output output
-mapper ch02/src/main/ruby/max_temperature_map.rb
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
When running on a large dataset on a cluster, we should set the combiner, using the -
combiner option.
From release 0.21.0, the combiner can be any Streaming command. For earlier releases,
the combiner had to be written in Java, so as a workaround it was common to do manual
combining in the mapper, without having to resort to Java. In this case, we could change the
mapper to be a pipeline:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar

-input input/ncdc/all
-output output
-mapper "ch02/src/main/ruby/max_temperature_map.rb | sort |
ch02/src/main/ruby/max_temperature_reduce.rb"
-reducer ch02/src/main/ruby/max_temperature_reduce.rb -file
ch02/src/main/ruby/max_temperature_map.rb
-file ch02/src/main/ruby/max_temperature_reduce.rb
Note also the use of -file, which we use when running Streaming programs on the cluster to
ship the scripts to the cluster.
Python
Streaming supports any programming language that can read from standard input, and
example
again. The map script is in Example and the reduce script is in Example
Example . Map function for maximum temperature in Python
#!/usr/bin/env python

import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
print "%st%s" % (year, temp)
Example 2-11. Reduce function for maximum temperature in Python
10
#!/usr/bin/env python
11
import sys

12
(last_key, max_val) = (None, 0)
13
for line in sys.stdin:
14
(key, val) = line.strip().split("t")
15
if last_key and last_key != key:
16
print "%st%s" % (last_key, max_val)
17
(last_key, max_val) = (key, int(val))
18
else:
19
(last_key, max_val) = (key, max(max_val, int(val)))
20
if last_key:
21
print "%st%s" % (last_key, max_val)

As an alternative to Streaming, Python programmers should consider Dumbo , which

makes the Streaming MapReduce interface more Pythonic and easier to use.
We can test the programs and run the job in the same way we did in Ruby. For example,
to run a test:
% cat input/ncdc/sample.txt | ch02/src/main/python/max_temperature_map.py |
sort | ch02/src/main/python/max_temperature_reduce.py
1949 111
1950 22
DESIGN OF HDFS
When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed file systems
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop
Distributed Filesystem.
The Design of HDFS :
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large files:
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access :

HDFS is built around the idea that the most efficient data processing pattern is a write-
once, readmany-times pattern. A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.
Commodity hardware :
run on clusters of commodity hardware (commonly available hardware available from multiple
vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters.
HDFS is designed to carry on working without a noticeable interruption to the user in

the face of such failure.
These are areas where HDFS is not a good fit today:

Low-latency data access :
Applications that require low-latency access to data, in the tens of milliseconds range,
will not work well with HDFS.
Lots of small files :

Since the name node holds file system metadata in memory, the limit to the number of
files in a file system is governed by the amount of memory on the name node.
Multiple writers, arbitrary file modifications:
Files in HDFS may be written to by a single writer. Writes are always made at the end
of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in
the file.
HDFS Concepts Blocks:

HDFS has the concept of a block, but it is a much larger unit 64 MB by default. Files
in HDFS are broken into block-sized chunks, which are stored as independent units.
Having a block abstraction for a distributed filesystem brings several benefits.:
The first benefit :
the blocks from a file to be stored on the same disk, so they can take advantage of any of the
disks in the cluster.
Second:
Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. The storage subsystem deals with blocks, simplifying storage management (since
blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and
eliminating metadata concerns.
Third:

Blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small
number of physically separate machines (typically three).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk can be
made to be significantly larger than the time to seek to the start of the block. Thus the time to
transfer a large file made of multiple blocks operates at the disk transfer rate.
A quick calculation shows that if the seek time is around 10 ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size
around 100 MB. The default is actually 64 MB, although many HDFS installations use 128
MB blocks. This figure will continue to be revised upward as transfer speeds grow with new
generations of disk drives.
Name nodes and Data nodes:
An HDFS cluster has two types of node operating in a master-worker pattern: a

namenode (the master) and a number of datanodes (workers). The namenode manages the
filesystem namespace. It maintains the filesystem tree and the metadata for all the files and
directories in the tree. This information is stored persistently on the local disk in the form of
two files: the namespace image and the edit log. The name node also knows the data nodes on
which all the blocks for a given file are located.
Apache Hadoop is designed to have Master Slave architecture:
Master: Namenode, JobTracker

Slave: {DataNode, TaskTrac cker}

HDFS is one primary components of Hadoop cluster and HDFS is designed to

have Master-slave architecture.
Master: Name Node
- The Master (Name Node) manages the file system namespace operations like opening,
closing, and renaming files and directories and determines the mapping of blocks to Data Nodes
along with regulating access to files by client.
- Slaves (DataNodes)
clients along with perform block creation, deletion, and replication upon instruction from the
Master (NameNode).
Data nodes are the workhorses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with
lists of blocks that they are storing.
NameNode failure: if the machine running the namenode failed, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from the
blocks on the datanodes.

Bigdata Unit IV

Uploaded by

Copyright:

Available Formats

Bigdata Unit IV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bigdata Unit IV

Uploaded by

Copyright:

Available Formats

UNIT IV HADOOP AND MAP REDUCE

Unit-4: HADOOP AND MAP REDUCE Hadoop: The

The Hadoop Distributed File System:

It is cost effective as it uses commodity hardware. It involves the concept of blocks,

P.SARITHA M.Sc.,M.Phil., PAGE 1

Manages the file system namespace.

P.SARITHA M.Sc.,M.Phil., PAGE 2

HDFS DataNode and NameNode Image:

HDFS Read Image:

P.SARITHA M.Sc.,M.Phil., PAGE 3

HDFS Write Image:

Secondary Name Node:

It is a separate physical machine which acts as a helper of name node. It performs

P.SARITHA M.Sc.,M.Phil., PAGE 4

COMPONENTS OF HADOOP ANALYSING THE DATA WITH HADOOP

Building Blocks of Hadoop

It has two main processes running for processing of the data:

P.SARITHA M.Sc.,M.Phil., PAGE 5

2. MapReduce (The processing layer)

InputSplit is the logical representation of the data which is to be processed by an

P.SARITHA M.Sc.,M.Phil., PAGE 6

3. YARN (The management layer)

increases in the volume of data being produced and processed, many

P.SARITHA M.Sc.,M.Phil., PAGE 7

Common Performance Bottlenecks

is a huge indicator of the need for a data scale.

Scaling Up vs. Scaling Out

P.SARITHA M.Sc.,M.Phil., PAGE 8

continued growth is expected.

P.SARITHA M.Sc.,M.Phil., PAGE 9

f data locality the input to a single reduce task is

P.SARITHA M.Sc.,M.Phil., PAGE 10

reduce tasks for a given job.

P.SARITHA M.Sc.,M.Phil., PAGE 11

P.SARITHA M.Sc.,M.Phil., PAGE 12

And the second produced:

1 (1950, [0, 20, 10, 25, 15])

P.SARITHA M.Sc.,M.Phil., PAGE 13

(1950, [20, 25])

mean(0, 20, 10, 25, 15) = 14

mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15

P.SARITHA M.Sc.,M.Phil., PAGE 14

Specifying a combiner function

public class MaxTemperatureWithCombiner {

public static void main(String[] args) throws IOException {

System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +

P.SARITHA M.Sc.,M.Phil., PAGE 15

JobConf conf = new JobConf(MaxTemperatureWithCombiner.class);

FileInputFormat.addInputPath(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

P.SARITHA M.Sc.,M.Phil., PAGE 16

Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-

ng our MapReduce program for finding maximum temperatures

P.SARITHA M.Sc.,M.Phil., PAGE 17

Example Map function for maximum temperature in Ruby

year, temp, q = val[15,4], val[87,5], val[92,1]

puts "#{year}t#{temp}" if (temp != "+9999" && q =~ /[01459]/)

possible to consider multiple lines at a time by accumulating previous lines in an instance

P.SARITHA M.Sc.,M.Phil., PAGE 18