Bigdata Unit IV
Bigdata Unit IV
Bigdata Unit IV
UNIT IV
SYLLABUS:
What is HDFS
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed
over several machines and replicated to ensure their durability to failure and high availability
to parallel application.
Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.
Features of HDFS
It is suitable for the distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file segments
are called as blocks. In other words, the minimum amount of data that HDFS can read or write
is called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
Fault detection and recovery
Since HDFS includes a large number of commodity hardware, failure of components
is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection
and recovery.
Huge datasets
HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data
A requested task can be done efficiently, when the computation takes place near the
data. Especially where huge datasets are involved, it reduces the network traffic and increases
the throughput.
Portability
To facilitate adoption, HDFS is designed to be portable across multiple hardware
platforms and to be compatible with a variety of underlying operating systems.
Since all the metadata is stored in name node, it is very important. If it fails the file
system can not be used as there would be no way of knowing how to reconstruct the files from
blocks present in data node. To overcome this, the concept of secondary name node arises.
Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may
be loss. So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in
the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the
other machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that
makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into
nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
a. NameNode
It is running on the master machine. It saves the locations of all the files stored in the
file system and tracks where the data resides across the cluster i.e. it stores the metadata of the
files. When the client applications want to make certain operations on the data, it interacts with
the NameNode. When the NameNode receives the request, it responds by returning a list of
Data Node servers where the required data resides.
b. DataNode
This process runs on every slave machine. One of its functionalities is to store each
HDFS data block in a separate file in its local file system. In other words, it contains the actual
data in form of blocks. It sends heartbeat signals periodically and waits for the request from
the NameNode to access the data.
These records are then fed to the mappers for further processing the data. MapReduce
jobs primarily consist of three phases namely the Map phase, the Shuffle phase, and the Reduce
phase.
a. Map Phase
It is the first phase in the processing of the data. The main task in the map phase is to
process each input from the RecordReader and convert it into intermediate tuples (key-value
pairs). This intermediate output is stored in the local disk by the mappers.
The values of these key-value pairs can differ from the ones received as input from the
RecordReader. The map phase can also contain combiners which are also called as local
reducers. They perform aggregations on the data but only within the scope of one mapper.
As the computations are performed across different data nodes, it is essential that all
the values associated with the same key are merged together into one reducer. This task is
performed by the partitioner. It performs a hash function over these key-value pairs to merge
them together.
It also ensures that all the tasks are partitioned evenly to the reducers. Partitioners
generally come into the picture when we are working with more than one reducer.
b. Shuffle and Sort Phase
This phase transfers the intermediate output obtained from the mappers to the reducers.
This process is called as shuffling. The output from the mappers is also sorted before
transferring it to the reducers. The sorting is done on the basis of the keys in the key-value
pairs. It helps the reducers to perform the computations on the data even before the entire data
is received and eventually helps in reducing the time required for computations.
As the keys are sorted, whenever the reducer gets a different key as the input it starts to
perform the reduce tasks on the previously received data.
c. Reduce Phase
The output of the map phase serves as an input to the reduce phase. It takes these key-
value pairs and applies the reduce function on them to produce the desired result. The keys and
the values associated with the key are passed on to the reduce function to perform certain
operations.
We can filter the data or combine it to obtain the aggregated output. Post the execution
of the reduce function, it can create zero or more key-value pairs. This result is written back in
the Hadoop Distributed File System.
SCALING OUT
Scale out is a growth architecture or method that focuses on horizontal growth, or the
addition of new resources instead of increasing the capacity of current resources (known as
scaling up). Ina system such as a cloud storage facility, following a scale-out growth would
mean that new storage hardware and controllers would be added in order to increase capacity.
This has two obvious pros one is that storage capacity is increased and the second is traffic
capacity is also increased because there is more hardware to share the load.
organizations dealing with exploding datasets. A scalable data platform accommodates rapid
changes in the growth of data, either in traffic or volume. These platforms utilize added
hardware or software to increase output and storage of data. When a company has a scalable
data platform, it also is prepared for the potential of growth in its data needs.
Companies should implement scalability into their organization precisely when performance
issues arise. These issues can negatively impact the workflow, efficiency and customer
retention. There are three common, key performance bottlenecks, that often point the way
toward a proper resolution with data scaling:
1. High CPU Usage is the most common bottleneck, and the most visible. Slowing and erratic
performance is a key indicator of high CPU usage, and can often be a harbinger of other issues.
User CPU means the CPU is doing productive work, but needs a server upgrade; system CPU
refers to usage consumed by the operating system, and is usually related to the software; and
I/O wait, which is the idling time caused by the CPU waiting for the I/O subsystem.
2. Low Memory is the next most common bottleneck. Servers without enough memory to handle
an application load can slow the application completely. Low memory can require a RAM
upgrade, but this can also be an indicator of a memory leak, which requires finding and
3. High Disk Usage is another common bottleneck. This is often caused by maxed out disks, and
Once a decision has been made for data scaling, the specific scaling approach must be chosen.
There are two commonly used types of data scaling, up and out:
1. Scaling up, or vertical scaling, involves obtaining a faster server with more powerful
processors and more memory. This solution uses less network hardware, and consumes less
power; but ultimately, for many platforms may only provide a short-term fix, especially if
2. Scaling out, or horizontal scaling, involves adding servers for parallel computing. The scale
out technique is a long-term solution, as more and more servers may be added when needed.
But going from one monolithic system to this type of cluster may be a difficult, although
extremely effective solution.
-eye view
of the system and look at the data flow for large inputs. For simplicity, the examples so far
have used files on the local filesystem. However, to scale out, we need to store the data in a
Hadoop to mov
see how this works.
Data Flow
First, some terminology. A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program, and configuration
information. Hadoop runs the job by dividing it into tasks, of which there are two types: map
tasks and reduce tasks.
There are two types of nodes that control the job execution process: a jobtracker and a number
of tasktrackers. The jobtracker coordinates all the jobs run on the system by scheduling tasks
to run on tasktrackers. Tasktrackers run tasks and send progress reports to the jobtracker, which
keeps a record of the overall progress of each job. if a task fails, the jobtracker can reschedule
it on a different tasktracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits, or just
splits. Hadoop creates one map task for each split, which runs the userdefined map function for
each record in the split.
Having many splits means the time taken to process each split is small compared to the time to
process the whole input. So if we are processing the splits in parallel, the processing is better
load-balanced if the splits are small, since a faster machine will be able to process
proportionally more splits over the course of the job than a slower machine. Even if the
machines are identical, failed processes or other jobs running concurrently make load balancing
desirable, and the quality of the load balancing increases as the splits become more fine-
grained.
On the other hand, if splits are too small, then the overhead of managing the splits and of map
task creation begins to dominate the total job execution time. For most jobs, a good split size
tends to be the size of an HDFS block, 64 MB by default, although this can be changed for the
cluster (for all newly created files), or specified when each file is created.
Hadoop does its best to run the map task on a node where the input data resides in HDFS. This
is called the data locality optimization. It should now be clear why the optimal split size is the
same as the block size: it is the largest size of input that can be guaranteed to be stored on a
single node. If the split spanned two blocks, it would be unlikely that any HDFS node stored
both blocks, so some of the split would have to be transferred across the network to the node
running the map task, which is clearly less efficient than running the whole map task using
local data.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
job is complete the map output can be thrown away. So storing it in HDFS, with replication,
would be overkill. If the node running the map task fails before the map output has been
consumed by the reduce task, then Hadoop will automatically rerun the map task on another
node to re-create the map output.
stored on off-rack nodes. Thus, writing the reduce output does consume network bandwidth,
but only as much as a normal HDFS write pipeline consumes.
The whole data flow with a single reduce task is illustrated in Figure The dotted boxes indicate
nodes, the light arrows show data transfers on a node, and the heavy arrows show data transfers
between nodes.
The number of reduce tasks is not governed by the size of the input, but is specified
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys (and their associated values) in each
partition, but the records for any given key are all in a single partition. The partitioning can be
controlled by a user-defined partitioning function, but normally the default partitioner which
buckets keys using a hash function works very well.
The data flow for the general case of multiple reduce tasks is illustrated in Figure This diagram
makes it clear why the data flow between map and reduce tasks is coll
this diagram suggests, and tuning it can have a big impact on job execution time, as you will
need the shuffle since the processing can be carried out entirely in parallel (a few examples are
-node data transfer is when the
map tasks write to HDFS (see Figure ).
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it pays to
minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify
a
to the reduce function. Since the combiner function is an optimization, Hadoop does not
provide a guarantee of how many times it will call it for a particular map output record, if at
all. In other words, calling the combiner function zero, one, or many times should produce the
same output from the reducer
The contract for the combiner function constrains the type of function that may be used. This
is best illustrated with an example. Suppose that for the maximum temperature example,
readings for the year 1950 were processed by two maps (because they were in different splits).
Imagine the first map produced the output:
1.(1950, 0)
2.(1950, 20)
3.(1950, 10)
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values:
with output:
(1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that, just like the
reduce function, finds the maximum temperature for each map output. The reduce would then
be called with:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Not all function
but:
it is always worth considering whether you can use a combiner function in your MapReduce
job.
Example . Application to find the maximum temperature, using a combiner function for
efficiency
if (args.length != 2) {
"<output path>");
System.exit(-1);
conf.setJobName("Max temperature");
10
11
12
conf.setMapperClass(MaxTemperatureMapper.class);
13
conf.setCombinerClass(MaxTemperatureReducer.class);
14
conf.setReducerClass(MaxTemperatureReducer.class);
15
conf.setOutputKeyClass(Text.class);
16
conf.setOutputValueClass(IntWritable.class);
17
JobClient.runJob(conf);
18
19
Running a Distributed MapReduce Job The same program will run, without alteration, on a full
dataset. This is the point of MapReduce: it scales to the size of your data and the size of your
-node EC2 cluster running High-CPU Extra Large
Instances, the program took six minutes to run.
HADOOP STREAMING
Hadoop provides an API to MapReduce that allows you to write your map and reduce functions
in languages other than Java. Hadoop Streaming uses Unix standard stream as the interface
between Hadoop and your program, so you can use any language that can read standard input
and write to standard output to write your MapReduce program.
Streaming is naturally suited for text processing (although, as of version 0.21.0, it can handle
binary streams, too), and when used in text mode, it has a line-oriented view of data. Map input
data is passed over standard input to your map function, which processes it line by line and
writes lines to standard output. A map output key-value pair is written as a single tab-delimited
line. Input to the reduce function is in the same format a tab-separated key-value pair passed
over standard input. The reduce function reads lines from standard input, which the framework
guarantees are sorted by key, and writes its results to standard output.
Ruby
The map function can be expressed in Ruby as shown in Example
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
end
This is a factor of seven faster than the serial run on one machine using awk. The main reason
convenience, the input files were gzipped by year, resulting in large files for later years in the
dataset, when the number of weather records was much higher.
The program iterates over lines from standard input by executing a block for each line from
STDIN (a global constant of type IO). The block pulls out the relevant fields from each input
line, and, if the temperature is valid, writes the year and the temperature separated by a tab
character t to standard output (using puts).
The Java API is geared toward processing your map function one record at a time. The
framework calls the map() method on your Mapper for each record in the input, whereas with
Streaming the map program can decide how to process the input for example, it could easily
variable in the Mapper.§ In this case, you need to implement the close() method so that you
know when the last record has been read, so you can finish processing the last group of lines.
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078
The reduce function shown in Example is a little more complex.
#!/usr/bin/env ruby
STDIN.each_line do |line|
puts "#{last_key}t#{max_val}"
else
10
end
11
end
12
13
Again, the program iterates over lines from standard input, but this time we have to
store some state as we process each key group. In this case, the keys are weather station
identifiers, and we store the last key seen and the maximum temperature seen so far for that
key. The MapReduce framework ensures that the keys are ordered, so we know that if a key is
different from the previous one, we have moved into a new key group. In contrast to the Java
API, where you are provided an iterator over each key group, in Streaming you have to find
key group boundaries in your program.
just finished a group, we just update the maximum temperature for the current key.
The last line of the program ensures that a line is written for the last key group in the
input.
We can now simulate the whole MapReduce pipeline with a Unix pipeline (which is
equivalent to the Unix pipeline shown in Figure ):
sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22
The output is the same as the Java program, so the next step is to run it using Hadoop itself.
Streaming JAR file along with the jar option. Options to the Streaming program specify the
input and output paths, and the map and reduce scripts. This is what it looks like:
input/ncdc/sample.txt
-output output
-mapper ch02/src/main/ruby/max_temperature_map.rb
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
When running on a large dataset on a cluster, we should set the combiner, using the -
combiner option.
From release 0.21.0, the combiner can be any Streaming command. For earlier releases,
the combiner had to be written in Java, so as a workaround it was common to do manual
combining in the mapper, without having to resort to Java. In this case, we could change the
mapper to be a pipeline:
-input input/ncdc/all
-output output
ch02/src/main/ruby/max_temperature_reduce.rb"
ch02/src/main/ruby/max_temperature_map.rb
-file ch02/src/main/ruby/max_temperature_reduce.rb
Note also the use of -file, which we use when running Streaming programs on the cluster to
ship the scripts to the cluster.
Python
Streaming supports any programming language that can read from standard input, and
example
again. The map script is in Example and the reduce script is in Example
#!/usr/bin/env python
import re
import sys
val = line.strip()
10
#!/usr/bin/env python
11
import sys
12
13
14
15
16
17
18
else:
19
20
if last_key:
21
We can test the programs and run the job in the same way we did in Ruby. For example,
to run a test:
sort | ch02/src/main/python/max_temperature_reduce.py
1949 111
1950 22
DESIGN OF HDFS
When a dataset outgrows the storage capacity of a single physical machine, it becomes
necessary to partition it across a number of separate machines. Filesystems that manage the
storage across a network of machines are called distributed file systems
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop
Distributed Filesystem.
The Design of HDFS :
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large files:
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is a write-
once, readmany-times pattern. A dataset is typically generated or copied from source, then
various analyses are performed on that dataset over time.
Commodity hardware :
run on clusters of commodity hardware (commonly available hardware available from multiple
vendors3) for which the chance of node failure across the cluster is high, at least for large
clusters.
Applications that require low-latency access to data, in the tens of milliseconds range,
will not work well with HDFS.
Files in HDFS may be written to by a single writer. Writes are always made at the end
of the file. There is no support for multiple writers, or for modifications at arbitrary offsets in
the file.
the blocks from a file to be stored on the same disk, so they can take advantage of any of the
disks in the cluster.
Second:
Making the unit of abstraction a block rather than a file simplifies the storage
subsystem. The storage subsystem deals with blocks, simplifying storage management (since
blocks are a fixed size, it is easy to calculate how many can be stored on a given disk) and
eliminating metadata concerns.
Third:
Blocks fit well with replication for providing fault tolerance and availability. To insure
against corrupted blocks and disk and machine failure, each block is replicated to a small
number of physically separate machines (typically three).
Why Is a Block in HDFS So Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost
of seeks. By making a block large enough, the time to transfer the data from the disk can be
made to be significantly larger than the time to seek to the start of the block. Thus the time to
transfer a large file made of multiple blocks operates at the disk transfer rate.
A quick calculation shows that if the seek time is around 10 ms, and the transfer rate is
100 MB/s, then to make the seek time 1% of the transfer time, we need to make the block size
around 100 MB. The default is actually 64 MB, although many HDFS installations use 128
MB blocks. This figure will continue to be revised upward as transfer speeds grow with new
generations of disk drives.
Name nodes and Data nodes:
- The Master (Name Node) manages the file system namespace operations like opening,
closing, and renaming files and directories and determines the mapping of blocks to Data Nodes
along with regulating access to files by client.
- Slaves (DataNodes)
clients along with perform block creation, deletion, and replication upon instruction from the
Master (NameNode).
Data nodes are the workhorses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with
lists of blocks that they are storing.
NameNode failure: if the machine running the namenode failed, all the files on the filesystem
would be lost since there would be no way of knowing how to reconstruct the files from the
blocks on the datanodes.