P.prabu (28x61c) CCS334 BDA.unit 4
P.prabu (28x61c) CCS334 BDA.unit 4
P.prabu (28x61c) CCS334 BDA.unit 4
Basics of Hadoop
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop pipes
– design of Hadoop distributed file system (HDFS) – HDFS concepts – Java interface – data
flow – Hadoop I/O – data integrity – compression – serialization – Avro – file-based data
structures - Cassandra – Hadoop integration.
Data Format
Q.What is the data format of Hadoop?
The data is stored using a line-oriented ASCII format, in which each line is a record. The
format supports a rich set of meteorological elements, many of which are optional or with
variable data lengths.
The default output format provided by hadoop is TextOutput Format and it writes records as
lines of text. If file output format is not specified explicitly, then textfiles are created as
output files. Output key-value pairs can be of any formatbecauseTextOutput Format
converts these into strings with to String() method.
HDFS data is stored in something called blocks. These blocks are the smallest unit of data
that the file system can store. Files are processed and broken down into these blocks, which
are then taken and distributed across the cluster and also replicated for safety.
Analyzing the Data with Hadoop
Hadoop supports parallel processing, so we take this advantage for expressing query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster
of machines.
MapReduce works by breaking the processing into two phases: The map phase and the
reduce phase.
Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: The map function and the reduce
function.
MapReduce is a method for distributing a task across multiple nodes. Each node processes
the data stored on that node to the extent possible. A running MapReduce job consists of
various phases which is described in the following
1
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Phases of Hadoop MapReduce
In the map job, we split the input dataset into chunks. Map task processes these chunks in
parallel. The map we use outputs as inputs for the reduced tasks. Reducers process the
intermediate data from the maps into smaller tuples that reduce the tasks leading to the final
output of the framework.
The advantages of using MapReduce, which run over a distributed infrastructure like CPU
and storage, are automatic parallelization and distribution of data in blocks across a
distributed system, fault-tolerance against failure of storage, computation and network
infrastructure, deployment, monitoring and security capability and clear abstraction for
programmers. Most MapReduce programs are written in Java. It can also be written in any
scripting language using the Streaming API of Hadoop.
Scaling Out
Q.Discuss data flow in MapReduce programming model.
To scale out, we need to store the data in a distributed file system, typically HDFS, to allow
Hadoop to move the MapReduce computation to each machine hosting a part of the data.
A MapReduce job is a unit of work that the client wants to be performed: It consists of the
input data, the MapReduce program and configuration information.
Hadoop runs the job by dividing it into tasks, of which there are two types: Map tasks and
reduce tasks.
There are two types of nodes that control the job execution process: A job tracker and a
number of task trackers.
Job tracker: This tracker plays the role of scheduling jobs and tracking all job assigned to the
task tracker.
Task tracker: This tracker plays the role of tracking tasks and reporting the status of tasks to
the job tracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits.
Hadoop creates one map task for each split, which runs the user defined map function for
each record in the split.
2
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
That split information is used by YARN Application Master to try to schedule maptasks on
the same node where split data is residing thus making the task data local. If map tasks are
spawned at random locations, then each map task has to copy the data it needs to process
from the Data Node where that split data is residing, resulting in lots of cluster bandwidth.
By trying to schedule map tasks on the same node where split data is residing, what Hadoop
framework does is to send computation to data rather than bringing data to computation,
saving cluster bandwidth. This is called data locality optimization.
Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: It is processed by reducing tasks to produce the final output and once
the job is complete the map output can be thrown away. So storing it in HDFS, with
replication, would be overkill. If the node running the map task fails before the map output
has been consumed by the reduce task, then Hadoop will automatically rerun the map task
on another node to re-create the map output.
The number of reduced tasks is not governed by the size of the input, but is specified
independently.
When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys in each partition, but the records for
any given key are all in a single partition.
3
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
MapReduce data flow with Multiple reduce task
Hadoop allows the user to specify a combiner function to be run on the map output, the
combiner function's output forms the input to the reduce function. Since the combiner
function is an optimization, Hadoop does not provide a guarantee of how many times it will
call it for a particular map output record, if at all.
Hadoop Streaming
Q. What is Hadoop streaming? Explain in details.
Definition:
Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It
uses UNIX standard streams as the interface between Hadoop and the user application.
Hadoop streaming is a utility that comes with the Hadoop distribution.
Streaming is naturally suited for text processing. The data view is line-oriented and
processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines
from the standard input, which is sorted by key and writes its results to the standard output.
It helps in real-time data analysis, which is much faster using MapReduce programming
running on a multi-node cluster. There are different Technologies like spark Kafka and
others which help in real-time Hadoop streaming.
Features of Hadoop Streaming
1. Users can execute non-Java-programmed MapReduce jobs on Hadoop
clusters.Supported languages include Python, Perl, and C++.
2. Hadoop Streaming monitors the progress of jobs and provides logs of a job's entire
execution for analysis.
4
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
3. Hadoop Streaming works on the MapReduce paradigm, so it supportsscalability,
flexibility, and security/authentication.
4. Hadoop Streaming jobs are quick to develop and don't require muchprogramming.
Following code shows Streaming Utility:
Where:
Input = Input location for Mapper from where it can read input
Output = location for the Reducer to store the output
Mapper = The executable file of the Mapper
Reducer = The executable file of the Reducer
Map and reduce functions read their input from STDIN and produce their output to
STDOUT. In the diagram above, the Mapper reads the input data from InputReader/Format
in the form of key-value pair, maps them as per logic, written on code, and then passes
through the Reduce stream, which performs data aggregation and releases the data to the
output.
Hadoop Pipes
5
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Q. Briefly explain about Hadoop pipes
Hadoop pipes are the name of the C++ interface to Hadoop MapReduce. UnlikeStreaming,
this uses standard input and output to communicate with the map andreduce code.
Pipes uses sockets as the channel over which the task tracker communicates withthe process
running the C++ map or reduce function.
6
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
HDFS stores file system metadata and application data separately. As in other distributed file
systems, like GFS, HDFS stores metadata on a dedicated server, called the NameNode.
Application data is stored on other server’s called DataNodes. All servers are fully
connected and communicate with each other using TCP-based protocols.
Hadoop Distributed File System (HDFS) is a distributed file system that handles large data
sets running on commodity hardware. It is used to scale a Apache Hadoop cluster to
hundreds of nodes.
A block is the minimum amount of data that it can read or write. HDFS blocksare 128 MB
by default and this is configurable. When a file is saved in HDFS, the file is broken into
smaller chunks or "blocks".
HDFS is a fault-tolerant and resilient system, meaning it prevents a failure in anode from
affecting the overall system's health and allows for recovery from failure too. In order to
achieve this, data stored in HDFS is automatically replicated across different nodes.
HDFS supports a traditional hierarchical file organization. A user or an application can
create directories and store files inside these directories. The file system namespace
hierarchy is similar to most other existing file systems; one can create and remove files,
move a file from one directory to another, or rename a file.
Hadoop distributed file system is a block-structured file system where each file is divided
into blocks of a pre-determined size. These blocks are stored across a cluster of one or
several machines.
Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster
comprises of a single NameNode (MasterNode) and all the other nodes are DataNodes
(Slave nodes).
HDFS can be deployed on a broad spectrum of machines that support Java. Though one can
run several DataNodes on a single machine, but in the practical world, these DataNodes are
spread across various machines.
Design issue of HDFS:
1. Commodity hardware: HDFS do not require expensive hardware for executing user
tasks. It's designed to run on clusters of commodity hardware.
2. Streaming data access: HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
7
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
3. Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a
single writer. Writes are always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the file.
4. Low-latency data access.
5. Holds lots of small files.
6. Store very large files.
The HDFS achieve the following goals:
1. Manage large datasets: Organizing and storing datasets can be a hard talk to handle.
HDFS is used to manage the applications that have to deal with huge datasets. To do
this, HDFS should have hundreds of nodes per cluster.
2. Detecting faults: HDFS should have technology in place to scan and detect faults
quickly and effectively as it includes a large number of commodity hardware. Failure of
components is a common issue.
3. Hardware efficiency: When large datasets are involved it can reduce the network traffic
and increase the processing speed.
HDFS Architecture
Hadoop architecture
Hadoop distributed file system is a block-structured file system where each file is divided
into blocks of a pre-determined size. These blocks are stored across a cluster of one or
several machines.
Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster
comprises of a single NameNode (Master node) and all the other nodes are DataNodes
(Slave nodes).
8
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
DataNodes process and store data blocks, while NameNodes manage the many DataNodes,
maintain data block metadata and control client access.
NameNode and DataNode
Namenode holds the meta data for the HDFS like Namespace information, block
information etc. When in use, all this information is stored in main memory. But this
information also stored in disk for persistence storage.
Namenode manages the file system namespace. It keeps the directory tree of all files in the
file system and metadata about files and directories.
DataNode is a slave node in HDFS that stores the actual data as instructed by the
NameNode. In brief, NameNode controls and manages a single or multiple datanodes.
DataNode serves to read or write requests. It also creates deletes and replicates blocks on the
instructions from the NameNode.
Name node
Two different files are :
1. fsimage: It is the snapshot of the file system when name node started.
2. Edit logs: It is the sequence of changes made to the file system after namenode started.
Only in the restart of namenode, edit logs are applied to fsimage to get the latestsnapshot of
the file system.
But namenode restart are rare in production clusters which means edit logs cangrow very
large for the clusters where namenode runs for a long period of time.
The following issues we will encounter in this situation :
1. Editlog become very large, which will be challenging to manage it.
2. Namenode restart takes long time because lot of changes to be merged.
3. In the case of crash, we will lose huge amount of metadata since fsimage isvery old.
9
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
So to overcome these issues we need a mechanism which will help us reduce the edit log
size which is manageable and have up to date fsimage, so that load on namenode reduces.
Secondary Namenode helps to overcome the above issues by taking over responsibility of
merging edit logs with fsimage from the namecode.
Secondary Namenode
Working of secondary Namenode:
1. It gets the edit logs from the Namenode in regular intervals and applies of fsimage.
2. Once it has new fsimage, it copies back to Namenode.
3. Namenode will use this fsimage for the next restart, which will reduce the startup time.
Secondary Namenode's whole purpose is to have a checkpoint in HDFS. It’s just a helper
node for Namecode. That is why it also known as checkpoint node inside the community.
HDFS Block
Q. Write short note on HDFS block
HDFS is a block structured file system. In general, the user’s data stored in HDFS in terms
of block. The files in the file system are divided into one or more segments called blocks.
The default size of HDFS block is 64 MB that can be increased as per need.
The HDFS is fault tolerant such that if a data node fails then the current block write
operation on the data node is re-replicated to some other node. The blocksize, number of
replicas and replication factors are specified in the Hadoop configuration file. The
synchronization between name node and data node is done by heartbeat functions which are
periodically generated by data node to namenode.
Apart from above components the job tracker and task trackers are used when map reduce
applications run over the HDFS. Hadoop Core consists of one master job tracker and several
10
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
task trackers. The job tracker runs on name nodes like a master while task trackers run on
data nodes like slaves.
The job tracker is responsible for taking the requests from a client and assigning task
trackers to it with tasks to be performed. The job tracker always tries to assign tasks to the
task tracker on the data nodes where the data is locally present.
If for some reason the node fails the job tracker assigns the task to another
taskreplicatedtracker where the replica of the data exists since the data blocks are across the
data nodes. This ensures that the job does not fail even if a node fails within the cluster.
The Command Line Interface:
The HDFS can be manipulated either using the command line. All the commands used for
manipulating HDFS through the command line interface begin with the "hadoop fs"
command.
Most of the Linux commands are supported over HDFS which starts with "-"sign.
For example: The command for listing the files in Hadoop directory will be,
#hadoop fs –Is
The general syntax of HDFS command line manipulation is,
#hadoop fs -<command>
Java Interface
Q. Briefly explain about java interface in Hadoop filesystem.
Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the
Java API. The filesystem shell, for example, is a Java application that uses the Java
FileSystem class to provide file system operations. By exposing its filesystem interface as a
Java API, Hadoop makes it awkward for non-Javaapplications to access.
1. Reading Data from a Hadoop URL :
To read a file from a Hadoop file system is by using a java.net.URL object toopen a stream
to read the data. The syntax is as follows:
InputStream in =null;
try {
in =new URL("hdfs://host/path").openStream();
// process in
}
Finally {
IOUtils.closeStream(in);
11
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
}
Java recognizes Hadoop's hdfs URL scheme by calling the
setURLStreamHandlerFactoryMethod on URL with an instance of
FSUrlStreamHandlerFactory.
The 'setURLStreamHandlerFactory‘ is a method in the java.net. URL class thatsets the
URL stream handler factory for the Java Virtual Machine. This factory isresponsible for
creating URL stream handler instances that are used to retrieve thecontents of a URL.
This method can only be called once per JVM, so it is typically executed in a staticblock.
Example: Displaying files from a Hadoop file system on standard output using a
URLStreamHandler.
import java.io.InputStream;
import java.net. URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
// vvURLCat
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try{
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
12
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Sometimes it is not possible to set URLStreamHandlerFactory for our application, so in that
case we will use Filesystem API for opening an input stream for a file.A file in a Hadoop
filesystem is represented by a Hadoop Path object.
There are two static factory methods for getting a FileSystem instance:
public static Filesystem get(Configuration conf) throws IOException
public static FileSystem get(URI url, Configuration conf) throws IOException
FileSystem is a generic abstract class used to interface with a file system. FileSystem class
also serves as a factory for concrete implementations, with thefollowing methods:
Public static FileSystem get (Configuration conf): Use information fromconfiguration such
as scheme and authority.
A configuration object encapsulates a client or server's configuration, which is setusing
configuration files read from the classpath, such as conf/core-site.xml.
FSDatalnputStream :
The open () method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class.
This class is a specialization of java.io.DataInputStream with support for random Package
access, SO we can read from any part of the stream org.apache.hadoop.fs;
Writing Data:
The FileSystem class has a number of methods for creating a file. The simplest is the
method that takes a path object for the file to be created and returns an outputstream to write
to public FSDataOutputStream create(Path f) throws IOExceptionFSDataOutputStream
The create() method on FileSystem returns an FSDataOutputStream, which,
likeFSDataInputStream, has a method for querying the current position in the file :package
org.apache.hadoop.fs;
We can append to an existing file using the append() method :
public FSDataOutputStream append(Path f) throws IOException
The append operation allows a single writer to modify an already written file by opening it
and writing data from the final offset in the file. With this API, applications that produce
unbounded files, such as log files, can write to an existing file after a restart, for example,
the append operation is optional and not implemented by all Hadoop filesystems.
Data Flow
Q Explain in details about data flow and heart beat mechanism of HDFS
13
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
1. Anatomy of a File Read:
The client opens the file it wishes to read by calling open() on the FileSystemobject, which
for HDFS is an instance of Distributed FileSystem (DFS). DFS calls the namenode, using
RPC, to determine the locations of the blocks for the first few blocks in the file.
For each block, the namenode returns the addresses of the datanodes that have a copy of that
block. Furthermore, the datanodes are sorted according to their proximity to the client. If the
client is itself a datanode, then it will read from thelocal datanode, if it hosts a copy of the
block.
The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and
namenode I/O.
14
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
locations for the next batch of blocks as needed. When the client has finished reading, it
calls close() on the PSDataInputStream.
During reading, if the DPSInputStream encounters an error while communicating with a
datanode, then it will try the next closest one for that block.
2. Anatomy of a File Write:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is
then written to an internal queue, called data queue. Datastreamer consumes the data
queue.
4. Data streamer streams the packets to the first DataNode in the pipeline. It storesthe
packet and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages the Ackqueue"of
the packets that are waiting to be acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.
15
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Heartbeat mechanism
The connectivity between the NameNode and a DataNode are managed by the persistent
heartbeats that are sent by the DataNode every three seconds.
The heartbeat provides the NameNode confirmation about the availability of the blocks and
the replicas of the DataNode.
Additionally, heartbeats also carry information about total storage capacity, storage in use
and the number of data transfers currently in progress. These statistics are by the NameNode
for managing space allocation and load balancing.
During normal operations, if the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode, it considers that DataNode to be out ofservice and the block
replicas hosted to be unavailable.
The NameNode schedules the creation of new replicas of those blocks on other DataNodes.
The heartbeats carry roundtrip communications and instructions from the NameNode,
including commands to :
a) Replicate blocks to other nodes.
b) Remove local block replicas.
c) Re-register the node.
d) Shut down the node.
e) Send an immediate block report.
Role of Sorter, Shuffler and Combiner in MapReduces Paradigm
16
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
A combiner, also known as a semi-reducer, is an optional class that operates byaccepting the
inputs from the Map class and thereafter passing the outputkey-value pairs to the Reducer
class.
The main function of a combiner is to summarize the map output records with thesame key.
The output of the combiner will be sent over the network to the actualreducer task as input.
The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer as
input. So, shuffle phase is necessary for the reducers, otherwise, they would not have any
input.
Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce.
Sort phase in MapReduce covers the merging and sorting of map outputs.
Data from the mapper are grouped by the key, split among reducers and sorted by the key.
Every reducer obtains all values associated with the same key. Shuffle and sort phase in
Hadoop occur simultaneously and are done by the MapReduce framework.
Hadoop I/0
Hadoop input output system comes with a set of primitives. Hadoop deals with multi-
terabytes of datasets; a special consideration on these primitives will give an idea how
Hadoop handles data input and output.
Data Integrity
Q. Explain in details about data integrity and Hadoop local file system in HDFS .
Data integrity means that data should remain accurate and consistent all across its storing,
processing and retrieval operations.
However, since every I/O operation on the disk or network carries with it a small chance of
introducing errors into the data that it is reading or writing. The usual way of detecting
corrupted data is by computing a checksum for the data when it first enters the system and
again whenever it is transmitted across a channel that is unreliable and hence capable of
corrupting the data.
The commonly used error detecting code is CRC-32 which computes a 32-bitinteger
checksum input of any size.
17
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
HDFS transparently checksums all data written to it and by default verifieschecksums when
reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data.
The default is 512 bytes, and since a CRC-32checksum is 4 bytes long, the storage overhead
is not an issue.
All data that enters into the system is verified by the datanodes before being forwarded for
storage or further processing. Data sent to the datanode pipeline is verified through
checksums and any corruption found is immediately notified to the client with
ChecksumException.
The client read from the datanode also goes through the same drill. The datanodes maintain
a log of checksum verification to keep track of the verified block. The log is updated by the
datanode upon receiving a block verification success signal from the client. This type of
statistics helps in keeping the bad disks at bay.
Apart from this, a periodic verification on the block store is made with the help of
DataBlockScanner running along with the datanode thread in the background. Thisprotects
data from corruption in the physical storage media.
HDFS stores replicas of blocks, it can "heal" corrupted blocks by copying one of the good
replicas to produce a new, uncorrupt replica.
If a client detects an error when reading a block, it reports the bad block and the datanode it
was trying to read from to the namenode before throwing a ChecksumException. The
namenode marks the block replica as corrupt, so it doesnot direct clients to it, or try to copy
this replica to another datanode.
It then schedules a copy of the block to be replicated on another datanode, so its replication
factor is back at the expected level. Once this has happened, the corrupt replica is deleted. It
is possible to disable verification of checksums bypassing false to the setVerify Checksum()
method on FileSystem, before using the open() method to read a file.
18
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
The file can be read correctly though the settings of the files might change and if an error is
detected then the local system throws a checksum exception.
19
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
object in an easily transmittable form. In this serialized form, the data can be delivered to
another data store, application, or some other destination.
Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.
The reverse process, constructing a data structure or object from a series of byte is
deserialization. The deserialization process recreates the object, thus making the data easier
to read and modify as a native structure in a programming language.
20
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
a. Compact: A compact format makes the best use of network bandwidth, which is the
scarcest resource in a data center.
b. Fast: Inter process communication forms the backbone for a distributed system, so it
is essential that there is as little performance overhead as possible for the
serialization and deserialization process.
c. Extensible: Protocols change over time to meet new requirements, so it should be
straightforward to evolve the protocol in a controlled manner for clients and servers.
d. Interoperable: For some systems, it is desirable to be able to support clients that are
written in different languages to the server, so the format needs to be designed to
make this possible.
The Writable Interface
Q. Briefly explain about the writable interface of Hadoop.
Hadoop uses its own serialization format called Writable. It is written in Java and is fast as
well as compact. The other serialization framework supported byHadoop is Avro.
The Writable interface defines two methods: One for writing its state to a DataOutput binary
stream and one for reading its state from a DataInput binary stream.
When we write a key as IntWritable in the Mapper class and send it to the reducer class,
there is an intermediate phase between the Mapper and Reducerclass i.e., shuffle and sort,
where each key has to be compared with many other keys. If the keys are not comparable,
then the shuffle and sort phase won't be executed or may be executed with
a high amount of overhead.
If a key is taken as IntWritable by default, then it has a comparable feature because of Raw
Comparator acting on that variable. It will compare the key taken with the other keys in the
network. This cannot take place in the absence of Writable.
WritableComparator is a general-purpose implementation of RawComparator for
WritableComparable classes. It provides two main functions:
a. It provides a default implementation of the raw compare() method that deserializes the
objects to be compared from the stream and invokes the objectcompare() method.
b. It acts as a factory for RawComparator instances.
To provide mechanisms for serialization and deserialization of data, Hadoop provided two
important interfaces Writable and WritableComparable. Writableinterface specification is as
follows:
21
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
package org.apache.hadoop.io;
import java.io.Datalnput;
import java.io.DataOutput;
import java.io.IOException;
public interface Writable
{
void write(DataOutput out) throws IOException;
void readFields (Datainput in) throws IOException;
}
In the above list VIntWritable and VLongWritable are used for variable lengthInteger types
and variable length long types respectively.
22
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Serialized sizes of the above primitive writable data types are the same as the size of actual
java data types. So, the size of IntWritable is 4 bytes and LongWritable is8 bytes.
Text:
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of
java.lang.String.
The Text class uses an int to store the number of bytes in the string encoding, so the
maximum value is 2 GB.
Bytes Writable:
BytesWritable is a wrapper for an array of binary data. Its serialized format is an integer
field (4 bytes) that specifies the number of bytes to follow, followed by the bytes
themselves.
Bytes Writable is mutable and its value may be changed by calling its set() method.
NullWritable is a special type of writable, as it has a zero-length serialization. No bytes are
written to, or read from, the stream. It is used as a placeholder.
ObjectWritable and GenericWritable
ObjectWritable is a general-purpose wrapper for the following: Java primitives,string, enum,
writable, null, or arrays of any of these types.
It is used in Hadoop RPC to marshal and unmarshal method arguments and return types.
There are four writable collection types in the org.apache.hadoop.io package:Array
Writable, TwoDArrayWritable, MapWritable, and SortedMapWritable.
ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and two-
dimensional arrays of Writable instances. All the elements of an ArrayWritable or a
TwoDArrayWritable must be instances of the same class.
ArrayWritable and TwoDArrayWritable both have get() and set() methods, as well as a
toArray() method, which creates a shallow copy of the array.
MapWritable and SortedMapWritable are implementations of java.util.Map<Writable,
Writable> and java.util.SortedMap<WritableComparable, Writable>, respectively. The type
of each key and value field is a part of the serialization format for that field.
Q. Write short note on Avro, file-based structure and Cassandra Hadoop integration.
Avro
23
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Data serialization is a technique of converting data into binary or text format. There are
multiple systems available for this purpose. Apache Avro is one of those data serialization
systems.
Avro is a language-independent serialization library. Avro is a language independent,
schema-based data serialization library. It uses a schema to perform serialization and
deserialization. Moreover, Avro uses a JSON format to specify the data structure which
makes it more powerful.
Avro creates a data file where it keeps data along with schema in its metadata section. Avro
files include markers that can be used to split large data sets into subsets suitable for Apache
MapReduce processing.
Avro has rich schema resolution capabilities. Within certain carefully defined constraints,
the schema used to read data need not be identical to the schema that was used to write the
data.
An Avro data file has a metadata section where the schema is stored, which makes the file
self-describing. Avro data files support compression and are splitable, which is crucial for a
MapReduce data input format.
Avro defines a small number of data types, which can be used to build application specific
data structures by writing schemas.
Avro supports two types of data :
a. Primitive type: Avro supports all the primitive types. We use primitive TypeName’s to
define a type of a given field. For example, a value which holds a string should be
declared as {"type": "string") in Schema.
b. Complex type: Avro supports six kinds of complex types: records, enums, arrays, maps,
unions and fixed.
Avro data files :
A data file has a header containing metadata, including the Avro schema and async marker,
followed by a series of blocks containing the serialized Avro objects.
Blocks are separated by a sync marker that is unique to the file and that permits rapid
resynchronization with a block boundary after seeking to an arbitrary pointing the file, such
as an HDFS block boundary. Thus, Avro data files are splitable, which makes them
amenable to efficient MapReduce processing.
File-based Data Structures
24
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Apache Hadoop supports text files which are quite commonly used for storing the data,
besides text files it also supports binary files and one of these binary formats is called
sequence files.
Hadoop sequence file is a flat file structure which consists of serialized key-value pairs. This
is the same format in which the data is stored internally during the processing of the
MapReduce tasks.
Sequence files can also be compressed for space considerations and based on these
compression type users, Hadoop sequence files can be of three types: Uncompressed, record
compressed and block compressed.
To create a SequenceFile, use one of its createWriter() static methods, which returns a
SequenceFile.Writer instance.
The keys and values stored in a SequenceFile do not necessarily need to be writable. Any
types that can be serialized and deserialized by a serialization maybe used.
Reading sequence files from beginning to end is a matter of creating an instance of
SequenceFile.Reader and iterating over records by repeatedly invoking one of the next()
methods.
The SequenceFile format
26
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
directories are represented on the NameNode by inodes, which record attributes like permissions,
modification and access times, namespace and disk space quotas.
Q.3 What is data locality optimization?
Ans. To run the map task on a node where the input data resides in HDFS. This is called data
locality optimization.
Q.4Why do map tasks write their output to the local disk, not to HDFS?
Ans.: Map output is intermediate output: It is processed by reducing tasks to produce the final
output and once the job is complete the map output can be thrown away. So, storing it in HDFS,
with replication, would be overkill. If the node running the map task fails before the map output has
been consumed by the reduce task, then Hadoop will automatically rerun the map task on another
node to re-create the map output.
Q.5 Why is a block in HDFS so large ?
Ans.: HDFS blocks are large compared to disk blocks and the reason is to minimize the cost of
seeks. By making a block large enough, the time to transfer the data from the disk can be made to
be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large
file made of multiple blocks operates at the disk transfer rate.
Q.6 How HDFS services support big data ?
Ans.: Five core elements of big data organized by HDFS services :
Velocity - How fast data is generated, collated and analyzed.
Volume - The amount of data generated.
Variety - The type of data, this can be structured, unstructured, etc.
Veracity - The quality and accuracy of the data.
Value-How you can use this data to bring an insight into your business processes.
Q.7 What if writable were not there in Hadoop ?
Ans.: Serialization is important in Hadoop because it enables easy transfer of data. IfWritable is not
present in Hadoop, then it uses the serialization of Java which increases the data over-head in the
network.
Q.8 Define serialization.
Ans.: Serialization is the process of converting object data into byte stream data for transmission
over a network across different nodes in a cluster or for persistent data storage.
Q.9 What is writable? Explain its Importance in Hadoop.
Ans. Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all
the primitive data type of Java. That is how int of java has become IntWritable in Hadoop and
27
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
String of Java has become Text in Hadoop. Writables are used for creating serialized data types in
Hadoop. So, let us start by understandingwhat are data type, interface and serialization.
Q.10 What happens if a client detects an error when reading a block in Hadoop ?
Ans. If a client detects an error when reading a block :
It reports the bad block and datanode it as trying to read from to the namenode
before throwing a ChecksumException.
The namenode marks the block replica as corrupt, so it does not direct clients to it, or
try to copy this replica to another datanode.
It then schedules a copy of the block to be replicated on another datanode,so its
replication factor is back at the expected level.
Once this has happened, the corrupt replica is deleted.
Q.11 What is MapFile ?
Ans. A MapFile is a sorted sequence file with an index to permit lookups by key.Map File can be
thought of as a persistent form of java.util. Map which is able to grow beyond the size of a map that
is kept in memory.
Q.12 What are Hadoop pipes ?
Ans. Hadoop pipes is the name of the C++ interface to Hadoop MapReduce. Unlike streaming, this
uses standard input and output to communicate with the map and reduce code. Pipes uses sockets as
the channel over which the task tracker communicates with the process running the C++ map or
reduce function.
28
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics