P.prabu (28x61c) CCS334 BDA.unit 4

UNIT IV
Basics of Hadoop
Data format – analyzing data with Hadoop – scaling out – Hadoop streaming – Hadoop pipes
– design of Hadoop distributed file system (HDFS) – HDFS concepts – Java interface – data
flow – Hadoop I/O – data integrity – compression – serialization – Avro – file-based data
structures - Cassandra – Hadoop integration.
Data Format
Q.What is the data format of Hadoop?
 The data is stored using a line-oriented ASCII format, in which each line is a record. The
format supports a rich set of meteorological elements, many of which are optional or with
variable data lengths.
 The default output format provided by hadoop is TextOutput Format and it writes records as
lines of text. If file output format is not specified explicitly, then textfiles are created as
output files. Output key-value pairs can be of any formatbecauseTextOutput Format
converts these into strings with to String() method.
 HDFS data is stored in something called blocks. These blocks are the smallest unit of data
that the file system can store. Files are processed and broken down into these blocks, which
are then taken and distributed across the cluster and also replicated for safety.
Analyzing the Data with Hadoop
 Hadoop supports parallel processing, so we take this advantage for expressing query as a
MapReduce job. After some local, small-scale testing, we will be able to run it on a cluster
of machines.
 MapReduce works by breaking the processing into two phases: The map phase and the
reduce phase.
 Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer. The programmer also specifies two functions: The map function and the reduce
function.
 MapReduce is a method for distributing a task across multiple nodes. Each node processes
the data stored on that node to the extent possible. A running MapReduce job consists of
various phases which is described in the following
1
Mr.P.PRABU., AP/CSE. ., VRSCET. CCS334- Big Data Analytics
Phases of Hadoop MapReduce
 In the map job, we split the input dataset into chunks. Map task processes these chunks in
parallel. The map we use outputs as inputs for the reduced tasks. Reducers process the
intermediate data from the maps into smaller tuples that reduce the tasks leading to the final
output of the framework.
 The advantages of using MapReduce, which run over a distributed infrastructure like CPU
and storage, are automatic parallelization and distribution of data in blocks across a
distributed system, fault-tolerance against failure of storage, computation and network
infrastructure, deployment, monitoring and security capability and clear abstraction for
programmers. Most MapReduce programs are written in Java. It can also be written in any
scripting language using the Streaming API of Hadoop.
Scaling Out
Q.Discuss data flow in MapReduce programming model.
 To scale out, we need to store the data in a distributed file system, typically HDFS, to allow
Hadoop to move the MapReduce computation to each machine hosting a part of the data.
 A MapReduce job is a unit of work that the client wants to be performed: It consists of the
input data, the MapReduce program and configuration information.
 Hadoop runs the job by dividing it into tasks, of which there are two types: Map tasks and
reduce tasks.
 There are two types of nodes that control the job execution process: A job tracker and a
number of task trackers.
 Job tracker: This tracker plays the role of scheduling jobs and tracking all job assigned to the
task tracker.
 Task tracker: This tracker plays the role of tracking tasks and reporting the status of tasks to
the job tracker.
 Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits.
Hadoop creates one map task for each split, which runs the user defined map function for
each record in the split.
2
 That split information is used by YARN Application Master to try to schedule maptasks on
the same node where split data is residing thus making the task data local. If map tasks are
spawned at random locations, then each map task has to copy the data it needs to process
from the Data Node where that split data is residing, resulting in lots of cluster bandwidth.
By trying to schedule map tasks on the same node where split data is residing, what Hadoop
framework does is to send computation to data rather than bringing data to computation,
saving cluster bandwidth. This is called data locality optimization.
 Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is
intermediate output: It is processed by reducing tasks to produce the final output and once
the job is complete the map output can be thrown away. So storing it in HDFS, with
replication, would be overkill. If the node running the map task fails before the map output
has been consumed by the reduce task, then Hadoop will automatically rerun the map task
on another node to re-create the map output.
MapReduce data flow with single reduce task
 The number of reduced tasks is not governed by the size of the input, but is specified
independently.
 When there are multiple reducers, the map tasks partition their output, each creating one
partition for each reduce task. There can be many keys in each partition, but the records for
any given key are all in a single partition.
3
MapReduce data flow with Multiple reduce task
 Hadoop allows the user to specify a combiner function to be run on the map output, the
combiner function's output forms the input to the reduce function. Since the combiner
function is an optimization, Hadoop does not provide a guarantee of how many times it will
call it for a particular map output record, if at all.
Hadoop Streaming
Q. What is Hadoop streaming? Explain in details.
Definition:
 Hadoop Streaming is an API that allows writing Mappers and Reduces in any language. It
uses UNIX standard streams as the interface between Hadoop and the user application.
Hadoop streaming is a utility that comes with the Hadoop distribution.
 Streaming is naturally suited for text processing. The data view is line-oriented and
processed as a key-value pair separated by a 'tab' character. The Reduce function reads lines
from the standard input, which is sorted by key and writes its results to the standard output.
 It helps in real-time data analysis, which is much faster using MapReduce programming
running on a multi-node cluster. There are different Technologies like spark Kafka and
others which help in real-time Hadoop streaming.
Features of Hadoop Streaming
1. Users can execute non-Java-programmed MapReduce jobs on Hadoop
clusters.Supported languages include Python, Perl, and C++.
2. Hadoop Streaming monitors the progress of jobs and provides logs of a job's entire
execution for analysis.
4
3. Hadoop Streaming works on the MapReduce paradigm, so it supportsscalability,
flexibility, and security/authentication.
4. Hadoop Streaming jobs are quick to develop and don't require muchprogramming.
 Following code shows Streaming Utility:
Where:
Input = Input location for Mapper from where it can read input
Output = location for the Reducer to store the output
Mapper = The executable file of the Mapper
Reducer = The executable file of the Reducer
 Map and reduce functions read their input from STDIN and produce their output to
STDOUT. In the diagram above, the Mapper reads the input data from InputReader/Format
in the form of key-value pair, maps them as per logic, written on code, and then passes
through the Reduce stream, which performs data aggregation and releases the data to the
output.
Code execution process
Hadoop Pipes
5
Q. Briefly explain about Hadoop pipes
 Hadoop pipes are the name of the C++ interface to Hadoop MapReduce. UnlikeStreaming,
this uses standard input and output to communicate with the map andreduce code.
 Pipes uses sockets as the channel over which the task tracker communicates withthe process
running the C++ map or reduce function.
Execution of streaming and pipes

 With Hadoop pipes, we can implement applications that require higher performance in
numerical calculations using C++ in MapReduce. The pipes utility works by establishing a
persistent socket connection on a port with the Java pipes
task on one end, and the external C++ process at the other.
 Other dedicated alternatives and implementations are also available, such as Pydoop for
Python, and for C. These are mostly built as wrappers, and are JNI-based. It is, however,
noticeable that MapReduce tasks are often a smaller component to a larger aspect of
chaining, redirecting and recurring MapReduce jobs. This is usually done with the help of
higher-level languages or APIs like Pig, Hive and Cascading, which can be used to express
such data extraction and transformation problems.
Design of Hadoop Distributed File System (HDFS)
Q. Briefly explain about design and architecture of HDFS
 The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware. HDFS is the file system component of Hadoop.
6
 HDFS stores file system metadata and application data separately. As in other distributed file
systems, like GFS, HDFS stores metadata on a dedicated server, called the NameNode.
Application data is stored on other server’s called DataNodes. All servers are fully
connected and communicate with each other using TCP-based protocols.
 Hadoop Distributed File System (HDFS) is a distributed file system that handles large data
sets running on commodity hardware. It is used to scale a Apache Hadoop cluster to
hundreds of nodes.
 A block is the minimum amount of data that it can read or write. HDFS blocksare 128 MB
by default and this is configurable. When a file is saved in HDFS, the file is broken into
smaller chunks or "blocks".
 HDFS is a fault-tolerant and resilient system, meaning it prevents a failure in anode from
affecting the overall system's health and allows for recovery from failure too. In order to
achieve this, data stored in HDFS is automatically replicated across different nodes.
 HDFS supports a traditional hierarchical file organization. A user or an application can
create directories and store files inside these directories. The file system namespace
hierarchy is similar to most other existing file systems; one can create and remove files,
move a file from one directory to another, or rename a file.
 Hadoop distributed file system is a block-structured file system where each file is divided
into blocks of a pre-determined size. These blocks are stored across a cluster of one or
several machines.
 Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster
comprises of a single NameNode (MasterNode) and all the other nodes are DataNodes
(Slave nodes).
 HDFS can be deployed on a broad spectrum of machines that support Java. Though one can
run several DataNodes on a single machine, but in the practical world, these DataNodes are
spread across various machines.
 Design issue of HDFS:
1. Commodity hardware: HDFS do not require expensive hardware for executing user
tasks. It's designed to run on clusters of commodity hardware.
2. Streaming data access: HDFS is built around the idea that the most efficient data
processing pattern is a write-once, read-many-times pattern.
7
3. Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a
single writer. Writes are always made at the end of the file. There is no support for
multiple writers, or for modifications at arbitrary offsets in the file.
4. Low-latency data access.
5. Holds lots of small files.
6. Store very large files.
 The HDFS achieve the following goals:
1. Manage large datasets: Organizing and storing datasets can be a hard talk to handle.
HDFS is used to manage the applications that have to deal with huge datasets. To do
this, HDFS should have hundreds of nodes per cluster.
2. Detecting faults: HDFS should have technology in place to scan and detect faults
quickly and effectively as it includes a large number of commodity hardware. Failure of
components is a common issue.
3. Hardware efficiency: When large datasets are involved it can reduce the network traffic
and increase the processing speed.
HDFS Architecture
Hadoop architecture
 Hadoop distributed file system is a block-structured file system where each file is divided
into blocks of a pre-determined size. These blocks are stored across a cluster of one or
several machines.
 Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster
comprises of a single NameNode (Master node) and all the other nodes are DataNodes
(Slave nodes).
8
 DataNodes process and store data blocks, while NameNodes manage the many DataNodes,
maintain data block metadata and control client access.
NameNode and DataNode
 Namenode holds the meta data for the HDFS like Namespace information, block
information etc. When in use, all this information is stored in main memory. But this
information also stored in disk for persistence storage.
 Namenode manages the file system namespace. It keeps the directory tree of all files in the
file system and metadata about files and directories.
 DataNode is a slave node in HDFS that stores the actual data as instructed by the
NameNode. In brief, NameNode controls and manages a single or multiple datanodes.
 DataNode serves to read or write requests. It also creates deletes and replicates blocks on the
instructions from the NameNode.
Name node
 Two different files are :
1. fsimage: It is the snapshot of the file system when name node started.
2. Edit logs: It is the sequence of changes made to the file system after namenode started.
 Only in the restart of namenode, edit logs are applied to fsimage to get the latestsnapshot of
the file system.
 But namenode restart are rare in production clusters which means edit logs cangrow very
large for the clusters where namenode runs for a long period of time.
 The following issues we will encounter in this situation :
1. Editlog become very large, which will be challenging to manage it.
2. Namenode restart takes long time because lot of changes to be merged.
3. In the case of crash, we will lose huge amount of metadata since fsimage isvery old.
9
 So to overcome these issues we need a mechanism which will help us reduce the edit log
size which is manageable and have up to date fsimage, so that load on namenode reduces.
 Secondary Namenode helps to overcome the above issues by taking over responsibility of
merging edit logs with fsimage from the namecode.
Secondary Namenode
 Working of secondary Namenode:
1. It gets the edit logs from the Namenode in regular intervals and applies of fsimage.
2. Once it has new fsimage, it copies back to Namenode.
3. Namenode will use this fsimage for the next restart, which will reduce the startup time.
 Secondary Namenode's whole purpose is to have a checkpoint in HDFS. It’s just a helper
node for Namecode. That is why it also known as checkpoint node inside the community.
HDFS Block
Q. Write short note on HDFS block
 HDFS is a block structured file system. In general, the user’s data stored in HDFS in terms
of block. The files in the file system are divided into one or more segments called blocks.
The default size of HDFS block is 64 MB that can be increased as per need.
 The HDFS is fault tolerant such that if a data node fails then the current block write
operation on the data node is re-replicated to some other node. The blocksize, number of
replicas and replication factors are specified in the Hadoop configuration file. The
synchronization between name node and data node is done by heartbeat functions which are
periodically generated by data node to namenode.
 Apart from above components the job tracker and task trackers are used when map reduce
applications run over the HDFS. Hadoop Core consists of one master job tracker and several
10
task trackers. The job tracker runs on name nodes like a master while task trackers run on
data nodes like slaves.
 The job tracker is responsible for taking the requests from a client and assigning task
trackers to it with tasks to be performed. The job tracker always tries to assign tasks to the
task tracker on the data nodes where the data is locally present.
 If for some reason the node fails the job tracker assigns the task to another
taskreplicatedtracker where the replica of the data exists since the data blocks are across the
data nodes. This ensures that the job does not fail even if a node fails within the cluster.
The Command Line Interface:
 The HDFS can be manipulated either using the command line. All the commands used for
manipulating HDFS through the command line interface begin with the "hadoop fs"
command.
 Most of the Linux commands are supported over HDFS which starts with "-"sign.
 For example: The command for listing the files in Hadoop directory will be,
#hadoop fs –Is
 The general syntax of HDFS command line manipulation is,
#hadoop fs -<command>
Java Interface
Q. Briefly explain about java interface in Hadoop filesystem.
 Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the
Java API. The filesystem shell, for example, is a Java application that uses the Java
FileSystem class to provide file system operations. By exposing its filesystem interface as a
Java API, Hadoop makes it awkward for non-Javaapplications to access.
1. Reading Data from a Hadoop URL :
 To read a file from a Hadoop file system is by using a java.net.URL object toopen a stream
to read the data. The syntax is as follows:
InputStream in =null;
try {
in =new URL("hdfs://host/path").openStream();
// process in
}
Finally {
IOUtils.closeStream(in);
11
}
 Java recognizes Hadoop's hdfs URL scheme by calling the
setURLStreamHandlerFactoryMethod on URL with an instance of
FSUrlStreamHandlerFactory.
 The 'setURLStreamHandlerFactory‘ is a method in the java.net. URL class thatsets the
URL stream handler factory for the Java Virtual Machine. This factory isresponsible for
creating URL stream handler instances that are used to retrieve thecontents of a URL.
 This method can only be called once per JVM, so it is typically executed in a staticblock.
 Example: Displaying files from a Hadoop file system on standard output using a
URLStreamHandler.
import java.io.InputStream;
import java.net. URL;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import org.apache.hadoop.io.IOUtils;
// vvURLCat
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try{
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
2.Reading Data Using the FileSystem API :
12
 Sometimes it is not possible to set URLStreamHandlerFactory for our application, so in that
case we will use Filesystem API for opening an input stream for a file.A file in a Hadoop
filesystem is represented by a Hadoop Path object.
 There are two static factory methods for getting a FileSystem instance:
public static Filesystem get(Configuration conf) throws IOException
public static FileSystem get(URI url, Configuration conf) throws IOException
 FileSystem is a generic abstract class used to interface with a file system. FileSystem class
also serves as a factory for concrete implementations, with thefollowing methods:
Public static FileSystem get (Configuration conf): Use information fromconfiguration such
as scheme and authority.
 A configuration object encapsulates a client or server's configuration, which is setusing
configuration files read from the classpath, such as conf/core-site.xml.
FSDatalnputStream :
 The open () method on FileSystem actually returns a FSDataInputStream rather than a
standard java.io class.
 This class is a specialization of java.io.DataInputStream with support for random Package
access, SO we can read from any part of the stream org.apache.hadoop.fs;
Writing Data:
 The FileSystem class has a number of methods for creating a file. The simplest is the
method that takes a path object for the file to be created and returns an outputstream to write
to public FSDataOutputStream create(Path f) throws IOExceptionFSDataOutputStream
 The create() method on FileSystem returns an FSDataOutputStream, which,
likeFSDataInputStream, has a method for querying the current position in the file :package
org.apache.hadoop.fs;
 We can append to an existing file using the append() method :
public FSDataOutputStream append(Path f) throws IOException
 The append operation allows a single writer to modify an already written file by opening it
and writing data from the final offset in the file. With this API, applications that produce
unbounded files, such as log files, can write to an existing file after a restart, for example,
the append operation is optional and not implemented by all Hadoop filesystems.
Data Flow
Q Explain in details about data flow and heart beat mechanism of HDFS
13
1. Anatomy of a File Read:
 The client opens the file it wishes to read by calling open() on the FileSystemobject, which
for HDFS is an instance of Distributed FileSystem (DFS). DFS calls the namenode, using
RPC, to determine the locations of the blocks for the first few blocks in the file.
 For each block, the namenode returns the addresses of the datanodes that have a copy of that
block. Furthermore, the datanodes are sorted according to their proximity to the client. If the
client is itself a datanode, then it will read from thelocal datanode, if it hosts a copy of the
block.
 The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and
namenode I/O.
Client reading data from HDFS

 The client then calls read() on the stream. DFSInputStream, which has stored the datanode
addresses for the first few blocks in the file, then connects to the first(closest) datanode for
the first block in the file.
 Data is streamed from the datanode back to the client, which calls read()repeatedly on the
stream. When the end of the block is reached, DFSInputStreamwill close the connection to
the datanode, then find the best datanode for the next block. This happens transparently to
the client, which from its point of view is just reading a continuous stream.
 Blocks are read in order with the DFSInputStream opening new connections to datanodes as
the client reads through the stream. It will also call the namenode to retrieve the datanode
14
locations for the next batch of blocks as needed. When the client has finished reading, it
calls close() on the PSDataInputStream.
 During reading, if the DPSInputStream encounters an error while communicating with a
datanode, then it will try the next closest one for that block.
2. Anatomy of a File Write:
1. The client calls create() on DistributedFileSystem to create a file.
2. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is
then written to an internal queue, called data queue. Datastreamer consumes the data
queue.
4. Data streamer streams the packets to the first DataNode in the pipeline. It storesthe
packet and forwards it to the second DataNode in the pipeline.
5. In addition to the internal queue, DFSOutputStream also manages the Ackqueue"of
the packets that are waiting to be acknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream.
Anatomy of a file writes
Heartbeat Mechanism in HDFS

 Heartbeat is a single indicating that is alive. A datanode sends heartbeat to Namenode and
task tracker will send its heartbeat to job tracker.
15
Heartbeat mechanism
 The connectivity between the NameNode and a DataNode are managed by the persistent
heartbeats that are sent by the DataNode every three seconds.
 The heartbeat provides the NameNode confirmation about the availability of the blocks and
the replicas of the DataNode.
 Additionally, heartbeats also carry information about total storage capacity, storage in use
and the number of data transfers currently in progress. These statistics are by the NameNode
for managing space allocation and load balancing.
 During normal operations, if the NameNode does not receive a heartbeat from a DataNode
in ten minutes the NameNode, it considers that DataNode to be out ofservice and the block
replicas hosted to be unavailable.
 The NameNode schedules the creation of new replicas of those blocks on other DataNodes.
 The heartbeats carry roundtrip communications and instructions from the NameNode,
including commands to :
a) Replicate blocks to other nodes.
b) Remove local block replicas.
c) Re-register the node.
d) Shut down the node.
e) Send an immediate block report.
Role of Sorter, Shuffler and Combiner in MapReduces Paradigm
16
 A combiner, also known as a semi-reducer, is an optional class that operates byaccepting the
inputs from the Map class and thereafter passing the outputkey-value pairs to the Reducer
class.
 The main function of a combiner is to summarize the map output records with thesame key.
The output of the combiner will be sent over the network to the actualreducer task as input.
 The process of transferring data from the mappers to reducers is known as shuffling i.e. the
process by which the system performs the sort and transfers the map output to the reducer as
input. So, shuffle phase is necessary for the reducers, otherwise, they would not have any
input.
 Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in MapReduce.
Sort phase in MapReduce covers the merging and sorting of map outputs.
 Data from the mapper are grouped by the key, split among reducers and sorted by the key.
Every reducer obtains all values associated with the same key. Shuffle and sort phase in
Hadoop occur simultaneously and are done by the MapReduce framework.
Hadoop I/0
 Hadoop input output system comes with a set of primitives. Hadoop deals with multi-
terabytes of datasets; a special consideration on these primitives will give an idea how
Hadoop handles data input and output.
Data Integrity
Q. Explain in details about data integrity and Hadoop local file system in HDFS .
 Data integrity means that data should remain accurate and consistent all across its storing,
processing and retrieval operations.
 However, since every I/O operation on the disk or network carries with it a small chance of
introducing errors into the data that it is reading or writing. The usual way of detecting
corrupted data is by computing a checksum for the data when it first enters the system and
again whenever it is transmitted across a channel that is unreliable and hence capable of
corrupting the data.
 The commonly used error detecting code is CRC-32 which computes a 32-bitinteger
checksum input of any size.
Data Integrity in HDFS:
17
 HDFS transparently checksums all data written to it and by default verifieschecksums when
reading data. A separate checksum is created for every io.bytes.per.checksum bytes of data.
The default is 512 bytes, and since a CRC-32checksum is 4 bytes long, the storage overhead
is not an issue.
 All data that enters into the system is verified by the datanodes before being forwarded for
storage or further processing. Data sent to the datanode pipeline is verified through
checksums and any corruption found is immediately notified to the client with
ChecksumException.
 The client read from the datanode also goes through the same drill. The datanodes maintain
a log of checksum verification to keep track of the verified block. The log is updated by the
datanode upon receiving a block verification success signal from the client. This type of
statistics helps in keeping the bad disks at bay.
 Apart from this, a periodic verification on the block store is made with the help of
DataBlockScanner running along with the datanode thread in the background. Thisprotects
data from corruption in the physical storage media.
 HDFS stores replicas of blocks, it can "heal" corrupted blocks by copying one of the good
replicas to produce a new, uncorrupt replica.
 If a client detects an error when reading a block, it reports the bad block and the datanode it
was trying to read from to the namenode before throwing a ChecksumException. The
namenode marks the block replica as corrupt, so it doesnot direct clients to it, or try to copy
this replica to another datanode.
 It then schedules a copy of the block to be replicated on another datanode, so its replication
factor is back at the expected level. Once this has happened, the corrupt replica is deleted. It
is possible to disable verification of checksums bypassing false to the setVerify Checksum()
method on FileSystem, before using the open() method to read a file.
Hadoop Local File System

 The Hadoop local file system performs client-side checksums. When a file is created, it
automatically creates a transparent file in the background with the filename.crc, which uses
check chunks to check the file?
 Each chunk can check a segment up to 512 bytes, the chunk of data is divided by the
file.byte-per-checksum property and the chunk is then stored as metadata in a.crc file.
18
 The file can be read correctly though the settings of the files might change and if an error is
detected then the local system throws a checksum exception.
Checksum file system:

 Local file systems use checksum file systems as a security measure to ensure that the data is
not corrupt or damaged in any way.
 In this file system, the underlying file system is called the raw file system, if an error is
detected while working on the checksum file system, it will callreportchecksumfailure().
 Here the local system moves the affected file to another directory as a file titled as bad_file.
It is then the responsibility of an administrator to keep a check on these bad files and take
the necessary action.
Q. Different between Compression and serialization
Compression
 Compression has two major benefits
a) It creates space for a file.
b) It also increases the speed of data transfer to a disk or drive.
 The following are the commonly used methods of compression in Hadoop :
a) Deflate, b) Gzip, c) Bzip2, d) Lzo, e) Lz4, and f) Snappy.
 All these compression methods primarily provide optimization of speed and storage space
and they all have different characteristics and advantages.
 Gzip is a general compressor used to clear the space and performs faster than bzip2, but the
decompression speed of bzip2 is good.
 Lzo, Lz4 and Snappy can be optimized as required and hence, are the better tools in
comparison to the others.
Codecs:
 A codec is an algorithm that is used to perform compression and decompression of a data
stream to transmit or store it.
 In Hadoop, these compression and decompression operations run with different codecs and
with different compression formats.
Serialization
 Serialization is the process of converting a data object; a combination of code and data
represented within a region of data storage into a series of bytes that saves the state of the
19
object in an easily transmittable form. In this serialized form, the data can be delivered to
another data store, application, or some other destination.
 Data serialization is the process of converting an object into a stream of bytes to more easily
save or transmit it.
 The reverse process, constructing a data structure or object from a series of byte is
deserialization. The deserialization process recreates the object, thus making the data easier
to read and modify as a native structure in a programming language.
Serialization and deserialization

 Serialization and deserialization work together to transform/recreate data objects to/from a
portable format.
 Serialization enables us to save the state of an object and recreate the object in a new
location. Serialization encompasses both the storage of the object and exchange of data.
Since objects are composed of several components, saving or delivering all the parts
typically requires significant coding effort, so serialization is a standard way to capture the
object into a shareable format.
 Serialization is divided in two methods of data processing: Inter crossing communication
and data storage.
 Intercrossing communication between nodes is processing that uses remote procedure calls
(RPC's). In RPC, the data is converted to the binary system and is then transferred to a
remote node where the data is de-serialized into the original message. The lifespan of RPC
is less than a second.
 It is desirable that an RPC serialization format is:
20
a. Compact: A compact format makes the best use of network bandwidth, which is the
scarcest resource in a data center.
b. Fast: Inter process communication forms the backbone for a distributed system, so it
is essential that there is as little performance overhead as possible for the
serialization and deserialization process.
c. Extensible: Protocols change over time to meet new requirements, so it should be
straightforward to evolve the protocol in a controlled manner for clients and servers.
d. Interoperable: For some systems, it is desirable to be able to support clients that are
written in different languages to the server, so the format needs to be designed to
make this possible.
The Writable Interface
Q. Briefly explain about the writable interface of Hadoop.
 Hadoop uses its own serialization format called Writable. It is written in Java and is fast as
well as compact. The other serialization framework supported byHadoop is Avro.
 The Writable interface defines two methods: One for writing its state to a DataOutput binary
stream and one for reading its state from a DataInput binary stream.
 When we write a key as IntWritable in the Mapper class and send it to the reducer class,
there is an intermediate phase between the Mapper and Reducerclass i.e., shuffle and sort,
where each key has to be compared with many other keys. If the keys are not comparable,
then the shuffle and sort phase won't be executed or may be executed with
a high amount of overhead.
 If a key is taken as IntWritable by default, then it has a comparable feature because of Raw
Comparator acting on that variable. It will compare the key taken with the other keys in the
network. This cannot take place in the absence of Writable.
 WritableComparator is a general-purpose implementation of RawComparator for
WritableComparable classes. It provides two main functions:
a. It provides a default implementation of the raw compare() method that deserializes the
objects to be compared from the stream and invokes the objectcompare() method.
b. It acts as a factory for RawComparator instances.
 To provide mechanisms for serialization and deserialization of data, Hadoop provided two
important interfaces Writable and WritableComparable. Writableinterface specification is as
follows:
21
package org.apache.hadoop.io;
import java.io.Datalnput;
import java.io.DataOutput;
import java.io.IOException;
public interface Writable
{
void write(DataOutput out) throws IOException;
void readFields (Datainput in) throws IOException;
}
 WritableComparable interface is sub-interface of Hadoop's Writable and Java'sComparable

interfaces. Its specification is shown below :
public interface Writable Comparable extends Writable, Comparable
{
}
Writable Classes Hadoop Data Types:
 Hadoop provides classes that wrap the Java primitive types and implement
theWritableComparable and Writable Interfaces. They are provided in
theorg.apache.hadoop.io package.
 All the Writable wrapper classes have a get() and a set() method for retrieving and storing
the wrapped value.
Primitive Writable Classes:
 These are writable wrappers for Java primitive data types and they hold a single primitive
value that can be set either at construction or via a setter method.
 All these primitive writable wrappers have get() and set() methods to read or write the
wrapped value. Below is the list of primitive writable data typesavailable in Hadoop.
a) BooleanWritable b) ByteWritable
c) IntWritable d) VIntWritable
e) FloatWritable f) Long Writable
g) VLongWritable h) DoubleWritable.
 In the above list VIntWritable and VLongWritable are used for variable lengthInteger types
and variable length long types respectively.
22
 Serialized sizes of the above primitive writable data types are the same as the size of actual
java data types. So, the size of IntWritable is 4 bytes and LongWritable is8 bytes.
Text:
 Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of
java.lang.String.
 The Text class uses an int to store the number of bytes in the string encoding, so the
maximum value is 2 GB.
Bytes Writable:
 BytesWritable is a wrapper for an array of binary data. Its serialized format is an integer
field (4 bytes) that specifies the number of bytes to follow, followed by the bytes
themselves.
 Bytes Writable is mutable and its value may be changed by calling its set() method.
 NullWritable is a special type of writable, as it has a zero-length serialization. No bytes are
written to, or read from, the stream. It is used as a placeholder.
ObjectWritable and GenericWritable
ObjectWritable is a general-purpose wrapper for the following: Java primitives,string, enum,
writable, null, or arrays of any of these types.
 It is used in Hadoop RPC to marshal and unmarshal method arguments and return types.
 There are four writable collection types in the org.apache.hadoop.io package:Array
Writable, TwoDArrayWritable, MapWritable, and SortedMapWritable.
 ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and two-
dimensional arrays of Writable instances. All the elements of an ArrayWritable or a
TwoDArrayWritable must be instances of the same class.
 ArrayWritable and TwoDArrayWritable both have get() and set() methods, as well as a
toArray() method, which creates a shallow copy of the array.
 MapWritable and SortedMapWritable are implementations of java.util.Map<Writable,
Writable> and java.util.SortedMap<WritableComparable, Writable>, respectively. The type
of each key and value field is a part of the serialization format for that field.
Q. Write short note on Avro, file-based structure and Cassandra Hadoop integration.
Avro
23
 Data serialization is a technique of converting data into binary or text format. There are
multiple systems available for this purpose. Apache Avro is one of those data serialization
systems.
 Avro is a language-independent serialization library. Avro is a language independent,
schema-based data serialization library. It uses a schema to perform serialization and
deserialization. Moreover, Avro uses a JSON format to specify the data structure which
makes it more powerful.
 Avro creates a data file where it keeps data along with schema in its metadata section. Avro
files include markers that can be used to split large data sets into subsets suitable for Apache
MapReduce processing.
 Avro has rich schema resolution capabilities. Within certain carefully defined constraints,
the schema used to read data need not be identical to the schema that was used to write the
data.
 An Avro data file has a metadata section where the schema is stored, which makes the file
self-describing. Avro data files support compression and are splitable, which is crucial for a
MapReduce data input format.
 Avro defines a small number of data types, which can be used to build application specific
data structures by writing schemas.
 Avro supports two types of data :
a. Primitive type: Avro supports all the primitive types. We use primitive TypeName’s to
define a type of a given field. For example, a value which holds a string should be
declared as {"type": "string") in Schema.
b. Complex type: Avro supports six kinds of complex types: records, enums, arrays, maps,
unions and fixed.
Avro data files :
 A data file has a header containing metadata, including the Avro schema and async marker,
followed by a series of blocks containing the serialized Avro objects.
 Blocks are separated by a sync marker that is unique to the file and that permits rapid
resynchronization with a block boundary after seeking to an arbitrary pointing the file, such
as an HDFS block boundary. Thus, Avro data files are splitable, which makes them
amenable to efficient MapReduce processing.
File-based Data Structures
24
 Apache Hadoop supports text files which are quite commonly used for storing the data,
besides text files it also supports binary files and one of these binary formats is called
sequence files.
 Hadoop sequence file is a flat file structure which consists of serialized key-value pairs. This
is the same format in which the data is stored internally during the processing of the
MapReduce tasks.
 Sequence files can also be compressed for space considerations and based on these
compression type users, Hadoop sequence files can be of three types: Uncompressed, record
compressed and block compressed.
 To create a SequenceFile, use one of its createWriter() static methods, which returns a
SequenceFile.Writer instance.
 The keys and values stored in a SequenceFile do not necessarily need to be writable. Any
types that can be serialized and deserialized by a serialization maybe used.
 Reading sequence files from beginning to end is a matter of creating an instance of
SequenceFile.Reader and iterating over records by repeatedly invoking one of the next()
methods.
The SequenceFile format
Structure of a sequence file with no compression and record compression

 A sequence file consists of a header followed by one or more records. The first three bytes
of a sequence file are the bytes SEQ, which acts a magicnumber, followed by a single byte
representing the version number. The header contains other fields including the names of the
key and value classes, compression details, user defined metadata and the sync marker.
 Recall that the sync marker is used to allow a reader to synchronize to a record boundary
from any position in the file. Each file has a randomly generated sync marker, whose value
is stored in the header. Sync markers appear between records in the sequence file.
25
Cassandra Hadoop Integration
 Cassandra provides native support to Hadoop MapReduce, Pig and Hive. Cassandra
supports input to Hadoop with ColumnFamilyInputFormat and output with
ColumnFamilyOutputFormat classes, respectively.
 ColumnFamilyInputFormat: It is an implementation
oforg.apache.hadoop.mapred.InputFormat. So, its implementation is dictated by the
InputFormat class specifications. Hadoop uses this class to get data for the MapReduce
tasks. It also fragments input data into small chunks that get fed to map tasks.
 Column Family Output Format : OutputFormat is the mechanism of writing the result from
MapReduce to a permanent storage. Cassandra implements Hadoop'sOutputFormat. It
enables Hadoop to write the result from the reduced task as column family rows. It is
implemented such that the results are written, to the column family, in batches. This is a
performance improvement and thismechanism is called lazy write-back caching.
 ConfigHelper:ConfigHelper is a gateway to configure Cassandra-specific settings for
Hadoop. It saves developers from inputting a wrong property name because all the
properties are set using a method; any typo will appear at compile time.
 Bulk loading:BulkOutputFormat is another utility that Cassandra provides to improve the
write performance of jobs that result in large data. It streams the data in binary format,
which is much quicker than inserting one by one. It uses SSTableLoader to do this.
 Configuring Hadoop with Cassandra is itself quite some work. Writing verbose and long
Java code to do something as trivial as word count is a turn-off to a high level user like a
data analyst.
Two Marks Questions with Answers
Q.1 Why do we need Hadoop streaming?
Ans. It helps in real-time data analysis, which is much faster using MapReduceprogramming
running on a multi-node cluster. There are different technologies like spark Kafka and others which
help in real-time Hadoop streaming.
Q.2 What is the Hadoop Distributed file system?
Ans. The Hadoop Distributed File System (HDFS) is designed to store very large datasets reliably
and to stream those data sets at high bandwidth to user applications’ stores file system metadata and
application data separately. The HDFSnamespace is a hierarchy of files and directories. Files and
26
directories are represented on the NameNode by inodes, which record attributes like permissions,
modification and access times, namespace and disk space quotas.
Q.3 What is data locality optimization?
Ans. To run the map task on a node where the input data resides in HDFS. This is called data
locality optimization.
Q.4Why do map tasks write their output to the local disk, not to HDFS?
Ans.: Map output is intermediate output: It is processed by reducing tasks to produce the final
output and once the job is complete the map output can be thrown away. So, storing it in HDFS,
with replication, would be overkill. If the node running the map task fails before the map output has
been consumed by the reduce task, then Hadoop will automatically rerun the map task on another
node to re-create the map output.
Q.5 Why is a block in HDFS so large ?
Ans.: HDFS blocks are large compared to disk blocks and the reason is to minimize the cost of
seeks. By making a block large enough, the time to transfer the data from the disk can be made to
be significantly larger than the time to seek to the start of the block. Thus the time to transfer a large
file made of multiple blocks operates at the disk transfer rate.
Q.6 How HDFS services support big data ?
Ans.: Five core elements of big data organized by HDFS services :
 Velocity - How fast data is generated, collated and analyzed.
 Volume - The amount of data generated.
 Variety - The type of data, this can be structured, unstructured, etc.
 Veracity - The quality and accuracy of the data.
 Value-How you can use this data to bring an insight into your business processes.
Q.7 What if writable were not there in Hadoop ?
Ans.: Serialization is important in Hadoop because it enables easy transfer of data. IfWritable is not
present in Hadoop, then it uses the serialization of Java which increases the data over-head in the
network.
Q.8 Define serialization.
Ans.: Serialization is the process of converting object data into byte stream data for transmission
over a network across different nodes in a cluster or for persistent data storage.
Q.9 What is writable? Explain its Importance in Hadoop.
Ans. Writable is an interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all
the primitive data type of Java. That is how int of java has become IntWritable in Hadoop and
27
String of Java has become Text in Hadoop. Writables are used for creating serialized data types in
Hadoop. So, let us start by understandingwhat are data type, interface and serialization.
Q.10 What happens if a client detects an error when reading a block in Hadoop ?
Ans. If a client detects an error when reading a block :
 It reports the bad block and datanode it as trying to read from to the namenode
before throwing a ChecksumException.
 The namenode marks the block replica as corrupt, so it does not direct clients to it, or
try to copy this replica to another datanode.
 It then schedules a copy of the block to be replicated on another datanode,so its
replication factor is back at the expected level.
 Once this has happened, the corrupt replica is deleted.
Q.11 What is MapFile ?
Ans. A MapFile is a sorted sequence file with an index to permit lookups by key.Map File can be
thought of as a persistent form of java.util. Map which is able to grow beyond the size of a map that
is kept in memory.
Q.12 What are Hadoop pipes ?
Ans. Hadoop pipes is the name of the C++ interface to Hadoop MapReduce. Unlike streaming, this
uses standard input and output to communicate with the map and reduce code. Pipes uses sockets as
the channel over which the task tracker communicates with the process running the C++ map or
reduce function.
28

P.prabu (28x61c) CCS334 BDA.unit 4

Uploaded by

Copyright:

Available Formats

P.prabu (28x61c) CCS334 BDA.unit 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

P.prabu (28x61c) CCS334 BDA.unit 4

Uploaded by

Copyright:

Available Formats

UNIT IV

MapReduce data flow with single reduce task

Code execution process

Execution of streaming and pipes

2.Reading Data Using the FileSystem API :

Client reading data from HDFS

Anatomy of a file writes

Heartbeat Mechanism in HDFS

Data Integrity in HDFS:

Hadoop Local File System

Checksum file system:

Serialization and deserialization

 WritableComparable interface is sub-interface of Hadoop's Writable and Java'sComparable

Structure of a sequence file with no compression and record compression

You might also like