Professional Documents
Culture Documents
BFS U2
BFS U2
With growing data velocity, the data size easily outgrows the storage limit of a machine. A solution
would be to store the data across a network of machines. Such filesystems are called distributed
filesystems. Since data is stored across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop
Distributed File System) is a unique design that provides storage for extremely large files with
streaming data access pattern and it runs on commodity hardware. Let’s elaborate the terms:
Extremely large files: Here we are talking about the data in range of petabytes (1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-
times. Once data is written large portions of dataset can be processed any number times.
Commodity hardware: Hardware that is inexpensive and easily available in the market. This
is one of feature which specially distinguishes HDFS from other file system.
1. NameNode(MasterNode):
It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
Actual worker nodes, who do the actual work like reading, writing, processing etc.
They also perform creation, deletion, and replication upon instruction from the
master.
Namenodes:
Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
DataNodes:
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Let’s assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across
different datanodes(slavenode). Datanodes(slavenode)replicate the blocks among themselves and
the information of what blocks they contain is sent to the master. Default replication factor
is 3 means for each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each and every
single data node and the blocks they contain, i.e., nothing is done without the permission of
masternode.
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single
machine. Even if we store, then each read and write operation on that whole file is going to take very
high seek time. But if we have multiple blocks of size 128MB then it’s become easy to perform
various read and write operations on it compared to doing it on a whole file at once. So, we divide
the file to have faster data access i.e., reduce seek time.
Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now
if the data node D1 crashes we will lose the block and which will make the overall data inconsistent
and faulty. So, we replicate the blocks to achieve fault-tolerance.
Balancing: If a datanode is crashed the blocks present on it will be gone too and the blocks
will be under-replicated compared to the remaining blocks. Here master node(namenode)
will give a signal to datanodes containing replicas of those lost blocks to replicate so that
overall distribution of blocks is balanced.
Features:
The data is highly available as the same block is present at multiple datanodes.
Even if multiple datanodes are down we can still do our work, thus making it highly reliable.
Limitations: Though HDFS provide many features there are some areas where it doesn’t work well.
Low latency data access: Applications that require low-latency access to data i.e in the range
of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that
we need high-throughput of data even at the cost of latency.
Small file problem: Having lots of small files will result in lots of seeks and lots of movement
from one datanode to another datanode to retrieve each small file, this whole process is a
very inefficient data access pattern.
o Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.
o Lots Of Small Files: The name node contains the metadata of files in memory and if the files
are small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes: It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write HDFS blocks are 128
MB by default and this is configurable. Files n HDFS are broken into block-sized chunks,
which are stored as independent units. Unlike a file system, if the file is in HDFS is smaller
than block size, then it does not occupy full block’s size, i.e., 5 MB of file stored in HDFS of
block size 128 MB takes 5MB of space only. The HDFS block size is large just to minimize the
cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata of all
the files in HDFS; the metadata information being file permission, names and location of
each block. The metadata are small, so it is stored in the memory of name node, allowing
faster access to data. Moreover, the HDFS cluster is accessed by multiple clients
concurrently, so all this information is handled by a single machine. The file system
operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta
data which helps minimize downtime and loss of data.
Starting HDFS
The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.
To Start $ start-dfs.sh
o First create a folder in HDFS where data can be put form local file system.
o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test
Recursive deleting
Example:
"<localSrc>" and "<localDest>" are paths as above, but on the local file system
o put <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within the DFS.
o copyFromLocal <localSrc><dest>
Identical to -put
o copyFromLocal <localSrc><dest>
Identical to -put
o moveFromLocal <localSrc><dest>
Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and
then deletes the local copy on success.
Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.
o cat <filen-ame>
o moveToLocal <src><localDest>
o touchz <path>
Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path,
unless the file is already size 0.
Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n),
block size (%o), replication (%r), and modification date (%y, %Y).
HDFS interface
FileSystem:
FileSystem is an abstract base class for a generic file system. It may be implemented as distributed
system or a local system. The local implementation is LocalFileSystem and distributed
implementation is DistributedFileSystem.
All user code that may use the HDFS should use a FileSystem object.
FSDataInputStream:
int read(long position, byte[] buffer, int offset, int length) – Read bytes from the given position in the
stream to the given buffer. The return value is the number of bytes actually read.
void readFully(long position, byte[] buffer, int offset, int length) – Read bytes from the given position
in the stream to the given buffer. Continues to read until length bytes have been read. If the end of
stream is reached while reading EOFException is thrown.
void readFully(long position, byte[] buffer) – buffer.length bytes will be read from position in stream
FSDataOutputStream:
It is because HDFS allows only sequential writes to an open file or appends to an existing File. In
other words, there is no support for writing to anywhere other than the end of the file.
Writes len bytes from the specified byte array starting at offset off to the underlying output stream.
If no exception is thrown, the counter written is incremented by len.
For any File I/O operation in HDFS through Java API, the first thing we need is FileSystem instance. To
get file system instance, we have three static methods from FileSystem class.
In order to read a file from HDFS, we need to open an input stream for the same. We can do the
same by invoking open() method on FileSystem instance.
public FSDataOutputStream create(Path f)
The create() method creates any parent directories of the file to be written that don’t already
exist.
Map-Reduce is a processing framework used to process data over a large number of machines.
Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not
similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these previous
frameworks are designed to use with a traditional system where the data is stored at a single
location like Network File System, Oracle database, etc. But when we are processing big data the
data is located on multiple commodity machines with the help of HDFS.
So when the data is stored on multiple nodes we need a processing framework where it can copy the
program to the location where the data is present, Means it copies the program to all the machines
where the data is present. Here the Map-Reduce came into the picture for processing the data on
Hadoop over a distributed system. Hadoop has a major drawback of cross-switch network traffic
which is due to the massive volume of data. Map-Reduce comes with a feature called Data-Locality.
Data Locality is the potential to move the computations closer to the actual data location on the
machines.
Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used for
Transformation while the Reducer is used for aggregation kind of operation. The terminology for
Map and Reduce is derived from some functional programming languages like Lisp, Scala, etc. The
Map-Reduce processing framework program comes with 3 main components i.e. our Driver
code, Mapper(For Transformation), and Reducer(For Aggregation).
Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of data
is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it for that
we have a Map-Reduce framework. So to process this data with Map-Reduce we have a Driver code
which is called Job. If we are using Java programming language for processing the data on HDFS then
we need to initiate this Driver class with the Job object. Suppose you have a car which is your
framework than the start button used to start the car is similar to this Driver code in the Map-
Reduce framework. We need to initiate the Driver code to utilize the advantages of this Map-Reduce
Framework.
There are also Mapper and Reducer classes provided by this framework which are predefined and
modified by the developers as per the organizations requirement.
Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have
100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper
program or process that runs in parallel on machines(nodes) and produce there own output known
as intermediate output which is then stored on Local Disk, not on HDFS. The output of the mapper
act as input for Reducer which performs some sorting and aggregation operation on data and
produces the final output.
Reducer is the second part of the Map-Reduce programming model. The Mapper produces the
output in the form of key-value pairs which works as input for the Reducer. But before sending this
intermediate key-value pairs directly to the Reducer some process will be done which shuffle and
sort the key-value pairs according to its key values. The output generated by the Reducer will be the
final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs
some computation operation like addition, filtration, and aggregation.
Steps of Data-Flow:
At a time single input split is processed. Mapper is overridden by the developer according to
the business logic and this Mapper run in a parallel manner in all the machines in our cluster.
The intermediate output generated by Mapper is stored on the local disk and shuffled to the
reducer to reduce the task.
Once Mapper finishes their task the output is then sorted and merged and provided to the
Reducer.
Reducer performs some reducing tasks like aggregation and other compositional operation
and the final output is then stored on HDFS in part-r-00000(created by default) file.
Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability
for processing petabytes of data. Data analysis using Hadoop is just half the battle won. Getting data
into the Hadoop cluster plays a critical role in any big data deployment.
Data ingestion is important in any big data project because the volume of data is generally in
petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is
used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is mostly
used to extract structured data from databases like Teradata, Oracle, etc., and Flume in Hadoop is
used to sources data which is stored in various sources like and deals mostly with unstructured data.
Apache Sqoop and Apache Flume are two popular open source tools for Hadoop that help
organizations overcome the challenges encountered in data ingestion.
While working on Hadoop, there is always one question occurs that if both Sqoop and Flume are
used to gather data from different sources and load them into HDFS so why we are using both of
them.
So, in this post, BigData Hadoop: Apache Sqoop vs Apache Flume we will answer this question. At
first, we will understand the brief introduction of both tools. Afterward, we will learn comparison
of Apache Flume vs Sqoop to understand each tool.
What is Apache Flume?
Basically, for streaming logs into Hadoop environment, Apache Flume is best service designed. Also
for collecting and aggregating huge amounts of log data, Flume is a distributed and reliable service.
Conclusion
Hadoop Archives
Hadoop is created to deal with large files data. So small files are problematic and to be handled
efficiently.
As large input file is splitted into number of small input files and stored across all the data nodes, all
these huge number of records are to be stored in name node which makes name node inefficient. To
handle this problem, Hadoop Archieve has been created which packs the HDFS files into archives and
we can directly use these files as input to the MR jobs. It always comes with *.har extension.
HAR Syntax:
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example:
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo
If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for
MapReduce input, all you need to specify the input directory as har:///user/zoo/foo.har.
/data/myArch..har/_index
/data/myArch..har/_masterindex
/data/myArch..har/part-0
part files are the original files concatenated together with big files and index files are to look up for
the small files in the big part file.
1) Creation of HAR files will create a copy of the original files. So, we need as much disk space as size
of original files which we are archiving. We can delete the original files after creation of archive to
release some disk space.
2) Once an archive is created, to add or remove files from/to archive we need to re-create the
archive.
3) HAR file will require lots of map tasks which are inefficient.
Unlike any I/O subsystem, Hadoop also comes with a set of primitives. These primitive
considerations, although generic in nature, go with the Hadoop IO system as well with some special
connotation to it, of course. Hadoop deals with multi-terabytes of datasets; a special consideration
on these primitives will give an idea how Hadoop handles data input and output. This article quickly
skims over these primitives to give a perspective on the Hadoop input output system.
Data Integrity
Data integrity means that data should remain accurate and consistent all across its storing,
processing, and retrieval operations. To ensure that no data is lost or corrupted during persistence
and processing, Hadoop maintains stringent data integrity constraints. Every read/write operation
occurs in disks, more so through the network is prone to errors. And, the volume of data that
Hadoop handles only aggravates the situation. The usual way to detect corrupt data is through
checksums. A checksum is computed when data first enters into the system and is sent across the
channel during the retrieval process. The retrieving end computes the checksum again and matches
with the received ones. If it matches exactly then the data deemed to be error free else it contains
error. But the problem is – what if the checksum sent itself is corrupt? This is highly unlikely because
it is a small data, but not an undeniable possibility. Using the right kind of hardware such as ECC
memory can be used to alleviate the situation.
This is mere detection. Therefore, to correct the error, another technique, called CRC (Cyclic
Redundancy Check), is used.
Hadoop takes it further and creates a distinct checksum for every 512 (default) bytes of data.
Because CRC-32 is 4 bytes only, the storage overhead is not an issue. All data that enters into the
system is verified by the datanodes before being forwarded for storage or further processing. Data
sent to the datanode pipeline is verified through checksums and any corruption found is
immediately notified to the client with ChecksumException. The client read from the datanode also
goes through the same drill. The datanodes maintain a log of checksum verification to keep track of
the verified block. The log is updated by the datanode upon receiving a block verification success
signal from the client. This type of statistics helps in keeping the bad disks at bay.
Apart from this, a periodic verification on the block store is made with the help
of DataBlockScanner running along with the datanode thread in the background. This protects data
from corruption in the physical storage media.
Hadoop maintains a copy or replicas of data. This is specifically used to recover data from massive
corruption. Once the client detects an error while reading a block, it immediately reports to the
datanode about the bad block from the namenode before throwing ChecksumException. The
namenode then marks it as a bad block and schedules any further reference to the block to its
replicas. In this way, the replica is maintained with other replicas and the marked bad block is
removed from the system.
For every file created in the Hadoop LocalFileSystem, a hidden file with the same name in the same
directory with the extension .<filename>.crc is created. This file maintains the checksum of each
chunk of data (512 bytes) in the file. The maintenance of metadata helps in detecting read error
before throwing ChecksumException by the LocalFileSystem.
Compression
Keeping in mind the volume of data Hadoop deals with, compression is not a luxury but a
requirement. There are many obvious benefits of file compression rightly used by Hadoop. It
economizes storage requirements and is a must-have capability to speed up data transmission over
the network and disks. There are many tools, techniques, and algorithms commonly used by
Hadoop. Many of them are quite popular and have been used in file compression over the ages. For
example, gzip, bzip2, LZO, zip, and so forth are often used.
Serialization
The process that turns structured objects to stream of bytes is called serialization. This is specifically
required for data transmission over the network or persisting raw data in disks. Deserialization is just
the reverse process, where a stream of bytes is transformed into a structured object. This is
particularly required for object implementation of the raw bytes. Therefore, it is not surprising that
distributed computing uses this in a couple of distinct areas: inter-process communication and data
persistence.
Hadoop uses RPC (Remote Procedure Call) to enact inter-process communication between nodes.
Therefore, the RPC protocol uses the process of serialization and deserialization to render a message
to the stream of bytes and vice versa and sends it across the network. However, the process must be
compact enough to best use the network bandwidth, as well as fast, interoperable, and flexible to
accommodate protocol updates over time.
Hadoop has its own compact and fast serialization format, Writables, that MapReduce programs use
to generate keys and value types.
There are a couple of high-level containers that elaborate the specialized data structure in Hadoop
to hold special types of data. For example, to maintain a binary log, the SequenceFile container
provides the data structure to persist binary key-value pairs. We then can use the key, such as a
timestamp represented by LongWritable and value by Writable, which refers to logged quantity.
These two containers are interoperable and can be converted to and from each other.
AVRO - Overview
To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to
the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-
based serialization technique.
This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries
for various programming languages. In this tutorial, we demonstrate the examples using Java library.
What is Avro?
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the
father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite
helpful, as it deals with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop.
Avro has a schema-based system. A language-independent schema is associated with its read and
write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a
compact binary format, which can be deserialized by any application.
Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java,
C, C++, C#, Python, and Ruby.
Avro Schemas
Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of
the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along
with the Avro data in a file for any further processing.
In RPC, the client and the server exchange schemas during the connection. This exchange helps in
the communication between same named fields, missing fields, extra fields, etc.
Avro schemas are defined with JSON that simplifies its implementation in languages with JSON
libraries.
Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol
Buffers, and Thrift.
Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these
frameworks in the following ways −
Avro supports both dynamic and static types as per the requirement. Protocol Buffers and
Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These
IDLs are used to generate code for serialization and deserialization.
Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop
ecosystem.
Unlike Thrift and Protocol Buffer, Avro's schema definition is in JSON and not in any proprietary IDL.
Features of Avro
It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
Avro creates a self-describing file named Avro Data File, in which it stores data along with its
schema in the metadata section.
Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange
schemas in the connection handshake.
Step 1 − Create schemas. Here you need to design Avro schema according to your data.
Step 2 − Read the schemas into your program. It is done in two ways −
o By Using Parsers Library − You can directly read the schema using parsers library.
Step 3 − Serialize the data using the serialization API provided for Avro, which is found in
the package org.apache.avro.specific.
Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in
the package org.apache.avro.specific.