Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Introduction to Hadoop Distributed File System (HDFS)

With growing data velocity, the data size easily outgrows the storage limit of a machine. A solution
would be to store the data across a network of machines. Such filesystems are called distributed
filesystems. Since data is stored across a network all the complications of a network come in. 
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop
Distributed File System) is a unique design that provides storage for extremely large files  with
streaming data access pattern and it runs on commodity hardware. Let’s elaborate the terms:  

 Extremely large files: Here we are talking about the data in range of petabytes (1000 TB).

 Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-
times. Once data is written large portions of dataset can be processed any number times.

 Commodity hardware:  Hardware that is inexpensive and easily available in the market. This
is one of feature which specially distinguishes HDFS from other file system.

Nodes: Master-slave nodes typically forms the HDFS cluster. 

1. NameNode(MasterNode): 

 Manages all the slave nodes and assign work to them.

 It executes filesystem namespace operations like opening, closing, renaming files


and directories.

 It should be deployed on reliable hardware which has the high config. not on
commodity hardware.

2. DataNode(SlaveNode): 

 Actual worker nodes, who do the actual work like reading, writing, processing etc.

 They also perform creation, deletion, and replication upon instruction from the
master.

 They can be deployed on commodity hardware.

HDFS daemons: Daemons are the processes running in background. 

 Namenodes: 

 Run on the master node.

 Store metadata (data about data) like file path, the number of blocks, block Ids. etc.

 Require high amount of RAM.

 Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.

 DataNodes: 

 Run on slave nodes.

 Require high memory as data is actually stored here.

Data storage in HDFS: Now let’s see how the data is stored in a distributed manner. 
Let’s assume that 100TB file is inserted, then masternode(namenode) will first divide the file into
blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these blocks are stored across
different datanodes(slavenode). Datanodes(slavenode)replicate the blocks among themselves and
the information of what blocks they contain is sent to the master. Default replication factor
is 3 means for each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration here. 

Note: MasterNode has the record of everything, it knows the location and info of each and every
single data node and the blocks they contain, i.e., nothing is done without the permission of
masternode. 

Why divide the file into blocks? 

Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file on a single
machine. Even if we store, then each read and write operation on that whole file is going to take very
high seek time. But if we have multiple blocks of size 128MB then it’s become easy to perform
various read and write operations on it compared to doing it on a whole file at once. So, we divide
the file to have faster data access i.e., reduce seek time. 

Why replicate the blocks in data nodes while storing? 

Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode D1. Now
if the data node D1 crashes we will lose the block and which will make the overall data inconsistent
and faulty. So, we replicate the blocks to achieve fault-tolerance. 

Terms related to HDFS:  

 HeartBeat : It is the signal that datanode continuously sends to namenode. If namenode


doesn’t receive heartbeat from a datanode then it will consider it dead.

 Balancing: If a datanode is crashed the blocks present on it will be gone too and the blocks
will be under-replicated compared to the remaining blocks. Here master node(namenode)
will give a signal to datanodes containing replicas of those lost blocks to replicate so that
overall distribution of blocks is balanced.

 Replication: It is done by datanode.


Note: No two replicas of the same block are present on the same datanode. 

Features:  

 Distributed data storage.

 Blocks reduce seek time.

 The data is highly available as the same block is present at multiple datanodes.

 Even if multiple datanodes are down we can still do our work, thus making it highly reliable.

 High fault tolerance.

Limitations: Though HDFS provide many features there are some areas where it doesn’t work well. 

 Low latency data access: Applications that require low-latency access to data i.e in the range
of milliseconds will not work well with HDFS, because HDFS is designed keeping in mind that
we need high-throughput of data even at the cost of latency.

 Small file problem: Having lots of small files will result in lots of seeks and lots of movement
from one datanode to another datanode to retrieve each small file, this whole process is a
very inefficient data access pattern.

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes, gigabytes or more.

o Streaming Data Access: The time to read whole data set is more important than latency in
reading the first. HDFS is built on write-once and read-many-times pattern.

o Commodity Hardware: It works on low-cost hardware.

Where not to use HDFS

o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the
first record.

o Lots Of Small Files: The name node contains the metadata of files in memory and if the files
are small in size it takes a lot of memory for name node's memory which is not feasible.

o Multiple Writes: It should not be used when we have to write multiple times.

HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or write HDFS blocks are 128
MB by default and this is configurable. Files n HDFS are broken into block-sized chunks,
which are stored as independent units. Unlike a file system, if the file is in HDFS is smaller
than block size, then it does not occupy full block’s size, i.e., 5 MB of file stored in HDFS of
block size 128 MB takes 5MB of space only. The HDFS block size is large just to minimize the
cost of seek.

2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata of all
the files in HDFS; the metadata information being file permission, names and location of
each block. The metadata are small, so it is stored in the memory of name node, allowing
faster access to data. Moreover, the HDFS cluster is accessed by multiple clients
concurrently, so all this information is handled by a single machine. The file system
operations like opening, closing, renaming etc. are executed by it.

3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The
data node being a commodity hardware also does the work of block creation, deletion and
replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:

HDFS Write Image:


Since all the metadata is stored in name node, it is very important. If it fails the file system can not
be used as there would be no way of knowing how to reconstruct the files from blocks present in
data node. To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta
data which helps minimize downtime and loss of data.

Starting HDFS

The HDFS should be formatted initially and then started in the distributed mode. Commands are
given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations

1. Putting data to HDFS from local file system

o First create a folder in HDFS where data can be put form local file system.

$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder /usr/home/Desktop to HDFS
folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system


o $ hadoop fs -copyToLocal /user/test/data.txt /usr/bin/data_copy.txt

3. Compare the files and see that both are same

o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

HDFS Other commands

The below is used in the commands

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.

"<file>" means any filename.

"<src>" and "<dest>" are path names in a directed operation.

"<localSrc>" and "<localDest>" are paths as above, but on the local file system

o put <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within the DFS.

o copyFromLocal <localSrc><dest>

Identical to -put

o copyFromLocal <localSrc><dest>

Identical to -put

o moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by localSrc to dest within HDFS, and
then deletes the local copy on success.

o get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.

o cat <filen-ame>

Displays the contents of filename on stdout.

o moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

o setrep [-R] [-w] rep <path>


Sets the target replication factor for files identified by path to rep. (The actual replication factor will
move toward the target over time)

o touchz <path>

Creates a file at path containing the current time as a timestamp. Fails if a file already exists at path,
unless the file is already size 0.

o test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

o stat [format] <path>

Prints information about path. Format is a string which accepts file size in blocks (%b), filename (%n),
block size (%o), replication (%r), and modification date (%y, %Y).

HDFS Design Goals

 Hardware Failure – Detections of faults and quick, automatic recovery.


 Streaming Data Access – High throughput of data access (Batch processing of data)
 Large Data Sets – Gigabytes to Terabytes in size.
 Simple Coherency Model – Write one read many access models for files
 Moving computation is cheaper than moving data.

HDFS interface

FileSystem:

FileSystem is an abstract base class for a generic file system. It may be implemented as distributed
system or a local system. The local implementation is LocalFileSystem and distributed
implementation is DistributedFileSystem.

All these classes are present in org.apache.hadoop.fs package.

All user code that may use the HDFS should use a FileSystem object.

Similar to DataInputStream and DataOutputStream in Java File I/O for reading or writing primitive


data types, In Hadoop, their corresponding stream classes
are FSDataInputStream and FSDataOutputStream respectively.

FSDataInputStream:

 FSDataInputStream class is a specialization of java.io.DataInputStream with support for


random access, so we can read from any part of the stream. It is an utility that wraps
a FSInputStream in a DataInputStream and buffers input through a BufferedInputStream.

 FSDataInputStream class implements Seekable & PositionedReadable interfaces. So, we can


have random access in the stream with help of below methods.

int read(long position, byte[] buffer, int offset, int length) – Read bytes from the given position in the
stream to the given buffer. The return value is the number of bytes actually read.

void readFully(long position, byte[] buffer, int offset, int length) – Read bytes from the given position
in the stream to the given buffer. Continues to read until length bytes have been read. If the end of
stream is reached while reading EOFException is thrown.
void readFully(long position, byte[] buffer) – buffer.length bytes will be read from position in stream

void seek(long pos) – Seek to the given offset.

long getPos() – Get the current position in the input stream.

FSDataOutputStream:

 FSDataOutputStream class is counterpart for FSDataInputStream, to open a stream for


output. It is an utility that wraps a OutputStream in a DataOutputStream, buffers output
through a BufferedOutputStream and creates a checksum file.

 Similar to FSDataInputStream, FSDataOutputStream also support getPos() method to know


current position in the output stream but seek() method is not supported by
FSDataOutputStream.

It is because HDFS allows only sequential writes to an open file or appends to an existing File. In
other words, there is no support for writing to anywhere other than the end of the file.

 We can invoke write() method to write to an output stream on an instance of


FSDataOutputStream.

public void write(byte[] b, int off, int len) throws IOException ;

Writes len bytes from the specified byte array starting at offset off to the underlying output stream.
If no exception is thrown, the counter written is incremented by len.

Below are some of the important methods from FileSystem class.

Getting FileSystem Instance:

For any File I/O operation in HDFS through Java API, the first thing we need is FileSystem instance. To
get file system instance, we have three static methods from FileSystem class.

 static FileSystem get(Configuration conf) — Returns the configured file


system implementation.

 static FileSystem get(URI uri, Configuration conf) — Returns the FileSystem


for this URI.

 static FileSystem get(URI uri, Configuration conf, String user) — Get a file


system instance based on the uri, the passed configuration and the user.

Opening Existing File:

In order to read a file from HDFS, we need to open an input stream for the same. We can do the
same by invoking open() method on FileSystem instance.

 public FSDataInputStream open(Path f)

 public abstract FSDataInputStream open(Path f, int bufferSize)

The first method uses a default buffer size of 4 K.

Creating a new File:


There are several ways to create a file in HDFS through FileSystem class. But one of the simplest
method is to invoke create() method that takes a Path object for the file to be created and returns
an output stream to write to.

public FSDataOutputStream create(Path f)

There are overloaded versions of this method that allow you to specify whether to forcibly


overwrite existing files, the replication factor of the file, the buffer size to use when writing the file,
the block size for the file, and file permissions.

The create() method creates any parent directories of the file to be written that don’t already
exist.

Data Flow in HDFS

Map-Reduce is a processing framework used to process data over a large number of machines.
Hadoop uses Map-Reduce to process the data distributed in a Hadoop cluster. Map-Reduce is not
similar to the other regular processing framework like Hibernate, JDK, .NET, etc. All these previous
frameworks are designed to use with a traditional system where the data is stored at a single
location like Network File System, Oracle database, etc. But when we are processing big data the
data is located on multiple commodity machines with the help of HDFS. 

So when the data is stored on multiple nodes we need a processing framework where it can copy the
program to the location where the data is present, Means it copies the program to all the machines
where the data is present. Here the Map-Reduce came into the picture for processing the data on
Hadoop over a distributed system. Hadoop has a major drawback of cross-switch network traffic
which is due to the massive volume of data. Map-Reduce comes with a feature called Data-Locality.
Data Locality is the potential to move the computations closer to the actual data location on the
machines. 

Since Hadoop is designed to work on commodity hardware it uses Map-Reduce as it is widely


acceptable which provides an easy way to process data over multiple nodes. Map-Reduce is not the
only framework for parallel processing. Nowadays Spark is also a popular framework used for
distributed computing like Map-Reduce. We also have HAMA, MPI theses are also the different-
different distributed processing framework. 
 

Let’s Understand Data-Flow in Map-Reduce

Map Reduce is a terminology that comes with Map Phase and Reducer Phase. The map is used for
Transformation while the Reducer is used for aggregation kind of operation. The terminology for
Map and Reduce is derived from some functional programming languages like Lisp, Scala, etc. The
Map-Reduce processing framework program comes with 3 main components i.e. our Driver
code, Mapper(For Transformation), and Reducer(For Aggregation). 

Let’s take an example where you have a file of 10TB in size to process on Hadoop. The 10TB of data
is first distributed across multiple nodes on Hadoop with HDFS. Now we have to process it for that
we have a Map-Reduce framework. So to process this data with Map-Reduce we have a Driver code
which is called Job. If we are using Java programming language for processing the data on HDFS then
we need to initiate this Driver class with the Job object. Suppose you have a car which is your
framework than the start button used to start the car is similar to this Driver code in the Map-
Reduce framework. We need to initiate the Driver code to utilize the advantages of this Map-Reduce
Framework. 

There are also Mapper and Reducer classes provided by this framework which are predefined and
modified by the developers as per the organizations requirement.

Brief Working of Mapper

Mapper is the initial line of code that initially interacts with the input dataset. suppose, If we have
100 Data-Blocks of the dataset we are analyzing then, in that case, there will be 100 Mapper
program or process that runs in parallel on machines(nodes) and produce there own output known
as intermediate output which is then stored on Local Disk, not on HDFS. The output of the mapper
act as input for Reducer which performs some sorting and aggregation operation on data and
produces the final output.

Brief Working Of Reducer

Reducer is the second part of the Map-Reduce programming model. The Mapper produces the
output in the form of key-value pairs which works as input for the Reducer. But before sending this
intermediate key-value pairs directly to the Reducer some process will be done which shuffle and
sort the key-value pairs according to its key values. The output generated by the Reducer will be the
final output which is then stored on HDFS(Hadoop Distributed File System). Reducer mainly performs
some computation operation like addition, filtration, and aggregation. 
 

Steps of Data-Flow:

 At a time single input split is processed. Mapper is overridden by the developer according to
the business logic and this Mapper run in a parallel manner in all the machines in our cluster.
 The intermediate output generated by Mapper is stored on the local disk and shuffled to the
reducer to reduce the task.

 Once Mapper finishes their task the output is then sorted and merged and provided to the
Reducer.

 Reducer performs some reducing tasks like aggregation and other compositional operation
and the final output is then stored on HDFS in part-r-00000(created by default) file.

Data Ingest with Flume and Sqoop

Apache Hadoop is synonymous with big data for its cost-effectiveness and its attribute of scalability
for processing petabytes of data. Data analysis using Hadoop is just half the battle won. Getting data
into the Hadoop cluster plays a critical role in any big data deployment.

Data ingestion is important in any big data project because the volume of data is generally in
petabytes or exabytes. Hadoop Sqoop and Hadoop Flume are the two tools in Hadoop which is
used to gather data from different sources and load them into HDFS. Sqoop in Hadoop is mostly
used to extract structured data from databases like Teradata, Oracle, etc., and Flume in Hadoop is
used to sources data which is stored in various sources like and deals mostly with unstructured data.

Apache Sqoop and Apache Flume are two popular open source tools for Hadoop that help
organizations overcome the challenges encountered in data ingestion.

While working on Hadoop, there is always one question occurs that if both Sqoop and Flume are
used to gather data from different sources and load them into HDFS so why we are using both of
them.

So, in this post, BigData Hadoop: Apache Sqoop vs Apache Flume we will answer this question. At
first, we will understand the brief introduction of both tools. Afterward, we will learn comparison
of Apache Flume vs Sqoop to understand each tool.

What is Apache Sqoop?


 Apache Sqoop is a lifesaver in moving data from the data warehouse into the Hadoop environment.
Interestingly it named Sqoop as SQL-to-Hadoop. Basically, for importing data from RDBMS’s like
MySQL, Oracle, etc. into HBase, Hive or HDFS Apache Sqoop is an effective Hadoop tool.

What is Apache Flume?

Basically, for streaming logs into Hadoop environment, Apache Flume is best service designed. Also
for collecting and aggregating huge amounts of log data, Flume is a distributed and reliable service.

Difference Between Apache Sqoop vs Flume

Conclusion

As you have already learned above Sqoop and Flume, both are primarily two Data


Ingestion tools used in the Big Data world, now still if you need to ingest textual log data into
Hadoop/HDFS then Flume is the right choice for doing that. If your data is not regularly generated
then Flume will still work but it will be an overkill for that situation. Similarly, Sqoop is not the best
fit for event-driven data handling.

Hadoop Archives

Hadoop is created to deal with large files data. So small files are problematic and to be handled
efficiently.

As large input file is splitted into number of small input files and stored across all the data nodes, all
these huge number of records are to be stored in name node which makes name node inefficient. To
handle this problem, Hadoop Archieve has been created which packs the HDFS files into archives and
we can directly use these files as input to the MR jobs. It always comes with *.har extension.

HAR Syntax:
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>

Example:
hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for
MapReduce input, all you need to specify the input directory as har:///user/zoo/foo.har.

If we list the archive file:

$hadoop fs -ls /data/myArch.har

/data/myArch..har/_index

/data/myArch..har/_masterindex

/data/myArch..har/part-0

part files are the original files concatenated together with big files and index files are to look up for
the small files in the big part file.

Limitations of HAR Files:

1) Creation of HAR files will create a copy of the original files. So, we need as much disk space as size
of original files which we are archiving. We can delete the original files after creation of archive to
release some disk space.
2) Once an archive is created, to add or remove files from/to archive we need to re-create the
archive.
3) HAR file will require lots of map tasks which are inefficient.

Understanding the Hadoop Input Output System

Unlike any I/O subsystem, Hadoop also comes with a set of primitives. These primitive
considerations, although generic in nature, go with the Hadoop IO system as well with some special
connotation to it, of course. Hadoop deals with multi-terabytes of datasets; a special consideration
on these primitives will give an idea how Hadoop handles data input and output. This article quickly
skims over these primitives to give a perspective on the Hadoop input output system.

Data Integrity

Data integrity means that data should remain accurate and consistent all across its storing,
processing, and retrieval operations. To ensure that no data is lost or corrupted during persistence
and processing, Hadoop maintains stringent data integrity constraints. Every read/write operation
occurs in disks, more so through the network is prone to errors. And, the volume of data that
Hadoop handles only aggravates the situation. The usual way to detect corrupt data is through
checksums. A checksum is computed when data first enters into the system and is sent across the
channel during the retrieval process. The retrieving end computes the checksum again and matches
with the received ones. If it matches exactly then the data deemed to be error free else it contains
error. But the problem is – what if the checksum sent itself is corrupt? This is highly unlikely because
it is a small data, but not an undeniable possibility. Using the right kind of hardware such as ECC
memory can be used to alleviate the situation.

This is mere detection. Therefore, to correct the error, another technique, called CRC (Cyclic
Redundancy Check), is used.

Hadoop takes it further and creates a distinct checksum for every 512 (default) bytes of data.
Because CRC-32 is 4 bytes only, the storage overhead is not an issue. All data that enters into the
system is verified by the datanodes before being forwarded for storage or further processing. Data
sent to the datanode pipeline is verified through checksums and any corruption found is
immediately notified to the client with ChecksumException. The client read from the datanode also
goes through the same drill. The datanodes maintain a log of checksum verification to keep track of
the verified block. The log is updated by the datanode upon receiving a block verification success
signal from the client. This type of statistics helps in keeping the bad disks at bay.

Apart from this, a periodic verification on the block store is made with the help
of DataBlockScanner running along with the datanode thread in the background. This protects data
from corruption in the physical storage media.

Hadoop maintains a copy or replicas of data. This is specifically used to recover data from massive
corruption. Once the client detects an error while reading a block, it immediately reports to the
datanode about the bad block from the namenode before throwing ChecksumException. The
namenode then marks it as a bad block and schedules any further reference to the block to its
replicas. In this way, the replica is maintained with other replicas and the marked bad block is
removed from the system.

For every file created in the Hadoop LocalFileSystem, a hidden file with the same name in the same
directory with the extension .<filename>.crc is created. This file maintains the checksum of each
chunk of data (512 bytes) in the file. The maintenance of metadata helps in detecting read error
before throwing ChecksumException by the LocalFileSystem.

Compression

Keeping in mind the volume of data Hadoop deals with, compression is not a luxury but a
requirement. There are many obvious benefits of file compression rightly used by Hadoop. It
economizes storage requirements and is a must-have capability to speed up data transmission over
the network and disks. There are many tools, techniques, and algorithms commonly used by
Hadoop. Many of them are quite popular and have been used in file compression over the ages. For
example, gzip, bzip2, LZO, zip, and so forth are often used.

Serialization

The process that turns structured objects to stream of bytes is called serialization. This is specifically
required for data transmission over the network or persisting raw data in disks. Deserialization is just
the reverse process, where a stream of bytes is transformed into a structured object. This is
particularly required for object implementation of the raw bytes. Therefore, it is not surprising that
distributed computing uses this in a couple of distinct areas: inter-process communication and data
persistence.

Hadoop uses RPC (Remote Procedure Call) to enact inter-process communication between nodes.
Therefore, the RPC protocol uses the process of serialization and deserialization to render a message
to the stream of bytes and vice versa and sends it across the network. However, the process must be
compact enough to best use the network bandwidth, as well as fast, interoperable, and flexible to
accommodate protocol updates over time.

Hadoop has its own compact and fast serialization format, Writables, that MapReduce programs use
to generate keys and value types.

Data Structure of Files

There are a couple of high-level containers that elaborate the specialized data structure in Hadoop
to hold special types of data. For example, to maintain a binary log, the SequenceFile container
provides the data structure to persist binary key-value pairs. We then can use the key, such as a
timestamp represented by LongWritable and value by Writable, which refers to logged quantity.

There is another container, a sorted derivation of SequenceFile, called MapFile. It provides an index


for convenient lookups by key.

These two containers are interoperable and can be converted to and from each other.

AVRO - Overview

To transfer data over a network or for its persistent storage, you need to serialize the data. Prior to
the serialization APIs provided by Java and Hadoop, we have a special utility, called Avro, a schema-
based serialization technique.

This tutorial teaches you how to serialize and deserialize the data using Avro. Avro provides libraries
for various programming languages. In this tutorial, we demonstrate the examples using Java library.

What is Avro?

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the
father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite
helpful, as it deals with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop.

Avro has a schema-based system. A language-independent schema is associated with its read and
write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a
compact binary format, which can be deserialized by any application.

Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java,
C, C++, C#, Python, and Ruby.

Avro Schemas

Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of
the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along
with the Avro data in a file for any further processing.
In RPC, the client and the server exchange schemas during the connection. This exchange helps in
the communication between same named fields, missing fields, extra fields, etc.

Avro schemas are defined with JSON that simplifies its implementation in languages with JSON
libraries.

Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol
Buffers, and Thrift.

Comparison with Thrift and Protocol Buffers

Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these
frameworks in the following ways −

 Avro supports both dynamic and static types as per the requirement. Protocol Buffers and
Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These
IDLs are used to generate code for serialization and deserialization.

 Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop
ecosystem.

Unlike Thrift and Protocol Buffer, Avro's schema definition is in JSON and not in any proprietary IDL.

Property Avro Thrift & Protocol Buffer

Dynamic schema Yes No

Built into Hadoop Yes No

Schema in JSON Yes No

No need to compile Yes No

No need to declare IDs Yes No

Bleeding edge Yes No

Features of Avro

Listed below are some of the prominent features of Avro −

 Avro is a language-neutral data serialization system.

 It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).

 Avro creates binary structured format that is both compressible and splittable. Hence it can


be efficiently used as the input to Hadoop MapReduce jobs.
 Avro provides rich data structures. For example, you can create a record that contains an
array, an enumerated type, and a sub record. These datatypes can be created in any
language, can be processed in Hadoop, and the results can be fed to a third language.

 Avro schemas defined in JSON, facilitate implementation in the languages that already have


JSON libraries.

 Avro creates a self-describing file named Avro Data File, in which it stores data along with its
schema in the metadata section.

 Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange
schemas in the connection handshake.

General Working of Avro

To use Avro, you need to follow the given workflow −

 Step 1 − Create schemas. Here you need to design Avro schema according to your data.

 Step 2 − Read the schemas into your program. It is done in two ways −

o By Generating a Class Corresponding to Schema − Compile the schema using Avro.


This generates a class file corresponding to the schema

o By Using Parsers Library − You can directly read the schema using parsers library.

 Step 3 − Serialize the data using the serialization API provided for Avro, which is found in
the package org.apache.avro.specific.

 Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in
the package org.apache.avro.specific.

You might also like