Unit 2

UNIT - 2 INTRODUCTION TO HADOOP AND
HADOOP ARCHITECTURE
STRUCTURE
2.0 Learning Objectives
2.1 Introduction
2.2 Hadoop
2.2.1 Definition of Hadoop
2.3 Big Data
2.3.1 Apache Hadoop
2.3.2 Hadoop Eco System
2.4 Moving Data in and out of Hadoop
2.5 Map Reduce
2.5.1 Understanding Inputs and Outputs of Map Reduce
2.6 Data Serialization
2.7 Architecture of Hadoop
2.8 Summary
2.9 Keywords
2.10 Learning Activity
2.11 Unit End Questions
2.12 References
58
2.0 LEARNING OBJECTIVES
After studying this unit, you will be able to:
 Describe the concept of Hadoop.
 Define the Hadoop Eco System
 Explain the Map Reduce
 Elucidate the Data Serialization
 Describe the Architecture of Hadoop
2.1 INTRODUCTION
Information technology is an important part of most modern businesses. Business processes

are mapped and integrated via IT; knowledge is gained from the available information and
thus adds value. Business intelligence (BI) processes derive valuable information from
internal and external sources of companies. Big data solutions take this a step further and
process immense amounts of data with the help of highly complex and optimized algorithms,
in order to draw the best possible economic conclusions.
People live in a digital world today; enormous amounts of data are generated. To date,
companies have dealt with transaction data, that is, structured data. They have made use of
these data and used them in making decisions for the company. Structured data are data that
can be specifically searched for or sorted according to individual or composite attributes. Due
to the ever-increasing social networking of organizations, companies, and researchers such as
through social media, web analysis applications, scientific applications, additional terabytes,
and hexabytes of data are created. Mobile devices and sensors are also involved in these data.
A property of these data is that they are in an unstructured form. This means that the data are
not stored in a predefined, structured table. They usually consist of numbers, texts, and fact
blocks and do not have a special format.
The Hadoop framework consists of two main layers. First the Hadoop Distributed File
System (HDFS), and second the Map Reduce. In the newer version branch 2.x of Hadoop
there is also YARN (Yet another Resource Negotiator), which abstracts from Map Reduce
and also allows the parallel execution of different jobs in a cluster.
59
YARN is the resource manager of Hadoop and is responsible for distributing the requested
resources (CPU, memory) of a Hadoop cluster to the various jobs. In this way, certain jobs
can be assigned more or fewer resources, which can be configured according to the
application and user.
HDFS is the first building block of a Hadoop cluster. It is a Java-based distributed file system
that allows persistent and reliable storage and fast access to large amounts of data. It divides
the files into blocks and saves them redundantly on the cluster, which is only slightly
influenced and perceived by the user. If files are stored in it, it is generally not noticed even if
individual files are stored on several computers. This implicit distribution of the data makes
the file system interesting because it reduces the administrative effort of data storage in a big
data system.
Map Reduce is a way of breaking down each request into smaller requests that are sent to
many small servers to make the most scalable use of the CPU possible. So, scaling in smaller
steps is possible (scale-out). The model is based on two different stages for an application:
Map an initial recording and transformation stage in which individual input records can be
processed in parallel.
Reduce an aggregation or consolidation stage in which all related records are processed by a
single entity.
Two main advantages are associated with the consolidation stage.
Map Task and Logical Blocks
The core concept of Map Reduce in Hadoop is that the input (input data) can be split into
logical blocks. Each block can be processed independently at the beginning by a map task.
The results from these individually working blocks can be physically divided into different
sets and then sorted. Each sorted block is then passed on to the reduce task.
Reduce Tasks
A map task can run on any computed nodes on the cluster, and multiple map tasks can run in
parallel on the cluster. It is responsible for transforming the input records into key/value
pairs. The output from all maps is split, and each split is sorted, but there is only one division
for each reduce task. The keys of each sorted division and the values associated with the keys
are processed by the reduce task. The multiple reduce tasks can then run in parallel on the
cluster.
60
Hadoop Ecosystem
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e., HDFS, Map Reduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis, storage, and
maintenance of data etc.
Serialization
Serialization is the process of turning structured objects into a byte stream for transmission
over a network or for writing to persistent storage. Deserialization is the reverse process of
turning a byte stream back into a series of structured objects. Serialization appears in two
quite distinct areas of distributed data processing for inter-process communication and for
persistent storage.
In Hadoop, inter-process communication between nodes in the system is implemented using

remote procedure calls (RPCs). The RPC protocol uses serialization to render the message
into a binary stream to be sent to the remote node, which then deserializers the binary stream
into the original message.
Hadoop Architecture
Hadoop has a master-slave topology. In this topology, we have one master node and multiple
slave nodes, master node’s function is to assign a task to various slave nodes and manage
resources. The slave nodes do the actual computing. Slave nodes store the real data whereas
on master we have metadata. This means it stores data about data.
A good Hadoop architectural design requires various design considerations in terms of

computing power, networking, and storage. the Hadoop architecture and the factors to be
considered when designing and building a Hadoop cluster for production success
61
2.2 HADOOP
Hadoop is an open-source framework that is quite popular in the big data industry. Due to
Hadoop’s future scope, versatility, and functionality, it has become a must-have for every
data scientist.
In simple words, Hadoop is a collection of tools that lets you store big data in a readily
accessible and distributed environment. It enables you to process the data parallelly. The
Hadoop framework application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from a single
server to thousands of machines, each offering local computation and storage. Commodity
computers are cheap and widely available. These are mainly useful for achieving greater
computational power at a low cost.
Formally known as Apache Hadoop, the technology is developed as part of an open-source

project within the Apache Software Foundation. Multiple vendors offer commercial Hadoop
distributions, although the number of Hadoop vendors has declined because of an
overcrowded market and then competitive pressures driven by the increased deployment of
big data systems in the cloud. The shift to the cloud also enables users to store data in lower-
cost cloud object storage services instead of Hadoop's namesake file system; as a result,
Hadoop's role is being reduced in some big data architectures.
Some Common Frameworks of Hadoop
Hive- It is a data warehouse tool basically used for analysing, querying, and summarizing
analysed data concepts on top of the Hadoop framework.
Drill- It consists of user-defined functions and is used for data exploration.
Storm- It allows real-time processing and streaming of data.
Spark- It contains a Machine Learning Library (MLlib) for providing enhanced machine
learning and is widely used for data processing. It also supports Java, Python, and Scala.
Pig- Pig is a high-level framework that ensures us to work in coordination either with Apache
Spark or Map Reduce to analyse the data. The language used to code for the frameworks is
known as Pig Latin.
Tez- It reduces the complexities of Hive and Pig and helps in the running of their codes
faster.
62
Sqoop: This framework is used for transferring the data to Hadoop from relational databases.
This application is based on a command-line interface.
Oozie: This is a scheduling system for workflow management, executing workflow routes for
successful completion of the task in a Hadoop.
Zookeeper: Open-source centralized service which is used to provide coordination between

distributed applications of Hadoop. It offers the registry and synchronization service on a
high level.
Hadoop Framework is Made Up of the Following Modules
Hadoop Map Reduce (Processing/Computation layer) –Map Reduce is a parallel

programming model mainly used for writing large amount of data distribution applications
devised from Google for efficient processing of large amounts of datasets, on large group of
clusters.
Hadoop HDFS (Storage layer) –Hadoop Distributed File System or HDFS is based on the
Google File System (GFS) which provides a distributed file system that is specially designed
to run on commodity hardware. It reduces faults or errors and helps incorporate low-cost
hardware. It gives high-level processing throughput access to application data and is suitable
for applications with large datasets.
Hadoop YARN –Hadoop YARN is a framework used for job scheduling and cluster resource
management.
Hadoop Common –This includes Java libraries and utilities which provide those java files
which are essential to start Hadoop.
Task Tracker –It is a node which is used to accept the tasks such as shuffle and Map Reduce
from job tracker.
Job Tracker –It is a service provider which runs Map Reduce jobs on cluster.
Name Node –It is a node where Hadoop stores all file location information(data stored
location) in Hadoop distributed file system.
Data Node – The data is stored in the Hadoop distributed file system.
Data Node –It stores data in the Hadoop distributed file system.
63
Use Cases for Hadoop
Hadoop can be used in many industries. For this reason, some example scenarios are
provided in which possible problems can be solved using Hadoop. This is to help better
understand Hadoop.
 Customer analysis
 Challenge: Why does a company lose customers? Data on these factors come from a
variety of sources and are challenging to analyse.
 Solution with Hadoop: Quickly build a behaviour model from disparate data sources.
 Structuring and analysing with Hadoop: This includes traversing data, creating a
graph, and recognizing patterns using various information from customer data.
 Typical industries: telecommunications, financial services.
 Modelling True Risk
 Challenge: How much risk exposure does an organization really have with a
customer? Analysing multiple sources of data across multiple industries within a
company.
 Solution with Hadoop: Obtaining and accumulating disparate data sources, such as
call recordings, chat sessions, emails, and bank activities.
 Structure and analysis: sentiment analysis, developing a graph, typical pattern

recognition.
 Typical industries: financial services (banks, insurance companies, etc.)
 Point of Sale (PoS) transaction analysis
 Challenge: Analysis of PoS data to target promotions and manage operations. The
sources are complex, and the volume of data grows across chains of stores and other
sources.
 Solution with Hadoop: A number of processing frameworks (HDFS, Map Reduce)

allow parallel execution over large data sets.
 Pattern recognition: Optimization across multiple data sources. Using the information
to predict demand or demand.
64
 Typical industries: retail.
 Analyse network data to predict failures
 Challenge: The analysis of data series in real-time, from a network of sensors. Over
time, calculating the average frequency has become quite tedious due to the need for
analysing terabytes of data.
 Solution with Hadoop: Calculate this data by expanding from simple queries to more
complex data mining. This gives you a better understanding of how the network reacts
to changes. Separate anomalies can be linked together.
 Typical industries: telecommunications, data centres, utilities.
 Analyse research data to support decisions and actions
 Challenge: The amount of data is increasing due to the enormous increase in internal
and external data sources, but the time that is available to use this data is getting
shorter and shorter.
 Solution with Hadoop: Isolated data sources and data redundancy can be hyped up
with the help of data analysis. Data analysis is one of the most important tasks of data
quality. However, employees also want to know whether the quality of their research
data is sufficient for informed decisions.
 Typical industries: universities and research institutes, libraries.
2.2.1 Definition of Hadoop
Hadoop is an open-source distributed processing framework that manages data processing

and storage for big data applications in scalable clusters of computer servers. It's at the centre
of an ecosystem of big data technologies that are primarily used to support advanced
analytics initiatives, including predictive analytics, data mining and machine learning.
Hadoop systems can handle various forms of structured and unstructured data, giving users
more flexibility for collecting, processing, analysing, and managing data than relational
databases and data warehouses provide.
65
Hadoop's ability to process and store different types of data makes it a particularly good fit
for big data environments. They typically involve not only large amounts of data but also a
mix of structured transaction data and semi structured and unstructured information, such as
internet clickstream records, web server and mobile application logs, social media posts,
customer emails and sensor data from the internet of things (IoT).
History of Hadoop
Hadoop was created by Doug Cutting and hence was the creator of Apache Lucene. It is the
widely used text to search library. Hadoop has its origins in Apache Nutch which is an open-
source web search engine itself a part of the Lucene project.
Hadoop Challenges
The most significant challenge to implement a Hadoop cluster is that it requires a significant
learning curve associated with building, operating and maintain the cluster. A majority of
enterprise data is structured in a manner for which storage and retrieval would be easier in
RDBMS system, whereas to do the same thing in Hadoop is a challenge. Hadoop is highly
configurable which makes optimization for better performance a key challenge. Hadoop
requires a highly specialized skill set to manage. Additionally, since it is an open-source
project, there is no official support channel.
The Hadoop ecosystem is so expansive, and the dynamics are so large that the technology
changes seemingly every other week. The "go-live time" is generally long from the time of
cluster implementation due to a lot of manual code and development effort. Hadoop is
specifically designed for large scale data storage and processing whereas day-to-day business
analysis and reporting requires faster retrieval and ad-hoc query capabilities for which
Hadoop is not built for.
Diyotta and Hadoop
Diyotta compliments Hadoop by utilizing the enormous processing power Hadoop offer
while providing an intuitive UI, user-friendly front end and requiring less to no coding
expertise. More specifically, the complexity associated with a Hadoop ecosystem is
intelligently handled internally by Diyotta requiring the developer/analyst to focus only on
the business logic without being hampered by the technical challenges associated with
Hadoop.
66
Hadoop – Pros and Cons
Big Data has become necessary as industries are growing, the goal is to congregate
information and finding hidden facts behind the data. Data defines how industries can
improve their activity and affair. A large number of industries are revolving around the data,
there is a large amount of data that is gathered and analysed through various processes with
various tools.
Hadoop is one of the tools to deal with this huge amount of data as it can easily extract
information from data, Hadoop has its Advantages and Disadvantages while we deal with Big
Data.
Pros
Fig 2.1 Advantages of Hadoop
Cost
Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-
efficient model, unlike traditional Relational databases that require expensive hardware and
high-end processors to deal with Big Data. The problem with traditional Relational databases
is that storing the massive volume of data is not cost-effective, so the company’s started to
remove the Raw data. which may not result in the correct scenario of their business. Means
Hadoop provides us two main benefits with the cost one is its open source means free to use
and the other is that it uses commodity hardware which is also inexpensive.
67
Scalability
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of these
machines or nodes can be increased or decreased as per the enterprise’s requirements. In
traditional RDBMS (Relational Database Management System) the systems cannot be scaled
to approach large amounts of data.
Flexibility
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySQL Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos)
very efficiently. This means it can easily process any kind of data independent of its structure
which makes it highly flexible. which is very much useful for enterprises as they can process
large datasets easily, so businesses can use Hadoop to analyse valuable insights of data from
sources like social media, email, etc. with this flexibility Hadoop can be used with log
processing, Data Warehousing, Fraud detection, etc.
Speed
Hadoop uses a distributed file system to manage its storage i.e., HDFS (Hadoop Distributed
File System). In DFS (Distributed File System) a large size file is broken into small size file
blocks then distributed among the Nodes available in a Hadoop cluster, as this massive
number of file blocks are processed parallelly which makes Hadoop faster, because of which
it provides a High-level performance as compared to the traditional Database Management
Systems. When you are dealing with a large amount of unstructured data speed is an
important factor, with Hadoop you can easily access TBs of data in just a few minutes.
Fault Tolerance
Hadoop uses commodity hardware (inexpensive systems) which can be crashed at any
moment. In Hadoop data is replicated on various Data Nodes in a Hadoop cluster which
ensures the availability of data if somehow any of your systems got crashed. You can read all
of the data from a single machine if this machine faces a technical issue data can also be read
from other nodes in a Hadoop cluster because the data is copied or replicated by default.
Hadoop makes three copies of each file block and stores it into different nodes.
68
High Throughput
Hadoop works on Distributed file System where various jobs are assigned to various Data
nodes in a cluster, the bar of this data is processed parallelly in the Hadoop cluster which
produces high throughput. Throughput is nothing but the task or job done per unit time.
Minimum Network Traffic
In Hadoop, each task is divided into various small sub-task which is then assigned to each
data node available in the Hadoop cluster. Each data node processes a small amount of data
which leads to low traffic in a Hadoop cluster.
Cons
Fig 2. 2 Disadvantages of Hadoop
Problem with Small files
Hadoop can efficiently perform over a small number of files of large size. Hadoop stores the
file in the form of file blocks which are from 128MB in size (by default) to 256MB. Hadoop
fails when it needs to access the small size file in a large amount. These so many small files
surcharge the Name node and make it difficult to work.
69
Vulnerability
Hadoop is a framework that is written in java, and java is one of the most commonly used
programming languages which makes it more insecure as it can be easily exploited by any of
the cyber-criminal.
Low Performance in Small Data Surrounding
Hadoop is mainly designed for dealing with large datasets, so it can be efficiently utilized for
organizations that are generating a massive volume of data. Its efficiency decreases while
performing in small data surroundings.
Lack of Security
Data is everything for an organization, by default the security feature in Hadoop is made un-
available. So, the Data driver needs to be careful with this security face and should take
appropriate action on it. Hadoop uses Kerberos for security features that are not easy to
manage. Storage and network encryption are missing in Kerberos which makes us more
concerned about it.
High Up Processing
Read/Write operation in Hadoop is immoderate since we are dealing with large size data that
is in TB or PB. In Hadoop, the data read or write is done from the disk which makes it
difficult to perform in-memory calculation and lead to processing overhead or High up
processing.
Supports Only Batch Processing
The batch process is nothing but the processes that are running in the background and does
not have any kind of interaction with the user. The engines used for these processes inside the
Hadoop core is not that much efficient. Producing the output with low latency is not possible.
70
2.3 BIG DATA
In a fast-paced and hyper-connected world where more and more data are being created,
Hadoop’s breakthrough advantages mean that businesses and organizations can now find
value in data that was considered useless.
Organizations are realizing that categorizing and analysing Big Data can help make major
business predictions. Hadoop allows enterprises to store as much data, in whatever form,
simply by adding more servers to a Hadoop cluster. Each new server adds more storage and
processing power to the cluster. This makes data storage with Hadoop less expensive than
earlier data storage methods.
Big data is a term that refers to data sets or collection of data sets whose size (volume),
complexity (variability), and rate of growth (velocity). Making them difficult to be captured,
managed, processed or analysed by traditional technologies and tools, such as RDBMS and
desktop statistics or visualization packages, Within the time required to make them useful.
Real Time analysis about Big Data in now a day, Facebook has 1490 million active users,
WhatsApp has 800 million active users.
Another example is flicker having feature of Unlimited photo uploads (50MB per photo),
Unlimited video uploads, it also capable to show HD Video, Unlimited storage, Unlimited
bandwidth. Flickr had a total of 87 million registered members and more than 3.5 million
new
71
Fig 2.3 Big Data connecting services and industries
Big Data is a vague topic and there is no exact definition which is followed by everyone.
Data that has extra-large Volume, comes from Variety of sources, Variety of formats and
comes at us with a great Velocity is normally refer to as Big Data. Big data can be structured,
unstructured or semi-structured, which is not processed by the conventional data management
methods. Data can be generated on web in various forms like texts, images or videos or social
media posts. In order to process these large amounts of data in an inexpensive and efficient
way, parallelism is used. There are four characteristics for big data Volume, Velocity, Variety
and Veracity.
Challenges with Big Data
There are 800 million web pages on Internet giving information about Big Data. Big Data is
the next big thing after Cloud. Big data comes with a lot of opportunity to deal in health,
education, earth, and businesses but to deal with the data having large volume using
traditional models becomes very difficult. So, we need to look on big data challenges and
design some computing models for efficient analysis of data.
72
 Heterogeneity and Incompleteness
If we want to analyse the data, it should be structured but when we deal with the Big
Data, data may be structured or unstructured as well. Heterogeneity is the big challenge
in data Analysis and analysts need to cope with it. Consider an example of patient in
Hospital. We will make each record for each medical test. And we will also make a
record for hospital stay. This will be different for all patients. This design is not well
structured. So, managing with the Heterogeneous and incomplete is required. A good
data analysis should be applied to this.
 Scale
As the name says Big Data is having large size of data sets. Managing with large data
sets is a big problem from decades. Earlier, this problem was solved by the processors
getting faster but now data volumes are becoming huge and processors are static. World
is moving towards the Cloud technology, due to this shift data is generated in a very high
rate. This high rate of increasing data is becoming a challenging problem to the data
analysts. Hard disks are used to store the Data. They are slower I/O performance. But
now Hard Disks are replaced by the solid-state drives and other technologies. These are
not in slower rate like Hard disks, so new storage system should be designed.
Timeliness: Another challenge with size is speed. If the data sets are large in size, longer
the time it will take tantalize it. Any system which deals effectively with the size is likely
to perform well in term of speed. There are cases when we need the analysis results
immediately. For example, if there is any fraud transaction, it should be analysed before
the transaction is completed. So, some new system should be designed to meet this
challenge in data analysis.
 Privacy
Privacy of data is another big problem with big data. In some countries there are strict
laws regarding the data privacy, for example in USA there are strict laws for health
records, but for others it is less forceful. For example, in social media we cannot get the
private posts of users for sentiment analysis.
73
 Human Collaborations
In spite of the advanced computational models, there are many patterns that a computer
cannot detect. Anew method of harnessing human ingenuity to solve problem is
crowdsourcing. Wikipedia is the best example. We are reliable on the information given
byte strangers, however most of the time they are correct. But there can be other people
with other motives as well as like providing false information. We need technological
model to cope with this. As humans, we can look the review of book and find that some
are positive, and some are negative and come up with a decision to whether buy or not.
We need systems to be that intelligent to decide.
Opportunities to Big Data
Now this is Data Revolution time. Big Data is giving so many opportunities to business
organizations to grow their business to higher profit level. Not only InTechnology but big
data is playing an important role in every field like health, economics, banking, and
corporates as well as in government.
 Technology
Almost every top organization like Facebook, IBM, yahoo have adopted Big Data and
are investing on big data. Facebook handles 50 billion photos of users. Every month
Google handles 100 billion searches. From these stats we can say that there are a lot of
opportunities on internet, social media.
 Government
Big data can be used to handle the problems faced byte government. Obama government
announced big data research and development initiative in 2012. Big data analysis played
an important role of BJP winning the elections in 2014 and Indian government is
applying big data analysis in Indian electorate.
 Healthcare:
According to IBM Big data for healthcare, 80% of medical data is unstructured.
Healthcare organizations are adapting big data technology to get the complete
information about a patient. To improve the healthcare and low down the cost big data
analysis are required and certain technology should be adapted.
74
 Science and Research:
Big data is the latest topic of research. Many researchers are working on big data. There
are so many papers being published on big data. NASA enter for climate simulation
stores 32 petabytes of observations.
 Media:
Media is using big data for the promotions and selling of products by targeting the
interest of the user on internet. For example, social media posts, data analysts get the
number of posts and then analyse the interest of user. It can also be done by getting the
positive or negative reviews on the social media.
2.3.1 Apache Hadoop
Apache Hadoop is an open-source framework that is suited for processing large data sets on
commodity hardware. Hadoop is an implementation of Map Reduce, an application
programming model developed by Google, which provides two fundamental operations for
data processing: map and reduce. The former transforms and synthesizes the input data
provided by the user; the latter aggregates the output obtained by the map operations. Hadoop
provides the runtime environment, and developers need only provide the input data and
specify the map and reduce functions that need to be executed. Yahoo!, the sponsor of the
Apache Hadoop project, has put considerable effort into transforming the project into an
enterprise-ready cloud computing platform for data processing. Hadoop is an integral part of
the Yahoo! cloud infrastructure and supports several business processes of the company.
Currently, Yahoo! manages the largest Hadoop cluster in the world, which is also available to
academic institutions.
Because of the newness and the associated complexity of Hadoop, there are several areas
wherein confusion reigns and restrains its full-fledged assimilation and adoption. The Apache
Hadoop product family includes the Hadoop Distributed File System (HDFS), Map Reduce,
Hive, HBase, Pig, Zookeeper, Flume, Sqoop, Oozie, Hue, and so on. HDFS and Map Reduce
together constitute the core of Hadoop. For applications in BI, data warehousing (DW), and
big data analytics, the core Hadoop is usually augmented with Hive and HBase and
sometimes with Pig.
75
The Hadoop file system excels with big data, and it is file-based comprising multi structured
(structured, semi structured, and nonstructured) data. HDFS is a distributed file system
designed to run on clusters of commodity hardware. HDFS is highly fault tolerant because it
automatically replicates file blocks across multiple machine nodes and is designed to be
deployed on low-cost hardware. HDFS provides high-throughput access to application data
and is suitable for applications that have large data sets.
Because it is file-based, HDFS itself does not offer random access to data and has limited
metadata capabilities when compared to a DBMS. Likewise, HDFS is strongly batch-oriented
and hence has limited real-time data access functions. To overcome these challenges, you can
layer HBase over HDFS to gain some of the mainstream DBMS capabilities. HBase is
modelled after Google's big table, and hence HBase, like big table, excels with random and
real-time access to very large tables containing billions of rows and millions of columns.
Today HBase is limited to straightforward tables and records with little support for more
complex data structures. The Hive meta-store gives Hadoop some DBMS-like metadata
capabilities.
Hadoop is not just for new analytic applications, but it can revamp old ones too. For example,
analytics for risk and fraud is based on statistical analysis or data mining. This analytics
process benefits immensely from the much larger data samples, whose HDFS and Map
Reduce can wring from diverse data sources. Further on, most of the 360-degree customer
views include hundreds of customer attributes. Hadoop has the inherent capability to include
thousands of attributes and hence is touted as the best-in-class approach for next-generation
precision-centric analytics. Hadoop is a promising and potential technology that allows large
data volumes to be organized and processed while keeping the data on the original data
storage cluster. For example, clickstreams and weblogs can be turned into browsing
behaviour (sessions) by running Map Reduce programs (Hadoop) on the compute cluster and
generating aggregated results on the same cluster. The attained results are then loaded into a
relational DBMS system to be queried using structured query languages.
HBase is the mainstream Apache Hadoop database. It is an open source, no relational

(column-oriented), scalable, and distributed database management system that supports
structured data storage. Apache HBase is the right approach when you need random and real-
time read/write access to your big data.
76
This is for hosting of very large tables (billions of rows × millions of columns) on top of
clusters of commodity hardware. Just as Google Big table leverages the distributed data
storage provided by the Google File System, Apache HBase provides big table-like
capabilities on top of Hadoop and HDFS. HBase does support writing applications in Avro,
REST, and Thrift. There is a separate chapter on the Hadoop ecosystem covering all about its
origin, the widespread growth, and impacts besides some of the distinct use cases.
The term Hadoop is often used for both base modules and sub-modules and also the
ecosystem, or collection of additional software packages that can be installed on top of or
alongside Hadoop, such as
Apache Pig
Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The
language for this platform is called Pig Latin. Pig Latin abstracts the programming from the
Java Map Reduce idiom into a notation which makes Map Reduce programming high level,
similar to that of SQL for relational database management systems. Pig Latin can be extended
using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript,
Ruby or Groovy and then call directly from the language. Pig Latin allows users to specify an
implementation or aspects of an implementation to be used in executing a script in several
ways. In effect, Pig Latin programming is similar to specifying a query execution plan,
making it easier for programmers to explicitly control the flow of their data processing task.
Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline. Pig Latin's
ability to include user code at any point in the pipeline is useful for pipeline development. If
SQL is used, data must first be imported into the database, and then the cleansing and
transformation process can begin.
Apache Hive
Apache Hive is a data warehouse software project built on top of Apache Hadoop for
providing data query and analysis. Hive gives an SQL-like interface to query data stored in
various databases and file systems that integrate with Hadoop. Traditional SQL queries must
be implemented in the Map Reduce Java API to execute SQL applications and queries over
distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries
(HiveQL) into the underlying Java without the need to implement queries in the low-level
Java API. Since most data warehousing applications work with SQL-based querying
languages, Hive aids portability of SQL-based applications to Hadoop.
77
While initially developed by Facebook, Apache Hive is used and developed by other
companies such as Netflix and the Financial Industry Regulatory Authority (FINRA).[5][6]
Amazon maintains a software fork of Apache Hive included in Amazon Elastic Map Reduce
on Amazon Web Services.
Apache HBase
HBase is an open-source non-relational distributed database modelled after Google's big table
and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop
project and runs on top of HDFS (Hadoop Distributed File System) or Alluxio, providing big
table-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large
quantities of sparse data. HBase runs on top of HDFS and is well-suited for faster read and
write operations on large datasets with high throughput and low input/output latency. HBase
is now serving several data-driven websites but Facebook's Messaging Platform migrated
from HBase to My Rocks in 2018. Unlike relational and traditional databases, HBase does
not support SQL scripting; instead the equivalent is written in Java, employing similarity with
a Map Reduce application.
Apache Phoenix
Apache Phoenix is an open source, massively parallel, relational database engine supporting
OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver
that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL
tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query
data through SQL. Phoenix compiles queries and other statements into native NoSQL store
APIs rather than using Map Reduce enabling the building of low latency applications on top
of NoSQL stores.
Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing.
Spark provides an interface for programming entire clusters with implicit data parallelism and
fault tolerance. Apache Spark has its architectural foundation in the resilient distributed
dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that
is maintained in a fault-tolerant way. Spark facilitates the implementation of both iterative
algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data
analysis, i.e., the repeated database-style querying of data. Apache Spark requires a cluster
manager and a distributed storage system.
78
For cluster management, Spark supports standalone (native Spark cluster, where you can
launch a cluster either manually or use the launch scripts provided by the install package.
Apache Zoo Keeper
Apache Zoo Keeper is an open-source server for highly reliable distributed coordination of
cloud applications. It is a project of the Apache Software Foundation. Zoo Keeper was a sub-
project of Hadoop but is now a top-level Apache project in its own right.
Zoo Keeper nodes store their data in a hierarchical name space, much like a file system or a
tree data structure. Clients can read from and write to the nodes and in this way have a shared
configuration service. Zoo Keeper can be viewed as an atomic broadcast system, through
which updates are totally ordered. The Zoo Keeper Atomic Broadcast (ZAB) protocol is the
core of the system.
Zoo Keeper is modelled after Google's Chubby lock service and was originally developed at
Yahoo! for streamlining the processes running on big-data clusters by storing the status in
local log files on the Zoo Keeper servers. These servers communicate with the client
machines to provide them the information. Zoo Keeper was developed in order to fix the bugs
that occurred while deploying distributed big-data applications.
Apache Flume
Apache Flume is a distributed, reliable, and available software for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible architecture
based on streaming data flows. It is robust and fault tolerant with tuneable reliability
mechanisms and many failover and recovery mechanisms. It uses a simple extensible data
model that allows for online analytic application.
Apache Sqoop
Sqoop is a command-line interface application for transferring data between relational

databases and Hadoop. Sqoop supports incremental loads of a single table or a free form SQL
query as well as saved jobs which can be run multiple times to import updates made to a
database since the last import. Imports can also be used to populate tables in Hive or HBase.
Exports can be used to put data from Hadoop into a relational database. Sqoop got the name
from "SQL-to-Hadoop". Sqoop became a top-level Apache project.
79
Apache Oozie
Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.
Workflows in Oozie are defined as a collection of control flow and action nodes in a directed
acyclic graph. Control flow nodes define the beginning and the end of a workflow (start, end,
and failure nodes) as well as a mechanism to control the workflow execution path (decision,
fork, and join nodes). Action nodes are the mechanism by which a workflow triggers the
execution of a computation/processing task. Oozie provides support for different types of
actions including Hadoop Map Reduce, Hadoop distributed file system operations, Pig, SSH,
and email. Oozie can also be extended to support additional types of actions.
Oozie workflows can be parameterised using variables such as ${inputDir} within the
workflow definition. When submitting a workflow job, values for the parameters must be
provided. If properly parameterized (using different output directories), several identical
workflow jobs can run concurrently.
Oozie is implemented as a Java web application that runs in a Java servlet container and is
distributed under the Apache License 2.0.
Apache Storm
Apache Storm is a distributed stream processing computation framework written

predominantly in the Clojure programming language. Originally created by Nathan Marz and
team at Back Type, the project was open sourced after being acquired by Twitter. It uses
custom created "spouts" and "bolts" to define information sources and manipulations to allow
batch, distributed processing of streaming data.
A Storm application is designed as a "topology" in the shape of a directed acyclic graph

(DAG) with spouts and bolts acting as the graph vertices. Edges on the graph are named
streams and direct data from one node to another. Together, the topology acts as a data
transformation pipeline. At a superficial level the general topology structure is similar to a
Map Reduce job, with the main difference being that data is processed in real time as
opposed to in individual batches. Additionally, Storm topologies run indefinitely until killed,
while a Map Reduce job DAG must eventually end.
80
2.3.2 Hadoop Eco System
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or

framework which solves big data problems. You can consider it as a suite which
encompasses a number of services (ingesting, storing, analysing and maintaining) inside it.
Let us discuss and get a brief idea about how the services work individually and in
collaboration.
Overview
Apache Hadoop is an open-source framework intended to make interaction with big data
easier, However, for those who are not acquainted with this technology, one question arises
that what is big data? Big data is a term given to the data sets which can’t be processed in an
efficient manner with the help of traditional methodology such as RDBMS. Hadoop has made
its place in the industries and companies that need to work on large data sets which are
sensitive and needs efficient handling. Hadoop is a framework that enables processing of
large data sets which reside in the form of clusters. Being a framework, Hadoop is made up
of several modules that are supported by a large ecosystem of technologies.
Introduction
Hadoop a de facto industry standard has become kernel of the distributed operating system
for Big data. Hadoop has gained its popularity due to its ability of storing, analysing, and
accessing large amount of data, quickly and cost effectively through clusters of commodity
hardware. But No one uses kernel alone. Hadoop is taken to be a combination of HDFS and
Map Reduce. To complement the Hadoop modules there are also a variety of other projects
that provide specialized services and are broadly used to make Hadoop laymen accessible and
more usable, collectively known as Hadoop Ecosystem. All the components of the Hadoop
ecosystem, as explicit entities are evident to address particular needs. Recent Hadoop
ecosystem consists of different level layers, each layer performing different kind of tasks like
storing your data, processing stored data, resource allocating and supporting different
programming languages to develop various applications in Hadoop ecosystem
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems. It includes Apache projects and various commercial tools and solutions. There
are four major elements of Hadoop i.e., HDFS, Map Reduce, YARN, and Hadoop Common.
Most of the tools or solutions are used to supplement or support these major elements.
81
All these tools work collectively to provide services such as absorption, analysis, storage, and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
Fig 2.4 The Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS): offers a way to store large files across multiple
machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to
Hadoop 2.0.0, the Name Node was a single point of failure (SPOF) in an HDFS cluster. With
Zookeeper the HDFS High Availability feature addresses this problem by providing the
option of running two redundant Name Nodes in the same cluster in an Active/Passive
configuration with a hot standby.
Map Reduce
Map Reduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster. Apache Map Reduce was derived from Google Map
Reduce: Simplified Data Processing on Large Clusters paper. The current Apache Map
Reduce version is built over Apache YARN Framework. YARN stands for “Yet-Another-
Resource-Negotiator”. It is a new framework that facilitates writing arbitrary distributed
processing frameworks and applications. YARN’s execution model is more generic than the
earlier Map Reduce implementation. YARN can run applications that do not follow the Map
Reduce model, unlike the original Apache Hadoop Map Reduce (also called MR1). Hadoop
YARN is an attempt to take Apache Hadoop beyond Map Reduce for data-processing.
82
Hadoop YARN
Hadoop YARN (Yet another Resource Negotiator) is a Hadoop ecosystem component that
provides the resource management. Yarn is also one the most important component of
Hadoop Ecosystem.
YARN is called as the operating system of Hadoop as it is responsible for managing and
monitoring workloads. It allows multiple data processing engines such as real-time streaming
and batch processing to handle data stored on a single platform.
Thrift
Thrift is a software framework for scalable cross-language services development. Thrift is an

interface definition language for RPC (Remote procedure call) communication. Hadoop does
a lot of RPC calls so there is a possibility of using Hadoop Ecosystem component Apache
Thrift for performance or other reasons.
Apache Drill
The main purpose of the Hadoop Ecosystem Component is large-scale data processing
including structured and semi-structured data. It is a low latency distributed query engine that
is designed to scale to several thousands of nodes and query petabytes of data. The drill is the
first distributed SQL query engine that has a schema-free model.
Apache Mahout
Apache Mahout is open-source framework for creating scalable machine learning algorithm
and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data
science tools to automatically find meaningful patterns in those big data sets.
Ambari
Ambari, another Hadoop ecosystem component, is a management platform for provisioning,

managing, monitoring, and securing Apache Hadoop cluster. Hadoop management gets
simpler as Ambari provide consistent, secure platform for operational control.
Zookeeper
Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining

configuration information, naming, providing distributed synchronization, and providing
group services. Zookeeper manages and coordinates a large cluster of machines.
83
HCatalog
HCatalog is a table and storage management layer for Hadoop. HCatalog supports different
components available in Hadoop ecosystems like Map Reduce, Hive, and Pig to easily read
and write data from the cluster. HCatalog is a key component of Hive that enables the user to
store their data in any format and structure.
By default, HCatalog supports RCFile, CSV, JSON, sequence File and ORC file formats.
Pig
Apache Pig is a high-level language platform for analysing and querying huge dataset that are
stored in HDFS. Pig as a component of Hadoop Ecosystem uses Pig Latin language. It is very
similar to SQL. It loads the data, applies the required filters, and dumps the data in the
required format. For Programs execution, pig requires Java runtime environment.
Hive
The Hadoop ecosystem component, Apache Hive, is an open-source data warehouse system
for querying and analysing large datasets stored in Hadoop files. Hive does three main
functions: data summarization, query, and analysis.
HBase
Apache HBase is a Hadoop ecosystem component which is a distributed database that was
designed to store structured data in tables that could have billions of rows and millions of
columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS.
HBase, provide real-time access to read or write data in HDFS.
Apache Sqoop
Sqoop imports data from external sources into related Hadoop ecosystem components like
HDFS, HBase or Hive. It also exports data from Hadoop to other external sources. Sqoop
works with relational databases such as Teradata, Netezza, oracle, MySQL.
Apache Flume
Flume efficiently collects, aggregate and moves a large amount of data from its origin and
sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem
component allows the data flow from the source into Hadoop environment. It uses a simple
extensible data model that allows for the online analytic application. Using Flume, we can get
the data from multiple servers immediately into Hadoop.
84
2.4 MOVING DATA IN AND OUT OF HADOOP
In this chapter as data ingress and egress, is the process by which data is transported from an
external system into an internal system, and vice versa. Hadoop supports ingress and egress at
a low level in HDFS and Map Reduce. Files can be moved in and out of HDFS, and data can
be pulled from external data sources and pushed to external data sinks using Map Reduce.
Some of Hadoop’s ingress and egress mechanisms.
Data movement is one of those things that you aren’t likely to think too much about until
you’re fully committed to using Hadoop on a project, at which point it becomes this big scary
unknown that has to be tackled. How do you get your log data sitting across thousands of
hosts into Hadoop? What’s the most efficient way to get your data out of your relational and
No/NewSQL systems and into Hadoop? How do you get Lucene indexes generated in
Hadoop out to your servers? And how can these processes be automated?
The goal is to answer these questions and set you on your path to worry-free data movement.
In this chapter you’ll first see how data across a broad spectrum of locations and formats can
be moved into Hadoop, and then you’ll see how data can be moved out of Hadoop.
This chapter starts by highlighting key data-movement properties, so that as you go through
the rest of this chapter you can evaluate the fit of the various tools. It goes on to look at low-
level and high-level tools that can be used to move your data. We’ll start with some simple
techniques, such as using the command line and Java for ingress,[1] but we’ll quickly move
on to more advanced techniques like using NFS and DistCp.
Ingress and egress refer to data movement into and out of a system, respectively.
Further, how do you automate your data ingress and egress process so that your data is
moved at regular intervals? Automation is a critical part of the process, along with
monitoring and data integrity responsibilities to ensure correct and safe transportation of data.
Once the low-level tooling is out of the way, we’ll survey higher-level tools that have
simplified the process of ferrying data into Hadoop. We’ll look at how you can automate the
movement of log files with Flume, and how Sqoop can be used to move relational data. So as
not to ignore some of the emerging data systems, you’ll also be introduced to methods that
can be employed to move data from HBase and Kafka into Hadoop.
85
We’ll cover a lot of ground in this chapter, and it’s likely that you’ll have specific types of
data you need to work with. If this is the case, feel free to jump directly to the section that
provides the details you need.
Let’s start things off with a look at key ingress and egress system considerations.
Once the low-level tooling is out of the way, we’ll survey higher-level tools that have
simplified the process of ferrying data into Hadoop. We’ll look at how you can automate the
movement of log files with Flume, and how Sqoop can be used to move relational data. So as
not to ignore some of the emerging data systems, you’ll also be introduced to methods that
can be employed to move data from HBase and Kafka into Hadoop.
We’ll cover a lot of ground in this chapter, and it’s likely that you’ll have specific types of
data you need to work with. If this is the case, feel free to jump directly to the section that
provides the details you need.
Let’s start things off with a look at key ingress and egress system considerations.
Key Elements of Data Movement
Moving large quantities of data in and out of Hadoop offers logistical challenges that include
consistency guarantees and resource impacts on data sources and destinations. Before we
dive into the techniques, however, we need to discuss the design elements you should be
aware of when working with data movement.
Idempotence - An idempotent operation produces the same result no matter how many times
it’s executed. In a relational database, the inserts typically aren’t idempotent, because
executing them multiple times doesn’t produce the same resulting database state.
Alternatively, updates often are idempotent because they’ll produce the same end result.
Any time data is being written, idempotence should be a consideration, and data ingress and
egress in Hadoop are no different. How well do distributed log collection frameworks deal
with data retransmissions? How do you ensure idempotent behaviour in a Map Reduce job
where multiple tasks are inserting into a database in parallel? We’ll examine and answer
these questions in this chapter.
86
Aggregation - The data aggregation process combines multiple data elements. In the context
of data ingress, this can be useful because moving large quantities of small files into HDFS
potentially translates into Name Node memory woes, as well as slow Map Reduce execution
times. Having the ability to aggregate files or data together mitigates this problem and is a
feature to consider.
Data format transformation - The data format transformation process converts one data
format into another. Often your source data isn’t in a format that’s ideal for processing in
tools such as Map-Reduce. If your source data is in multiline XML or JSON form, for
example, you may want to consider a pre-processing step. This would convert the data into a
form that can be split, such as one JSON or XML element per line, or convert it into a format
such as Avro. Chapter 3 contains more details on these data formats.
Compression - Compression not only helps by reducing the footprint of data at rest, but also
has I/O advantages when reading and writing data.
Availability and Recoverability - Recoverability allows an ingress or egress tool to retry in

the event of a failed operation. Because it’s unlikely that any data source, sink, or Hadoop
itself can be 100% available, it’s important that an ingress or egress action be retried in the
event of failure.
Reliable Data Transfer and Data Validation - In the context of data transportation,
checking for correctness is how you verify that no data corruption occurred as the data was in
transit. When you work with heterogeneous systems such as Hadoop data ingress and egress,
the fact that data is being transported across different hosts, networks, and protocols only
increases the potential for problems during data transfer. A common method for checking the
correctness of raw data, such as storage devices, is Cyclic Redundancy Checks (CRCs),
which are what HDFS uses internally to maintain block-level integrity.
In addition, it’s possible that there are problems in the source data itself due to bugs in the
software generating the data. Performing these checks at ingress time allows you to do a one-
time check, instead of dealing with all the downstream consumers of the data that would have
to be updated to handle errors in the data.
Resource Consumption and Performance - Resource consumption and performance are

measures of system resource utilization and system efficiency, respectively. Ingress and
egress tools don’t typically impose significant load (resource consumption) on a system
unless you have appreciable data volumes.
87
For performance, the questions to ask include whether the tool performs ingress and egress
activities in parallel, and if so, what mechanisms it provides to tune the amount of
parallelism. For example, if your data source is a production database and you’re using Map
Reduce to ingest that data, don’t use a large number of concurrent map tasks to import data.
Monitoring - Monitoring ensures that functions are performing as expected in automated

systems. For data ingress and egress, monitoring breaks down into two elements: ensuring
that the processes involved in ingress and egress are alive and validating that source and
destination data are being produced as expected. Monitoring should also include verifying
that the data volumes being moved are at expected levels; unexpected drops or highs in your
data will alert you to potential system issues or bugs in your software.
Speculative Execution - Map Reduce has a feature called speculative execution that
launches duplicate tasks near the end of a job for tasks that are still executing. This helps
prevent slow hardware from impacting job execution times. But if you’re using a map task to
perform inserts into a relational database, for example, you should be aware that you could
have two parallel processes inserting the same data.
Map and reduce-side speculative execution can be disabled via the Map Reduce. Map
speculative and Map Reduce. Reduce. speculative configurable in Hadoop.
On to the techniques. Let’s start with how you can leverage Hadoop’s built-in ingress
mechanisms.
Moving Data into Hadoop
The first step in working with data in Hadoop is to make it available to Hadoop. There are
two primary methods that can be used to move data into Hadoop: writing external data at the
HDFS level (a data push) or reading external data at the Map Reduce level (more like a pull).
Reading data in Map Reduce has advantages in the ease with which the operation can be
parallelized and made fault tolerant. Not all data is accessible from Map Reduce, however,
such as in the case of log files, which is where other systems need to be relied on for
transportation, including HDFS for the final data hop.
In this section we’ll look at methods for moving source data into Hadoop. I’ll use the design
considerations in the previous section as the criteria for examining and understanding the
different tools.
88
We’ll get things started with a look at some low-level methods you can use to move data into
Hadoop.
Roll your own ingest: Hadoop comes bundled with a number of methods to get your data into
HDFS. This section will examine various ways that these built-in tools can be used for your
data movement needs. The first and potentially easiest tool you can use is the HDFS
command line.
Picking the right ingest tool for the job: The low-level tools in this section work well for one-
off file movement activities, or when working with legacy data sources and destinations that
are file-based. But moving data in this way is quickly becoming obsolete by the availability
of tools such as Flume and Kafka (covered later in this chapter), which offer automated data
movement pipelines.
Kafka is a much better platform for getting data from A to B (and B can be a Hadoop cluster)
than the old-school “let’s copy files around!” With Kafka, you only need to pump your data
into Kafka, and you have the ability to consume the data in real time (such as via Storm) or in
offline/batch jobs (such as via Camus).
File-based ingestion flows are, to me at least, a relic of the past (because everybody knows
how scp works :-P), and they primarily exist for legacy reasons—the upstream data sources
may have existing tools to create file snapshots (such as dump tools for the database), and
there’s no infrastructure to migrate or move the data into a real-time messaging system such
as Kafka.
Technique 33 Using the CLI to Load Files
If you have a manual activity that you need to perform, such as moving the examples bundled
with this book into HDFS, then the HDFS command-line interface (CLI) is the tool for you.
It’ll allow you to perform most of the operations that you’re used to performing on a regular
Linux file system. In this section we’ll focus on copying data from a local file system into
HDFS.
Problem
You want to copy files into HDFS using the shell.
Solution
The HDFS command-line interface can be used for one-off moves, or it can be incorporated
into scripts for a series of moves.
89
2.5 MAP REDUCE
The Map Reduce model is supported by Hadoop and is also Java-based. It was introduced by
Google as a method of solving a class of petabyte/terabyte magnitude problems with large
clusters of inexpensive machines. New, alternative algorithms, frameworks, and database
management systems have been developed to resolve the rapidly growing data and its
processing.
Fig 2.5 Map Reduce framework
The Map Reduce framework is used for the distributed and parallel processing of large
amounts of structured and unstructured data, which Hadoop typically stores in HDFS,
clustered across large computers. Map Reduce is a programming model to express a
distributed computation on a massive scale. Map Reduce is a way of breaking down each
request into smaller requests that are sent to many small servers to make the most scalable
use of the CPU possible.
Now today’s data-driven market, algorithms and applications are collecting data 24/7 about
people, processes, systems, and organizations, resulting in huge volumes of data. The
challenge, though, is how to process this massive amount of data with speed and efficiency,
and without sacrificing meaningful insights.
This is where the Map Reduce programming model comes to rescue. Initially used by Google
for analysing its search results, Map Reduce gained massive popularity due to its ability to
split and process terabytes of data in parallel, achieving quicker results.
For example, a Hadoop cluster with 20,000 inexpensive commodity servers and 256MB
block of data in each, can process around 5TB of data at the same time. This reduces the
processing time as compared to sequential processing of such a large data set.
There are two phases in the Map Reduce program, Map and Reduce.
The Map
The Map task includes splitting and mapping of the data by taking a dataset and converting it
into another set of data, where the individual elements get broken down into tuples i.e.,
key/value pairs. Map tasks which include Splits and Mapping.
 Splits - An input in the Map Reduce model is divided into small fixed-size parts called
input splits. This part of the input is consumed by a single map. The input data is
generally a file or directory stored in the HDFS.
 Mapping - This is the first phase in the map-reduce program execution where the data in
each split is passed line by line, to a mapper function to process it and produce the output
values.
The Reduce
The Reduce task shuffles and reduces the data, which means it combines the data tuples
based on the key and modifies the value of the key accordingly. Reduce tasks which includes
Shuffling and Reducing.
 Shuffling - It is a part of the output phase of Mapping where the relevant records are
consolidated from the output. It consists of merging and sorting. So, all the key-value
pairs which have the same keys are combined. In sorting, the inputs from the merging
step are taken and sorted. It returns key-value pairs, sorting the output.
 Reduce - All the values from the shuffling phase are combined and a single output
value is returned. Thus, summarizing the entire dataset.
The execution of these tasks is controlled by two entities called Job Tracker and Multiple
Task tracker.
91
With every job that gets submitted for execution, there is a Job Tracker that resides on the
Name Node and multiple task trackers that reside on the Data Node. A job gets divided into
multiple tasks that run onto multiple data nodes in the cluster. The Job Tracker coordinates
the activity by scheduling tasks to run on various data nodes.
The task tracker looks after the execution of individual tasks. It also sends the progress report
to the Job Tracker. Periodically, it sends a signal to the Job Tracker to notify the current state
of the system. When there is a task failure, the Job Tracker reschedules it on a different task
tracker.
Map Reduce was once the only method through which the data stored in the HDFS could be
retrieved, but that is no longer the case. Today, there are other query-based systems such as
Hive and Pig that are used to retrieve data from the HDFS using SQL-like statements.
However, these usually run along with jobs that are written using the Map Reduce model.
That’s because Map Reduce has unique advantages.
There are a number of advantages for applications which use this model. These are
 Big data can be easily handled.
 Datasets can be processed parallelly.
 All types of data such as structured, unstructured, and semi-structured can be easily
processed. High scalability is provided.
 Counting occurrences of words is easy and these applications can have massive
data collection.
 Large samples of respondents can be accessed quickly.
 In data analysis, a generic tool can be used to search tools.
 Load balancing time is offered in large clusters.
 The process of extracting contexts of user locations, situations, etc. is easily

possible.
 Good generalization performance and convergence is provided to these

applications.
92
Usage of Map Reduce
 It can be used in various application like document clustering, distributed sorting,

and web link-graph reversal.
 It can be used for distributed pattern-based searching.
 We can also use Map Reduce in machine learning.
 It was used by Google to regenerate Google's index of the World Wide Web.
 It can be used in multiple computing environments such as multi-cluster, multi-

core, and mobile environment.
2.5.1 Understanding Inputs and Outputs of Map Reduce
The Map Reduce model operates on <key, value> pairs. It views the input to the jobs as a set
of <key, value> pairs and produces a different set of <key, value> pairs as the output of the
jobs. Data input is supported by two classes in this framework, namely Input Format and
Record Reader.
The first is consulted to determine how the input data should be partitioned for the map tasks,
while the latter reads the data from the inputs. For the data output also there are two classes,
Output Format and Record Writer. The first class performs a basic validation of the data sink
properties, and the second class is used to write each reducer output to the data sink.
With Map Reduce, rather than sending data to where the application or logic resides, the
logic is executed on the server where the data already resides, to expedite processing. Data
access and storage is disk-based—the input is usually stored as files containing structured,
semi-structured, or unstructured data, and the output is also stored in files.
93
Fig 2.6 Inputs and outputs of Map Reduce
Input Formats
Hadoop can process many different types of data formats, from flat text files to databases. In
this section, we explore the different formats available.
Controlling the maximum line length
If you are using one of the text input formats discussed here, you can set a maximum
expected line length to safeguard against corrupted files. Corruption in a file can manifest
itself as a very long line, which can cause out of memory errors and then task failure. By
setting mapreduce.input.linerecordreader.line.maxlength to a value in bytes that fits in
memory (and is comfortably greater than the length of lines in your input data), you ensure
that the record reader will skip the (long) corrupt lines without the task failing.
TextInputFormat
TextInputFormat is the default Input Format. Each record is a line of input. The key, a Long
Writable, is the byte offset within the file of the beginning of the line. The value is the
contents of the line, excluding any line terminators.
KeyValueTextInputFormat
TextInputFormat’s keys, being simply the offsets within the file, are not normally very
useful. It is common for each line in a file to be a key-value pair, separated by a delimiter
such as a tab character.
94
For example, this is the kind of output produced by Text Output Format, Hadoop’s default
Output Format. To interpret such files correctly, KeyValueTextInputFormat is appropriate.
NlineInputFormat
With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable

number of lines of input. The number depends on the size of the split and the length of the
lines. If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the Input Format to use. Like with TextInputFormat, the keys are the
byte offsets within the file and the values are the lines themselves. N refers to the number of
lines of input that each mapper receives. With N each mapper receives exactly one line of
input. The MapReduce.input.lineinputformat.linespermap property controls the value of N.
StreamInputFormat
Hadoop comes with a Input Format for streaming which can be used outside streaming and
can be used for processing XML documents. You can use it by setting your input format to
StreamInputFormat and setting the stream.recordreader.class property to
org.apache.hadoop.streaming.mapreduce.StreamXmlRecordReader. The reader is configured
by setting job configuration properties to tell it about the patterns for the start and end tags.
SequenceFileInputFormat
Hadoop Map Reduce is not restricted to processing textual data. It has support for binary
formats, too. Hadoop’s sequence file format stores sequences of binary key-value pairs.
Sequence files are well suited as a format for Map Reduce data because they are splitable
(they have sync points so that readers can synchronize with record boundaries from an
arbitrary point in the file, such as the start of a split), they support compression as a part of
the format, and they can store arbitrary types using a variety of serialization frameworks.
Sequence File as Text Input Format
Sequence File as Text Input Format is a variant of SequenceFileInputFormat that converts the
sequence file’s keys and values to Text objects. The conversion is performed by calling to
String() on the keys and values. This format makes sequence files suitable input for
Streaming
95
Sequence File as Binary Input Format
SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that retrieves the

sequence file’s keys and values as opaque binary objects. They are encapsulated as
BytesWritable objects, and the application is free to interpret the underlying byte array as it
pleases.
Fixed Length Input Format
Fixed Length Input Format is for reading fixed-width binary records from a file when the
records are not separated by delimiters. The record size must be set via
fixedlengthinputformat.record.length.
Multiple Inputs
Although the input to a Map Reduce job may consist of multiple input files (constructed by a
combination of file globs, filters, and plain paths), all of the input is interpreted by a single
Input Format and a single Mapper. What often happens, however, is that the data format
evolves over time, so you have to write your mapper to cope with all of your legacy formats.
Or you may have data sources that provide the same type of data but in different formats.
This arises in the case of performing joins of different datasets. For instance, one might be
tab-separated plain text, and the other a binary sequence file. Even if they are in the same
format, they may have different representations, and therefore need to be parsed differently.
These cases are handled elegantly by using the Multiple Inputs class, which allows you to
specify which Input Format and Mapper to use on a per-path basis.
Database Input
DBInputFormat is an input format for reading data from a relational database, using JDBC.
Because it doesn’t have any sharding capabilities, you need to be careful not to overwhelm
the database from which you are reading by running too many mappers.
For this reason, it is best used for loading relatively small datasets, perhaps for joining with
larger datasets from HDFS using Multiple Inputs. The corresponding output format is
DBOutputFormat, which is useful for dumping job outputs (of modest size) into a database.
96
Output Formats
Hadoop has output data formats that correspond to the input formats.
Text Output
The default output format, TextOutputFormat, writes records as lines of text. Its keys and
values may be of any type, since TextOutputFormat turns them to strings by calling toString()
on them. Each key-value pair is separated by a tab character, although that may be changed
using the mapreduce.output.textoutputformat.separator property.
SequenceFileOutputFormat
It’s for writing binary Output. As the name indicates, SequenceFileOutputFormat writes
sequence files for its output. This is a good choice of output if it forms the input to a further
Map Reduce job, since it is compact and is readily compressed.
SequenceFileAsBinaryOutputFormat
Is the counterpart to SequenceFileAsBinaryInputFormat writes keys and values in raw binary

format into a sequence file container?
MapFileOutputFormat
MapFileOutputFormat writes map files as output. The keys in a Map File must be added in
order, so you need to ensure that your reducers emit keys in sorted order.
Multiple Outputs
Sometimes there is a need to have more control over the naming of the files or to produce
multiple files per reducer. Map Reduce comes with the Multiple Outputs class to help you do
this.
Lazy Output
FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are empty.
Some applications prefer that empty files not be created, which is where Lazy Output Format
helps. It is a wrapper output format that ensures that the output file is created only when the
first record is emitted for a given partition. To use it, call its setOutputFormatClass () method
with the JobConf and the underlying output format.
Database Output
The output formats for writing to relational databases and to HBase.
97
2.6 DATA SERIALIZATION
Programmers routinely work with data objects in memory, but sometimes the objects need to
be sent over a network or written to persistent storage (typically a file) to save some parts of
the state of the program. Serialization is a technique that allows you to package your data
objects in a robust and consistent form for storage or transmission, and then later restored to
their in-memory form, either on the original machine or a different one. While simple data
objects may reliably be represented by the same set of bytes running on similar hardware
architectures with compatible software, in general the actual byte representation of objects is
not guaranteed for various reasons. As a result, it is inadvisable to store and later reload the
same byte contents and expect to get the same object state in the general case. Serialization
provides a stable byte representation of the value of software objects that can be sent over a
network that potentially will continue to work correctly even in future implementations using
different hardware and/or software.
Serialization fundamentally works in similar ways across most languages and

implementations, although the specifics vary greatly depending on the style and nuances of
the particular language. Class implementers can explicitly declare when objects should be
serializable, in which case a standard library handles everything, and this works fine but only
for objects that contain simple values. Objects with complex internal state often need to
provide a custom implementation for serialization, typically by overriding a method of the
standard library. The trick is understanding what the standard implementation does, its
limitations, and when and how to handle serialization when appropriate.
Perhaps the easiest security mistake to make with serialization is to inappropriately trust the
provider of the serialized data. The act of deserialization converts the data to the internal
representation used by your programming language, with few if any checks as to whether the
encoded data was corrupted or intentionally designed to be malicious, assuming the standard
library will be fine when it actually does not do the right thing or possibly exposes protected
information inadvertently. When custom code handles serialization it needs to avoid all of the
usual security pitfalls while expressing and reconstructing properly initialized objects in order
for serialization to work properly. Objects that contain or reference other objects need special
care in determining which of the objects need to also be serialized and understanding how
those objects in turn work under serialization. When the source of serialized data is
potentially untrustworthy often there is no way to defensively check for validity.
98
Serialization is a valuable and safe mechanism when you have full control of the data you
receive for deserialization. There are a couple of general scenarios where serialization makes
sense.
The first scenario is when you want to save a complex object for later use. Once you produce
the serialized version of the object, you can write it safely to a file or database, making sure
that the protections are set correctly to prevent possible tampering. At some later time, your
program could read the object and deserialize it, knowing that it originated from a safe
source.
A second scenario is where you want to send a complex object from one protected server to
another. In this case, you control both the sender (the program that does the serialization) and
the receiver (the program that does the deserialization). Of course, you need to make sure that
you send the serialized data over an encrypted tamper-proof channel, using a secure protocol
such as TLS.
Serialization is an abstract concept potentially applicable to all kinds of software objects, so

let’s look at a concrete example in Java. Host A has an object that it wants to communicate to
a different Host B (that may be a completely different implementation) over a common
network. Using Java’s standard serialization library, a sequence of 25 bytes is generated that
contains sufficient information to express the object metadata and value. It is easy to send
these bytes to its peer which then uses the complementary deserializing library to decode the
data, determine the correct object type to instantiate, and then initialize it to have an identical
value to the original.
Serialization works in several popular languages.
Python uses the standard pickle library to handle serialization: dumps to serialize and loads to
deserialize.
Ruby serialization is handled by the Marshal module in a similar way to Python.
C++ Boost serialization uses text archive objects. Serialization writes into an output archive
object operating as an output data stream. The >> output operator when invoked for class
data types calls the class serialize function. Each serialize function uses the & operator, or via
>> recursively serializes nested objects to save or load its data members.
99
Microsoft Foundation Class (MFC) Library in C++ Visual Studio: Serialization is
implemented by classes derived from C Object and overriding the Serialize method. Serialize
has a CArchive argument that is used to read and write the object data. The CArchive object
has a member function, IsStoring, which indicates whether Serialize is storing (writing data)
or loading (reading data).
Serialized data tampering
Now that we understand how serialization works at a low level, let’s look at what an attacker
can do with serialized data.
While the serialized form at first looks like gibberish, if an attacker is able to see the
serialized form, they can learn the value the object held at the time of serialization by
analysing the byte stream. Protect the serialized form of sensitive data as you would protect
any form using access controls or encryption.
Python example
Next is an example of how more tampering with serialization provides an attacker more than
simply modifying the value of data. The scenario is where a server receives serialized data
that it mistakenly trusts, either sent from a malicious client or possibly modified in transit.
Perhaps the context is where the server receives some seemingly innocuous data from the
client, and the programmer figured that even if the data itself was incorrect, it was still
harmless. As we shall see, by deserializing harmful data, the consequences can be quite bad.
The serialization (pickling) in Python works in an interesting, flexible, and somewhat

complicated way. The pickled object is actually a pair of fields. The first field is a callable
object (basically a method name) and the second is a tuple of the parameters to be passed to
that method. This object is sent to the recipient, who deserializes (unpickles) it by calling the
method, passing it the parameters in the tuple. The result of this call is the deserialized object.
Using JSON securely
JSON (JavaScript Object Notation) is widely used as a universal data format on the web that
works much like serialization, with the additional advantages of being supported across a
wide range of languages and being human-readable as text. JSON is a subset of JavaScript
syntax that expresses data as name/value pairs and arrays of values. This subset is designed to
be easier to parse and check the validity of the data.
100
In JavaScript, always use JSON.stringify () and JSON.parse () functions to serialize and
deserialize text as JSON, respectively. While it may be tempting to use eval () on JSON text,
this should never be done since it potentially executes arbitrary code possibly embedded in
the untrusted input in the form of expressions such as function invocations within the JSON.
The JSON specification does not allow function invocations, but nevertheless eval () executes
them.
Serialization Risk Mitigations
Serialization is a useful tool, but as we have seen it must be used with care as the mechanisms
behind the implementation can be fragile and hence easily become a source of security
problems. Should an attacker manage to tamper with serialized data, the chances of additional
problems arising when deserializing spurious data are high.
Serialized data at a glance appears opaque but it can be easily reverse engineered exposing all
the contained information to an eavesdropper, as we showed in the “How serialization works”
example above. In the case of complex objects, there is sensitive internal state that appears in
the serialized form that is otherwise private. Serialization formats often include metadata or
other additional information besides the actual values within an object that may be sensitive.
Unless there is certainty that data integrity can be assured, avoiding serialization is the only
sure-fire way of eluding these potential issues. With a solid understanding of the risks, if you
do want to use serialization, consider applying as many of the following mitigations we
recommend following as applicable.
 When possible, write a class-specific serialization method that explicitly does not expose
sensitive fields or any internal state to the serialization stream. In some cases, it may not
be possible to omit sensitive data and still have the object work properly.
 Ensure that deserialization (including super classes) and object instantiation does not
have side effects.
 Never deserialize untrusted data. In general, the behaviour of deserialization given

arbitrarily tampered data is difficult, if not impossible, to guarantee safeness.
 Serialized data should be stored securely, protected by access control, or signed and
encrypted. One useful pattern is for the server to provide a signed and encrypted
serialization blob to a client; later the client can return this intact to the server where it is
only processed after signature checking.
101
 Sometimes it helps to sanitize deserialized data in a temporary object. For example,
deserialize an object first, instantiating and populating it with values, but before actually
using the object, ensure that all fields are reasonable and consistent, or force an error
response and destroy any object that appears faulty.
Always keep in mind that serialized data is just like any other data: it can leak information if
exposed, and from an untrusted source where it is subject to tampering, it needs to be handled
with care. Unless you understand what each byte in the serialized data means and exactly
how deserialization will treat the bytes, you should never assume that it will all “just work” in
the face of analysis and tampering by a clever attacker.
Serialization is a powerful tool with significant benefits, but it also carries inherent security
risks. It can be convenient, but the machinery of serialization and deserialization adds
overhead and complexity that can incur security vulnerability if used improperly. Given the
reality that attacks do occur, any use of serialized data incurs some additional risk if
tampering ever happens. While there is no magic bullet, the use of multiple mitigations as
outlined above can greatly mitigate the risks. The more mitigation and defensive coding you
do, the more secure you are.
2.7 ARCHITECTURE OF HADOOP
Many organizations that venture into enterprise adoption of Hadoop by business users or by
an analytics group within the company do not have any knowledge on how a good Hadoop
architecture design should be and how actually a Hadoop cluster works in production. This
lack of knowledge leads to design of a Hadoop cluster that is more complex than is necessary
for a particular big data application making it a pricey implementation. Apache Hadoop was
developed with the purpose of having a low–cost, redundant data store that would allow
organizations to leverage big data analytics at economical cost and maximize profitability of
the business.
A good Hadoop architectural design requires various design considerations in terms of

computing power, networking, and storage. This blog post gives an in-depth explanation of
the Hadoop architecture and the factors to be considered when designing and building a
Hadoop cluster for production success. But, before we dive into the architecture of Hadoop,
let us have a look at what Hadoop is and what are the reasons behind its popularity.
102
“Data, Data everywhere; not a disk to save it.” This quote feels so relatable to most of us
today as we usually run out of space either on our laptops or on our mobile phones. Hard
disks are now becoming more popular as a solution to this problem. But, what about
companies like Facebook and Google, where their users are constantly generating new data in
the form of pictures, posts, etc., every millisecond? How do these companies handle their
data so smoothly that every time we log in to our accounts on these websites, we can access
all our chats, emails, etc., without any difficulties? Well, the answer to this is that these tech
giants are using frameworks like Apache Hadoop. These frameworks make the task of
accessing information from large datasets easy using simple programming models. These
frameworks achieve this feat through distributed processing of the datasets across clusters of
computers. To understand this better, consider the case where we want to transfer 100 GB of
data in 10 minutes. One way to achieve this is to have one colossal computer that can
transmit at lightning speed of 10GB in one minute. Another method is to store the data in 10
computers and then let each computer transfer the data at 1GB/min. The latter task of parallel
processing sounds more straightforward than the former, and this is the kind of task that
Apache Hadoop achieves.
The Hadoop architecture is a package of the file system, Map Reduce engine and the HDFS
(Hadoop Distributed File System). The Map Reduce engine can be Map Reduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, Name Node, and Data Node whereas the slave node
includes Data Node and Task Tracker.
103
Fig 2.7 Hadoop architecture
Hadoop Architecture Overview
Hadoop is a framework permitting the storage of large volumes of data on node systems. The
Hadoop architecture allows parallel processing of data using several components:
 Hadoop Common– It contains libraries and utilities that other Hadoop modules require.
 Hadoop Distributed File System (HDFS)– A distributed file system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
patterned after the UNIX file system and provides POSIX-like semantics.
 Hadoop YARN – A platform that is in charge of managing computing resources in

clusters and utilising them for planning users' applications.
 Hadoop Map Reduce – This is an application of the Map Reduce programming model
for large-scale data processing. Both Hadoop's Map Reduce and HDFS were inspired by
Google's papers on Map Reduce and Google File System.
104
 Hadoop Ozone – This is a scalable, redundant, and distributed object store for Hadoop.
This is a new addition to the Hadoop family and unlike HDFS, it can handle both small
and large files alike.
Hadoop Architecture Explained
Hadoop skillset requires thoughtful knowledge of every layer in the Hadoop stack right from
understanding about the various components in the Hadoop architecture, designing a Hadoop
cluster, performance tuning it and setting up the top chain responsible for data processing.
But there are certain problems that the more devices are used, the higher the chances of
hardware failure. Hadoop Distributed File System (HDFS) addresses this problem as it stores
multiple copies of the data so that one can access another copy of the data in case of failure.
Another problem attached to the use of various machines is that there has to be one concise
way of combining the data using a standard programming model. Hadoop’s Map Reduce
solves this problem. It allows reading and writing data from the machines (or disks) by
performing computations over sets of keys and values. In fact, both HDFS and Map Reduce
form the core of Hadoop architecture’s functionalities.
Hadoop follows a master slave architecture design for data storage and distributed data
processing using HDFS and Map Reduce respectively. The master node for data storage is
Hadoop HDFS is the Name Node and the master node for parallel processing of data using
Hadoop Map Reduce is the Job Tracker. The slave nodes in the Hadoop architecture are the
other machines in the Hadoop cluster which store data and perform complex computations.
Every slave node has a Task Tracker daemon and a Data Node that synchronizes the
processes with the Job Tracker and Name Node respectively. In Hadoop architectural
implementation the master or slave systems can be setup in the cloud or on-premises.
Hadoop Architecture Implementation
Another name for Hadoop common is Hadoop Stack. Hadoop Common forms the base of the
Hadoop framework. The critical responsibilities of Hadoop Common are:
 Supplying source code and documents, and a contribution section.
 Performing basic tasks- abstraction of the file system, generalization of the operating
system, etc.
 Supporting the Hadoop Framework by keeping Java Archive files (JARs) and scripts
needed to initiate Hadoop.
105
Application Architecture Implementation
Hadoop Distributed File System (HDFS) stores the application data and file system metadata
separately on dedicated servers. Name Node and Data Node are the two critical components
of the Hadoop HDFS architecture. Application data is stored on servers referred to as Data
Nodes and file system metadata is stored on servers referred to as Name Node. HDFS
replicates the file content on multiple Data Nodes based on the replication factor to ensure
reliability of data. The Name Node and Data Node communicate with each other using TCP
based protocols.
Fig 2.8 HDFS architecture
Name Node
All the files and directories in the HDFS namespace are represented on the Name Node by
Inodes that contain various attributes like permissions, modification timestamp, disk space
quota, namespace quota and access times. Name Node maps the entire file system structure
into memory. Two files fsimage and edits are used for persistence during restarts.
 Fsimage file contains the Inodes and the list of blocks which define the metadata. It has a
complete snapshot of the file systems metadata at any given point of time.
106
 The edits file contains any modifications that have been performed on the content of the
fsimage file. Incremental changes like renaming or appending data to the file are stored
in the edit log to ensure durability instead of creating a new fsimage snapshot every time
the namespace is being altered.
Data Node
Data Node manages the state of an HDFS node and interacts with the blocks. A Data Node
can perform CPU intensive jobs like semantic and language analysis, statistics and machine
learning tasks, and I/O intensive jobs like clustering, data import, data export, search,
decompression, and indexing. A Data Node needs lot of I/O for data processing and transfer.
On start-up every Data Node connects to the Name Node and performs a handshake to verify
the namespace ID and the software version of the Data Node. If either of them does not
match, then the Data Node shuts down automatically. A Data Node verifies the block replicas
in its ownership by sending a block report to the Name Node. As soon as the Data Node
registers, the first block report is sent. Data Node sends heartbeat to the Name Node every 3
seconds to confirm that the Data Node is operating and the block replicas it hosts are
available.
2.8 SUMMARY
 Business intelligence (BI) processes derive valuable information from internal and
external sources of companies.
 Structured data are data that can be specifically searched for or sorted according to
individual or composite attributes.
 Data are not stored in a predefined, structured table. They usually consist of numbers,
texts, and fact blocks and do not have a special format.
 The Hadoop framework consists of two main layers. First the Hadoop Distributed File
System (HDFS), and second the Map Reduce.
 Java-based distributed file system that allows persistent and reliable storage and fast
access to large amounts of data. It divides the files into blocks and saves them
redundantly on the cluster, which is only slightly influenced and perceived by the
user.
107
 Map Reduce is a way of breaking down each request into smaller requests that are
sent to many small servers to make the most scalable use of the CPU possible.
 A map task can run on any computed nodes on the cluster, and multiple map tasks can
run in parallel on the cluster.
 Hadoop Ecosystem is a platform or a suite which provides various services to solve

the big data problems. It includes Apache projects and various commercial tools and
solutions.
 Serialization is the process of turning structured objects into a byte stream for
transmission over a network or for writing to persistent storage.
 In Hadoop, inter-process communication between nodes in the system is implemented

using remote procedure calls (RPCs).
 The shift to the cloud also enables users to store data in lower-cost cloud object
storage services instead of Hadoop's namesake file system; as a result, Hadoop's role
is being reduced in some big data architectures.
 Hadoop Distributed File System or HDFS is based on the Google File System (GFS)
which provides a distributed file system that is specially designed to run on
commodity hardware.
 Hadoop allows enterprises to store as much data, in whatever form, simply by adding
more servers to a Hadoop cluster. Each new server adds more storage and processing
power to the cluster.
 Big data is a term that refers to data sets or collection of data sets whose size
(volume), complexity (variability), and rate of growth (velocity).
 Apache Hadoop is an open-source framework that is suited for processing large data
sets on commodity hardware.
108
2.9 KEYWORDS
 YARN – YARN means Yet Another Resource Negotiator, is the resource

management and job scheduling technology in the open-source Hadoop distributed
processing framework. One of Apache Hadoop's core components, YARN is
responsible for allocating system resources to the various applications running in a
Hadoop cluster and scheduling tasks to be executed on different cluster nodes.
 Apache – The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely on hardware
to deliver high availability, the library itself is designed to detect and handle failures
at the application layer, so delivering a highly available service on top of a cluster of
computers, each of which may be prone to failures.
 RPC - Remote Procedure Calls (RPCs) means, a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. Rather than rely on hardware
to deliver high availability, the library itself is designed to detect and handle failures
at the application layer, so delivering a highly available service on top of a cluster of
computers, each of which may be prone to failures.
 Hive – Apache Hive is a data warehouse software project built on top of Apache
Hadoop for providing data query and analysis.[3] Hive gives an SQL-like interface to
query data stored in various databases and file systems that integrate with Hadoop.
Traditional SQL queries must be implemented in the Map Reduce Java API to
execute SQL applications and queries over distributed data. Hive provides the
necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying
Java without the need to implement queries in the low-level Java API.
109
 GFS - Google File System (GFS) means, a proprietary distributed file system
developed by Google to provide efficient, reliable access to data using large clusters
of commodity hardware. GFS is enhanced for Google's core data storage and usage
needs (primarily the search engine), which can generate enormous amounts of data
that must be retained. It is also designed and optimized to run on Google's computing
clusters, dense nodes which consist of cheap "commodity" computers, which means
precautions must be taken against the high failure rate of individual nodes and the
subsequent data loss. Other design decisions select for high data throughputs, even
when it comes at the cost of latency.
2.10 LEARNING ACTIVITY
1. Carry out steps to upload new data into HDFS.
___________________________________________________________________________
___________________________________________________________________________
2. Collect certain case studies on how to view and manipulate files copied into HDFS.
___________________________________________________________________________
___________________________________________________________________________
2.11 UNIT END QUESTIONS
A. Descriptive Questions
Short Questions
1. What is Hadoop? Explain briefly.
2. Describe briefly the two main layers of Hadoop Framework.
3. Explain Map Reduce?
4. What are the common frameworks of Hadoop?
5. Mention briefly the challenge faced to implement Hadoop cluster?
110
Long Questions
1. Explain how HDFS is a first building block of a Hadoop cluster.
2. What are the advantages of Map Reduce associated with consolidation stage?
3. Explain the different Use Cases of Hadoop.
4. Mention the Pros and Cons of Hadoop.
5. Mention the challenges faced with Big Data.
B. Multiple Choice Questions
1. Which of the following serves as the master and has one Name Node per cluster?
a. Data Node
b. Name Node
c. Data block
d. Replication
2. Which of the following is correct statement?
a. Data Node is the slave/worker node and holds the user data in the form of
Data Blocks
b. Each incoming file is broken into 32 MB by default
c. Data blocks are replicated across different nodes in the cluster to ensure a low
degree of fault tolerance
d. None of these
3. Which of the following Name Node is used when the Primary Name Node goes
down?
a. Rack
b. Data
c. Secondary
d. None of these
111
4. What is the command line interface that HDFS provides to interact with HDFS?
a. “HDFS Shell”
b. “FS Shell”
c. “DFS Shell”
d. None of these
5. Which of the following is correct statement?
a. Map Reduce tries to place the data and the compute as close as possible
b. Map Task in Map Reduce is performed using the Mapper() function
c. Reduce Task in Map Reduce is performed using the Map() function
d. All of these
Answers
1-b, 2-a, 3-c, 4-b, 5-a
2.12 REFERENCES
References
 Shabbir, M.Q., Gardezi, S.B.W. Application of Big Data analytics and organizational
performance: the mediating role of knowledge management practices. J Big Data 7,
47 (2020).
 Nazari, E., Shahriari, M. H., & Tabesh, H. (2019). BigData Analysis in Healthcare:
Apache Hadoop, Apache spark and Apache Flink. Frontiers in Health Informatics,
8(1), 14.
 Kim, H. G. (2017). SQL-to-Map Reduce Translation for Efficient OLAP Query

Processing with Map Reduce. International Journal of Database Theory and
Application, 10(6), 61–70.
112
Textbooks
 Liu, W. (2021). Exploring Hadoop Ecosystem (Volume 1): Batch Processing.

Lulu.com.
 Balusamy, B., R, A. N., Kadry, S., & Gandomi, A. H. (2021). Big Data: Concepts,
Technology, and Architecture (1st ed.). Wiley.
 De Mauro A, Greco M, Grimaldi M. A formal definition of Big Data based on its

essential features. Libr Rev. 2016;65(3):122–35.
Websites
 https://www.tutorialspoint.com/hadoop/hadoop_introduction.htm
 https://searchdatamanagement.techtarget.com/definition/Hadoop
 https://www.mdpi.com/2504-2289/5/1/12/htm
113

Unit 2

Uploaded by

Copyright:

Available Formats

Unit 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

UNIT - 2 INTRODUCTION TO HADOOP AND

2.0 Learning Objectives

2.2.1 Definition of Hadoop

2.3 Big Data

2.3.1 Apache Hadoop

2.3.2 Hadoop Eco System

2.4 Moving Data in and out of Hadoop

2.5 Map Reduce

2.5.1 Understanding Inputs and Outputs of Map Reduce

2.6 Data Serialization

2.7 Architecture of Hadoop

2.10 Learning Activity

2.11 Unit End Questions

After studying this unit, you will be able to:

 Describe the concept of Hadoop.

 Define the Hadoop Eco System

 Explain the Map Reduce

 Elucidate the Data Serialization

 Describe the Architecture of Hadoop

Information technology is an important part of most modern businesses. Business processes

Two main advantages are associated with the consolidation stage.

Map Task and Logical Blocks

In Hadoop, inter-process communication between nodes in the system is implemented using

A good Hadoop architectural design requires various design considerations in terms of

Formally known as Apache Hadoop, the technology is developed as part of an open-source

Some Common Frameworks of Hadoop

Drill- It consists of user-defined functions and is used for data exploration.

Storm- It allows real-time processing and streaming of data.

Zookeeper: Open-source centralized service which is used to provide coordination between

Hadoop Framework is Made Up of the Following Modules

Hadoop Map Reduce (Processing/Computation layer) –Map Reduce is a parallel

 Typical industries: telecommunications, financial services.

 Modelling True Risk

 Structure and analysis: sentiment analysis, developing a graph, typical pattern

 Typical industries: financial services (banks, insurance companies, etc.)

 Point of Sale (PoS) transaction analysis

 Solution with Hadoop: A number of processing frameworks (HDFS, Map Reduce)

 Analyse network data to predict failures

 Typical industries: telecommunications, data centres, utilities.

 Analyse research data to support decisions and actions

 Typical industries: universities and research institutes, libraries.

2.2.1 Definition of Hadoop

Hadoop is an open-source distributed processing framework that manages data processing

Diyotta and Hadoop

Fig 2.1 Advantages of Hadoop

Minimum Network Traffic

Fig 2. 2 Disadvantages of Hadoop

Problem with Small files

Low Performance in Small Data Surrounding

Supports Only Batch Processing

Challenges with Big Data

Opportunities to Big Data

2.3.1 Apache Hadoop

HBase is the mainstream Apache Hadoop database. It is an open source, no relational

Apache Zoo Keeper

Sqoop is a command-line interface application for transferring data between relational

Apache Oozie is a server-based workflow scheduling system to manage Hadoop jobs.

Apache Storm is a distributed stream processing computation framework written

A Storm application is designed as a "topology" in the shape of a directed acyclic graph