BDA Module 2 Chapter 1

Jain College of Engineering and Research, Belagavi BIG DATA ANALYTICS (18CS72)
MODULE 02
CHAPTER 1
Introduction to Hadoop
 Hadoop is a open source software framework.
 Scalable and parallel computing platform to handle large amount of data.
 Distributed data storage system do not use the concept of JOINS.
 Data need to be fault tolerant.
 Big data should follow the CAP theorem.
Big Data Store model
 Data store in file system consist of Data Blocks(Physical division of large data).
 Data blocks are distributed across multiple nodes(Data nodes).
 Data nodes are stored at the racks of a clusters.
 Each rack has multiple data nodes(Data servers).
 Hadoop system uses the Data Store Model
 Data blocks are replicate at the data nodes if any one get failed.
Big data programming model
 Hadoop system will uses the Big Data Programming model.
 Big Data programming model is that application in which application jobs and tasks is
scheduled on the same servers which stores the data for processing.
 Job means running an assignment of a set of instructions for processing. Example,

Processing the queries in application and sending back to the application is job.
 Job scheduling means assigning a job for processing.
Key Terms
Cluster Computing
 Refers to the computing, storing and analyzing huge amounts of unstructured or structured
data in a distributed computing environment.
 Each cluster forms loosely or tightly connected computing nodes that work together.
Prof. Laxmi Deshpande, Dept. of CSE Page

 Improves the performance cost effectiveness and accessibility
Data Flow
 Flow of data from one node to another node.
.Data consistency
 All copies of the data blocks should have same value
Data availability
 At least one copy of the data should be available if partition becomes inactive.
Resources
 Availability of physical/virtual components or devices.
Resource Management
 Managing resources such as creation, deletion or the manipulation of resource data.
Horizontal Scalability
 Increasing number of systems working in coherence.
 Example: MPPs
Vertical Scalability
 Increasing the number of tasks in the system. Tasks like reporting, Business processing(BP),
Business Intelligent(BI) Tasks.
Ecosystem
 Made up of multiple computing components which work together.
Hadoop and its Ecosystem

 Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems.
 It includes Apache projects and various commercial tools and solutions.
 Most of the tools or solutions are used to supplement or support the core elements of
Hadoop. All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.

Hadoop Core Components
 Above diagram shows the Core components of Hadoop.
1. Hadoop Common
 The common module contains the libraries and utilities that are required by the other
modules of hadoop.
 Hadoop common provides various components and interfaces for distributed file system
and general input/output. This includes serialization and file based data structures.
2. Hadoop Distributed File System(HDFS)
 A java based distributed file system which can store all kinds of data on the disks at the
clusters.
3. Map Reduce v1
 Software programming model in Hadoop 1 using Maper and Reducer. The v1processes
large data sets in parallel and in batches.
4. YARN
 Software for managing resources for computing.
 The user application tasks or sub-tasks run in parallel at the hadoop, uses scheduling
and handles the requests for the resources in distributed running of the tasks.

5. Map Reduce v2
 Hadoop 2 YARN-based system for parallel processing of the application tasks.
Spark
 Open source frame work
 Cluster computing frame work
 Provides in-memory analytics
 Enables OLAP and Real time processing
 Adapted by the companies like Amazon, eBay, and Yahoo.
Features of Hadoop
 Fault-efficient scalable, flexible and modular design
 Hadoop uses simple and modular programming model.
 The system provides server at high scalability.The system is scalable by adding new
nodes to handle large data.
 Hadoop proves very helpful in storing , managing, processing and analyzing Big data.
 Modular functions make system flexible.
 One can add or replace components at ease.
 Robust design of HDFS
 Execution of big data applications continue even when an individual server or cluster
fails. This is because of hadoop provisions of backup and recovery mechanism.
 HDFS has high reliability.

 Store and process Big Data
 Processes Big data of 4V characteristics(Volume, Variety, volume, Veracity)

 Distributed Clusters computing model with data locality
 Processes Big data at high speed as the application tasks and sub tasks submit to the
DataNodes.
 One can achieve more computing power by increasing the number of computing nodes.

 The processing splits across multiple DataNodes, and thus fast processing and
aggregated results.
 Hardware fault-tolerant
 A fault does not affect data and application processing.
 If a node goes down, the other nodes take care of the residue.
 This is due to multiple copies of all data blocks which replicate automatically.
 Open source framework
 Open source access and cloud services enable large data store.
 Hadoop uses a cluster of multiple inexpensive servers or the cloud.
 Java and Linux Based
 Hadoop uses Java interfaces.
 Hadoop base is Linux but has its own set of shell command support.
 Hadoop provides various components and interfaces for distributed file system and general
input/output.
 HDFS is basically designed more for batch processing.
 YARN provides a platform for many different modes of data processing, from traditional
batch processing to processing of the applications such as interactive queries, text analytics
and streaming analytics.
Hadoop Ecosystem Components

 Hadoop ecosystem refers to a combination of technologies.
 Hadoop ecosystem consist of own family of applications which tie up together with the
hadoop.
 The system component support the storage, processing, access, analysis, governance,
security and operations for Big data.
 The system enables the applications which run Big Data and deploy HDFS.
 The data store system consist of clusters, racks, DataNodes and blocks.
 Hadoop deploys application programming model, such as MapReduce and HBase. YARN
manages resources and schedules sub-tasks of the application.

 Below figure shows the Hadoop Core components HDFS, MapReduce and YARN along
with the ecosystem. Ecosystem includes the application support layer and application layer
components.
 Components are AVRO, Zookeeper, Pig, Hive, Sqoop, Ambari, Mahout, Spark, Flink and
Flame.
 Four layers in above diagram are as follows
1. Distributed Storage Layer
2. Resource-manager layer for job or application sub-tasks scheduling and execution.
3. Processing-frame work layer, consisting of Maper and Reducer for the MapReduce
process flow.
4. APIs at application support layer. The codes communicate to APIs

 AVRO enables data serialization between the layers.
 Zookeeper enables coordination among layer components and it is a centralized server

which provides the configuration of layers.
 Mahout is the ready to use framework.
Hadoop Streaming
 HDFS with MapReduce and YARN-based system enables parallel processing of large
datasets.
 Spark provides in-memory processing of data thus improves processing speed.
 In hadoop streaming spark and Flink are used to interface between the Maper and Reducer.
 Flink improves overall performance as it provides single run-time for streaming as well as
batch processing.
Hadoop Pipes
 This is another way to interface or connecting between the Maper and Reducer .
 C++ pipes are used for interfacing.
Hadoop Distributed File System(HDFS)

 Big data analytics applications are software applications.
 HDFS is a core component of Hadoop.
 HDFS is designed to run on a cluster of computers and servers at cloud based utility
services.
 HDFS stores Big Data range from GBs to PBs.
HDFS Data Storage

 Hadoop data store concept implies storing the data at a number of clusters.
 Each cluster has a number of data stores, called racks. Each rack stores a number of
DataNodes.
 Each DataNode has a large number of Data Blocks.
 A rack distribute across a cluster. The nodes have storing and processing capabilities.
 The data blocks replicate by default at least on three data nodes in same or remote nodes.

 Below diagram shows the replication of data blocks.
Features of HDFS
 Create, append, delete, rename and attribute modification functions.
 Content of individual file cannot be modified or replaced but appended with new data at the
end of the file.
 Write once but read many times during usages and processing.
 Average file size can be more than 500MB
Hadoop Physical Organization
 The conventional file system uses directories.
 A directory consists of folders. A folder consists of files.
 When data processes, the data sources identify by pointers for the resources.
 A data-dictionary stores the resources pointers. Master tables at the dictionary store at a
central location. The centrally stored tables enable administration easier when the data
sources change during processing.
 The files,DataNodes and blocks need the identification during processing at HDFS. HDFS
use the NameNode and DataNode.

 Few nodes in Hadoop cluster act as NameNodes. These nodes are termed as Master Nodes
or simply Masters.
 These masters have the different configurations and processing power.
 Master nodes have less local storage.
 Majority of the nodes in hadoop cluster acts as DataNodes and TaskTrackers. These nodes
are refered to as slave nodes or slaves.
 The slaves have lots of disk storage and moderate amounts of processing capabilities.
 Slaves are responsible to store the data and process the computation tasks submitted by the
clients.
 Below diagram illustrates the above explanation
 A single master node provides HDFS, MapReduce and Hbase using threads in small to
medium sized clusters.
 When the cluster size is large, multiple servers are used, such as to balance the load.
 The secondary NameNode provides Name node management services and Zookeepr is used
by HBase for metadata storage.
 The master node fundamentally plays the role of a coordinator.

 The master node receives client connections, maintains the description of the global file
system name space, and the allocation of file blocks.
 It also monitors the state of the system in order to detect any failure.
 MasterNode consists of 3 components NameNode, Secondary Node and Job tracker.
 The NameNode stores all the file system related information such as:
 The file section is stored
 Last access time for the file
 User permissions like which user has access to the file.
 Secondary NameNode is an alternate for NameNode. Secondary node keeps a copy of

NameNode meta data.
 Masters and slaves, and hadoop client(node) load the data into clusters, submit the
processing job and then retrieve the data to see the response after the job completion.
Hadoop 2
 Single NameNode failure in Hadoop 1 is an operational limitation. Scaling up was also

restricted to scale beyond a few thousands of DataNodes and few number of clusters.
 Hadoop 2 provides the multiple NameNodes.This enables higher resources availability.
 Each Master Node has the following components.
 An associated NameNode
 Zookeeper coordination client(an associated NameNode), functions as a centralized

repository for distributed applications. Zookeeper uses synchronization, serialization,
and coordination activities.
 Associated JournalNode(JN) . The JN keeps the records of the state, resources assigned,
and intermediate results. The System takes care of failure issues as follows.

HDFS Commands
 HDFS commands are common to other modules of Hadoop. The HDFS shell is not
complaint with the POSIX.
 Thus, the shell cannot interact similar to Unix or Linux.
 Commands for interacting with the files in HDFS require /bin/hdfs dfs <args>, where args
stands for the command arguments.
 Below table shows the examples and usages of commands.


BDA Module 2 Chapter 1

Uploaded by

Copyright:

Available Formats

BDA Module 2 Chapter 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Module 2 Chapter 1

Uploaded by

Copyright:

Available Formats

Jain College of Engineering and Research, Belagavi BIG DATA ANALYTICS (18CS72)

 Scalable and parallel computing platform to handle large amount of data.

 Distributed data storage system do not use the concept of JOINS.

 Data need to be fault tolerant.

 Big data should follow the CAP theorem.

Big Data Store model

 Data blocks are distributed across multiple nodes(Data nodes).

 Data nodes are stored at the racks of a clusters.

 Each rack has multiple data nodes(Data servers).

 Hadoop system uses the Data Store Model

Big data programming model

 Hadoop system will uses the Big Data Programming model.

 Job means running an assignment of a set of instructions for processing. Example,

 Job scheduling means assigning a job for processing.

Prof. Laxmi Deshpande, Dept. of CSE Page

 Improves the performance cost effectiveness and accessibility

 Flow of data from one node to another node.

 All copies of the data blocks should have same value

 Availability of physical/virtual components or devices.

 Managing resources such as creation, deletion or the manipulation of resource data.

 Increasing number of systems working in coherence.

 Made up of multiple computing components which work together.

Hadoop and its Ecosystem

 It includes Apache projects and various commercial tools and solutions.

Prof. Laxmi Deshpande, Dept. of CSE Page

Hadoop Core Components

 Above diagram shows the Core components of Hadoop.

2. Hadoop Distributed File System(HDFS)

 Software for managing resources for computing.

Prof. Laxmi Deshpande, Dept. of CSE Page

 Hadoop 2 YARN-based system for parallel processing of the application tasks.

 HDFS has high reliability.

 Processes Big data of 4V characteristics(Volume, Variety, volume, Veracity)

Prof. Laxmi Deshpande, Dept. of CSE Page

 A fault does not affect data and application processing.

 Open source framework

 Hadoop uses a cluster of multiple inexpensive servers or the cloud.

 Java and Linux Based

 Hadoop uses Java interfaces.

 HDFS is basically designed more for batch processing.

Hadoop Ecosystem Components

Prof. Laxmi Deshpande, Dept. of CSE Page

 Four layers in above diagram are as follows

1. Distributed Storage Layer

2. Resource-manager layer for job or application sub-tasks scheduling and execution.

4. APIs at application support layer. The codes communicate to APIs

Prof. Laxmi Deshpande, Dept. of CSE Page

 AVRO enables data serialization between the layers.

 Zookeeper enables coordination among layer components and it is a centralized server

 Mahout is the ready to use framework.

 Spark provides in-memory processing of data thus improves processing speed.

 C++ pipes are used for interfacing.

Hadoop Distributed File System(HDFS)

 HDFS is a core component of Hadoop.

 HDFS stores Big Data range from GBs to PBs.

HDFS Data Storage

 Each DataNode has a large number of Data Blocks.

Prof. Laxmi Deshpande, Dept. of CSE Page

 Below diagram shows the replication of data blocks.

Prof. Laxmi Deshpande, Dept. of CSE Page