BDA Module 2 Chapter 1
BDA Module 2 Chapter 1
BDA Module 2 Chapter 1
MODULE 02
CHAPTER 1
Introduction to Hadoop
Hadoop is a open source software framework.
Data store in file system consist of Data Blocks(Physical division of large data).
Data blocks are replicate at the data nodes if any one get failed.
Big Data programming model is that application in which application jobs and tasks is
scheduled on the same servers which stores the data for processing.
Key Terms
Cluster Computing
Refers to the computing, storing and analyzing huge amounts of unstructured or structured
data in a distributed computing environment.
Each cluster forms loosely or tightly connected computing nodes that work together.
Data Flow
.Data consistency
Data availability
At least one copy of the data should be available if partition becomes inactive.
Resources
Resource Management
Horizontal Scalability
Example: MPPs
Vertical Scalability
Increasing the number of tasks in the system. Tasks like reporting, Business processing(BP),
Business Intelligent(BI) Tasks.
Ecosystem
Most of the tools or solutions are used to supplement or support the core elements of
Hadoop. All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
1. Hadoop Common
The common module contains the libraries and utilities that are required by the other
modules of hadoop.
Hadoop common provides various components and interfaces for distributed file system
and general input/output. This includes serialization and file based data structures.
A java based distributed file system which can store all kinds of data on the disks at the
clusters.
3. Map Reduce v1
Software programming model in Hadoop 1 using Maper and Reducer. The v1processes
large data sets in parallel and in batches.
4. YARN
The user application tasks or sub-tasks run in parallel at the hadoop, uses scheduling
and handles the requests for the resources in distributed running of the tasks.
5. Map Reduce v2
Spark
Open source frame work
Cluster computing frame work
Provides in-memory analytics
Enables OLAP and Real time processing
Adapted by the companies like Amazon, eBay, and Yahoo.
Features of Hadoop
Fault-efficient scalable, flexible and modular design
Hadoop uses simple and modular programming model.
The system provides server at high scalability.The system is scalable by adding new
nodes to handle large data.
Hadoop proves very helpful in storing , managing, processing and analyzing Big data.
Modular functions make system flexible.
One can add or replace components at ease.
Robust design of HDFS
Execution of big data applications continue even when an individual server or cluster
fails. This is because of hadoop provisions of backup and recovery mechanism.
Processes Big data at high speed as the application tasks and sub tasks submit to the
DataNodes.
One can achieve more computing power by increasing the number of computing nodes.
The processing splits across multiple DataNodes, and thus fast processing and
aggregated results.
Hardware fault-tolerant
If a node goes down, the other nodes take care of the residue.
This is due to multiple copies of all data blocks which replicate automatically.
Open source access and cloud services enable large data store.
Hadoop base is Linux but has its own set of shell command support.
Hadoop provides various components and interfaces for distributed file system and general
input/output.
YARN provides a platform for many different modes of data processing, from traditional
batch processing to processing of the applications such as interactive queries, text analytics
and streaming analytics.
Hadoop ecosystem consist of own family of applications which tie up together with the
hadoop.
The system component support the storage, processing, access, analysis, governance,
security and operations for Big data.
The system enables the applications which run Big Data and deploy HDFS.
The data store system consist of clusters, racks, DataNodes and blocks.
Hadoop deploys application programming model, such as MapReduce and HBase. YARN
manages resources and schedules sub-tasks of the application.
Below figure shows the Hadoop Core components HDFS, MapReduce and YARN along
with the ecosystem. Ecosystem includes the application support layer and application layer
components.
Components are AVRO, Zookeeper, Pig, Hive, Sqoop, Ambari, Mahout, Spark, Flink and
Flame.
3. Processing-frame work layer, consisting of Maper and Reducer for the MapReduce
process flow.
Hadoop Streaming
HDFS with MapReduce and YARN-based system enables parallel processing of large
datasets.
In hadoop streaming spark and Flink are used to interface between the Maper and Reducer.
Flink improves overall performance as it provides single run-time for streaming as well as
batch processing.
Hadoop Pipes
This is another way to interface or connecting between the Maper and Reducer .
HDFS is designed to run on a cluster of computers and servers at cloud based utility
services.
Each cluster has a number of data stores, called racks. Each rack stores a number of
DataNodes.
A rack distribute across a cluster. The nodes have storing and processing capabilities.
The data blocks replicate by default at least on three data nodes in same or remote nodes.
Features of HDFS
Create, append, delete, rename and attribute modification functions.
Content of individual file cannot be modified or replaced but appended with new data at the
end of the file.
Write once but read many times during usages and processing.
Average file size can be more than 500MB
Hadoop Physical Organization
The conventional file system uses directories.
A directory consists of folders. A folder consists of files.
When data processes, the data sources identify by pointers for the resources.
A data-dictionary stores the resources pointers. Master tables at the dictionary store at a
central location. The centrally stored tables enable administration easier when the data
sources change during processing.
The files,DataNodes and blocks need the identification during processing at HDFS. HDFS
use the NameNode and DataNode.
Few nodes in Hadoop cluster act as NameNodes. These nodes are termed as Master Nodes
or simply Masters.
Majority of the nodes in hadoop cluster acts as DataNodes and TaskTrackers. These nodes
are refered to as slave nodes or slaves.
The slaves have lots of disk storage and moderate amounts of processing capabilities.
Slaves are responsible to store the data and process the computation tasks submitted by the
clients.
A single master node provides HDFS, MapReduce and Hbase using threads in small to
medium sized clusters.
When the cluster size is large, multiple servers are used, such as to balance the load.
The secondary NameNode provides Name node management services and Zookeepr is used
by HBase for metadata storage.
The master node receives client connections, maintains the description of the global file
system name space, and the allocation of file blocks.
It also monitors the state of the system in order to detect any failure.
The NameNode stores all the file system related information such as:
Masters and slaves, and hadoop client(node) load the data into clusters, submit the
processing job and then retrieve the data to see the response after the job completion.
Hadoop 2
An associated NameNode
Associated JournalNode(JN) . The JN keeps the records of the state, resources assigned,
and intermediate results. The System takes care of failure issues as follows.
HDFS Commands
HDFS commands are common to other modules of Hadoop. The HDFS shell is not
complaint with the POSIX.
Commands for interacting with the files in HDFS require /bin/hdfs dfs <args>, where args
stands for the command arguments.