Module-2 - Introduction To Hadoop
Module-2 - Introduction To Hadoop
MODULE 2
INTRODUCTION TO HADOOP
A programming model is centralized computing of data in which the data is transferred
from multiple distributed data sources t a central server.
Analyzing, reporting, visualizing, business –intelligence tasks compute centrally.
The data is input to the central server.
Another programming model is distributed computing which uses the datasets at multiple
computing nodes with data sharing between the nodes during the computation.
Distributed computing requires the sharing of data in transparent manner where each user
within the system can access the data in all the databases though it is single database.
It is a location independent, the result will be independent.
Transparency between the data nodes at computing may not fulfill the big data when
distributed computing takes place using data sharing between local or remote for the
following reason.
Distributed storage system do not use concept of joins.
Data need to be fault tolerant and data store should take into accountability of network
failure.
When data need to be partitioned into blocks , it needs to be written at one set of nodes,
then it replicates to the multiple nodes. Hence it take care of the failure of the network.
Big data follows the theorem called CAP (consistency, Availability and Partitions.), two
must be present in the applications, services and processes.
1. Big data store model.
Data stores in files consisting of data blocks.
The data blocks are distributed across the multiple nodes.
The data nodes are at clusters and racks are scalable. Rack has a multiple number of
data nodes. Each cluster is arranged in a number of racks. Hadoop uses the data store
model.
2. Big data programming model.
It is the application where application job and tasks is scheduled on the same server
which store the data for processing.
Job means running an assignment of set of instructions for processing.
Ex. Processing the queries in an application and sending back the result into
application.
Job scheduling means assigning a job for processing which follows the scheduling.
Scheduling as per the specific sequence or a specific period.
Hadoop system uses the programming model where job or tasks are assigned and
scheduled on the same server which holds the data. It gives a cost effective method to
build search indexes, face book, twitter and LinkedIn messages.
Hadoop consists of clustering of the data, hence it provides the cost effective and
improved nodes accessibility compared to a single computing node.
Each node of the computational cluster is a set to perform the same task and sub task
such as Map reduce which software control and schedule.
It consisted of two components such as data store in blocks in the clusters and
computations at each individual clusters in parallel with another.
Hadoop components are written in Java with a part of c code. Command line utilities
have written in shell scripts.
Hadoop platform provide the low-cost big data platform which is open source and uses
the cloud services.
1. Hadoop common libraries and utilities – it uses the common libraries and utilities
required by the other modules of the Hadoop.
Ex: Hadoop common provides the various components and interfaces for for
distributed file system. It includes the serialization, JAVA RPC and file based data
structures .
2. Hadoop Distributed File System(HDFS) – a java based distributed file system which
can store all kinds of data on the disks as the clusters.
3. Map reduce V1 – software programming model used in Hadoop 1 using Mapper and
Reducer. It processes large sets of data in parallel and batches.
4. YARN – software for managing resources for computing. The user runs the
application task in parallel at the Hadoop, hence it requests for the resources in
distributed running of the tasks.
5. Map reduce v2 – Hadoop 2 YARN based system for parallel processing of large
datasets and distributed processing of the application tasks.
FEATURES OF HADOOP
1. Fault efficient scalable, flexible and modular design which uses simple and programming
model. It provides servers at a high scalability by adding new node to handle larger data.
2. Robust design of HDFS: it provides the data recovery mechanism because of the
provision of back up. Hence it has a high reliability.
3. Store and process big data : process big data of 3V characteristics
4. Distributed cluster computing model with data locality: the processing splits across
multiple servers and fast processing and aggregating results.
5. Hardware fault – tolerant : this fault does not affect the data and application processing
because the data is replicated automatically. Default is three copies of the data blocks.
6. Open source framework: it uses a cluster of multiple inexpensive servers or cloud.
7. Java or Linux based: it uses java interfaces. Hadoop base is linux but it has its own set of
shell commands for support.
8. Hadoop provides the various components and interfaces for the distributed file system. It
is designed for batch processing.
9. The data access is required faster than the latency at DataNodes at HDFS.
The system enables the applications which run Big data and deploys the HDFS. Hadoop deploys
the application programming model such as MapReduce and HBase.YARN manages resources
and schedules sub tasks of the application. The figure 2.2 shows the Hadoop eco system which
includes the following four layers.
4. APIs at the application support layer, the codes communicate and runs using the YARN
or Map Reduce programming framework layer.
5. AVRO enable the serialization between the layers.
6. Zookeeper enables the coordination among the layer components.
HADOOP STREAMING
Sparks provides in memory processing of the data, thus improves the processing speed. Spark
and Fling technologies enable in-stream processing.
Spark includes the security features and Flink is emerging as a powerful tool. Flink improves
the overall performanc as it provides a single run time for streaming as well as batch processing.
HADOOP PIPES
These are the C++ pipes which interfaces with MapReduce.java interfaces are not used in pipes.
Apache Hadoop provides an adapter layer , which processes in pipes.
A pipe means data streaming into the system at Mapper input and aggregated results flowing out
at outputs.
The adapter layer enables the running of application tasks in C++ coded MapReduce programs,
which requires the faster numerical computations achieve higher throughput using C++ when
used through the pipes as compared to the pipes.
Pipes do not use the standard I/O when communicating with Mapper and Reducer codes. Clouera
distribution including Hadoop (CDH) runs the pipes. IBM PowerLinux system enable working
with Hadoop pipes and system libraries.
The conventional file system uses the directories, which consists of folders. A folder consists of
files. when data processes , the sources identify by the pointers for the resources.
The data dictionary stores the resource pointers. Master tables at the dictionary store at central
location.
Similarly , files,DataNodes, blocks need the identification during the processing at HDFS.it uses
the DataNode and the NameNode. NameNode stores the file’s meta data.Meta data gives the
information about the users application but does not participate in the computations. The
DataNode stores the actual data files in the data blocks.
In Hadoop cluster few nodes act as a NameNodes. They are termed as MasterNodes or simply
Masters. They have different configuration supporting high DRAM and processing power.
Masters have less local storage. Majority of the nodes in Hadoop clusters act as DataNodes and
TaskTrackers, these nodes are referred to as slave nodes or slaves. The slaves have lots of disk
storage and moderate amounts of processing capabilities and DRAM. Slaves are responsible to
store the data and process the computation tasks submitted by the clients.
Figure 2.4 shows the client, Master NameNode, primary and secondary MasterNodes and slave
nodes in the physical architecture.
Clients as users run the application with the help of the Hadoop ecosystem projects. Example
Hive,Mahout and Pig are the Hadoop ecosystem’s projects. They are not required to be present
in the Hadoop cluster. A single MasterNode provides HDFS, MapReduce and HBase using
threads in a small medium seized clusters. When the cluster size is large ,multiple servers are
used to balance the load.
The secondary NameNode provides NameNode management services and Zookeeper is used by
HBase for metadata storage.
HADOOP 2
Single NameNode failure in Hadoop 1 is an operational limitation. Scaling up was restricted to
scale beyond a few thousands of DataNodes and number of Clusters.Hence Hadoop 2 provides
the multiple NameNodes which enabes higher resources availability.
HDFS COMMANDS
The HDFS shell is not compliant to the POSIX. Thus , the shell cannot interact similar to UNIX
or LINUX. Hence it requires /bin/hdfs/dfs <args>, where args gives the arguments. A set of
Hadoop commands can be found at Apache Softwarefoundation website. copyToLocal is the
command used to copy a file at HDFS to the local. –cat is used to standard output. All Hadoop
commands are invoked by the bin/Hadoop script. % Hadoop fsck / -files –blocks. Table 2.1
gives the example of Hadoop command usages.
Aggrgation function gives the functionality of the values of multiple rows together to result a
single value of more significant or meaning or measurements.
Ex: functions such as count, MAX, MIN deviation and Standard deviation .
Querying function finds the desired values. Ex: the best performance of the student in
examination.
1. The distribution of job based on client application task to various nodes within a cluster.
2. Organizing and reducing the results from each node into a cohesive response to the
application to answer the query.
4. Map and Reduce run on isolation from one another. The reducer job always performs
after the Map job.
HADOOP YARN
1. YARN is a resource a management platform. It manages the computer resources.
2. YARN manages the schedules for running the sub tasks. Each sub tasks uses the
resources in the allotted interval time.
3. YARN separates the resources management and processing components.
4. It stands for YET ANOTHER RESOURCE NEGOTIATOR , it manages and allocates
resources for the application sub tasks and submit the resources for them in the Hadoop
system.
The figure 2.5 shows the YARN based execution model. It consists of Client, Resource
Manager(RM),Node manager(NM), Application Master (AM) and containers.
1. A master node has two components a) Job History Server and b) Resource Manager.
2. The client Node submits the request of an application to the RM. One RM exists per
clusters. It keeps the information of all the slave NM.
3. Multiple NameNodes are at cluster and NM creates an AM instance (AMI) and starts up.
Multiple instances can be created in AM
4. The AMI performs role of an Application Manager. That estimates the resource
requirement for running an application program.
5. NM is the slave of the infrastructure. It signals whenever it initializes. All NMs send
controlling signal periodically to the RM signaling their presence. Each NM includes the
several containers for usage.
6. Each NM assigns a container for each AMI
7. RM allots the resources to AM , thus NM runs in parallel to provide the output.