Business Intelligence & Big Data Analytics-CSE3124Y

Business Intelligence & Big Data
Analytics- CSE3124Y
INTRODUCTION TO BIG DATA
ECOSYSTEM
LECTURE 3
Learning Outcomes
Describe the Hadoop Ecosystem and its characteristics
List the Hadoop core components
Explain the Hadoop core components
Describe the importance of the Hadoop Core
Components
Differentiate between NameNode and DataNode
List the main functionalities of NameNode and dataNode
Apache Hadoop (1)
•Hadoop was founded by Apache.
•Open-source software framework for processing and querying vast
amounts of data on large clusters of commodity.
•Hadoop is being written in Java
Consists of 3 sub projects:
− MapReduce
− Hadoop Distributed File System a.k.a. HDFS
− Hadoop Common
Apache Hadoop (2)
•Process huge volume of structured and unstructured data
•It is implemented for Google MapReduce as an open source
and is based on simple programming model called
MapReduce.
•It provides reliability through replication
•The Apache Hadoop ecosystem is composed of the Hadoop
Kernel, MapReduce, HDFS and several other components
like Apache Hive, Base and Zookeeper (Bhosale and
Gadekar, 2014).
Characteristics of Hadoop
The characteristics of Hadoop are described as follows:
Scalable– New nodes are added without disruption and without any
change on the format of the data.
Cost effective– There is parallel computing to all the commodity
servers using Hadoop. This decrease cost makes it affordable to
process massive amount of data.
Flexible– Hadoop is able to process any type of data from various
sources and deep analysis can be performed.
Fault tolerant– When a node is damaged, the system is able to
redirect the work to another location to continue the processing
without missing any data.
Activity 1
State the types of work where it is not
appropriate to use Hadoop
Hadoop is not for all types of work
•Not to process transactions (random access)
•Not good when work cannot be parallelized
•Not good for low latency data access
•Not good for processing lots of small files
•Not good for intensive calculations with little data
APACHE HADOOP CORE COMPONENTS
•HDFS (Hadoop Distributed File System)
•Map Reduce
Two Key Aspects of Hadoop
•MapReduce framework
How Hadoop understands and assigns work to the nodes
(machines)
•Hadoop Distributed File System = HDFS
–Where Hadoop stores data
– A file system that spans all the nodes in a Hadoop cluster
– It links together the file systems on many local nodes to
make them into one big file system
What is the Hadoop Distributed File
System?
•HDFS stores data across multiple nodes
•HDFS assumes nodes will fail, so it achieves reliability
by replicating data across multiple nodes
•The file system is built from a cluster of data nodes,
each of which serves up blocks of data over the
network using a block protocol specific to HDFS
MapReduce Framework
A software framework that enables you to write applications that will
process large amounts of data, in-parallel, on large clusters of commodity
hardware, in a reliable and fault-tolerant manner
•Integrates with HDFS and provides the same benefits for parallel data
processing
•Sends computations to where the data is stored
•The framework:
–Schedules and monitors tasks, and re-executes failed tasks
–Hides complex distributed computing complexity tasks from the
developer
Benefits of MapReduce
MapReduce provides:
•Automatic parallelization and distribution of large data sets
that are stored and distributed on a Hadoop cluster slave
nodes
•Fault-tolerance
•I/O scheduling
•Status and monitoring
Hadoop Distributed File System (HDFS)
Distributed, scalable, fault tolerant, high throughput
• Data access through MapReduce
• Files split into blocks
• 3 replicas for each piece of data by default
• Can create, delete, copy, but NOT update
• Designed for streaming reads, not random access
• Data locality: processing data on or near the physical storage to
decrease transmission of data
Terms
Term Description
Cluster A group of servers (nodes) on a network that are configured to work together. A
server is either a master node or a slave (worker) node.
Hadoop A batch processing infrastructure that stores and distributes files and
distributes work across a group of servers (nodes).
Hadoop Cluster A collection of Racks containing master and slave nodes
Blocks HDFS breaks down a data file into blocks or "chunks" and stores the data blocks
on different slave DataNodes in the Hadoop cluster.
Replication Factor HDFS makes three copies of data blocks and stores on different
DataNodes/Racks in the Hadoop cluster.
NameNode (NN) A service (Daemon) that maintains a directory of all files in HDFS and
tracks where data is stored in the HDFS cluster.
Secondary NameNode Performs internal NameNode transaction log checkpointing
DataNode (DN) Stores the blocks "chunks" of data for a set of files
HDFS – Architecture
Master / Slave architecture
Master: NameNode
•manages the file system namespace and metadata
◦ FsImage
◦ EditLog
•regulates client access to files
Slave: DataNode
•many per cluster
•manages storage attached to the nodes
•periodically reports status to NameNode
Master- Slave Architecture
HDFS – Blocks
HDFS is designed to support very large files
•Each file is split into blocks
– Hadoop default: 64MB
– Big Insights default: 128MB
•Blocks reside on different physical DataNode
•Behind the scenes, 1 HDFS block is supported by multiple operating system blocks
64 MB HDFS
•If a file or a chunk of the file is smaller than the block size, only needed space is used. E.g.: a
210MB file is split as
64MB 64MB 64MB 18MB
HDFS – Replication
Blocks of data are replicated to multiple nodes
– Behavior is controlled by replication factor, configurable per file
– Default is 3 replicas
Common case:
•one replica on one node in the local rack
•another replica on a different node in the local rack
•and the last on a different node in a different rack
This cuts inter-rack network bandwidth, which improves write
performance
Namenode Startup
1. NameNode reads fsimage in memory
2. NameNode applies editlog changes
3. NameNode waits for block data from data nodes
•Namenode doesn’t store block information
•Namenode exits safemode when 99.9% of blocks have
at least one copy accounted for
Adding file
File is added to NameNode memory and persisted in
editlog
Data is written in blocks to datanodes
•Datanode starts chained copy to two other
datanodes
•If at least one write for each block succeeds, write is
successful
Managing Cluster
•Adding Data Node
– Start new datanode ( pointing to namenode )
– If required run balancer (hadoop balancer) to rebalance blocks
•Remove Node
– Simply remove datanode
– Better: Add node to exclude file and wait till all blocks have been
moved
•Checking filesystem health
Secondary NameNode
During operation primary Namenode cannot merge
fsImage and editlog
•This is done on the secondary namenode
– Every couple minutes, secondary namenode copies new
edit log from primary NN
– Merges editLog into fsimage
– Copies the new merged fsImage back to primary
namenode
Functions of the NameNode
•Acts as the repository for all HDFS metadata
•Maintains the file system namespace
•Executes the directives for opening, closing, and renaming files and directories
•Stores the HDFS state in an image file (fsimage)
•Stores file system modifications in an edit log file (edits)
•On startup, merges the fsimage and edits files, and then empties edits
•Places replicas of blocks on multiple racks for fault tolerance
•Records the number of replicas (replication factor) of a file specified by an
application
Functions of DataNodes
DataNodes perform the following functions:
•Serving read and write requests from the file system
clients
•Performing block creation, deletion, and replication based
on instructions from the NameNode
•Providing simultaneous send/receive operations to
DataNodes during replication (“replication pipelining”)
Activity 1
Differentiate between NameNode and

DataNode.
Activity
Microsoft Word
Document

Business Intelligence & Big Data Analytics-CSE3124Y

Uploaded by

Copyright:

Available Formats

Business Intelligence & Big Data Analytics-CSE3124Y

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Business Intelligence & Big Data Analytics-CSE3124Y

Uploaded by

Copyright:

Available Formats

Business Intelligence & Big Data

Differentiate between NameNode and

You might also like