HDFS 79
HDFS 79
HDFS 79
Hadoop Distributed
File System
CIS 612
Sunnie Chung
Introduction
• What is Big Data??
– Bulk Amount
– Unstructured
Hadoop 3
Hadoop Distributed File System (HDFS)
Hadoop 4
H
Hadoop Distributed File System (HDFS)
Hadoop 5
Why should I use Hadoop?
Hadoop 6
HDFS: Key Features
• Highly Fault Tolerant:
Automatic Failure Recovery System
• High aggregate throughput for streaming large files
• Supports replication and locality features
• Designed to work with systems with vary large file
(files with size in TB) and few in number.
• Provides streaming access to file system data. It is
specifically good for write once read many kind of
files (for example Log files).
Hadoop 7
Hadoop Distributed File System (HDFS)
Hadoop 9
What features does Hadoop offer?
Hadoop 10
When should you choose Hadoop?
Hadoop 11
When should you avoid Hadoop?
Hadoop 12
Hadoop Examples
Hadoop 13
Hadoop Distributed File System (HDFS)
Hadoop 14
How HDFS works: Split Data
Hadoop 15
How HDFS works: Replication
Hadoop 18
Hadoop Architecture Overview
Client
Job Tracker
Name
Node
Data Data
Node Data Node
Data
Node Node
Hadoop 19
Hadoop Components: Job Tracker
Client
Job Tracker
Client
Job
Tracker
Task Task
Tracker Tracker
Name Node
Data Data
Data Data
Node Node
Node Node
One active Name Node per cluster
Manages the file system namespace and metadata
Single point of failure: Good place to spend money on hardware
Hadoop 21
Name Node
• Master of HDFS
• Maintains and Manages data on Data Nodes
• High reliability Machine (can be even RAID)
• Expensive Hardware
• Stores NO data; Just holds Metadata!
• Secondary Name Node:
– Reads from RAM of Name Node and stores it to hard
disks periodically.
• Active & Passive Name Nodes from Gen2 Hadoop
Hadoop 22
Hadoop Components: Task Tracker
Client
Job
Tracker
Task Task
Tracker Tracker
Name Node
Data Data
Data Data
Node Node
Node Node
There are typically a lot of task trackers
Responsible for executing operations
Reads blocks of data from data nodes
Hadoop 23
Hadoop Components: Data Node
Client
Job
Tracker
Task Task
Tracker Tracker
Name Node
Data Data
Data Data
Node Node
Node Node
There are typically a lot of data nodes
Data nodes manage data blocks and serve them to clients
Data is replicated so failure is not a problem
Hadoop 24
Data Nodes
• Slaves in HDFS
• Provides Data Storage
• Deployed on independent machines
• Responsible for serving Read/Write requests from
Client.
• The data processing is done on Data Nodes.
Hadoop 25
HDFS Architecture
Hadoop 26
Hadoop Modes of Operation
Hadoop 27
HDFS Operation
Hadoop 28
HDFS Operation
• Client makes a Write request to Name Node
• Name Node responds with the information about
on available data nodes and where data to be
written.
• Client write the data to the addressed Data Node.
• Replicas for all blocks are automatically created
by Data Pipeline.
• If Write fails, Data Node will notify the Client
and get new location to write.
• If Write Completed Successfully,
Acknowledgement is given to Client
• Non-Posted Write by Hadoop
Hadoop 29
HDFS: File Write
Hadoop 30
HDFS: File Read
Hadoop 31
• HadoopHadoop:
Development Platform
Hadoop Stack
– User written code runs on system
– System appears to user as a single entity
– User does not need to worry about
distributed system
– Many system can run on top of Hadoop
• Allows further abstraction from system
Hadoop 32
Hive and Hadoop: Hive on
HBase are layers & top
HBase
of Hadoop
HBase & Hive are applications
Provide an interface to data on the HDFS
Other programs or applications may use Hive or
HBase as an intermediate layer
HBase
ZooKeeper
Hadoop 33
Hadoop: Hive
• Hive
– Data warehousing application
– SQL like commands (HiveQL)
– Not a traditional relational database
– Scales horizontally with ease
– Supports massive amounts of data*
* Facebook has more than 15PB of information stored in it and imports 60TB each day (as of 2010)
Hadoop 34
Hadoop: HBase
• HBase
– No SQL Like language
• Uses custom Java API for working with data
– Modeled after Google’s BigTable
– Random read/write operations allowed
– Multiple concurrent read/write operations allowed
Hadoop 35
Hadoop MapReduce
Hadoop has it’s own implementation of MapReduce
Hadoop 1.0.4
API: http://hadoop.apache.org/docs/r1.0.4/api/
Tutorial: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html
Custom Serialization
Data Types
Writable/Comparable
Text vs String
LongWritable vs long
IntWritable vs int
DoubleWritable vs double
Hadoop 36
Structure of a Hadoop Mapper (WordCount)
Hadoop 37
Structure of a Hadoop Reducer (WordCount)
Hadoop 38
Hadoop MapReduce
Working with the Hadoop
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html
A quick overview of Hadoop commands
bin/start-all.sh
bin/stop-all.sh
bin/hadoop fs –put localSourcePath hdfsDestinationPath
bin/hadoop fs –get hdfsSourcePath localDestinationPath
bin/hadoop fs –rmr folderToDelete
bin/hadoop job –kill job_id
Running a Hadoop MR Program
bin/hadoop jar jarFileName.jar programToRun parm1 parm2…
Hadoop 40
How MapReduce Works in Hadoop
Lifecycle of a MapReduce Job
Map function
Reduce function
Map function
Reduce function
• Map Parameters
io.sort.mb
• Shuffle/Reduce Parameters
io.sort.factor
mapred.inmem.merge.threshold
mapred.job.shuffle.merge.percent
Components in a Hadoop MR Workflow
Fault tolerance is of
high priority in the
MapReduce framework
HDFS Architecture
Lifecycle of a MapReduce Job
Time
• 1D projection for
io.sort.factor=500
Automatic Optimization? (Not yet in Hadoop)
Shuffle
Map Map Map Reduce Reduce
Wave 1 Wave 2 Wave 3 Wave 1 Wave 2
What if
#reduces
increased
to 9?