Hadoop Distributed File System

Hadoop Distributed File System
(HDFS)

Big Data Concepts
• Volume
– No more GBs of data
– TB,PB,EB,ZB
• Velocity
– High frequency data like
in stocks
• Variety
– Structure and
Unstructured data

Challenges In Big Data
• Complex
– No proper understanding
of the underlying data
• Storage
– How to accommodate
large amount of data in
single physical machine
• Performance
– How to process large
amount of data efficiently
and effectively so as to
increase the performance

Challenges in Traditional Application
• Network
– Limited bandwidth
• Data
– Growth of data can’t be
controlled
• Efficiency & Performance
– How fast data can be read
• Processing capacity of machine
– Processor, RAM is a bottleneck

Statistics
Application Size(MB) Data Size Total Round trip time(sec)
10 10 MB 1+1 = 2
10 100MB 10+10 = 20
10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)
10 1000 GB= 1TB 100000 + 100000 = ~55 Hour
• Calculation is done under ideal condition
• No processing time is taken into consideration
Assuming N/W bandwidth is 10MBPS
• How data is read ?
• Line by Line reading
• Depends on seek rate and disc latency
Average Data Transfer rate = 75MB/sec
Total Time to read 100GB = 22 min
Total time to read 1TB = 3 hours
How much time you take to sort 1TB of data??
Enough time to
watch a movie, while
data is being read

Statistics(Contd.)
Observation
• Large amount of data takes lot of
time to read
• Data is moved back and forth
over the low latency network
where application is running
– 90% of the time is consumed in
data transfer
• Application size is constant
Conclusion
• Achieving Data Localization
– Move application close to data
Or
– Move data close to application

Summary
• Storage is problem
– Cannot store large amount of data
– Upgrading the hard disk will also not solve the problem
(Hardware limitation)
• Performance degradation
– Upgrading RAM will not solve the problem (Hardware
limitation)
• Reading
– Larger data requires larger time to read

Solution Approach
• Distributed Framework
– Storing the data across several
machine
– Performing computation
parallel across several
machines
• Should Support
– Partial failures
– Recoverability
– Data availability
– Consistency
– Data reliability
– Upgrading

Introducing Hadoop
Distributed framework that provides scaling in :
• Storage
• Performance
• IO Bandwidth

What makes Hadoop special?
• No high end or expensive systems are required
• Can run on Linux, Mac OS/X, Windows, Solaris
• Fault tolerant system
– Execution of the job continues even of nodes are failing
• Highly reliable and efficient storage system
• In built intelligence to speed up the application
– Speculative execution
• Fit for lot of applications:
– Web log processing
– Page Indexing, page ranking
– Complex event processing

Features of Hadoop
• Partition, replicate and distributes the data
– Data availability, consistency
• Performs Computation closer to the data
– Data Localization
• Performs computation across several hosts
– MapReduce framework

Hadoop Components
• Hadoop is bundled with two independent
components
– HDFS (Hadoop Distributed File System)
• Designed for scaling in terms of storage and IO
bandwidth
– MR framework (MapReduce)
• Designed for scaling in terms of performance

Understanding file structure
1 GB file
File is
split into
blocks
Each block is
typically
64MB
Each block is stored as
two files – one holding
data and second for
metadata, checksum
Block

Hadoop Processes
• Processes running on Hadoop
– NameNode
– DataNode
– Secondary NameNode
– Task Tracker
– Job Tracker

NameNode
• Single point of contact
• HDFS master
• Holds meta information
– List of files and directories
– Location of blocks
• Single node per cluster
– Cluster can have thousands of
DataNodes and tens of
thousands of HDFS client.
NameNode

DataNode
• Can execute multiple tasks concurrently
• Holds actual data blocks, checksum and
generation stamp
• If block is half full, needs only half of the space of
full block
• At start-up, connects to NameNode and perform
handshake
• No binding to IP address or port, uses Storage ID
• Sends heartbeat to NameNode
DataNode
Storage ID: XYZ001

Communication
• Total Storage Capacity
• Fraction of storage in
use
• No of data transfer
currently in progress
• Instructs DataNode
• Replicate block to other node
• Remove local block replica
• Send immediate block report
• Shut down the node
Every 3
seconds.
“I AM ALIVE”
NameNode
DataNode
Storage ID: XYZ001 DataNode
Storage ID: XYZ002
DataNode
Storage ID: XYZ003
Reply
No
heartbeat
for 10
minutes
Heartbeat

Hadoop Distributed File System

More Related Content

Hadoop Distributed File System