Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Hadoop Distributed File System
(HDFS)
Big Data Concepts
• Volume
– No more GBs of data
– TB,PB,EB,ZB
• Velocity
– High frequency data like
in stocks
• Variety
– Structure and
Unstructured data
Challenges In Big Data
• Complex
– No proper understanding
of the underlying data
• Storage
– How to accommodate
large amount of data in
single physical machine
• Performance
– How to process large
amount of data efficiently
and effectively so as to
increase the performance
Challenges in Traditional Application
• Network
– Limited bandwidth
• Data
– Growth of data can’t be
controlled
• Efficiency & Performance
– How fast data can be read
• Processing capacity of machine
– Processor, RAM is a bottleneck
Statistics
Application Size(MB) Data Size Total Round trip time(sec)
10 10 MB 1+1 = 2
10 100MB 10+10 = 20
10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min)
10 1000 GB= 1TB 100000 + 100000 = ~55 Hour
• Calculation is done under ideal condition
• No processing time is taken into consideration
Assuming N/W bandwidth is 10MBPS
• How data is read ?
• Line by Line reading
• Depends on seek rate and disc latency
Average Data Transfer rate = 75MB/sec
Total Time to read 100GB = 22 min
Total time to read 1TB = 3 hours
How much time you take to sort 1TB of data??
Enough time to
watch a movie, while
data is being read
Statistics(Contd.)
Observation
• Large amount of data takes lot of
time to read
• Data is moved back and forth
over the low latency network
where application is running
– 90% of the time is consumed in
data transfer
• Application size is constant
Conclusion
• Achieving Data Localization
– Move application close to data
Or
– Move data close to application
Summary
• Storage is problem
– Cannot store large amount of data
– Upgrading the hard disk will also not solve the problem
(Hardware limitation)
• Performance degradation
– Upgrading RAM will not solve the problem (Hardware
limitation)
• Reading
– Larger data requires larger time to read
Solution Approach
• Distributed Framework
– Storing the data across several
machine
– Performing computation
parallel across several
machines
• Should Support
– Partial failures
– Recoverability
– Data availability
– Consistency
– Data reliability
– Upgrading
Introducing Hadoop
Distributed framework that provides scaling in :
• Storage
• Performance
• IO Bandwidth
What makes Hadoop special?
• No high end or expensive systems are required
• Can run on Linux, Mac OS/X, Windows, Solaris
• Fault tolerant system
– Execution of the job continues even of nodes are failing
• Highly reliable and efficient storage system
• In built intelligence to speed up the application
– Speculative execution
• Fit for lot of applications:
– Web log processing
– Page Indexing, page ranking
– Complex event processing
Features of Hadoop
• Partition, replicate and distributes the data
– Data availability, consistency
• Performs Computation closer to the data
– Data Localization
• Performs computation across several hosts
– MapReduce framework
Hadoop Components
• Hadoop is bundled with two independent
components
– HDFS (Hadoop Distributed File System)
• Designed for scaling in terms of storage and IO
bandwidth
– MR framework (MapReduce)
• Designed for scaling in terms of performance
Understanding file structure
1 GB file
File is
split into
blocks
Each block is
typically
64MB
Each block is stored as
two files – one holding
data and second for
metadata, checksum
Block
Hadoop Processes
• Processes running on Hadoop
– NameNode
– DataNode
– Secondary NameNode
– Task Tracker
– Job Tracker
NameNode
• Single point of contact
• HDFS master
• Holds meta information
– List of files and directories
– Location of blocks
• Single node per cluster
– Cluster can have thousands of
DataNodes and tens of
thousands of HDFS client.
NameNode
DataNode
• Can execute multiple tasks concurrently
• Holds actual data blocks, checksum and
generation stamp
• If block is half full, needs only half of the space of
full block
• At start-up, connects to NameNode and perform
handshake
• No binding to IP address or port, uses Storage ID
• Sends heartbeat to NameNode
DataNode
Storage ID: XYZ001
Communication
• Total Storage Capacity
• Fraction of storage in
use
• No of data transfer
currently in progress
• Instructs DataNode
• Replicate block to other node
• Remove local block replica
• Send immediate block report
• Shut down the node
Every 3
seconds.
“I AM ALIVE”
NameNode
DataNode
Storage ID: XYZ001 DataNode
Storage ID: XYZ002
DataNode
Storage ID: XYZ003
Reply
No
heartbeat
for 10
minutes
Heartbeat
Overview of HDFS
HDFS Client

More Related Content

Hadoop Distributed File System

  • 1. Hadoop Distributed File System (HDFS)
  • 2. Big Data Concepts • Volume – No more GBs of data – TB,PB,EB,ZB • Velocity – High frequency data like in stocks • Variety – Structure and Unstructured data
  • 3. Challenges In Big Data • Complex – No proper understanding of the underlying data • Storage – How to accommodate large amount of data in single physical machine • Performance – How to process large amount of data efficiently and effectively so as to increase the performance
  • 4. Challenges in Traditional Application • Network – Limited bandwidth • Data – Growth of data can’t be controlled • Efficiency & Performance – How fast data can be read • Processing capacity of machine – Processor, RAM is a bottleneck
  • 5. Statistics Application Size(MB) Data Size Total Round trip time(sec) 10 10 MB 1+1 = 2 10 100MB 10+10 = 20 10 1000 MB = 1GB 100 + 100 = 200 (~3.3 min) 10 1000 GB= 1TB 100000 + 100000 = ~55 Hour • Calculation is done under ideal condition • No processing time is taken into consideration Assuming N/W bandwidth is 10MBPS • How data is read ? • Line by Line reading • Depends on seek rate and disc latency Average Data Transfer rate = 75MB/sec Total Time to read 100GB = 22 min Total time to read 1TB = 3 hours How much time you take to sort 1TB of data?? Enough time to watch a movie, while data is being read
  • 6. Statistics(Contd.) Observation • Large amount of data takes lot of time to read • Data is moved back and forth over the low latency network where application is running – 90% of the time is consumed in data transfer • Application size is constant Conclusion • Achieving Data Localization – Move application close to data Or – Move data close to application
  • 7. Summary • Storage is problem – Cannot store large amount of data – Upgrading the hard disk will also not solve the problem (Hardware limitation) • Performance degradation – Upgrading RAM will not solve the problem (Hardware limitation) • Reading – Larger data requires larger time to read
  • 8. Solution Approach • Distributed Framework – Storing the data across several machine – Performing computation parallel across several machines • Should Support – Partial failures – Recoverability – Data availability – Consistency – Data reliability – Upgrading
  • 9. Introducing Hadoop Distributed framework that provides scaling in : • Storage • Performance • IO Bandwidth
  • 10. What makes Hadoop special? • No high end or expensive systems are required • Can run on Linux, Mac OS/X, Windows, Solaris • Fault tolerant system – Execution of the job continues even of nodes are failing • Highly reliable and efficient storage system • In built intelligence to speed up the application – Speculative execution • Fit for lot of applications: – Web log processing – Page Indexing, page ranking – Complex event processing
  • 11. Features of Hadoop • Partition, replicate and distributes the data – Data availability, consistency • Performs Computation closer to the data – Data Localization • Performs computation across several hosts – MapReduce framework
  • 12. Hadoop Components • Hadoop is bundled with two independent components – HDFS (Hadoop Distributed File System) • Designed for scaling in terms of storage and IO bandwidth – MR framework (MapReduce) • Designed for scaling in terms of performance
  • 13. Understanding file structure 1 GB file File is split into blocks Each block is typically 64MB Each block is stored as two files – one holding data and second for metadata, checksum Block
  • 14. Hadoop Processes • Processes running on Hadoop – NameNode – DataNode – Secondary NameNode – Task Tracker – Job Tracker
  • 15. NameNode • Single point of contact • HDFS master • Holds meta information – List of files and directories – Location of blocks • Single node per cluster – Cluster can have thousands of DataNodes and tens of thousands of HDFS client. NameNode
  • 16. DataNode • Can execute multiple tasks concurrently • Holds actual data blocks, checksum and generation stamp • If block is half full, needs only half of the space of full block • At start-up, connects to NameNode and perform handshake • No binding to IP address or port, uses Storage ID • Sends heartbeat to NameNode DataNode Storage ID: XYZ001
  • 17. Communication • Total Storage Capacity • Fraction of storage in use • No of data transfer currently in progress • Instructs DataNode • Replicate block to other node • Remove local block replica • Send immediate block report • Shut down the node Every 3 seconds. “I AM ALIVE” NameNode DataNode Storage ID: XYZ001 DataNode Storage ID: XYZ002 DataNode Storage ID: XYZ003 Reply No heartbeat for 10 minutes Heartbeat