Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
© Vigen Sahakyan 2016
Hadoop Tutorial
HDFS (Hadoop
Distributed File System)
© Vigen Sahakyan 2016
Agenda
● What is HDFS ?
● Anatomy of HDFS
● HDFS 2 Federation
● HDFS clients
● HDFS workflow tools
© Vigen Sahakyan 2016
What is HDFS?
● Distributed File System
● Distributed because it partitions data and
store across cluster of machines.
● It was inspired by GFS(Google File System)
● One of the main component of Hadoop
● Has master/slave architecture
● It is run on commodity hardware
● HDFS is highly fault tolerant
● Supports huge data sets
● It has horizontal scaling feature
© Vigen Sahakyan 2016
Anatomy of HDFS
HDFS has two main concepts block and replication factor.
● Blocks - the files in HDFS as in regular file system are broken in small blocks (by default
128mb, which you can configure) also called chunks, which is physically are stored in
independent machines for reliability purposes.
● Replication factor - the replication factor is number (by default it is 3) of copy per each
block where each copy stored on separate machines. You can also configure it in hdfs-
site.xml
Rack awareness - for large HDFS cluster which have multiple racks, it is important to
ensure that replicas of data blocks exist on multiple racks. This way, the loss of a switch
doesn’t render portions of the data unavailable due to all replicas being underneath it.
HDFS also support High Availability feature.
© Vigen Sahakyan 2016
Anatomy of HDFS
HDFS provides its core services via three types of long-running daemons:
● NameNode ( per cluster) - the master process of HDFS which maintain file
system tree and metadata for all files and directories. All metadata about file
block addresses are stored in RAM(it is NameNode bottleneck).
● DataNode ( per node ) - the slave process which physically store and retrieve
data from disk (when it is told by client or NameNode). Also it periodically send
heartbeat to NameNode with data condition.
● SecondaryNameNode (per cluster) - it is periodically copy the information from
NameNode and become master process, when NameNode fail. It guarantees
fail tolerance of HDFS.
© Vigen Sahakyan 2016
Anatomy of HDFS
© Vigen Sahakyan 2016
HDFS Federation
HDFS federation allows a cluster to scale by adding namenodes, each of which
manages a portion of the filesystem namespace (e.g. /user, /share e.t.c). Client
should do request to specific namespace (NameNode). Clusters with one
NameNode are limited because RAM is limited.
● NameNodes don’t know about
each other
● All NameNodes use same DataNodes
© Vigen Sahakyan 2016
HDFS clients
● CLI - HDFS supports command line interface for read write and other operations
similar to linux shell.
● Native API - it supports API for java/python/c++ to work with his file system.
● DSTcopy - batch read/write command which is implemented as Map/Reduce
application. It is very fast and can be used to copy data between two clusters.
● WEBHdfs - rest service which provide access to HDFS via web service with special
API. You have a similar operation set with linux shell. Especially it usefull when you
work with languages which don’t have native API for HDFS.
● HTTPfs - same as WEBHdfs but work well behind firewall, WEBHdfs doesn’t work
behind firewalls.
● NFS - you can mount your HDFS to your local file system and use it as your regular
filesystem
© Vigen Sahakyan 2016
HDFS workflow tools
● Flume - is created for automate copying log files to HDFS, doesn’t work well
with binary files.
● HDFS File Slurper - created by Alex Holmes, it can copy files of any format
into and out of HDFS.
● Kafka - same as Flume but support more formats and it’s also more reliably
than Flume. It also possible use for stream data.
● Sqoop - used to import and export data from HDFS to relational databases,
support batch import for MySql and PostgreSql.
● OOzie - workflow automatization tool which help to copying files and running
jobs. It can copy data by using Sqoop,Flume, Kafka and then run MapReduce
job in that data.
© Vigen Sahakyan 2016
References
1. Hadoop in Practice 2nd Edition by Alex Holmes
http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222
2. Hadoop: The Definitive Guide by Tom White
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC
Thanks!
© Vigen Sahakyan 2016

More Related Content

Hadoop HDFS

  • 1. © Vigen Sahakyan 2016 Hadoop Tutorial HDFS (Hadoop Distributed File System)
  • 2. © Vigen Sahakyan 2016 Agenda ● What is HDFS ? ● Anatomy of HDFS ● HDFS 2 Federation ● HDFS clients ● HDFS workflow tools
  • 3. © Vigen Sahakyan 2016 What is HDFS? ● Distributed File System ● Distributed because it partitions data and store across cluster of machines. ● It was inspired by GFS(Google File System) ● One of the main component of Hadoop ● Has master/slave architecture ● It is run on commodity hardware ● HDFS is highly fault tolerant ● Supports huge data sets ● It has horizontal scaling feature
  • 4. © Vigen Sahakyan 2016 Anatomy of HDFS HDFS has two main concepts block and replication factor. ● Blocks - the files in HDFS as in regular file system are broken in small blocks (by default 128mb, which you can configure) also called chunks, which is physically are stored in independent machines for reliability purposes. ● Replication factor - the replication factor is number (by default it is 3) of copy per each block where each copy stored on separate machines. You can also configure it in hdfs- site.xml Rack awareness - for large HDFS cluster which have multiple racks, it is important to ensure that replicas of data blocks exist on multiple racks. This way, the loss of a switch doesn’t render portions of the data unavailable due to all replicas being underneath it. HDFS also support High Availability feature.
  • 5. © Vigen Sahakyan 2016 Anatomy of HDFS HDFS provides its core services via three types of long-running daemons: ● NameNode ( per cluster) - the master process of HDFS which maintain file system tree and metadata for all files and directories. All metadata about file block addresses are stored in RAM(it is NameNode bottleneck). ● DataNode ( per node ) - the slave process which physically store and retrieve data from disk (when it is told by client or NameNode). Also it periodically send heartbeat to NameNode with data condition. ● SecondaryNameNode (per cluster) - it is periodically copy the information from NameNode and become master process, when NameNode fail. It guarantees fail tolerance of HDFS.
  • 6. © Vigen Sahakyan 2016 Anatomy of HDFS
  • 7. © Vigen Sahakyan 2016 HDFS Federation HDFS federation allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace (e.g. /user, /share e.t.c). Client should do request to specific namespace (NameNode). Clusters with one NameNode are limited because RAM is limited. ● NameNodes don’t know about each other ● All NameNodes use same DataNodes
  • 8. © Vigen Sahakyan 2016 HDFS clients ● CLI - HDFS supports command line interface for read write and other operations similar to linux shell. ● Native API - it supports API for java/python/c++ to work with his file system. ● DSTcopy - batch read/write command which is implemented as Map/Reduce application. It is very fast and can be used to copy data between two clusters. ● WEBHdfs - rest service which provide access to HDFS via web service with special API. You have a similar operation set with linux shell. Especially it usefull when you work with languages which don’t have native API for HDFS. ● HTTPfs - same as WEBHdfs but work well behind firewall, WEBHdfs doesn’t work behind firewalls. ● NFS - you can mount your HDFS to your local file system and use it as your regular filesystem
  • 9. © Vigen Sahakyan 2016 HDFS workflow tools ● Flume - is created for automate copying log files to HDFS, doesn’t work well with binary files. ● HDFS File Slurper - created by Alex Holmes, it can copy files of any format into and out of HDFS. ● Kafka - same as Flume but support more formats and it’s also more reliably than Flume. It also possible use for stream data. ● Sqoop - used to import and export data from HDFS to relational databases, support batch import for MySql and PostgreSql. ● OOzie - workflow automatization tool which help to copying files and running jobs. It can copy data by using Sqoop,Flume, Kafka and then run MapReduce job in that data.
  • 10. © Vigen Sahakyan 2016 References 1. Hadoop in Practice 2nd Edition by Alex Holmes http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222 2. Hadoop: The Definitive Guide by Tom White http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC