Hadoop HDFS

© Vigen Sahakyan 2016
Hadoop Tutorial
HDFS (Hadoop
Distributed File System)

Agenda
● What is HDFS ?
● Anatomy of HDFS
● HDFS 2 Federation
● HDFS clients
● HDFS workflow tools

What is HDFS?
● Distributed File System
● Distributed because it partitions data and
store across cluster of machines.
● It was inspired by GFS(Google File System)
● One of the main component of Hadoop
● Has master/slave architecture
● It is run on commodity hardware
● HDFS is highly fault tolerant
● Supports huge data sets
● It has horizontal scaling feature

Anatomy of HDFS
HDFS has two main concepts block and replication factor.
● Blocks - the files in HDFS as in regular file system are broken in small blocks (by default
128mb, which you can configure) also called chunks, which is physically are stored in
independent machines for reliability purposes.
● Replication factor - the replication factor is number (by default it is 3) of copy per each
block where each copy stored on separate machines. You can also configure it in hdfs-
site.xml
Rack awareness - for large HDFS cluster which have multiple racks, it is important to
ensure that replicas of data blocks exist on multiple racks. This way, the loss of a switch
doesn’t render portions of the data unavailable due to all replicas being underneath it.
HDFS also support High Availability feature.

Anatomy of HDFS
HDFS provides its core services via three types of long-running daemons:
● NameNode ( per cluster) - the master process of HDFS which maintain file
system tree and metadata for all files and directories. All metadata about file
block addresses are stored in RAM(it is NameNode bottleneck).
● DataNode ( per node ) - the slave process which physically store and retrieve
data from disk (when it is told by client or NameNode). Also it periodically send
heartbeat to NameNode with data condition.
● SecondaryNameNode (per cluster) - it is periodically copy the information from
NameNode and become master process, when NameNode fail. It guarantees
fail tolerance of HDFS.

Anatomy of HDFS

HDFS Federation
HDFS federation allows a cluster to scale by adding namenodes, each of which
manages a portion of the filesystem namespace (e.g. /user, /share e.t.c). Client
should do request to specific namespace (NameNode). Clusters with one
NameNode are limited because RAM is limited.
● NameNodes don’t know about
each other
● All NameNodes use same DataNodes

HDFS clients
● CLI - HDFS supports command line interface for read write and other operations
similar to linux shell.
● Native API - it supports API for java/python/c++ to work with his file system.
● DSTcopy - batch read/write command which is implemented as Map/Reduce
application. It is very fast and can be used to copy data between two clusters.
● WEBHdfs - rest service which provide access to HDFS via web service with special
API. You have a similar operation set with linux shell. Especially it usefull when you
work with languages which don’t have native API for HDFS.
● HTTPfs - same as WEBHdfs but work well behind firewall, WEBHdfs doesn’t work
behind firewalls.
● NFS - you can mount your HDFS to your local file system and use it as your regular
filesystem

HDFS workflow tools
● Flume - is created for automate copying log files to HDFS, doesn’t work well
with binary files.
● HDFS File Slurper - created by Alex Holmes, it can copy files of any format
into and out of HDFS.
● Kafka - same as Flume but support more formats and it’s also more reliably
than Flume. It also possible use for stream data.
● Sqoop - used to import and export data from HDFS to relational databases,
support batch import for MySql and PostgreSql.
● OOzie - workflow automatization tool which help to copying files and running
jobs. It can copy data by using Sqoop,Flume, Kafka and then run MapReduce
job in that data.

References
1. Hadoop in Practice 2nd Edition by Alex Holmes
http://www.amazon.com/Hadoop-Practice-Alex-Holmes/dp/1617292222
2. Hadoop: The Definitive Guide by Tom White
http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White-ebook/dp/B00V7B1IZC

Thanks!

Hadoop HDFS

More Related Content

Hadoop HDFS