Introduction to HDFS and MapReduce

HDFS and Mapreduce
-by Uday Vakalapudi

Agenda
 Core elements of Hadoop
 Basic Hadoop Storage Hierarchy
 HDFS Default Storage Style
 Anatomy of a File Read
 Anatomy of a File Write
 Blocks & Block Caching
 HDFS Basic Filesystem Operations
 Copy with distcp for data backups
 Commissioning and Decommissioning Nodes
 MapReduce inspiration
 Brief Mapreduce flow
 Job, Task, and Task Attempt IDs
 Mapreduce example code(with Compression codec)

Core elements of Hadoop
HDFS Namenode Datanode
Mapreduce JobTracker TaskTracker

Name Node
R1N1 R1N2 R1N3 R1N4
Rack R1
Rack R2
R2N1 R2N2 R2N3 R2N4
Data center D1
Basic Hadoop Storage Hierarchy

HDFS Default Storage Style
Name Node
R1N1 R1N2 R1N3 R1N4
Rack R1
Rack R2
R2N1 R2N2 R2N3 R2N4
Data.csv
B1 B2 B3
B3 B1 B1 B3 B2 B2
B1
B2
B3
R2N1 R2N2R1N1
R2N3 R2N4R1N1
R2N3 R2N1R1N1
Meta
data

Blocks & Block Caching
 Block size is the minimum amount of data that it can read or write
 Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes
 HDFS, too, has the concept of a block, but it is a much larger unit—128 MB by default.
 Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are
stored as independent units.
 hdfs fsck /user/file.txt -files –blocks
 for frequently-accessed files the blocks may be explicitly cached in the datanode’s memory, in an
off-heap block cache.
 By default a block is cached in only one datanode’s memory
 dfs.datanode.max.locked.memory property used to set max lock memory
 Usinf hdfs cacheadmin option we add cachepool, add directory, and also we can give TTL(time –
to-live)
 hdfs cacheadmin -addDirective -path <path> -pool <pool-name> [-force] [-replication
<replication>] [-ttl <time-to-live>]

HDFS Basic Filesystem Operations
 hadoop fs –ls
 hadoop fs –lsr
 hadoop fs –put localdir hdfsdir <-copyFromLocal>
 hadoop fs –get hdfsdir localdir
 hadoop fs –rmr hdfsdir <rmdir>
 hadoop fs -getmerge <src> <localdst> [addnl]

Copy with distcp for data backups
 hadoop distcp file1 file2
 hadoop distcp dir1 dir2
 hadoop distcp -update dir1 dir2 <If you are unsure of the effect of a distcp operation>
 hadoop distcp -update -delete -p hdfs://namenode1/foo hdfs://namenode2/foo
 The -delete flag causes distcp to delete any files or directories from the destination that
are not present in the source, and -p means that file status attributes like permissions,
block size and replication are preserved.

Commissioning and Decommissioning
Nodes Commissioning new nodes
 Add the network addresses of the new nodes to the include file.
 Update the namenode with the new set of permitted datanodes using thiscommand:
 % hdfs dfsadmin -refreshNodes
 Update the resource manager with the new set of permitted node managers using:
 % yarn rmadmin -refreshNodes
 Update the slaves file with the new nodes, so that they are included in future operations performed by the Hadoop control
scripts.
 Start the new datanodes and node managers.
 Check that the new datanodes and node managers appear in the web UI.
 Decommissioning old nodes
 HDFS is set by the dfs.hosts.exclude property and for YARN by the yarn.resourcemanager.nodes.exclude-path property.
 Update the namenode with the new set of permitted datanodes, using this command:
 % hdfs dfsadmin -refreshNodes
 Update the resource manager with the new set of permitted node managers using:
 % yarn rmadmin -refreshNodes

MapReduce inspiration
 The name MapReduce comes from functional programming
- Map is the name of a higher-order function that applies a given function
to each element of a list. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.map(x => x * x) == List(1,4,9,16,25)
- Reduce is the name of a higher-order function that analyze a recursive
data structure and recombine through use of a given combining
operation the results of recursively processing its constituent parts,
building up a return value. Sample in Scala:
val numbers = List(1,2,3,4,5)
numbers.reduce(_ + _) == 15
Note: MapReduce takes an input, splits it into smaller parts, execute the code of
the mapper on every part, then gives all the results to one or more reducers
that merge all the results into one.

Job,Task, andTask Attempt IDs
 application_1410450250506_0003
 job_1410450250506_0003
 task_1410450250506_0003_m_000003
 attempt_1410450250506_0003_m_000003_0

Introduction to HDFS and MapReduce

Related slideshows

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Introduction to HDFS and MapReduce

Similar to Introduction to HDFS and MapReduce (20)

More from Uday Vakalapudi

More from Uday Vakalapudi (12)

Recently uploaded

Recently uploaded (20)

Introduction to HDFS and MapReduce