Unit 4 Da
Unit 4 Da
Unit 4 Da
Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-IV:
HADOOP MAPREDUCE: Employing Hadoop Map Reduce, Creating the components of Hadoop Map Reduce jobs,
Distributing data processing across server farms, Executing Hadoop Map Reduce jobs, monitoring the progress of job
flows, The Building Blocks of Hadoop Map Reduce Distinguishing Hadoop daemons, Investigating the Hadoop Distributed
File System Selecting appropriate execution modes: local, pseudo-distributed, fully distributed.
Efficiency from
Streaming through data, reducing seeks
Pipelining
A good fit for a lot of applications
Log processing
Web index building
Hadoop MapReduce
MapReduce programming model
Framework for distributed processing of large data sets
Pluggable user code runs in generic framework
Common design pattern in data processing
cat * | grep | sort | uniq -c | cat > file
input | map | shuffle | reduce | output
MapReduce Usage
Log processing
Web search indexing
Ad-hoc queries
Closer Look
MapReduce Component
JobClient
JobTracker
TaskTracker
Child
Job Creation/Execution Process
MapReduce Process (org.apache.hadoop.mapred)
JobClient
Submit job
JobTracker
Manage and schedule job, split job into tasks
TaskTracker
Start and monitor the task execution
Child
The process that really execute the task
Inter Process Communication
IPC/RPC (org.apache.hadoop.ipc)
Protocol
JobSubmissionProtocol
JobClient <-------------> JobTracker
InterTrackerProtocol
TaskTracker <------------> JobTracker
TaskUmbilicalProtocol
TaskTracker <-------------> Child
JobTracker impliments both protocol and works as server
in both IPC
TaskTracker implements the TaskUmbilicalProtocol; Child
gets task information and reports task status through it.
JobClient.submitJob - 1
Check input and output, e.g. check if the output directory is already
existing
job.getInputFormat().validateInput(job);
job.getOutputFormat().checkOutputSpecs(fs, job);
Get InputSplits, sort, and write output to HDFS
InputSplit[] splits = job.getInputFormat().
getSplits(job, job.getNumMapTasks());
writeSplitsFile(splits, out); // out is $SYSTEMDIR/$JOBID/job.split
JobClient.submitJob - 2
The jar file and configuration file will be uploaded to HDFS system
directory
job.write(out); // out is $SYSTEMDIR/$JOBID/job.xml
sort and
shuffle
extract what
you care
about. aggregate,
summarize
Map
Reduce
What is MapReduce
Easy as 1, 2,
3! Step 2: Sort / Group Step 3:
Step 1: by Reduce
Map
Map: extract
what you sort and Reduce:
care about. shuffle aggregate,
summarize
(Leskovec at al., 2014;
http://www.mmds.org/)
Chunks
Word Count Example
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
Input and Output Formats
Maps
Usually as many as the number of HDFS blocks being processed, this
is the default
Else the number of maps can be specified as a hint
The number of maps can also be controlled by specifying the
minimum split size
The actual sizes of the map inputs are computed by:
max(min(block_size,data/#maps), min_split_size
Reduces
Unless the amount of data being processed is
small
0.95*num_nodes*mapred.tasktracker.tasks.maximum
Some handy tools
Partitioners
Combiners
Compression
Counters
Speculation
Zero Reduces
Distributed File Cache
Tool
Partitioners are application code that define how keys are
Partitioners
assigned to reduces
Default partitioning spreads keys evenly, but randomly
Uses key.hashCode() % num_reduces
Custom partitioning is often required, for example, to
produce a total order in the output
Should implement Partitioner interface
Set by calling conf.setPartitionerClass(MyPart.class)
To get a total order, sample the map output keys and pick values to
divide the keys into roughly equal buckets and use that in your
partitioner
When maps produce many repeated keys
Combiners It is often useful to do a local aggregation following the map
Done by specifying a Combiner
Goal is to decrease size of the transient data
Combiners have the same interface as Reduces, and often are the same class
Combiners must not side effects, because they run an intermdiate number of
times
In WordCount, conf.setCombinerClass(Reduce.class);
Compressing the outputs and intermediate data will often yield huge
Compression performance gains
Can be specified via a configuration file or set programmatically
Set mapred.output.compress to true to compress job output
Set mapred.compress.map.output to true to compress map outputs
Compression Types (mapred(.map)?.output.compression.type)
“block” - Group of keys and values are compressed together
“record” - Each value is compressed individually
Block compression is almost always best
Compression Codecs (mapred(.map)?.output.compression.codec)
Default (zlib) - slower, but more compression
LZO - faster, but less compression
Often Map/Reduce applications have countable events
Counters
For example, framework counts records in to and out of
Mapper and Reducer
To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
Define nice names in a MyClass_Counter.properties file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
Speculative execution