Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 4 Da

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 57

Acropolis Institute of Technology &

Research, Indore
www.acropolis.in
Data Analytics
By: Mr. Ronak Jain
Table of Contents
UNIT-IV:
HADOOP MAPREDUCE: Employing Hadoop Map Reduce, Creating the components of Hadoop Map Reduce jobs,
Distributing data processing across server farms, Executing Hadoop Map Reduce jobs, monitoring the progress of job
flows, The Building Blocks of Hadoop Map Reduce Distinguishing Hadoop daemons, Investigating the Hadoop Distributed
File System Selecting appropriate execution modes: local, pseudo-distributed, fully distributed.

December 13, 2023 3


MapReduce
Challenges for IO Cluster Computing
Distributed
1. NodesArchitecture
fail (Cluster)
1 in 1000 nodes fail a day
Duplicate Data (Distributed FS)
2. Network is a bottleneck
Typically 1-10 Gb/s throughput
Bring computation to nodes, rather than
data to nodes. (Sort and Shuffle)
3. Traditional distributed programming is
often ad-hoc and complicated (Simply define a map
Stipulate a programming system that and reduce)
can easily be distributed
MapReduce - What?

MapReduce is a programming model for


efficient distributed computing
It works like a Unix pipeline
 cat input | grep | sort | uniq -c | cat > output
 Input | Map | Shuffle & Sort | Reduce | Output

Efficiency from
 Streaming through data, reducing seeks
 Pipelining
A good fit for a lot of applications
 Log processing
 Web index building
Hadoop MapReduce
MapReduce programming model
Framework for distributed processing of large data sets
Pluggable user code runs in generic framework
Common design pattern in data processing
cat * | grep | sort | uniq -c | cat > file
input | map | shuffle | reduce | output
MapReduce Usage
Log processing
Web search indexing
Ad-hoc queries
Closer Look
MapReduce Component
JobClient
JobTracker
TaskTracker
Child
Job Creation/Execution Process
MapReduce Process (org.apache.hadoop.mapred)

JobClient
 Submit job
JobTracker
 Manage and schedule job, split job into tasks
TaskTracker
 Start and monitor the task execution
Child
 The process that really execute the task
Inter Process Communication
IPC/RPC (org.apache.hadoop.ipc)

Protocol
JobSubmissionProtocol
 JobClient <-------------> JobTracker
InterTrackerProtocol
 TaskTracker <------------> JobTracker
TaskUmbilicalProtocol
 TaskTracker <-------------> Child
JobTracker impliments both protocol and works as server
in both IPC
TaskTracker implements the TaskUmbilicalProtocol; Child
gets task information and reports task status through it.
JobClient.submitJob - 1
Check input and output, e.g. check if the output directory is already
existing
job.getInputFormat().validateInput(job);
job.getOutputFormat().checkOutputSpecs(fs, job);
Get InputSplits, sort, and write output to HDFS
InputSplit[] splits = job.getInputFormat().
getSplits(job, job.getNumMapTasks());
writeSplitsFile(splits, out); // out is $SYSTEMDIR/$JOBID/job.split
JobClient.submitJob - 2
The jar file and configuration file will be uploaded to HDFS system
directory
job.write(out); // out is $SYSTEMDIR/$JOBID/job.xml

JobStatus status = jobSubmitClient.submitJob(jobId);


This is an RPC invocation, jobSubmitClient is a proxy created in the
initialization
Job initialization on JobTracker - 1
JobTracker.submitJob(jobID) <-- receive RPC invocation request
JobInProgress job = new JobInProgress(jobId, this, this.conf)
Add the job into Job Queue
jobs.put(job.getProfile().getJobId(), job);
jobsByPriority.add(job);
jobInitQueue.add(job);
Job initialization on JobTracker - 2
Sort by priority
resortPriority();
compare the JobPrioity first, then compare the JobSubmissionTime
Wake JobInitThread
jobInitQueue.notifyall();
job = jobInitQueue.remove(0);
job.initTasks();
JobInProgress - 1
JobInProgress(String jobid, JobTracker jobtracker, JobConf
default_conf);
JobInProgress.initTasks()
DataInputStream splitFile = fs.open(new
Path(conf.get(“mapred.job.split.file”)));
// mapred.job.split.file --> $SYSTEMDIR/$JOBID/job.split
JobInProgress - 2
splits = JobClient.readSplitFile(splitFile);
numMapTasks = splits.length;
maps[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf,
this, i);
reduces[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker,
conf, this, i);
JobStatus --> JobStatus.RUNNING
JobTracker Task Scheduling - 1
Task getNewTaskForTaskTracker(String taskTracker)
Compute the maximum tasks that can be running on taskTracker
int maxCurrentMap Tasks = tts.getMaxMapTasks();
int maxMapLoad = Math.min(maxCurrentMapTasks, (int)Math.ceil(double)
remainingMapLoad/numTaskTrackers));
JobTracker Task Scheduling - 2
int numMaps = tts.countMapTasks(); // running tasks number
If numMaps < maxMapLoad, then more tasks can be allocated, then
based on priority, pick the first job from the jobsByPriority Queue,
create a task, and return to TaskTracker
Task t = job.obtainNewMapTask(tts, numTaskTrackers);
Start TaskTracker - 1
initialize()
Remove original local directory
RPC initialization
 TaskReportServer = RPC.getServer(this, bindAddress, tmpPort, max, false, this, fConf);
 InterTrackerProtocol jobClient = (InterTrackerProtocol)
RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID,
jobTrackAddr, this.fConf);
Start TaskTracker - 2
run();
offerService();
TaskTracker talks to JobTracker with HeartBeat message periodically
HeatbeatResponse heartbeatResponse = transmitHeartBeat();
TaskTracker.localizeJob(TaskInProgress tip);
launchTasksForJob(tip, new JobConf(rjob.jobFile));
Run Task on TaskTracker -1
tip.launchTask(); // TaskTracker.TaskInProgress
tip.localizeTask(task); // create folder, symbol link
runner = task.createRunner(TaskTracker.this);
runner.start(); // start TaskRunner thread
TaskRunner.run();
ConfigureRunchild
Task onprocess’ jvm
TaskTracker - 2 parameters, i.e. classpath, taskid,
taskReportServer’s address & port
Start Child Process
 runChild(wrappedCommand, workDir, taskid);
Child.main()
Create RPC Proxy, and execute RPC invocation
TaskUmbilicalProtocol umbilical = (TaskUmbilicalProtocol)
RPC.getProxy(TaskUmbilicalProtocol.class, TaskUmbilicalProtocol.versionID,
address, defaultConf);
Task task = umbilical.getTask(taskid);
task.run(); // mapTask / reduceTask.run
Finish Job - 1
Child
task.done(umilical);
 RPC call: umbilical.done(taskId, shouldBePromoted)
TaskTracker
done(taskId, shouldPromote)
 TaskInProgress tip = tasks.get(taskid);
 tip.reportDone(shouldPromote);
o taskStatus.setRunState(TaskStatus.State.SUCCEEDED)
Finish Job - 2
JobTracker
TaskStatus report: status.getTaskReports();
TaskInProgress tip = taskidToTIPMap.get(taskId);
JobInProgress update JobStatus
 tip.getJob().updateTaskStatus(tip, report, myMetrics);
o One task of current job is finished
o completedTask(tip, taskStatus, metrics);
o If (this.status.getRunState() == JobStatus.RUNNING && allDone)
{this.status.setRunState(JobStatus.SUCCEEDED)}
MapReduce - Features
Fine grained Map and Reduce tasks
 Improved load balancing
 Faster recovery from failed tasks
Automatic re-execution on failure
 In a large cluster, some nodes are always slow or flaky
 Framework re-executes failed tasks
Locality optimizations
 With large data, bandwidth to data is a problem
 Map-Reduce + HDFS is a very effective solution
 Map-Reduce queries HDFS for locations of input data
 Map tasks are scheduled close to the inputs when possible
What is MapReduce

sort and
shuffle
extract what
you care
about. aggregate,
summarize
Map
Reduce
What is MapReduce

Easy as 1, 2,
3! Step 2: Sort / Group Step 3:
Step 1: by Reduce
Map

(Leskovec at al., 2014; http://www.mmds.org/)


Data Flow
MapReduce - Dataflow
Example: Word Count

tokenize(document) | sort | uniq -c

Map: extract
what you sort and Reduce:
care about. shuffle aggregate,
summarize
(Leskovec at al., 2014;
http://www.mmds.org/)

Chunks
Word Count Example
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to cluster
Word Count Dataflow
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable> {
private static final IntWritable one = new IntWritable(1);
private Text word = new Text();

public static void map(LongWritable key, Text value,


OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
String line = value.toString();
StringTokenizer = new StringTokenizer(line);
while(tokenizer.hasNext()) {
word.set(tokenizer.nextToken());
output.collect(word,one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements
Reducer<Text,IntWritable,Text,IntWritable> {
public static void map(Text key, Iterator<IntWritable> values,
OutputCollector<Text,IntWritable> output, Reporter reporter) throws
IOException {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Word Count Example

Jobs are controlled by configuring JobConfs


JobConfs are maps from attribute names to string values
The framework defines attributes to control how the job is
executed
 conf.set(“mapred.job.name”, “MyApp”);
Applications can add arbitrary values to the JobConf
 conf.set(“my.string”, “foo”);
 conf.set(“my.integer”, 12);
JobConf is available to all tasks
Putting it all together

Create a launching program for your application


The launching program configures:
The Mapper and Reducer to use
The output key and value types (input types are
inferred from the InputFormat)
The locations for your input and output
The launching program then submits the job and
typically waits for it to complete
Putting it all together

JobConf conf = new JobConf(WordCount.class);


conf.setJobName(“wordcount”);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducer(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
Conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
Input and Output Formats

A Map/Reduce may specify how it’s input is to be read by


specifying an InputFormat to be used
A Map/Reduce may specify how it’s output is to be written by
specifying an OutputFormat to be used
These default to TextInputFormat and TextOutputFormat,
which process line-based text data
Another common choice is SequenceFileInputFormat and
SequenceFileOutputFormat for binary data
These are file-based, but they are not required to be
How many Maps and Reduces

Maps
 Usually as many as the number of HDFS blocks being processed, this
is the default
 Else the number of maps can be specified as a hint
 The number of maps can also be controlled by specifying the
minimum split size
 The actual sizes of the map inputs are computed by:
 max(min(block_size,data/#maps), min_split_size
Reduces
Unless the amount of data being processed is
small
 0.95*num_nodes*mapred.tasktracker.tasks.maximum
Some handy tools

Partitioners
Combiners
Compression
Counters
Speculation
Zero Reduces
Distributed File Cache
Tool
Partitioners are application code that define how keys are
Partitioners
assigned to reduces
Default partitioning spreads keys evenly, but randomly
 Uses key.hashCode() % num_reduces
Custom partitioning is often required, for example, to
produce a total order in the output
 Should implement Partitioner interface
 Set by calling conf.setPartitionerClass(MyPart.class)
 To get a total order, sample the map output keys and pick values to
divide the keys into roughly equal buckets and use that in your
partitioner
When maps produce many repeated keys
Combiners  It is often useful to do a local aggregation following the map
 Done by specifying a Combiner
 Goal is to decrease size of the transient data
 Combiners have the same interface as Reduces, and often are the same class
 Combiners must not side effects, because they run an intermdiate number of
times
 In WordCount, conf.setCombinerClass(Reduce.class);
Compressing the outputs and intermediate data will often yield huge
Compression performance gains
 Can be specified via a configuration file or set programmatically
 Set mapred.output.compress to true to compress job output
 Set mapred.compress.map.output to true to compress map outputs
Compression Types (mapred(.map)?.output.compression.type)
 “block” - Group of keys and values are compressed together
 “record” - Each value is compressed individually
 Block compression is almost always best
Compression Codecs (mapred(.map)?.output.compression.codec)
 Default (zlib) - slower, but more compression
 LZO - faster, but less compression
Often Map/Reduce applications have countable events
Counters
For example, framework counts records in to and out of
Mapper and Reducer
To define user counters:
static enum Counter {EVENT1, EVENT2};
reporter.incrCounter(Counter.EVENT1, 1);
Define nice names in a MyClass_Counter.properties file
CounterGroupName=MyCounters
EVENT1.name=Event 1
EVENT2.name=Event 2
Speculative execution

The framework can run multiple instances of slow tasks


 Output from instance that finishes first is used
 Controlled by the configuration variable mapred.speculative.execution
 Can dramatically bring in long tails on jobs
Frequently, we only need to run a filter on the input data
Zero Reduces
 No sorting or shuffling required by the job
 Set the number of reduces to 0
 Output from maps will go directly to OutputFormat and disk
Handle “standard” Hadoop command line options
Tool
 -conf file - load a configuration file named file
 -D prop=value - define a single configuration property prop
Class looks like:
public class MyApp extends Configured implements
Tool {
public static void main(String[] args) throws
Exception {
System.exit(ToolRunner.run(new Configuration(),
new MyApp(), args));
}
public int run(String[] args) throws Exception {
…. getConf() ….
}
}
Distributed File Cache

Sometimes need read-only copies of data on the local


computer
 Downloading 1GB of data for each Mapper is expensive
Define list of files you need to download in JobConf
Files are downloaded once per computer
Add to launching program:
DistributedCache.addCacheFile(new
URI(“hdfs://nn:8020/foo”), conf);
Add to task:
Path[] files =
DistributedCache.getLocalCacheFiles(conf);
Finding the Shortest Path
A common graph search
application is finding the
shortest path from a start
node to one or more target
nodes
Commonly done on a single
machine with Dijkstra’s
Algorithm
Can we use BFS to find the
shortest path via
MapReduce?
Finding the Shortest Path: Intuition

We can define the solution to this problem


inductively
DistanceTo(startNode) = 0
For all nodes n directly reachable from startNode,
DistanceTo(n) = 1
For all nodes n reachable from some other set of
nodes S,
DistanceTo(n) = 1 + min(DistanceTo(m), m  S)
From Intuition to Algorithm

A map task receives a node n as a key, and (D,


points-to) as its value
D is the distance to the node from the start
points-to is a list of nodes reachable from n
p  points-to, emit (p, D+1)
Reduces task gathers possible distances to a
given p and selects the minimum one
What This Gives Us

This MapReduce task can advance the known


frontier by one hop
To perform the whole BFS, a non-MapReduce
component then feeds the output of this step
back into the MapReduce task for another
iteration
Problem: Where’d the points-to list go?
Solution: Mapper emits (n, points-to) as well
Blow-up and Termination

This algorithm starts from one node


Subsequent iterations include many more
nodes of the graph as the frontier advances
Does this ever terminate?
Yes! Eventually, routes between nodes will stop
being discovered and no better distances will be
found. When distance is the same, we stop
Mapper should emit (n,D) to ensure that “current
distance” is carried into the reducer

You might also like