Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

2 Mapreduce Model Principles

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

MAP REDUCE SOME PRINCIPLES

AND PATTERNS
GENOVEVA VARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr
http://mapreducefest.wordpress.com/

http://vargas-solar.imag.fr
MAP-REDUCE

¡ Programming model for expressing distributed computations on massive amounts of data


¡ Execution framework for large-scale data processing on clusters of commodity servers
¡ Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing
content must tackle large-data problems
¡ data- intensive processing is beyond the capability of any individual machine and requires clusters
¡ large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of
machines

« Data represent the rising tide that lifts all boats—more data lead to better
algorithms and systems for solving real-world problems » 2
DATA PROCESSING

¡ Process the data to produce other data: analysis tool, business intelligence tool, ...
¡ This means
¡ • Handle large volumes of data
¡ • Manage thousands of processors
¡ • Parallelize and distribute treatments
¡ SchedulingI/O
¡ ManagingFaultTolerance
¡ Monitor/Controlprocesses

MapReduce provides all this easy!

3
MOTIVATION

¡ The only feasible approach to tackling large-data problems is to divide and conquer
¡ To the extent that the sub-problems are independent, they can be tackled in parallel by different worker (threads in a processor core,
cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster)
¡ Intermediate results from each individual worker are then combined to yield the final output
¡ Aspects to consider
¡ How do we decompose the problem so that the smaller tasks can be executed in parallel?
¡ How do we assign tasks to workers distributed across a potentially large number of machines? (some workers are better suited to
running some tasks than others, e.g., due to available resources, locality constraints, etc.)
¡ How do we ensure that the workers get the data they need?
¡ How do we coordinate synchronization among the different workers?
¡ How do we share partial results from one worker that is needed by another?
¡ How do we accomplish all of the above in the face of software errors and hardware faults?

4
MOTIVATION

¡ OpenMP for shared memory parallelism or libraries implementing the Message Passing Interface (MPI) for
cluster-level parallelism provide logical abstractions that hide details of operating system synchronization and
communications primitives
à developers keep track of how resources are made available to workers

¡ Map-Reduce provides an abstraction hiding many system-level details from the programmer
à developers focus on what computations need to be performed, as opposed to how those computations are actually carried
out or how to get the data to the processes

§ Yet, organizing and coordinating large amounts of computation is only part of the challenge
§ Large-data processing requires bringing data and code together for computation to occur —no small feat for datasets that
are terabytes and perhaps petabytes in size!

5
APPROACH
Centralized computing with
distributed data storage

Run the program at the Client, get data from the distributed system
Downsides: important data flows, no use of the cluster computing “push the program near the data”
resources

¡ Instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the
data
¡ The complex task of managing storage in such a processing environment is typically handled by a distributed
6
file system that sits underneath MapReduce
MAP-REDUCE PRINCIPLE

¡ Stage 1: Apply a user-specified computation over all input records in a dataset.


¡ These operations occur in parallel and yield intermediate output (key-value pairs)

¡ Stage 2: Aggregate intermediate output by another user-specified computation


¡ Recursively applies a function on every pair of the list

You might also like