2 Mapreduce Model Principles
2 Mapreduce Model Principles
2 Mapreduce Model Principles
AND PATTERNS
GENOVEVA VARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr
http://mapreducefest.wordpress.com/
http://vargas-solar.imag.fr
MAP-REDUCE
« Data represent the rising tide that lifts all boats—more data lead to better
algorithms and systems for solving real-world problems » 2
DATA PROCESSING
¡ Process the data to produce other data: analysis tool, business intelligence tool, ...
¡ This means
¡ • Handle large volumes of data
¡ • Manage thousands of processors
¡ • Parallelize and distribute treatments
¡ SchedulingI/O
¡ ManagingFaultTolerance
¡ Monitor/Controlprocesses
3
MOTIVATION
¡ The only feasible approach to tackling large-data problems is to divide and conquer
¡ To the extent that the sub-problems are independent, they can be tackled in parallel by different worker (threads in a processor core,
cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster)
¡ Intermediate results from each individual worker are then combined to yield the final output
¡ Aspects to consider
¡ How do we decompose the problem so that the smaller tasks can be executed in parallel?
¡ How do we assign tasks to workers distributed across a potentially large number of machines? (some workers are better suited to
running some tasks than others, e.g., due to available resources, locality constraints, etc.)
¡ How do we ensure that the workers get the data they need?
¡ How do we coordinate synchronization among the different workers?
¡ How do we share partial results from one worker that is needed by another?
¡ How do we accomplish all of the above in the face of software errors and hardware faults?
4
MOTIVATION
¡ OpenMP for shared memory parallelism or libraries implementing the Message Passing Interface (MPI) for
cluster-level parallelism provide logical abstractions that hide details of operating system synchronization and
communications primitives
à developers keep track of how resources are made available to workers
¡ Map-Reduce provides an abstraction hiding many system-level details from the programmer
à developers focus on what computations need to be performed, as opposed to how those computations are actually carried
out or how to get the data to the processes
§ Yet, organizing and coordinating large amounts of computation is only part of the challenge
§ Large-data processing requires bringing data and code together for computation to occur —no small feat for datasets that
are terabytes and perhaps petabytes in size!
5
APPROACH
Centralized computing with
distributed data storage
Run the program at the Client, get data from the distributed system
Downsides: important data flows, no use of the cluster computing “push the program near the data”
resources
¡ Instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the
data
¡ The complex task of managing storage in such a processing environment is typically handled by a distributed
6
file system that sits underneath MapReduce
MAP-REDUCE PRINCIPLE