2 Mapreduce Model Principles

MAP REDUCE SOME PRINCIPLES
AND PATTERNS
GENOVEVA VARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr
http://mapreducefest.wordpress.com/
http://vargas-solar.imag.fr
MAP-REDUCE
¡ Programming model for expressing distributed computations on massive amounts of data

¡ Execution framework for large-scale data processing on clusters of commodity servers
¡ Market: any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing
content must tackle large-data problems
¡ data- intensive processing is beyond the capability of any individual machine and requires clusters
¡ large-data problems are fundamentally about organizing computations on dozens, hundreds, or even thousands of
machines
« Data represent the rising tide that lifts all boats—more data lead to better
algorithms and systems for solving real-world problems » 2
DATA PROCESSING
¡ Process the data to produce other data: analysis tool, business intelligence tool, ...
¡ This means
¡ • Handle large volumes of data
¡ • Manage thousands of processors
¡ • Parallelize and distribute treatments
¡ SchedulingI/O
¡ ManagingFaultTolerance
¡ Monitor/Controlprocesses
MapReduce provides all this easy!
3
MOTIVATION
¡ The only feasible approach to tackling large-data problems is to divide and conquer
¡ To the extent that the sub-problems are independent, they can be tackled in parallel by different worker (threads in a processor core,
cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster)
¡ Intermediate results from each individual worker are then combined to yield the final output
¡ Aspects to consider
¡ How do we decompose the problem so that the smaller tasks can be executed in parallel?
¡ How do we assign tasks to workers distributed across a potentially large number of machines? (some workers are better suited to
running some tasks than others, e.g., due to available resources, locality constraints, etc.)
¡ How do we ensure that the workers get the data they need?
¡ How do we coordinate synchronization among the different workers?
¡ How do we share partial results from one worker that is needed by another?
¡ How do we accomplish all of the above in the face of software errors and hardware faults?
4
MOTIVATION
¡ OpenMP for shared memory parallelism or libraries implementing the Message Passing Interface (MPI) for
cluster-level parallelism provide logical abstractions that hide details of operating system synchronization and
communications primitives
à developers keep track of how resources are made available to workers
¡ Map-Reduce provides an abstraction hiding many system-level details from the programmer
à developers focus on what computations need to be performed, as opposed to how those computations are actually carried
out or how to get the data to the processes
§ Yet, organizing and coordinating large amounts of computation is only part of the challenge
§ Large-data processing requires bringing data and code together for computation to occur —no small feat for datasets that
are terabytes and perhaps petabytes in size!
5
APPROACH
Centralized computing with
distributed data storage
Run the program at the Client, get data from the distributed system
Downsides: important data flows, no use of the cluster computing “push the program near the data”
resources
¡ Instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the
data
¡ The complex task of managing storage in such a processing environment is typically handled by a distributed
6
file system that sits underneath MapReduce
MAP-REDUCE PRINCIPLE
¡ Stage 1: Apply a user-specified computation over all input records in a dataset.

¡ These operations occur in parallel and yield intermediate output (key-value pairs)
¡ Stage 2: Aggregate intermediate output by another user-specified computation

¡ Recursively applies a function on every pair of the list

2 Mapreduce Model Principles

Uploaded by

Copyright:

Available Formats

2 Mapreduce Model Principles

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 Mapreduce Model Principles

Uploaded by

Copyright:

Available Formats

MAP REDUCE SOME PRINCIPLES

¡ Programming model for expressing distributed computations on massive amounts of data

MapReduce provides all this easy!

¡ Stage 1: Apply a user-specified computation over all input records in a dataset.

¡ Stage 2: Aggregate intermediate output by another user-specified computation

You might also like