• What is MapReduce?
• What are MapReduce implementations?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. The attached presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
2. What is MapReduce?
Restricted parallel programming model meant for large
clusters
– User implements Map() and Reduce() functions
Parallel computing framework
– Libraries take care of EVERYTHING else
• Parallelization
• Fault Tolerance
• Data Distribution
• Load Balancing
Useful model for many practical tasks
www.decideo.fr/bruley
3. Map and Reduce
The idea of Map, and Reduce is 40+ year old
– Present in all Functional Programming Languages.
– See, e.g., APL, Lisp and ML
Alternate names for Map: Apply-All
Higher Order Functions
– take function definitions as arguments, or
– return a function as output
Map and Reduce are higher-order functions.
www.decideo.fr/bruley
4. Map and Reduce Functions
Functions borrowed from functional programming
languages (eg. Lisp)
Map()
– Process a key/value pair to generate intermediate
key/value pairs
Reduce()
– Merge all intermediate values associated with the same
key
www.decideo.fr/bruley
5. Example: Counting Words
Map()
– Input <filename, file text>
– Parses file and emits <word, count> pairs
• eg. <”hello”, 1>
Reduce()
– Sums all values for the same key and emits <word,
TotalCount>
• eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
www.decideo.fr/bruley
6. Execution on Clusters
1.
Input files split (M splits)
2.
Assign Master & Workers
3.
Map tasks
4.
Writing intermediate data to disk (R regions)
5.
Intermediate data read & sort
6.
Reduce tasks
7.
Return
www.decideo.fr/bruley
7. Map/Reduce Cluster
Implementation
Input
files
M map Intermediate
tasks
files
R reduce
tasks
split 0
split 1
split 2
split 3
split 4
Several map or
reduce tasks can
run on a single
computer
www.decideo.fr/bruley
Output
files
Output 0
Output 1
Each intermediate
file is divided into R
partitions, by
partitioning function
Each reduce task
corresponds to
one partition
8. Map Reduce vs. Parallel
Databases
Map Reduce widely used for parallel processing
– Google, Yahoo, and 100’s of other companies
– Example uses: compute PageRank, build keyword indices,
do data analysis of web click logs, ….
Database people say:
– but parallel databases have been doing this for decades
Map Reduce people say:
– we operate at scales of 1000’s of machines
– We handle failures seamlessly
– We allow procedural code in map and reduce and allow
data of any type
www.decideo.fr/bruley
10. Map Reduce Implementations
Google
– Not available outside Google
Hadoop
– An open-source implementation in Java
– Uses HDFS for stable storage
– Download: http://lucene.apache.org/hadoop/
Teradata Aster
– Cluster-optimized SQL Database that also implements
MapReduce
• IITB alumnus among founders
And several others, such as Cassandra at Facebook, etc.
www.decideo.fr/bruley
12. Solutions Stack for Teradata Aster
Data
Integration
/ ETL
Business
Intelligence
Tools
Query
Tools
Analytics
Specialists
Systems Management
Aster Data
Ecosystem
Security
Aster Data nCluster
Operating System
Servers
Cloud Infrastructure
Storage
www.decideo.fr/bruley
Aster Data
Platform
Infrastructure
13. Teradata Aster Platform Infrastructure
For physical infrastructure (non-cloud) deployments
Aster Data
Analytic
Platform
nCluster
nCluster
Aster Data nCluster packaged software
Operating
System
Certified Linux operating system
Server
Hardware
Certified commodity (x86) server
hardware with internal storage
www.decideo.fr/bruley
14. Teradata Aster Infrastructure
For cloud deployments
Aster Data
Analytic
Platform
nCluster
nCluster
Aster Data nCluster packaged software
Operating
System
Compute
Instance
Storage
www.decideo.fr/bruley
Linux operating system
CC
CC
xLarge
xLarge
EBS
EBS
Ephemeral
Ephemeral
Compute instance from cloud provider
(e.g. Amazon Web Services EC2)
Storage connected to cloud computing
capacity
15. Teradata Aster Architecture for
Analytics
Your Analytics & Advanced Reporting
Applications
App
App
App
App
• Support for in-database processing of custom
applications written in broad variety of languages
• Integration with third-party packaged software via
ODBC/JDBC or in-database integration
Aster Data nCluster
Analytic Functions and Frameworks
• Rich libraries of MapReduce analytics from Aster
Data and partners
• Visual development environment--develop in hours
Unified Interface
• Standard SQL interface
• MapReduce processing integrated with SQL via
SQL-MapReduce interface
SQL
SQL-MapReduce
Analytics Processing Engines
SQL
MapReduce
Massively Parallel Data Stores
www.decideo.fr/bruley
…
• Optimized SQL engine
• Fully-integrated in-database MapReduce
• Hybrid row/column DBMS
• Linear, incremental scalability
• Commodity hardware
16. Teradata Aster Ecosystem
Partner
Product
Product
release
Platform for Certification
MicroStrategy
Intelligence Server
9.2.1 32-bit
Windows 7, Enterprise Edition SP1, 32-bit, 64-bit
SAP
Business Objects
XI 3.1
Windows 2008, 32-bit
Informatica
Powercenter
9.0.1
Client: Windows 2003/2008 Server 32 bit.
Server: Windows 2003/2008 Server 32 bit and 64 bit
IBM
Cognos
10.1FP1
n/a
Tableau
Tableau Server
6
Windows (SS: TBU)
Microsoft
SSLS, SSAS,
SSFS, SSIS
SQL Server
2008
.NET Framework 2.0
Windows Server, 2008 64-bit
Windows 2003, 32-bit
*Oracle BIEE certification currently in process
www.decideo.fr/bruley