R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Revolution Confidential
Revolution Analytics & Cloudera Confidential
R + Hadoop
Ask Bigger (and new) Questions
and Get Better, Faster Answers
Michele Chambers
Chief Strategy Officer & VP Product Mgmt
Jai Ranganathan
Director Product Mgmt & Strategy

Period of Disruption
2
1st Generation Predictive Analytics

Today’s Challenge:
Accelerating Business Cadence
Changing Business Environment
• Fact Based Decisions Require More Data
• Need to Understand Tradeoffs and Best Course of Action
• Predictive Models Need to Continually Deliver Lift
• Reduced Shelf Life for Predictive Models
Faster Time to Value
• Reduce Analytic Cycle Time
• Build & Deploy Models Faster
• Eliminate Time Consuming Data Movements
Rapid Customer Facing Decisions
• Score More Frequently
• Need to Make Best Decision in Real Time
3

4
Big Data
2nd Generation Modern Analytics
Machine
Learning
Quick to Fail
Lift

Typical Technology Challenges
Our Customers Face
Big Data
• New Data
Sources
• Data Variety &
Velocity
• Fine Grain
Control
• Data Movement,
Memory Limits
Complex
Computation
• Experimentation
• Many Small
Models
• Ensemble
Models
• Simulation
Enterprise
Readiness
• Heterogeneous
Landscape
• Write Once,
Deploy Anywhere
• Skill Shortage
• Production
Support
Production
Efficiency
• Shorter Model
Shelf Life
• Volume of
Models
• Long End-to-End
Cycle Time
• Pace of Decision
Accelerated
5

Big Data Big Analytics is different

y=ax+b
8

y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
y=ax+b
9

New model
Existing model
10

60%
65%
70%
75%
80%
85%
90%
95%
100%
0% 5% 10% 15% 20% 25% 30%
Accuracy
False Positives
Add unstructured data
Existing model

Big Data Big Analytics Use Cases
12
• Build predictive models with (very) large datasets
• More rows/observations and/or more columns/features
• Tend to use dimension reduction, machine learning and/or ensemble techniques
One Big Model
• Score and predict with (very) large datasets with previously built model
• Score in batch or individual transactions
• Previously built model may be exported from model build to model deployment env.
Big Data Scoring
• Model factories build predictive models in quantity
• Automated building of individualized models and/or parallel individualized model
execution
Many Small
Models
• Score and predict with many individualized models
• Production model factories require model management
Scoring Many
Models
• Analytic models that are mathematically intense
• May not use large data sets but generate a lot of interim calculations
• May include vectorization, simulation, optimization
Computationally
Intensive Analytics
12

Big Data Big Analytics
Specialized Use Cases
• Build forecasts with time sequenced data
• For Big Data, tend to be many small models esp. machine data
• Due to typical Big Data volume requires model management
Time Series
Analytics
• Use of unstructured, free text
• For Big Data, typically used to enhance structured predictive analytics
• Minimally requires text processing tools and may also require natural language
processing
Text and Document
Analytics
• Analyzing continuous, high speed data flows for patterns and acting upon the
patterns in real-time
• Requires specialized sampling and filtering techniques
• Uses distinct discovery analytics methods such as frequent itemsets or clustering
Mining Data
Streams
• No separation of model building and model scoring
• As real-time data becomes more widely available, this emerging category reduces
time-to-insight with little or no separation between model building and scoring
Zero Latency
13

Analytic Reference Architecture
Decision
Analytic Applications
Integration
Middleware
Data
Hadoop
Data
Warehouse
Other
Data
Sources
Analytics
Analytics Development Tools &
Platforms
|||||||||||||||||||||||||||
14

Architectural Approaches to Analytics
Beside Architecture Inside Architecture
DecisionIntegrationAnalytics
Analytics Development Tools & Platforms
Local Data Mart
Data
||||||||||||
||||||||||||
DecisionIntegration
Data+Analytics
Analytics Development Tools & Platforms
Middleware
Data Sources
Data Sources
Middleware
 15

Pros & Cons of Architectural Approaches
• Analytic workflow tasks performed in a separate analytics
environment outside of the source database
• Pros: Segregates analytic workload
• Cons: Doesn’t leverage powerful production for transformations,
introduces scoring latencies,
Beside
Architecture
• Analytics workflow tasks performed inside the source database
with embedded analytics
• Pros: Eliminates data movement, reduces model latency, allows
exploration of all data
• Cons: IT governance on production, potential new skills
Inside
Architecture
• Some analytic workflow tasks performed inside the source
database & others performed in a separate analytics environment
• Pros: Leverages strengths of each architecture
• Cons: Maintain multiple environments
Hybrid
Architecture
16

Building & Deploying Analytic Models
Beside
Architecture
Inside
Architecture
Hybrid
Architecture
Analytics
Analytics Development
Tools & Platforms
Local Data Mart
Data
Data Sources
24 3 34 1
Data+Analytics
Tools & Platforms
Data Sources
2 31
Analytics
Tools & Platforms
Local Data Mart
Data+Analytics
Tools & Platforms
Data Sources1 2
LEGEND
Model Build
Model Deploy
Model Recode / PMML
Update DataData Prep / Marshaling
134

Revolution ConfidentialOur platform vision
19
Lower cost per TB
Avoid data copying
Minimize big data movement
Simplify the IT and user
experience
Organizations bring their applications to
Hadoop data
©2013 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction or
redistribution without written permission is prohibited.

Traditional workloads in Hadoop
WORKLOADS IN HADOOP
Search
Analytics
Self-service BI
Data Processing (ELT)
In Cloudera
• 2-10X the performance
• 1/10th the cost
In Cloudera
• Integrated R support for
deep analytics
• Takes advantage of entire
cluster for high
performance
• More granular datasets
with more model features
In Cloudera
• Data exploration on the
full fidelity data
• Faster lifecycle from
source data to mini-mart
• 1/10th the cost
OLAP reporting

Enterprise-Grade Solutions for Big Data
Key Characteristics

Cloudera Manager & R integration
Seamless cluster administration for Revolution R Enterprise
Deploy
Deploy Revolution R Enterprise quickly
and easily onto your CDH cluster
1
Configure & Optimize
Ensure optimal settings are configured for
performance of Revolution R Enterprise
2
Monitor, Diagnose &
Report
Identify resource controls, monitor
performance, debug and diagnose issues
through a single consolidated interface
3

What is the R Language?
 A Platform…
 A Procedural Language for Stats, Math and Data Science
 A Complete Data Visualization Framework
 Provided as Open Source
 A Community…
 2M+ Users with the Skill to Tackle Big Data Statistical and
Numerical Analysis and Machine Learning Projects
 Active User Groups Across the World
 An Ecosystem
 CRAN: 4500+ Freely Available Algorithms, Test Data and
Evaluations
24

Revolution R Enterprise
is the only enterprise big data big analytics platform
based on open source R statistical computing language
Portable Across Enterprise Platforms
High Performance, Scalable Analytics
Easier to Build & Deploy
25

R is open source and drives analytic innovation but….
has some limitations for Enterprises
Disk based
scalability
Parallel threading
Commercial
support
Leverage open
source packages
plus Big Data ready
packages
26
Commercial
License
In memory bound
Single threaded
Community support
4500+ innovative
analytic packages
Risk of deployment
of open source
Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
Commercial
Viability
26

Language
Interpreter and
Standard R
Algorithm Suites
Development &
Deployment Tooling
Big Data Distributed
Execution Platform
Introducing Revolution R Enterprise
The Big Data Big Analytics Platform
R+CRAN
RevoR
DistributedR
ConnectR
ScaleR
DevelopR DeployR
27

Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
28
First, we enhance and
accelerate the Open
Source R interpreter.
28

Open Source R performance:
Multi-threaded Math
Open
Source R
29
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x
performance improvements
compared to Open Source R —
without changing any code

Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
30
Second, we built a
platform for hosting R
with Big Data on a
variety of massively
parallel platforms.
30

Revolution ConfidentialRevolution R Enterprise DistributedR
Innovative Memory Management, Multi-Threaded Execution, Multi-Core Processing
• A Revolution R Enterprise ScaleR analytic is provided a data source as input
• The analytic loops over data, reading a block at a time.
• Blocks of data are read by a separate worker thread (Thread 0).
• Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update
intermediate results objects in memory
• When all of the data is processed a master results object is created from the intermediate results objects
COMBINE INTERMEDIATE RESULTS
31

Revolution R Enterprise ScaleR
Performance and Capacity
32

SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
33

Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
 Data import – Delimited,
Fixed, SAS, SPSS, OBDC
 Variable creation &
transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort
 Merge
 Split
 Aggregate by category
(means, sums)
 Min / Max
 Mean
 Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product
matrix for set variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data
(standard tables & long form)
 Marginal Summaries of Cross
Tabulations
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
 Subsample (observations &
variables)
 Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
34

Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
 Sum of Squares (cross product
matrix for set variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
 Covariance & Correlation
Matrices
 Logistic Regression
 Classification & Regression Trees
 Predictions/scoring for models
 Residuals for all models
 Histogram
 Line Plot
 Scatter Plot
 Lorenz Curve
 ROC Curves (actual data and
predicted values)
 K-Means
Statistical Modeling
 Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
 Monte Carlo
Variable Selection
 Stepwise Regression (for linear reg)
35

Unparalleled Big Data Big Analytics
Scale, Performance & Innovation
1 + 1 = 1000’s
Performance
V
a
l
u
e
+ =
Performance
Enhanced R
R Language
Open Source
R Analytic
Packages
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Open Source
R Analytic
Packages
Performance Enhanced R
36

Leveraging CRAN with DistributedR & ScaleR
 Big Data Distillation
 Allows a R programmer to leverage RRE ScaleR to reduce dimensionality
prior and input the reduced data set into open source packages so that the
computationally intensive portion is sped up with RRE ScaleR techniques
and any of the plethora of open source packages can be leveraged
 Big Data Threading
 Allows a R programmer to leverage RRE ScaleR to execute algorithms
designed for SMP environments in parallel using DistributedR (ie: Monte
Carlo simulation)
 Supercharge Open Source package with RRE
 Allows a R programmer to re-engineer a CRAN routine by replacing an
Open Source function inside an R based algorithm with the equivalent
ScaleR function(s)
 High Performance Custom Algorithm
 Allows a R programmer to use the RRE high throughput extreme data
format (XDF) to apply any combination of Open Source functions and logic
while chunking through an XDF file to overcome the Open Source R
memory limitations
37

WODA:
Write Once – Deploy Anywhere
38

Big Analytics on Big Data in Hadoop
 100% R on Hadoop
 Full Skill Transfer - No Java needed.
 Use 4500+ CRAN Packages
 Blend Combine R & Other Tools /
Methods
 100% Portability
 Build Once – Deploy Many
 Track Evolution of Hadoop
 Protect Against Platform Uncertainty
 Avoid Platform Lock-ins
 Hadoop Performance & Scale
 Leverage Hadoop Parallelism Easily
 Analyze Data Without Moving It
DataAnalyticsApplications
Hadoop
+
Scalable
Compute
HDFS
HBase
Portability.
Parallel Storage
Hive
Big Data
Scale
100% R.
39

Revolution R Enterprise + Cloudera Propels
Enterprises into the Future
Decision
Integration
Middleware
Data
Cloudera
Data Management Platform
Analytics
Big Data Big Analytics Platform
|||||||||||||||||||||||||||
40

Revolution R Enterprise Powers
Write Once, Deploy Anywhere
41
Beside
Architecture
Inside
Architecture
Hybrid
Architecture
Analytics
Local Data Mart
Data
Cloudera
24 3 34 1
Data+Analytics
Cloudera
2 31
Analytics
Local Data Mart
Data+Analytics
Cloudera1 2
LEGEND
Model Build
Model Deploy
Model Recode / PMML
Update DataData Prep / Marshaling
4 |||||||||||||
|||||||||||||
|||||| Direct Connector
Bottom Line: Save Time, Save Money, Get Insights Faster
• Direct connectors access data without data movement
• Push down analyzing data without movement
• Use same R script on any platform without recoding
• Use right architecture for the job!

Revolution R Enterprise Inside Cloudera
Consumption
Cloudera
Business Analysts
(Alteryx, Tableau,
QlikView, Cognos,
Microstrategy, Datameer
etc.)
Power Analysts
(R Studio, DevelopR, etc.)
Line of Business
users
(Analytic Apps, Rules
Engines, etc.)
Machine Data
New Data Sources
Data Suppliers
Traditional Sources
IBM
Mainframe
Data Sources
R+CRAN
RevoR
DistributedR
ConnectR
ScaleR
DeployR
Big Data Big Analytics
Data Transformation,
Model Building & Scoring
42

QuickStart Programs Deliver Value Quickly
 Offered by both Cloudera and Revolution
Analytics
 Combine Software, Services and Training
 Cloudera can help you get started with
Hadoop in a few ways
 Revolution Analytics helps you realize value
from R + Hadoop
43

Summary
Revolution R Enterprise and Cloudera Hadoop bring best-of-breed
technologies to deliver:
 Highly scalable and high performance machine learning on data
residing in Hadoop
 Using the familiar R programming environment makes analytics
at scale accessible and easy for R users
 With the ability to integrate disparate data sources in one
repository, full lifecycle analytics from ad-hoc analysis to
production analytics are available in one managed environment
 The deep integration of Revolution R Enterprise with Cloudera
will provide a seamless operational experience for managing
both products
44

45
Thank You
Visit us @ Strata NYC Oct 28

Questions
Revolution Analytics: info@revolutionanalytics.com
Cloudera: info@cloudera.com

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Similar to R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers (20)

More from Revolution Analytics

More from Revolution Analytics (20)

Recently uploaded

Recently uploaded (20)

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers