The document discusses machine learning and Hadoop. It begins by outlining machine learning truths for industrial applications, then describes the current state of machine learning on Hadoop, which relies heavily on Apache Mahout. However, Mahout has limitations. The document concludes that the future lies in moving beyond MapReduce to platforms like Spark, GraphLab, and AllReduce that can better support machine learning workloads at scale.
Report
Share
Report
Share
1 of 30
More Related Content
Machine Learning and Hadoop: Present and Future
1. Machine
Learning
and
Hadoop
Present
and
Future
Josh
Wills
Cloudera
Data
Science
Team
September
6th,
2012
2. About
Me
Copyright
2012
Cloudera
Inc.
All
rights
reserved
3. Outline
• Part
1:
Industrial
Machine
Learning
• Part
2:
ML
and
Hadoop:
The
State
of
the
World
• Part
3:
ML
and
Hadoop:
Where
Things
are
Headed
Copyright
2012
Cloudera
Inc.
All
rights
reserved
4. (Academic)
ML
vs.
(Academic)
StaIsIcs
“Machine
learning
is
sta/s/cs
minus
any
checking
of
models
and
assump/ons.”
-‐-‐
Brian
Ripley,
UseR!
2004
(provoca/vely
paraphrased)
Copyright
2012
Cloudera
Inc.
All
rights
reserved
5. Industrial
Machine
Learning:
Truth
#1
The
thing
that
we
are
trying
to
predict
is
rarely
the
thing
that
we
are
trying
to
opImize.
Copyright
2012
Cloudera
Inc.
All
rights
reserved
6. Industrial
Machine
Learning:
Truth
#2
Systems
precede
algorithms.
Copyright
2012
Cloudera
Inc.
All
rights
reserved
8. ImplicaIon
Data
science
requires
predicIon-‐oriented
machine
learning
models
AND
classical,
rigorous
staIsIcal
analysis.
Copyright
2012
Cloudera
Inc.
All
rights
reserved
9. Outline
• Part
1:
Industrial
Machine
Learning
• Part
2:
ML
and
Hadoop:
The
State
of
the
World
• Part
3:
ML
and
Hadoop:
Where
Things
are
Headed
Copyright
2012
Cloudera
Inc.
All
rights
reserved
10. “Hadoop.
It’s
Where
The
Data
Is.”
Copyright
2012
Cloudera
Inc.
All
rights
reserved
11. Hadoop
PlaWorm:
Substrate
• Commodity
servers
• Open
source
operaFng
system
• “”
ConfiguraFon
Management
• “”
CoordinaFon
Service
• “”
File
System
API
• “”
Efficient
and
Extensible
File
Formats
• “”
Efficient
and
Extensible
RPC
Libraries
Copyright
2012
Cloudera
Inc.
All
rights
reserved
13. ML
and
Hadoop:
The
State
of
the
World
Copyright
2012
Cloudera
Inc.
All
rights
reserved
14. MapReduce
• Great
for:
• Data
PreparaFon
• Feature
Engineering
• Model
ValidaFon/EvaluaFon
• Works
Well
For
Certain
Model
Fing
Problems
• CollaboraFve
Filtering
Algorithms
• ExpectaFon
MaximizaFon
• Decision
Trees
(PLANET;
Gradient
Boosted
Decision
Trees)
• Not
A
PracIcal
OpIon
for
Many
Kinds
of
Problems
• Way
More
Detail
in
the
KDD
2011
Talk
Copyright
2012
Cloudera
Inc.
All
rights
reserved
15. Apache
Mahout
• The
starFng
place
for
MapReduce-‐based
machine
learning
algorithms
• Not
machine-‐learning-‐in-‐a-‐box
• Custom
tweaks/modificaFons
are
the
rule
• A
disparate
collecFon
of
algorithms
for:
• RecommendaFons
• Clustering
• ClassificaFon
• Frequent
Itemset
Mining
Copyright
2012
Cloudera
Inc.
All
rights
reserved
16. Apache
Mahout
(cont.)
• Best
Library:
Taste
Recommender
• Oldest
project,
most
widely-‐deployed
in
producFon
• SVD
implementaFon
is
parFcularly
acFve
• Good
Libraries:
Online
SGD
• Does
not
use
MapReduce
• Vowpal
Rabbit
is
faster,
has
L-‐BFGS
opFon
• Roll
Your
Own
Instead:
Naïve
Bayes
Copyright
2012
Cloudera
Inc.
All
rights
reserved
22. The
Contenders
Copyright
2012
Cloudera
Inc.
All
rights
reserved
23. AllReduce
• Developed
at
Yahoo!
Research
• Defines
the
allreduce
operaFon
• N
machines
each
have
a
number
=>
each
machine
has
the
sum
of
the
numbers
• At
the
heart
of
Vowpal
Wabbit’s
performance
• Implemented
in
C++
• Can
be
patched
into
Apache
Hadoop
and
used
today
Copyright
2012
Cloudera
Inc.
All
rights
reserved
24. Spark
• Developed
at
Berkeley’s
AMP
Lab
• Defines
operaFons
on
distributed
in-‐memory
collecFons
• Wriken
in
Scala
• Supports
reading
to
and
wriFng
from
HDFS
Copyright
2012
Cloudera
Inc.
All
rights
reserved
25. GraphLab
• Developed
at
CMU
• Lower-‐level
primiFves
• (but
higher
than
MPI)
• Map/Reduce
=>
Update/Sort
• Flexible,
allows
for
asynchronous
computaFons
• Reads
from
HDFS
Copyright
2012
Cloudera
Inc.
All
rights
reserved