Machine Learning Spark ML
Machine Learning Spark ML
3
ML in real-life
4
Supervised and Unsupervised Learning
• Unsupervised Learning
• There are not predefined and known set of outcomes
• Look for hidden patterns and relations in the data
• A typical example: Clustering 2.5
2.0
1.5
irisCluster$cluster
Petal.Width
1
1.0
0.5
0.0
2 4 6
Petal.Length
5
Supervised and Unsupervised Learning
• Supervised Learning
• For every example in the data there is always a predefined
outcome
• Models the relations between a set of descriptive features and
a target (Fits data to a function)
• 2 groups of problems:
• Classification
• Regression
6
Supervised Learning
• Classification
• Predicts which class a given sample of data (sample of descriptive
features) is part of (discrete value).
virginica
0.0 4.0 96.0
Percent
100
75
Predicted
versicolor
0.0 96.0 4.0 50
25
• Regression setosa
100.0 0.0 0.0
• Predicts continuous values.
setosa versicolor virginica
Actual
7
Machine Learning as a Process
Define - Define measurable and quantifiable goals
Objectives - Use this stage to learn about the problem
- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation
8
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the
model performance
• Time on data preparation should not be underestimated
9
ML as a Process: Feature engineering
• Determine the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for non-informative predictors
Algorithms that use
Multiple models
Wrappers adding and
removing parameter
models as input and
performance as
Genetics Algorithms
output
Evaluate the
Filters relevance of the
predictor
Based normally on
correlations
• Binning predictors
10
ML as a Process: Model Building
• Data Splitting
• Allocate data to different tasks
• model training
• performance evaluation
• Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
• Visualization of results – discovery interesting areas of the problem space
• Statistics and performance measures
• Evaluation and Model selection
• The ‘no free lunch’ theorem no a priory assumptions can be made
• Avoid use of favorite models if NEEDED
11