Machine Learning Spark ML

Machine Learning with Spark MLlib discusses machine learning processes including supervised and unsupervised learning. Supervised learning models relationships between features and targets by classification or regression. Unsupervised learning finds patterns without predefined outcomes, like clustering. The machine learning process involves data preparation, model building/evaluation, and deployment. Key steps are data splitting, feature engineering, model selection based on performance.

Uploaded by

syarian sakir

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views

Machine Learning Spark ML

Uploaded by

syarian sakir

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Machine Learning with Spark MLlib

Manuel Martín Márquez

Antonio Romero Marin
Joeri Hermans
Hadoop Tutorials
Machine Learning (ML)
• ML is a branch of artificial intelligence:
• Uses computing based systems to make sense out of
data
• Extracting patterns, fitting data to functions, classifying data,
etc
• ML systems can learn and improve
• With historical data, time and experience
• Bridges theoretical computer science and real noise data.

3
ML in real-life

4
Supervised and Unsupervised Learning
• Unsupervised Learning
• There are not predefined and known set of outcomes
• Look for hidden patterns and relations in the data
• A typical example: Clustering 2.5

2.0

1.5
irisCluster$cluster

Petal.Width
1

1.0

0.5

0.0
2 4 6
Petal.Length

5
Supervised and Unsupervised Learning
• Supervised Learning
• For every example in the data there is always a predefined
outcome
• Models the relations between a set of descriptive features and
a target (Fits data to a function)
• 2 groups of problems:
• Classification
• Regression

6
Supervised Learning
• Classification
• Predicts which class a given sample of data (sample of descriptive
features) is part of (discrete value).
virginica
0.0 4.0 96.0

Percent
100

Predicted
versicolor
0.0 96.0 4.0 50

• Regression setosa
100.0 0.0 0.0
• Predicts continuous values.
setosa versicolor virginica
Actual

7
Machine Learning as a Process
Define - Define measurable and quantifiable goals
Objectives - Use this stage to learn about the problem

- Normalization
- Transformation
Model - Missing Values
Deployment Data - Outliers
Preparation

- Study models accuracy

- Work better than the naïve - Data Splitting
approach or previous system - Features Engineering
- Do the results make sense in - Estimating Performance
the context of the problem - Evaluation and Model
Model Model
Selection
Evaluation Building

8
ML as a Process: Data Preparation
• Needed for several reasons
• Some Models have strict data requirements
• Scale of the data, data point intervals, etc
• Some characteristics of the data may impact dramatically on the
model performance
• Time on data preparation should not be underestimated

• Missing Values • Scaling

• Error Values • Centering
Raw
• Different Scales Data
• Skewness
Transfor
Data Modeling
Data
• Dimensionality
• Types Problems
• Outliers
mation
• Missing Values
Ready phase
• Many others • Errors

9
ML as a Process: Feature engineering
• Determine the predictors (features) to be used is one of the most critical
questions
• Some times we need to add predictors
• Reduce Number:
• Fewer predictors more interpretable model and less costly
• Most of the models are affected by high dimensionality, specially for non-informative predictors
Algorithms that use
Multiple models
Wrappers adding and
removing parameter
models as input and
performance as
Genetics Algorithms
output

Evaluate the
Filters relevance of the
predictor
Based normally on
correlations

• Binning predictors

10
ML as a Process: Model Building
• Data Splitting
• Allocate data to different tasks
• model training
• performance evaluation
• Define Training, Validation and Test sets
• Feature Selection (Review the decision made previously)
• Estimating Performance
• Visualization of results – discovery interesting areas of the problem space
• Statistics and performance measures
• Evaluation and Model selection
• The ‘no free lunch’ theorem no a priory assumptions can be made
• Avoid use of favorite models if NEEDED

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Notes GCP PCA Preparation
No ratings yet
Notes GCP PCA Preparation
7 pages
Deloitte - Anaplan-CoE-whitepaper
No ratings yet
Deloitte - Anaplan-CoE-whitepaper
11 pages
Multivariable Feedback Control 2005
100% (4)
Multivariable Feedback Control 2005
293 pages
Dice Resume CV Vijay Krishna
No ratings yet
Dice Resume CV Vijay Krishna
4 pages
Software Project Management
No ratings yet
Software Project Management
82 pages
Characteristics of Operations Research Managment
87% (23)
Characteristics of Operations Research Managment
2 pages
Machine Learning in Spark
No ratings yet
Machine Learning in Spark
26 pages
Big Data Masters Certification Learnbay
No ratings yet
Big Data Masters Certification Learnbay
12 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
SS1123 - D2T - Apache Cassandra Overview PDF
100% (1)
SS1123 - D2T - Apache Cassandra Overview PDF
45 pages
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
No ratings yet
Kafka Producer Internals: Find Answers On The Fly, or Master Something New. Subscribe Today
1 page
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Apache Kafka Installation
No ratings yet
Apache Kafka Installation
3 pages
Problem Description: Sensitivity: Internal & Restricted
No ratings yet
Problem Description: Sensitivity: Internal & Restricted
2 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Intellipaat Hands On Exercises PDF
No ratings yet
Intellipaat Hands On Exercises PDF
49 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Spark NLP Training-Public-April 2020
No ratings yet
Spark NLP Training-Public-April 2020
39 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
2.PEGA Interview Dump 1
No ratings yet
2.PEGA Interview Dump 1
187 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Introduction to Docker
No ratings yet
Introduction to Docker
136 pages
Python Advanced - Pipes in Python
No ratings yet
Python Advanced - Pipes in Python
7 pages
Elastic Search
No ratings yet
Elastic Search
19 pages
Google Cloud Platform Tutorial
No ratings yet
Google Cloud Platform Tutorial
6 pages
Kafka Secuirty
No ratings yet
Kafka Secuirty
4 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
Cloudera Certification Dump 410 Anil PDF
No ratings yet
Cloudera Certification Dump 410 Anil PDF
49 pages
Project Ready Workshop catalog_updated Nov 2024
No ratings yet
Project Ready Workshop catalog_updated Nov 2024
121 pages
BDE ManagedHadoopDataLakes PAVLIK PDF
No ratings yet
BDE ManagedHadoopDataLakes PAVLIK PDF
10 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Spark ETL and Process
No ratings yet
Spark ETL and Process
15 pages
Company Interview Question Bank
No ratings yet
Company Interview Question Bank
16 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
SHIVA KUMARA - JavaArchitect
No ratings yet
SHIVA KUMARA - JavaArchitect
9 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
MapR Certified Spark Developer Study Guide (MCSD)
No ratings yet
MapR Certified Spark Developer Study Guide (MCSD)
29 pages
AWS Athena Knowledgebase
No ratings yet
AWS Athena Knowledgebase
4 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Tomcat
100% (1)
Tomcat
36 pages
MLOps Interview QnA
No ratings yet
MLOps Interview QnA
19 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
Azure Kubernetes Service - Architecture & Implementation Case Study
No ratings yet
Azure Kubernetes Service - Architecture & Implementation Case Study
9 pages
Business Intelligence DW
No ratings yet
Business Intelligence DW
17 pages
Deepak Professional Summary
No ratings yet
Deepak Professional Summary
3 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
GCP Cloud
No ratings yet
GCP Cloud
5 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Terraform
No ratings yet
Terraform
5 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Oozie Tutorial
No ratings yet
Oozie Tutorial
84 pages
WildFly Performance Tuning
From Everand
WildFly Performance Tuning
Arnold Johansson
No ratings yet
A Novel Strategy For Predicting Agriculture Crop and Its Yield
No ratings yet
A Novel Strategy For Predicting Agriculture Crop and Its Yield
5 pages
L1-L3 Computer Vision
No ratings yet
L1-L3 Computer Vision
15 pages
AI - The Time To Root For Machines!: There Are Already Some Magnificent Works in AI To Boast About
No ratings yet
AI - The Time To Root For Machines!: There Are Already Some Magnificent Works in AI To Boast About
3 pages
Expert Systems
No ratings yet
Expert Systems
29 pages
TECKATHON 2023 Presentation-Format
No ratings yet
TECKATHON 2023 Presentation-Format
5 pages
DJJ50212 Maintenance Engineering and Management Chapter 1 Maintenance Organization
100% (1)
DJJ50212 Maintenance Engineering and Management Chapter 1 Maintenance Organization
61 pages
Dieu Khien He Thong Bi Trong Tu Truong
No ratings yet
Dieu Khien He Thong Bi Trong Tu Truong
5 pages
IoT (15CS81) Module 4 Machine Learning
No ratings yet
IoT (15CS81) Module 4 Machine Learning
66 pages
Lab Manual 07 08 CSE 406 Integrated Design Project II
No ratings yet
Lab Manual 07 08 CSE 406 Integrated Design Project II
7 pages
#3data Abstraction & Data Independence
No ratings yet
#3data Abstraction & Data Independence
19 pages
7.kolaboratif Latsar CPNS
No ratings yet
7.kolaboratif Latsar CPNS
32 pages
State Space Design
No ratings yet
State Space Design
47 pages
MBIT - Lecture 1 Introduction To Requirements Engineering
No ratings yet
MBIT - Lecture 1 Introduction To Requirements Engineering
23 pages
Applications and Uses of Artificial Intelligence
No ratings yet
Applications and Uses of Artificial Intelligence
10 pages
Explaining Vulnerabilities To Adversarial Machine Learning Through Visual Analytics
No ratings yet
Explaining Vulnerabilities To Adversarial Machine Learning Through Visual Analytics
11 pages
Soft Computing Vs Hard Computing
No ratings yet
Soft Computing Vs Hard Computing
23 pages
Booking System
No ratings yet
Booking System
10 pages
Moore's Procedure & Johnson's Procedure: Session 12
No ratings yet
Moore's Procedure & Johnson's Procedure: Session 12
16 pages
Data Warehouse Thesis Paper
100% (3)
Data Warehouse Thesis Paper
5 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Soft Computing SYLLABUS
No ratings yet
Soft Computing SYLLABUS
2 pages
Deep Learning Based Fusion Approach For Hate Speech Detection
No ratings yet
Deep Learning Based Fusion Approach For Hate Speech Detection
7 pages
Mod 3
No ratings yet
Mod 3
33 pages
ELYAN 2020 Deep Learning
No ratings yet
ELYAN 2020 Deep Learning
36 pages
An Automated Model Based Testing Approach For Platform Games
No ratings yet
An Automated Model Based Testing Approach For Platform Games
11 pages
Artificial Neuron Models For Hydrological Modeling: Seema Narain and Ashu Jain
No ratings yet
Artificial Neuron Models For Hydrological Modeling: Seema Narain and Ashu Jain
5 pages