Sent-Machine Learning For Data Science
Sent-Machine Learning For Data Science
http://tinyurl.com/ycs5zqe8
Machine Learning
for
Data Science
https://www.facebook.com/analyticsindeep
http://www.aekanun.com
Cross-Validation & Hyperparameters Tuning
Source: mapr.com
3 aekanun@imcinstitute.com
Cross-Validation & Hyperparameters Tuning
Source: databricks.com
4 aekanun@imcinstitute.com
Topic: Lecture & LAB
Tools for Large Scale Machine Learning: Spark Core, Spark SQL, Spark MLlib
Recommendation
5 aekanun@imcinstitute.com
What is Data Science ?
Images: https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics
Images: https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics
Data Science Methodology: CRISP-DM
Source: Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:1322.
9 aekanun@imcinstitute.com
Step 1: Business Understanding
- Define the problem, project objectives and solution requirements from a
business perspective.
- Ex.
- Alert after events occurs: Not predictive but its descriptive mining.
- Recommendation System: Not user profiles but focus on user
preferences (Like, Love, etc.)
Adapted from Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
10 aekanun@imcinstitute.com
Step 2: Data Understanding
- Identifies Data Requirements for analytic approach
- Data Sources: Flat files, Relational DB, HDFS
- Data Characteristics: Volumes, Variety, Velocity
- Data format & presentation: Structure/Unstructure, Categorical/Numerical
- Directly affect to selection of the Data Ingestion Tools, next steps of the methodology.
- Data Collection
- Exploratory data analysis (EDA): Approach to analyzing data sets to summarize their main
characteristics, often with visual methods.
- Assess data quality: Bias, Inconsistent, Duplication, Missing
Histogram Scatterplot
Boxplots
Adapted from Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
11 aekanun@imcinstitute.com
Hidden Biases of Big Data
How if you analyzed tweets immediately before and
after Hurricane Sandy, you would think that most
people were supermarket shopping pre-Sandy and
partying post-Sandy.
Source: Cathy O'Neil et.al, Doing Data Science, O'Reilly Media, 2013
Image: wikipedia.org
12 aekanun@imcinstitute.com
Step 3: Data Preparation
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
13 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out
Missing Values
Anomalous values
Inconsistency: format,
presentation, unit of
measurement
Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
14 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out
Missing Values
Anomalous values
Inconsistency: format,
presentation, unit of
measurement
Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
15 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out
Missing Values
Anomalous values
Inconsistency: format,
presentation, unit of
measurement
Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
16 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out
Missing Values
Anomalous values
Inconsistency: format,
presentation, unit of
measurement
Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
17 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out
Single vs Separated
Missing Values
Anomalous values
Inconsistency: format,
presentation, unit of
measurement
Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
18 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out
Unit of measurement ?
Missing Values
Anomalous values
Inconsistency: format,
presentation, unit of
measurement
Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
19 aekanun@imcinstitute.com
Errors Values
20 aekanun@imcinstitute.com
Missing Values
21 aekanun@imcinstitute.com
Handling Missing Data
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
22 aekanun@imcinstitute.com
Transforming Numerical Var. into Categorical Var.
(Discretization)
(a)
(b)
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
23 aekanun@imcinstitute.com
Transforming: Scaling Difference of Values
using Min-Max Normalization
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015
24 aekanun@imcinstitute.com
Step 4: Modeling
- Machine Learning Algorithm
- Decision Trees
- Recommendation engines
- k-Nearest Neighbours
- Naive Bayes
- Logistic Regression
- K-Means
- Fuzzy C-Means
- Genetic algorithms
- etc.
Image:machinelearningmastery.com
25 aekanun@imcinstitute.com
Categories of Machine Learning
- Supervised Learning
- A set of example observations is provided as a training set.
- Goal: to learn an association between the inputs (features) and output
(target variable) by using the examples provided.
- Unsupervised Learning
- The input data is a feature matrix of observations without a target
variable.
- Scenarios:
- Used for exploratory analysis.
- A step before supervised learning.
Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
27
27 aekanun@imcinstitute.com
28
28 aekanun@imcinstitute.com
Spark ML Pipeline
Source: stats.stackexchange.com
29
29 aekanun@imcinstitute.com
Source: stats.stackexchange.com
30
30 aekanun@imcinstitute.com
Tasks of Machine Learning
- Classification/Regression
Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Classification vs. Regression
Classification
Regression
Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Classification vs. Regression
Source: www.edureka.in/data-science
Tasks of Machine Learning
- Clustering
- Identify natural groupings or clusters of observations
that are similar to each other.
- Recommender Systems
- Predict user preference for a product or item given
historical preference data from other uses.
Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Algorithm of Supervised Learning
- Decision Trees
- Mining Task: Classification/Regression (Regression Trees)
Source: spark.apache.org
Spark.ML: Decision Trees
Source: spark.apache.org
Algorithm of Supervised Learning
- Random Forest
- Features:
- Correct for decision trees' habit of overfitting to their training set.
- Classification: Support random forests for binary and multiclass
classification
- Regression: Using both continuous and categorical features
- Process:
- a set of decision trees is constructed, each using a random
subset of the training examples and features.
- A vote based on the outputs of all the trees is used to make the
final decision.
Spark.ML: Random Forest
Source: spark.apache.org
Algorithm of Supervised Learning
- k-Nearest Neighbours
- Mining Task: Classification/Regression
- Process:
- Define a distance metric between observations
and use this metric for classification or
regression.
- Determine the k-nearest neighbors of any given
observation.
- For classification, an unseen observation is then
classified to the most popular class amongst its
k-nearest neighbors. Similarly, in regression, the
mean of the k-nearest neighbors is used.
Adapted from: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Algorithm of Supervised Learning
- Naive Bayes
- Mining Task: Classification
- Features:
- Bayes' theorem with strong (naive)
independence assumptions between the
features.
- Use Cases:
- Sex Classification
- Document Classification: Spam/Not
Algorithm of Supervised Learning
- Logistic Regression
- Mining Task:
- A regression model where the dependent
variable (DV) is categorical.
- Features: Binary dependent variable
Algorithm of Unsupervised Learning
- Clustering
- k-Means, MinHash, Hierarchical Clustering
- Hidden Markov Models
- Feature Extraction methods
- Neural Networks
Features Encoding Pipeline
Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Step 5: Evaluation
- Evaluation: Model Performance
- Categorical Data: Confusion Matrix
- Numerical Data: Root Mean Squared Error
Testing
Data
Adapted from Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
45 aekanun@imcinstitute.com
Evaluation - Confusion Matrix
Positive Negative
Positive TP FN Sensitivity/Recall
Actual
Negative FP TN Specificity
(True Negative Rate)
Positive Negative
Predictive Predictive
(Precision)
Source: analyticsvidhya.com
46 aekanun@imcinstitute.com
Pro
por
wh tion
o a of a
we re fo ll ca
re d cus ses
ete
-
-
We focus on Churn.
Sensitivity or True Positive Rate means: cte ed
d
- Proportion of all cases who are Churn
were detected.
Positive Negative
Positive TP FN Sensitivity/
Recall
Positive Negative
Predictive Predictive
(Precision)
47 aekanun@imcinstitute.com
Be
lief
or
No
t?
- We focus on Churn.
- Predictive Positive Value means:
Positive Negative
Positive TP FN Sensitivity/
Recall
Positive Negative
Predictive Predictive
(Precision)
48 aekanun@imcinstitute.com
Step 6: Deployment
Source: machinelearningmastery.com
49 aekanun@imcinstitute.com
Deployment
- Feedback
- Updates & Tuning
Source: bbi-consultancy.com
50 aekanun@imcinstitute.com
Data Science Methodology: CRISP-DM
51 aekanun@imcinstitute.com
Speaker
Problem Solving
Realtime analytics:
- Largest retailer in the world product recommendation
- 20,000 stores in 28 countries. Right place, right time,
- Has Big Data and analytics right customer
department since 2004 Monitors public social
- The worlds largest private data media conversations,and
cloud
attempts to predict what
- Process 2.5 PB every hour
products people will buy
54
Source: Big Data in Practice, Bernard Marr, 2016
WALMART: Retail Industry
Data Technology
Data Caf uses database 40 petabytes of data
consisting of 200 billion Hadoop (since 2011)
rows of transactional data Spark
200 other sources, Cassandra
including meteorological R
data, economic data, SAS
telecoms data, social media
data, gas prices
55
Source: Big Data in Practice, Bernard Marr, 2016
WALMART: Retail Industry
Results
Data Caf system has led to a reduction in the time it takes
from a problem being spotted in the numbers to a solution
being proposed from an average of two to three weeks down
to around 20 minutes.
56
Source: Big Data in Practice, Bernard Marr, 2016
57
Netflix: Entertainment
Problem Solving
To understand customer
- Streaming movie and TV service viewing habits
- 65 million members in over 50 Improve in the number of
countries hours customers spending
- one-third of peak-time Internet They launched the Netflix
traffic in the US Prize
58
Source: Big Data in Practice, Bernard Marr, 2016
Netflix: Entertainment
Data Technology
Customer ID, movie ID, 3 petabytes of data
rating and the date the Amazon Web Services
movie was watched Hadoop, Hive and Pig
Streaming data Originally used Oracle
databases, but they
switched to NoSQL and
Cassandra
59
Source: Big Data in Practice, Bernard Marr, 2016
60
Netflix: Entertainment
Results
They added 4.9 million new subscribers in Q1 2015,
compared to four million in the same period in 2014.
Q1 2015 alone, Netflix members streamed 10 billion hours of
content.
61
Source: Big Data in Practice, Bernard Marr, 2016
Uber: Transportation
Problem Solving
Big Data principle of
- A smartphone app-based taxi crowdsourcing.
booking service. Store and monitor data on
- Now valued at $41 billion. every journey to determine
- Firmly in Big Data, and demand, allocate resources
leveraging this data in a more and set fares.
effective way than traditional taxi
Big Data-informed pricing,
firms.
which call surge pricing
62
Source: Big Data in Practice, Bernard Marr, 2016
Uber: Transportation
Data Technology
mixture of internal and Hadoop data lake.
external data. Apache Spark
GPS, traffic data
public transport routes
63
Source: Big Data in Practice, Bernard Marr, 2016
Uber: Transportation
Results
This case is less about short-term results and more about
long-term development of a data-driven business model. But
its fair to say that without their clever use of data the
company wouldnt have grown into the phenomenon they
are.
64
Source: Big Data in Practice, Bernard Marr, 2016
Amazon
Problem Solving
recommendation engine
- one of the worlds largest technology is based on
retailers of physical goods, virtual collaborative filtering.
goods such as ebooks and 360-degree view of you
streaming video and more as an individual customer
recently Web services. monitor, track and secure
its 1.5 billion items in its
retail store
65
Source: Big Data in Practice, Bernard Marr, 2016
Amazon
Data Technology
Data from users as they 187 million unique
browse the site. monthly website visitor.
Location data and Hewlett-Packard servers
information about other running Oracle on Linux
apps use on your phone. 5 TB of data
External datasets such as
census information
Streaming data
66
Source: Big Data in Practice, Bernard Marr, 2016
Amazon
Results
Amazon have grown to become the largest online retailer in
the US based on their customer-focused approach to
recommendation technology. Last year, they took in nearly
$90 billion from worldwide sales.
67
Source: Big Data in Practice, Bernard Marr, 2016
Big Data
Source: http://www.datasciencecentral.com/ 68
Source: IBM 69
Source: IBM 70
Source: IBM 71
Source: IBM 72
Source: Bernard Marr 73
74
Source: A digital age vision of insurance services, CRIF Reference
75
76
Source: William EL KAIM, Enterprise Architecture and Technology Innovation
77
Source: Domo
Big Data : Why Now?
78
Source: William EL KAIM, Enterprise Architecture and Technology Innovation
Use Cases
80
81
What is Data Science?
82
83
Data Scientist Lifecycle
84
Source Big Data: Understanding How Data Powers Big Business
What is Machine Learning?
85
86
Source: http://www.datamation.com/
What is Deep Learning?
87
Deep Learning
88
Source: http://www.information-management.com/
What is Data Mining?
Source: http://www.unc.edu/~xluan/258/datamining.html 90
What is Business intelligence?
Gartner
91
92
93
Source Big Data: Understanding How Data Powers Big Business
94
Source Big Data: Understanding How Data Powers Big Business
Data Analytics Lifecycle
95
Source Big Data: Understanding How Data Powers Big Business
96
Source: Data Science and Critical Thinking, A.Scroll
97
Types of Data Scientist
Data Businesspeople
Data Creatives
Data Developers
Data Researchers
Source: www.edureka.in/data-science 98
Source: www.edureka.in/data-science 99
Big Data is changing the world
100
Big Data Challenges
103
Big Data Analytics Reference Architectures
104
Source: SoftServe
Relational Reference Architectures
105
Source: SoftServe
Non-Relational Reference Architectures
106
Source: SoftServe
Data Discovery: Non-Relational Architecture
107
Source: SoftServe
Business Reporting: Hybrid Architecture
108
Source: SoftServe
109
110
Source: Xiaoxiao Shi
What is Hadoop?
111
Hadoop Environment
112
Source: Hadoop in Practice; Alex Holmes
113
Source: HDInsight Essentials - Second Edition
Hadoop Platform
114
Source: Octo Technology
Hadoop Ecosystem
115
Source: Apache Hadoop Operations for Production Systems, Cloudera
116
Source: The evolution and future of Hadoop storage: Cloudera
117
Source: The evolution and future of Hadoop storage: Cloudera
Hadoop Cluster
118
Source: HDInsight Essentials - Second Edition
When to use Hadoop?
119
Hadoop for Big Data Analytics
120
Source: Microsoft
What is Mahout?
121
Mahout in Apache Software
122
Why Mahout?
Apache License
Good Community
Good Documentation
Scalable
Extensible
Command Line Interface
Java Library
123
Mahout Architecture
124
Use Cases
125
Data Science: Core Components
126
Source www.edureka.in/data-science
Data Science Implementation
127
Source www.edureka.in/data-science
Social Media Use Case
128
Source www.edureka.in/data-science
Social Media Use Case: Classification
129
Source www.edureka.in/data-science
What is Spark?
130
Spark Framework
131
132
Source: http://www.informationweek.com/big-data
Source: Jump start into Apache Spark and Databricks 133
What is Spark?
Framework for distributed processing.
In-memory, fault tolerant data structures
Flexible APIs in Scala, Java, Python, SQL, R
Open source
134
Spark History
Founded by AMPlab, UC Berkeley
Created by Matei Zaharia (PhD Thesis)
Maintained by Apache Software Foundation
Commercial support by Databricks
135
Why Spark?
Handle Petabytes of data
Significant faster than MapReduce
Simple and intutive APIs
General framework
Runs anywhere
Handles (most) any I/O
Interoperable libraries for specific use-cases
136
Spark Platform
137
138
Hadoop Processing Paradigms
141
Source: William EL KAIM, Enterprise Architecture and Technology Innovation
Introduction to
Machine Learning
Decision Trees
Recommendation engines
k-Nearest Neighbours
Naive Bayes
Logistic Regression
K-Means
Fuzzy C-Means
Genetic algorithms
etc.
Categories
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Supervised Learning
The correct classes of the training data are known
Supervised Learning
Handwriting Recognition
Spam Detection
Information Retrieval
Personalisation based on ranks
Speech Recognition
Supervised Learning Algorithm
Decision Trees
k-Nearest Neighbours
Random Forest
Naive Bayes
Logistic Regression
Neural Networks
Unsupervised Learning
The correct classes of the training data are not
known
Unsupervised Learning
Source www.edureka.in/data-science
Unsupervised Learning
analytics
Applications
Pattern Recognition
Groupings based on a distance measure
Group of People, Objects, ...
Unsupervised Learning Algorithm
Clustering
Neural Networks
Types of Algorithms
Classification
Recommend friends/dates/products
Use Cases
Sentiment Analysis
Clustering
Source: Mahout in Action
Sample Data
Source www.edureka.in/data-science
Distance Measures
Source: www.edureka.in/data-science
Example of K-Means Clustering
http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html
K-Means with different
distance measures
Source: www.edureka.in/data-science
Step 1.1: Term Frequency (TF)
Source: www.edureka.in/data-science
Step 1.2: Normalized Term Frequency
Source: www.edureka.in/data-science
Step 2: Inversed Document Frequency (IDF)
Source: www.edureka.in/data-science
Step 3: TF*IDF
Source: www.edureka.in/data-science
Step 4: Cosine similarity
Source: www.edureka.in/data-science
Step 4: Cosine similarity
Source: www.edureka.in/data-science
Classification
Classification Process
Source: www.edureka.in/data-science
Keywords of Classification
Model
Training data
Test data
Training
Predictor variable
Target variable
[1] Isabelle Guyon, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, 2003
Classification Workflow
Step 1: Training the model
Decision Trees
k-Nearest Neighbours
Forest Trees
Naive Bayes
Logistic Regression
Regression
Source: www.edureka.in/data-science
Supervised Learning
Source www.edureka.in/data-science
Decision Trees
Examples:
Main loop:
Source: www.edureka.in/data-science
Attribute Selection Example
Source: www.edureka.in/data-science
Source: cs.upc.edu
Source: cs.upc.edu
Source: cs.upc.edu
Missing Values - some Solutions
Source: www.edureka.in/data-science
Nave Bayes Classifier
Bayes Classifiers
Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Bayes Classifiers : Example
Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Bayes Classifiers (Cont.)
Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Probability model of Bayes Classifiers
Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Recommendation
Differences
Clustering algorithms
Decide on their own which distinctions appear to
be important
Classification algorithms
Learn to mimic examples of correct decisions
Make a single decision with a very limited set of
possible outcomes
Recommendation algorithms
Select and rank the best of many possible
alternatives
Collaborative Filtering framework
provide recommendations
User-based: Recommend items by finding similar users.
This is often harder to scale because of the dynamic nature
of users.
Item-based: Calculate similarity between items and make
recommendations. Items usually don't change much, so
this often can be computed offline.
Slope-One: A very fast and simple item-based
recommendation approach applicable when users have
given ratings (and not just boolean preferences).
Recommendation: Example
Recommendation: Example
Source: www.edureka.in/data-science
Item-based Recommendation
lambdaarch-batch
lambdaarch-batch
Connect via SSH in browser window
lambdaarch-batch
Connect to the instance
lambdaarch-batch:~$
Config Firewall: Select the instance
lambdaarch-batch
Config Firewall: Select the Network
Config Firewall: Select Add Firewall rules
Create a firewall rule with the following configuration
Come back and so forth
lambdaarch-batch:~$
Hands-On: Install a Docker Engine
Update OS (Ubuntu)
$ sudo su -
# apt-get update
Docker Installation
# apt-get install docker.io
Install Cloudera Quickstart on
Docker Container
Pull Cloudera Quickstart
# docker pull cloudera/quickstart:latest
Verify the image was successfully pulled
# docker images
Run Cloudera quickstart
lambdaarch-batch
Login to Hue: http://external-ip-address:8080
December 2016
Mr.Aekanun Thongtae
aekanun@imcinstitute.com
255 aekanun@imcinstitute.com
ISSUE: Python 2.6.x can NOT run many pythons neccessary libraries.
256 aekanun@imcinstitute.com
What is Jupyter ?
- Server-Client application: allows you to edit and run your notebooks via a
web browser.
- Two main components:
- A kernel is a program that runs and introspects the users code.
- The dashboard of the application: can also be used to manage the
kernels
- IPython is original, Fernando Prez starts developing in late 2001.
- Lastly, in 2014, Project Jupyter started as a spin-off project from
IPython.
- IPython is now the name of the Python backend, which is also known
as the kernel.
source: www.datacamp.com
257 aekanun@imcinstitute.com
Why is Jupyter ?
- IDEs
- Can be saved and easily shared in .ipynb JSON format.
- Statistical data visualization, such as Seaborn
Source: Jonathan W., Jupyter Notebook for Data Science Teams, Infinite Skills, 2016
258 aekanun@imcinstitute.com
1. Getting Started (this page may be skipped if the
container has been already run.)
- Delete the exising containers
259 aekanun@imcinstitute.com
- Install neccessary applications
260 aekanun@imcinstitute.com
2. Install the Anaconda, Jupyter and some Modules
wget
[root@quickstart /]#
https://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh
261 aekanun@imcinstitute.com
export
[root@quickstart /]#
PYSPARK_DRIVER_PYTHON=/root/anaconda2/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook
[root@quickstart /]#
--NotebookApp.open_browser=False --NotebookApp.ip='*'
--NotebookApp.port=8880"
262 aekanun@imcinstitute.com
- Run Jupyter notebook with Pyspark
263 aekanun@imcinstitute.com
Stop all Hadoops services
#! /usr/bin/env bash
/etc/init.d/zookeeper-server stop
/etc/init.d/hadoop-hdfs-datanode stop
/etc/init.d/hadoop-hdfs-journalnode stop
/etc/init.d/hadoop-hdfs-namenode stop
/etc/init.d/hadoop-hdfs-secondarynamenode stop
/etc/init.d/hadoop-httpfs stop
/etc/init.d/hadoop-mapreduce-historyserver stop
/etc/init.d/hadoop-yarn-nodemanager stop
/etc/init.d/hadoop-yarn-resourcemanager stop
/etc/init.d/hbase-master stop
/etc/init.d/hbase-rest stop
/etc/init.d/hbase-thrift stop
/etc/init.d/hive-metastore stop
/etc/init.d/hive-server2 stop
/etc/init.d/sqoop2-server stop
/etc/init.d/spark-history-server stop
/etc/init.d/hbase-regionserver stop
/etc/init.d/hue stop
/etc/init.d/impala-state-store stop
/etc/init.d/oozie stop
/etc/init.d/solr-server stop
/etc/init.d/impala-catalog stop
/etc/init.d/impala-server stop
Difficult to
SCALE
Image:integralnet.co.uk
265
265 aekanun@imcinstitute.com
266
266 aekanun@imcinstitute.com
Image: linkedin.com
267
267 aekanun@imcinstitute.com
Which topic would we
like to focus ?
Source: dailykos.com
268
268 aekanun@imcinstitute.com
269
269
Image: linkedin.com/pulse/dealing-data-structured-unstructured-way-ronald-baan
270
270 aekanun@imcinstitute.com
Image: rodneyrohrmann.blogspot.com
271
271 aekanun@imcinstitute.com
Semi-structured Data
Image: Thomas Eri et.al, Big Data Fundamentals: Concepts, Drivers & Techniques, Prentice Hall, 2016
272
272 aekanun@imcinstitute.com
horizontal
273
273 aekanun@imcinstitute.com
Hadoop 2
Source: Tomcy John and Pankaj Misra, Data Lake for Enterprises, Packt Publishing, 2017
274
Aekanun Thongtae, aekanun@imcinstitute.com Apr 2017
MapReduce-Architecture View
Source: Tomcy John and Pankaj Misra, Data Lake for Enterprises, Packt Publishing, 2017
275
Aekanun Thongtae, aekanun@imcinstitute.com Apr 2017
276
Hadoop Environment
276
Source: Hadoop in Practice; Alex Holmes Linux
277
277
Source:
278
278
Source: Thomas Eri et.al, Big Data Fundamentals: Concepts, Drivers & Techniques, Prentice Hall, 2016
279
Hadoop Cluster
279
Source: HDInsight Essentials - Second Edition
280
280
Source: hadoop.apache.org
281
281
282
282
MapReduce Tutorial: A Word Count Example of MapReduce
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
Source: edureka.co
283
283
Conclusion: MapReduce
- Parallel Processing
- The time taken to process the data gets reduced by a tremendous
amount
- Data Locality
- Instead of moving data to the processing unit, we are moving
processing unit to the data
Source: edureka.co
284
284
Introduction
A Petabyte Scale Data Warehouse Using Hadoop
285
Big data Architecture and Analytics Platform
Architecture Overview
Hive Map Reduce HDFS
Hive CLI
Web UI
Mgmt.
MetaStore SerDe
Thrift Jute JSON..
Hive.apache.org 286
Big Data Hadoop Workshop Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015
Introduction to
Apache Spark
Mr.Aekanun Thongtae
Big Data Consultant
IMC Institute
288
Issues
2. Interactive querying of
large datasets so a data
scientist may run ad-hoc
queries on data set.
288
Source: dzone.com
Challenges with big data analytics
- Many tasks will results in too much time spent starting / stopping JVMs and
too many small files.
Adapted from Srinivas Duvvuri; Bikramaditya, Singhal,Spark for Data Science, Packt Publishing, 2016
Evolution of big data analytics
- Hadoop's MapReduce model could not fit in well was with machine
learning algorithms that were iterative in nature.
Source: Srinivas Duvvuri; Bikramaditya, SinghalSpark for Data Science, Packt Publishing, 2016
Evolution of big data analytics
Source: Srinivas Duvvuri; Bikramaditya, SinghalSpark for Data Science, Packt Publishing, 2016
292
292
Source: dzone.com
293
293
Source: dzone.com
A fast and general engine for large scale data processing
294
Big Data Ecosystem
295
Spark: History
296
297
What is Spark?
Framework for distributed processing.
Open source
Source: dzone.com
298
Source: databricks.com
299
Spark Platform
301
Source: dzone.com
Apache Spark
302 aekanun@imcinstitute.com
303
The Driver sends Tasks to the empty slots on the Executors when work has to be done:
303
Source: databricks.com
Apache Spark
304 aekanun@imcinstitute.com
RDD & Partition
- Though Spark sets the number
of partitions automatically
based on the cluster, we have
the liberty to set it manually by
passing it as a second
argument to the parallelize
function (for example,
sc.parallelize(data, 3)).
- A diagrammatic representation
of an RDD which is created
with a dataset with, say, 14
records (or tuples) and is
partitioned into 3, distributed
across 3 nodes:
The Spark engine
Source: Jump start into Apache Spark and Databricks
Big data Architecture and Analytics Platform
What is a RDD?
Fault tolerance
Immutable
Three methods for creating RDD:
Parallelizing an existing correction
Referencing a dataset
Transformation from an existing RDD
Types of files supported:
Text files
Sequence Files
Hadoop InputFormat
Source: dzone.com
319
DEMO: Transformation & Action
RDD#1
RDD#2-6
block4_rdd
320
RDD Creation
hdfsData = sc.textFile("hdfs://data.txt)
RDD#1
RDD#2-6
block4_rdd
323
Platform: Cloudera/Dataproc
Tools: Jupyter
(LAB II)
http://stat-computing.org/dataexpo/2009/the-data.html
Header: Attributes
Data/Fact
text ***
text ***
(LAB III)
Jan 2017
344
Introduction to DataFrame
DataFrame is an immutable distributed collection of data.
Unlike an RDD, data is organized into named columns,
like a table in a relational database.
345
Architecture: Spark SQL & DataFrame
346
Source: Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
RDDs vs. DataFrames: Similarities
Both are fault-tolerant, partitioned data abstractions in
Spark
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
347
Source: Jump start into Apache Spark and Databricks
348
RDDs vs. DataFrames: Differences
DataFrames are a higher-level abstraction than RDDs.
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
349
When to use RDDs?
Low-level transformation and actions and control on your
dataset;
Data is unstructured, such as media streams or streams
of text;
Manipulate your data with functional programming
constructs than domain specific expressions;
Dont care about imposing a schema, such as columnar
format, while processing or accessing data attributes by
name or column; and
Forgo some optimization and performance benefits
available with DataFrames and Datasets for structured
and semi-structured data.
Adapted from databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
350
Get Access to the SparkSQL
Use DataFrame API to entry point: SQLContext or
HiveContext
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
351
Creating DataFrames: RDDs
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
352
Creating DataFrames: JSON
HDFS (Hadoop Ecosystem)
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
353
Creating DataFrames: JDBC
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
354
SparkSQL can leverage the Hive
metastore
Hive Metastore can also be leveraged by a wide
array of applications
Spark
Hive
Impala
Available from HiveContext
355
SparkSQL: HiveContext
>>> color_df
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
360
DataFrames Operations
Check the schema.
>>> color_df.dtypes
>>> color_df.count()
>>> color_df.show(2)
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
361
DataFrames Operations
List out column names.
>>> color_df.columns
>>> color_df.drop('length').show()
>>> color_df.toJSON().first()
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
362
DataFrames Operations
>>> color_df.filter(color_df.length.between(4,5))
.select(color_df.color.alias("mid_length")).show()
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
363
DataFrames Operations
>>> color_df.drop('length').show()
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
364
DataFrames Operations
First filter colors of length more than 4 and then sort on multiple
columns.
The Filtered rows are sorted first on the column length in default
ascending order.
Rows with same length are sorted on color in descending order.
>>> color_df.filter(color_df['length']>=4).sort("length",
'color',ascending=False).show()
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
365
DataFrames Operations
You can use orderBy instead, which is an alias to sort.
>>> color_df.orderBy('length','color').take(4)
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
366
DataFrames Operations
GroupBy
>>> color_df.groupBy('length').count().show()
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
367
Large-Scale Machine Learning
using
Apache Spark MLlib & ML Pipeline
Mr.Aekanun Thongtae
IMC Institute
aekanun@imcinstitute.com
Sep 2017
368 aekanun@imcinstitute.com
Apache Spark
369 aekanun@imcinstitute.com
Apache Spark
370 aekanun@imcinstitute.com
What is MLlib?
372 aekanun@imcinstitute.com
MLlib Algorithms
373 aekanun@imcinstitute.com
MLlib Algorithms
375 aekanun@imcinstitute.com
ML Pipeline
- MLlibs goal is to make practical machine learning (ML)
scalable and easy.
Source: databricks.com
376 aekanun@imcinstitute.com
ML Pipeline
- Leverage on Spark SQL
Source: databricks.com
377 aekanun@imcinstitute.com
Hands-On: Basic Predictive Analytics
with MLlib and ML pipeline
(LAB I)
Focus: Process of the pipeline
Spark ML Pipeline
2. Combines a selected columns into a single vector column.
3. Define an algorithm.
4. Pipeline.
379
Define training set
0.0 : apple
1.0 : pipeapple
2.0 : grape
Lending Club is a marketplace for personal loans that matches borrowers who are
seeking a loan with investors looking to lend money and make a return.
Current - Outstanding
Due Charged
Date Grace Period: 15 Days Late: 16 - 30 Days Late: 31 - 120 Days Default Off
Source: lendingclub.com
- Lending Club evaluates each borrower's credit score using past historical data and
assign an interest rate to the borrower.
- Higher interest rate means that the borrower is riskier and more unlikely to
pay back the loan
- Lower interest rate means that the borrower has a good credit history is
more likely to pay back the loan.
- If the borrower accepts the interest rate, then the loan is listed on the
Lending Club marketplace.
Source: lendingclub.com
- Approved loans are listed on the Lending Club website, where qualified
investors can browse recently approved loans, the borrower's credit score,
the purpose for the loan, and other information from the application.
- Once they're ready to back a loan, they select the amount of money they
want to fund.
- Once a loan's requested amount is fully funded, the borrower receives the
money they requested minus the origination fee that Lending Club charges.
Source: lendingclub.com
Late payment (Past Due) / Not ? How much principal has been paid so far ?
Current - Outstanding
Due Charged
Date Grace Period: 15 Days Late: 16 - 30 Days Late: 31 - 120 Days Default Off
Source: rstudio-pubs-static.s3.amazonaws.com
Image: weiminwang.blog
Sources: investopedia.com
Images: nrilifeinsurance.com, lendingmemo.com
Images: Funnelholic
Source: rstudio-pubs-static.s3.amazonaws.com
Columns/
Field Names
Data/Fact
Images: Funnelholic
Load to DataFrame
Union
Output
Core
rawweb_df rawkaggle_df raw_df
398 aekanun@imcinstitute.com
Architecture and Flow of Data Processing
Register to Temp.
df_no_missing: table
- ONLY Columns that are related to
prediction.
- Clean
df
SQL
399 aekanun@imcinstitute.com
Architecture and Flow of Data Processing
of month.
Make a extraction
Input df
SQL
Hive Table
Write to
DW (Parquet Format)
Table: personal_loan
- earliest_cr_line, last_credit_pull_d have already
extracted as only month.
SQL
400 aekanun@imcinstitute.com
Architecture and Flow of Data Processing
Crosstab
(Frequency)
DW (Parquet Format)
C
on
ne
ct
to
Table: personal_loan
Basic Statistics.
SQL
401 aekanun@imcinstitute.com
Architecture and Flow of Data Processing
DW (Parquet Format)
Table: personal_loan
SQL
Temp. table
Register to
Normalization
raw_df Output - annual_inc Input
crunched_data
- loan_amnt
SQL
402 aekanun@imcinstitute.com
Features
403 aekanun@imcinstitute.com
Data Cleansing & Transformation
- Missing Values:
- Remove all tuples that contain them.
- Label Values:
- Values of charged off and fully paid are selected, but others are
removed
Source: stanford.edu
Training Data
252,531 Records
Testing Data
Records
Source: lendingclub.com
Testing set
raw_df Split
Input
Transformation
Output - Numerical
- Vectors
Algorithm:
RandomForestClassifier
407 aekanun@imcinstitute.com
Training Data/Testing Data
Training Data
322,640 Records
Testing Data
3383 Records
Vector Space
DenseVector([0.0, 1.0,
0.0, 0.0, 4.0, 40.0])
Features:
18
Columns
409 aekanun@imcinstitute.com
Architecture and Flow of Data Processing
Algorithm:
RandomForestClassifier
Output
Model
Testing set
(Not yet be transformed)
410 aekanun@imcinstitute.com
Code for Training Model
Source: lendingclub.com
Source: stanford.edu
Images: dimensionless.in
Images: classeval.wordpress.com
Positive Negative
Positive TP FN Sensitivity/Recall
Actual
Negative FP TN Specificity
(True Negative Rate)
Positive Negative
Predictive Predictive
(Precision)
Source: analyticsvidhya.com
414 aekanun@imcinstitute.com
Our Evaluation from Testing Data
- IndexedLabel means
label/target that is collected
from observations.
322,640
Records !
Accuracy:
0.69
416 aekanun@imcinstitute.com
Be
lief
or
No
t?
0.89
417 aekanun@imcinstitute.com
Pro
wh portio
oa
re f n of a
o ll
det cuse case
ect dw s
ed ere
0.71
418 aekanun@imcinstitute.com
Model Tuning
Images: http://machinelearningmastery.com/
421 aekanun@imcinstitute.com
Hands-on
Clustering on Network Interaction
! cd /mnt
! rm -rf /mnt/kddcup.data*
! wget
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz
! gunzip kddcup.data.gz
! ls /mnt/kddcup.data* -lh
import sys
import os
try:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.feature import StandardScaler
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
max_k = 20
data_file = '/user/root/kddcup.data'
def parse_interaction(line):
"""
Parses a network data interaction.
"""
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in
clean_line_split]))
raw_data = sc.textFile(data_file)
parsed_data = raw_data.map(parse_interaction)
parsed_data_values = parsed_data.values().cache()
standardizer_model =
standardizer.fit(parsed_data_values)
standardized_data_values =
standardizer_model.transform(parsed_data_values)
scores = map(lambda k:
clustering_score(standardized_data_values, k),
range(10,max_k+1,10)) # Call predefined functions
Adapted from: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016
Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
Preparing Large Dataset
http://grouplens.org/datasets/movielens/
data = sc.textFile("./u.data")
df = ratings.toDF(['user','product','rating'])
import numpy as np
import matplotlib.pyplot as plt
n_groups = 5
x = df.groupBy("rating").count().select('count')
xx = x.rdd.flatMap(lambda x: x).collect()
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 1
opacity = 0.4
rects1 = plt.bar(index, xx, bar_width,
alpha=opacity,
color='b',
label='ratings')
plt.xlabel('ratings')
plt.ylabel('Counts')
plt.title('Distribution of ratings')
plt.xticks(index + bar_width, ('1.0', '2.0', '3.0', '4.0', '5.0'))
plt.legend()
plt.tight_layout()
plt.show()
df.stat.crosstab("user", "rating").show()
df.groupBy('user').agg({'rating': 'mean'}).show(5)
rank = 10
numIterations = 10
model = ALS.train(training,rank,numIterations)
pred_ind = model.predict(1, 5)
predictions = model.predictAll(testdata).map(lambda r:
((r[0], r[1]), r[2]))
recommedItemsToUsers = model.recommendProductsForUsers(10)
recommedItemsToUsers.take(2)
Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Launch Dataproc using gcloud command:
with Jupyter Notebook I) Launch Cloud Shell
Hive.apache.org
Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Launch Dataproc using gcloud command
with Jupyter Notebook
II) Type the following command
Changed to be your name.
Hive.apache.org
Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Launch the Jupyter notebook
<<public ip>> :8123
Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Course Supplement
452 aekanun@imcinstitute.com
www.facebook.com/imcinstitute
453
Thank you
thanachart@imcinstitute.com aekanun@imcinstitute.com
www.facebook.com/imcinstitute www.facebook.com/analyticsindeep
www.aekanun.com
www.slideshare.net/imcinstitute
www.thanachart.org
454
A Machine Learning Issue
Source: scikit-learn.org
Model selection (a.k.a. hyperparameter tuning)
460 aekanun@imcinstitute.com
Python overtakes R, becomes the leader in Data
Science, Machine Learning platforms
Source: kdnuggets.com
461 aekanun@imcinstitute.com
Python overtakes R, becomes the leader in Data
Science, Machine Learning platforms
Source: kdnuggets.com
462 aekanun@imcinstitute.com
Statistics
Source: mph.ufl.edu
463 aekanun@imcinstitute.com