1. Machine Learning - Introduction
1. Machine Learning - Introduction
Introduction
Disclaimer: This material is protected under copyright act AnalytixLabs ©, 2011-2016. Unauthorized use and/ or duplication of this material or any part of this material
including data, in any form without explicit and written permission from AnalytixLabs is strictly prohibited. Any violation of this copyright will attract legal actions
Introduction to Machine Learning
Application Information:
age 23 years
gender Male
annual salary USD 60,000
year in residence 1 year
year in job 0.5 year
current debt USD 5,000
improved
data ML performance
measure
Machine Learning
Supervised Unsupervised
Learning Learning
Recommender Association
Clustering
rules
Statistical
modeling Dimension Self
reduction organizing
maps
Reinforced
Learning
Types of Learning
• Supervised (Inductive) Learning
• Training data includes desired outputs
• Labelled input data.
• Creating classifiers to predict unseen inputs.
• Unsupervised Learning
• Training data does not include desired outputs
• Unlabelled input data.
• Creating a function to predict the relation and output
• Semi-supervised Learning
• Training data includes a few desired outputs
• Combines Supervised and Unsupervised Learning methodology
• Reinforcement Learning (e.g., Markov Decision Process, Q Learning etc.)
• No label is provided, but only indicates if a label is correct or not
• Direct Rewards from sequence of actions
• Reward-Punishment based agent
Supervised Learning
Target function
Applications
1. Supervised (inductive) Learning • Classifier
Patient Tumor Clump … Malignant? • Spam Detection
Age Size Thickness • Information Retrieval
55 5 3 TRUE • Personalisation based on ranks
70 4 7 TRUE • Speech Recognition
Training
Data 85 4 6 FALSE LEARNING Model
35 2 1 FALSE
… … … … FALSE
TRUE
Labeled Data-
Set
• Training data includes both predictors (Xi), and Patient age Tumor size Clump … Malignant
response (Yi) 72 3 3 ?
• Labelled input data. 66 4 4 ?
• Creating classifiers to predict unseen inputs.
Test Data
Supervised Learning - Algorithms
• Linear Regression
• Logistic Regression
• Decision Trees (CHAID, CART & Random Forest)
• k-Nearest Neighbours (KNN)
• Naive Bayes (Bayesian Learning)
• Discriminant Analysis (LDA/QDA) – Classification using linear regression
• Neural Networks
• SVM and Kernel estimation
• Perceptron and Multi-level Perceptrons (ANN – Deep Learning)
• Ensemble Models
• …
Un-supervised Learning
2. Un-Supervised Learning to detect natural patterns
Annual Marital
Age State status
Income
25 CA $80,000 M
45 NY $150,000 D Naturally
55 WA $100,500 M
occurring (hidden)
structure
18 TX $85,000 S
… … … …
No Label
Applications
• Un-labelled input data. • Market segmentation - divide potential
• In this situation only the Xi’s are observed. customers into groups based on their
• We need to use the Xi’s to guess what Yi would have been characteristics
and build a model from there. • Pattern Recognition
• Finding hidden structure in data • Groupings based on a distance measure
• Creating a function to predict the relation and output • Group of People, Objects, …
Unsupervised Learning - Algorithms
• Clustering
• k-Means, Hierarchical Clustering
• Hidden Markov Models (HMM)
• Dimension Reduction (Factor Analysis, PCA)
• Feature Extraction methods
• Self-organizing Maps (Neural Nets)
•…
Semi-Supervised Learning
Semi-Supervised Learning
• Training data includes a few desired outputs
• Combines Supervised & Unsupervised
Learning Methodologies
Application
• Webpage Classification:
Reinforced Learning
Some techniques:
- Linear Regression / GLM
- Decision Trees
- Support vector regression
- Ensembles
- Etc…
Classification: Predicting a Category
Binary Problems Multi Class Problems
• credit approve/disapprove •written digits ⇒ 0, 1, · · · , 9
• email spam/non-spam •pictures ⇒ apple, orange, strawberry
• patient sick/not sick •emails ⇒ spam, primary, social, promotion, update (Google)
• ad profitable/not profitable
• answer correct/incorrect
Some techniques:
- Naïve Bayes
- Decision Tree
- Logistic Regression/GLM
- Support Vector Machines
- Neural Network
- Ensembles
- LDA, QDA
- Etc…
Product Affinity & Recommendation
A. Product-to-Product Affinity C. Customer-to-Product Propensity
Some techniques:
- Market Basket Analysis
- FP Growth
B. Identifying frequent item sets - A-priori Algorithm
Item 4
Item 1
Item 2
Item 3
Item 5
Item 1
Item 2
Item 3
Item 4
Item 5
- Collaborative Filtering
- Etc…
…
Y N N Y N Y N N Y N
Tx 1 Tx 1
Y N N Y N Y N N Y N
Tx 2 Tx 2
Y Y N Y N Y Y N Y N
Tx 3 Tx 3
N N Y Y Y N N Y Y Y
Tx 4 Tx 4
Tx 5 Tx 5
… …
B. Building a Machine Learning Application
Building a Machine Learning Application
1. Around 80% of data analysis time is spent on the process of cleaning & preparing the data
2. Data cleaning starts with data tidying where each variable is a column, each observation is a row, and each type of
observational unit is a table
3. Fixed variables should come first, followed by measured variables, each ordered so that related variables are
contiguous. Rows can then be ordered by the first variable, breaking ties with the second and subsequent (fixed)
variables
Table-1 Table-2
Tidy
Table-3
Common Errors…
Modeling & Analysis becomes
1. Column headers are values, not variable names… easy with this structure…
TS Data Product 1-Jan-15 2-Jan-15 3-Jan-15 4-Jan-15 5-Jan-15 6-Jan-15 Product DATE Qty
XAZ256 27 34 60 81 76 137 XAZ256 1/1/2015 27
XAZ256 12 27 37 52 35 70
XAZ256 1/2/2015 34
XAZ256 27 21 30 34 33 58
XAZ256 1/3/2015 60
XAZ256 418 617 732 670 638 1116
XAZ256 1 9 7 9 11 34 XAZ256 1/4/2015 81
XAZ256 20 27 24 24 21 30 XAZ256 1/5/2015 76
XAZ256 19 19 25 25 30 95 XAZ256 1/6/2015 137
XAZ256 1/1/2015 12
XAZ256 1/2/2015 27
Common Errors…
Modeling & Analysis becomes
2. Multiple variables stored in one column… easy with this structure…
WHO Data on Country Year Candidate Cases Country Year Gender Age Cases
some disease as AD 2000 m014 0 AD 2000 m 0-14 0
extracted AD 2000 m1524 0 AD 2000 m 15-24 0
AD 2000 m2534 1 AD 2000 m 25-34 1
AD 2000 m3544 0 AD 2000 m 35-44 0
AD 2000 m4554 0 AD 2000 m 45-54 0
AD 2000 m5564 0 AD 2000 m 55-64 0
AD 2000 m65 0 AD 2000 m 65+ 0
AE 2000 m014 2 AE 2000 m 0-14 2
AE 2000 m1524 4 AE 2000 m 15-24 4
AE 2000 m2534 4 AE 2000 m 25-34 4
AE 2000 m3544 6 AE 2000 m 35-44 6
AE 2000 m4554 5 AE 2000 m 45-54 5
AE 2000 m5564 12 AE 2000 m 55-64 12
AE 2000 m65 10 AE 2000 m 65+ 10
AE 2000 f014 3 AE 2000 f 0-14 3
Common Errors…
Modeling & Analysis becomes
3. Variables are stored in both rows and columns… easy with this structure…
• A random Sampling is done to split the data into training & test data
• A time series data should not be sampled randomly, it should be
contiguous in nature
• One time period might depending all time periods previously
• Well distributed balanced sample
• Over & Under-Sampling need to Carried out for an imbalanced data!!
Building a ML Model
y = f(x)
output prediction function Image feature
• Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction
function f by minimizing the prediction error on the training set
• Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
• Parametric: Algorithms that assumes a known form of function are called parametric machine
learning algorithms.
• The algorithms involve two steps:
• Select a form for the function.
• Learn the coefficients for the function from the training data.
• E.g., Linear and Logistic Regression.
• Non-Parametric: Algorithms that do not make strong assumptions about the form of the mapping
function are called nonparametric machine learning algorithms.
• By not making assumptions, they are free to learn any functional form from the training data.
• Non-parametric methods are often more flexible, achieve better accuracy but require a lot more data
and training time.
• E.g., Support Vector Machines, Neural Networks and Decision Trees.
Building a ML Model
In sample error: Error resulted from applying your prediction algorithm to the dataset you built it with
• Also known as resubstitution error
• Often optimistic (less than on a new sample) as the model may be tuned to error of the sample
Out of sample error: Error resulted from applying your prediction algorithm to a new data set
• Also known as generalization error
• Out of sample error most important as it better evaluates how the model should perform
1 n
MSE ( yi y
ˆi )2
n i 1
Where ŷi is the prediction our method gives for the observation in
our training data.
The Problem
The training method has been designed to make MSE small on the
training data we are looking at (e.g. with linear regression we choose the
line such that MSE is minimized.)
What we really care about is how well the method works on new data.
We call this new data “Test Data”.
There is no guarantee that the method with the smallest training MSE
will have the smallest test (i.e. new data) MSE.
Training vs. Test MSE’s
In general the more flexible a method is the lower its training MSE will be
i.e. it will “fit” or explain the training data very well.
However, the test MSE may in fact be higher for a more flexible method
than for a simple approach like linear regression.
The Trade-off
It can be shown that for any given, X=x0, the expected test MSE for a
new Y at x0 will be equal to
Irreducible Error
As a model gets more complex, the bias will decrease and the
variance will increase but expected test MSE may go up or down!
A Fundamental Picture
Variance refers to how much your estimate for
f would change by if you had a different
training data set.
Generally, the more flexible a method is the
more variance it has.
In general training errors will always decline if
model complexity increases
However, test errors will decline at first (as
reductions in bias dominate) but will then start
to increase again (as increases in variance
dominate).
We must always keep this picture in mind when choosing a learning method.
More flexible/complicated is not always better!
Bias/ Variance Tradeoff
The previous graphs of test versus training MSE’s
illustrates a very important tradeoff that governs the
choice of statistical learning methods.
To avoid overfitting
1) Regularization
2) Cross Validation
Image source: pingax.com
Test MSE, Bias and Variance
Bias – Variance Trade Off – Few Tips
• High number of features and less examples (observations)
• Reduce number of features (But that is information lost)
• Regularization
• If your predictions are seeing large error -
•Get more training data
•Try a smaller set a features
•Try getting additional features
•Adding polynomial features
•Building your own, new, better features based on your knowledge of the problem
– Can be risky if you accidentally over-fit your data by creating new features which are
inherently specific/relevant to your training data
Regularization
Constrain the weights.
Impose penalty for complexity
Model=argmin ∑L (actual, predicted(Model))+λR (Model)
If λ2 is 0 , the regularization is called LASSO. Advantage: It does feature selection too.
λ1 + λ2 is always 1. If both are present in the objective function, it is called elastic net
regularization
Cross Validation
Procedures: Split training set into sub-training/test sets Build model on sub-training set Evaluate on
sub-test set
Repeat and average estimated errors
Result:
• we are able to fit/test various different models with different variables included to find the best one on
the cross-validated test sets
• we are able to test out different types of prediction algorithms to use and pick the best performing one
• we are able to choose the parameters in prediction function and estimate their values
• Note: original test set completely untouched, so when final prediction algorithm is applied, the result
will be an unbiased measurement of the out of sample accuracy of the model
Approaches:
• Random subsampling/ Holdout Method
• K-fold
• Leave one out
Considerations:
• For time series data must be used in “chunks”
- one time period might depending all time periods previously (should not take random samples)
Cross validation Approach-1: Train, Test, Validate Datasets
Training Set
Sample Design Guidelines for Prediction Study
For large sample sizes: 60% training - 20% test - 20% validation
For medium sample sizes: 60% training - 40% test – no validation set to refine model (to ensure test set is of
sufficient size)
For small sample sizes:
• Carefully consider if there are enough sample to build a prediction algorithm
• Report caveat of small sample size and highlight the fact that the prediction algorithm has never been
tested for out of sample error
There should always be a test/validation set that is held away and should NOT be looked at when building
model
• When complete, apply the model to the held-out set only one time
Randomly sample training and test sets
• For data collected over time, build training set in chunks of times
Datasets must reflect structure of problem
• If prediction evolves with time, split train/test sets in time chunks (known as back testing in finance)
Subsets of data should reflect as much diversity as possible
Cross-Validation Approach-2: K-fold validation
Break training set into K subsets
Build the model/predictor on the remaining training
data in each subset and applied to the test subset
Rebuild the data K times with the training and test
subsets
Average the findings
Considerations:
larger k = less bias, more variance
smaller k = more bias, less variance
source: wikipedia
Cross-Validation Approach-3: Leave one out (LOO)
leave out exactly one sample and build predictor on the rest of training data
predict value for the left out sample
repeat for each sample
source: wikipedia
Let’s Check Our Understanding…
Quiz-1
A. no pattern
B. programmable definition
C. pattern: customer behavior; definition: not easily programmable; data: history of bank operation
D. arguably no (or not enough) data yet
Quiz-1
Answer: C
A. no pattern
B. programmable definition
C. pattern: customer behavior; definition: not easily programmable; data: history of bank operation
D. arguably no (or not enough) data yet
Quiz-2
Note: While data mining and machine learning do share a huge overlap, they are arguably not
equivalent because of the difference of focus.
Quiz-2
Answer: C
Note: While data mining and machine learning do share a huge overlap, they are arguably not
equivalent because of the difference of focus.
Quiz-3
Of the following examples, which would you address using an unsupervised learning algorithm?
Problem 1: You have a large inventory of identical items. You want to predict how many of these items
will sell over the next 3 months.
Problem 2: You’d like software to examine individual customer accounts, and for each account decide if it
has been hacked/compromised.
Problem 1: You have a large inventory of identical items. You want to predict how many of these items
will sell over the next 3 months.
Problem 2: You’d like software to examine individual customer accounts, and for each account decide if it
has been hacked/compromised.
The Answer: C
Quiz-5
The entrance system of the school gym, which does automatic face recognition based on machine
Learning, is built to charge four different groups of users differently: staff, student, professor, other.
What type of learning problem best fits the need of the system
A. Binary Classification
B. Multi Class classification
C. Regression
D. None of the above
Quiz-5
The entrance system of the school gym, which does automatic face recognition based on machine
Learning, is built to charge four different groups of users differently: staff, student, professor, other.
What type of learning problem best fits the need of the system
A. Binary Classification
B. Multi Class classification
C. Regression
D. None of the above
The Answer: B
Puzzle
A huntsman can hit a target with probability of 0.2.
Join us on:
Twitter - http://twitter.com/#!/AnalytixLabs
Facebook - http://www.facebook.com/analytixlabs
LinkedIn - http://www.linkedin.com/in/analytixlabs
Blog - http://www.analytixlabs.co.in/category/blog/