Machine Learning
Machine Learning
Machine Learning
• Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Market-Basket transactions
Data Practical
acquisition usage
Universal
set
(unobserve
d)
Testing
Image Learned
Prediction
Features model
Test Image
Prediction: Regression
• Example: Price of
a used car
y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
Regression Applications
• Navigating a car: Angle of the steering
wheel (CMU NavLab)
• Kinematics of a robot arm
(x,y)
α2
α1= g1(x,y)
α2= g2(x,y)
α1
Inductive Learning
• Given examples of a function (X, F(X))
• Predict function F(X) for new examples X
– Discrete F(X): Classification
– Continuous F(X): Regression
– F(X) = Probability(X): Probability estimation
Supervised Learning: Uses
Example: decision trees tools that create rules
• Prediction of future cases: Use the rule to
predict the output for future inputs
• Knowledge extraction: The rule is easy to
understand
• Compression: The rule is simpler than the
data it explains
• Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud
Algorithms
• The success of machine learning system also depends
on the algorithms.
Supervised Unsupervised
learning learning
37
Semi-supervised
learning
What are we seeking?
• Supervised: Low E-out or maximize probabilistic terms
• Dissimilarity measurement
– Distance : Euclidean(L2), Manhattan(L1), …
– Angle : Inner product, …
– Non-metric : Rank, Intensity, …
• Types of Clustering
– Hierarchical
• Agglomerative or divisive
– Partitioning
• K-means, VQ, MDS, …
43
(Matlab helppage)
Find K partitions with the total
intra-cluster variance minimized
Iterative method
Initialization : Randomized yi
Assignment of x (yi fixed)
Update of yi (x fixed)
Problem?
Trap in local minima
(MacKay, 2003)44
Deterministically avoid local minima
No stochastic process (random walk)
Tracing the global solution by changing
level of randomness
Statistical Mechanics (Maxima and Minima, Wikipedia)
Gibbs distribution
45
Analogy to physical annealing process
Control energy (randomness) by temperature (high low)
Starting with high temperature (T = 1)
▪ Soft (or fuzzy) association probability
▪ Smooth cost function with one global minimum
Lowering the temperature (T ! 0)
▪ Hard association
▪ Revealing full complexity, clusters are emerged
Iteratively,
46
Definition
Process to transform high-dimensional data into low-
dimensional ones for improving accuracy, understanding, or
removing noises.
Curse of dimensionality
Complexity grows exponentially
in volume by adding extra
dimensions
Types (Koppen, 2000)
47
Machine Learning in a
Nutshell
• Tens of thousands of machine learning
algorithms
• Hundreds new every year
• Every machine learning algorithm has
three components:
– Representation
– Evaluation
– Optimization
Generative vs. Discriminative
Classifiers
Generative Models Discriminative Models
• Represent both the data and • Learn to directly predict the labels
the labels from the data
• Often, makes use of conditional • Often, assume a simple boundary
independence and priors (e.g., linear)
• Examples • Examples
– Naïve Bayes classifier – Logistic regression
– Bayesian network – SVM
– Boosted decision trees
• Models of data may apply to • Often easier to predict a label from
future prediction problems the data than to model the data
female
x2
x1 Pitch of voice
P( x1 , x2 | y 1)
log wT x
P( x1 , x2 | y 1)
P ( y 1 | x1 , x2 ) 1 / 1 exp w T x
Classifiers: Nearest neighbor
Test Training
Training examples
examples example
from class 2
from class 1
Source: D. Lowe
K-nearest neighbor
• It can be used for both classification and regression problems.
• However, it is more widely used in classification problems in the
industry.
• K nearest neighbours is a simple algorithm
– stores all available cases and
– classifies new cases by a majority vote of its k neighbours.
– The case being assigned to the class is most common amongst its K
nearest neighbours measured by a distance function.
– These distance functions can be Euclidean, Manhattan, Minkowski
and Hamming distance.
• First three functions are used for continuous function and
• Fourth one (Hamming) for categorical variables.
– If K = 1, then the case is simply assigned to the class of its nearest
neighbour.
– At times, choosing K turns out to be a challenge while performing KNN
modelling.
Naïve Bayes
• Bayes theorem provides a way of calculate
– posterior probability P(c|x) from P(c), P(x) and P(x|c).
• Look at the equation below:
– Here,
– P(c|x) is the posterior probability of class (target)
given predictor (attribute).
– P(c) is the prior probability of class.
– P(x|c) is the likelihood which is the probability of predictor given class.
– P(x) is the prior probability of predictor
Naïve Bayes Example
• Let’s understand it using an example.
– Have a training data set of weather and corresponding target variable
‘Play’.
– Now, we need to classify whether players will play or not based on
weather condition.
– Let’s follow the below steps to perform it.
• Step 1: Convert the data set to frequency table
• Step 2: Create Likelihood table by finding the probabilities like
– Overcast probability = 0.29 and
– probability of playing is 0.64.
Naïve Bayes
– Step 3: Now, use Naive Bayesian equation to calculate the posterior
probability for each class.
• The class with the highest posterior probability is the outcome of prediction.
• Problem:
– Players will pay if weather is sunny, is this statement is correct?
– We can solve it using above discussed method,
• so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
• Here we have, P (Sunny |Yes) = 3/9 = 0.33,
P(Sunny) = 5/14 = 0.36,
P( Yes)= 9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60,
• which has higher probability.
• Naive Bayes uses a similar method to
– predict the probability of different class based on various attributes.
– This algorithm is mostly used in text classification and
– with problems having multiple classes.
EM algorithm
• Problems in ML estimation
– Observation X is often not complete
– Latent (hidden) variable Z exists
– Hard to explore whole parameter space
• Expectation-Maximization algorithm
– Object : To find ML, over latent distribution P(Z |X,)
– Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)
2. M-step – Find new which maximize likelihood.
3. Go to step 1 after updating old à new
60
Problem
Estimate hidden parameters (={, })
from the given data extracted from
k Gaussian distributions
Gaussian distribution (Mitchell , 1997)
Maximum Likelihood
61