Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Learning Associations

• Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services.
Market-Basket transactions

Example: P ( chips | beer ) = 0.7


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Training and testing

Data Practical
acquisition usage

Universal
set
(unobserve
d)

Training Testing set


set (unobserve
(observed) d)
Training and testing
• Training is the process of making the system able to learn.

• No free lunch rule:


– Training set and testing set come from the same distribution
– Need to make some assumptions or bias
Performance
• There are several factors affecting the performance:
– Types of training provided
– The form and extent of any initial background
knowledge
– The type of feedback provided
– The learning algorithms used

• Two important factors:


– Modeling
– Optimization
Classification: Applications
• Face recognition: Pose, lighting, occlusion (glasses,
beard), make-up, hair style
• Character recognition: Different handwriting styles.
• Speech recognition: Temporal dependency.
– Use of a dictionary or the syntax of the language.
– Sensor fusion: Combine multiple modalities; eg,
visual (lip image) and acoustic for speech
• Medical diagnosis: From symptoms to illnesses
• Web Advertising: Predict if a user clicks on an ad on the
Internet.
Steps
Training Training
Labels
Training
Images
Image Learned
Training
Features model

Testing

Image Learned
Prediction
Features model
Test Image
Prediction: Regression

• Example: Price of
a used car
y = wx+w0
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
Regression Applications
• Navigating a car: Angle of the steering
wheel (CMU NavLab)
• Kinematics of a robot arm
(x,y)

α2
α1= g1(x,y)
α2= g2(x,y)
α1
Inductive Learning
• Given examples of a function (X, F(X))
• Predict function F(X) for new examples X
– Discrete F(X): Classification
– Continuous F(X): Regression
– F(X) = Probability(X): Probability estimation
Supervised Learning: Uses
Example: decision trees tools that create rules
• Prediction of future cases: Use the rule to
predict the output for future inputs
• Knowledge extraction: The rule is easy to
understand
• Compression: The rule is simpler than the
data it explains
• Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud
Algorithms
• The success of machine learning system also depends
on the algorithms.

• The algorithms control the search to find and build the


knowledge structures.

• The learning algorithms should extract useful information


from training examples.
Algorithms
• Supervised learning ( )
– Prediction
– Classification (discrete labels), Regression (real values)
• Unsupervised learning ( )
– Clustering
– Probability distribution estimation
– Finding association (in features)
– Dimension reduction
• Semi-supervised learning
• Reinforcement learning
– Decision making (robot, chess machine)
Algorithms

Supervised Unsupervised
learning learning

37
Semi-supervised
learning
What are we seeking?
• Supervised: Low E-out or maximize probabilistic terms

E-in: for training


set
E-out: for testing
set

• Unsupervised: Minimum quantization error, Minimum


distance, MAP, MLE(maximum likelihood estimation)
Machine learning structure
• Supervised learning
Unsupervised Learning
• Learning “what normally happens”
• No output
• Clustering: Grouping similar instances
• Other applications: Summarization,
Association Analysis
• Example applications
– Customer segmentation in CRM
– Image compression: Color quantization
– Bioinformatics: Learning motifs
Machine learning structure
• Unsupervised learning
Clustering Analysis
• Definition
Grouping unlabeled data into clusters, for the purpose of
inference of hidden structures or information

• Dissimilarity measurement
– Distance : Euclidean(L2), Manhattan(L1), …
– Angle : Inner product, …
– Non-metric : Rank, Intensity, …
• Types of Clustering
– Hierarchical
• Agglomerative or divisive
– Partitioning
• K-means, VQ, MDS, …

43
(Matlab helppage)
 Find K partitions with the total
intra-cluster variance minimized

 Iterative method
 Initialization : Randomized yi
 Assignment of x (yi fixed)
 Update of yi (x fixed)

 Problem?
 Trap in local minima

(MacKay, 2003)44
 Deterministically avoid local minima
 No stochastic process (random walk)
 Tracing the global solution by changing
level of randomness
 Statistical Mechanics (Maxima and Minima, Wikipedia)

 Gibbs distribution

 Helmholtz free energy F = D – TS


▪ Average Energy D = < Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z

 In DA, we make F minimized

45
 Analogy to physical annealing process
 Control energy (randomness) by temperature (high  low)
 Starting with high temperature (T = 1)
▪ Soft (or fuzzy) association probability
▪ Smooth cost function with one global minimum
 Lowering the temperature (T ! 0)
▪ Hard association
▪ Revealing full complexity, clusters are emerged

 Minimization of F, using E(x, yj) = ||x-yj||2

Iteratively,

46
 Definition
Process to transform high-dimensional data into low-
dimensional ones for improving accuracy, understanding, or
removing noises.

 Curse of dimensionality
 Complexity grows exponentially
in volume by adding extra
dimensions
 Types (Koppen, 2000)

 Feature selection : Choose representatives (e.g., filter,…)


 Feature extraction : Map to lower dim. (e.g., PCA, MDS, … )

47
Machine Learning in a
Nutshell
• Tens of thousands of machine learning
algorithms
• Hundreds new every year
• Every machine learning algorithm has
three components:
– Representation
– Evaluation
– Optimization
Generative vs. Discriminative
Classifiers
Generative Models Discriminative Models
• Represent both the data and • Learn to directly predict the labels
the labels from the data
• Often, makes use of conditional • Often, assume a simple boundary
independence and priors (e.g., linear)
• Examples • Examples
– Naïve Bayes classifier – Logistic regression
– Bayesian network – SVM
– Boosted decision trees
• Models of data may apply to • Often easier to predict a label from
future prediction problems the data than to model the data

Slide credit: D. Hoiem


Classifiers: Logistic Regression
Maximize likelihood of
label given data,
assuming a log-linear male
model

female

x2
x1 Pitch of voice
P( x1 , x2 | y  1)
log  wT x
P( x1 , x2 | y  1)

P ( y  1 | x1 , x2 )  1 / 1  exp w T x 
Classifiers: Nearest neighbor

Test Training
Training examples
examples example
from class 2
from class 1

f(x) = label of the training example nearest to x


 All we need is a distance function for our inputs
 No training required!

Slide credit: L. Lazebnik


Nearest Neighbor Classifier
• Assign label of nearest training data point to each test data point

partitioning of feature space for two-category 2D and 3D data

Source: D. Lowe
K-nearest neighbor
• It can be used for both classification and regression problems.
• However, it is more widely used in classification problems in the
industry.
• K nearest neighbours is a simple algorithm
– stores all available cases and
– classifies new cases by a majority vote of its k neighbours.
– The case being assigned to the class is most common amongst its K
nearest neighbours measured by a distance function.
– These distance functions can be Euclidean, Manhattan, Minkowski
and Hamming distance.
• First three functions are used for continuous function and
• Fourth one (Hamming) for categorical variables.
– If K = 1, then the case is simply assigned to the class of its nearest
neighbour.
– At times, choosing K turns out to be a challenge while performing KNN
modelling.
Naïve Bayes
• Bayes theorem provides a way of calculate
– posterior probability P(c|x) from P(c), P(x) and P(x|c).
• Look at the equation below:

– Here,
– P(c|x) is the posterior probability of class (target)
given predictor (attribute).
– P(c) is the prior probability of class.
– P(x|c) is the likelihood which is the probability of predictor given class.
– P(x) is the prior probability of predictor
Naïve Bayes Example
• Let’s understand it using an example.
– Have a training data set of weather and corresponding target variable
‘Play’.
– Now, we need to classify whether players will play or not based on
weather condition.
– Let’s follow the below steps to perform it.
• Step 1: Convert the data set to frequency table
• Step 2: Create Likelihood table by finding the probabilities like
– Overcast probability = 0.29 and
– probability of playing is 0.64.
Naïve Bayes
– Step 3: Now, use Naive Bayesian equation to calculate the posterior
probability for each class.
• The class with the highest posterior probability is the outcome of prediction.

• Problem:
– Players will pay if weather is sunny, is this statement is correct?
– We can solve it using above discussed method,
• so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
• Here we have, P (Sunny |Yes) = 3/9 = 0.33,
P(Sunny) = 5/14 = 0.36,
P( Yes)= 9/14 = 0.64
• Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60,
• which has higher probability.
• Naive Bayes uses a similar method to
– predict the probability of different class based on various attributes.
– This algorithm is mostly used in text classification and
– with problems having multiple classes.
EM algorithm
• Problems in ML estimation
– Observation X is often not complete
– Latent (hidden) variable Z exists
– Hard to explore whole parameter space

• Expectation-Maximization algorithm
– Object : To find ML, over latent distribution P(Z |X,)
– Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)
2. M-step – Find new which maximize likelihood.
3. Go to step 1 after updating old à new

60
 Problem
Estimate hidden parameters (={, })
from the given data extracted from
k Gaussian distributions
 Gaussian distribution (Mitchell , 1997)

 Maximum Likelihood

With Gaussian (P = N),

Solve either brute-force or numeric method

61

You might also like