Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
44 views

Statistical Machine Learning-The Basic Approach and Current Research Challenges

This document discusses statistical machine learning and current research challenges. It describes how machine learning aims to develop algorithms that detect meaningful patterns in complex data sets too complex for humans to analyze. Representative learning tasks include medical research, fraud detection, and spam filtering. The document outlines key differences from classical statistics, such as a focus on hypothesis generation over testing. It also discusses the fundamental tradeoff between model complexity and accuracy, and challenges in avoiding overfitting while achieving computational efficiency. Support vector machines are presented as one approach that aims to balance these considerations.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Statistical Machine Learning-The Basic Approach and Current Research Challenges

This document discusses statistical machine learning and current research challenges. It describes how machine learning aims to develop algorithms that detect meaningful patterns in complex data sets too complex for humans to analyze. Representative learning tasks include medical research, fraud detection, and spam filtering. The document outlines key differences from classical statistics, such as a focus on hypothesis generation over testing. It also discusses the fundamental tradeoff between model complexity and accuracy, and challenges in avoiding overfitting while achieving computational efficiency. Support vector machines are presented as one approach that aims to balance these considerations.
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 35

Statistical Machine LearningThe Basic Approach and

Current Research Challenges


Shai Ben-David
CS497
February, 2007

A High Level Agenda


The purpose of science is
to find meaningful simplicity
in the midst of disorderly complexity

Herbert Simon

Representative learning tasks


Medical

research.
Detection of fraudulent activity
(credit card transactions, intrusion
detection, stock market manipulation)
Analysis of genome functionality
Email spam detection.
Spatial prediction of landslide hazards.

Common to all such tasks

We wish to develop algorithms that detect meaningful


regularities in large complex data sets.

We focus on data that is too complex for humans to


figure out its meaningful regularities.

We consider the task of finding such regularities from


random samples of the data population.

We should derive conclusions in timely manner.


Computational efficiency is essential.

Different types of learning tasks

Classification prediction
we wish to classify data points into categories, and we
are given already classified samples as our training
input.

For example:
Training a spam filter
Medical Diagnosis (Patient info
High/Low risk).
Stock market prediction ( Predict tomorrows market
trend from companies performance data)

Other Learning Tasks

Clustering
the grouping data into representative collections
- a fundamental tool for data analysis.

Examples :

Clustering customers for targeted marketing.

Clustering pixels to detect objects in images.

Clustering web pages for content similarity.

Differences from Classical Statistics


We

are interested in hypothesis generation


rather than hypothesis testing.
We wish to make no prior assumptions
about the structure of our data.
We develop algorithms for automated
generation of hypotheses.
We are concerned with computational
efficiency.

Learning Theory:
The fundamental dilemma
Tradeoff between
accuracy and simplicity

y=f(x)

Y
Good models
should enable
Prediction
of new data
X

Accuracy

A Fundamental Dilemma of Science:


Model Complexity vs Prediction Accuracy

Limited data

Possible
Models/representati
ons

Complexity

Problem Outline

We are interested in
(automated) Hypothesis Generation,
rather than traditional Hypothesis Testing

First obstacle: The danger of overfitting.


First solution:
Consider only a limited set of candidate hypotheses.

Empirical Risk Minimization


Paradigm
Choose a Hypothesis Class H of subsets of X.
For an input sample S, find some h in H that fits S
well.

For a new point x, predict a label according to its


membership in h.

The Mathematical Justification


Assume both a training sample S and the test point
(x,l) are generated i.i.d. by the same distribution over
X x {0,1} then,
If H is not too rich ( in some formal sense) then,
for every h in H, the training error of h on the
sample S is a good estimate of its probability of
success on the new x .
In other words there is no overfitting

The Mathematical Justification - Formally


If S is sampled i.i.d. by some probability P over X{0,1}
then, with probability > 1-, For all h in H

1
VC dim( H ) ln( )
| {(x, y) S : h( x) y} |

Pr( x , y )D (h( x) y )
c
|S|
|S|

Expected test error

Training error

Complexity Term

The Types of Errors to be


Considered
Training error
minimizer

Best regressor for P


The Class H
Total error
Approximation Error
Estimation Error

Best h (in H) for P

The Model Selection Problem


Expanding H
will lower the approximation error
BUT
it will increase the estimation error
(lower statistical soundness)

Yet another problem


Computational Complexity

Once we have a large enough training


sample,
how much computation is required to
search for a good hypothesis?
(That is, empirically good.)

The Computational Problem

Given a class H of subsets of Rn

Input: A finite set of {0, 1}-labeled


points S in Rn
1}

Output: Some hypothesis function h in H that


maximizes the number of correctly labeled points of S.

Hardness-of-Approximation Results
For each of the following classes, approximating the
best agreement rate for h in H (on a given input
sample S ) up to some constant ratio, is NP-hard :
Monomials
Constant width
Monotone Monomials
Half-spaces
Balls
BD-Eiron-Long
Axis aligned Rectangles
Threshold NNs
Bartlett- BD

The Types of Errors to be


Considered
Arg min{ Er s ( h ) : h H }
Output of the the
learning Algorithm

Best regressor for D

The Class H
Approximation Error
Estimation Error
Computational Error
Total Error

Arg min{ Er( h ) : h H }

Our hypotheses set should balance


several requirements:
Expressiveness being able to capture the
structure of our learning task.

Statistical compactness- having low


combinatorial complexity.

Computational manageability existence of


efficient ERM algorithms.

Concrete learning paradigm- linear separators

The predictor h:

Sign ( wi xi+b)

(where w is the weight vector of the hyperplane h,


and x=(x1, xi,xn) is the example to classify)

Potential problem
data may not be linearly separable

The SVM Paradigm

Choose an Embedding of the domain X into


some high dimensional Euclidean space,
so that the data sample becomes (almost)
linearly separable.

Find a large-margin data-separating hyperplane


in this image space, and use it for prediction.
Important gain: When the data is separable,
finding such a hyperplane is computationally feasible.

The SVM Idea: an Example

The SVM Idea: an Example

x (x, x2)

The SVM Idea: an Example

Controlling Computational Complexity

Potentially the embeddings may require


very high Euclidean dimension.
How can we search for hyperplanes
efficiently?
The Kernel Trick: Use algorithms that
depend only on the inner product of
sample points.

Kernel-Based Algorithms
Rather than define the embedding explicitly, define
just the matrix of the inner products in the range
space.
K(x1x1)
K(x1xm)
K(x1x2)
K(xixj)
K(xmx1)

............

.......

.......

........

K(xmxm
)

Mercer Theorem: If the matrix is symmetric and positive


semi-definite, then it is the inner product matrix with
respect to some embedding

Support Vector Machines (SVMs)

On input: Sample (x1 y1) ... (xmym) and a


kernel matrix K
Output:

A good separating hyperplane

A Potential Problem: Generalization


VC-dimension bounds: The VC-dimension of

the class of half-spaces in Rn is n+1.


n+1
Can we guarantee low dimension of the embeddings
range?

Margin bounds: Regardless of the Euclidean


dimension, generalization
can bounded as a function of
g
the margins of the hypothesis hyperplane.
Can one guarantee the existence of a large-margin
separation?

The Margins of a Sample

max
separating h

min
xi

wn xi

(where wn is the weight vector of the hyperplane h)

Summary of SVM learning


The user chooses a Kernel Matrix
- a measure of similarity between
input points.
2. Upon viewing the training data, the
algorithm finds a linear separator the
maximizes the margins (in the high
dimensional Feature Space).
1.

How are the basic requirements met?


Expressiveness by allowing all types of kernels
there is (potentially) high expressive power.

Statistical compactness- only if we are lucky,


and the algorithm found a large margin good
separator.

Computational manageability it turns out the


search for a large margin classifier can be done in
time polynomial in the input size.

You might also like