Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
62 views

Algorithms Notes

The document introduces logistic regression, which is a classification algorithm used to predict a discrete value such as spam/not spam or malignant/benign. It uses a logistic function to generate a probability between 0 and 1 for each class. The goal is to fit the parameters θ to minimize a cost function over the training data. Unlike linear regression, logistic regression's cost function is convex, allowing gradient descent to find the global minimum. Non-linear decision boundaries can be achieved by adding higher-order terms to the hypothesis function.

Uploaded by

sameeringate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Algorithms Notes

The document introduces logistic regression, which is a classification algorithm used to predict a discrete value such as spam/not spam or malignant/benign. It uses a logistic function to generate a probability between 0 and 1 for each class. The goal is to fit the parameters θ to minimize a cost function over the training data. Unlike linear regression, logistic regression's cost function is convex, allowing gradient descent to find the global minimum. Non-linear decision boundaries can be achieved by adding higher-order terms to the hypothesis function.

Uploaded by

sameeringate
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

LOGISTIC REGRESSION

 Where y is a discrete value


o Develop the logistic regression algorithm to determine what class
a new input should fall into
 Classification problems
o Email -> spam/not spam?
o Online transactions -> fraudulent?
o Tumor -> Malignant/benign
 Variable in these problems is Y
o Y is either 0 or 1
 0 = negative class (absence of something)
 1 = positive class (presence of something)
 Start with binary class problems
o Later look at multiclass classification problem, although this is
just an extension of binary classification
 How do we develop a classification algorithm?

o Tumour size vs malignancy (0 or 1)
o We could use linear regression
o
 Then threshold the classifier output (i.e. anything over some
value is yes, else no)
 In our example below linear regression with thresholding
seems to work

 We can see above this does a reasonable job of stratifying the data
points into one of two classes
o But what if we had a single Yes with a very small tumour
o This would lead to classifying all the existing yeses as nos
 Another issues with linear regression
o We know Y is 0 or 1
o Hypothesis can give values large than 1 or less than 0
 So, logistic regression generates a value where is always either 0 or 1

o Logistic regression is a classification algorithm - don't be
confused

Hypothesis representation
 What function is used to represent our hypothesis in classification
 We want our classifier to output values between 0 and 1

o When using linear regression we did hθ(x) = (θT x)
o For classification hypothesis representation we do hθ(x) =
g((θT x))
o
 Where we define g(z)
 z is a real number
 g(z) = 1/(1 + e-z)
 This is the sigmoid function, or the logistic
function
 If we combine these equations we can write out the
hypothesis as

 What does the sigmoid function look like


 Crosses 0.5 at the origin, then flattens out]
o Asymptotes at 0 and 1
 Given this we need to fit θ to our data

Interpreting hypothesis output

 When our hypothesis (hθ(x)) outputs a number, we treat that value as the
estimated probability that y=1 on input x
o Example
 If X is a feature vector with x0 = 1 (as always)
and x1 = tumourSize
 hθ(x) = 0.7
 Tells a patient they have a 70% chance of a tumor
being malignant
o We can write this using the following notation
 hθ(x) = P(y=1|x ; θ)
o What does this mean?
 Probability that y=1, given x, parameterized by θ
 Since this is a binary classification task we know y = 0 or 1
o So the following must be true
 P(y=1|x ; θ) + P(y=0|x ; θ) = 1
 P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Decision boundary
 Gives a better sense of what the hypothesis function is computing
 Better understand of what the hypothesis function looks like
o One way of using the sigmoid function is;
 When the probability of y being 1 is greater than 0.5 then we
can predict y = 1
 Else we predict y = 0
o When is it exactly that hθ(x) is greater than 0.5?
 Look at sigmoid function
 g(z) is greater than or equal to 0.5 when z is greater
than or equal to 0

 So if z is positive, g(z) is greater than 0.5


 z = (θT x)
 So when
 θT x >= 0
 Then hθ >= 0.5
 So what we've shown is that the hypothesis predicts y = 1 when θT x >=
0
o The corollary of that when θT x <= 0 then the hypothesis predicts
y=0
o Let's use this to better understand how the hypothesis makes its
predictions

Decision boundary

 hθ(x) = g(θ0 + θ1x1 + θ2x2)

 So, for example


o θ0 = -3
o θ1 = 1
o θ2 = 1
 So our parameter vector is a column vector with the above values
o So, θT is a row vector = [-3,1,1]
 What does this mean?

o The z here becomes θT x
o We predict "y = 1" if
 -3x0 + 1x1 + 1x2 >= 0
 -3 + x1 + x2 >= 0
 We can also re-write this as
o If (x1 + x2 >= 3) then we predict y = 1
o If we plot
 x1 + x2 = 3 we graphically plot our decision boundary

 Means we have these two regions on the graph



o Blue = false
o Magenta = true
o Line = decision boundary
o
 Concretely, the straight line is the set of points where hθ(x)
= 0.5 exactly
o The decision boundary is a property of the hypothesis
 Means we can create the boundary with the hypothesis and
parameters without any data
 Later, we use the data to determine the parameter
values
 i.e. y = 1 if
 5 - x1 > 0
 5 > x1

Non-linear decision boundaries

 Get logistic regression to fit a complex non-linear data set



o Like polynomial regress add higher order terms
o So say we have
o
 hθ(x) = g(θ0 + θ1x1+ θ3x12 + θ4x22)
 We take the transpose of the θ vector times the input vector

 Say θT was [-1,0,0,1,1] then we say;
 Predict that "y = 1" if

 -1 + x12 + x22 >= 0
or
 x12 + x22 >= 1
 If we plot x12 + x22 = 1
 This gives us a circle with a radius of 1 around 0

 Mean we can build more complex decision boundaries by fitting


complex parameters to this (relatively) simple hypothesis
 More complex decision boundaries?

o By using higher order polynomial terms, we can get even more
complex decision boundaries

Cost function for logistic regression


 Fit θ parameters
 Define the optimization object for the cost function we use the fit the
parameters

o Training set of m training examples
o
 Each example has is n+1 length column vector

 This is the situation


o Set of m training examples
o Each example is a feature vector which is n+1 dimensional
o x0 = 1
o y ∈ {0,1}
o Hypothesis is based on parameters (θ)
 Given the training set how to we chose/fit θ?
 Linear regression uses the following function to determine θ

 Instead of writing the squared error term, we can write



o If we define "cost()" as;
 cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2
 Which evaluates to the cost for an individual example using
the same measure as used in linear regression
o We can redefine J(θ) as

 Which, appropriately, is the sum of all the individual costs


over the training data (i.e. the same as linear regression)
 To further simplify it we can get rid of the superscripts
o So

 What does this actually mean?


o This is the cost you want the learning algorithm to pay if the
outcome is hθ(x) and the actual outcome is y
o If we use this function for logistic regression this is a non-convex
function for parameter optimization
 Could work....
 What do we mean by non convex?

o We have some function - J(θ) - for determining the parameters
o Our hypothesis function has a non-linearity (sigmoid function
of hθ(x) )
 This is a complicated non-linear function
o If you take hθ(x) and plug it into the Cost() function, and them
plug the Cost() function into J(θ) and plot J(θ) we find many local
optimum -> non convex function
o Why is this a problem
 Lots of local minima mean gradient descent may not find
the global optimum - may get stuck in a global minimum
o We would like a convex function so if you run gradient descent
you converge to a global minimum

A convex logistic regression cost function

 To get around this we need a different, convex Cost() function which


means we can apply gradient descent

 This is our logistic regression cost function



o This is the penalty the algorithm pays
o Plot the function
 Plot y = 1

o So hθ(x) evaluates as -log(hθ(x))

 So when we're right, cost function is 0


o Else it slowly increases cost function as we become "more" wrong
o X axis is what we predict
o Y axis is the cost associated with that prediction
 This cost functions has some interesting properties
o If y = 1 and hθ(x) = 1
 If hypothesis predicts exactly 1 and thats exactly correct
then that corresponds to 0 (exactly, not nearly 0)
o As hθ(x) goes to 0
 Cost goes to infinity
 This captures the intuition that if hθ(x) = 0
(predict P (y=1|x; θ) = 0) but y = 1 this will penalize the
learning algorithm with a massive cost
 What about if y = 0
 then cost is evaluated as -log(1- hθ( x ))

o Just get inverse of the other function

 Now it goes to plus infinity as hθ(x) goes to 1


 With our particular cost functions J(θ) is going to be convex and avoid
local minimum

Simplified cost function and gradient


descent
 Define a simpler way to write the cost function and apply gradient
descent to the logistic regression
o By the end should be able to implement a fully functional logistic
regression function
 Logistic regression cost function is as follows
 This is the cost for a single example

o For binary classification problems y is always 0 or 1
o
 Because of this, we can have a simpler way to write the cost
function

 Rather than writing cost function on two lines/two
cases
 Can compress them into one equation - more efficient
o Can write cost function is
 cost(hθ, (x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )
 This equation is a more compact of the two cases
above
o We know that there are only two possible cases
o
 y=1
 Then our equation simplifies to
 -log(hθ(x)) - (0)log(1 - hθ(x))
 -log(hθ(x))
 Which is what we had before when y = 1
 y=0
 Then our equation simplifies to

 -(0)log(hθ(x)) - (1)log(1 - hθ(x))


 = -log(1- hθ(x))
 Which is what we had before when y = 0
Clever!
 So, in summary, our cost function for the θ parameters can be defined as
 Why do we chose this function when other cost functions exist?
o This cost function can be derived from statistics using the
principle of maximum likelihood estimation
 Note this does mean there's an underlying Gaussian
assumption relating to the distribution of features
o Also has the nice property that it's convex
 To fit parameters θ:
o Find parameters θ which minimize J(θ)
o This means we have a set of parameters to use in our model
for future predictions
 Then, if we're given some new example with set of features x, we can
take the θ which we generated, and output our prediction using


o This result is
o
 p(y=1 | x ; θ)

 Probability y = 1, given x, parameterized by θ

How to minimize the logistic regression cost function

 Now we need to figure out how to minimize J(θ)



o Use gradient descent as before
o Repeatedly update each parameter using a learning rate
 If you had n features, you would have an n+1 column vector for θ
 This equation is the same as the linear regression rule
o The only difference is that our definition for the hypothesis has
changed
 Previously, we spoke about how to monitor gradient descent to check it's
working
o Can do the same thing here for logistic regression
 When implementing logistic regression with gradient descent, we have
to update all the θ values (θ0 to θn) simultaneously
o Could use a for loop
o Better would be a vectorized implementation
 Feature scaling for gradient descent for logistic regression also applies
here

Advanced optimization
 Previously we looked at gradient descent for minimizing the
cost function
 Here look at advanced concepts for minimizing the cost function for
logistic regression
o Good for large machine learning problems (e.g. huge feature set)
 What is gradient descent actually doing?

o We have some cost function J(θ), and we want to minimize it
o We need to write code which can take θ as input and compute the
following
o
 J(θ)
 Partial derivative if J(θ) with respect to j (where j=0 to j = n)

 Given code that can do these two things


o Gradient descent repeatedly does the following update
 So update each j in θ sequentially
 So, we must;
o Supply code to compute J(θ) and the derivatives
o Then plug these values into gradient descent
 Alternatively, instead of gradient descent to minimize the cost function
we could use
o Conjugate gradient
o BFGS (Broyden-Fletcher-Goldfarb-Shanno)
o L-BFGS (Limited memory - BFGS)
 These are more optimized algorithms which take that same input and
minimize the cost function
 These are very complicated algorithms
 Some properties

o Advantages
 No need to manually pick alpha (learning rate)
 Have a clever inner loop (line search algorithm) which
tries a bunch of alpha values and picks a good one
 Often faster than gradient descent
 Do more than just pick a good learning rate
 Can be used successfully without understanding their
complexity
o Disadvantages
o
 Could make debugging more difficult
 Should not be implemented themselves
 Different libraries may use different implementations - may
hit performance

Using advanced cost minimization algorithms

 How to use algorithms



o Say we have the following example
 Example above
o θ1 and θ2 (two parameters)
o Cost function here is J(θ) = (θ1 - 5)2 + ( θ2 - 5)2
o The derivatives of the J(θ) with respect to either θ1 and θ2 turns
out to be the 2(θi - 5)
 First we need to define our cost function, which should have the
following signature

function [jval, gradent] = costFunction(THETA)

 Input for the cost function is THETA, which is a vector of the θ parameters
 Two return values from costFunction are
o jval
 How we compute the cost function θ (the underived cost
function)
 In this case = (θ1 - 5)2 + (θ2 - 5)2
o gradient
 2 by 1 vector
 2 elements are the two partial derivative terms
 i.e. this is an n-dimensional vector
 Each indexed value gives the partial derivatives for
the partial derivative of J(θ) with respect to θi
 Where i is the index position in the gradient vector
 With the cost function implemented, we can call the advanced algorithm
using

options= optimset('GradObj', 'on', 'MaxIter', '100'); % define the options data


structure
initialTheta= zeros(2,1); # set the initial dimensions for theta % initialize the
theta values
[optTheta, funtionVal, exitFlag]= fminunc(@costFunction, initialTheta,
options); % run the algorithm

 Here

o options is a data structure giving options for the algorithm
o fminunc
 function minimize the cost function (find minimum
of unconstrained multivariable function)
o @costFunction is a pointer to the costFunction function to be
used
 For the octave implementation
o initialTheta must be a matrix of at least two dimensions

 How do we apply this to logistic regression?



o Here we have a vector

 Here
o theta is a n+1 dimensional column vector
o Octave indexes from 1, not 0
 Write a cost function which captures the cost function for logistic
regression

Multiclass classification problems


 Getting logistic regression for multiclass classification using one vs. all
 Multiclass - more than yes or no (1 or 0)

o Classification with multiple classes for assignment

 Given a dataset with three classes, how do we get a learning algorithm to


work?

o Use one vs. all classification make binary classification work for
multiclass classification
 One vs. all classification

o Split the training set into three separate binary classification
problems
o
 i.e. create a new fake training set

 Triangle (1) vs crosses and squares (0) hθ1(x)
 P(y=1 | x1; θ)
 Crosses (1) vs triangle and square (0) hθ2(x)
 P(y=1 | x2; θ)
 Square (1) vs crosses and square (0) hθ3(x)
 P(y=1 | x3; θ)

 Overall
o Train a logistic regression classifier hθ(i)(x) for each class i to
predict the probability that y = i
o On a new input, x to make a prediction, pick the class i that
maximizes the probability that hθ(i)(x) = 1
Motivation 1: Data compression - PCA
 Start talking about a second type of unsupervised learning problem
- dimensionality reduction
o Why should we look at dimensionality reduction?

Compression

 Speeds up algorithms
 Reduces space used by data for them
 What is dimensionality reduction?
o So you've collected many features - maybe more than you need
 Can you "simply" your data set in a rational and useful way?
o Example
 Redundant data set - different units for same attribute
 Reduce data to 1D (2D->1D)

 Example above isn't a perfect straight line because of


round-off error
o Data redundancy can happen when different teams are
working independently
 Often generates redundant data (especially if you don't control
data collection)
o Another example
 Helicopter flying - do a survey of pilots (x1 = skill, x2 = pilot
enjoyment)
 These features may be highly correlated
 This correlation can be combined into a single attribute
called aptitude (for example)

 What does dimensionality reduction mean?


o In our example we plot a line
o Take exact example and record position on that line

o So before x1 was a 2D feature vector (X and Y dimensions)


o Now we can represent x1 as a 1D number (Z dimension)
 So we approximate original examples
o Allows us to half the amount of storage
o Gives lossy compression, but an acceptable loss (probably)
 The loss above comes from the rounding error in the
measurement, however
 Another example 3D -> 2D
o So here's our data

o Maybe all the data lies in one plane


 This is sort of hard to explain in 2D graphics, but that plane may
be aligned with one of the axis
 Or or may not...
 Either way, the plane is a small, a constant 3D space
 In the diagram below, imagine all our data points are sitting
"inside" the blue tray (has a dark blue exterior face and a light
blue inside)
 Because they're all in this relative shallow area, we can basically
ignore one of the dimension, so we draw two new lines
(z1 and z2) along the x and y planes of the box, and plot the
locations in that box
 i.e. we loose the data in the z-dimension of our "shallow box" (NB
"z-dimensions" here refers to the dimension relative to the box (i.e
it's depth) and NOT the z dimension of the axis we've got drawn
above) but because the box is shallow it's OK to lose this.
Probably....
o Plot values along those projections

o So we've now reduced our 3D vector to a 2D vector


 In reality we'd normally try and do 1000D -> 100D

Motivation 2: Visualization
 It's hard to visualize highly dimensional data
o Dimensionality reduction can improve how we display information in a
tractable manner for human consumption
o Why do we care?
 Often helps to develop algorithms if we can understand our data
better
 Dimensionality reduction helps us do this, see data in a helpful
 Good for explaining something to someone if you can "show" it in
the data
 Example;
o Collect a large data set about many facts of a country around the world

 So
x1 = GDP
 ...
 x6 = mean household
 Say we have 50 features per country
 How can we understand this data better?
 Very hard to plot 50 dimensional data
o Using dimensionality reduction, instead of each country being
represented by a 50-dimensional feature vector
 Come up with a different feature representation (z values) which
summarize these features

o This gives us a 2-dimensional vector


 Reduce 50D -> 2D
 Plot as a 2D plot
o Typically you don't generally ascribe meaning to the new features (so we
have to determine what these summary values mean)
 e.g. may find horizontal axis corresponds to overall country
size/economic activity
 and y axis may be the per-person well being/economic activity
o So despite having 50 features, there may be two "dimensions" of
information, with features associated with each of those dimensions
 It's up to you to asses what of the features can be grouped to form
summary features, and how best to do that (feature scaling is
probably important)
o Helps show the two main dimensions of variation in a way that's easy to
understand

Principle Component Analysis (PCA): Problem


Formulation
 For the problem of dimensionality reduction the most commonly used
algorithm is PCA
o Here, we'll start talking about how we formulate precisely what we want
PCA to do
 So
o Say we have a 2D data set which we wish to reduce to 1D

o In other words, find a single line onto which to project this data
 How do we determine this line?
 The distance between each point and the projected version
should be small (blue lines below are short)
 PCA tries to find a lower dimensional surface so the sum of
squares onto that surface is minimized
 The blue lines are sometimes called the projection error
 PCA tries to find the surface (a straight line in this
case) which has the minimum projection error

As an aside, you should normally do mean


normalization and feature scaling on your data before
PCA
 A more formal description is
o For 2D-1D, we must find a vector u(1), which is of some dimensionality
o Onto which you can project the data so as to minimize the projection
error

o u(1) can be positive or negative (-u(1)) which makes no difference


 Each of the vectors define the same red line
 In the more general case
o To reduce from nD to kD we
 Find k vectors (u(1), u(2), ... u(k)) onto which to project the data to
minimize the projection error
 So lots of vectors onto which we project the data
 Find a set of vectors which we project the data onto the linear
subspace spanned by that set of vectors
 We can define a point in a plane with k vectors
o e.g. 3D->2D
 Find pair of vectors which define a 2D plane (surface) onto which
you're going to project your data
 Much like the "shallow box" example in compression, we're
trying to create the shallowest box possible (by defining two of it's
three dimensions, so the box' depth is minimized)

 How does PCA relate to linear regression?


o PCA is not linear regression
 Despite cosmetic similarities, very different
o For linear regression, fitting a straight line to minimize the straight
line between a point and a squared line
 NB - VERTICAL distance between point
o For PCA minimizing the magnitude of the shortest orthogonal distance
 Gives very different effects
o More generally
 With linear regression we're trying to predict "y"
 With PCA there is no "y" - instead we have a list of features and
all features are treated equally
 If we have 3D dimensional data 3D->2D
 Have 3 features treated symmetrically

PCA Algorithm
 Before applying PCA must do data preprocessing
o Given a set of m unlabeled examples we must do
 Mean normalization
 Replace each xji with xj - μj,
 In other words, determine the mean of each feature
set, and then for each feature subtract the mean from
the value, so we re-scale the mean to be 0
 Feature scaling (depending on data)
 If features have very different scales then scale so they all
have a comparable range of values
 e.g. xji is set to (xj - μj) / sj
 Where sj is some measure of the range, so
could be
 Biggest - smallest
 Standard deviation (more commonly)
 With preprocessing done, PCA finds the lower dimensional sub-space which
minimizes the sum of the square
o In summary, for 2D->1D we'd be doing something like this;

o Need to compute two things;


 Compute the u vectors
 The new planes
 Need to compute the z vectors
 z vectors are the new, lower dimensionality feature vectors
 A mathematical derivation for the u vectors is very complicated
o But once you've done it, the procedure to find each u vector is not that
hard

Algorithm description

 Reducing data from n-dimensional to k-dimensional


o Compute the covariance matrix

 This is commonly denoted as Σ (greek upper case sigma) - NOT


summation symbol
 Σ = sigma
 This is an [n x n] matrix
 Remember than xi is a [n x 1] matrix
 In MATLAB or octave we can implement this as follows;

o Compute eigenvectors of matrix Σ


o
 [U,S,V] = svd(sigma)
 svd = singular value decomposition
 More numerically stable than eig
 eig = also gives eigenvector
o U,S and V are matrices
 U matrix is also an [n x n] matrix
 Turns out the columns of U are the u vectors we want!
 So to reduce a system from n-dimensions to k-dimensions
 Just take the first k-vectors from U (first k columns)

 Next we need to find some way to change x (which is n dimensional) to z


(which is k dimensional)
o (reduce the dimensionality)
o Take first k columns of the u matrix and stack in columns
 n x k matrix - call this Ureduce
o We calculate z as follows
 z = (Ureduce)T * x
 So [k x n] * [n x 1]
 Generates a matrix which is
 k*1
 If that's not witchcraft I don't know what is!

 Exactly the same as with supervised learning except we're now doing it
with unlabeled data
 So in summary
o Preprocessing
o Calculate sigma (covariance matrix)
o Calculate eigenvectors with svd
o Take k vectors from U (Ureduce= U(:,1:k);)
o Calculate z (z =Ureduce' * x;)
 No mathematical derivation

o Very complicated
o But it works

Reconstruction from Compressed Representation


 Earlier spoke about PCA as a compression algorithm
o If this is the case, is there a way to decompress the data from low
dimensionality back to a higher dimensionality format?
 Reconstruction

o Say we have an example as follows

o We have our examples (x1, x2 etc.)


o Project onto z-surface
o Given a point z1, how can we go back to the 2D space?
 Considering
o z (vector) = (Ureduce)T * x
 To go in the opposite direction we must do
o xapprox = Ureduce * z
 To consider dimensions (and prove this really works)
 Ureduce = [n x k]
 z [k * 1]
 So
 xapprox = [n x 1]
 So this creates the following representation

 We lose some of the information (i.e. everything is now perfectly on that line)
but it is now projected into 2D space

Choosing the number of Principle Components


 How do we chose k ?
o k = number of principle components
o Guidelines about how to chose k for PCA
 To chose k think about how PCA works
o PCA tries to minimize averaged squared projection error

o Total variation in data can be defined as the average over data saying
how far are the training examples from the origin

 When we're choosing k typical to use something like this


o Ratio between averaged squared projection error with total variation in
data
 Want ratio to be small - means we retain 99% of the variance
o If it's small (0) then this is because the numerator is small
 The numerator is small when xi = xapproxi
 i.e. we lose very little information in the dimensionality
reduction, so when we decompress we regenerate the same
data
 So we chose k in terms of this ratio
 Often can significantly reduce data dimensionality while retaining the variance
 How do you do this

Advice for Applying PCA


 Can use PCA to speed up algorithm running time
o Explain how
o And give general advice

Speeding up supervised learning algorithms

 Say you have a supervised learning problem


o Input x and y
 x is a 10 000 dimensional feature vector
 e.g. 100 x 100 images = 10 000 pixels
 Such a huge feature vector will make the algorithm slow
o With PCA we can reduce the dimensionality and make it tractable
o How
 1) Extract xs
 So we now have an unlabeled training set
 2) Apply PCA to x vectors
 So we now have a reduced dimensional feature vector
z
 3) This gives you a new training set
 Each vector can be re-associated with the label
 4) Take the reduced dimensionality data set and feed to a
learning algorithm
 Use y as labels and z as feature vector
 5) If you have a new example map from higher
dimensionality vector to lower dimensionality vector, then
feed into learning algorithm
 PCA maps one vector to a lower dimensionality vector
o x -> z
o Defined by PCA only on the training set
o The mapping computes a set of parameters
 Feature scaling values
 Ureduce
 Parameter learned by PCA
 Should be obtained only by determining PCA on your
training set
o So we use those learned parameters for our
 Cross validation data
 Test set
 Typically you can reduce data dimensionality by 5-10x without a major
hit to algorithm

Applications of PCA
 Compression
o Why
 Reduce memory/disk needed to store data
 Speed up learning algorithm
o How do we chose k?
 % of variance retained
 Visualization
o Typically chose k =2 or k = 3
o Because we can plot these values!
 One thing often done wrong regarding PCA
o A bad use of PCA: Use it to prevent over-fitting
 Reasoning
 If we have xi we have n features, zi has k features
which can be lower
 If we only have k features then maybe we're
less likely to over fit...
 This doesn't work
 BAD APPLICATION
 Might work OK, but not a good way to address over
fitting
 Better to use regularization
 PCA throws away some data without knowing what the
values it's losing
 Probably OK if you're keeping most of the data
 But if you're throwing away some crucial data bad
 So you have to go to like 95-99% variance retained
 So here regularization will give you AT LEAST
as good a way to solve over fitting
 A second PCA myth
o Used for compression or visualization - good
o Sometimes used
 Design ML system with PCA from the outset
 But, what if you did the whole thing without PCA
 See how a system performs without PCA
 ONLY if you have a reason to believe PCA will help
should you then add PCA
 PCA is easy enough to add on as a processing step
 Try without first!

Unsupervised learning - introduction


 Talk about clustering
o Learning from unlabeled data
 Unsupervised learning

o Useful to contras with supervised learning
 Compare and contrast
o Supervised learning
 Given a set of labels, fit a hypothesis to it
o Unsupervised learning
 Try and determining structure in the data
 Clustering algorithm groups data together based on data
features
 What is clustering good for

o Market segmentation - group customers into different market
segments
o Social network analysis - Facebook "smartlists"
o Organizing computer clusters and data centers for network
layout and location
o Astronomical data analysis - Understanding galaxy formation

K-means algorithm
 Want an algorithm to automatically group the data into coherent
clusters
 K-means is by far the most widely used clustering algorithm

Overview

 Take unlabeled data and group into two clusters

 Algorithm overview
o 1) Randomly allocate two points as the cluster centroids
 Have as many cluster centroids as clusters you want to do
(K cluster centroids, in fact)
 In our example we just have two clusters
o 2) Cluster assignment step
 Go through each example and depending on if it's closer to
the red or blue centroid assign each point to one of the two
clusters
 To demonstrate this, we've gone through the data and
"colour" each point red or blue

o 3) Move centroid step


 Take each centroid and move to the average of the
correspondingly assigned data-points

 Repeat 2) and 3) until convergence


 More formal definition
o Input:
 K (number of clusters in the data)
 Training set {x1, x2, x3 ..., xn)
o Algorithm:
 Randomly initialize K cluster centroids as {μ1, μ2, μ3 ... μK}


 Loop 1
 This inner loop repeatedly sets
the c(i) variable to be the index of the
closes variable of cluster centroid closes
to xi
 i.e. take ith example, measure squared
distance to each cluster centroid,
assign c(i)to the cluster closest

 Loop 2

 Loops over each centroid calculate the
average mean based on all the points
associated with each centroid from c(i)
 What if there's a centroid with no data

 Remove that centroid, so end up with K-1 classes
 Or, randomly reinitialize it

 Not sure when though...

K-means for non-separated clusters

 So far looking at K-means where we have well defined clusters


 But often K-means is applied to datasets where there aren't well defined
clusters
o e.g. T-shirt sizing

 Not obvious discrete groups


 Say you want to have three sizes (S,M,L) how big do you make these?
o One way would be to run K-means on this data
o May do the following

o So creates three clusters, even though they aren't really there


o Look at first population of people
 Try and design a small T-shirt which fits the 1st population
 And so on for the other two
o This is an example of market segmentation
o
 Build products which suit the needs of your subpopulations

K means optimization objective


 Supervised learning algorithms have an optimization objective (cost
function)
o K-means does too
 K-means has an optimization objective like the supervised learning
functions we've seen
o Why is this good?

o Knowing this is useful because it helps for debugging
o Helps find better clusters
 While K-means is running we keep track of two sets of variables
o ci is the index of clusters {1,2, ..., K} to which xi is currently
assigned
 i.e. there are m ci values, as each example has a ci value, and
that value is one the the clusters (i.e. can only be one of K
different values)
o μk, is the cluster associated with centroid k
 Locations of cluster centroid k
 So there are K
 So these the centroids which exist in the training data space
o μc , is the cluster centroid of the cluster to which example xi has
i

been assigned to
 This is more for convenience than anything else
 You could look up that example i is indexed to cluster
j (using the c vector), where j is between 1 and K
 Then look up the value associated with cluster j in
the μ vector (i.e. what are the features associated
with μj)
 But instead, for easy description, we have this variable
which gets exactly the same value
 Lets say xi as been assigned to cluster 5
 Means that
 ci = 5
 μci, = μ5
 Using this notation we can write the optimization objective;

o i.e. squared distances between training example xi and the cluster


centroid to which xi has been assigned to
 This is just what we've been doing, as the visual description
below shows;

 The red line here shows the distances between the


example xi and the cluster to which that example has been
assigned
 Means that when the example is very close to
the cluster, this value is small
 When the cluster is very far away from the example,
the value is large
o This is sometimes called the distortion (or distortion cost
function)
o So we are finding the values which minimizes this function;

 If we consider the k-means algorithm


o The cluster assigned step is minimizing J(...) with respect
to c1, c2 ... ci
 i.e. find the centroid closest to each example
 Doesn't change the centroids themselves
o The move centroid step
 We can show this step is choosing the values of μ which
minimizes J(...) with respect to μ
o So, we're partitioning the algorithm into two parts
 First part minimizes the c variables
 Second part minimizes the J variables
 We can use this knowledge to help debug our K-means algorithm

Random initialization

 How we initialize K-means


o And how avoid local optimum
 Consider clustering algorithm
o Never spoke about how we initialize the centroids
 A few ways - one method is most recommended
 Have number of centroids set to less than number of examples (K < m)
(if K > m we have a problem)0
o Randomly pick K training examples
o Set μ1 up to μK to these example's values
 K means can converge to different solutions depending on
the initialization setup
o Risk of local optimum

o The local optimum are valid convergence, but local optimum not
global ones
 If this is a concern
o We can do multiple random initializations
 See if we get the same result - many same results are likely
to indicate a global optimum
 Algorithmically we can do this as follows;

o A typical number of times to initialize K-means is 50-1000


o Randomly initialize K-means
 For each 100 random initialization run K-means
 Then compute the distortion on the set of cluster
assignments and centroids at convergent
 End with 100 ways of cluster the data
 Pick the clustering which gave the lowest distortion
 If you're running K means with 2-10 clusters can help find better global
optimum

o If K is larger than 10, then multiple random initializations are less
likely to be necessary
o First solution is probably good enough (better granularity of
clustering)

How do we choose the number of clusters?


 Choosing K?
o Not a great way to do this automatically
o Normally use visualizations to do it manually
 What are the intuitions regarding the data?
 Why is this hard
o Sometimes very ambiguous
 e.g. two clusters or four clusters
 Not necessarily a correct answer
o This is why doing it automatic this is hard

Elbow method

 Vary K and compute cost function at a range of K values


 As K increases J(...) minimum value should decrease (i.e. you decrease
the granularity so centroids can better optimize)
o Plot this (K vs J())
 Look for the "elbow" on the graph

 Chose the "elbow" number of clusters


 If you get a nice plot this is a reasonable way of choosing K
 Risks
o Normally you don't get a a nice line -> no clear elbow on curve
o Not really that helpful

Another method for choosing K

 Using K-means for market segmentation


 Running K-means for a later/downstream purpose
o See how well different number of clusters serve you later needs
 e.g.
o T-shirt size example
 If you have three sizes (S,M,L)
 Or five sizes (XS, S, M, L, XL)
 Run K means where K = 3 and K = 5
o How does this look

o This gives a way to chose the number of clusters


 Could consider the cost of making extra sizes vs. how well
distributed the products are
 How important are those sizes though? (e.g. more sizes
might make the customers happier)
 So applied problem may help guide the number of clusters
Decision Trees
 We now turn our attention to decision trees, a simple yet flexible class of We will first
consider the non-linear, region-based nature of decision trees, continue on to define and
contrast region-based loss functions, and close off with an investigation of some of the
specific advantages and disadvantages of such methods. Once finished with their nuts
and bolts, we will move on to investigating different ensembling methods through the
lens of decision trees, due to their suitability for such techniques.

Non-linearity
 Importantly, decision trees are one of the first inherently non-linear machine learning
techniques we will cover, as compared to methods such as vanilla SVMs or GLMs.
Formally, a method is linear if for an input x∈Rnx∈Rn (with interecept
term x0=1x0=1) it only produces hypothesis functions hh of the form:
h(x)=θTxh(x)=θTx
o where θ∈Rnθ∈Rn. Hypothesis functions that cannot be reduced to the form
above are called non-linear, and if a method can produce non-linear
hypothesis functions then it is also non-linear. We have already seen that
kernelization of a linear method is one such method by which we can achieve
non-linear hypothesis functions, via a feature mapping ϕ(x)ϕ(x)
 Decision trees, on the other hand, can directly produce non-linear hypothesis functions
without the need for first coming up with an appropriate feature mapping. As a
motivating (and very Canadien) example, let us say we want to build a classifier that,
given a time and a location, can predict whether or not it would be possible to ski
nearby. To keep things simple, the time is represented as month of the year and the
location is represented as a latitude (how far North or South we are
with −90∘,0∘−90∘,0∘, and 90∘90∘ being the South Pole, Equator, and North Pole,
respectively).
 A representative dataset is shown above left. There is no linear boundary that would
correctly split this dataset. However, we can recognize that there are different areas of
positive and negative space we wish to isolate, one such division being shown above
right. We accomplish this by partitioning the input space XX into disjoint subsets (or
regions) RiRi:
X=⋃ni=0Ri s.t. Ri∩Rj=∅ for i≠jX=⋃i=0nRi s.t. Ri∩Rj=∅ for i≠j
o where n∈Z+n∈Z+

Selecting Regions
 In general, selecting optimal regions is intractable. Decision trees generate an
approximate solution via greedy, top-down, recursive partitioning. The method is top-
down because we start with the original input space XX and split it into two child
regions by thresholding on a single feature. We then take one of these child regions and
can partition via a new threshold. We continue the training of our model in a recursive
manner, always selecting a leaf node, a feature, and a threshold to form a new split.
Formally, given a parent region RpRp, a feature index jj, and a threshold t∈Rt∈R, we
obtain two child regions R1R1 and R2R2 as follows:
R1={X∣Xj<t,X∈Rp}R2={X∣Xj≥t,X∈Rp}R1={X∣Xj<t,X∈Rp}R2={X∣Xj≥t,X∈Rp}
 The beginning of one such process is shown below applied to the skiing dataset. In step
a, we split the input space XX by the location feature, with a threshold of 15, creating
child regions R1R1 and R2R2. In step b, we then recursively select one of these child
regions (in this case R2R2) and select a feature (time) and threshold (3), generating two
more child regions (R21R21). and (R22R22). In step c, we select any one of the
remaining leaf nodes (R1,R21,R22R1,R21,R22). We can continue in such a manner
until we a meet a given stop criterion (more on this later), and then predict the majority
class at each leaf node.
Defining a Loss Function
 A natural question to ask at this point is how to choose our splits. To do so, it is first
useful to define our loss LL as a set function on a region RR. Given a split of a
parent RpRp into two child regions R1R1 and R2R2, we can compute the loss of the
parent L(Rp)L(Rp) as well as the cardinality-weighted loss of the
children |R1|L(R1)+|R2|L(R2)|R1|+|R2||R1|L(R1)+|R2|L(R2)|R1|+|R2|. Within our greedy
partitioning framework, we want to select the leaf region, feature, and threshold that
will maximize our decrease in loss:
L(Rp)−|R1|L(R1)+|R2|L(R2)|R1|+|R2|L(Rp)−|R1|L(R1)+|R2|L(R2)|R1|+|R2|
 For a classification problem, we are interested in the misclassification
loss LmisclassLmisclass. For a region RR let p^cp^c be the proportion of examples
in RR that are of class cc. Misclassification loss on RR can be written as:
Lmisclass(R)=1−maxc(p^c)Lmisclass(R)=1−maxc(p^c)
 We can understand this as being the number of examples that would be misclassified if
we predicted the majority class for region RR (which is exactly what we do). While
misclassification loss is the final value we are interested in, it is not very sensitive to
changes in class probabilities. As a representative example, we show a binary
classification case below. We explicitly depict the parent region RpRp as well as the
positive and negative counts in each region.

 The first split is isolating out more of the positives, but we note that:
L(Rp)=|R1|L(R1)+|R2|L(R2)|R1|+|R2|=|R′1|L(R′1)+|R′2|L(R′2)|R′1+|R′2∣=100L(R
p)=|R1|L(R1)+|R2|L(R2)|R1|+|R2|=|R1′|L(R1′)+|R2′|L(R2′)|R1′+|R2′∣=100
 Thus, not only can we not only are the losses of the two splits identical, but neither of
the splits decrease the loss over that of the parent.
 We therefore are interested in defining a more sensitive loss. While several have been
proposed, we will focus here on the cross-entropy loss LcrossLcross :
Lcross(R)=−∑cp^clog2p^cLcross(R)=−∑cp^clog2⁡p^c
 With p^log2p^≡0p^log2⁡p^≡0 if p^=0p^=0. From an information-theoretic
perspective, cross-entropy measure the number of bits needed to specify the outcome
(or class) given that the distribution is known. Furthermore, the reduction in loss from
parent to child is known as information gain.
 To understand the relative sensitivity of cross-entropy loss with respect to
misclassification loss, let us look at plots of both loss functions for the binary
classification case. For these cases, we can simplify our loss functions to depend on just
the proportion of positive examples p^ip^i in a region RiRi:
Lmisclass(R)=Lmisclass(p^)=1−max(p^,1−p^)Lcross(R)=Lcross(p^)=−p^logp^−(1−
p^)log(1−p^)Lmisclass(R)=Lmisclass(p^)=1−max(p^,1−p^)Lcross(R)=Lcross(p^)=−p^log
⁡p^−(1−p^)log⁡(1−p^)
 In the figure above on the left, we see the cross-entropy loss plotted over p. We take the
regions (Rp,R1,R2)(Rp,R1,R2) from the previous page’s example’s first split, and plot
their losses as well. As cross-entropy loss is strictly concave, it can be seen from the
plot (and easily proven) that as long as p^1≠p^2p^1≠p^2 and both child regions are
non-empty, then the weighted sum of the children losses will always be less than that
of the parent.
 Misclassification loss, on the other hand, is not strictly concave, and therefore there is
no guarantee that the weighted sum of the children will be less than that of the parent,
as shown above right, with the same partition. Due to this added sensitivity, cross-
entropy loss (or the closely related Gini loss) are used when growing decision trees for
classification.
 Before fully moving away from loss functions, we briefly cover the regression setting
for decision trees. For each data point xixi we now instead have an associated
value yi∈Ryi∈R we wish to predict. Much of the tree growth process remains the same,
with the differences being that the final prediction for a region RR is the mean of all
the values:
y^=∑i∈Ryi|R|y^=∑i∈Ryi|R|
 And in this case we can directly use the squared loss to select our splits:
Lsquared(R)=∑i∈R(yi−y^)2|R|Lsquared(R)=∑i∈R(yi−y^)2|R|

Other Considerations
 The popularity of decision trees can in large part be attributed to the ease by which they
are explained and understood, as well as the high degree of interpretability they exhibit:
we can look at the generated set of thresholds to understand why a model made specific
predictions. However, that is not the full picture - we will now cover some additional
salient points.

Categorical Variables
 Another advantage of decision trees is that they can easily deal with categorical
variables. As an example, our location in the skiing dataset could instead be represented
as a categorical variable (one of Northern Hemisphere, Southern Hemisphere, or
Equator (i.e. loc∈N,S,E loc∈N,S,E)). Rather than use a one-hot encoding or similar
preprocessing step to transform the data into a quantitative feature, as would be
necessary for the other algorithms we have seen, we can directly probe subset
membership. The final tree in Section 2 can be re-written as:
 A caveat to the above is that we must take care to not allow a variable to have too many
categories. For a set of categories SS, our set of possible questions is the power
set P(S)P(S), of cardinality 2|S|2|S|. Thus, a large number of categories makes
question selectioin computationally intractable. Optimizations are possible for the
binary classification, though even in this case serious consideration should be given to
whether the feature can be re-formulated as a quantitative one instead as the large
number of possible thresholds lend themselves to a high degree of overfitting.

Regularization
 In Section 2 we alluded to various stopping criteria we could use to determine when to
halt the growth of a tree. The simplest criteria involves “fully” growning the tree: we
continue until each leaf region contains exactly one training data point. This technique
however leads to a high variance and low bias model, and we therefore turn to various
stopping heuristics for regularization. Some common ones include:
o Minimum Leaf Size - Do not split RR if its cardinality falls below a fixed
threshold.
o Maximum Depth - Do not split RR if more than a fixed threshold of splits
were already taken to reach RR.
o Maximum Number of Nodes - Stop if a tree has more than a fixed threshold
of leaf nodes.
 A tempting heuristic to use would be to enforce a minimum decrease in loss after splits.
This is a problematic approach as the greedy, singlefeature at a time approach of
decision trees could mean missing higher order interactions. If we require thresholding
on multiple features to achieve a good split, we might be unable to achieve a good
decrease in loss on the initial splits and therefore prematurely terminate. A better
approach involves fully growing out the tree, and then pruning away nodes that
minimally decrease misclassification or squared error, as measured on a validation set.

Runtime
 We briefly turn to considering the runtime of decision trees. For ease of analysis, we
will consider binary classification with nn examples, ff features, and a tree of
depth dd. At test time, for a data point we traverse the tree until we reach a leaf node
and then output its prediction, for a runtime of O(d)O(d). Note that if our tree is
balanced than d=O(logn)d=O(log⁡n), and thus test time performance is generally
quite fast.
 At training time, we note that each data point can only appear in at
most O(d)O(d) nodes. Through sorting and intelligent caching of intermediate values,
we can achieve an amortized runtime of O(1)O(1) at each node for a single data point
for a single feature. Thus, overall runtime is O(nfd)−O(nfd)− a fairly fast runtime as
the data matrix alone is of size nfnf.

Lack of Additive Structure


 One important downside to consider is that decision trees can not easily capture additive
structure. For example, as seen below on the left, a simple decision boundary of the
form x1+x2x1+x2 could only be approximately modeled through the use of many
splits, as each split can only consider one of x1x1 or x2x2 at a time. A linear model on
the other hand could directly derive this boundary, as shown below right.
 While there has been some work in allowing for decision boundaries that factor in many
features at once, they have the downside of further increasing variance and reducing
interpretability.

Ensemble Methods Overview


 We now cover methods by which we can aggregate the output of trained models. We
will use Bias-Variance analysis as well as the example of decision trees to probe some
of the trade-offs of each of these methods.
 To understand why we can derive benefit from ensembling, let us first recall some basic
probability theory. Say we have nn independent, identically distributed (i.i.d.) random
variables XiXi for 0≤i<n0≤i<n. Assume Var(Xi)=σ2Var⁡(Xi)=σ2 for all XiXi.
Then we have that the variance of the mean is:
Var(X¯)=Var(1n∑iXi)=σ2nVar⁡(X¯)=Var⁡(1n∑iXi)=σ2n
 Now, if we drop the independence assumption (so the variables are only i.d.), and instead say
that the XiXi’s are correlated by a factor ρρ, we can show that:
Var(X¯)=Var(1n∑iXi)=1n2∑i,jCov(Xi,Xj)=nσ2n2+n(n−1)ρσ2n2=ρσ2+1−ρnσ2Var
⁡(X¯)=Var⁡(1n∑iXi)=1n2∑i,jCov⁡(Xi,Xj)=nσ2n2+n(n−1)ρσ2n2=ρσ2+1−ρnσ2
 Where in Step 3 we use the definition of pearson correlation coefficient,
ρX,Y=Cov(X,Y)σxσy and that Cov(X,X)=Var(X)ρX,Y=Cov⁡(X,Y)σxσy and
that Cov⁡(X,X)=Var⁡(X)
 Now, if we consider each random variable to be the error of a given model, we can see that both
increasing the number of models used (causing the second term to vanish) as well as decreasing
the correlation between models (causing the first term to vanish and returning us to the i.i.d.
definition) leads to an overall decrease in variance of the error of the ensemble.
 There are several ways by which we can generate de-correlated models, including:
o Using different algorithms
o Using different training sets
o Bagging
o Boosting
 While the first two are fairly straightforward, they involve large amounts of additional work. In
the following sections, we will cover the latter two techniques, boosting and bagging, as well
as their specific uses in the context of decision trees.

Bagging

Boostrap
 Bagging stands for “Boostrap Aggregation” and is a variance reduction ensembling method.
Bootstrap is a method from statistics traditionally used to measure uncertainty of some
estimator (e.g. mean).
 Say we have a true population PP that we wish to compute an estimator for, as well a training
set SS sampled from P(S∼P)P(S∼P). While we can find an approximation by computing the
estimator on SS, we cannot know what the error is with respect to the true value. To do so we
would need multiple independent training sets S1,S2,…S1,S2,… all sampled from PP.
 However, if we make the assumption that S=PS=P, we can generate a new bootstrap
set ZZ sampled with replacement from S(Z∼S,|Z|=|S|)S(Z∼S,|Z|=|S|). In fact we can
generate many such samples Z1,Z2,…,ZMZ1,Z2,…,ZM. We can then look at the variability
of our estimate across these bootstrap sets to obtain a measure of error.

Aggregation
 Now, returning to ensembling, we can take each ZmZm and train a machine learning
model GmGm on each, and define a new aggregate predictor:
G(X)=∑mGm(x)MG(X)=∑mGm(x)M
 This process is called bagging. Referring back to equation (4)(4), we have that the variance
of MM correlated predictors is:
Var(X¯)=ρσ2+1−ρMσ2Var⁡(X¯)=ρσ2+1−ρMσ2
 Bagging creates less correlated predictors than if they were all simply trained on SS, thereby
decreasing ρρ. While the bias of each individual predictor increases due to each bootstrap set
not having the full training set available, in practice it has been found that the decrease in
variance outweighs the increase in bias. Also note that increasing the number of
predictors MM can’t lead to additional overfitting, as ρρ is insensitive to MM and therefore
overall variance can only decrease.
 An additional advantage of bagging is called out-of-bag estimation. It can be shown that each
bootstrapped sample only contains approximately 2323 of SS, and thus we can use the
other 1313 as an estimate of error, called outof-bag error. In the limit, as M→∞M→∞, out-
of-bag error gives an equivalent result to leave-one-out cross-validation.

Bagging + Decision Trees


 Recall that fully-grown decision trees are high variance, low bias models, and therefore the
variance-reducing effects of bagging work well in conjunction with them. Bagging also allows
for handling of missing features: if a feature is missing, exclude trees in the ensemble that use
that feature in our of their splits. Though if certain features are particularly powerful predictors
they may still be included in most if not all trees.
 A downside to bagged trees is that we lose the interpretability inherent in the single decision
tree. One method by which to re-gain some amount of insight is through a technique called
variable importance measure. For each feature, find each split that uses it in the ensemble and
average the decrease in loss across all such splits. Note that this is not the same as measuring
how much performance would degrade if we did not have this feature, as other features might
be correlated and could substitute.
 A final but important aspect of bagged decision trees to cover is the method of random forests.
If our dataset contained one very strong predictor, then our bagged trees would always use that
feature in their splits and end up correlated. With random forests, we instead only allow a subset
of features to be used at each split. By doing so, we achieve a decrease in correlation ρρ which
leads to a decrease in variance. Again, there is also an increase in bias due to the restriction of
the feature space, but as with vanilla bagged decision trees this proves to not often be an issue.
 Finally, even powerful predictors will no longer be present in every tree (assuming sufficient
number of trees and sufficient restriction of features at each split), allowing for more graceful
handling of missing predictors.

Key Takeaways
 To summarize, some of the primary benefits of bagging, in the context of decision trees, are:
o Decrease in variance (even more so for random forests)
o Better accuracy
o Free validation set
o Support for missing values
 While some of the disadvantages include:
o Incrase in bias (even more so for random forests)
o Harder to interpret
o Still not additive
o More expensive

Boosting

Intuition
 Bagging is a variance-reducing technique, whereas boosting is used for bias reduction. We
therefore want high bias, low variance models, also known as weak learners. Continuing our
exploration via the use of decision trees, we can make them into weak learners by allowing each
tree to only make one decision before making a prediction; these are known as decision stumps.
 We explore the intuition behind boosting via the example above. We start with a dataset on the
left, and allow a single decision stump to be trained, as seen in the middle panel. The key idea
is that we then track which examples the classifier got wrong, and increase their relative weight
compared to the correctly classified examples. We then train a new decision stump which will
be more incentivized to correctly classify these “hard negatives.” We continue as such,
incrementally re-weighting examples at each step, and at the end we output a combination of
these weak learners as an ensemble classifier.

Adaboost
 Having covered the intuition, let us look at one of the most popular boosting algorithms,
Adaboost, reproduced below:
o Algorithm 0: Adaboost
o Input: Labeled training
data (x1,y1),(x2,y2),…(xN,yN)(x1,y1),(x2,y2),…(xN,yN)
o Output: Ensemble classifer f(x)f(x)
1. wi←1Nwi←1N for i=1,2…,Ni=1,2…,N.
2. for m=0m=0 to MM do
3. Fit weak classifier GmGm to training data weighted by wiwi
4. Compute weighted
error errm=∑iwi1(yi≠Gm(xi))∑wierrm=∑iwi1(yi≠Gm(xi))∑wi
5. Compute weight αm=log(1−errmerrm)αm=log⁡(1−errmerr⁡m)
6. wi←wi∗exp(αm1(yi≠Gm(xi)))wi←wi∗exp⁡(αm1(yi≠Gm(xi
)))
7. end
8. f(x)=sign(∑mαmGm(x))f(x)=sign⁡(∑mαmGm(x))
 The weightings for each example begin out even, with misclassified examples being further up-
weighted at each step, in a cumulative fashion. The final aggregate classifier is a summation of
all the weak learners, weighted by the negative log-odds of the weighted error.
 We can also see that due to the final summation, this ensembling method allows for modeling
of additive terms, increasing the overall modeling capability (and variance) of the final model.
Each new weak learner is no longer independent of the previous models in the sequence,
meaning that increasing MM leads to an increase in the risk of overfitting.
 The exact weightings used for Adaboost appear to be somewhat arbitrary at first glance, but
can be shown to be well justified. We shall approach this in the next section through a more
general framework of which Adaboost is a special case.

Forward Stagewise Additive Modeling


 The Forward Stagewise Additive Modeling algorithm reproduced below is a framework
for ensembling:
o Algorithm 1: Forward Stagewise Additive Modeling
o Input: Labeled training
data (x1,y1),(x2,y2),…(xN,yN)(x1,y1),(x2,y2),…(xN,yN)
o Output: Ensemble - classifer f(x)f(x)
1. Initialize f0(x)=0f0(x)=0
2. for m=0m=0 to MM do
3. Compute (βm,γm)=argminβ,γ∑Ni=1L(yi,fm−1(xi)+βG(xi;−
γ))(βm,γm)=argminβ,γ⁡∑i=1NL(yi,fm−1(xi)+βG(xi;−γ))
4. Set fm(x)=fm−1(x)+βmG(x;yi)fm(x)=fm−1(x)+βmG(x;yi)
5. end
6. f(x)=fm(x)f(x)=fm(x)
 Close inspection reveals that few assumptions are made about the learning problem at hand, the
only major ones being the additive nature of the ensembling as well as the fixing of all previous
weightings and parameters after a given step. We again have weak classifiers G(x)G(x),
though this time we explicitly parameterize them by their parameters γγ. At each step we are
trying to find the next weak learner’s parameters and weighting so to best match the remaining
error of the current ensemble.
 As a concrete implementation of this algorithm, using a squared loss would be the same as
fitting individual classifiers to the residual yi−fm−1(xi)yi−fm−1(xi). Furthermore, it can be
shown that Adaboost is a special case of this formulation, specifically for 2-class classification
and exponential loss:
L(y,y^)=exp(−yy^)L(y,y^)=exp⁡(−yy^)
 For further details regarding the connection between Adaboost and Forward Stagewise Additive
Modeling, the interested reader is referred to chapter 10.4 Exponential Loss and AdaBoost
in Elements of Statistical Learning.

Gradient Boosting
 In general, it is not always easy to write out a closed-form solution to the minimization problem
presented in Forward Stagewise Additive Modeling. High-performing methods such as xgoost
resolve this issue by turning to numerical optimization.
 One of the most obvious things to do in this case would be to take the derivative of the loss and
perform gradient descent. However, the complication is that we are restricted to taking steps in
our model class - we can only add in parameterized weak learners G(x,γ)G(x,γ), not make
arbitrary moves in the input space.
 In gradient boosting, we instead compute the gradient at each training point with respect to the
current predictor (typically a decision stump):
gi=∂L(y,f(xi))∂f(xi)gi=∂L(y,f(xi))∂f(xi)
 We then train a new regression predictor to match this gradient and use it as the gradient step.
In Forward Stagewise Additive Modeling, this works out to:
γi=argminγ∑i=1N(gi−G(xi;γ))2γi=argminγ⁡∑i=1N(gi−G(xi;γ))2

Key Takeaways
 To summarize, some of the primary benefits of boosting are:
o Decrease in bias
o Better accuracy
o Additive modeling
 While some of the disadvantages include:
o Increase in variance
o Prone to overfitting

You might also like