Algorithms Notes
Algorithms Notes
We can see above this does a reasonable job of stratifying the data
points into one of two classes
o But what if we had a single Yes with a very small tumour
o This would lead to classifying all the existing yeses as nos
Another issues with linear regression
o We know Y is 0 or 1
o Hypothesis can give values large than 1 or less than 0
So, logistic regression generates a value where is always either 0 or 1
o Logistic regression is a classification algorithm - don't be
confused
Hypothesis representation
What function is used to represent our hypothesis in classification
We want our classifier to output values between 0 and 1
o When using linear regression we did hθ(x) = (θT x)
o For classification hypothesis representation we do hθ(x) =
g((θT x))
o
Where we define g(z)
z is a real number
g(z) = 1/(1 + e-z)
This is the sigmoid function, or the logistic
function
If we combine these equations we can write out the
hypothesis as
When our hypothesis (hθ(x)) outputs a number, we treat that value as the
estimated probability that y=1 on input x
o Example
If X is a feature vector with x0 = 1 (as always)
and x1 = tumourSize
hθ(x) = 0.7
Tells a patient they have a 70% chance of a tumor
being malignant
o We can write this using the following notation
hθ(x) = P(y=1|x ; θ)
o What does this mean?
Probability that y=1, given x, parameterized by θ
Since this is a binary classification task we know y = 0 or 1
o So the following must be true
P(y=1|x ; θ) + P(y=0|x ; θ) = 1
P(y=0|x ; θ) = 1 - P(y=1|x ; θ)
Decision boundary
Gives a better sense of what the hypothesis function is computing
Better understand of what the hypothesis function looks like
o One way of using the sigmoid function is;
When the probability of y being 1 is greater than 0.5 then we
can predict y = 1
Else we predict y = 0
o When is it exactly that hθ(x) is greater than 0.5?
Look at sigmoid function
g(z) is greater than or equal to 0.5 when z is greater
than or equal to 0
Decision boundary
o This result is
o
p(y=1 | x ; θ)
Probability y = 1, given x, parameterized by θ
Advanced optimization
Previously we looked at gradient descent for minimizing the
cost function
Here look at advanced concepts for minimizing the cost function for
logistic regression
o Good for large machine learning problems (e.g. huge feature set)
What is gradient descent actually doing?
o We have some cost function J(θ), and we want to minimize it
o We need to write code which can take θ as input and compute the
following
o
J(θ)
Partial derivative if J(θ) with respect to j (where j=0 to j = n)
Input for the cost function is THETA, which is a vector of the θ parameters
Two return values from costFunction are
o jval
How we compute the cost function θ (the underived cost
function)
In this case = (θ1 - 5)2 + (θ2 - 5)2
o gradient
2 by 1 vector
2 elements are the two partial derivative terms
i.e. this is an n-dimensional vector
Each indexed value gives the partial derivatives for
the partial derivative of J(θ) with respect to θi
Where i is the index position in the gradient vector
With the cost function implemented, we can call the advanced algorithm
using
Here
o options is a data structure giving options for the algorithm
o fminunc
function minimize the cost function (find minimum
of unconstrained multivariable function)
o @costFunction is a pointer to the costFunction function to be
used
For the octave implementation
o initialTheta must be a matrix of at least two dimensions
Here
o theta is a n+1 dimensional column vector
o Octave indexes from 1, not 0
Write a cost function which captures the cost function for logistic
regression
Overall
o Train a logistic regression classifier hθ(i)(x) for each class i to
predict the probability that y = i
o On a new input, x to make a prediction, pick the class i that
maximizes the probability that hθ(i)(x) = 1
Motivation 1: Data compression - PCA
Start talking about a second type of unsupervised learning problem
- dimensionality reduction
o Why should we look at dimensionality reduction?
Compression
Speeds up algorithms
Reduces space used by data for them
What is dimensionality reduction?
o So you've collected many features - maybe more than you need
Can you "simply" your data set in a rational and useful way?
o Example
Redundant data set - different units for same attribute
Reduce data to 1D (2D->1D)
Motivation 2: Visualization
It's hard to visualize highly dimensional data
o Dimensionality reduction can improve how we display information in a
tractable manner for human consumption
o Why do we care?
Often helps to develop algorithms if we can understand our data
better
Dimensionality reduction helps us do this, see data in a helpful
Good for explaining something to someone if you can "show" it in
the data
Example;
o Collect a large data set about many facts of a country around the world
So
x1 = GDP
...
x6 = mean household
Say we have 50 features per country
How can we understand this data better?
Very hard to plot 50 dimensional data
o Using dimensionality reduction, instead of each country being
represented by a 50-dimensional feature vector
Come up with a different feature representation (z values) which
summarize these features
o In other words, find a single line onto which to project this data
How do we determine this line?
The distance between each point and the projected version
should be small (blue lines below are short)
PCA tries to find a lower dimensional surface so the sum of
squares onto that surface is minimized
The blue lines are sometimes called the projection error
PCA tries to find the surface (a straight line in this
case) which has the minimum projection error
PCA Algorithm
Before applying PCA must do data preprocessing
o Given a set of m unlabeled examples we must do
Mean normalization
Replace each xji with xj - μj,
In other words, determine the mean of each feature
set, and then for each feature subtract the mean from
the value, so we re-scale the mean to be 0
Feature scaling (depending on data)
If features have very different scales then scale so they all
have a comparable range of values
e.g. xji is set to (xj - μj) / sj
Where sj is some measure of the range, so
could be
Biggest - smallest
Standard deviation (more commonly)
With preprocessing done, PCA finds the lower dimensional sub-space which
minimizes the sum of the square
o In summary, for 2D->1D we'd be doing something like this;
Algorithm description
Exactly the same as with supervised learning except we're now doing it
with unlabeled data
So in summary
o Preprocessing
o Calculate sigma (covariance matrix)
o Calculate eigenvectors with svd
o Take k vectors from U (Ureduce= U(:,1:k);)
o Calculate z (z =Ureduce' * x;)
No mathematical derivation
o Very complicated
o But it works
We lose some of the information (i.e. everything is now perfectly on that line)
but it is now projected into 2D space
o Total variation in data can be defined as the average over data saying
how far are the training examples from the origin
Applications of PCA
Compression
o Why
Reduce memory/disk needed to store data
Speed up learning algorithm
o How do we chose k?
% of variance retained
Visualization
o Typically chose k =2 or k = 3
o Because we can plot these values!
One thing often done wrong regarding PCA
o A bad use of PCA: Use it to prevent over-fitting
Reasoning
If we have xi we have n features, zi has k features
which can be lower
If we only have k features then maybe we're
less likely to over fit...
This doesn't work
BAD APPLICATION
Might work OK, but not a good way to address over
fitting
Better to use regularization
PCA throws away some data without knowing what the
values it's losing
Probably OK if you're keeping most of the data
But if you're throwing away some crucial data bad
So you have to go to like 95-99% variance retained
So here regularization will give you AT LEAST
as good a way to solve over fitting
A second PCA myth
o Used for compression or visualization - good
o Sometimes used
Design ML system with PCA from the outset
But, what if you did the whole thing without PCA
See how a system performs without PCA
ONLY if you have a reason to believe PCA will help
should you then add PCA
PCA is easy enough to add on as a processing step
Try without first!
K-means algorithm
Want an algorithm to automatically group the data into coherent
clusters
K-means is by far the most widely used clustering algorithm
Overview
Algorithm overview
o 1) Randomly allocate two points as the cluster centroids
Have as many cluster centroids as clusters you want to do
(K cluster centroids, in fact)
In our example we just have two clusters
o 2) Cluster assignment step
Go through each example and depending on if it's closer to
the red or blue centroid assign each point to one of the two
clusters
To demonstrate this, we've gone through the data and
"colour" each point red or blue
Loop 1
This inner loop repeatedly sets
the c(i) variable to be the index of the
closes variable of cluster centroid closes
to xi
i.e. take ith example, measure squared
distance to each cluster centroid,
assign c(i)to the cluster closest
Loop 2
Loops over each centroid calculate the
average mean based on all the points
associated with each centroid from c(i)
What if there's a centroid with no data
Remove that centroid, so end up with K-1 classes
Or, randomly reinitialize it
Not sure when though...
been assigned to
This is more for convenience than anything else
You could look up that example i is indexed to cluster
j (using the c vector), where j is between 1 and K
Then look up the value associated with cluster j in
the μ vector (i.e. what are the features associated
with μj)
But instead, for easy description, we have this variable
which gets exactly the same value
Lets say xi as been assigned to cluster 5
Means that
ci = 5
μci, = μ5
Using this notation we can write the optimization objective;
Random initialization
o The local optimum are valid convergence, but local optimum not
global ones
If this is a concern
o We can do multiple random initializations
See if we get the same result - many same results are likely
to indicate a global optimum
Algorithmically we can do this as follows;
Elbow method
Non-linearity
Importantly, decision trees are one of the first inherently non-linear machine learning
techniques we will cover, as compared to methods such as vanilla SVMs or GLMs.
Formally, a method is linear if for an input x∈Rnx∈Rn (with interecept
term x0=1x0=1) it only produces hypothesis functions hh of the form:
h(x)=θTxh(x)=θTx
o where θ∈Rnθ∈Rn. Hypothesis functions that cannot be reduced to the form
above are called non-linear, and if a method can produce non-linear
hypothesis functions then it is also non-linear. We have already seen that
kernelization of a linear method is one such method by which we can achieve
non-linear hypothesis functions, via a feature mapping ϕ(x)ϕ(x)
Decision trees, on the other hand, can directly produce non-linear hypothesis functions
without the need for first coming up with an appropriate feature mapping. As a
motivating (and very Canadien) example, let us say we want to build a classifier that,
given a time and a location, can predict whether or not it would be possible to ski
nearby. To keep things simple, the time is represented as month of the year and the
location is represented as a latitude (how far North or South we are
with −90∘,0∘−90∘,0∘, and 90∘90∘ being the South Pole, Equator, and North Pole,
respectively).
A representative dataset is shown above left. There is no linear boundary that would
correctly split this dataset. However, we can recognize that there are different areas of
positive and negative space we wish to isolate, one such division being shown above
right. We accomplish this by partitioning the input space XX into disjoint subsets (or
regions) RiRi:
X=⋃ni=0Ri s.t. Ri∩Rj=∅ for i≠jX=⋃i=0nRi s.t. Ri∩Rj=∅ for i≠j
o where n∈Z+n∈Z+
Selecting Regions
In general, selecting optimal regions is intractable. Decision trees generate an
approximate solution via greedy, top-down, recursive partitioning. The method is top-
down because we start with the original input space XX and split it into two child
regions by thresholding on a single feature. We then take one of these child regions and
can partition via a new threshold. We continue the training of our model in a recursive
manner, always selecting a leaf node, a feature, and a threshold to form a new split.
Formally, given a parent region RpRp, a feature index jj, and a threshold t∈Rt∈R, we
obtain two child regions R1R1 and R2R2 as follows:
R1={X∣Xj<t,X∈Rp}R2={X∣Xj≥t,X∈Rp}R1={X∣Xj<t,X∈Rp}R2={X∣Xj≥t,X∈Rp}
The beginning of one such process is shown below applied to the skiing dataset. In step
a, we split the input space XX by the location feature, with a threshold of 15, creating
child regions R1R1 and R2R2. In step b, we then recursively select one of these child
regions (in this case R2R2) and select a feature (time) and threshold (3), generating two
more child regions (R21R21). and (R22R22). In step c, we select any one of the
remaining leaf nodes (R1,R21,R22R1,R21,R22). We can continue in such a manner
until we a meet a given stop criterion (more on this later), and then predict the majority
class at each leaf node.
Defining a Loss Function
A natural question to ask at this point is how to choose our splits. To do so, it is first
useful to define our loss LL as a set function on a region RR. Given a split of a
parent RpRp into two child regions R1R1 and R2R2, we can compute the loss of the
parent L(Rp)L(Rp) as well as the cardinality-weighted loss of the
children |R1|L(R1)+|R2|L(R2)|R1|+|R2||R1|L(R1)+|R2|L(R2)|R1|+|R2|. Within our greedy
partitioning framework, we want to select the leaf region, feature, and threshold that
will maximize our decrease in loss:
L(Rp)−|R1|L(R1)+|R2|L(R2)|R1|+|R2|L(Rp)−|R1|L(R1)+|R2|L(R2)|R1|+|R2|
For a classification problem, we are interested in the misclassification
loss LmisclassLmisclass. For a region RR let p^cp^c be the proportion of examples
in RR that are of class cc. Misclassification loss on RR can be written as:
Lmisclass(R)=1−maxc(p^c)Lmisclass(R)=1−maxc(p^c)
We can understand this as being the number of examples that would be misclassified if
we predicted the majority class for region RR (which is exactly what we do). While
misclassification loss is the final value we are interested in, it is not very sensitive to
changes in class probabilities. As a representative example, we show a binary
classification case below. We explicitly depict the parent region RpRp as well as the
positive and negative counts in each region.
The first split is isolating out more of the positives, but we note that:
L(Rp)=|R1|L(R1)+|R2|L(R2)|R1|+|R2|=|R′1|L(R′1)+|R′2|L(R′2)|R′1+|R′2∣=100L(R
p)=|R1|L(R1)+|R2|L(R2)|R1|+|R2|=|R1′|L(R1′)+|R2′|L(R2′)|R1′+|R2′∣=100
Thus, not only can we not only are the losses of the two splits identical, but neither of
the splits decrease the loss over that of the parent.
We therefore are interested in defining a more sensitive loss. While several have been
proposed, we will focus here on the cross-entropy loss LcrossLcross :
Lcross(R)=−∑cp^clog2p^cLcross(R)=−∑cp^clog2p^c
With p^log2p^≡0p^log2p^≡0 if p^=0p^=0. From an information-theoretic
perspective, cross-entropy measure the number of bits needed to specify the outcome
(or class) given that the distribution is known. Furthermore, the reduction in loss from
parent to child is known as information gain.
To understand the relative sensitivity of cross-entropy loss with respect to
misclassification loss, let us look at plots of both loss functions for the binary
classification case. For these cases, we can simplify our loss functions to depend on just
the proportion of positive examples p^ip^i in a region RiRi:
Lmisclass(R)=Lmisclass(p^)=1−max(p^,1−p^)Lcross(R)=Lcross(p^)=−p^logp^−(1−
p^)log(1−p^)Lmisclass(R)=Lmisclass(p^)=1−max(p^,1−p^)Lcross(R)=Lcross(p^)=−p^log
p^−(1−p^)log(1−p^)
In the figure above on the left, we see the cross-entropy loss plotted over p. We take the
regions (Rp,R1,R2)(Rp,R1,R2) from the previous page’s example’s first split, and plot
their losses as well. As cross-entropy loss is strictly concave, it can be seen from the
plot (and easily proven) that as long as p^1≠p^2p^1≠p^2 and both child regions are
non-empty, then the weighted sum of the children losses will always be less than that
of the parent.
Misclassification loss, on the other hand, is not strictly concave, and therefore there is
no guarantee that the weighted sum of the children will be less than that of the parent,
as shown above right, with the same partition. Due to this added sensitivity, cross-
entropy loss (or the closely related Gini loss) are used when growing decision trees for
classification.
Before fully moving away from loss functions, we briefly cover the regression setting
for decision trees. For each data point xixi we now instead have an associated
value yi∈Ryi∈R we wish to predict. Much of the tree growth process remains the same,
with the differences being that the final prediction for a region RR is the mean of all
the values:
y^=∑i∈Ryi|R|y^=∑i∈Ryi|R|
And in this case we can directly use the squared loss to select our splits:
Lsquared(R)=∑i∈R(yi−y^)2|R|Lsquared(R)=∑i∈R(yi−y^)2|R|
Other Considerations
The popularity of decision trees can in large part be attributed to the ease by which they
are explained and understood, as well as the high degree of interpretability they exhibit:
we can look at the generated set of thresholds to understand why a model made specific
predictions. However, that is not the full picture - we will now cover some additional
salient points.
Categorical Variables
Another advantage of decision trees is that they can easily deal with categorical
variables. As an example, our location in the skiing dataset could instead be represented
as a categorical variable (one of Northern Hemisphere, Southern Hemisphere, or
Equator (i.e. loc∈N,S,E loc∈N,S,E)). Rather than use a one-hot encoding or similar
preprocessing step to transform the data into a quantitative feature, as would be
necessary for the other algorithms we have seen, we can directly probe subset
membership. The final tree in Section 2 can be re-written as:
A caveat to the above is that we must take care to not allow a variable to have too many
categories. For a set of categories SS, our set of possible questions is the power
set P(S)P(S), of cardinality 2|S|2|S|. Thus, a large number of categories makes
question selectioin computationally intractable. Optimizations are possible for the
binary classification, though even in this case serious consideration should be given to
whether the feature can be re-formulated as a quantitative one instead as the large
number of possible thresholds lend themselves to a high degree of overfitting.
Regularization
In Section 2 we alluded to various stopping criteria we could use to determine when to
halt the growth of a tree. The simplest criteria involves “fully” growning the tree: we
continue until each leaf region contains exactly one training data point. This technique
however leads to a high variance and low bias model, and we therefore turn to various
stopping heuristics for regularization. Some common ones include:
o Minimum Leaf Size - Do not split RR if its cardinality falls below a fixed
threshold.
o Maximum Depth - Do not split RR if more than a fixed threshold of splits
were already taken to reach RR.
o Maximum Number of Nodes - Stop if a tree has more than a fixed threshold
of leaf nodes.
A tempting heuristic to use would be to enforce a minimum decrease in loss after splits.
This is a problematic approach as the greedy, singlefeature at a time approach of
decision trees could mean missing higher order interactions. If we require thresholding
on multiple features to achieve a good split, we might be unable to achieve a good
decrease in loss on the initial splits and therefore prematurely terminate. A better
approach involves fully growing out the tree, and then pruning away nodes that
minimally decrease misclassification or squared error, as measured on a validation set.
Runtime
We briefly turn to considering the runtime of decision trees. For ease of analysis, we
will consider binary classification with nn examples, ff features, and a tree of
depth dd. At test time, for a data point we traverse the tree until we reach a leaf node
and then output its prediction, for a runtime of O(d)O(d). Note that if our tree is
balanced than d=O(logn)d=O(logn), and thus test time performance is generally
quite fast.
At training time, we note that each data point can only appear in at
most O(d)O(d) nodes. Through sorting and intelligent caching of intermediate values,
we can achieve an amortized runtime of O(1)O(1) at each node for a single data point
for a single feature. Thus, overall runtime is O(nfd)−O(nfd)− a fairly fast runtime as
the data matrix alone is of size nfnf.
Bagging
Boostrap
Bagging stands for “Boostrap Aggregation” and is a variance reduction ensembling method.
Bootstrap is a method from statistics traditionally used to measure uncertainty of some
estimator (e.g. mean).
Say we have a true population PP that we wish to compute an estimator for, as well a training
set SS sampled from P(S∼P)P(S∼P). While we can find an approximation by computing the
estimator on SS, we cannot know what the error is with respect to the true value. To do so we
would need multiple independent training sets S1,S2,…S1,S2,… all sampled from PP.
However, if we make the assumption that S=PS=P, we can generate a new bootstrap
set ZZ sampled with replacement from S(Z∼S,|Z|=|S|)S(Z∼S,|Z|=|S|). In fact we can
generate many such samples Z1,Z2,…,ZMZ1,Z2,…,ZM. We can then look at the variability
of our estimate across these bootstrap sets to obtain a measure of error.
Aggregation
Now, returning to ensembling, we can take each ZmZm and train a machine learning
model GmGm on each, and define a new aggregate predictor:
G(X)=∑mGm(x)MG(X)=∑mGm(x)M
This process is called bagging. Referring back to equation (4)(4), we have that the variance
of MM correlated predictors is:
Var(X¯)=ρσ2+1−ρMσ2Var(X¯)=ρσ2+1−ρMσ2
Bagging creates less correlated predictors than if they were all simply trained on SS, thereby
decreasing ρρ. While the bias of each individual predictor increases due to each bootstrap set
not having the full training set available, in practice it has been found that the decrease in
variance outweighs the increase in bias. Also note that increasing the number of
predictors MM can’t lead to additional overfitting, as ρρ is insensitive to MM and therefore
overall variance can only decrease.
An additional advantage of bagging is called out-of-bag estimation. It can be shown that each
bootstrapped sample only contains approximately 2323 of SS, and thus we can use the
other 1313 as an estimate of error, called outof-bag error. In the limit, as M→∞M→∞, out-
of-bag error gives an equivalent result to leave-one-out cross-validation.
Key Takeaways
To summarize, some of the primary benefits of bagging, in the context of decision trees, are:
o Decrease in variance (even more so for random forests)
o Better accuracy
o Free validation set
o Support for missing values
While some of the disadvantages include:
o Incrase in bias (even more so for random forests)
o Harder to interpret
o Still not additive
o More expensive
Boosting
Intuition
Bagging is a variance-reducing technique, whereas boosting is used for bias reduction. We
therefore want high bias, low variance models, also known as weak learners. Continuing our
exploration via the use of decision trees, we can make them into weak learners by allowing each
tree to only make one decision before making a prediction; these are known as decision stumps.
We explore the intuition behind boosting via the example above. We start with a dataset on the
left, and allow a single decision stump to be trained, as seen in the middle panel. The key idea
is that we then track which examples the classifier got wrong, and increase their relative weight
compared to the correctly classified examples. We then train a new decision stump which will
be more incentivized to correctly classify these “hard negatives.” We continue as such,
incrementally re-weighting examples at each step, and at the end we output a combination of
these weak learners as an ensemble classifier.
Adaboost
Having covered the intuition, let us look at one of the most popular boosting algorithms,
Adaboost, reproduced below:
o Algorithm 0: Adaboost
o Input: Labeled training
data (x1,y1),(x2,y2),…(xN,yN)(x1,y1),(x2,y2),…(xN,yN)
o Output: Ensemble classifer f(x)f(x)
1. wi←1Nwi←1N for i=1,2…,Ni=1,2…,N.
2. for m=0m=0 to MM do
3. Fit weak classifier GmGm to training data weighted by wiwi
4. Compute weighted
error errm=∑iwi1(yi≠Gm(xi))∑wierrm=∑iwi1(yi≠Gm(xi))∑wi
5. Compute weight αm=log(1−errmerrm)αm=log(1−errmerrm)
6. wi←wi∗exp(αm1(yi≠Gm(xi)))wi←wi∗exp(αm1(yi≠Gm(xi
)))
7. end
8. f(x)=sign(∑mαmGm(x))f(x)=sign(∑mαmGm(x))
The weightings for each example begin out even, with misclassified examples being further up-
weighted at each step, in a cumulative fashion. The final aggregate classifier is a summation of
all the weak learners, weighted by the negative log-odds of the weighted error.
We can also see that due to the final summation, this ensembling method allows for modeling
of additive terms, increasing the overall modeling capability (and variance) of the final model.
Each new weak learner is no longer independent of the previous models in the sequence,
meaning that increasing MM leads to an increase in the risk of overfitting.
The exact weightings used for Adaboost appear to be somewhat arbitrary at first glance, but
can be shown to be well justified. We shall approach this in the next section through a more
general framework of which Adaboost is a special case.
Gradient Boosting
In general, it is not always easy to write out a closed-form solution to the minimization problem
presented in Forward Stagewise Additive Modeling. High-performing methods such as xgoost
resolve this issue by turning to numerical optimization.
One of the most obvious things to do in this case would be to take the derivative of the loss and
perform gradient descent. However, the complication is that we are restricted to taking steps in
our model class - we can only add in parameterized weak learners G(x,γ)G(x,γ), not make
arbitrary moves in the input space.
In gradient boosting, we instead compute the gradient at each training point with respect to the
current predictor (typically a decision stump):
gi=∂L(y,f(xi))∂f(xi)gi=∂L(y,f(xi))∂f(xi)
We then train a new regression predictor to match this gradient and use it as the gradient step.
In Forward Stagewise Additive Modeling, this works out to:
γi=argminγ∑i=1N(gi−G(xi;γ))2γi=argminγ∑i=1N(gi−G(xi;γ))2
Key Takeaways
To summarize, some of the primary benefits of boosting are:
o Decrease in bias
o Better accuracy
o Additive modeling
While some of the disadvantages include:
o Increase in variance
o Prone to overfitting