Machine Learning
Machine Learning
com/introduction-to-machine-learning-db7c668822c4
For example, one kind of algorithm is a classification algorithm. It can put data into different groups.
The classification algorithm used to detect handwritten alphabets could also be used to classify emails
into spam and not-spam.
“A computer program is said to learn from experience E with some class of tasks T and performance measure P
if its performance at tasks in T, as measured by P, improves with experience E.” -Tom M. Mitchell
P = the probability that the program will win the next game.
There are many examples of machine learning. Here are a few examples of classification problems
where the goal is to categorize objects into a fixed set of categories.
Weather prediction: Predict, for instance, whether or not it will rain tomorrow.
Machine Learning is a field which is raised out of Artificial Intelligence(AI). Applying AI, we wanted
to build better and intelligent machines. But except for few mere tasks such as finding the shortest path
between point A and B, we were unable to program more complex and constantly evolving
challenges.There was a realisation that the only way to be able to achieve this task was to let machine
learn from itself. This sounds similar to a child learning from its self. So machine learning was
developed as a new capability for computers. And now machine learning is present in so many
segments of technology, that we don’t even realise it while using it.
Finding patterns in data on planet earth is possible only for human brains. The data being very massive,
the time taken to compute is increased, and this is where Machine Learning comes into action, to help
people with large data in minimum time.
If big data and cloud computing are gaining importance for their contributions, machine learning as
technology helps analyse those big chunks of data, easing the task of data scientists in an automated
process and gaining equal importance and recognition.
The techniques we use for data mining have been around for many years, but they were not effective as
they did not have the competitive power to run the algorithms. If you run deep learning with access to
better data, the output we get will lead to dramatic breakthroughs which is machine learning.
a. Supervised Learning
b. Unsupervised Learning
c. Reinforcement Learning
Supervised Learning
In supervised learning, the system tries to learn from the previous examples that are given. (On the
other hand, in unsupervised learning, the system attempts to find the patterns directly from the example
given.)
Speaking mathematically, supervised learning is where you have both input variables (x) and output
variables(Y) and can use an algorithm to derive the mapping function from the input to the output.
Example :
Supervised learning problems can be further divided into two parts, namely classification, and
regression.
Classification: A classification problem is when the output variable is a category or a group, such as
“black” or “white” or “spam” and “no spam”.
Regression: A regression problem is when the output variable is a real value, such as “Rupees” or
“height.”
Unsupervised Learning
In unsupervised learning, the algorithms are left to themselves to discover interesting structures in the
data.
Mathematically, unsupervised learning is when you only have input data (X) and no corresponding
output variables.
This is called unsupervised learning because unlike supervised learning above, there are no given
correct answers and the machine itself finds the answers.
Unsupervised learning problems can be further divided into association and clustering problems.
Association: An association rule learning problem is where you want to discover rules that describe
large portions of your data, such as “people that buy X also tend to buy Y”.
Clustering: A clustering problem is where you want to discover the inherent groupings in the data,
such as grouping customers by purchasing behaviour.
Reinforcement Learning
A computer program will interact with a dynamic environment in which it must perform a particular
goal (such as playing a game with an opponent or driving a car). The program is provided feedback in
terms of rewards and punishments as it navigates its problem space.
Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine
is exposed to an environment where it continuously trains itself using trial and error method.
Example:
Machine Learning theory is a field that meets statistical, probabilistic, computer science and
algorithmic aspects arising from learning iteratively from data which can be used to build intelligent
applications.
There are various reasons why the mathematics of Machine Learning is necessary, and I will highlight
some of them below:
Selecting the appropriate algorithm for the problem includes considerations of accuracy, training time,
model complexity, the number of parameters and number of characteristics.
The foremost question when trying to understand a field such as Machine Learning is the amount of
maths necessary and the complexity of maths required to understand these systems.
The answer to this question is multidimensional and depends on the level and interest of the individual.
Here is the minimum level of mathematics that is needed for Machine Learning Engineers / Data
Scientists.
2. Probability Theory and Statistics (Probability Rules & Axioms, Bayes’ Theorem, Random
Variables, Variance and Expectation, Conditional and Joint Distributions, Standard Distributions.)
Closing Notes
Thanks for reading! Hopefully, you’re now able to understand what Machine Learning is and its
applications.
Source: https://towardsdatascience.com/introduction-to-machine-learning-db7c668822c4
Artificial Intelligence v/s Machine Learning
Source: https://www.quora.com/What-are-the-main-differences-between-artificial-intelligence-and-machine-learning-
Is-machine-learning-a-part-of-artificial-intelligence
Source: https://www.geeksforgeeks.org/difference-between-machine-
learning-and-artificial-intelligence/
Overview
Artificial Intelligence : The word Artificial Intelligence comprises of two words “Artificial” and
“Intelligence”. Artificial refers to something which is made by human or non natural thing and
Intelligence means ability to understand or think. There is a misconception that Artificial Intelligence is
a system, but it is not a system .AI is implemented in the system. There can be so many definition of
AI, one definition can be “It is the study of how to train the computers so that computers can do
things which at present human can do better.”Therefore It is a intelligence where we want to add all
the capabilities to machine that human contain.
Machine Learning : Machine Learning is the learning in which machine can learn by its own without
being explicitly programmed. It is an application of AI that provide system the ability to automatically
learn and improve from experience. Here we can generate a program by integrating input and output of
that program. One of the simple definition of the Machine Learning is “Machine Learning is said to
learn from experience E w.r.t some class of task T and a performance measure P if learners
performance at the task in the class as measured by P improves with experiences.”
The aim is to increase chance of success and not The aim is to increase accuracy, but it does not
accuracy. care about success
Reference:
www.techrepublic.com/article/understanding-the-differences-between-ai-machine-learning-and-deep-
learning
Source: https://www.geeksforgeeks.org/difference-between-machine-learning-and-
artificial-intelligence/
Example:
A value Y is a function of x:
y = f(x)
f(x)= θ0 + θ1x
where
θ0 : Y Intercept of the line (where it crosses Y axis when x=0) and
θ1 : Slope of the line.
Goodness of Fitness:
The difference between the best-fit and the observed value is called residual or Error. To measure the error we use Cost
function. The objective is to find the best fitting line through the data that minimize the error. In case of linear regression
Sum of Squared error function is used as the cost function. the reasons for preferring SSE as cost function over Sum of
Errors or sum of absolute errors are:
o Simple sum of errors don’t depict actual effect as for some points error is –ve and for some point its +ve.
Hence we have to use either squared or absolute values.
o The absolute error function is not differentiable. And it is hard to find derivative of function with
absolute value. In most of the machine learning algorithms we use gradient descent which requires a
differentiable function.
Example of non differentiable function, Blue line representing the function and pink line
represents its derivative, at x=-2 ans x= 3 there are multiple values for rate of change,
hence the function is not diffentiable.
o The squaring converts the function into convex function which enhances faster convergence and also
ensures global optimum, in case of gradient descent.
o Least squared approach will penalize large errors hence accommodate outliers.
Coefficient of Determination
Coefficient of Determination tells: How well does the estimated regression equation fit our data? The Total Sum
of Squares is partitioned to SSE and SSR. It quantifies this ratio as percentage and represented as:
Steps:
1. Set i = 1
2. Pick a random value for xi=3.
3. Take the derivative f’(x) of f(x)
f‘(x)=2x-2
4. Study the derivative for the value of x
F’(3)=2*3-2
=4
[
we know that the derivative at minima is 0 : ---Global minima from Calculus
if derivative is +ve we know that the value is getting larger
By studying the derivative we know if we are getting closer or further away from the minimum
]
5. Set xi+1 = xi - α f’(xi)
Here α is learning rate. Assume it is .02
xi+1=3-.02 *4
= 2.2
6. Go to step 4 until the decrease in f’(x) is less than some threshold value or it converges to minima.
GRADIENT ASCENT: If instead the problem is for maximization use following equation in step number 5:
xi+1 = xi + α f’(xi)
If finding minima using calculus is so easy than why Gradient Descent is preferred?
In case the function is very complicated such as
and X is a vector.
In such cases finding out minima/maxima is very tough.
This is why Gradient Decent is useful
Another reason is it is fast.
Learning Rate α :
It provides size of each jump
If learning rate is high the solution may not converge easily.
If learning rate is slow it will take very large number of iterations to converge.
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference
(actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output
y's.
This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved 1) as a
convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out
the ½ term.
So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That's where gradient descent comes in.
Imagine that we graph our hypothesis function based on its fields θ 0 and θ1 (actually we are graphing the cost function
as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our
hypothesis function and the cost resulting from selecting a particular set of parameters.
We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be
the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts
such a setup.
We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when
its value is the minimum. The red arrows show the minimum points in the graph.
The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the
tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost
function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is
called the learning rate.
For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A
smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken
is determined by the partial derivative of J(θ0,θ1). Depending on where one starts on the graph, one could end up at
different points. The image above shows us two different starting points that end up in two different places.
where
When specifically applied to the case of linear regression, a new form of the gradient descent equation can
be derived. We can substitute our actual cost function and our actual hypothesis function and modify the
equation to :
where m is the size of the training set, θ0 a constant that will be changing simultaneously with θ1 and xi , yi
are values of the given training set (data).
The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these
gradient descent equations, our hypothesis will become more and more accurate.
So, this is simply gradient descent on the original cost function J. This method looks at every example in the
entire training set on every step, and is called batch gradient descent. Note that, while gradient descent
can be susceptible to local minima in general, the optimization problem we have posed here for linear
regression has only one global, and no other local, optima; thus gradient descent always converges
(assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic
function. Here is an example of gradient descent as it is run to minimize a quadratic function.
Linear Regression with Multiple Variables
Logistic Regression
Source: Machine Learning by Andrew NG
Make sure that if you are developing machine learning systems, that you know how to choose one of
the most promising avenues to spend your time pursuing. And concretely what we'd focus on is the
problem of, suppose you are developing a machine learning system or trying to improve the
performance of a machine learning system, how do you go about deciding what are the proxy avenues
to try.
The question is what should you then try mixing in order to improve the learning algorithm?
1. One thing they could try, is to get more training examples. And concretely, you can imagine,
maybe, you know, setting up phone surveys, going door to door, to try to get more data on how
much different houses sell for.
o And the sad thing is a lot of people spend a lot of time collecting more training
examples, thinking oh, if we have twice as much or ten times as much training data, that
is certainly going to help, right?
o But sometimes getting more training data doesn't actually help and in the next few
videos we will see why, and we will see how you can avoid spending a lot of time
collecting more training data in settings where it is just not going to help.
2. Try a smaller set of features. So if you have some set of features such as x1, x2, x3 and so on,
maybe a large number of features. Maybe you want to spend time carefully selecting some small
subset of them to prevent overfitting.
3. Or maybe you need to get additional features. Maybe the current set of features aren't
informative enough and you want to collect more data in the sense of getting more features.
4. Try adding polynomial features things like x2 .We can still spend quite a lot of time thinking
about that and we can also try other things like decreasing lambda, the regularization parameter
or increasing lambda.
Given a menu of options like these, some of which can easily scale up to six month or longer
projects.
Unfortunately, the most common method that people use to pick one of these is to go by gut feeling. In
which what many people will do is sort of randomly pick one of these options and
maybe say, "Oh, lets go and get more training data." And easily spend six months collecting
more training data or maybe someone else would rather be saying, "Well, let's go collect a lot more
features on these houses in our data set." Many people spend, literally 6 months doing one of these
avenues that they have sort of at random only to discover six months later that
that really wasn't a promising avenue to pursue.
Fortunately, there is a pretty simple technique that can let you very quickly rule out half of the things on
this list as being potentially promising things to pursue. And there is a very simple technique, that if
you run, can easily rule out many of these options, and potentially save you a lot of time pursuing
something that's just is not going to work.
In next section we discuss machine learning diagnostics. And learn how to improve performance of
machine learning algorithm.
Suppose you're left to decide what degree of polynomial to fit to a data set. So that what features to
include that gives you a learning algorithm. Or suppose choose the regularization parameter longer for
learning algorithm. How do you do that? This account model selection process. Browsers, and in our
discussion of how to do this, we'll talk about not just how to split your data into the train and test sets,
but how to switch data into what we discover is called the train, validation, and test sets. We'll see how
to use them to do model selection. We've already seen a lot of times the problem of overfitting, in
which just because a learning algorithm fits a training set well, that doesn't mean it's a good
hypothesis. More generally, this is why the training set's error is not a good predictor for how well the
hypothesis will do on new example. Concretely, if you fit some set of parameter Theta0, theta1, theta2,
and so on, to your training set. Then the fact that your hypothesis does well on the training set. Well,
this doesn't mean much in terms of predicting how well your hypothesis will generalize to new
examples not seen in the training set. And a more general principle is that once your parameter is what
fit to some set of data. Maybe the training set, maybe something else. Then the error of your
hypothesis as measured on that same data set, such as the training error, that's unlikely to be a good
estimate of your actual generalization error. That is how well the hypothesis will generalize to new
examples.
Now, one thing one could do then is, in order to select one of these models, we could then see which
model has the lowest test set error. And let's just say for this example that ended up choosing the fifth
order polynomial. So, this seems reasonable so far. But now let's say we want to take my fifth
hypothesis, fifth order model, and how well does this model generalize? One thing we could do is look
at how well my fifth order polynomial hypothesis had done on my test set. But the problem is this will
not be a fair estimate of how well my hypothesis generalizes. And the reason is what we've done is
we've fit this extra parameter d, that is this degree of polynomial. And what fits that parameter d, using
the test set, namely, we chose the value of d that gave us the best possible performance on the test set.
And so, the performance of my parameter vector θ5, on the test set, that's likely to be an overly
optimistic estimate of generalization error.
Right, so, that because we had fit this parameter d to my test set is no longer fair to evaluate hypothesis
on this test set, because we fit this parameters to this test set, chose the degree d of polynomial using
the test set. And so my hypothesis is likely to do better on this test set than it would on new examples
that it hasn't seen before, and that's which is, what I really care about. So just to reiterate, on the
previous slide, we saw that if we fit some set of parameters, you know, say θ1, θ2, and so on, to some
training set, then the performance of the fitted model on the training set is not predictive of how well
the hypothesis will generalize to new examples. It is because these parameters were fit to the training
set, so they're likely to do well on the training set, even if the parameters don't do well on other
examples. And, in the procedure I just described on this line, we just did the same thing. And
specifically, what we did was, we fit this parameter d to the test set. And by having fit the parameter to
the test set, this means that the performance of the hypothesis on that test set may not be a fair estimate
of how well the hypothesis is, is likely to do on examples we haven't seen before.
To address this problem, in a model selection setting, if we want to evaluate a hypothesis, this is
what we usually do instead. Given the data set, instead of just splitting into a training test set, what
we're going to do is then split it into three pieces:
1. Training Set
2. Validation Set (also called as Cross Validation Set )
3. Testing Set
And the first piece is going to be called the training set as usual. So let me call this first part the
training set. And the second piece of this data, I'm going to call the cross validation set. Sometimes it's
also called the validation, set instead of cross validation set.
And then the loss can be to call the usual test set. And the typical ratio at which to split these things
will be to send 60% of your data's, your training set, maybe 20% to your cross validation set, and 20%
to your test set. And these numbers can vary a little bit but this integration be pretty typical. And so
our training sets will now be only maybe 60% of the data, and our cross-validation set, or our
validation set, will have some number of examples.
Notations:
And finally we also have a test set over here with our mtest being the number of test examples. So, now
that we've defined the training validation or cross validation and test sets. We can also define the
training error, cross validation error, and test error.
So when faced with a model selection problem like this, what we're going to do is, instead of using the
test set to select the model, we're instead going to use the validation set, or the cross validation set, to
select the model. Concretely, we're going to first take our first hypothesis, take this first model, and
say, minimize the cross function, and this would give me some parameter vector θ1 for the new model.
We do the same thing for the quadratic model. Get some parameter vector θ2 so on, down to θ10 for the
polynomial. And what we are going to do is, instead of testing these hypotheses on the test set, we
instead going to test them on the cross validation set. And measure Jcv, to see how well each of these
hypotheses do on my cross validation set. And pick the hypothesis with the lowest cross validation
error. So for this example, let's say for the sake of argument, that it was my 4th order polynomial, that
had the lowest cross validation error. So in that case I'm going to pick this fourth order polynomial
model. And finally, what this means is that that parameter d, remember d was the degree of
polynomial, right? So d equals two, d equals three, all the way up to d equals 10. What we've done is
we'll fit that parameter d and we'll say d equals four. And we did so using the cross-validation set. And
so this degree of polynomial, so the parameter, is no longer fit to the test set, and we've not saved away
the test set, and we can use the test set to measure, or to estimate the generalization error of the model
that was selected.
Overfitting and Underfitting With Machine
Learning Algorithms
The cause of poor performance in machine learning is either overfitting or underfitting the data. Approximate
a Target Function in Machine Learning
Supervised machine learning is best understood as approximating a target function (f) that maps input
variables (X) to an output variable (Y).
Y = f(X)
This characterization describes the range of classification and prediction problems and the machine algorithms
that can be used to address them. An important consideration in learning the target function from the training
data is how well the model generalizes to new data. Generalization is important because the data we collect is
only a sample, it is incomplete and noisy
In machine learning we describe the learning of the target function from training data as inductive
learning.
Induction refers to learning general concepts from specific examples which is exactly the problem that
supervised machine learning problems aim to solve. This is different from deduction that is the other
way around and seeks to learn specific concepts from general rules.
Generalization refers to how well the concepts learned by a machine learning model apply to specific
examples not seen by the model when it was learning.
The goal of a good machine learning model is to generalize well from the training data to any data from
the problem domain. This allows us to make predictions in the future on data the model has never seen.
There is a terminology used in machine learning when we talk about how well a machine learning
model learns and generalizes to new data, namely overfitting and underfitting.
Overfitting and underfitting are the two biggest causes for poor performance of machine learning
algorithms.
Statistical Fit
This is good terminology to use in machine learning, because supervised machine learning algorithms
seek to approximate the unknown underlying mapping function for the output variables given the input
variables.
Statistics often describe the goodness of fit which refers to measures used to estimate how well the
approximation of the function matches the target function.
Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of
these techniques assume we know the form of the target function we are approximating, which is not
the case in machine learning.
If we knew the form of the target function, we would use it directly to make predictions, rather than
trying to learn an approximation from samples of noisy training data.
Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model. The problem is that
these concepts do not apply to new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when
learning a target function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is
subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned
in order to remove some of the detail it has picked up.
Underfitting refers to a model that can neither model the training data nor generalize to new data.
An underfit machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy
is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good
contrast to the problem of overfitting.
Ideally, you want to select a model at the sweet spot between underfitting and overfitting.
To understand this goal, we can look at the performance of a machine learning algorithm over time as it
is learning a training data. We can plot both the skill on the training data and the skill on a test dataset
we have held back from the training process.
Over time, as the algorithm learns, the error for the model on the training data goes down and so does
the error on the test dataset. If we train for too long, the performance on the training dataset may
continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the
training dataset. At the same time the error for the test set starts to rise again as the model’s ability to
generalize decreases.
The following figure depicts an example of high bias. In other words, the model is underfitting. The
data points obviously follow some sort of curve, but our predictor isn’t complex enough to capture that
information. Our model is biased in that it assumes that the data will behave in a certain fashion
(linear, quadratic, etc.) even though that assumption may not be true. A key point is that there’s
nothing wrong with our training—this is the best possible fit that a linear model can achieve. There is,
however, something wrong with the model itself in that it’s not complex enough to model our data.
High Bias Error (case of underfitting): When Training error is high and cross validation/test
error is high
High Variance (case of overfitting): When Training error is low and cross validation/test error
is high
Learning Curve
Learning curves is often a very useful thing to plot. If either you wanted to check that
your algorithm is working correctly, or
if you want to improve the performance of the algorithm.
And learning curves is a tool that is actually used very often to try to diagnose if a physical learning
algorithm may be suffering from bias, sort of variance problem or a bit of both.
To plot a learning curve, what we usually do is plot Jtrain(θ) or Jcv(θ) as a function of m, that is as a
function of the number of training examples we have. To plot the graph we vary number of training
examples (10/20/30 ……. Upto m) and plot what the training error is and what the cross validation is
for this smallest training set exercises.
Suppose We have only one training example like that shown in this t first example
here and let's say fitting a quadratic function and able to fit it perfectly right. You know, just fit the
quadratic function we get 0 error on the one training example. As m increases the training error start
increasing (while using the same quadratic function).
However , the very small training set is not able to generalize the model, hence cross validation error is
high when m is small and it start decreasing with the increase in m.
After some threshold value of m the error rate is flatten out hence increasing value of m beyond this
point do not help to improve the performance.
Learning curve in case of high Bias
Suppose the hypothesis has high bias and for example data that, can't really fit well by a straight line.
So we end up with a hypotheses that maybe looks like
Now let's think what would happen if we increase the training set size. So if instead of five examples
like, we have a lot more training examples. Well what happens, if you fit a straight line to this, you end
up with pretty much the same straight line.
`
Hence in this case increasing the size of training data set will not improve the performance (the straight
line isn't going to change that much).
The learning curve for training and validation error is as:
But by the time you have reached a certain number of training examples, you have almost fit the best
possible straight line, and even if you end up with a much larger training set size, a much larger value
of m, you know, you're basically getting the same straight line, and so, the cross-validation error or test
set error flatten out pretty soon, once you reached beyond a certain the number of training examples.
Well, the training error will again be small. And what you find in the high bias case is that the training
error will end up close to the cross validation error, because you have so few parameters and so much
data, at least when m is large.
The performance on the training set and the cross validation set will be very similar. And so, this is
what your learning curves will look like, if you have an algorithm that has high bias.
And finally, the problem with high bias is reflected in the fact that both the cross validation error and
the training error are high,
This also implies something very interesting, which is that, if a learning algorithm has high bias, as we
get more and more training examples, the cross validation error isn't going down much, it's basically
fattened up, and so if learning algorithms are really suffering from high bias.
So knowing if your learning algorithm is suffering from high bias seems like a useful thing to know
because this can prevent you from wasting a lot of time collecting more training data where it might
just not end up being helpful.
Let us just look at the training error in a around if you have very small training set like five training
examples shown on the figure on the right and if we're fitting say a very high order
polynomial(hundredth degree polynomial- Just for example) : case of overfitting.
And as this training set size increases a bit, we may still be overfitting this data a little bit but it also
becomes slightly harder to fit this data set perfectly. As the training set size increases, we'll find that
training error increases, because it is just a little harder to fit the training set perfectly when we have
more examples, but the training set error will still be pretty low.
Now, how about the cross validation error?
Well, in high variance setting, a hypothesis is overfitting and so the cross validation error will remain
high, even as we get you know, a moderate number of training. The indicative diagnostic that we have a
high variance problem, is the fact that there's this large gap between the training error and the cross
validation error.
If we think about adding more training data, that is, taking this figure and extrapolating to
the right the two curves(training and validation error) are converging to each other.
And so, if we were to extrapolate this figure to the right, then it seems it likely that the training error
will keep on going up and the cross-validation error would keep on going down.
The thing we really care about is the cross-validation error or the test set error. In this sort of scenario,
we can tell that if we keep on adding training examples and extrapolate to the right, well our cross
validation error will keep on coming down. So,in the high variance setting, getting more training data
is, indeed, likely to help.
Now, on the previous slide and this slide, I've drawn fairly clean fairly idealized curves. If you plot
these curves for an actual learning algorithm, sometimes you will actually see, you know, pretty much
curves, like what I've drawn here. Although, sometimes you see curves that are a little bit noisier and a
little bit messier than this. But plotting learning curves like these can often tell you, can often help you
figure out if your learning algorithm is suffering from bias, or variance or even a little bit of both.
For implementation of Linear Regression refer Jupyter note sheet “ML_ Learning Linear
Regression” and
Source: https://utkuufuk.github.io/2018/04/21/linear-regression/
Hey everyone, welcome to my first blog post! This is going to be a walkthrough on training a simple
linear regression model in Python. I’ll show you how to do it from scratch, without using any machine
learning tools or libraries. We’ll only use NumPy and Matplotlib for matrix operations and data
visualization.
We’ll look at a regression problem from a very popular machine learning course tought by Andrew Ng.
Our objective in this problem will be to train a model that accurately predicts the profits of a food truck.
The first column in our dataset file contains city populations and the second column contains food truck profits
in each city, both in 10,000
food_truck_data.txt
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
...
We’re going to use this dataset as a training sample to build our model. Let’s begin by loading it:
import numpy as np
import matplotlib.pyplot as plt
Both x
and y
are one dimensional arrays, because we have one feature (population) and one target variable (profit) in this
problem. Therefore we can conveniently visualize our dataset using a scatter plot:
fig, ax = plt.subplots()
ax.scatter(x, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
plt.show()
Hypothesis Function
Now we need to come up with a straight line which accurately represents the relationship between
population and profit. This is called the hypothesis function and it’s formulated as:
hθ(x)=θTx=θ0+θ1x1+θ2x2+…+θnxn
where x
hθ(x)=θ0+θ1x1
As you may have noticed, the number of model parameters are equal to the number of features plus 1
. That’s because each feature is weighted by a parameter to control its impact on the hypothesis hθ(x). There is
also an independent parameter θ0 called the intercept term, which defines the point where the hypothesis
function intercepts the y
The predictions of a hypothesis function can easily be evaluated in Python by computing the cross product of x
vectors but we don’t have our model parameters yet. So let’s create those as well and initialize them with
zeros:
theta = np.zeros(2)
and θT are compatible for the cross product operation. Currently x has 1 column but θT has 2 rows. The
dimensions don’t match because of the additional intercept term θ0.
and set it to all ones. This is essentially equivalent to creating a new feature x0=1. This extra column won’t
effect the hypothesis whatsoever, because θ0 is going to be multiplied by 1
X = np.ones(shape=(len(x), 2))
X[:, 1] = x
predictions = X @ theta
Of course the predictions are currently all zeros because we haven’t trained our model yet.
Cost Function
The objective in training a linear regression model is to minimize a cost function, which measures the
difference between actual y
values in the training sample and predictions made by the hypothesis function hθ(x)
J(θ)=12m∑i=1m(hθ(x(i))−y(i))2
where m
Now let’s take a look at the cost of our initial untrained model:
, we must somehow adjust them to minimize our cost function J(θ). This is where the gradient descent
algorithm comes into play. It’s an optimization algorithm which can be used in minimizing differentiable
functions. Luckily our cost function J(θ)
happens to be a differentiable one.
In each iteration, it takes a small step in the opposite gradient direction of J(θ)
gradually come closer to the optimal values. This process is repeated until eventually the minimum cost is
achieved.
More formally, gradient descent performs the following update in each iteration:
θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i)j
The α
term here is called the learning rate. It allows us to control the step size to update θ
in each iteration. Choosing a too large learning rate may prevent us from converging to a minimum cost,
whereas choosing a too small learning rate may significantly slow down the algorithm.
Now let’s use this function to train our model and plot the hypothesis function:
Our linear fit looks pretty good, right? The algorithm must have successfully optimized our model.
Well, to be honest, it’s been fairly easy to visualize the hypothesis because there’s only one feature in
this problem.
But what if we had multiple features? Then it wouldn’t be possible to simply plot the hypothesis to see
whether the algorithm has worked as intended or not.
Fortunately, there’s a simple way to debug the gradient descent algorithm irrespective of the number of
features:
1. Modify the gradient descent function to make it record the cost at the end of each iteration.
2. Plot the cost history after the gradient descent has finished.
3. Pat yourself on the back if you see that the cost has monotonically decreased over time.
, 0.015, 0.02
and plot the cost history for each one:
plt.figure()
num_iters = 1200
learning_rates = [0.01, 0.015, 0.02]
for lr in learning_rates:
_, cost_history = gradient_descent(X, y, lr, num_iters)
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with different learning rates", fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.legend(list(map(str, learning_rates)))
plt.axis([0, num_iters, 4, 6])
plt.grid()
plt.show()
It appears that the gradient descent algorithm worked correctly for these particular learning rates.
Notice that it takes more iterations to minimize the cost as the learning rate decreases.
Now let’s try a larger learning rate and see what happens:
learning_rate = 0.025
num_iters = 50
_, cost_history = gradient_descent(X, y, learning_rate, num_iters)
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate),
fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.axis([0, num_iters, 0, 6000])
plt.grid()
plt.show()
Doesn’t look good… That’s what happens when the learning rate is too large. Even though the gradient
descent algorithm takes steps in the correct direction, these steps are so huge that it’s going to overshoot
the target and the cost diverges from the minimum value instead of converging to it.
, because it allows us to minimize the cost and it requires relatively less iterations to converge.
Prediction
Now that we’ve learned how to train our model, we can finally predict the food truck profit for a
particular city:
Source:https://utkuufuk.github.io/2018/05/04/learning-curves/
Learning Curves in Linear & Polynomial
Regression
Learning curves are very useful for analyzing the bias-variance characteristics of a machine learning
model. In this post, I’m going to talk about how to make use of them in a case study of a regression
problem. We’re going to start with a simple linear regression model and improve it as much as we can
by taking advantage of learning curves.
In a nutshell, learning curves show how the training and validation errors change with respect to the
number of training examples used while training a machine learning model.
If a model is balanced, both errors converge to small values as the training sample size
increases.
If a model has high bias, it ends up underfitting the data. As a result, both errors fail to
decrease no matter how many examples there are in the training set.
If a model has high variance, it ends up overfitting the training data. In that case, increasing
the training sample size decreases the training error but it fails to decrease the validation error.
After this incredibly brief introduction, let me introduce you today’s problem where we’ll get to see
learning curves in action. It’s another problem from Andrew Ng’s machine learning course, in which
the objective is to predict the amount of water flowing out of a dam, given the change of water level in
a reservoir.
The dataset file we’re about to read contains historical records on the change in water level and the
amount of water flowing out of the dam. The reason that it’s a .mat file is because this problem is
originally a MATLAB assignment. Fortunately it’s pretty easy to load .mat files in Python using the
loadmat function from SciPy. We’ll also need NumPy and Matplotlib for matrix operations and data
visualization:
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt # we'll need this later
import scipy.io as sio
dataset = sio.loadmat("water.mat")
x_train = dataset["X"]
x_val = dataset["Xval"]
x_test = dataset["Xtest"]
Notice that we have to explicitly convert the target variables (y_train, y_val and y_test) to one
dimensional vectors, because they are stored as matrices inside the .mat file.
fig, ax = plt.subplots()
ax.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Training sample", fontsize=16)
plt.show()
Alright, it’s time to come up with a strategy. First of all, it’s clear that there’s a nonlinear relationship between x
and . Normally we would rule out any linear model because of that. However, we are going to begin by training
a linear regression model so that we can see how the learning curves of a model with high bias look like.
Then we’ll train a polynomial regression model which is going to be much more flexible than linear
regression. This will let us see the learning curves of a model with high variance.
Finally we’ll add Regularization to the existing polynomial regression model and see how a balanced
model’s learning curves look like.
Linear Regression
I’ve already shown you in the previous post how to train a linear regression model using gradient
descent. Before proceeding any further, I strongly encourage you to take a look at it if you don’t have at
least a basic understanding of linear regression.
Here I’ll show you an easier way to train a linear regression model using an optimization function
called fmin_cg from scipy.optimize. You can check out the detailed documentation here. The cool
thing about this function is that it’s faster than gradient descent and also you don’t have to select a
learning rate by trial and error.
fmin_cg needs a function that returns the cost and another one that returns the gradient of the cost for a
given hypothesis. We have to pass those to fmin_cg as function arguments. Fortunately we can reuse
some code from the previous post:
We can completely reuse the cost function because it’s independent from the optimization method
that we use.
From the gradient_descent function, we can borrow the part where the gradient of the cost
function is evaluated.
If you look at our cost function, there we evaluate the cross product of the feature matrix X
. Remember, this is only possible if the matrix dimensions match. Therefore we also need a tiny utility function
to insert an additional first column of all ones to a raw feature matrix such as x_train.
def insert_ones(x):
X = np.ones(shape=(x.shape[0], x.shape[1] + 1))
X[:, 1:] = x
return X
Now let’s train a linear regression model and plot the linear fit on top of the training sample:
X_train = insert_ones(x_train)
theta = train_linear_regression(X_train, y_train)
hypothesis = X_train @ theta
ax.plot(X_train[:, 1], hypothesis, linewidth=2)
fig
The above plot clearly shows that linear regression is not suitable for this task. Let’s also look at its
learning curves and see if we can draw the same conclusion.
training examples and increase them one by one. In each iteration, we’ll train a model and evaluate the training
error on the existing training sample, and the validation error on the whole validation sample:
In order to use this function, we have to resize x_val just like we did x_train:
X_val = insert_ones(x_val)
plt.title("Learning Curves for Linear Regression", fontsize=16)
learning_curves(X_train, y_train, X_val, y_val)
As expected, we were unable to sufficiently decrease either the training or the validation error.
Polynomial Regression
Feature Mapping
In order to train a polynomial regression model, the existing feature(s) have to be mapped to artificially
generated polynomial features. Then the rest is pretty much the same drill.
In our case we only have a single feature x1, the change in water level. Therefore we can simply compute the
first several powers of x1 to artificially obtain new polynomial features. Let’s create a simple function for this:
Now let’s generate new feature matrices for training, validation and test samples with 8 polynomial
features in each:
x_train_poly = poly_features(x_train, 8)
x_val_poly = poly_features(x_val, 8)
x_test_poly = poly_features(x_test, 8)
Feature Normalization
Ok we have our polynomial features but we also have a tiny little problem. If you take a closer look at
one of the new matrices, you’ll see that the polynomial features are very imbalanced at the moment. For
instance let’s look at the first few rows of the x_train_poly matrix:
print(x_train_poly[:4, :])
[[ -1.59367581e+01 2.53980260e+02 -4.04762197e+03 6.45059724e+04
-1.02801608e+06 1.63832436e+07 -2.61095791e+08 4.16102047e+09]
[ -2.91529792e+01 8.49896197e+02 -2.47770062e+04 7.22323546e+05
-2.10578833e+07 6.13900035e+08 -1.78970150e+10 5.21751305e+11]
[ 3.61895486e+01 1.30968343e+03 4.73968522e+04 1.71527069e+06
6.20748719e+07 2.24646160e+09 8.12984311e+10 2.94215353e+12]
[ 3.74921873e+01 1.40566411e+03 5.27014222e+04 1.97589159e+06
7.40804977e+07 2.77743990e+09 1.04132297e+11 3.90414759e+12]]
As the polynomial degree increases, the values in the corresponding columns exponentially grow to the
point where they differ by orders of magnitude.
The thing is, the cost function will generally converge much more slowly when the features are
imbalanced like this. So we need to make sure that our features are on a similar scale before we begin
to train our model. We’re going to do this in two steps:
1. Subtract the mean value of each column from itself and make the new mean 0
2. Divide the values in each column by their standard deviation and make the new standard deviation 1
It’s important that we use the mean and standard deviation values from the training sample while
normalizing the validation and test samples.
train_means = x_train_poly.mean(axis=0)
train_stdevs = np.std(x_train_poly, axis=0, ddof=1)
X_train_poly = insert_ones(x_train_poly)
X_val_poly = insert_ones(x_val_poly)
X_test_poly = insert_ones(x_test_poly)
Finally we can train our polynomial regression model by using our train_linear_regression
function and plot the polynomial fit. Note that when the polynomial features are simply treated as
independent features, training a polynomial regression model is no different than training a multivariate
linear regression model:
What do you think, seems pretty accurate right? Let’s take a look at the learning curves.
Now that’s overfitting written all over it. Even though the training error is very low, the validation error
miserably fails to converge.
It appears that we need something in between in terms of flexibility. Although we can’t make linear
regression more flexible, we can decrese the flexibility of polynomial regression using regularization.
Before going further with the example,Lets discuss basics of regularization:
Ridge Regression
It performs ‘L2 regularization’, i.e. adds penalty equivalent to square of the magnitude of
coefficients. Thus, it optimises the following:
Here, λ is the tuning parameter which balances the amount of emphasis given to minimising RSS vs
minimising sum of square of coefficients. It can take various values:
λ = 0:
λ = ∞:
0 < λ < ∞:
The magnitude of α will decide the weightage given to different parts of objective.
The coefficients will be somewhere between 0 and ones for simple linear regression.
Lasso Regression
LASSO stands for Least Absolute Shrinkage and Selection Operator. I know it doesn’t give much of
an idea but there are 2 key words here - absolute and selection.
Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of
coefficients in the optimisation objective.
Here, λ works similar to that of ridge. Like that of ridge, λ can take various values and provide a trade-
off between balancing RSS and magnitude of coefficients.
Selection of λ
Alpha can be adjusted to help you find a good fit for your model.
It’s up to the user to find the optimized value. Cross validation using different values of λ can help you
to identify the optimal λ that produces the lowest out of sample error.
Key differences between Ridge and Lasso Regression
Ridge: It includes all (or none) of the features in the model. Thus, the major advantage of ridge
regression is coefficient shrinkage and reducing model complexity.
Lasso: Along with shrinking coefficients, lasso performs feature selection as well. (Remember the
‘selection‘ in the lasso full-form?) As we observed earlier, some of the coefficients become exactly
zero, which is equivalent to the particular feature being excluded from the model.
But why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly
equal to zero? Lets explain it in detail in the next section.
Regularized Polynomial Regression
Regularization lets us come up with simpler hypothesis functions that are less prone to overfitting. This is
achieved by penalizing large θ values during the training stage.
Of course we’ll need to reflect these changes to the corresponding Python implementations by
introducing a regularization parameter lamb:
Alright we’re now ready to train a regularized polynomial regression model. Let’s set λ=1
has significantly improved the unregularized model, we can do even better by optimizing λ
as well. Here’s how we’re going to do it:
1. Select a set of λ
in the set.
Find the λ
Looks like we’ve achieved the lowest validation error where λ=3
It’s good practice to evaluate an optimized model’s accuracy on a separate test sample other than the
training and validation samples. So let’s train our models once again and compare test errors:
X_test = insert_ones(x_test)
theta = train_linear_regression(X_train, y_train)
test_error = cost(theta, X_test, y_test)
print("Test Error =", test_error, "| Linear Regression")
Bayesian Classifiers
The Bayesian Classifiers works based on the concept of probability, in next section describes
the basics of probability:
Probability
It is value that depicts chances of happening of the event. When we toss a coin the outcome
is random and probability of of a coin that will land head is 50%
The probability is always between 0 to 1 and summation of probability of all events of same
space is always 1.
In field of Probability Theory there are two groups: Frquentist and Bayesian
Let the events are denoted as
X: Gender of the voter
Y: Voted for Female president or not (Outcome: yes or No)
Marginal Probability
P(X=male) = P(X=male ∩ Y= yes) + P(X=male ∩ Y= no)
Hence
P(X=male) = P(X=male | Y= yes) P(Y=yes) + P(X=male | Y= no) P(Y=no)
Moreover, the rule generalizes for more than two events provided they are all independent of one another, so
the joint probability of three events P(ABC) = P(A) * (P(B) * P(C), again assuming independence.
Bayes’ Theorem Example #2
You might be interested in finding out a patient’s probability of having liver disease if they are an
alcoholic. “Being an alcoholic” is the test (kind of like a litmus test) for liver disease.
A could mean the event “Patient has liver disease.” Past data tells you that 10% of patients entering
your clinic have liver disease. P(A) = 0.10.
B could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s patients are
alcoholics. P(B) = 0.05.
You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is
your B|A: the probability that a patient is alcoholic, given that they have liver disease, is 7%.
Another way to look at the theorem is to say that one event follows another. Above I said “tests” and
“events”, but it’s also legitimate to think of it as the “first event” that leads to the “second event.”
There’s no one right way to do this: use the terminology that makes most sense to you.
In a particular pain clinic, 10% of patients are prescribed narcotic pain killers. Overall, five percent of
the clinic’s patients are addicted to narcotics (including pain killers and illegal substances). Out of all
the people prescribed pain pills, 8% are addicts. If a patient is an addict, what is the probability that
they will be prescribed pain pills?
Step 1: Figure out what your event “A” is from the question. That information is in the italicized
part of this particular question. The event that happens first (A) is being prescribed pain pills. That’s
given as 10%.
Step 2: Figure out what your event “B” is from the question. That information is also in the
italicized part of this particular question. Event B is being an addict. That’s given as 5%.
Step 3: Figure out what the probability of event B (Step 2) given event A (Step 1). In other words,
find what (B|A) is. We want to know “Given that people are prescribed pain pills, what’s the
probability they are an addict?” That is given in the question as 8%, or .8.
Step 4: Insert your answers from Steps 1, 2 and 3 into the formula and solve.
P(A|B) = P(B|A) * P(A) / P(B) = (0.08 * 0.1)/0.05 = 0.16
Likelihood: This is the probability after observing the data. Suppose given a dataset
D={h,h,h,h,t,t,t,t,t,t}, the probability of getting head on top is 4/10. The likehood of the data
given class can be represented as p(X|c)
Posterior: This is the estimate of probability after considering prior and observing the data. It
can be computed by multiplying likelihood and prior.
Need for Prior
Suppose D={t,t,t,t}
In this case if we only consider likelihood the probability of getting head is 0. However it is not
possible. Hence we use the prior information to compute Posterior.
Classification accuracy can also easily be turned into a misclassification rate or error rate by inverting the value,
such as
Classification accuracy is a great place to start, but often encounters problems in practice.
The main problem with classification accuracy is that it hides the detail you need to better understand
the performance of your classification model. There are two examples where you are most likely to
encounter this problem:
1. When you are data has more than 2 classes. With 3 or more classes you may get a classification
accuracy of 80%, but you don’t know if that is because all classes are being predicted equally
well or whether one or two classes are being neglected by the model.
2. When your data does not have an even number of classes. You may achieve accuracy of 90% or
more, but this is not a good score if 90 records for every 100 belong to one class and you can
achieve this score by always predicting the most common class value.
Classification accuracy can hide the detail you need to diagnose the performance of your model. But
thankfully we can tease apart this detail by using a confusion matrix.
Confusion Matrix
The number of correct and incorrect predictions are summarized with count values and broken down by
each class. This is the key to the confusion matrix.
The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.
It gives you insight not only into the errors being made by your classifier but more importantly the
types of errors that are being made.
It is this breakdown that overcomes the limitation of using classification accuracy alone.
1. You need a test dataset or a validation dataset with expected outcome values.
2. Make a prediction for each row in your test dataset.
3. From the expected outcomes and predictions count:
1. The number of correct predictions for each class.
2. The number of incorrect predictions for each class, organized by the class that was predicted.
Expected down the side: Each row of the matrix corresponds to a predicted class.
Predicted across the top: Each column of the matrix corresponds to an actual class.
The counts of correct and incorrect classification are then filled into the table.
The total number of correct predictions for a class go into the expected row for that class value and the
predicted column for that class value.
In the same way, the total number of incorrect predictions for a class go into the expected row for that
class value and the predicted column for that class value.
“In practice, a binary classifier such as this one can make two types of errors: it can incorrectly assign
an individual who defaults to the no default category, or it can incorrectly assign an individual who
does not default to the default category. It is often of interest to determine which of these two types of
errors are being made. A confusion matrix […] is a convenient way to display this information.” —
Page 145, An Introduction to Statistical Learning: with Applications in R, 2014
This matrix can be used for 2-class problems where it is very easy to understand, but can easily be
applied to problems with 3 or more class values, by adding more rows and columns to the confusion
matrix.
Let’s make this explanation of creating a confusion matrix concrete with an example.
Let’s pretend we have a two-class classification problem of predicting whether a photograph contains a
man or a woman.
We have a test dataset of 10 records with expected outcomes and a set of predictions from our
classification algorithm.
Expected, Predicted
man, woman
man, man
woman, woman
man, man
woman, man
woman, woman
woman, woman
man, man
man, woman
woman, woman
Let’s start off and calculate the classification accuracy for this set of predictions. The algorithm made 7
of the 10 predictions correct with an accuracy of 70%.
Let’s turn our results into a confusion matrix. First, we must calculate the number of correct predictions
for each class.
Now, we can calculate the number of incorrect predictions for each class, organized by the predicted
value.
We can now arrange these values into the 2-class confusion matrix:
men women
men 3 1
women 2 4
Here column label representing actual value where as the row label is representing the predicted value.
The total actual men in the dataset is the sum of the values on the men column (3 + 2)
The total actual women in the dataset is the sum of values in the women column (1 +4).
The correct values are organized in a diagonal line from top left to bottom-right of the matrix (3
+ 4).
More errors were made by predicting men as women than predicting women as men.
In a two-class problem, we are often looking to discriminate between observations with a specific
outcome, from normal observations. Such as a disease state or event from no disease state or no event.
In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then
assign the event column of predictions as “true” and the no-event as “false“.
Example:
Consider the case where there are two classes. […] The top row of the table corresponds to samples
predicted to be events. Some are predicted correctly (the true positives, or TP) while others are
inaccurately classified (false positives or FP). Similarly, the second row contains the predicted
negatives with true negatives (TN) and false negatives (FN).
Now that we have worked through a simple 2-class confusion matrix case study, let’s see how we might
calculate a confusion matrix in modern machine learning tools.
TP + FN = P
TN + FP = N
Precision = TP/(TP+FP)
It is also called the Positive Predictive Value (PPV). Precision can be thought of as a measure of a
classifiers exactness. A low precision can also indicate a large number of False Positives.
If precision is .8, it means if the model predicted 10 positives out of these positive results 8 are correct.
Recall (R)/ Sensitivity/ True Positive Rate: Capability of the model to correctly identify positives
from total given positives.
R= TP/(TP+FN)
Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False
Negatives.
Suppose actually there are 100 spam emails, however during classification the only 70 has been marked
as spam.
False Negative FN = 30
R= 70/(70+30) = .7
F1 Score : It is also called the F Score or the F Measure. Put another way, the F1 score conveys the
balance between the precision and the recall.
F1= 2*P*R/(P+R)
The higher the F-Measure is, the better. F1 Score is needed when you want to seek a balance between
Precision and Recall. F1 is usually more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of
false positives and false negatives are very different, it’s better to look at both Precision and Recall.
Specificity (True Negative Rate ) : Capability of the model to correctly identify negatives from total
negatives.
Among the actual nos, what fraction was predicted as no? Also equivalent to 1-True negative rate:
Specificity=TN/(TN+FP)
1- Specificity (False Positive Rate ) : Capability of the model to correctly identify negatives from total
negatives.
Among the actual nos, what fraction was predicted as yes? Also equivalent to 1-False positive rate:
1-Specificity=FP/(FP+TN)
Suppose actually there are 100 non spam emails, however during classification the only 70 has been
marked as non spam.
False Postive FP = 30
FPR= 30/(30+100) = .3
Area Under Curve (ROC): Receiver Operating Characteristic curve is used to plot between true
positive report and false positive rate, also known as a sensitivity and 1-specificity graph.
The image below depicts a typical binary situation - where any data you receive, will fall into one of
two distributions. The two distributions are the two bell curves below. To continue the spam
classification example, let's consider the curve on the left to be emails that are "not spam", and the
curve on the right to be emails that are "spam".
ROC curves – what are they and how are they
used?
by Suzanne Ekelund
ROC curves are frequently used to show in a graphical way the connection/trade-off between clinical
sensitivity and specificity for every possible cut-off for a test or a combination of tests. In addition the
area under the ROC curve gives an idea about the benefit of using the test(s) in question.
ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a test. The best
cut-off has the highest true positive rate together with the lowest false positive rate.
As the area under an ROC curve is a measure of the usefulness of a test in general, where a greater area
means a more useful test, the areas under ROC curves are used to compare the usefulness of tests.
ROC curves were first employed in the study of discriminator systems for the detection of radio signals
in the presence of noise in the 1940s, following the attack on Pearl Harbor.
The initial research was motivated by the desire to determine how the US RADAR "receiver operators"
had missed the Japanese aircraft.
Now ROC curves are frequently used to show the connection between clinical sensitivity and
specificity for every possible cut-off for a test or a combination of tests. In addition, the area under the
ROC curve gives an idea about the benefit of using the test(s) in question.
To make an ROC curve you have to be familiar with the concepts of true positive, true negative, false
positive and false negative. These concepts are used when you compare the results of a test with the
clinical truth, which is established by the use of diagnostic procedures not involving the test in question.
Before you make a table like TABLE I you have to decide your cut-off for distinguishing healthy from
sick.
The cut-off determines the clinical sensitivity (fraction of true positives to all with disease) and
specificity (fraction of true negatives to all without disease).
When you change the cut-off, you will get other values for true positives and negatives and false
positives and negatives, but the number of all with disease is the same and so is the number of all
without disease.
Thus you will get an increase in sensitivity or specificity at the expense of lowering the other parameter
when you change the cut-off [1].
FIG. I and FIG. II demonstrate the trade-off between sensitivity and specificity. When 400 µg/L is
chosen as the analyte concentration cut-off, the sensitivity is 100 % and the specificity is 54 %. When
the cut-off is increased to 500 µg/L, the sensitivity decreases to 92 % and the specificity increases to 79
%.
An ROC curve shows the relationship between clinical sensitivity and specificity for every possible cut-
off. The ROC curve is a graph with:
Thus every point on the ROC curve represents a chosen cut-off even though you cannot see this cut-off.
What you can see is the true positive fraction and the false positive fraction that you will get when you
choose this cut-off.
To make an ROC curve from your data you start by ranking all the values and linking each value to the
diagnosis – sick or healthy.
TABLE II : Ranked data with diagnosis (Yes/No)
In the example in TABLE II 159 healthy people and 81 sick people are tested. The results and the
diagnosis (sick Y or N) are listed and ranked based on parameter concentration.
For each and every concentration it is calculated what the clinical sensitivity (true positive rate) and the
(1 – specificity) (false positive rate) of the assay will be if a result identical to this value or above is
considered positive.
TABLE III: Ranked data with calculated true positive and false positive rates for a scenario
where the specific value is used as cut-off
Now the curve is constructed by plotting the data pairs for sensitivity and (1 – specificity):
The area under the ROC curve (AUROC) of a test can be used as a criterion to measure the test's
discriminative ability, i.e. how good is the test in a given clinical situation.
Various computer programs can automatically calculate the area under the ROC curve. Several methods
can be used. An easy way to calculate the AUROC is to use the trapezoid method. To explain it simply,
the sum of all the areas between the x-axis and a line connecting two adjacent data points is calculated:
A perfect test is able to discriminate between the healthy and sick with 100 % sensitivity and 100 %
specificity.
It will have an ROC curve that passes through the upper left corner (~100 % sensitivity and 100 %
specificity). The area under the ROC curve of the perfect test is 1.
FIG. X: ROC curve for a test with no overlap between healthy and sick
When we have a complete overlap between the results from the healthy and the results from the sick
population, we have a worthless test. A worthless test has a discriminating ability equal to flipping a
coin.
FIG. XII: ROC curve for a test with complete overlap between healthy and sick
As mentioned above, the area under the ROC curve of a test can be used as a criterion to measure the
test's discriminative ability, i.e. how good is the test in a given clinical situation. Generally, tests are
categorized based on the area under the ROC curve.
The closer an ROC curve is to the upper left corner, the more efficient is the test.
In FIG. XIII test A is superior to test B because at all cut-offs the true positive rate is higher and the
false positive rate is lower than for test B. The area under the curve for test A is larger than the area
under the curve for test B.
As a rule of thumb the categorizations in TABLE IV can be used to describe an ROC curve.
Another Example:
Where TP+ FN is total number of actually positive cases, i.e. the number of persons actually having
Hypothyroid
= 18/32 = .56
And Specificity=TN/(TN+FP)
=1/93 = .01
= 92 /93 = .99
If we take 7 as cutoff, the sensitivity is = true positive rate = TP/(TP+FN)
And Specificity=TN/(TN+FP)
=17+1/93 = .19
= 75/93 = .81
And Specificity=TN/(TN+FP)
=17+1+36/93 = .58
= 39/93 = .42
= (18+7+4+3) /93 = 1
And Specificity=TN/(TN+FP)
=17+1+36+39/93 = .1
Specificity 1-
Cutpoint Sensitivity Specific
ity
5 0.56 .01 0.99
7 0.78 .19 0.81
9 0.91 .58 0.42
10 1.0 1 0
Distance Based Methods
Any form of learning is based on generalized from training data to unseen data by exploiting the
similarities between the two. A similarity based on the distance is one of the form.
Distance
D is a distance measure if it is a function from pairs of points to real numbers such that:
1. d(x,y)>=0
2. d(x,y) = 0 iff x=y
3. d(x,y) = d(y,x)
4. d(x,y) <= d(x,z)+d(z,y) {triangle inequality: one side of triangle is always less than
sum of two other sides of triangle.}
Euclidean Distance
Non Euclidean Distance
Euclidean Distance : Euclidean Space has Some number of real valued dimension and dense points. A
Euclidean distance is based on location in such points.
Most common Euclidean distance is L2 Norm: Square root of sum of Square of the difference between
x and y in each dimensions.
Non Euclidean Distance: On other side distance measures for non Euclidean spaces are based on
properties of points but not their location in a space. Some examples of non Euclidean distances are:
Edit Distance : Edit distance of two strings is the number of inserts and delets of characters needed to
turn one into the other.
LCS Longest common Substring – any longest string obtained both by deleting from x and y.
x= abcde y= bcduve
x can be turned into y by deleting a and then inserting u and v, LCS here is bcd and the distance is
d(x,y) = |x| + |y| - 2|LCS(x,y)|
= 5+6 -2*3
d(x,y) >= 0
Hamming Distance : This Non Euclidean distance defines the number of positions in which two bit
vectors differ.
Given N labeled training examples {xn,yn}Nn=1 from two classes positive and negative.
A simples distance based approach is to check distance from Mean(µ) and assign class with closer
Mean.
A hyperplane is a subspace whose dimension is one less than dimensions that of its ambient space. If a
space is 3D its hyper plane is 2D.
1980’s
Decision trees and NNs allowed efficient learning of non-linear decision surfaces
Little theoretical basis and all suffer from local minima
1990’s
Efficient learning algorithms for non-linear functions based on computational learning theory
developed
Nice theoretical properties
y= mx+c
ax+by+c=0
y= -c/b-(a/b)x
slope= -a/b
intercept=-c/b
Plane : In 3D to divide the area in two parts plane is required and representd as
w0 + w1x1 + w2x2 + c = 0
Hyperplane: In geometry, a hyperplane is a subspace whose dimension is one less than that of its
ambient space. If a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if
the space is 2-dimensional, its hyperplanes are the 1-dimensional lines. It is represented by
All the points that are on hyperplane should satisfy above mentioned equation.
Understanding Support Vector Machine
algorithm from examples
Sunil Ray,
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
Introduction
Mastering machine learning algorithms isn’t a myth at all. Most of the beginners start by learning
regression. It is simple to learn and use, but does that solve our purpose? Of course not! Because, you
can do so much more than just Regression!
By now, I hope you’ve now mastered Random Forest, Naive Bayes Algorithm and Ensemble
Modeling. If not, I’d suggest you to take out few minutes and read about them as well. In this article, I
shall guide you through the basics to advanced knowledge of a crucial machine learning algorithm,
support vector machines.
Table of Contents
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification problems. In
this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features
you have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very well (look at the below
snapshot).
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a
frontier which best segregates the two classes (hyper-plane/ line).
You can look at definition of support vectors and a few examples of its working here.
Above, we got accustomed to the process of segregating the two classes with a hyper-plane. Now the
burning question is “How can we identify the right hyper-plane?”. Don’t worry, it’s not as hard as you
think!
Let’s understand:
Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C). Now,
identify the right hyper-plane to classify star and circle.
Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-
plane with higher margin is robustness. If we select a hyper-plane having low margin then there
is high chance of miss-classification.
Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in previous section to
identify the right hyper-plane
Some of you may have selected the hyper-
plane B as it has higher margin compared to A. But, here is the catch, SVM selects the hyper-plane
which classifies the classes accurately prior to maximizing margin. Here, hyper-plane B has a
classification error and A has classified all correctly. Therefore, the right hyper-plane is A.
Can we classify two classes (Scenario-4)?: Below, I am unable to segregate the two classes using a
straight line, as one of star lies in the territory of other(circle) class as an outlier.
Find the hyper-plane to segregate to classes (Scenario-5): In the scenario below, we can’t have linear
hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have
only looked at the linear hyper-plane.
In SVM, it is easy to have a linear hyper-plane between these two classes. But, another burning
question which arises is, should we need to add this feature manually to have a hyper-plane.
No, SVM has a technique called the kernel trick. These are functions which takes low
dimensional input space and transform it to a higher dimensional space i.e. it converts not
separable problem to separable problem, these functions are called kernels. It is mostly useful in
non-linear separation problem. Simply put, it does some extremely complex data
transformations, then find out the process to separate the data based on the labels or outputs
you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:
Now, let’s look at the methods to apply SVM algorithm in a data science challenge.
Pros:
o It works really well with clear margin of separation
o It is effective in high dimensional spaces.
o It is effective in cases where number of dimensions is greater than the number of samples.
o It uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.
Cons:
o It doesn’t perform well, when we have large data set because the required training time is
higher
o It also doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping
o SVM doesn’t directly provide probability estimates, these are calculated using an expensive
five-fold cross-validation. It is related SVC method of Python scikit-learn library.
Vectors
In Support Vector Machine, there is the word vector. It is important to know some basics about
vectors in order to understand SVMs and how to use them.
When we do calculations, we denote a vector with the coordinates of its endpoint (the point
where the tip of the arrow is). In Figure 1, the point A has the coordinates (4,3). We can write:
If we want to, we can give another name to the vector, for instance, .
From this point, one might be tempted to think that a vector is defined by its coordinates.
However, if I give you a sheet of paper with only a horizontal line and ask you to trace the
same vector as the one in Figure 1, you can still do it.
The direction is the second component of a vector. By definition, it is a new vector for which the
coordinates are the initial coordinates of our vector divided by its norm.
Where does it come from? Geometry. Figure 3 shows us a vector and its angles with respect
to the horizontal and vertical axis. There is an angle (theta) between and the horizontal
axis, and there is an angle (alpha) between and the vertical axis.
It makes sense, as the sole objective of this vector is to describe the direction of other
vectors— by having a norm of 1, it stays as simple as possible. As a result, a direction
vector such as is often referred to as a unit vector
Dimensions of a vector
Note that the order in which the numbers are written is important. As a result, we say that a
- dimensional vector is a tuple of real-valued numbers.
People often have trouble with the dot product because it seems to come out of nowhere. What
is important is that it is an operation performed on two vectors and that its result gives us some
insights into how the two vectors relate to each other. There are two ways to think about the dot
product: geometrically and algebraically.
By looking at this formula, we can see that the dot product is strongly influenced by the angle :
Figure 5: Using these three angles will allow us to simplify the dot product
In Figure 5, we can see the relationship between the three angles , (beta), and (alpha):
Or:
In Figure 6, you can clearly see that the expensive wine contains less alcohol than the cheap
one. In fact, you can find a point that separates the data into two groups. This data is said to be
linearly separable. For now, you decide to measure the alcohol concentration of your wine
automatically before filling an expensive bottle. If it is greater than 13 percent, the production
chain stops and one of your employee must make an inspection. This improvement
dramatically reduces complaints, and your business is flourishing again.
This example is too easy—in reality, data seldom works like that. In fact, some scientists really
measured alcohol concentration of wine, and the plot they obtained is shown in Figure 7. This
is an example of non-linearly separable data. Even if most of the time data will not be linearly
separable, it is fundamental that you understand linear separability well. In most cases, we will
start from the linearly separable case (because it is the simpler) and then derive the non-
separable case.
Similarly, in most problems, we will not work with only one dimension, as in Figure 6. Real-life
problems are more challenging than toy examples, and some of them can have thousands of
dimensions, which makes working with them more abstract. However, its abstractness does
not make it more complex. Most examples in this book will be two-dimensional examples.
They are simple enough to be easily visualized, and we can do some basic geometry on them,
which will allow you to understand the fundamentals of SVMs.
Figure 7: Plotting alcohol by volume from a real dataset
In our example of Figure 6, there is only one dimension: that is, each data point is represented
by a single number. When there are more dimensions, we will use vectors to represent each
data point. Every time we add a dimension, the object we use to separate the data changes.
Indeed, while we can separate the data with a single point in Figure 6, as soon as we go into
two dimensions we need a line (a set of points), and in three dimensions we need a plane
(which is also a set of points).
• In one dimension, you can find a point separating the data (Figure 6).
• In two dimensions, you can find a line separating the data (Figure 8).
• In three dimensions, you can find a plane separating the data (Figure 9).
Similarly, when data is non-linearly separable, we cannot find a separating point, line, or plane.
Figure 10 and Figure 11 show examples of non-linearly separable data in two and three
dimensions.
Figure 10: Non-linearly separable data in 2D Figure 11: Non-linearly separable data in 3D
Hyperplanes
What do we use to separate the data when there are more than three dimensions? We
use what is called a hyperplane.
What is a hyperplane?
In geometry, a hyperplane is a subspace of one dimension less than its ambient space.
This definition, albeit true, is not very intuitive. Instead of using it, we will try to understand
what a hyperplane is by first studying what a line is.
If you recall mathematics from school, you probably learned that a line has an equation of the
form , that the constant is known as the slope, and that intercepts the y-
axis. There are several values of for which this formula is true, and we say that the set
of the solutions is a line.
What is often confusing is that if you study the function in a calculus course,
you will be studying a function with one variable.
This is equivalent to .
What is nice with this last equation is that it uses vectors. Even if we derived it by using two-
dimensional vectors, it works for vectors of any dimensions. It is, in fact, the equation of a
hyperplane.
From this equation, we can have another insight into what a hyperplane is: it is the set of points
We isolate to get:
If we define and :
We see that the bias of the line equation is only equal to the bias of the hyperplane equation
when . So you should not be surprised if is not the intersection with the vertical axis when
you see a plot for a hyperplane (this will be the case in our next example). Moreover, if
Given the linearly separable data of Figure 12, we can use a hyperplane to perform
binary classification.
For instance, with the vector and we get the hyperplane in Figure 13.
We associate each vector with a label , which can have the value or (respectively
the triangles and the stars in Figure 13).
It uses the position of with respect to the hyperplane to predict a value for the label .
Every data point on one side of the hyperplane will be assigned a label, and every data point
on the other side will be assigned the other label.
Because it uses the equation of the hyperplane, which produces a linear combination of the
values, the function , is called a linear classifier.
With one more trick, we can make the formula of even simpler by removing the b constant.
If we have a hyperplane that separates the data set like the one in Figure 13, by using
the hypothesis function , we are able to predict the label of every point perfectly. The
main question is: how do we find such a hyperplane?
Summary
After introducing vectors and linear separability, we learned what a hyperplane is and how
we can use it to classify data. We then saw that the goal of a learning algorithm trying to
learn a linear classifier is to find a hyperplane separating the data. Eventually, we discovered
that finding a hyperplane is equivalent to finding a vector .
We will now examine which approaches learning algorithms use to find a hyperplane that
separates the data. Before looking at how SVMs do this, we will first look at one of the
simplest learning models: the Perceptron.
The Perceptron
The Perceptron is an algorithm invented in 1957 by Frank Rosenblatt, a few years before the
first SVM. It is widely known because it is the building block of a simple neural network: the
multilayer perceptron. The goal of the Perceptron is to find a hyperplane that can separate a
linearly separable data set. Once the hyperplane is found, it is used to perform binary
classification.
What is important to understand here is that the only unknown value is . It means that the
goal of the algorithm is to find a value for . You find ; you have a hyperplane. There is an
infinite number of hyperplanes (you can give any value to ), so there is an infinity of
hypothesis functions.
1. Start with a random hyperplane (defined by a vector ) and use it to classify the data.
2. Pick a misclassified example and select another hyperplane by updating the value of
, hoping it will work better at classifying this example (this is called the update rule).
• If the predicted label is 1, the angle between and is smaller than , and we want
to increase it.
• If the predicted label is -1, the angle between and is bigger than , and we want
to decrease it.
Let’s see what happens with two vectors, and , having an angle between (Figure 15).
On the one hand, adding them creates a new vector and the angle between and
is smaller than (Figure 16).
• If the predicted label is 1, the angle is smaller than . We want to increase the angle,
so we set .
• If the predicted label is -1, the angle is bigger than . We want to decrease the angle,
so we set .
As we are doing this only on misclassified examples, when the predicted label has a value,
the expected label is the opposite. This means we can rewrite the previous statement:
Note that the update rule does not necessarily change the sign of the hypothesis for the
example the first time. Sometimes it is necessary to apply the update rule several times before
it happens. This is not a problem, as we are looping across misclassified examples, so we will
continue to use the update rule until the example is correctly classified. What matters here is
that each time we use the update rule, we change the value of the angle in the right direction
(increasing it or decreasing it).
Also note that sometimes updating the value of for a particular example changes the hyperplane
in such a way that another example previously correctly classified becomes misclassified. So, the
hypothesis might become worse at classifying after being updated. This is illustrated in Figure 18,
which shows us the number of classified examples at each iteration step. One way to avoid this
problem is to keep a record of the value of before making the update and use the updated only if
it reduces the number of misclassified examples. This modification of the PLA is known as the Pocket
algorithm (because we keep in our pocket).
At first, this might not seem like a problem. After all, the four hyperplanes perfectly
classify the data, so they might be equally good, right? However, when using a
machine learning algorithm such as the PLA, our goal is not to find a way to classify
perfectly the data we have right now. Our goal is to find a way to correctly classify new
data we will receive in the future.
Let us introduce some terminology to be clear about this. To train a model, we pick a
sample of existing data and call it the training set. We train the model, and it comes up
with a hypothesis (a hyperplane in our case). We can measure how well the
hypothesis performs on the training set: we call this the in-sample error (also called
training error). Once we are satisfied with the hypothesis, we decide to use it on unseen
data (the test set) to see if it indeed learned something. We measure how well the
hypothesis performs on the test set, and we call this the out-of-sample error (also
called the generalization error).
In the case of the PLA, all hypotheses in Figure 19 perfectly classify the data: their in-
sample error is zero. But we are really concerned about their out-of-sample error. We
can use a test set such as the one in Figure 20 to check their out-of-sample errors.
Now we better understand why it is problematic. When using the Perceptron with a linearly
separable dataset, we have the guarantee of finding a hypothesis with zero in-sample error, but
we have no guarantee about how well it will generalize to unseen data (if an algorithm
generalizes well, its out-of-sample error will be close to its in-sample error). How can we choose
a hyperplane that generalizes well? As we will see in the next chapter, this is one of the goals of
SVMs.
Summary
In this chapter, we have learned what a Perceptron is. We then saw in detail how the
Perceptron Learning Algorithm works and what the motivation behind the update rule is. After
learning that the PLA is guaranteed to converge, we saw that not all hypotheses are equal,
and that some of them will generalize better than others. Eventually, we saw that the
Perceptron is unable to select which hypothesis will have the smallest out-of-sample error and
instead just picks one hypothesis having the lowest in-sample error at random.
Introduction to K-means Clustering
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups
in the data, with the number of groups represented by the variable K. The algorithm works
iteratively to assign each data point to one of K groups based on the features that are provided.
Data points are clustered based on feature similarity. The results of the K-means clustering
algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. The "Choosing K" section below describes how the
number of groups can be determined.
Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of group
each cluster represents.
Business Uses
The K-means clustering algorithm is used to find groups which have not been explicitly labeled
in the data. This can be used to confirm business assumptions about what types of groups exist or
to identify unknown groups in complex data sets. Once the algorithm has been run and the
groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use
cases are:
Behavioral segmentation:
o Segment by purchase history
o Segment by activities on application, website, or platform
o Define personas based on interests
o Create profiles based on activity monitoring
Inventory categorization:
o Group inventory by sales activity
o Group inventory by manufacturing metrics
Sorting sensor measurements:
o Detect activity types in motion sensors
o Group images
o Separate audio
o Identify groups in health monitoring
Detecting bots or anomalies:
o Separate valid activity groups from bots
o Group valid activity to clean up outlier detection
141
In addition, monitoring if a tracked data point switches between groups over time can be used to
detect meaningful changes in the data.
Algorithm
The Κ-means clustering algorithm uses iterative refinement to produce a final result. The
algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of
features for each data point. The algorithms starts with initial estimates for the Κ centroids,
which can either be randomly generated or randomly selected from the data set. The algorithm
then iterates between two steps:
Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, based on the squared Euclidean distance. More formally, if ci is the collection of
centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for
each ith cluster centroid be Si.
In this step, the centroids are recomputed. This is done by taking the mean of all data points
assigned to that centroid's cluster.
The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data
points change clusters, the sum of the distances is minimized, or some maximum number of
iterations is reached).
This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not
necessarily the best possible outcome), meaning that assessing more than one run of the
algorithm with randomized starting centroids may give a better outcome.
Choosing K
The algorithm described above finds the clusters and data set labels for a particular pre-chosen K.
To find the number of clusters in the data, the user needs to run the K-means clustering algorithm
for a range of K values and compare the results. In general, there is no method for determining
exact value of K, but an accurate estimate can be obtained using the following techniques.
One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus,
this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function
142
of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to
roughly determine K.
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
143
This data set is to be grouped into two clusters. As a first step in finding a sensible initial
partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:
Mean Vector
Individual
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)
The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
144
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster
1). In other words, each individual's distance to its own cluster mean should be smaller that the
distance to the other cluster's mean (which is not the case with individual 3). Thus, individual 3
is relocated to Cluster 2 resulting in the new partition:
Mean Vector
Individual
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
The iterative relocation would now continue from this new partition until no more relocations
occur. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be
a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.
145
Introduction to Decision Tree Algorithm
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is
and what the corresponding output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are
where the data is split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In
this case this was a binary classification problem (a yes no type problem).
What we’ve seen above is an example of classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123.
Working
Now that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3.
146
Before discussing the ID3 algorithm, we’ll go through few definitions.
Entropy
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.
Or
Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss
whose probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest
possible, since there’s no way of determining what the outcome might be. Alternatively, consider
a coin which has heads on both the sides, the entropy of such an event can be predicted perfectly
since we know beforehand that it’ll always be heads. In other words, this event has no
randomness hence it’s entropy is zero.
In particular, lower values imply less uncertainty while higher values imply high uncertainty.
147
148
149
150
151
152
153
154
Let’s understand this with the help of another example
Consider a piece of data collected over the course of 14 days where the features are Outlook,
Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day.
Now, our job is to build a predictive model which takes in above 4 parameters and predicts
whether Golf will be played on the day. We’ll build a decision tree to do that using ID3
algorithm.
Now we’ll go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.
Yes No Total
9 5 14
155
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them
belong to one class and other half belong to other class that is perfect randomness. Here it’s 0.94
which means the distribution is fairly random.
Now the next step is to choose the attribute that gives us highest possible Information Gain
which we’ll choose as the root node.
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong}
Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind
is Strong.
Wind = Wind =
Total
Weak Strong
8 6 14
Now out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have,
156
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for
Play Golf and 3 where we had ‘No’ for Play Golf.
Remember, here half items belong to one class while other half belong to other. Hence we have
perfect randomness.
Now we have all the pieces required to calculate the Information Gain,
Which tells us the Information Gain by considering ‘Wind’ as the feature and give us
information gain of 0.048. Now we must similarly calculate the Information Gain for all the
features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we
chose Outlook attribute as the root node. At this point, the decision tree looks like.
157
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain is
given by the attribute Outlook.
Now how do we proceed from this point? We can simply apply recursion, you might want to
look at the algorithm steps described earlier.
Now that we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and
Wind. And, we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast
node already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute:
Sunny and Rain.
As we can see the highest Information Gain is given by Humidity. Proceeding in the same
way with will give us Wind as the one with highest information gain. The final Decision
Tree looks something like this.
158
Overfitting in case of decision tree classification
Overfitting is a significant practical difficulty for decision tree models and many other predictive
models. Overfitting happens when the learning algorithm continues to develop hypotheses that
re duce training set error at the cost of an increased test set error.
Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training
set.
Post-pruning that allows the tree to perfectly classify the training set, and then post
prune the tree.
Practically, the second approach of post-pruning overfit trees is more successful because it
is not easy to precisely estimate when to stop growing the tree.
The important step of tree pruning is to define a criterion be used to determine the correct
final tree size using one of the following methods:
159
1. Use a distinct dataset from the training set (called validation set), to evaluate the effect
of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to estimate whether
pruning or expanding a particular node is likely to produce an improvement beyond the
training set.
o Error estimation
o Significance testing (e.g., Chi-square test)
3. Minimum Description Length principle : Use an explicit measure of the complexity for
encoding the training set and the decision tree, stopping growth of the tree when this
encoding size (size(tree) + size(misclassifications(tree)) is minimized.
The first method is the most common approach. In this approach, the available data are
separated into two sets of examples: a training set, which is used to build the decision tree,
and a validation set, which is used to evaluate the impact of pruning the tree. The second
method is also a common approach. Here, we explain the error estimation and Chi2 test.
160
Dimensionality Reduction
Let’s say that you want to predict what the gross domestic product (GDP) of the United States
will be for 2017. You have lots of information available: the U.S. GDP for the first quarter of
2017, the U.S. GDP for the entirety of 2016, 2015, and so on. You have any publicly-available
economic indicator, like the unemployment rate, inflation rate, and so on. You have U.S. Census
data from 2010 estimating how many Americans work in each industry and American
Community Survey data updating those estimates in between each census. You know how many
members of the House and Senate belong to each political party. You could gather stock price
data, the number of IPOs occurring in a year, and how many CEOs seem to be mounting a bid
for public office. Despite being an overwhelming number of variables to consider, this just
scratches the surface. You have a lot of variables to consider.
If you’ve worked with a lot of variables before, you know this can present problems. Do you
understand the relationships between each variable? Do you have so many variables that you are
in danger of overfitting your model to your data or that you might be violating assumptions of
whichever modeling tactic you’re using?
You might ask the question, “How do I take all of the variables I’ve collected and focus on only
a few of them?” In technical terms, you want to “reduce the dimension of your feature space.”
By reducing the dimension of your feature space, you have fewer relationships between variables
to consider and you are less likely to overfit your model. (Note: This doesn’t immediately mean
that overfitting, etc. are no longer concerns — but we’re moving in the right direction!)
Somewhat unsurprisingly, reducing the dimension of the feature space is called “dimensionality
reduction.” There are many ways to achieve dimensionality reduction, but most of these
techniques fall into one of two classes:
Feature Elimination
Feature Extraction
Feature elimination is what it sounds like: we reduce the feature space by eliminating features.
In the GDP example above, instead of considering every single variable, we might drop all
variables except the three we think will best predict what the U.S.’s gross domestic product will
look like. Advantages of feature elimination methods include simplicity and maintaining
interpretability of your variables.
As a disadvantage, though, you gain no information from those variables you’ve dropped. If we
only use last year’s GDP, the proportion of the population in manufacturing jobs per the most
recent American Community Survey numbers, and unemployment rate to predict this year’s
GDP, we’re missing out on whatever the dropped variables could contribute to our model. By
eliminating features, we’ve also entirely eliminated any benefits those dropped variables would
bring.
Feature extraction, however, doesn’t run into this problem. Say we have ten independent
variables. In feature extraction, we create ten “new” independent variables, where each “new”
independent variable is a combination of each of the ten “old” independent variables. However,
we create these new independent variables in a specific way and order these new variables by
how well they predict our dependent variable.
You might say, “Where does the dimensionality reduction come into play?” Well, we keep as
many of the new independent variables as we want, but we drop the “least important ones.”
161
Because we ordered the new variables by how well they predict our dependent variable, we
know which variable is the most important and least important. But — and here’s the kicker —
because these new independent variables are combinations of our old ones, we’re still keeping
the most valuable parts of our old variables, even when we drop one or more of these “new”
variables!
What is PCA?
Principal component analysis is a technique for feature extraction — so it combines our input
variables in a specific way, then we can drop the “least important” variables while still retaining
the most valuable parts of all of the variables! As an added benefit, each of the “new” variables
after PCA are all independent of one another. This is a benefit because the assumptions of a
linear model require our independent variables to be independent of one another. If we decide to
fit a linear regression model with these “new” variables (see “principal component regression”
below), this assumption will necessarily be satisfied.
1. Do you want to reduce the number of variables, but aren’t able to identify variables to
completely remove from consideration?
2. Do you want to ensure your variables are independent of one another?
3. Are you comfortable making your independent variables less interpretable?
If you answered “yes” to all three questions, then PCA is a good method to use. If you answered
“no” to question 3, you should not use PCA.
The section after this discusses why PCA works, but providing a brief summary before jumping
into the algorithm may be helpful for context:
We are going to calculate a matrix that summarizes how our variables all relate to one another.
We’ll then break this matrix down into two separate components: direction and magnitude. We
can then understand the “directions” of our data and its “magnitude” (or how “important” each
direction is). The screenshot below, from the setosa.io applet, displays the two main directions
in this data: the “red direction” and the “green direction.” In this case, the “red direction” is the
more important one. We’ll get into why this is the case later, but given how the dots are
arranged, can you see why the “red direction” looks more important than the “green direction?”
(Hint: What would fitting a line of best fit to this data look like?)
162
Our original data in the xy-plane. (Source.)
We will transform our original data to align with these important directions (which are
combinations of our original variables). The screenshot below (again from setosa.io) is the same
exact data as above, but transformed so that the x- and y-axes are now the “red direction” and
“green direction.” What would the line of best fit look like here?
While the visual example here is two-dimensional (and thus we have two “directions”), think
about a case where our data has more dimensions. By identifying which “directions” are most
“important,” we can compress or project our data into a smaller space by dropping the
“directions” that are the “least important.” By projecting our data into a smaller space, we’re
reducing the dimensionality of our feature space… but because we’ve transformed our data in
these different “directions,” we’ve made sure to keep all original variables in our model!
The two charts show the exact same data, but the right graph reflects the original data
transformed so that our axes are now the principal components.
163
In both graphs, the principal components are perpendicular to one another. In fact, every
principal component will ALWAYS be orthogonal (a.k.a. official math term for perpendicular) to
every other principal component. (Don’t believe me? Try to break the applet!)
While PCA is a very technical method relying on in-depth linear algebra algorithms, it’s a
relatively intuitive method when you think about it.
First, the covariance matrix ZᵀZ is a matrix that contains estimates of how every variable in Z
relates to every other variable in Z. Understanding how one variable is associated with another
is quite powerful.
Second, eigenvalues and eigenvectors are important. Eigenvectors represent directions. Think of
plotting your data on a multidimensional scatterplot. Then one can think of an individual
eigenvector as a particular “direction” in your scatterplot of data. Eigenvalues represent
magnitude, or importance. Bigger eigenvalues correlate with more important directions.
Finally, we make an assumption that more variability in a particular direction correlates with
explaining the behavior of the dependent variable. Lots of variability usually indicates signal,
whereas little variability usually indicates noise. Thus, the more variability there is in a particular
direction is, theoretically, indicative of something important we want to detect.
1. A measure of how each variable is associated with one another. (Covariance matrix.)
2. The directions in which our data are dispersed. (Eigenvectors.)
3. The relative importance of these different directions. (Eigenvalues.)
PCA combines our predictors and allows us to drop the eigenvectors that are relatively
unimportant.
Yes, more than I can address here in a reasonable amount of space. The one I’ve most frequently
seen is principal component regression, where we take our untransformed Y and regress it on the
subset of Z* that we didn’t drop. (This is where the independence of the columns of Z* comes
in; by regressing Y on Z*, we know that the required independence of independent variables will
necessarily be satisfied. However, we will need to still check our other assumptions.)
164
165
166
167
168
169
170
Introduction to Ensemble learning
Ensemble modeling is a powerful way to improve the performance of your model. It usually pays
off to apply ensemble learning over and above various models you might be building.
Ensemble learning is a broad topic and is only confined by your own imagination. For the
purpose of this article, I will cover the basic concepts and ideas of ensemble modeling. This
should be enough for you to start building ensembles at your own end. As usual, we have tried to
keep things as simple as possible.
Let’s quickly start with an example to understand the basics of Ensemble learning. This example
will bring out, how we use ensemble model every day without realizing that we are using
ensemble modeling.
Example: I want to invest in a company XYZ. I am not sure about its performance though. So, I
look for advice on whether the stock price will increase more than 6% per annum or not? I
decide to approach various experts having diverse domain experience:
1. Employee of Company XYZ: This person knows the internal functionality of the company and
have the insider information about the functionality of the firm. But he lacks a broader
perspective on how are competitors innovating, how is the technology evolving and what will be
the impact of this evolution on Company XYZ’s product. In the past, he has been right 70%
times.
2. Financial Advisor of Company XYZ: This person has a broader perspective on how companies
strategy will fair of in this competitive environment. However, he lacks a view on how the
company’s internal policies are fairing off. In the past, he has been right 75% times.
3. Stock Market Trader: This person has observed the company’s stock price over past 3 years.
He knows the seasonality trends and how the overall market is performing. He also has
developed a strong intuition on how stocks might vary over time. In the past, he has been right
70% times.
4. Employee of a competitor: This person knows the internal functionality of the competitor
firms and is aware of certain changes which are yet to be brought. He lacks a sight of company in
focus and the external factors which can relate the growth of competitor with the company of
subject. In the past, he has been right 60% of times.
5. Market Research team in same segment: This team analyzes the customer preference of
company XYZ’s product over others and how is this changing with time. Because he deals with
customer side, he is unaware of the changes company XYZ will bring because of alignment to its
own goals. In the past, they have been right 75% of times.
6. Social Media Expert: This person can help us understand how has company XYZ has
positioned its products in the market. And how are the sentiment of customers changing over
time towards company. He is unaware of any kind of details beyond digital marketing. In the
past, he has been right 65% of times.
Given the broad spectrum of access we have, we can probably combine all the information and
make an informed decision.
171
In a scenario when all the 6 experts/teams verify that it’s a good decision(assuming all the
predictions are independent of each other), we will get a combined accuracy rate of
1 - 30%*25%*30%*40%*25%*35%
= 1 - 0.07875 = 99.92125%
Assumption: The assumption used here that all the predictions are completely independent is
slightly extreme as they are expected to be correlated. However, we see how we can be so sure
by combining various predictions together.
Let us now change the scenario slightly. This time we have 6 experts, all of them are employee
of company XYZ working in the same division. Everyone has a propensity of 70% to advocate
correctly.
What if we combine all these advice together, can we still raise up our confidence to >99% ?
Obviously not, as all the predictions are based on very similar set of information. They are
certain to be influenced by similar set of information and the only variation in their advice would
be due to their personal opinions & collected facts about the firm.
The three most popular methods for combining the predictions from different models are:
Bagging. Building multiple models (typically of the same type) from different subsamples of the
training dataset.
Boosting. Building multiple models (typically of the same type) each of which learns to fix the
prediction errors of a prior model in the chain.
Voting. Building multiple models (typically of differing types) and simple statistics (like
calculating the mean) are used to combine predictions.
Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision
tree. Here idea is to create several subsets of data from training sample chosen randomly with
replacement. Now, each collection of subset data is used to train their decision trees. As a result,
we end up with an ensemble of different models. Average of all the predictions from different
trees are used which is more robust than a single decision tree.
Bootstrapping is a process of creating random samples with replacement for estimating sample
statistics.
One of the way to select samples or bootstrap samples is to select n items with replacement
from an original sample, N. A bootstrap sample may have a few duplicate observations or
records, as the sampling is done with replacement.
172
Bootstrapping Sampling Example
The samples are referred to as a resample. This allows the model or algorithm to get a better
understanding of the various biases, variances and features that exist in the resample. Taking a
sample of the data allows the resample to contain different characteristics then it might have
contained as a whole. Each sample population has different pieces, and none are identical. This
would then affect the overall mean, standard deviation and other descriptive metrics of a data set.
In turn, it can develop more robust models.
Why do we create bootstrap samples? Bootstrap samples are created to estimate and validate
models for improved accuracy, reduced variance and bias, and improved stability of a model.
Once bootstrap samples are created, model classifier is used for training or building a model and
then selecting model based on popularity votes. In classification model, a label with maximum
votes will assigned to the observations. Average value is used in case of a regression model.
Bagging: Overview
Bagging is an ensambling process – where a model is trained on each of the bootstrap samples
and the final model is an aggregated models of the all sample models. For a numeric target
variable /regression problems the predicted outcome is an average of all the models and in the
classification problems, the predicted class is defined based on plurality.
What Bagging does is help reduce variance from models that are might be very accurate, but
only on the data they were trained on. This is also known as overfitting.
Overfitting is when a function fits the data too well. Typically this is because the actual equation
is much too complicated to take into account each data point and outlier.
173
Another example of an algorithm that can overfit easily is a decision tree. The models that are
developed using decision trees require very simple heuristics. Decision trees are composed of a
set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set
that might have some bias or difference in spread of underlying features compared to the
previous set. The model will fail to be as accurate. This is because the data will not fit the model
as well(which is a backwards statement anyways).
Now, we will show an example of Bagging for both Regression (Numerical Outcome) and
Classification scenarios.
In R, “adabag” and “ipred” packages allow us to develop bagging based models for both
classification and regression scenarios.
Tree Building process used on Bagging is based on CART algorithm and adabag and ipred use
rpart algorithm.
Random Forest
Random Forest is an extension over bagging. It takes one extra step where in addition to taking
the random subset of data, it also takes the random selection of features rather than using all
features to grow trees. When you have many random trees. It’s called Random Forest.
1. Suppose there are N observations and M features in training data set. First, a sample from
training data set is taken randomly with replacement.
2. A subset of M features are selected randomly and whichever feature gives the best split is used
to split the node iteratively.
174
3. The tree is grown to the largest.
4. Above steps are repeated and prediction is given based on the aggregation of predictions from
n number of trees.
Since final prediction is based on the mean predictions from subset trees, it won’t give
precise values for the regression model.
What is Boosting?
The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong
learners.
Let’s understand this definition in detail by solving a problem of spam email identification:
How would you classify an email as SPAM or not? Like everyone else, our initial approach
would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:
1. Email has only one image file (promotional image), It’s a SPAM
2. Email has only link(s), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
4. Email from our official domain “ABC.com” , Not a SPAM
5. Email from known source, Not a SPAM
Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you
think these rules individually are strong enough to successfully classify an email? No.
Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.
To convert weak learner to strong learner, we’ll combine the prediction of each weak learner
using methods like:
• Using average/ weighted average
• Considering prediction has higher vote
For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’
and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM
because we have higher(3) vote for ‘SPAM’.
175
How Boosting Algorithms works?
Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An
immediate question which should pop in your mind is, ‘How boosting identify weak rules?‘
To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each
time base learning algorithm is applied, it generates a new weak prediction rule. This is an
iterative process. After many iterations, the boosting algorithm combines these weak rules into a
single strong prediction rule.
Here’s another question which might haunt you, ‘How do we choose different distribution for
each round?’
For choosing the right distribution, here are the following steps:
Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.
Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher
attention to observations having prediction error. Then, we apply the next base learning
algorithm.
Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.
Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improves the prediction power of the model. Boosting pays higher focus on examples which are
mis-classified or have higher errors by preceding weak rules.
Underlying engine used for boosting algorithms can be anything. It can be decision stamp,
margin-maximizing classification algorithm etc. There are many boosting algorithms which use
other types of engine such as:
176
Boosting Algorithm: AdaBoost
Box 1: You can see that we have assigned equal weights to each data point and applied a
decision stump to classify them as + (plus) or – (minus). The decision stump (D1) has generated
vertical line at left side to classify the data points. We see that, this vertical line has
incorrectly predicted three + (plus) as – (minus). In such case, we’ll assign higher weights to
these three + (plus) and apply another decision stump.
Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as
compared to rest of the data points. In this case, the second decision stump (D2) will try to
predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-
classified + (plus) correctly. But again, it has caused mis-classification errors. This time with
three -(minus). Again, we will assign higher weight to three – (minus) and apply another
decision stump.
177
Box 3: Here, three – (minus) are given higher weights. A decision stump (D3) is applied to
predict these mis-classified observation correctly. This time a horizontal line is generated to
classify + (plus) and – (minus) based on higher weight of mis-classified observation.
Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule
as compared to individual weak learner. You can see that this algorithm has classified these
observation quite well as compared to any of individual weak learner.
Combining Result
If W1*X1 + W2*X2 + .. …+WNXN > 0 the final predicted class will be 1 other wise 0
Mostly, we use decision stamps with AdaBoost. But, we can use any machine learning
algorithms as base learner if it accepts weight on training data set. We can use AdaBoost
algorithms for both classification and regression problem.
In gradient boosting, it trains many model sequentially. Each new model gradually minimizes the
loss function (y = ax + b + e, e needs special attention as it is an error term) of the whole system
using Gradient Descent method. The learning procedure consecutively fit new models to provide
a more accurate estimate of the response variable.
178
The principle idea behind this algorithm is to construct new base learners which can
be maximally correlated with negative gradient of the loss function, associated with the whole
ensemble. You can refer article “Learn Gradient Boosting Algorithm” to understand this concept
using an example.
AdaBoost Tutorial
13 Dec 2013
My education in the fundamentals of machine learning has mainly come from Andrew Ng’s
excellent Coursera course on the topic. One thing that wasn’t covered in that course, though, was
the topic of “boosting” which I’ve come across in a number of different contexts now.
Fortunately, it’s a relatively straightforward topic if you’re already familiar with machine
learning classification.
Whenever I’ve read about something that uses boosting, it’s always been with the “AdaBoost”
algorithm, so that’s what this post covers.
AdaBoost is a popular boosting technique which helps you combine multiple “weak classifiers”
into a single “strong classifier”. A weak classifier is simply a classifier that performs poorly, but
performs better than random guessing. A simple example might be classifying a person as male
or female based on their height. You could say anyone over 5’ 9” is a male and anyone under
that is a female. You’ll misclassify a lot of people that way, but your accuracy will still be
greater than 50%.
AdaBoost can be applied to any classification algorithm, so it’s really a technique that builds on
top of other classifiers as opposed to being a classifier itself.
You could just train a bunch of weak classifiers on your own and combine the results, so what
does AdaBoost do for you? There’s really two things it figures out for you:
1. It helps you choose the training set for each new classifier that you train based on the results of
the previous classifier.
2. It determines how much weight should be given to each classifier’s proposed answer when
combining the results.
Each weak classifier should be trained on a random subset of the total training set. The subsets
can overlap–it’s not the same as, for example, dividing the training set into ten portions.
AdaBoost assigns a “weight” to each training example, which determines the probability that
each example should appear in the training set. Examples with higher weights are more likely to
be included in the training set, and vice versa. After training a classifier, AdaBoost increases the
weight on the misclassified examples so that these examples will make up a larger part of the
next classifiers training set, and hopefully the next classifier trained will perform better on them.
The equation for this weight update step is detailed later on.
179
Classifier Output Weights
After each classifier is trained, the classifier’s weight is calculated based on its accuracy. More
accurate classifiers are given more weight. A classifier with 50% accuracy is given a weight of
zero, and a classifier with less than 50% accuracy (kind of a funny concept) is given negative
weight.
Formal Definition
To learn about AdaBoost, I read through a tutorial written by one of the original authors of the
algorithm, Robert Schapire. The tutorial is available here.
Below, I’ve tried to offer some intuition into the relevant equations.
The final classifier consists of ‘T’ weak classifiers. h_t(x) is the output of weak classifier ‘t’ (in
this paper, the outputs are limited to -1 or +1). Alpha_t is the weight applied to classifier ‘t’ as
determined by AdaBoost. So the final output is just a linear combination of all of the weak
classifiers, and then we make our final decision simply by looking at the sign of this sum.
The classifiers are trained one at a time. After each classifier is trained, we update the
probabilities of each of the training examples appearing in the training set for the next classifier.
The first classifier (t = 1) is trained with equal probability given to all training examples. After
it’s trained, we compute the output weight (alpha) for that classifier.
The output weight, alpha_t, is fairly straightforward. It’s based on the classifier’s error rate,
‘e_t’. e_t is just the number of misclassifications over the training set divided by the training set
size.
Here’s a plot of what alpha_t will look like for classifiers with different error rates.
180
There are three bits of intuition to take from this graph:
1. The classifier weight grows exponentially as the error approaches 0. Better classifiers are
given exponentially more weight.
2. The classifier weight is zero if the error rate is 0.5. A classifier with 50% accuracy is no
better than random guessing, so we ignore it.
3. The classifier weight grows exponentially negative as the error approaches 1. We give a
negative weight to classifiers with worse than 50% accuracy. “Whatever that classifier
says, do the opposite!”.
After computing the alpha for the first classifier, we update the training example weights using
the following formula.
The variable D_t is a vector of weights, with one weight for each training example in the training
set. ‘i’ is the training example number. This equation shows you how to update the weight for the
ith training example.
The paper describes D_t as a distribution. This just means that each weight D(i) represents the
probability that training example i will be selected as part of the training set.
This vector is updated for each new weak classifier that’s trained. D_t refers to the weight vector
used when training classifier ‘t’.
This equation needs to be evaluated for each of the training samples ‘i’ (x_i, y_i). Each weight
from the previous training round is going to be scaled up or down by this exponential term.
181
To understand how this exponential term behaves, let’s look first at how exp(x) behaves.
The function exp(x) will return a fraction for negative values of x, and a value greater than one
for positive values of x. So the weight for training sample i will be either increased or decreased
depending on the final sign of the term “-alpha * y * h(x)”. For binary classifiers whose output is
constrained to either -1 or +1, the terms y and h(x) only contribute to the sign and not the
magnitude.
y_i is the correct output for training example ‘i’, and h_t(x_i) is the predicted output by classifier
t on this training example. If the predicted and actual output agree, y * h(x) will always be +1
(either 1 * 1 or -1 * -1). If they disagree, y * h(x) will be negative.
Ultimately, misclassifications by a classifier with a positive alpha will cause this training
example to be given a larger weight. And vice versa.
Note that by including alpha in this term, we are also incorporating the classifier’s effectiveness
into consideration when updating the weights. If a weak classifier misclassifies an input, we
don’t take that as seriously as a strong classifier’s mistake.
182
Neural Networks:
Neutral networks is the state of the art technique for many different machine learning problems.
So why do we need yet another learning algorithm? We already have linear regression and we
have logistic regression, so why do we need, neural networks? In order to motivate the
discussion of neural networks, let us start by discussing a few examples of machine learning
problems where we need to learn complex non-linear hypotheses. Consider a supervised learning
classification problem where you have a training set like:
If you want to apply logistic regression to this problem, one thing you could do is apply logistic
regression with a lot of nonlinear features like that. So here, g as usual is the sigmoid function,
and we can include lots of polynomial terms like these. And, if you include enough polynomial
terms then, you know, maybe you can get a hypotheses that separates the positive and negative
examples. If you were to include all the quadratic terms, all of these, even all of the quadratic
183
that is the second or the polynomial terms, there would be a lot of them. There would be terms
like x1*x1, x1*x2, x1*x3, you know, x1*x4 up to x1*x100 and then you have x2*x2 , x2*x3 and
so on. And if you include just the second order terms, that is, the terms that are a product of, you
know, two of these terms, x1 times x1 and so on, then, for the case of n equals 100, you end up
with about 5000 features. And, asymptotically, the number of quadratic features grows roughly
as order n2, where n is the number of the original features, like x1 through x100 that we had. And
its actually closer to n2/2. So including all the quadratic features doesn't seem like it's maybe a
good idea, because that is a lot of features and you might up overfitting the training set, and it
can also be computationally expensive, you know, to be working with that many features.
One thing you could do is include only a subset of these, so if you include only the features
x1*x1, x2*x2, x3*x3, up to maybe x100 squared, then the number of features is much smaller.
Here you have only 100 such quadratic features, but this is not enough features and certainly
won't let you fit the data set like that on the upper left. In fact, if you include only these quadratic
features together with the original x1, and so on, up to x100 features, then you can actually fit
very interesting hypotheses. So, you can fit things like, you know, access a line of the ellipses
like these, but you certainly cannot fit a more complex data set like that shown here.
So 5000 features seems like a lot, if you were to include the cubic, or third order known of each
others, the x1, x2, x3. You know, x1 squared, x2, x10 and x11, x17 and so on. You can imagine
there are gonna be a lot of these features. In fact, they are going to be order and cube such
features and if any is 100 you can compute that, you end up with on the order of about 170,000
such cubic features and so including these higher auto-polynomial features when your original
feature set end is large this really dramatically blows up your feature space and this doesn't seem
like a good way to come up with additional features with which to build none many classifiers
when n is large. For many machine learning problems, n will be pretty large.
184
185
186
187
188
189
190
191
192
193
194
What is Activation Function ?
It’s just a thing (node) that you add to the output end of any neural network. It is also known as Transfer
Function. It can also be attached in between two Neural Networks.
It is used to determine the output of neural network like yes or no. It maps the resulting values in
between 0 to 1 or -1 to 1 etc. (depending upon the function).
As you can see the function is a line or linear.Therefore, the output of the functions will not be
confined between any range.
Equation : f(x) = x
195
It doesn’t help with the complexity or various parameters of usual data that is fed to the neural
networks.
The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps to
makes the graph look something like this
It makes it easy for the model to generalize or adapt with variety of data and to differentiate
between the output.
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.
The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-
The function is differentiable.That means, we can find the slope of the sigmoid curve at any two
points.
The logistic sigmoid function can cause a neural network to get stuck at the training time.
The softmax function is a more generalized logistic activation function which is used for
multiclass classification.
tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh
is also sigmoidal (s - shaped).
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs
will be mapped near zero in the tanh graph.
Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
The ReLU is the most used activation function in the world right now.Since, it is used in almost
all the convolutional neural networks or deep learning.
197
Fig: ReLU v/s Logistic Sigmoid
As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and
f(z) is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
But the issue is that all the negative values become zero immediately which decreases the ability
of the model to fit or train from the data properly. That means any negative input given to the
ReLU activation function turns the value into zero immediately in the graph, which in turns
affects the resulting graph by not mapping the negative values appropriately.
4. Leaky ReLU
The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.
198
Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives
also monotonic in nature.
Background
If this kind of thing interests you, you should sign up for my newsletter where I post about AI-
related projects that I’m working on.
204
Backpropagation in Python
You can play around with a Python script that I wrote that implements the backpropagation
algorithm in this Github repo.
Backpropagation Visualization
For an interactive visualization showing a neural network as it learns, check out my Neural
Network visualization.
Additional Resources
If you find this tutorial useful and want to continue learning about neural networks, machine
learning, and deep learning, I highly recommend checking out Adrian Rosebrock’s new book,
Deep Learning for Computer Vision with Python. I really enjoyed the book and will have a full
review up soon.
Overview
For this tutorial, we’re going to use a neural network with two inputs, two hidden neurons, two
output neurons. Additionally, the hidden and output neurons will include a bias.
In order to have some numbers to work with, here are the initial weights, the biases, and training
inputs/outputs:
205
The goal of backpropagation is to optimize the weights so that the neural network can learn how
to correctly map arbitrary inputs to outputs.
For the rest of this tutorial we’re going to work with a single training set: given inputs 0.05 and
0.10, we want the neural network to output 0.01 and 0.99.
To begin, lets see what the neural network currently predicts given the weights and biases above
and inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.
We figure out the total net input to each hidden layer neuron, squash the total net input using an
activation function (here we use the logistic function), then repeat the process with the output
layer neurons.
Total net input is also referred to as just net input by some sources.
We repeat this process for the output layer neurons, using the output from the hidden layer
neurons as inputs.
206
And carrying out the same process for we get:
We can now calculate the error for each output neuron using the squared error function and sum
them to get the total error:
Some sources refer to the target as the ideal and the output as the actual.
The is included so that exponent is cancelled when we differentiate later on. The result is eventually
multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here [1].
For example, the target output for is 0.01 but the neural network output 0.75136507, therefore
its error is:
Repeating this process for (remembering that the target is 0.99) we get:
The total error for the neural network is the sum of these errors:
Our goal with backpropagation is to update each of the weights in the network so that they cause
the actual output to be closer the target output, thereby minimizing the error for each output
neuron and the network as a whole.
Output Layer
Consider . We want to know how much a change in affects the total error, aka .
is read as “the partial derivative of with respect to “. You can also say “the gradient with
respect to “.
First, how much does the total error change with respect to the output?
is sometimes expressed as
When we take the partial derivative of the total error with respect to , the quantity
becomes zero because does not affect it which means we’re taking the
derivative of a constant which is zero.
Next, how much does the output of change with respect to its total net input?
The partial derivative of the logistic function is the output multiplied by 1 minus the output:
Finally, how much does the total net input of change with respect to ?
208
You’ll often see this calculation combined in the form of the delta rule:
Alternatively, we have and which can be written as , aka (the Greek letter
delta) aka the node delta. We can use this to rewrite the calculation above:
Therefore:
Some sources extract the negative sign from so it would be written as:
To decrease the error, we then subtract this value from the current weight (optionally multiplied
by some learning rate, eta, which we’ll set to 0.5):
Some sources use (alpha) to represent the learning rate, others use (eta), and others even use
(epsilon).
We perform the actual updates in the neural network after we have the new weights leading into
the hidden layer neurons (ie, we use the original weights, not the updated weights, when we
continue the backpropagation algorithm below).
Hidden Layer
Next, we’ll continue the backwards pass by calculating new values for , , , and .
209
Visually:
We’re going to use a similar process as we did for the output layer, but slightly different to
account for the fact that the output of each hidden layer neuron contributes to the output (and
therefore error) of multiple output neurons. We know that affects both and
therefore the needs to take into consideration its effect on the both output neurons:
Starting with :
And is equal to :
Therefore:
210
Now that we have , we need to figure out and then for each weight:
We calculate the partial derivative of the total net input to with respect to the same as we
did for the output neuron:
Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1 inputs
originally, the error on the network was 0.298371109. After this first round of backpropagation,
the total error is now down to 0.291027924. It might not seem like much, but after repeating this
process 10,000 times, for example, the error plummets to 0.0000351085. At this point, when we
feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and
0.984065734 (vs 0.99 target).
211