Lecture 4
Lecture 4
Lecture 4
As a concrete example, X superscript in parentheses 2, will be a vector of the features for the second training example,
so it will equal to this 1416, 3, 2 and 40 and technically, I'm writing these numbers in a row, so sometimes this is called a
row vector rather than a column vector.
To refer to a specific feature in the ith training example, I will write X superscript i, subscript j, so for example, X
superscript 2 subscript 3 will be the value of the third feature, that is the number of floors in the second training example
and so that's going to be equal to 2.
Sometimes in order to emphasize that this X^2 is not a number but is actually a list of numbers that is a vector, we'll draw
an arrow on top of that just to visually show that is a vector and over here as well, but you don't have to draw this arrow in
your notation. You can think of the arrow as an optional signifier. They're sometimes used just to emphasize that this is a
vector and not a number.
Let's think a bit about how you might interpret these parameters. If the model is trying to predict the price of the house
in thousands of dollars, you can think of this b equals 80 as saying that the base price of a house starts off at maybe
$80,000, assuming it has no size, no bedrooms, no floor and no age. You can think of this 0.1 as saying that maybe for
every additional square foot, the price will increase by 0.1 $1,000 or by $100, because we're saying that for each square
foot, the price increases by 0.1, times $1,000, which is $100. Maybe for each additional bathroom, the price increases by
$4,000 and for each additional floor the price may increase by $10,000 and for each additional year of the house's age,
the price may decrease by $2,000, because the parameter is negative 2.
In general, if you have n features, then the model will look like this.
Let me also write X as a list or a vector, again a row vector that lists all of the features X_1, X_2, X_3 up to X_n, this is
again a vector, so I'm going to add a little arrow up on top to signify. In the notation up on top, we can also add little
arrows here and here to signify that that W and that X are actually these lists of numbers, that they're actually these
vectors.
When you're implementing a learning algorithm, using
vectorization will both make your code shorter and also make it
run much more efficiently. Learning how to write vectorized
code will allow you to also take advantage of modern numerical
linear algebra libraries, as well as maybe even GPU hardware
that stands for graphics processing unit. This is hardware
objectively designed to speed up computer graphics in your
computer, but turns out can be used when you write vectorized
code to also help you execute your code much more quickly.
I'm actually using a numerical linear algebra library in Python called NumPy, which is by far the most widely
used numerical linear algebra library in Python and in machine learning.
I want to emphasize that vectorization actually has two distinct benefits. First, it makes code shorter, is now just one line of
code. Isn't that cool? Second, it also results in your code running much faster than either of the two previous
implementations that did not use vectorization.
The reason that the vectorized implementation is much faster is behind the scenes. The NumPy dot function is able to use
parallel hardware in your computer and this is true whether you're running this on a normal computer, that is on a normal
computer CPU or if you are using a GPU, a graphics processor unit, that's often used to accelerate machine learning jobs.
The ability of the NumPy dot function to use parallel hardware makes it much more efficient than the for loop or the
sequential calculation that we saw previously. Now, this version is much more practical when n is large.
When a possible range of values of a feature is large, like the size and square feet which goes all the way up to 2000. It's more
likely that a good model will learn to choose a relatively small parameter value, like 0.1. Likewise, when the possible values of
the feature are small, like the number of bedrooms, then a reasonable value for its parameters will be relatively large like 50.
If you plot the training data, you notice that the horizontal axis is on a much larger scale or much larger range of values
compared to the vertical axis.
Next let's look at how the cost function might look in a contour plot. You might see a contour plot where the
horizontal axis has a much narrower range, say between zero and one, whereas the vertical axis takes on much
larger values, say between 10 and 100.
So the contours form ovals or ellipses and they're short on one side and longer on the other. And this is because a very
small change to w1 can have a very large impact on the estimated price and that's a very large impact on the cost J.
Because w1 tends to be multiplied by a very large number, the size and square feet. In contrast, it takes a much larger
change in w2 in order to change the predictions much. And thus small changes to w2, don’t change the cost function
nearly as much.
This is what might end up happening if you were to run great in dissent, if you were to use your training data as is.
Because the contours are so tall and skinny gradient descent may end up bouncing back and forth for a long time
before it can finally find its way to the global minimum.
In situations like this, a useful thing to do is to scale the features. This
means performing some transformation of your training data so that x1
say might now range from 0 to 1 and x2 might also range from 0 to 1. So
the data points now look more like this and you might notice that the
scale of the plot on the bottom is now quite different than the one on
top. The key point is that the re scale x1 and x2 are both now taking
comparable ranges of values to each other.
When you run gradient descent on a cost function to find on this, re scaled x1 and x2 using this transformed data, then
the contours will look more like this more like circles and less tall and skinny. And gradient descent can find a much
more direct path to the global minimum. So when you have different features that take on very different ranges of values,
it can cause gradient descent to run slowly but re scaling the different features so they all take on comparable range of
values. because speed, upgrade and dissent significantly.
How to carry out Feature Scaling?
In addition to dividing by the maximum, you can also do what's
called mean normalization.
What this looks like is, you start with the original features and
then you re-scale them so that both of them are centered
around zero.
Whereas before they only had values greater than zero, now
they have both negative and positive values that may be
usually between negative one and plus one.
To implement Z-score normalization, you need to calculate something called the standard deviation of each feature. The
normal distribution or the bell-shaped curve, sometimes also called the Gaussian distribution, this is what the standard
deviation for the normal distribution looks like.
As a rule of thumb, when performing feature scaling, you might want to aim for getting the features to
range from maybe anywhere around negative one to somewhere around plus one for each feature x.
These values, negative one and plus one can be a little bit loose. If the features range from
negative three to plus three or negative 0.3 to plus 0.3, all of these are completely okay.
The job of gradient descent is to find parameters w and b that hopefully minimize the cost function J.
Plot the cost function J, which is calculated on the training set,
at each iteration of gradient descent. Remember that each
iteration means after each simultaneous update of the
parameters w and b.