Lecture 4

The document discusses feature scaling techniques for machine learning models. It explains how scaling features so they have comparable value ranges can improve gradient descent optimization by producing more circular contour plots and faster convergence. Mean normalization and z-score normalization are described as common scaling methods.

Uploaded by

M Husnain Shahid

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Lecture 4

Uploaded by

M Husnain Shahid

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

we'll call this a vector that includes all the features of the ith training example

As a concrete example, X superscript in parentheses 2, will be a vector of the features for the second training example,
so it will equal to this 1416, 3, 2 and 40 and technically, I'm writing these numbers in a row, so sometimes this is called a
row vector rather than a column vector.
To refer to a specific feature in the ith training example, I will write X superscript i, subscript j, so for example, X
superscript 2 subscript 3 will be the value of the third feature, that is the number of floors in the second training example
and so that's going to be equal to 2.
Sometimes in order to emphasize that this X^2 is not a number but is actually a list of numbers that is a vector, we'll draw
an arrow on top of that just to visually show that is a vector and over here as well, but you don't have to draw this arrow in
your notation. You can think of the arrow as an optional signifier. They're sometimes used just to emphasize that this is a
vector and not a number.
Let's think a bit about how you might interpret these parameters. If the model is trying to predict the price of the house
in thousands of dollars, you can think of this b equals 80 as saying that the base price of a house starts off at maybe
$80,000, assuming it has no size, no bedrooms, no floor and no age. You can think of this 0.1 as saying that maybe for
every additional square foot, the price will increase by 0.1 $1,000 or by $100, because we're saying that for each square
foot, the price increases by 0.1, times $1,000, which is $100. Maybe for each additional bathroom, the price increases by
$4,000 and for each additional floor the price may increase by $10,000 and for each additional year of the house's age,
the price may decrease by $2,000, because the parameter is negative 2.
In general, if you have n features, then the model will look like this.
Let me also write X as a list or a vector, again a row vector that lists all of the features X_1, X_2, X_3 up to X_n, this is
again a vector, so I'm going to add a little arrow up on top to signify. In the notation up on top, we can also add little
arrows here and here to signify that that W and that X are actually these lists of numbers, that they're actually these
vectors.
When you're implementing a learning algorithm, using
vectorization will both make your code shorter and also make it
run much more efficiently. Learning how to write vectorized
code will allow you to also take advantage of modern numerical
linear algebra libraries, as well as maybe even GPU hardware
that stands for graphics processing unit. This is hardware
objectively designed to speed up computer graphics in your
computer, but turns out can be used when you write vectorized
code to also help you execute your code much more quickly.
I'm actually using a numerical linear algebra library in Python called NumPy, which is by far the most widely
used numerical linear algebra library in Python and in machine learning.
I want to emphasize that vectorization actually has two distinct benefits. First, it makes code shorter, is now just one line of
code. Isn't that cool? Second, it also results in your code running much faster than either of the two previous
implementations that did not use vectorization.

The reason that the vectorized implementation is much faster is behind the scenes. The NumPy dot function is able to use
parallel hardware in your computer and this is true whether you're running this on a normal computer, that is on a normal
computer CPU or if you are using a GPU, a graphics processor unit, that's often used to accelerate machine learning jobs.
The ability of the NumPy dot function to use parallel hardware makes it much more efficient than the for loop or the
sequential calculation that we saw previously. Now, this version is much more practical when n is large.
When a possible range of values of a feature is large, like the size and square feet which goes all the way up to 2000. It's more
likely that a good model will learn to choose a relatively small parameter value, like 0.1. Likewise, when the possible values of
the feature are small, like the number of bedrooms, then a reasonable value for its parameters will be relatively large like 50.
If you plot the training data, you notice that the horizontal axis is on a much larger scale or much larger range of values
compared to the vertical axis.
Next let's look at how the cost function might look in a contour plot. You might see a contour plot where the
horizontal axis has a much narrower range, say between zero and one, whereas the vertical axis takes on much
larger values, say between 10 and 100.
So the contours form ovals or ellipses and they're short on one side and longer on the other. And this is because a very
small change to w1 can have a very large impact on the estimated price and that's a very large impact on the cost J.
Because w1 tends to be multiplied by a very large number, the size and square feet. In contrast, it takes a much larger
change in w2 in order to change the predictions much. And thus small changes to w2, don’t change the cost function
nearly as much.
This is what might end up happening if you were to run great in dissent, if you were to use your training data as is.
Because the contours are so tall and skinny gradient descent may end up bouncing back and forth for a long time
before it can finally find its way to the global minimum.
In situations like this, a useful thing to do is to scale the features. This
means performing some transformation of your training data so that x1
say might now range from 0 to 1 and x2 might also range from 0 to 1. So
the data points now look more like this and you might notice that the
scale of the plot on the bottom is now quite different than the one on
top. The key point is that the re scale x1 and x2 are both now taking
comparable ranges of values to each other.
When you run gradient descent on a cost function to find on this, re scaled x1 and x2 using this transformed data, then
the contours will look more like this more like circles and less tall and skinny. And gradient descent can find a much
more direct path to the global minimum. So when you have different features that take on very different ranges of values,
it can cause gradient descent to run slowly but re scaling the different features so they all take on comparable range of
values. because speed, upgrade and dissent significantly.
How to carry out Feature Scaling?
In addition to dividing by the maximum, you can also do what's
called mean normalization.

What this looks like is, you start with the original features and
then you re-scale them so that both of them are centered
around zero.

Whereas before they only had values greater than zero, now
they have both negative and positive values that may be
usually between negative one and plus one.
To implement Z-score normalization, you need to calculate something called the standard deviation of each feature. The
normal distribution or the bell-shaped curve, sometimes also called the Gaussian distribution, this is what the standard
deviation for the normal distribution looks like.
As a rule of thumb, when performing feature scaling, you might want to aim for getting the features to
range from maybe anywhere around negative one to somewhere around plus one for each feature x.
These values, negative one and plus one can be a little bit loose. If the features range from
negative three to plus three or negative 0.3 to plus 0.3, all of these are completely okay.
The job of gradient descent is to find parameters w and b that hopefully minimize the cost function J.
Plot the cost function J, which is calculated on the training set,
at each iteration of gradient descent. Remember that each
iteration means after each simultaneous update of the
parameters w and b.

In this plot, the

horizontal axis is the number of iterations of gradient descent
that you've run so far. You may get a curve that looks like this.

Notice that the horizontal axis is the number of iterations of

gradient descent and not a parameter like w or b.
This curve is also called a learning curve. Note that there are a
few different types of learning curves used in machine learning.
Looking at this graph helps you to see how your cost J
changes after each iteration of gradient descent. If gradient
descent is working properly, then the cost J should decrease
after every single iteration. If J ever increases after one
iteration, that means either Alpha is chosen poorly, and it
usually means Alpha is too large, or there could be a bug in
the code.
Looking at this curve, by the time you reach maybe
300 iterations also, the cost J is leveling off and is no
longer decreasing much.

By 400 iterations, it looks like the curve has flattened

out.

This means that gradient descent has more or less

converged because the curve is no longer decreasing.

Looking at this learning curve, you can try to spot

whether or not gradient descent is converging.
By the way, the number of iterations that gradient descent
takes a conversion can vary a lot between different
applications. In one application, it may converge after just
30 iterations. For a different application, it could take 1,000
or 100,000 iterations. It turns out to be very difficult to tell
in advance how many iterations gradient descent needs to
converge, which is why you can create a graph like this,
a learning curve.
If the cost J decreases by less than this number epsilon on one iteration, then you're likely on this flattened part of the
curve that you see on the left and you can declare convergence.
Usually find that choosing the right threshold epsilon is pretty difficult. We actually tend to look at graphs like this
one on the left, rather than rely on automatic convergence tests.
Do just set Alpha to be a very small number and see if that causes the cost to decrease on every iteration. If even with Alpha
set to a very small number, J doesn't decrease on every single iteration, but instead sometimes increases, then that usually
means there's a bug somewhere in the code.
In fact, what I actually do is try a range of values
like this. After trying 0.001, I'll then increase the
learning rate threefold to 0.003. After that, I'll try
0.01, which is again about three times as large as
0.003. So these are roughly trying out gradient
descents with each value of Alpha being roughly
three times bigger than the previous value.
I'll slowly try to pick the largest possible learning rate, or just something slightly smaller than the largest reasonable
value that I found. When I do that, it usually gives me a good learning rate for my model.
The choice of features can have a huge impact on your learning algorithm's performance. In fact, for many
practical applications, choosing or entering the right features is a critical step to making the algorithm work well.
What we just did, creating a new feature is an example of what’s called feature engineering, in which you might use your
knowledge or intuition about the problem to design new features usually by transforming or combining the original
features of the problem in order to make it easier for the learning algorithm to make accurate predictions.
Let's take the ideas of multiple linear regression and
feature engineering to come up with a new algorithm
called polynomial regression, which will let you fit curves,
non-linear functions, to your data.
Maybe you want to fit a curve,
maybe a quadratic function to the
data like this which includes a size x
and also x squared, which is the size
raised to the power of two. Maybe
that will give you a better fit to the
data.

But then you may decide that your

quadratic model doesn't really make
sense because a quadratic function
eventually comes back down. Well,
we wouldn't really expect housing
prices to go down when the size
increases.
These are both examples of polynomial
regression, because you took your optional
feature x, and raised it to the power of two
or three or any other power.

Elliott Wave Timing Beyond Ordinary Fibonacci Methods
From Everand
Elliott Wave Timing Beyond Ordinary Fibonacci Methods
Mark Lytle
4/5 (23)
Creep Testing
No ratings yet
Creep Testing
12 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Certificate PDF
No ratings yet
Certificate PDF
1 page
Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Inter Group Intervention
80% (5)
Inter Group Intervention
22 pages
Linear Regression With Multiple Variables
No ratings yet
Linear Regression With Multiple Variables
37 pages
Vertopal.com C1 W2 Lab03 Feature Scaling and Learning Rate Soln
No ratings yet
Vertopal.com C1 W2 Lab03 Feature Scaling and Learning Rate Soln
10 pages
ML03
No ratings yet
ML03
14 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
L4 More On Linear Regression and Polynomial Regression
No ratings yet
L4 More On Linear Regression and Polynomial Regression
37 pages
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
12 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
Cours-1regression Lineaire PDF
No ratings yet
Cours-1regression Lineaire PDF
24 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
ML Coursera
No ratings yet
ML Coursera
10 pages
GR_1_report_week_7
No ratings yet
GR_1_report_week_7
6 pages
Machine Learning Exploring The Model
No ratings yet
Machine Learning Exploring The Model
17 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
Week 2
No ratings yet
Week 2
5 pages
Linear Regression
100% (1)
Linear Regression
51 pages
ch6 (Q 2,8,4)
No ratings yet
ch6 (Q 2,8,4)
9 pages
Larning Rate
No ratings yet
Larning Rate
9 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
Regression
No ratings yet
Regression
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229
No ratings yet
CS229
69 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
3
No ratings yet
3
14 pages
Lecture 2: Basics and Definitions: Networks As Data Models
No ratings yet
Lecture 2: Basics and Definitions: Networks As Data Models
28 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
9.b Handout-3-GD variants
No ratings yet
9.b Handout-3-GD variants
3 pages
ML Lec-6
No ratings yet
ML Lec-6
16 pages
5.Feauture Engineering
No ratings yet
5.Feauture Engineering
34 pages
Machine Learning Interview Questions
No ratings yet
Machine Learning Interview Questions
8 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
43 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
Deep Learning Unit - I Notes
No ratings yet
Deep Learning Unit - I Notes
20 pages
Linear Regression For Absolute Beginners With Implementation in Python
No ratings yet
Linear Regression For Absolute Beginners With Implementation in Python
17 pages
L07 Optimization
No ratings yet
L07 Optimization
12 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
5 pages
ML Notes
No ratings yet
ML Notes
14 pages
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
Charts: Easy Excel Essentials, #3
From Everand
Charts: Easy Excel Essentials, #3
M.L. Humphrey
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
List of A-Z Linux Commands
No ratings yet
List of A-Z Linux Commands
10 pages
August 2023 English Compile
No ratings yet
August 2023 English Compile
5 pages
Jadual Tariff Miceca Part2
No ratings yet
Jadual Tariff Miceca Part2
50 pages
Haslam Homeric Papyri A New Companion
No ratings yet
Haslam Homeric Papyri A New Companion
46 pages
Bateria Litio Felicity FLA48500
No ratings yet
Bateria Litio Felicity FLA48500
2 pages
Brand Awareness & Preference Regarding Havells Green CFL PDF
100% (1)
Brand Awareness & Preference Regarding Havells Green CFL PDF
74 pages
Sustainability Powerpoint
No ratings yet
Sustainability Powerpoint
6 pages
2024 02 6 10 51 09 Statement - 1717305669606
No ratings yet
2024 02 6 10 51 09 Statement - 1717305669606
25 pages
Intel Architecture: 2.1. Brief History of The Ia-32 Architecture
No ratings yet
Intel Architecture: 2.1. Brief History of The Ia-32 Architecture
19 pages
6160CR 2 Teclado
No ratings yet
6160CR 2 Teclado
1 page
CV For Road Safety Specialist - PDF - Compressed
No ratings yet
CV For Road Safety Specialist - PDF - Compressed
8 pages
The Management of Productivity and Technology in Manufacturing PDF
100% (2)
The Management of Productivity and Technology in Manufacturing PDF
333 pages
All Forms Under Factories Act 1948
No ratings yet
All Forms Under Factories Act 1948
2 pages
Australia Awards Referee Form PDF
0% (1)
Australia Awards Referee Form PDF
3 pages
BIM TO FIM Stanford Health Care
100% (2)
BIM TO FIM Stanford Health Care
41 pages
Ecommerce Manager Resume
67% (3)
Ecommerce Manager Resume
8 pages
LG Ht904ta
No ratings yet
LG Ht904ta
22 pages
3ms All Sequences Lesson Plans Worksheets Tutorial Sessions
No ratings yet
3ms All Sequences Lesson Plans Worksheets Tutorial Sessions
110 pages
Mobile Shop Management System Python _ PDF _ Software Testing _ Relational Database
No ratings yet
Mobile Shop Management System Python _ PDF _ Software Testing _ Relational Database
88 pages
Best Thesis Statement Exercises
100% (2)
Best Thesis Statement Exercises
6 pages
CBSE Sample Paper Class 7 Maths Half Yearly Set 1
100% (1)
CBSE Sample Paper Class 7 Maths Half Yearly Set 1
5 pages
The CTF Toolbox - CTF Tools of The Trade PDF
No ratings yet
The CTF Toolbox - CTF Tools of The Trade PDF
55 pages
Lego PowerPoint Template by EaTemp
No ratings yet
Lego PowerPoint Template by EaTemp
23 pages
Rainey E. Updegrove: Objective
No ratings yet
Rainey E. Updegrove: Objective
1 page
Objective Questions
No ratings yet
Objective Questions
2 pages
Netter's Internal Medicine 2nd Ed 3
No ratings yet
Netter's Internal Medicine 2nd Ed 3
4 pages
5L40E/5L50E: Technical Bulletin #873
No ratings yet
5L40E/5L50E: Technical Bulletin #873
2 pages