Cours-1regression Lineaire PDF
Cours-1regression Lineaire PDF
Cours-1regression Lineaire PDF
Variables
Learning Rate α
Prof. Abdelouahab MOUSSAOUI
Focus on the learning rate (α)
Topics
o Update rule
o Debugging
o How to chose α
o Number of iterations varies a lot
30 iterations
3000 iterations
3000 000 iterations
Very hard to tel in advance how many iterations will be needed
Can often make a guess based a plot like this after the first 100 or so iterations
o Automatic convergence tests
Check if J(θ) changes by a small threshold or less
Choosing this threshold is hard
So often easier to check for a straight line
Why? - Because we're seeing the straightness in the context of the
whole algorithm
o But you overshoot, so reduce learning rate so you actually reach the minimum (green
line)
Example
o House price prediction
Two features
Frontage - width of the plot of land along road (x1)
Depth - depth away from road (x2)
o You don't have to use just two features
Can create new features
o Might decide that an important feature is the land area
So, create a new feature = frontage * depth (x3)
h(x) = θ0 + θ1x3
Area is a better indicator
o Often, by defining new features you may get a better model
Polynomial regression
o May fit the data better
o θ0 + θ1x + θ2x2 e.g. here we have a quadratic function
o For housing data could use a quadratic function
o
But may not fit the data so well - inflection point means housing prices
decrease when size gets really big
So instead must use a cubic function
Normal equation
For some linear regression problems the normal equation provides a better solution
So far we've been using gradient descent
o Iterative algorithm which takes steps to converse
Normal equation solves θ analytically
o Solve for the optimum value of theta
Has some advantages and disadvantages
Take derivative of J(θ) with respect to θ
Set that derivative equal to 0
Allows you to solve for the value of θ which minimizes J(θ)
In our more complex problems;
o Here θ is an n+1 dimensional vector of real numbers
o Cost function is a function of the vector value
How do we minimize this function
Take the partial derivative of J(θ) with respect θj and set to 0 for every j
Do that and solve for θ0 to θn
This would give the values of θ which minimize J(θ)
o If you work through the calculus and the solution, the derivation is pretty complex
o
Here
o m=4
o n=4
To implement the normal equation
o Take examples
o Add an extra column (x0 feature)
o Construct a matrix (X - the design matrix) which contains all the training data
features in an [m x n+1] matrix
o Do something similar for y
Construct a column vector y vector [m x 1] matrix
o Using the following equation (X transpose * X) inverse times X transpose y
If you compute this, you get the value of theta which minimize the cost function
General case
o Vector y
o
Used by taking all the y values into a column vector
pinv(X'*x)*x'*y
o
X' is the notation for X transpose
pinv is a function for the inverse of a matrix
In a previous lecture discussed feature scaling
o If you're using the normal equation then no need for feature scaling
When should you use gradient descent and when should you use feature scaling?
o Gradient descent
Need to chose learning rate
Needs many iterations - could make it slower
Works well even when n is massive (millions)
Better suited to big data
What is a big n though