Slide 4 - Linear Regression With Multiple Variables
Slide 4 - Linear Regression With Multiple Variables
mul2ple'variables'
Mul2ple'features'
Machine'Learning'
2
Mul4ple%features%(variables).%
Size%(feet2)% Price%($1000)%
% %
% %
2104' 460'
1416' 232'
1534' 315'
852' 178'
…' …'
Andrew'Ng'
3
Mul4ple%features%(variables).%
Size%(feet2)% Number%of% Number%of% Age%of%home% Price%($1000)%
% bedrooms% floors% (years)% %
% % % % %
2104' 5' 1' 45' 460'
1416' 3' 2' 40' 232'
1534' 3' 2' 30' 315'
852' 2' 1' 36' 178'
…' …' …' …' …'
Nota2on:'
='number'of'features'
='input'(features)'of''''''''training'example.'
='value'of'feature''''in''''''''training'example.'
Andrew'Ng'
4
Hypothesis:'
'Previously:'
Andrew'Ng'
5
For'convenience'of'nota2on,'define''''''''''''''''.'
1 by (n+1)
matrix
Mul2variate'linear'regression.'
Andrew'Ng'
6
Linear'Regression'with'
mul2ple'variables'
Gradient'descent'for'
mul2ple'variables'
Machine'Learning'
8
Hypothesis:'
Parameters:'
n+1 dimensional vector
Cost'func2on:'
Gradient'descent:'
Repeat'
(simultaneously'update'for'every'''''''''''''''''''''''')'
Andrew'Ng'
9
Until Convergence
Linear'Regression'with'
mul2ple'variables'
Gradient'descent'in'
prac2ce'I:'Feature'Scaling'
Machine'Learning'
11
Feature%Scaling% Practical tricks for making gradient descent work well
Idea:'Make'sure'features'are'on'a'similar'scale.'
E.g.'''''''='size'(0X2000'feet2)' size'(feet2)'
''''''''''''''='number'of'bedrooms'(1X5)'
problem with
number'of'bedrooms'
two features
Andrew'Ng'
14
Average value
(Max - Min)
4
Linear'Regression'with'
mul2ple'variables'
Gradient'descent'in'
prac2ce'II:'Learning'rate'
Machine'Learning'
16
Gradient%descent%
X “Debugging”:'How'to'make'sure'gradient'
descent'is'working'correctly.'
X How'to'choose'learning'rate'''''.'
Andrew'Ng'
17
Making%sure%gradient%descent%is%working%correctly.%
Declare'convergence'if'''''''
decreases'by'less'than'''''''
in'one'itera2on.'
0' 100' 200' 300' 400' Deciding this threshold may be hard
No.'of'itera2ons' number of iterations Gradient descent takes
to converge depends on the application
Andrew'Ng'
18
Making%sure%gradient%descent%is%working%correctly.%
Gradient'descent'not'working.''
Use'smaller''''.''
No.'of'itera2ons'
No.'of'itera2ons' Theta
No.'of'itera2ons'
X For'sufficiently'small''''','''''''''''''should'decrease'on'every'itera2on.'
X But'if''''''is'too'small,'gradient'descent'can'be'slow'to'converge.'
Andrew'Ng'
19
Summary:%
X If'''''is'too'small:'slow'convergence.'
X If'''''is'too'large:'''''''''may'not'decrease'on'
every'itera2on;'may'not'converge.' Slow Converge also
possible
To'choose'''','try'
Andrew'Ng'
Linear'Regression'with'
mul2ple'variables'
Features'and'
polynomial'regression'
Machine'Learning'
21
Housing%prices%predic4on%
Land area
sometimes by defining new features you might actually get a better model.
Andrew'Ng'
22
It doesn't look like a straight line fits this data very well.
Polynomial%regression%
quadratic model
Price'
(y)'
Cubic Function
Size'(x)'
Andrew'Ng'
23
Choice%of%features%
Price'
(y)'
Size'(x)'
Andrew'Ng'
24
Linear'Regression'with'
mul2ple'variables'
Normal'equa2on'
Machine'Learning'
26
Gradient'Descent'
iterative algorithm that takes many steps,
multiple iterations of gradient descent to
converge to the global minimum.
Normal'equa2on:'Method'to'solve'for''
analy2cally.' For some linear regression problems, Normal equation will
give us a much better way to solve for the optimal value of
the parameters theta.
Andrew'Ng'
27
Intui2on:'If'1D' Example: Theta is just
a scalar value
(for'every''')'
Solve'for''
Andrew'Ng'
28
Examples:''
add an extra column Size%(feet2)% Number%of% Number%of% Age%of%home% Price%($1000)%
% bedrooms% floors% (years)% %
% % % % %
1' 2104' 5' 1' 45' 460'
1' 1416' 3' 2' 40' 232'
1' 1534' 3' 2' 30' 315'
1' 852' 2' 1' 36' 178'
Andrew'Ng'
29
%%%%%%training%examples,%%%%%features.%
Gradient'Descent' Normal'Equa2on'
No need to do features scaling
• Need'to'choose''''.'' • No'need'to'choose''''.'
• Needs'many'itera2ons.' • Don’t'need'to'iterate.'
•' Works'well'even' • Need'to'compute'
when'''''is'large.'
• Slow'if'''''is'very'large.'
Andrew'Ng'
To summarize, so long as the number of features is not too large, the normal equation 30
gives us a great alternative method to solve for the parameter theta. Concretely, so long
as the number of features is less than 1000, normal equation method can be used
rather than gradient descent.
As we get to the more complex learning algorithm, for example, when we talk about
classification algorithm, like a logistic regression algorithm, the normal equation method
actually do not work for those more sophisticated learning algorithms, and, we will have
to resort to gradient descent for those algorithms.
So, gradient descent is a very useful algorithm to know. The linear regression will have a
large number of features and for some of the other algorithms, because, for them, the
normal equation method just doesn't apply and doesn't work. But for this specific model
of linear regression, the normal equation can give you an alternative that can be much
faster, than gradient descent.
So, depending on the detail of the problems and how many features that you have, both
of these algorithms are well worth knowing about.