Linear Regression With Multiple Variables
Linear Regression With Multiple Variables
• Notation
– n = number of features
– m = number of training examples
– x(i) = input (features) of ith training example
= value of features j in ith training example
E.g.:
hθ(x) = 80 + 0.1x1 + 0.01x2 + 3x3 - 2x4
0 θ0
1 θ1
2 θ2
X= ϵ ℝn+1 θ= ϵ ℝn+1
n θn
hθ(x) = θTX
Cost function: θ = θ( )−
Gradient descent:
Repeat{
θj θj – α J(θ)
j
θ1 θ1 – α θ(
( )) − () ()
θ0 θ0 – α θ(
()
)− () ()
θ1 θ1 – α θ(
()
)− () ()
θ2 θ2 – α θ(
()
)− () ()
…
CSE 445 Machine Learning ECE@NSU
Gradient Descent Algorithm
• We're doing this for each j (0 until n)
as simultaneous update (like when n = 1)
• So, we reset/update θj to
– θj minus the learning rate (α) times the partial derivative of
the θ vector with respect to θj
– In non-calculus words, this means that we do
• Learning rate
• Times 1/m (makes the math easier)
• Times the sum of
– The hypothesis taking in the variable vector, minus the actual value,
times the jth value in that variable vector for each example
θ1
CSE 445 Machine Learning ECE@NSU
Feature Scaling
• Idea: Make sure features are on a similar scale
0 ≤ x1 ≤ 1
0 ≤ x2 ≤ 1 θ2
J(θ)
• You define each value from
x1 and x2 by dividing by the
max for each feature
• Contours become more like
circles (as scaled between
0 and 1) θ1
x0 = 0
0 ≤ x1 ≤ 3
-2 ≤ x2 ≤ 0.5
-100 ≤ x3 ≤ 100
-0.0001 ≤ x4 ≤ 0.0001
CSE 445 Machine Learning ECE@NSU
Mean Normalization
• Take a feature xi
– Replace it by (xi - mean)/max
– So your values all have an average of about 0
-0.5 ≤ x2 ≤ 0.5
No of iteration
• Automatic convergence tests
– Declare convergence if J(θ) decrease by less than 10-3 in one iteration
No of iteration No of iteration
J(θ)
No of iteration
CSE 445 Machine Learning ECE@NSU
Learning Rate α
• For sufficiently small α, J(θ) should decrease on every
iteration
• But if α is too small, gradient descent can be slow to
convergence
• So
– If α is too small: slow convergence
– If α is too large: J(θ) may not decrease on every iteration;
may not converge
• To choose α, try
…, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
• Two features
– Frontage - width of the plot of land along road (x1)
– Depth - depth away from road (x2)
J(θ) = … = 0
• Take derivative of J(θ) with respect to θ
• Set that derivative equal to 0
• Allows to solve for the value of θ which minimizes
J(θ)
• Here
– n=4
– m=4
X= y=
m x (n+1) m
• θ= (XT X)-1XT y
CSE 445 Machine Learning ECE@NSU
General Form
• m training examples and n features
• The design matrix (X)
– Each training example is a n+1 dimensional feature column
vector
– X is constructed by taking each training example,
determining its transpose (i.e. column -> row) and using it
for a row in the design X
– This creates an [m x (n+1)] matrix
ϵ ℝn+1
Design
Matrix
m x (n+1)
E.g.
mx2