1 Lecture 3: Optimization and Linear Regression
1 Lecture 3: Optimization and Linear Regression
At a high level, a supervised machine learning problem has the following structure:
The predictive model is chosen to model the relationship between inputs and targets. For instance,
it can predict future targets.
4 Optimizer: Notation
At a high-level an optimizer takes * an objective J (also called a loss function) and * a model class
M and finds a model f ∈ M with the smallest value of the objective J.
min J(f )
f ∈M
1
Intuitively, this is the function that bests “fits” the data on the training dataset D = {(x(i) , y (i) ) |
i = 1, 2, ..., n}.
We will use the a quadratic function as our running example for an objective J.
[2]: import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]
plt.plot(thetas, f_vals)
plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')
2
5 Calculus Review: Derivatives
plt.plot(thetas, f_vals)
plt.annotate('', xytext=(0-line_length, f0-line_length*df0), xy=(0+line_length,␣
,→f0+line_length*df0),
plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')
3
[6]: pts = np.array([[0, 0.5, 0.8]]).reshape((3,1))
df0s = quadratic_derivative(pts)
f0s = quadratic_function(pts)
plt.plot(thetas, f_vals)
for pt, f0, df0 in zip(pts.flatten(), f0s.flatten(), df0s.flatten()):
plt.annotate('', xytext=(pt-line_length, f0-line_length*df0),␣
,→xy=(pt+line_length, f0+line_length*df0),
4
7 Calculus Review: The Gradient
∂f (θ0 )
∂θ
∂f (θ10 )
∂θ2
∇θ f (θ0 ) =
.. .
.
∂f (θ0 )
∂θd
∂f (θ0 )
The j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect to the j-th
component of θ.
We will use a quadratic function as a running example.
[7]: def quadratic_function2d(theta0, theta1):
"""Quadratic objective function, J(theta0, theta1).
Parameters:
theta0 (np.array): 2d array of first parameter theta0
theta1 (np.array): 2d array of second parameter theta1
Returns:
fvals (np.array): 2d array of objective function values
fvals is the same dimension as theta0 and theta1.
fvals[i,j] is the value at theta0[i,j] and theta1[i,j].
"""
theta0 = np.atleast_2d(np.asarray(theta0))
theta1 = np.atleast_2d(np.asarray(theta1))
return 0.5*((2*theta1-2)**2 + (theta0-3)**2)
X, Y = np.meshgrid(theta0_grid, theta1_grid)
contours = plt.contour(X, Y, J_grid, 10)
plt.clabel(contours)
5
plt.axis('equal')
Parameters:
theta0 (np.array): 1d array of first parameter theta0
theta1 (np.array): 1d array of second parameter theta1
Returns:
grads (np.array): 2d array of partial derivatives
grads is of the same size as theta0 and theta1
along first dimension and of size
two along the second dimension.
grads[i,j] is the j-th partial derivative
at input theta0[i], theta1[i].
"""
# this is the gradient of 0.5*((2*theta1-2)**2 + (theta0-3)**2)
grads = np.stack([theta0-3, (2*theta1-2)*2], axis=1)
grads = grads.reshape([len(theta0), 2])
return grads
6
We can visualize the derivative.
[10]: theta0_pts, theta1_pts = np.array([2.3, -1.35, -2.3]), np.array([2.4, -0.15, 2.
,→75])
plt.scatter(theta0_pts, theta1_pts)
plt.clabel(contours)
plt.xlabel('Theta0')
plt.ylabel('Theta1')
plt.title('Gradients of the quadratic function')
plt.axis('equal')
7
8 Calculus Review: The Gradient
∂f (θ0 )
∂θ
∂f (θ10 )
∂θ2
∇θ f (θ0 ) =
.. .
.
∂f (θ0 )
∂θd
∂f (θ0 )
The j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect to the j-th
component of θ.
[11]: theta0_pts, theta1_pts = np.array([2.3, -1.35, -2.3]), np.array([2.4, -0.15, 2.
,→75])
plt.scatter(theta0_pts, theta1_pts)
plt.clabel(contours)
plt.xlabel('Theta0')
plt.ylabel('Theta1')
plt.title('Gradients of the quadratic function')
plt.axis('equal')
8
9 Gradient Descent: Intuition
More formally, if we want to optimize J(θ), we start with an initial guess θ0 for the parameters
and repeat the following update until θ is no longer changing:
θi := θi−1 − α · ∇θ J(θi−1 ).
9
[24]: convergence_threshold = 2e-1
step_size = 2e-1
theta, theta_prev = np.array([[-2], [3]]), np.array([[0], [0]])
opt_pts = [theta.flatten()]
opt_grads = []
plt.axis('equal')
10
# Part 2: Gradient Descent in Linear Models
Let’s now use gradient descent to derive a supervised learning algorithm for linear models.
If we want to optimize J(θ), we start with an initial guess θ0 for the parameters and repeat the
following update:
θi := θi−1 − α · ∇θ J(θi−1 ).
11
Let’s define our model in Python.
[26]: def f(X, theta):
"""The linear model we are trying to fit.
Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional data matrix
Returns:
y_pred (np.array): n-dimensional vector of predicted targets
"""
return X.dot(theta)
We pick θ to minimize the mean squared error (MSE). Slight variants of this objective are also
known as the residual sum of squares (RSS) or the sum of squared residuals (SSR).
1 ∑ (i)
n
J(θ) = (y − θ⊤ x(i) )2
2n
i=1
In other words, we are looking for the best compromise in θ over all the data points.
Let’s implement mean squared error.
[27]: def mean_squared_error(theta, X, y):
"""The cost function, J(theta0, theta1) describing the goodness of fit.
Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
"""
return 0.5*np.mean((y-f(X, theta))**2)
Let’s work out what a partial derivative is for the MSE error loss for a linear model.
12
∂J(θ) ∂ 1
= (fθ (x) − y)2
∂θj ∂θj 2
∂
= (fθ (x) − y) · (fθ (x) − y)
∂θj
( d )
∂ ∑
= (fθ (x) − y) · θk · x k − y
∂θj
k=0
= (fθ (x) − y) · xj
We can use this derivation to obtain an expression for the gradient of the MSE for a linear model
∂f (θ)
∂θ1 (fθ (x) − y) · x1
∂f (θ)
∂θ2 (fθ (x) − y) · x2
∇θ J(θ) =
.. = .. = (fθ (x) − y) · x.
. .
∂f (θ) (fθ (x) − y) · xd
∂θd
Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
Returns:
grad (np.array): d-dimensional gradient of the MSE
"""
return np.mean((f(X, theta) - y) * X.T, axis=1)
In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.
13
import numpy as np
import pandas as pd
from sklearn import datasets
Putting this together with the gradient descent algorithm, we obtain a learning method for training
linear models.
theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:
14
theta_prev = theta
theta = theta_prev - step_size * (f(x, theta)-y) * x
This update rule is also known as the Least Mean Squares (LMS) or Widrow-Hoff learning rule.
theta_prev = theta
gradient = mse_gradient(theta, X_train, y_train)
theta = theta_prev - step_size * gradient
opt_pts += [theta]
opt_grads += [gradient]
iter += 1
15
# Part 3: Ordinary Least Squares
In practice, there is a more effective way than gradient descent to find linear model parameters.
We will see this method here, which will lead to our first non-toy algorithm: Ordinary Least
Squares.
∂f (θ0 )
∂θ
∂f (θ10 )
∂θ2
∇θ f (θ0 ) =
.. .
.
∂f (θ0 )
∂θd
∂f (θ0 )
In other words, the j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect
to the j-th component of θ.
In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.
16
[36]: %matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]
import numpy as np
import pandas as pd
from sklearn import datasets
17
20 Notation: Design Matrix
Machine learning algorithms are most easily defined in the language of linear algebra. Therefore,
it will be useful to represent the entire dataset as one matrix X ∈ Rn×d , of the form:
(1) (1) (1)
x1 x2 . . . xd − (x(1) )⊤ −
(2) (2)
. . . xd − (x(2) )⊤ −
(2)
x1
=
x2
X= . . .
.
.. .
x1
(n)
x2
(n)
. . . xd
(n) − (x(n) )⊤ −
s4 s5 s6 one
422 -0.002592 0.040672 -0.009362 1
423 0.108111 0.015567 -0.046641 1
424 0.034309 0.024053 0.023775 1
425 -0.076395 -0.020289 -0.050783 1
426 0.057557 0.035462 0.085907 1
Similarly, we can vectorize the target variables into a vector y ∈ Rn of the form
(1)
y
y (2)
y = . .
. .
y (n)
Recall that we may fit a linear model by choosing θ that minimizes the squared error:
1 ∑ (i)
n
J(θ) = (y − θ⊤ x(i) )2
2
i=1
18
In other words, we are looking for the best compromise in β over all the data points.
We can write this sum in matrix-vector form as:
1 1
J(θ) = (y − Xθ)⊤ (y − Xθ) = ∥y − Xθ∥2 ,
2 2
where X is the design matrix and ∥ · ∥ denotes the Euclidean norm.
1
∇θ J(θ) = ∇θ (Xθ − y)⊤ (Xθ − y)
2
1 ( )
= ∇θ (Xθ)⊤ (Xθ) − (Xθ)⊤ y − y ⊤ (Xθ) + y ⊤ y
2
1 ( )
= ∇θ θ⊤ (X⊤X)θ − 2(Xθ)⊤ y
2
1( )
= 2(X ⊤ X)θ − 2X ⊤ y
2
= (X ⊤ X)θ − X ⊤ y
We used the facts that a⊤ b = b⊤ a (line 3), that ∇x b⊤ x = b (line 4), and that ∇x x⊤ Ax = 2Ax for
a symmetric matrix A (line 4).
24 Normal Equations
(X ⊤ X)θ = X ⊤ y.
θ∗ = (X ⊤ X)−1 X ⊤ y.
Note that we assumed that the matrix (X ⊤ X) is invertible; if this is not the case, there are easy
ways of addressing this issue.
Let’s apply the normal equations.
[21]: import numpy as np
theta_best = np.linalg.inv(X_train.T.dot(X_train)).dot(X_train.T).dot(y_train)
theta_best_df = pd.DataFrame(data=theta_best[np.newaxis, :], columns=X.columns)
theta_best_df
19
[21]: age sex bmi bp s1 s2 \
0 -3.888868 204.648785 -64.289163 -262.796691 14003.726808 -11798.307781
s3 s4 s5 s6 one
0 -5892.15807 -1136.947646 -2736.597108 -393.879743 155.698998
We can now use our estimate of theta to compute predictions for 3 new data points.
[22]: # Collect 3 data points for testing
X_test = X.iloc[:3]
y_test = y.iloc[:3]
20
25 Algorithm: Ordinary Least Squares
plt.figure(figsize=(16,4))
x_vars = np.linspace(-2, 2)
plt.subplot('131')
plt.title('Quadratic Function')
plt.plot(x_vars, x_vars**2)
plt.legend(["$x^2$"])
plt.subplot('132')
plt.title('Cubic Function')
plt.plot(x_vars, x_vars**3)
plt.legend(["$x^3$"])
plt.subplot('133')
plt.title('Third Degree Polynomial')
plt.plot(x_vars, x_vars**3 + 2*x_vars**2 + x_vars + 1)
plt.legend(["$x^3 + 2 x^2 + x + 1$"])
21
27 Modeling Non-Linear Relationships With Polynomial Regres-
sion
In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.
22
import numpy as np
import pandas as pd
from sklearn import datasets
23
X_bmi_p3 = pd.concat([X_bmi, X_bmi**2, X_bmi**3], axis=1)
X_bmi_p3.columns = ['bmi', 'bmi2', 'bmi3']
X_bmi_p3['one'] = 1
X_bmi_p3.head()
By training a linear model on this featurization of the diabetes set, we can obtain a polynomial
model of diabetest risk as a function of BMI.
[27]: # Fit a linear regression
theta = np.linalg.inv(X_bmi_p3.T.dot(X_bmi_p3)).dot(X_bmi_p3.T).dot(y_train)
24
31 Multivariate Polynomial Regression
We can also take this approach to construct non-linear function of multiples variable by using
multivariate polynomials.
For example, a polynomial of degree 2 over two variables x1 , x2 is a function of the form
The same approach holds for polynomials of an degree and any number of variables.
25
32 Towards General Non-Linear Features
Any non-linear feature map ϕ(x) : R → Rp can be used in this way to obtain general models of the
form
fθ (x) := θ⊤ ϕ(x)
that are highly non-linear in x but linear in θ.
For example, here is a way of modeling complex periodic functions via a sum of sines and cosines.
[28]: import warnings
warnings.filterwarnings("ignore")
plt.figure(figsize=(16,4))
x_vars = np.linspace(-5, 5)
plt.subplot('131')
plt.title('Cosine Function')
plt.plot(x_vars, np.cos(x_vars))
plt.legend(["$cos(x)$"])
plt.subplot('132')
plt.title('Sine Function')
plt.plot(x_vars, np.sin(2*x_vars))
plt.legend(["$x^3$"])
plt.subplot('133')
plt.title('Combination of Sines and Cosines')
plt.plot(x_vars, np.cos(x_vars) + np.sin(2*x_vars) + np.cos(4*x_vars))
plt.legend(["$cos(x) + sin(2x) + cos(4x)$"])
26
33 Algorithm: Non-Linear Least Squares
27