Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
93 views

1 Lecture 3: Optimization and Linear Regression

This document summarizes a lecture on optimization and linear regression. It begins with a review of the components of a supervised machine learning problem, including the dataset, learning algorithm, model class, objective function, and optimizer. It then provides a calculus review of derivatives, partial derivatives, and gradients to define optimization concepts. Gradient descent is introduced as an important algorithm that uses gradients to minimize an objective function. Visual examples are given of a quadratic objective function and its gradient to illustrate these optimization concepts.

Uploaded by

Jeremy Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views

1 Lecture 3: Optimization and Linear Regression

This document summarizes a lecture on optimization and linear regression. It begins with a review of the components of a supervised machine learning problem, including the dataset, learning algorithm, model class, objective function, and optimizer. It then provides a calculus review of derivatives, partial derivatives, and gradients to define optimization concepts. Gradient descent is introduced as an important algorithm that uses gradients to minimize an objective function. Visual examples are given of a quadratic objective function and its gradient to illustrate these optimization concepts.

Uploaded by

Jeremy Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

lecture3-linear-regression

September 15, 2020

1 Lecture 3: Optimization and Linear Regression

1.0.1 Applied Machine Learning

Volodymyr KuleshovCornell Tech

2 Part 1: Optimization and Calculus Background

In the previous lecture, we learned what is a supervised machine learning problem.


Before we turn our attention to Linear Regression, we will first dive deeper into the question of
optimization.

3 Review: Components of A Supervised Machine Learning Prob-


lem

At a high level, a supervised machine learning problem has the following structure:

Dataset + Learning Algorithm → Predictive Model


| {z }
Model Class + Objective + Optimizer

The predictive model is chosen to model the relationship between inputs and targets. For instance,
it can predict future targets.

4 Optimizer: Notation

At a high-level an optimizer takes * an objective J (also called a loss function) and * a model class
M and finds a model f ∈ M with the smallest value of the objective J.

min J(f )
f ∈M

1
Intuitively, this is the function that bests “fits” the data on the training dataset D = {(x(i) , y (i) ) |
i = 1, 2, ..., n}.
We will use the a quadratic function as our running example for an objective J.
[2]: import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

[3]: def quadratic_function(theta):


"""The cost function, J(theta)."""
return 0.5*(2*theta-1)**2

We can visualize it.


[4]: # First construct a grid of theta1 parameter pairs and their corresponding
# cost function values.
thetas = np.linspace(-0.2,1,10)
f_vals = quadratic_function(thetas[:,np.newaxis])

plt.plot(thetas, f_vals)
plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')

[4]: Text(0.5, 1.0, 'Simple quadratic function')

2
5 Calculus Review: Derivatives

Recall that the derivative


df (θ0 )

of a univariate function f : R → R is the instantaneous rate of change of the function f (θ) with
respect to its parameter θ at the point θ0 .
[5]: def quadratic_derivative(theta):
return (2*theta-1)*2

df0 = quadratic_derivative(np.array([[0]])) # derivative at zero


f0 = quadratic_function(np.array([[0]]))
line_length = 0.2

plt.plot(thetas, f_vals)
plt.annotate('', xytext=(0-line_length, f0-line_length*df0), xy=(0+line_length,␣
,→f0+line_length*df0),

arrowprops={'arrowstyle': '-', 'lw': 1.5}, va='center',␣


,→ha='center')

plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')

[5]: Text(0.5, 1.0, 'Simple quadratic function')

3
[6]: pts = np.array([[0, 0.5, 0.8]]).reshape((3,1))
df0s = quadratic_derivative(pts)
f0s = quadratic_function(pts)

plt.plot(thetas, f_vals)
for pt, f0, df0 in zip(pts.flatten(), f0s.flatten(), df0s.flatten()):
plt.annotate('', xytext=(pt-line_length, f0-line_length*df0),␣
,→xy=(pt+line_length, f0+line_length*df0),

arrowprops={'arrowstyle': '-', 'lw': 1}, va='center', ha='center')


plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')

[6]: Text(0.5, 1.0, 'Simple quadratic function')

6 Calculus Review: Partial Derivatives

The partial derivative


∂f (θ0 )
∂θj
of a multivariate function f : Rd → R is the derivative of f with respect to θj while all othe other
inputs θk for k ̸= j are fixed.

4
7 Calculus Review: The Gradient

The gradient ∇θ f further extends the derivative to multivariate functions f : Rd → R, and is


defined at a point θ0 as

 ∂f (θ0 ) 
∂θ
 ∂f (θ10 ) 
 ∂θ2 
∇θ f (θ0 ) = 
 ..  .

 . 
∂f (θ0 )
∂θd

∂f (θ0 )
The j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect to the j-th
component of θ.
We will use a quadratic function as a running example.
[7]: def quadratic_function2d(theta0, theta1):
"""Quadratic objective function, J(theta0, theta1).

The inputs theta0, theta1 are 2d arrays and we evaluate


the objective at each value theta0[i,j], theta1[i,j].
We implement it this way so it's easier to plot the
level curves of the function in 2d.

Parameters:
theta0 (np.array): 2d array of first parameter theta0
theta1 (np.array): 2d array of second parameter theta1

Returns:
fvals (np.array): 2d array of objective function values
fvals is the same dimension as theta0 and theta1.
fvals[i,j] is the value at theta0[i,j] and theta1[i,j].
"""
theta0 = np.atleast_2d(np.asarray(theta0))
theta1 = np.atleast_2d(np.asarray(theta1))
return 0.5*((2*theta1-2)**2 + (theta0-3)**2)

Let’s visualize this function.


[8]: theta0_grid = np.linspace(-4,7,101)
theta1_grid = np.linspace(-1,4,101)
theta_grid = theta0_grid[np.newaxis,:], theta1_grid[:,np.newaxis]
J_grid = quadratic_function2d(theta0_grid[np.newaxis,:], theta1_grid[:,np.
,→newaxis])

X, Y = np.meshgrid(theta0_grid, theta1_grid)
contours = plt.contour(X, Y, J_grid, 10)
plt.clabel(contours)

5
plt.axis('equal')

[8]: (-4.0, 7.0, -1.0, 4.0)

Let’s write down the derivative of the quadratic function.


[9]: def quadratic_derivative2d(theta0, theta1):
"""Derivative of quadratic objective function.

The inputs theta0, theta1 are 1d arrays and we evaluate


the derivative at each value theta0[i], theta1[i].

Parameters:
theta0 (np.array): 1d array of first parameter theta0
theta1 (np.array): 1d array of second parameter theta1

Returns:
grads (np.array): 2d array of partial derivatives
grads is of the same size as theta0 and theta1
along first dimension and of size
two along the second dimension.
grads[i,j] is the j-th partial derivative
at input theta0[i], theta1[i].
"""
# this is the gradient of 0.5*((2*theta1-2)**2 + (theta0-3)**2)
grads = np.stack([theta0-3, (2*theta1-2)*2], axis=1)
grads = grads.reshape([len(theta0), 2])
return grads

6
We can visualize the derivative.
[10]: theta0_pts, theta1_pts = np.array([2.3, -1.35, -2.3]), np.array([2.4, -0.15, 2.
,→75])

dfs = quadratic_derivative2d(theta0_pts, theta1_pts)


line_length = 0.2

contours = plt.contour(X, Y, J_grid, 10)


for theta0_pt, theta1_pt, df0 in zip(theta0_pts, theta1_pts, dfs):
plt.annotate('', xytext=(theta0_pt, theta1_pt),
xy=(theta0_pt-line_length*df0[0],␣
,→theta1_pt-line_length*df0[1]),

arrowprops={'arrowstyle': '->', 'lw': 2}, va='center',␣


,→ha='center')

plt.scatter(theta0_pts, theta1_pts)
plt.clabel(contours)
plt.xlabel('Theta0')
plt.ylabel('Theta1')
plt.title('Gradients of the quadratic function')
plt.axis('equal')

[10]: (-4.0, 7.0, -1.0, 4.0)

# Part 1b: Gradient Descent


Next, we will use gradients to define an important algorithm called gradient descent.

7
8 Calculus Review: The Gradient

The gradient ∇θ f further extends the derivative to multivariate functions f : Rd → R, and is


defined at a point θ0 as

 ∂f (θ0 ) 
∂θ
 ∂f (θ10 ) 
 ∂θ2 
∇θ f (θ0 ) = 
 ..  .

 . 
∂f (θ0 )
∂θd

∂f (θ0 )
The j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect to the j-th
component of θ.
[11]: theta0_pts, theta1_pts = np.array([2.3, -1.35, -2.3]), np.array([2.4, -0.15, 2.
,→75])

dfs = quadratic_derivative2d(theta0_pts, theta1_pts)


line_length = 0.2

contours = plt.contour(X, Y, J_grid, 10)


for theta0_pt, theta1_pt, df0 in zip(theta0_pts, theta1_pts, dfs):
plt.annotate('', xytext=(theta0_pt, theta1_pt),
xy=(theta0_pt-line_length*df0[0],␣
,→theta1_pt-line_length*df0[1]),

arrowprops={'arrowstyle': '->', 'lw': 2}, va='center',␣


,→ha='center')

plt.scatter(theta0_pts, theta1_pts)
plt.clabel(contours)
plt.xlabel('Theta0')
plt.ylabel('Theta1')
plt.title('Gradients of the quadratic function')
plt.axis('equal')

[11]: (-4.0, 7.0, -1.0, 4.0)

8
9 Gradient Descent: Intuition

Gradient descent is a very common optimization algorithm used in machine learning.


The intuition behind gradient descent is to repeatedly obtain the gradient to determine the direction
in which the function decreases most steeply and take a step in that direction.

10 Gradient Descent: Notation

More formally, if we want to optimize J(θ), we start with an initial guess θ0 for the parameters
and repeat the following update until θ is no longer changing:

θi := θi−1 − α · ∇θ J(θi−1 ).

As code, this method may look as follows:


theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)
In the above algorithm, we stop when ||θi − θi−1 || is small.
It’s easy to implement this function in numpy.

9
[24]: convergence_threshold = 2e-1
step_size = 2e-1
theta, theta_prev = np.array([[-2], [3]]), np.array([[0], [0]])
opt_pts = [theta.flatten()]
opt_grads = []

while np.linalg.norm(theta - theta_prev) > convergence_threshold:


# we repeat this while the value of the function is decreasing
theta_prev = theta
gradient = quadratic_derivative2d(*theta).reshape([2,1])
theta = theta_prev - step_size * gradient
opt_pts += [theta.flatten()]
opt_grads += [gradient.flatten()]

We can now visualize gradient descent.


[25]: opt_pts = np.array(opt_pts)
opt_grads = np.array(opt_grads)

contours = plt.contour(X, Y, J_grid, 10)


plt.clabel(contours)
plt.scatter(opt_pts[:,0], opt_pts[:,1])

for opt_pt, opt_grad in zip(opt_pts, opt_grads):


plt.annotate('', xytext=(opt_pt[0], opt_pt[1]),
xy=(opt_pt[0]-0.8*step_size*opt_grad[0], opt_pt[1]-0.
,→8*step_size*opt_grad[1]),

arrowprops={'arrowstyle': '->', 'lw': 2}, va='center',␣


,→ha='center')

plt.axis('equal')

[25]: (-4.0, 7.0, -1.0, 4.0)

10
# Part 2: Gradient Descent in Linear Models
Let’s now use gradient descent to derive a supervised learning algorithm for linear models.

11 Review: Gradient Descent

If we want to optimize J(θ), we start with an initial guess θ0 for the parameters and repeat the
following update:
θi := θi−1 − α · ∇θ J(θi−1 ).

As code, this method may look as follows:


theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)

12 Review: Linear Model Family

Recall that a linear model has the form


y = θ0 + θ1 · x1 + θ2 · x2 + ... + θd · xd
where x ∈ Rd is a vector of features and y is the target. The θj are the parameters of the model.
By using the notation x0 = 1, we can represent the model in a vectorized form

d
fθ (x) = θj · xj = θ⊤ x.
j=0

11
Let’s define our model in Python.
[26]: def f(X, theta):
"""The linear model we are trying to fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional data matrix

Returns:
y_pred (np.array): n-dimensional vector of predicted targets
"""
return X.dot(theta)

13 An Objective: Mean Squared Error

We pick θ to minimize the mean squared error (MSE). Slight variants of this objective are also
known as the residual sum of squares (RSS) or the sum of squared residuals (SSR).

1 ∑ (i)
n
J(θ) = (y − θ⊤ x(i) )2
2n
i=1

In other words, we are looking for the best compromise in θ over all the data points.
Let’s implement mean squared error.
[27]: def mean_squared_error(theta, X, y):
"""The cost function, J(theta0, theta1) describing the goodness of fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
"""
return 0.5*np.mean((y-f(X, theta))**2)

14 Mean Squared Error: Partial Derivatives

Let’s work out what a partial derivative is for the MSE error loss for a linear model.

12
∂J(θ) ∂ 1
= (fθ (x) − y)2
∂θj ∂θj 2

= (fθ (x) − y) · (fθ (x) − y)
∂θj
( d )
∂ ∑
= (fθ (x) − y) · θk · x k − y
∂θj
k=0
= (fθ (x) − y) · xj

15 Mean Squared Error: The Gradient

We can use this derivation to obtain an expression for the gradient of the MSE for a linear model
 ∂f (θ)   
∂θ1 (fθ (x) − y) · x1
 ∂f (θ)  
 ∂θ2  (fθ (x) − y) · x2  
∇θ J(θ) =  
 ..  =  ..  = (fθ (x) − y) · x.
 .   . 
∂f (θ) (fθ (x) − y) · xd
∂θd

Let’s implement the gradient.


[28]: def mse_gradient(theta, X, y):
"""The cost function, J(theta0, theta1) describing the goodness of fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets

Returns:
grad (np.array): d-dimensional gradient of the MSE
"""
return np.mean((f(X, theta) - y) * X.T, axis=1)

16 The UCI Diabetes Dataset

In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.

[29]: %matplotlib inline


import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

13
import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset


X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# add an extra column of onens


X['one'] = 1

# Collect 20 data points and only use bmi dimension


X_train = X.iloc[-20:].loc[:, ['bmi', 'one']]
y_train = y.iloc[-20:] / 300

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')


plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[29]: Text(0, 0.5, 'Diabetes Risk')

17 Gradient Descent for Linear Regression

Putting this together with the gradient descent algorithm, we obtain a learning method for training
linear models.
theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:

14
theta_prev = theta
theta = theta_prev - step_size * (f(x, theta)-y) * x
This update rule is also known as the Least Mean Squares (LMS) or Widrow-Hoff learning rule.

[34]: threshold = 1e-3


step_size = 4e-1
theta, theta_prev = np.array([2,1]), np.ones(2,)
opt_pts = [theta]
opt_grads = []
iter = 0

while np.linalg.norm(theta - theta_prev) > threshold:


if iter % 100 == 0:
print('Iteration %d. MSE: %.6f' % (iter, mean_squared_error(theta,␣
,→X_train, y_train)))

theta_prev = theta
gradient = mse_gradient(theta, X_train, y_train)
theta = theta_prev - step_size * gradient
opt_pts += [theta]
opt_grads += [gradient]
iter += 1

Iteration 0. MSE: 0.171729


Iteration 100. MSE: 0.014765
Iteration 200. MSE: 0.014349
Iteration 300. MSE: 0.013997
Iteration 400. MSE: 0.013701

[35]: x_line = np.stack([np.linspace(-0.1, 0.1, 10), np.ones(10,)])


y_line = opt_pts[-1].dot(x_line)

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')


plt.plot(x_line[0], y_line)
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[35]: Text(0, 0.5, 'Diabetes Risk')

15
# Part 3: Ordinary Least Squares
In practice, there is a more effective way than gradient descent to find linear model parameters.
We will see this method here, which will lead to our first non-toy algorithm: Ordinary Least
Squares.

18 Review: The Gradient

The gradient ∇θ f further extends the derivative to multivariate functions f : Rd → R, and is


defined at a point θ0 as

 ∂f (θ0 ) 
∂θ
 ∂f (θ10 ) 
 ∂θ2 
∇θ f (θ0 ) = 
 ..  .

 . 
∂f (θ0 )
∂θd

∂f (θ0 )
In other words, the j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect
to the j-th component of θ.

19 The UCI Diabetes Dataset

In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.

16
[36]: %matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset


X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# add an extra column of onens


X['one'] = 1

# Collect 20 data points


X_train = X.iloc[-20:]
y_train = y.iloc[-20:]

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')


plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[36]: Text(0, 0.5, 'Diabetes Risk')

17
20 Notation: Design Matrix

Machine learning algorithms are most easily defined in the language of linear algebra. Therefore,
it will be useful to represent the entire dataset as one matrix X ∈ Rn×d , of the form:
 (1) (1) (1)
  
x1 x2 . . . xd − (x(1) )⊤ −
 (2) (2) 
. . . xd   − (x(2) )⊤ −
(2)
 x1
= 
x2
X= .  . .
 .
 ..   . 
x1
(n)
x2
(n)
. . . xd
(n) − (x(n) )⊤ −

We can view the design matrix for the diabetes dataset.


[37]: X_train.head()

[37]: age sex bmi bp s1 s2 s3 \


422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356

s4 s5 s6 one
422 -0.002592 0.040672 -0.009362 1
423 0.108111 0.015567 -0.046641 1
424 0.034309 0.024053 0.023775 1
425 -0.076395 -0.020289 -0.050783 1
426 0.057557 0.035462 0.085907 1

21 Notation: Design Matrix

Similarly, we can vectorize the target variables into a vector y ∈ Rn of the form
 (1) 
y
 y (2) 
 
y =  . .
 . .
y (n)

22 Squared Error in Matrix Form

Recall that we may fit a linear model by choosing θ that minimizes the squared error:

1 ∑ (i)
n
J(θ) = (y − θ⊤ x(i) )2
2
i=1

18
In other words, we are looking for the best compromise in β over all the data points.
We can write this sum in matrix-vector form as:
1 1
J(θ) = (y − Xθ)⊤ (y − Xθ) = ∥y − Xθ∥2 ,
2 2
where X is the design matrix and ∥ · ∥ denotes the Euclidean norm.

23 The Gradient of the Squared Error

We can a gradient for the mean squared error as follows.

1
∇θ J(θ) = ∇θ (Xθ − y)⊤ (Xθ − y)
2
1 ( )
= ∇θ (Xθ)⊤ (Xθ) − (Xθ)⊤ y − y ⊤ (Xθ) + y ⊤ y
2
1 ( )
= ∇θ θ⊤ (X⊤X)θ − 2(Xθ)⊤ y
2
1( )
= 2(X ⊤ X)θ − 2X ⊤ y
2
= (X ⊤ X)θ − X ⊤ y

We used the facts that a⊤ b = b⊤ a (line 3), that ∇x b⊤ x = b (line 4), and that ∇x x⊤ Ax = 2Ax for
a symmetric matrix A (line 4).

24 Normal Equations

Setting the above derivative to zero, we obtain the normal equations:

(X ⊤ X)θ = X ⊤ y.

Hence, the value θ∗ that minimizes this objective is given by:

θ∗ = (X ⊤ X)−1 X ⊤ y.

Note that we assumed that the matrix (X ⊤ X) is invertible; if this is not the case, there are easy
ways of addressing this issue.
Let’s apply the normal equations.
[21]: import numpy as np

theta_best = np.linalg.inv(X_train.T.dot(X_train)).dot(X_train.T).dot(y_train)
theta_best_df = pd.DataFrame(data=theta_best[np.newaxis, :], columns=X.columns)
theta_best_df

19
[21]: age sex bmi bp s1 s2 \
0 -3.888868 204.648785 -64.289163 -262.796691 14003.726808 -11798.307781

s3 s4 s5 s6 one
0 -5892.15807 -1136.947646 -2736.597108 -393.879743 155.698998

We can now use our estimate of theta to compute predictions for 3 new data points.
[22]: # Collect 3 data points for testing
X_test = X.iloc[:3]
y_test = y.iloc[:3]

# generate predictions on the new patients


y_test_pred = X_test.dot(theta_best)

Let’s visualize these predictions.


[23]: # visualize the results
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.scatter(X_train.loc[:, ['bmi']], y_train)
plt.scatter(X_test.loc[:, ['bmi']], y_test, color='red', marker='o')
plt.plot(X_test.loc[:, ['bmi']], y_test_pred, 'x', color='red', mew=3,␣
,→markersize=8)

plt.legend(['Model', 'Prediction', 'Initial patients', 'New patients'])

[23]: <matplotlib.legend.Legend at 0x128d89668>

20
25 Algorithm: Ordinary Least Squares

• Type: Supervised learning (regression)


• Model family: Linear models
• Objective function: Mean squared error
• Optimizer: Normal equations
# Part 4: Non-Linear Least Squares
So far, we have learned about a very simple linear model. These can capture only simple lin-
ear relationships in the data. How can we use what we learned so far to model more complex
relationships?
We will now see a simple approach to model complex non-linear relationships called least squares.

26 Review: Polynomial Functions

Recall that a polynomial of degree p is a function of the form

ap xp + ap−1 xp−1 + ... + a1 x + a0 .

Below are some examples of polynomial functions.


[24]: import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(16,4))
x_vars = np.linspace(-2, 2)

plt.subplot('131')
plt.title('Quadratic Function')
plt.plot(x_vars, x_vars**2)
plt.legend(["$x^2$"])

plt.subplot('132')
plt.title('Cubic Function')
plt.plot(x_vars, x_vars**3)
plt.legend(["$x^3$"])

plt.subplot('133')
plt.title('Third Degree Polynomial')
plt.plot(x_vars, x_vars**3 + 2*x_vars**2 + x_vars + 1)
plt.legend(["$x^3 + 2 x^2 + x + 1$"])

[24]: <matplotlib.legend.Legend at 0x128ed2ac8>

21
27 Modeling Non-Linear Relationships With Polynomial Regres-
sion

Specifically, given a one-dimensional continuous variable x, we can defining a feature function


F : R → Rp+1 as  
1
x
 
 2
ϕ(x) = x  .
 .. 
.
xp

The class of models of the form



p
fθ (x) := θp xp = θ⊤ ϕ(x)
j=0

with parameters θ and polynomial features ϕ is the set of p-degree polynomials.


• This model is non-linear in the input variable x, meaning that we can model complex data
relationships.
• It is a linear model as a function of the parameters θ, meaning that we can use our familiar
ordinary least squares algorithm to learn these features.

28 The UCI Diabetes Dataset

In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.

[25]: %matplotlib inline


import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

22
import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset


X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# add an extra column of onens


X['one'] = 1

# Collect 20 data points


X_train = X.iloc[-20:]
y_train = y.iloc[-20:]

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')


plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[25]: Text(0, 0.5, 'Diabetes Risk')

29 Diabetes Dataset: A Non-Linear Featurization

Let’s now obtain linear features for this dataset.


[26]: X_bmi = X_train.loc[:, ['bmi']]

23
X_bmi_p3 = pd.concat([X_bmi, X_bmi**2, X_bmi**3], axis=1)
X_bmi_p3.columns = ['bmi', 'bmi2', 'bmi3']
X_bmi_p3['one'] = 1
X_bmi_p3.head()

[26]: bmi bmi2 bmi3 one


422 0.077863 0.006063 0.000472 1
423 -0.039618 0.001570 -0.000062 1
424 0.011039 0.000122 0.000001 1
425 -0.040696 0.001656 -0.000067 1
426 -0.034229 0.001172 -0.000040 1

30 Diabetes Dataset: A Polynomial Model

By training a linear model on this featurization of the diabetes set, we can obtain a polynomial
model of diabetest risk as a function of BMI.
[27]: # Fit a linear regression
theta = np.linalg.inv(X_bmi_p3.T.dot(X_bmi_p3)).dot(X_bmi_p3.T).dot(y_train)

# Show the learned polynomial curve


x_line = np.linspace(-0.1, 0.1, 10)
x_line_p3 = np.stack([x_line, x_line**2, x_line**3, np.ones(10,)], axis=1)
y_train_pred = x_line_p3.dot(theta)

plt.xlabel('Body Mass Index (BMI)')


plt.ylabel('Diabetes Risk')
plt.scatter(X_bmi, y_train)
plt.plot(x_line, y_train_pred)

[27]: [<matplotlib.lines.Line2D at 0x1292c99e8>]

24
31 Multivariate Polynomial Regression

We can also take this approach to construct non-linear function of multiples variable by using
multivariate polynomials.
For example, a polynomial of degree 2 over two variables x1 , x2 is a function of the form

a20 x21 + a10 x1 + a02 x22 + a01 x2 + a11 x1 x2 + a00 .

In general, a polynomial of degree p over two variables x1 , x2 is a function of the form



f (x1 , x2 ) = aij xi1 xj2 .
i,j≥0:i+j≤p

In our two-dimensional example, this corresponds to a feature function ϕ : R2 → R6 of the form


 
1
 x1 
 2 
 x1 
ϕ(x) =  
 x2  .
 
 x2 
2
x1 x2

The same approach holds for polynomials of an degree and any number of variables.

25
32 Towards General Non-Linear Features

Any non-linear feature map ϕ(x) : R → Rp can be used in this way to obtain general models of the
form
fθ (x) := θ⊤ ϕ(x)
that are highly non-linear in x but linear in θ.
For example, here is a way of modeling complex periodic functions via a sum of sines and cosines.
[28]: import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(16,4))
x_vars = np.linspace(-5, 5)

plt.subplot('131')
plt.title('Cosine Function')
plt.plot(x_vars, np.cos(x_vars))
plt.legend(["$cos(x)$"])

plt.subplot('132')
plt.title('Sine Function')
plt.plot(x_vars, np.sin(2*x_vars))
plt.legend(["$x^3$"])

plt.subplot('133')
plt.title('Combination of Sines and Cosines')
plt.plot(x_vars, np.cos(x_vars) + np.sin(2*x_vars) + np.cos(4*x_vars))
plt.legend(["$cos(x) + sin(2x) + cos(4x)$"])

[28]: <matplotlib.legend.Legend at 0x129571160>

26
33 Algorithm: Non-Linear Least Squares

• Type: Supervised learning (regression)


• Model family: Linear in the weights; non-linear with respect to raw inputs.
• Features: Non-linear functions of the attributes
• Objective function: Mean squared error
• Optimizer: Normal equations
[ ]:

27

You might also like