0% found this document useful (0 votes)

93 views

1 Lecture 3: Optimization and Linear Regression

This document summarizes a lecture on optimization and linear regression. It begins with a review of the components of a supervised machine learning problem, including the dataset, learning algorithm, model class, objective function, and optimizer. It then provides a calculus review of derivatives, partial derivatives, and gradients to define optimization concepts. Gradient descent is introduced as an important algorithm that uses gradients to minimize an objective function. Visual examples are given of a quadratic objective function and its gradient to illustrate these optimization concepts.

Uploaded by

Jeremy Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views

1 Lecture 3: Optimization and Linear Regression

Uploaded by

Jeremy Wang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

lecture3-linear-regression

September 15, 2020

1 Lecture 3: Optimization and Linear Regression

1.0.1 Applied Machine Learning

Volodymyr KuleshovCornell Tech

2 Part 1: Optimization and Calculus Background

In the previous lecture, we learned what is a supervised machine learning problem.

Before we turn our attention to Linear Regression, we will first dive deeper into the question of
optimization.

3 Review: Components of A Supervised Machine Learning Prob-

lem

At a high level, a supervised machine learning problem has the following structure:

Dataset + Learning Algorithm → Predictive Model

| {z }
Model Class + Objective + Optimizer

The predictive model is chosen to model the relationship between inputs and targets. For instance,
it can predict future targets.

4 Optimizer: Notation

At a high-level an optimizer takes * an objective J (also called a loss function) and * a model class
M and finds a model f ∈ M with the smallest value of the objective J.

min J(f )
f ∈M

1
Intuitively, this is the function that bests “fits” the data on the training dataset D = {(x(i) , y (i) ) |
i = 1, 2, ..., n}.
We will use the a quadratic function as our running example for an objective J.
[2]: import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

[3]: def quadratic_function(theta):

"""The cost function, J(theta)."""
return 0.5*(2*theta-1)**2

We can visualize it.

[4]: # First construct a grid of theta1 parameter pairs and their corresponding
# cost function values.
thetas = np.linspace(-0.2,1,10)
f_vals = quadratic_function(thetas[:,np.newaxis])

plt.plot(thetas, f_vals)
plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')

[4]: Text(0.5, 1.0, 'Simple quadratic function')

2
5 Calculus Review: Derivatives

Recall that the derivative

df (θ0 )
dθ
of a univariate function f : R → R is the instantaneous rate of change of the function f (θ) with
respect to its parameter θ at the point θ0 .
[5]: def quadratic_derivative(theta):
return (2*theta-1)*2

df0 = quadratic_derivative(np.array([[0]])) # derivative at zero

f0 = quadratic_function(np.array([[0]]))
line_length = 0.2

plt.plot(thetas, f_vals)
plt.annotate('', xytext=(0-line_length, f0-line_length*df0), xy=(0+line_length,␣
,→f0+line_length*df0),

arrowprops={'arrowstyle': '-', 'lw': 1.5}, va='center',␣

,→ha='center')

plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')

[5]: Text(0.5, 1.0, 'Simple quadratic function')

3
[6]: pts = np.array([[0, 0.5, 0.8]]).reshape((3,1))
df0s = quadratic_derivative(pts)
f0s = quadratic_function(pts)

plt.plot(thetas, f_vals)
for pt, f0, df0 in zip(pts.flatten(), f0s.flatten(), df0s.flatten()):
plt.annotate('', xytext=(pt-line_length, f0-line_length*df0),␣
,→xy=(pt+line_length, f0+line_length*df0),

arrowprops={'arrowstyle': '-', 'lw': 1}, va='center', ha='center')

plt.xlabel('Theta')
plt.ylabel('Objective value')
plt.title('Simple quadratic function')

[6]: Text(0.5, 1.0, 'Simple quadratic function')

6 Calculus Review: Partial Derivatives

The partial derivative

∂f (θ0 )
∂θj
of a multivariate function f : Rd → R is the derivative of f with respect to θj while all othe other
inputs θk for k ̸= j are fixed.

4
7 Calculus Review: The Gradient

The gradient ∇θ f further extends the derivative to multivariate functions f : Rd → R, and is

defined at a point θ0 as

 ∂f (θ0 ) 
∂θ
 ∂f (θ10 ) 
 ∂θ2 
∇θ f (θ0 ) = 
 ..  .

 . 
∂f (θ0 )
∂θd

∂f (θ0 )
The j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect to the j-th
component of θ.
We will use a quadratic function as a running example.
[7]: def quadratic_function2d(theta0, theta1):
"""Quadratic objective function, J(theta0, theta1).

The inputs theta0, theta1 are 2d arrays and we evaluate

the objective at each value theta0[i,j], theta1[i,j].
We implement it this way so it's easier to plot the
level curves of the function in 2d.

Parameters:
theta0 (np.array): 2d array of first parameter theta0
theta1 (np.array): 2d array of second parameter theta1

Returns:
fvals (np.array): 2d array of objective function values
fvals is the same dimension as theta0 and theta1.
fvals[i,j] is the value at theta0[i,j] and theta1[i,j].
"""
theta0 = np.atleast_2d(np.asarray(theta0))
theta1 = np.atleast_2d(np.asarray(theta1))
return 0.5*((2*theta1-2)**2 + (theta0-3)**2)

Let’s visualize this function.

[8]: theta0_grid = np.linspace(-4,7,101)
theta1_grid = np.linspace(-1,4,101)
theta_grid = theta0_grid[np.newaxis,:], theta1_grid[:,np.newaxis]
J_grid = quadratic_function2d(theta0_grid[np.newaxis,:], theta1_grid[:,np.
,→newaxis])

X, Y = np.meshgrid(theta0_grid, theta1_grid)
contours = plt.contour(X, Y, J_grid, 10)
plt.clabel(contours)

5
plt.axis('equal')

[8]: (-4.0, 7.0, -1.0, 4.0)

Let’s write down the derivative of the quadratic function.

[9]: def quadratic_derivative2d(theta0, theta1):
"""Derivative of quadratic objective function.

The inputs theta0, theta1 are 1d arrays and we evaluate

the derivative at each value theta0[i], theta1[i].

Parameters:
theta0 (np.array): 1d array of first parameter theta0
theta1 (np.array): 1d array of second parameter theta1

Returns:
grads (np.array): 2d array of partial derivatives
grads is of the same size as theta0 and theta1
along first dimension and of size
two along the second dimension.
grads[i,j] is the j-th partial derivative
at input theta0[i], theta1[i].
"""
# this is the gradient of 0.5*((2*theta1-2)**2 + (theta0-3)**2)
grads = np.stack([theta0-3, (2*theta1-2)*2], axis=1)
grads = grads.reshape([len(theta0), 2])
return grads

6
We can visualize the derivative.
[10]: theta0_pts, theta1_pts = np.array([2.3, -1.35, -2.3]), np.array([2.4, -0.15, 2.
,→75])

dfs = quadratic_derivative2d(theta0_pts, theta1_pts)

line_length = 0.2

contours = plt.contour(X, Y, J_grid, 10)

for theta0_pt, theta1_pt, df0 in zip(theta0_pts, theta1_pts, dfs):
plt.annotate('', xytext=(theta0_pt, theta1_pt),
xy=(theta0_pt-line_length*df0[0],␣
,→theta1_pt-line_length*df0[1]),

arrowprops={'arrowstyle': '->', 'lw': 2}, va='center',␣

,→ha='center')

plt.scatter(theta0_pts, theta1_pts)
plt.clabel(contours)
plt.xlabel('Theta0')
plt.ylabel('Theta1')
plt.title('Gradients of the quadratic function')
plt.axis('equal')

[10]: (-4.0, 7.0, -1.0, 4.0)

# Part 1b: Gradient Descent

Next, we will use gradients to define an important algorithm called gradient descent.

7
8 Calculus Review: The Gradient

The gradient ∇θ f further extends the derivative to multivariate functions f : Rd → R, and is

defined at a point θ0 as

 ∂f (θ0 ) 
∂θ
 ∂f (θ10 ) 
 ∂θ2 
∇θ f (θ0 ) = 
 ..  .

 . 
∂f (θ0 )
∂θd

∂f (θ0 )
The j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect to the j-th
component of θ.
[11]: theta0_pts, theta1_pts = np.array([2.3, -1.35, -2.3]), np.array([2.4, -0.15, 2.
,→75])

dfs = quadratic_derivative2d(theta0_pts, theta1_pts)

line_length = 0.2

contours = plt.contour(X, Y, J_grid, 10)

for theta0_pt, theta1_pt, df0 in zip(theta0_pts, theta1_pts, dfs):
plt.annotate('', xytext=(theta0_pt, theta1_pt),
xy=(theta0_pt-line_length*df0[0],␣
,→theta1_pt-line_length*df0[1]),

arrowprops={'arrowstyle': '->', 'lw': 2}, va='center',␣

,→ha='center')

plt.scatter(theta0_pts, theta1_pts)
plt.clabel(contours)
plt.xlabel('Theta0')
plt.ylabel('Theta1')
plt.title('Gradients of the quadratic function')
plt.axis('equal')

[11]: (-4.0, 7.0, -1.0, 4.0)

8
9 Gradient Descent: Intuition

Gradient descent is a very common optimization algorithm used in machine learning.

The intuition behind gradient descent is to repeatedly obtain the gradient to determine the direction
in which the function decreases most steeply and take a step in that direction.

10 Gradient Descent: Notation

More formally, if we want to optimize J(θ), we start with an initial guess θ0 for the parameters
and repeat the following update until θ is no longer changing:

θi := θi−1 − α · ∇θ J(θi−1 ).

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)
In the above algorithm, we stop when ||θi − θi−1 || is small.
It’s easy to implement this function in numpy.

9
[24]: convergence_threshold = 2e-1
step_size = 2e-1
theta, theta_prev = np.array([[-2], [3]]), np.array([[0], [0]])
opt_pts = [theta.flatten()]
opt_grads = []

while np.linalg.norm(theta - theta_prev) > convergence_threshold:

# we repeat this while the value of the function is decreasing
theta_prev = theta
gradient = quadratic_derivative2d(*theta).reshape([2,1])
theta = theta_prev - step_size * gradient
opt_pts += [theta.flatten()]
opt_grads += [gradient.flatten()]

We can now visualize gradient descent.

[25]: opt_pts = np.array(opt_pts)
opt_grads = np.array(opt_grads)

contours = plt.contour(X, Y, J_grid, 10)

plt.clabel(contours)
plt.scatter(opt_pts[:,0], opt_pts[:,1])

for opt_pt, opt_grad in zip(opt_pts, opt_grads):

plt.annotate('', xytext=(opt_pt[0], opt_pt[1]),
xy=(opt_pt[0]-0.8*step_size*opt_grad[0], opt_pt[1]-0.
,→8*step_size*opt_grad[1]),

arrowprops={'arrowstyle': '->', 'lw': 2}, va='center',␣

,→ha='center')

plt.axis('equal')

[25]: (-4.0, 7.0, -1.0, 4.0)

10
# Part 2: Gradient Descent in Linear Models
Let’s now use gradient descent to derive a supervised learning algorithm for linear models.

11 Review: Gradient Descent

If we want to optimize J(θ), we start with an initial guess θ0 for the parameters and repeat the
following update:
θi := θi−1 − α · ∇θ J(θi−1 ).

As code, this method may look as follows:

theta, theta_prev = random_initialization()
while norm(theta - theta_prev) > convergence_threshold:
theta_prev = theta
theta = theta_prev - step_size * gradient(theta_prev)

12 Review: Linear Model Family

Recall that a linear model has the form

y = θ0 + θ1 · x1 + θ2 · x2 + ... + θd · xd
where x ∈ Rd is a vector of features and y is the target. The θj are the parameters of the model.
By using the notation x0 = 1, we can represent the model in a vectorized form
∑
d
fθ (x) = θj · xj = θ⊤ x.
j=0

11
Let’s define our model in Python.
[26]: def f(X, theta):
"""The linear model we are trying to fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional data matrix

Returns:
y_pred (np.array): n-dimensional vector of predicted targets
"""
return X.dot(theta)

13 An Objective: Mean Squared Error

We pick θ to minimize the mean squared error (MSE). Slight variants of this objective are also
known as the residual sum of squares (RSS) or the sum of squared residuals (SSR).

1 ∑ (i)
n
J(θ) = (y − θ⊤ x(i) )2
2n
i=1

In other words, we are looking for the best compromise in θ over all the data points.
Let’s implement mean squared error.
[27]: def mean_squared_error(theta, X, y):
"""The cost function, J(theta0, theta1) describing the goodness of fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets
"""
return 0.5*np.mean((y-f(X, theta))**2)

14 Mean Squared Error: Partial Derivatives

Let’s work out what a partial derivative is for the MSE error loss for a linear model.

12
∂J(θ) ∂ 1
= (fθ (x) − y)2
∂θj ∂θj 2
∂
= (fθ (x) − y) · (fθ (x) − y)
∂θj
( d )
∂ ∑
= (fθ (x) − y) · θk · x k − y
∂θj
k=0
= (fθ (x) − y) · xj

15 Mean Squared Error: The Gradient

We can use this derivation to obtain an expression for the gradient of the MSE for a linear model
 ∂f (θ)   
∂θ1 (fθ (x) − y) · x1
 ∂f (θ)  
 ∂θ2  (fθ (x) − y) · x2  
∇θ J(θ) =  
 ..  =  ..  = (fθ (x) − y) · x.
 .   . 
∂f (θ) (fθ (x) − y) · xd
∂θd

Let’s implement the gradient.

[28]: def mse_gradient(theta, X, y):
"""The cost function, J(theta0, theta1) describing the goodness of fit.

Parameters:
theta (np.array): d-dimensional vector of parameters
X (np.array): (n,d)-dimensional design matrix
y (np.array): n-dimensional vector of targets

Returns:
grad (np.array): d-dimensional gradient of the MSE
"""
return np.mean((f(X, theta) - y) * X.T, axis=1)

16 The UCI Diabetes Dataset

In this section, we are going to again use the UCI Diabetes Dataset. * For each patient we have
a access to a measurement of their body mass index (BMI) and a quantiative diabetes risk score
(from 0-300). * We are interested in understanding how BMI affects an individual’s diabetes risk.

[29]: %matplotlib inline

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

13
import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# add an extra column of onens

X['one'] = 1

# Collect 20 data points and only use bmi dimension

X_train = X.iloc[-20:].loc[:, ['bmi', 'one']]
y_train = y.iloc[-20:] / 300

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')

plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[29]: Text(0, 0.5, 'Diabetes Risk')

17 Gradient Descent for Linear Regression

Putting this together with the gradient descent algorithm, we obtain a learning method for training
linear models.
theta, theta_prev = random_initialization()
while abs(J(theta) - J(theta_prev)) > conv_threshold:

14
theta_prev = theta
theta = theta_prev - step_size * (f(x, theta)-y) * x
This update rule is also known as the Least Mean Squares (LMS) or Widrow-Hoff learning rule.

[34]: threshold = 1e-3

step_size = 4e-1
theta, theta_prev = np.array([2,1]), np.ones(2,)
opt_pts = [theta]
opt_grads = []
iter = 0

while np.linalg.norm(theta - theta_prev) > threshold:

if iter % 100 == 0:
print('Iteration %d. MSE: %.6f' % (iter, mean_squared_error(theta,␣
,→X_train, y_train)))

theta_prev = theta
gradient = mse_gradient(theta, X_train, y_train)
theta = theta_prev - step_size * gradient
opt_pts += [theta]
opt_grads += [gradient]
iter += 1

Iteration 0. MSE: 0.171729

Iteration 100. MSE: 0.014765
Iteration 200. MSE: 0.014349
Iteration 300. MSE: 0.013997
Iteration 400. MSE: 0.013701

[35]: x_line = np.stack([np.linspace(-0.1, 0.1, 10), np.ones(10,)])

y_line = opt_pts[-1].dot(x_line)

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')

plt.plot(x_line[0], y_line)
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[35]: Text(0, 0.5, 'Diabetes Risk')

15
# Part 3: Ordinary Least Squares
In practice, there is a more effective way than gradient descent to find linear model parameters.
We will see this method here, which will lead to our first non-toy algorithm: Ordinary Least
Squares.

18 Review: The Gradient

The gradient ∇θ f further extends the derivative to multivariate functions f : Rd → R, and is

defined at a point θ0 as

 ∂f (θ0 ) 
∂θ
 ∂f (θ10 ) 
 ∂θ2 
∇θ f (θ0 ) = 
 ..  .

 . 
∂f (θ0 )
∂θd

∂f (θ0 )
In other words, the j-th entry of the vector ∇θ f (θ0 ) is the partial derivative ∂θj of f with respect
to the j-th component of θ.

19 The UCI Diabetes Dataset

16
[36]: %matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# add an extra column of onens

X['one'] = 1

# Collect 20 data points

X_train = X.iloc[-20:]
y_train = y.iloc[-20:]

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')

plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[36]: Text(0, 0.5, 'Diabetes Risk')

17
20 Notation: Design Matrix

Machine learning algorithms are most easily defined in the language of linear algebra. Therefore,
it will be useful to represent the entire dataset as one matrix X ∈ Rn×d , of the form:
 (1) (1) (1)
  
x1 x2 . . . xd − (x(1) )⊤ −
 (2) (2) 
. . . xd   − (x(2) )⊤ −
(2)
 x1
= 
x2
X= .  . .
 .
 ..   . 
x1
(n)
x2
(n)
. . . xd
(n) − (x(n) )⊤ −

We can view the design matrix for the diabetes dataset.

[37]: X_train.head()

[37]: age sex bmi bp s1 s2 s3 \

422 -0.078165 0.050680 0.077863 0.052858 0.078236 0.064447 0.026550
423 0.009016 0.050680 -0.039618 0.028758 0.038334 0.073529 -0.072854
424 0.001751 0.050680 0.011039 -0.019442 -0.016704 -0.003819 -0.047082
425 -0.078165 -0.044642 -0.040696 -0.081414 -0.100638 -0.112795 0.022869
426 0.030811 0.050680 -0.034229 0.043677 0.057597 0.068831 -0.032356

s4 s5 s6 one
422 -0.002592 0.040672 -0.009362 1
423 0.108111 0.015567 -0.046641 1
424 0.034309 0.024053 0.023775 1
425 -0.076395 -0.020289 -0.050783 1
426 0.057557 0.035462 0.085907 1

21 Notation: Design Matrix

Similarly, we can vectorize the target variables into a vector y ∈ Rn of the form
 (1) 
y
 y (2) 
 
y =  . .
 . .
y (n)

22 Squared Error in Matrix Form

Recall that we may fit a linear model by choosing θ that minimizes the squared error:

1 ∑ (i)
n
J(θ) = (y − θ⊤ x(i) )2
2
i=1

18
In other words, we are looking for the best compromise in β over all the data points.
We can write this sum in matrix-vector form as:
1 1
J(θ) = (y − Xθ)⊤ (y − Xθ) = ∥y − Xθ∥2 ,
2 2
where X is the design matrix and ∥ · ∥ denotes the Euclidean norm.

23 The Gradient of the Squared Error

We can a gradient for the mean squared error as follows.

1
∇θ J(θ) = ∇θ (Xθ − y)⊤ (Xθ − y)
2
1 ( )
= ∇θ (Xθ)⊤ (Xθ) − (Xθ)⊤ y − y ⊤ (Xθ) + y ⊤ y
2
1 ( )
= ∇θ θ⊤ (X⊤X)θ − 2(Xθ)⊤ y
2
1( )
= 2(X ⊤ X)θ − 2X ⊤ y
2
= (X ⊤ X)θ − X ⊤ y

We used the facts that a⊤ b = b⊤ a (line 3), that ∇x b⊤ x = b (line 4), and that ∇x x⊤ Ax = 2Ax for
a symmetric matrix A (line 4).

24 Normal Equations

Setting the above derivative to zero, we obtain the normal equations:

(X ⊤ X)θ = X ⊤ y.

Hence, the value θ∗ that minimizes this objective is given by:

θ∗ = (X ⊤ X)−1 X ⊤ y.

Note that we assumed that the matrix (X ⊤ X) is invertible; if this is not the case, there are easy
ways of addressing this issue.
Let’s apply the normal equations.
[21]: import numpy as np

theta_best = np.linalg.inv(X_train.T.dot(X_train)).dot(X_train.T).dot(y_train)
theta_best_df = pd.DataFrame(data=theta_best[np.newaxis, :], columns=X.columns)
theta_best_df

19
[21]: age sex bmi bp s1 s2 \
0 -3.888868 204.648785 -64.289163 -262.796691 14003.726808 -11798.307781

s3 s4 s5 s6 one
0 -5892.15807 -1136.947646 -2736.597108 -393.879743 155.698998

We can now use our estimate of theta to compute predictions for 3 new data points.
[22]: # Collect 3 data points for testing
X_test = X.iloc[:3]
y_test = y.iloc[:3]

# generate predictions on the new patients

y_test_pred = X_test.dot(theta_best)

Let’s visualize these predictions.

[23]: # visualize the results
plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')
plt.scatter(X_train.loc[:, ['bmi']], y_train)
plt.scatter(X_test.loc[:, ['bmi']], y_test, color='red', marker='o')
plt.plot(X_test.loc[:, ['bmi']], y_test_pred, 'x', color='red', mew=3,␣
,→markersize=8)

plt.legend(['Model', 'Prediction', 'Initial patients', 'New patients'])

[23]: <matplotlib.legend.Legend at 0x128d89668>

20
25 Algorithm: Ordinary Least Squares

• Type: Supervised learning (regression)

• Model family: Linear models
• Objective function: Mean squared error
• Optimizer: Normal equations
# Part 4: Non-Linear Least Squares
So far, we have learned about a very simple linear model. These can capture only simple lin-
ear relationships in the data. How can we use what we learned so far to model more complex
relationships?
We will now see a simple approach to model complex non-linear relationships called least squares.

26 Review: Polynomial Functions

Recall that a polynomial of degree p is a function of the form

ap xp + ap−1 xp−1 + ... + a1 x + a0 .

Below are some examples of polynomial functions.

[24]: import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(16,4))
x_vars = np.linspace(-2, 2)

plt.subplot('131')
plt.title('Quadratic Function')
plt.plot(x_vars, x_vars**2)
plt.legend(["$x^2$"])

plt.subplot('132')
plt.title('Cubic Function')
plt.plot(x_vars, x_vars**3)
plt.legend(["$x^3$"])

plt.subplot('133')
plt.title('Third Degree Polynomial')
plt.plot(x_vars, x_vars**3 + 2*x_vars**2 + x_vars + 1)
plt.legend(["$x^3 + 2 x^2 + x + 1$"])

[24]: <matplotlib.legend.Legend at 0x128ed2ac8>

21
27 Modeling Non-Linear Relationships With Polynomial Regres-
sion

Specifically, given a one-dimensional continuous variable x, we can defining a feature function

F : R → Rp+1 as  
1
x
 
 2
ϕ(x) = x  .
 .. 
.
xp

The class of models of the form

∑
p
fθ (x) := θp xp = θ⊤ ϕ(x)
j=0

with parameters θ and polynomial features ϕ is the set of p-degree polynomials.

• This model is non-linear in the input variable x, meaning that we can model complex data
relationships.
• It is a linear model as a function of the parameters θ, meaning that we can use our familiar
ordinary least squares algorithm to learn these features.

28 The UCI Diabetes Dataset

[25]: %matplotlib inline

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [8, 4]

22
import numpy as np
import pandas as pd
from sklearn import datasets

# Load the diabetes dataset

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)

# add an extra column of onens

X['one'] = 1

# Collect 20 data points

X_train = X.iloc[-20:]
y_train = y.iloc[-20:]

plt.scatter(X_train.loc[:,['bmi']], y_train, color='black')

plt.xlabel('Body Mass Index (BMI)')
plt.ylabel('Diabetes Risk')

[25]: Text(0, 0.5, 'Diabetes Risk')

29 Diabetes Dataset: A Non-Linear Featurization

Let’s now obtain linear features for this dataset.

[26]: X_bmi = X_train.loc[:, ['bmi']]

23
X_bmi_p3 = pd.concat([X_bmi, X_bmi**2, X_bmi**3], axis=1)
X_bmi_p3.columns = ['bmi', 'bmi2', 'bmi3']
X_bmi_p3['one'] = 1
X_bmi_p3.head()

[26]: bmi bmi2 bmi3 one

422 0.077863 0.006063 0.000472 1
423 -0.039618 0.001570 -0.000062 1
424 0.011039 0.000122 0.000001 1
425 -0.040696 0.001656 -0.000067 1
426 -0.034229 0.001172 -0.000040 1

30 Diabetes Dataset: A Polynomial Model

By training a linear model on this featurization of the diabetes set, we can obtain a polynomial
model of diabetest risk as a function of BMI.
[27]: # Fit a linear regression
theta = np.linalg.inv(X_bmi_p3.T.dot(X_bmi_p3)).dot(X_bmi_p3.T).dot(y_train)

# Show the learned polynomial curve

x_line = np.linspace(-0.1, 0.1, 10)
x_line_p3 = np.stack([x_line, x_line**2, x_line**3, np.ones(10,)], axis=1)
y_train_pred = x_line_p3.dot(theta)

plt.xlabel('Body Mass Index (BMI)')

plt.ylabel('Diabetes Risk')
plt.scatter(X_bmi, y_train)
plt.plot(x_line, y_train_pred)

[27]: [<matplotlib.lines.Line2D at 0x1292c99e8>]

24
31 Multivariate Polynomial Regression

We can also take this approach to construct non-linear function of multiples variable by using
multivariate polynomials.
For example, a polynomial of degree 2 over two variables x1 , x2 is a function of the form

a20 x21 + a10 x1 + a02 x22 + a01 x2 + a11 x1 x2 + a00 .

In general, a polynomial of degree p over two variables x1 , x2 is a function of the form

∑
f (x1 , x2 ) = aij xi1 xj2 .
i,j≥0:i+j≤p

In our two-dimensional example, this corresponds to a feature function ϕ : R2 → R6 of the form

 
1
 x1 
 2 
 x1 
ϕ(x) =  
 x2  .
 
 x2 
2
x1 x2

The same approach holds for polynomials of an degree and any number of variables.

25
32 Towards General Non-Linear Features

Any non-linear feature map ϕ(x) : R → Rp can be used in this way to obtain general models of the
form
fθ (x) := θ⊤ ϕ(x)
that are highly non-linear in x but linear in θ.
For example, here is a way of modeling complex periodic functions via a sum of sines and cosines.
[28]: import warnings
warnings.filterwarnings("ignore")

plt.figure(figsize=(16,4))
x_vars = np.linspace(-5, 5)

plt.subplot('131')
plt.title('Cosine Function')
plt.plot(x_vars, np.cos(x_vars))
plt.legend(["$cos(x)$"])

plt.subplot('132')
plt.title('Sine Function')
plt.plot(x_vars, np.sin(2*x_vars))
plt.legend(["$x^3$"])

plt.subplot('133')
plt.title('Combination of Sines and Cosines')
plt.plot(x_vars, np.cos(x_vars) + np.sin(2*x_vars) + np.cos(4*x_vars))
plt.legend(["$cos(x) + sin(2x) + cos(4x)$"])

[28]: <matplotlib.legend.Legend at 0x129571160>

26
33 Algorithm: Non-Linear Least Squares

• Type: Supervised learning (regression)

• Model family: Linear in the weights; non-linear with respect to raw inputs.
• Features: Non-linear functions of the attributes
• Objective function: Mean squared error
• Optimizer: Normal equations
[ ]:

CLO 3DMarvelous Designer Manual
75% (16)
CLO 3DMarvelous Designer Manual
405 pages
200-301 Exam - Free Actual Q&as, Page 1 _ ExamTopics
No ratings yet
200-301 Exam - Free Actual Q&as, Page 1 _ ExamTopics
579 pages
Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
Swipe To Unlock A Primer On Technology and Business Strategy PDF Ebook by Neel Mehta
0% (2)
Swipe To Unlock A Primer On Technology and Business Strategy PDF Ebook by Neel Mehta
3 pages
Gradient Descent Vizcs229 PDF
No ratings yet
Gradient Descent Vizcs229 PDF
7 pages
Linear Regression With Gradient Descent
100% (1)
Linear Regression With Gradient Descent
8 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
Homework2 Advanced Ml
No ratings yet
Homework2 Advanced Ml
4 pages
07_Gradient_Descent_For_Linear_Regression_10_min
No ratings yet
07_Gradient_Descent_For_Linear_Regression_10_min
5 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
H2_AndresAlcivar
No ratings yet
H2_AndresAlcivar
12 pages
Deep Learning Assignment2 Solutions PDF
No ratings yet
Deep Learning Assignment2 Solutions PDF
16 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
FAI 4 Mathematical Concepts II
No ratings yet
FAI 4 Mathematical Concepts II
39 pages
Representer Function
No ratings yet
Representer Function
12 pages
Đ Xuân Trư ng-IEIESB21003-HW2.2
No ratings yet
Đ Xuân Trư ng-IEIESB21003-HW2.2
11 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Derivatives
No ratings yet
Derivatives
2 pages
Backpropagation: Loading Data
No ratings yet
Backpropagation: Loading Data
12 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
Workshop 8: Numerical Differentiation and Integration
No ratings yet
Workshop 8: Numerical Differentiation and Integration
9 pages
ML Notes
No ratings yet
ML Notes
14 pages
Least Square Vs Gradient Descent
No ratings yet
Least Square Vs Gradient Descent
52 pages
CS229 Lecture 2 PDF
100% (1)
CS229 Lecture 2 PDF
48 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Vertopal.com C1 W1 Lab04 Gradient Descent Soln
No ratings yet
Vertopal.com C1 W1 Lab04 Gradient Descent Soln
11 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
CSE488_Lab6_Optimization
No ratings yet
CSE488_Lab6_Optimization
20 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Autodiff
No ratings yet
Autodiff
12 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
hw07 Neural Soln PDF
No ratings yet
hw07 Neural Soln PDF
6 pages
ML Labs
No ratings yet
ML Labs
46 pages
micromax-1
No ratings yet
micromax-1
2 pages
Linear Regression With Multiple Features
No ratings yet
Linear Regression With Multiple Features
7 pages
Slide 3 - Linear Regression One Variable
No ratings yet
Slide 3 - Linear Regression One Variable
60 pages
MMA Assignment-Python
No ratings yet
MMA Assignment-Python
19 pages
MTC-243 Python Programing Language II Slips Semester IV-1
No ratings yet
MTC-243 Python Programing Language II Slips Semester IV-1
188 pages
ML02
No ratings yet
ML02
25 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Recognition Patterns: Jean Carlo Grandas Franco March 2020
No ratings yet
Recognition Patterns: Jean Carlo Grandas Franco March 2020
9 pages
Newton Gauss Method
No ratings yet
Newton Gauss Method
37 pages
ML Lecture 2 2023
No ratings yet
ML Lecture 2 2023
59 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
micro
No ratings yet
micro
4 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Laplace Transforms Essentials
From Everand
Laplace Transforms Essentials
Morteza Shafii-Mousavi
3.5/5 (3)
Lecture4 Foundations Supervised Learning
No ratings yet
Lecture4 Foundations Supervised Learning
22 pages
1 Lecture 1: Introduction To Machine Learning
No ratings yet
1 Lecture 1: Introduction To Machine Learning
12 pages
1 Lecture 2: Supervised Machine Learning
No ratings yet
1 Lecture 2: Supervised Machine Learning
20 pages
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
No ratings yet
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
6 pages
Exam JN0-664 Topic 1 Question 5 Discussion - ExamTopics
No ratings yet
Exam JN0-664 Topic 1 Question 5 Discussion - ExamTopics
2 pages
Fluid Mechanics Mech Gate Ies Notes
50% (2)
Fluid Mechanics Mech Gate Ies Notes
21 pages
DCC-17
No ratings yet
DCC-17
11 pages
Page 190-281 (Country Wise Data
No ratings yet
Page 190-281 (Country Wise Data
46 pages
Alcatel 1660 SM
No ratings yet
Alcatel 1660 SM
8 pages
Research Report
No ratings yet
Research Report
23 pages
macOS All New Features Sept 2024
No ratings yet
macOS All New Features Sept 2024
16 pages
Introduction To Discrete Event Systems
No ratings yet
Introduction To Discrete Event Systems
10 pages
Ricoh Error Codes
50% (2)
Ricoh Error Codes
6 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
ocr-h446-2-2-mark-scheme
No ratings yet
ocr-h446-2-2-mark-scheme
16 pages
Vmware Migration Scenarios 986062 (1)
No ratings yet
Vmware Migration Scenarios 986062 (1)
13 pages
Agilent - Keysight E4412-90013 Power Sensor
No ratings yet
Agilent - Keysight E4412-90013 Power Sensor
51 pages
Difference Between Java Bean and EJB
100% (2)
Difference Between Java Bean and EJB
2 pages
exp-4-libre-writer
No ratings yet
exp-4-libre-writer
11 pages
AIA Final Exam Notes
100% (1)
AIA Final Exam Notes
4 pages
Revised Part 10 - Grade 10 Edumate - Playing Minor Chords in A Song
No ratings yet
Revised Part 10 - Grade 10 Edumate - Playing Minor Chords in A Song
2 pages
Elixir Language
100% (1)
Elixir Language
97 pages
PA 1 Computer
No ratings yet
PA 1 Computer
2 pages
Xlr8 Ddr4 3200Mhz: Desktop Memory
No ratings yet
Xlr8 Ddr4 3200Mhz: Desktop Memory
1 page
SLC To Compactlogix Programming Migration: Application Profile
No ratings yet
SLC To Compactlogix Programming Migration: Application Profile
86 pages
COAG NMAT For October 2021-B
No ratings yet
COAG NMAT For October 2021-B
41 pages
ST Seminar topics
No ratings yet
ST Seminar topics
2 pages
Data Sharing and Collaborationwith Delta Sharing
No ratings yet
Data Sharing and Collaborationwith Delta Sharing
127 pages
الملحق 3 ...
100% (1)
الملحق 3 ...
199 pages
GNS 101 TIME MANAGEMENT SUMMARY [futaelibrary]
No ratings yet
GNS 101 TIME MANAGEMENT SUMMARY [futaelibrary]
4 pages
TenantDirectory Eng 11-05-2024
No ratings yet
TenantDirectory Eng 11-05-2024
18 pages