0% found this document useful (0 votes)

106 views

1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021

This lecture discusses linear regression using both the normal equation and gradient descent approaches. The normal equation provides the optimal linear regression coefficients in closed form by solving XT Xβ = XT y, where X are the feature values and y are the target values. This works well when the number of features is small. Gradient descent is an iterative approach that starts with random coefficients, computes the gradient of the error function to determine how to update the coefficients to reduce error, and repeats until converging on the optimal solution. It is more computationally expensive than the normal equation but can scale to more features. The lecture demonstrates both approaches on sample data where the true linear model is known, to analyze the behavior and

Uploaded by

Tev Wallace

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views

1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021

Uploaded by

Tev Wallace

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Lecture09

February 15, 2021

1 Lecture 09
1.1 ID5059
1.2 Tom Kelsey - Jan 2021
As before, we take the Jupyter notebook associated with the course textbook - annotate - explain
Chapter 4 – Training Linear Models
This notebook contains all the sample code and solutions to the exercises in chapter 4.
Run in Google Colab

2 Setup
First, let’s import a few common modules, ensure MatplotLib plots figures inline and prepare a
function to save the figures. We also check that Python 3.5 or later is installed (although Python
2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as
Scikit-Learn �0.20.
[1]: # Python �3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn �0.20 is required

import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs

np.random.seed(42)

# To plot pretty figures

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

1
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures

PROJECT_ROOT_DIR = "."
CHAPTER_ID = "training_linear_models"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):

path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)

import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

3 Linear regression using the Normal Equation

3.0.1 Linear regression
• Models of the form
ŷ = β0 + β1 x1 + β2 x2 + . . . + βp xp
– β0 is an intercept term
– needed to get solutions that have slope & intercept as in y = mx + c

3.0.2 Normal equation

Recall from Lecture 1 that our standard problem

y = Xβ + e

has an analytic solution:

ŷ = X(XT X)−1 )XT y
where analytic mean “can be solved exactly” - i.e. gives the coefficients that minimise the RMSE -
as long as the matrix XT X is invertible - in technical terms, it hasd a non-zero determinant - and
this solution can be computed relatively efficiently - inversion by Gaussian elimination is O(n3 )

3.0.3 Known signal, known noise

• Note that the examples in this lecture follow a useful pattern for analysis of properties of
models
• We take a known signal

2
– y = 3X + 4 in the first example
• And add a known (and repeatable) amount of noise
– random values from the interval [0, 1)
• This provides an empirical framework for the comparison of approaches
– since we know exactly what “error” means
[2]: import numpy as np

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

[3]: plt.plot(X, y, "b.")

plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()

Saving figure generated_data_plot

3.0.4 Intercept terms

• We know that the signal is a straight line with a slope

3
• So we need to add the β0 term as described above
• We could set it to anthing, but using 1 means that the β value returned needs no adjustment
[4]: X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

3.0.5 Solve using linear algebra tools from numpy

• Implement the normal equation and solve for β̂
– note that the book uses θ - this is just a symbol choice
• We get the best coefficients given the noise that was added
– close to 4 and 3
• Make a series of prediction using these coefficients
– and plot observed vs predicted
[5]: theta_best

[5]: array([[4.21509616],
[2.77011339]])

[6]: X_new = np.array([[0], [2]])

X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict

[6]: array([[4.21509616],
[9.75532293]])

[7]: plt.plot(X_new, y_predict, "r-")

plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

4
The figure in the book actually corresponds to the following code, with a legend and axis labels:
[8]: plt.plot(X_new, y_predict, "r-", linewidth=2, label="Predictions")
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([0, 2, 0, 15])
save_fig("linear_model_predictions_plot")
plt.show()

Saving figure linear_model_predictions_plot

5
• Instead of writing the normal equatuion explicitly, we can call a library function that imple-
ments it
• And additionally gives a solution when the matrix is not inverttible
[9]: from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

[9]: (array([4.21509616]), array([[2.77011339]]))

[10]: lin_reg.predict(X_new)

[10]: array([[4.21509616],
[9.75532293]])

The LinearRegression class is based on the scipy.linalg.lstsq() function (the name stands
for “least squares”), which you could call directly:

[11]: theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)

theta_best_svd

6
[11]: array([[4.21509616],
[2.77011339]])

This function computes X+ y, where X+ is the pseudoinverse of X (specifically the Moore-Penrose

inverse). You can use np.linalg.pinv() to compute the pseudoinverse directly:

[12]: np.linalg.pinv(X_b).dot(y)

[12]: array([[4.21509616],
[2.77011339]])

3.0.6 Summary
• For simple problems we seek simple solutions
• The normal equation gives optimal results which are cheap to compute
– if the number of attributes - columns in our data - is small
• But can be expensive when we have many instances - rows in our data
• An alternative approach is to iteratively lower the error until a minimum is reached

4 Gradient descent
• Back in high school you would have seen functions of the form y = ax2 + bx + c
– standard quadtratic ploynomial
• To get the gradient (i.e. slope) of the function, we differentiate to get y ′ = 2ax + b
• The function has a minimum (or maximum) where the gradient is zero

4.1 Apply this to MSE

1∑
n
MSE = (ŷj − yj )2
n
j=1

where
ŷ = β0 + β1 x1 + β2 x2 + . . . + βp xp

To make things easier to visualise, take p = 1, forget β0 , and consider one value of j

Error = (β1 x1 − y)2

= (β1 x1 − y)(β1 x1 − y)
= (β12 x21 − 2β1 x1 y + y 2 )

So the derivative is
2β12 x1 − 2β1 y
or
β1 x − y

7
It should be clear that MSE will be zero when the derivative is zero, and β1 x = y (i.e. when the
predictions match the observed data
The idea is to start somewhere where the error value could be anything. Then 1. Measure the
gradient of the error at that point 2. Move to somewhere where the the error is less 3. Repeat until
(hopefully) the error is close to zero Source
To make it work we need three things 1. A good starting point (but we don’t have any insights so
just choose at random) 2. A loss function that measures the difference, or error, between actual y
and predicted y at its current position and provide feedback to the process so that it can adjust
the parameters to minimize the error 3. A learning rate - or step size - that it is evaluated and
updated based on the behavior of the loss function - too small leads to slow convergence - too large
leads to ascillation around the minimum Source
Note that - this all works as we’ve chosen our error to have a quadratic form, hence a unique
global minimum - for deep learning, we hope that the error is approximately quadratic near an
error minimum - everything remains true in p dimensions, but is hard to plot - we just have to
understand the concept - the code does all the work for us

5 Linear regression using batch gradient descent

• Batch gradient descent sums the error for each point in a training set
– updating the model only after all training examples have been evaluated
• This process is referred to as a training epoch.
• While this batching provides computation efficiency, it can still have a long processing time
for large training datasets
– needs to store all of the data into memory
• Batch gradient descent usually produces a stable error gradient and convergence
– but sometimes that convergence point is not ideal, finding a local minimum versus the
global one
In the code below we 1. fix a learning rate, the numer of rows in the data (100) and a number of
iterations before giving up - the number of iterations chosen is also an educated guess 2. Start at
random (x, y) coordinates 3. Calculate all the gradients for every instance and collect into single
terms 4. Update our coefficient values by the product of the step and the gradients
[13]: eta = 0.1 # learning rate
n_iterations = 1000
m = 100

theta = np.random.randn(2,1) # random initialization

for iteration in range(n_iterations):

gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients

We get β values and predictions close to those expected

8
[14]: theta

[14]: array([[4.21509616],
[2.77011339]])

[15]: X_new_b.dot(theta)

[15]: array([[4.21509616],
[9.75532293]])

[16]: theta_path_bgd = []

def plot_gradient_descent(theta, eta, theta_path=None):

m = len(X_b)
plt.plot(X, y, "b.")
n_iterations = 1000
for iteration in range(n_iterations):
if iteration < 10:
y_predict = X_new_b.dot(theta)
style = "b-" if iteration > 0 else "r--"
plt.plot(X_new, y_predict, style)
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
if theta_path is not None:
theta_path.append(theta)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 2, 0, 15])
plt.title(r"$\eta = {}$".format(eta), fontsize=16)

5.0.1 Vary the learning rate

• Too slow
• Too many jumps
• Just right
[17]: np.random.seed(42)
theta = np.random.randn(2,1) # random initialization

plt.figure(figsize=(10,4))
plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(132); plot_gradient_descent(theta, eta=0.1,␣
,→theta_path=theta_path_bgd)

plt.subplot(133); plot_gradient_descent(theta, eta=0.5)

save_fig("gradient_descent_plot")
plt.show()

9
Saving figure gradient_descent_plot

6 Stochastic Gradient Descent

• Batch GD suffers when the number of instances is very large
– since you need all the gradients in memory at each step
• Reduce memory needed by taking an instance at random at each iteration
• Also gradually reduce the learning rate
– we hope that the first instances got us “close”to the minimum
– and that the later ones help zero in
[18]: theta_path_sgd = []
m = len(X_b)
np.random.seed(42)

[19]: n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters

def learning_schedule(t):
return t0 / (t + t1)

theta = np.random.randn(2,1) # random initialization

for epoch in range(n_epochs):

for i in range(m):
if epoch == 0 and i < 20: # not shown in the book
y_predict = X_new_b.dot(theta) # not shown
style = "b-" if i > 0 else "r--" # not shown
plt.plot(X_new, y_predict, style) # not shown
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]

10
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
theta_path_sgd.append(theta) # not shown

plt.plot(X, y, "b.") # not shown

plt.xlabel("$x_1$", fontsize=18) # not shown
plt.ylabel("$y$", rotation=0, fontsize=18) # not shown
plt.axis([0, 2, 0, 15]) # not shown
save_fig("sgd_plot") # not shown
plt.show() # not shown

Saving figure sgd_plot

[20]: theta

[20]: array([[4.21076011],
[2.74856079]])

• As before, we get good results in terms of coefficients and predictions

– we see irregular convergence in the first 20 epochs due to the random choices
• The code above is designed to show how everything works
• There is also a library that lets you choose parameters and then run out of the box

11
Note that: - the training instances must be independent and identically distributed - otherwise we
might optimise on one variable, then another, … - and not find the global optimum - we can shufle
at each epoch to take care of this
[21]: from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1,␣

,→random_state=42)

sgd_reg.fit(X, y.ravel())

[21]: SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,

eta0=0.1, fit_intercept=True, l1_ratio=0.15,
learning_rate='invscaling', loss='squared_loss', max_iter=1000,
n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,
shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
warm_start=False)

[22]: sgd_reg.intercept_, sgd_reg.coef_

[22]: (array([4.24365286]), array([2.8250878]))

7 Mini-batch gradient descent

• Between batch and stochastic
• Take less than all, but more than one instance at each epoch
• Should be the best of both worlds
– and can use GPUs to speed up the matrix operations
• Choosing the mini-batch size is now another hyperparameter that can be tuned
[23]: theta_path_mgd = []

n_iterations = 50
minibatch_size = 20

np.random.seed(42)
theta = np.random.randn(2,1) # random initialization

t0, t1 = 200, 1000

def learning_schedule(t):
return t0 / (t + t1)

t = 0
for epoch in range(n_iterations):
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for i in range(0, m, minibatch_size):

12
t += 1
xi = X_b_shuffled[i:i+minibatch_size]
yi = y_shuffled[i:i+minibatch_size]
gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(t)
theta = theta - eta * gradients
theta_path_mgd.append(theta)

[24]: theta

[24]: array([[4.25214635],
[2.7896408 ]])

7.0.1 Compare the three GD methods for two attributes

1. Batch converges in an organised way
• but might be too expensive
2. Stochastic takes many iterations
• but each iteration uses very little memory
3. Mini-batch shows randomness, but converges faster than stochastic
Note that: - This is the theory - other data might give different results - All three methods require
scaled data - Numerical optimisation is a large and important area of research - there is much we
still don’t understand - full coverage of the issues would form another complete module
[25]: theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

[26]: plt.figure(figsize=(7,4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1,␣
,→label="Stochastic")

plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2,␣

,→label="Mini-batch")

plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3,␣

,→label="Batch")

plt.legend(loc="upper left", fontsize=16)

plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$ ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
save_fig("gradient_descent_paths_plot")
plt.show()

Saving figure gradient_descent_paths_plot

13
8 Summary
• We’ve covered gradient descent techniques for linear regression with a convex (quadratic)
error surface
• For more complex notions of area we would consider local minima and saddle points
– places where the loss function is at or close to zero, so the model stops learning
– but is not giving us the error reduction that we want
Source
• We’ll cover this later, but for now the randomness in stochastic variants can help us “jump
out” of a local minimum or saddle point

9 Polynomial regression
• We have our X data
• Think of this as X1 i.e. raised to the power one
• We added intercept terms: a matrix the same size as X but with all entries one
– Think of this as X0 i.e. raised to the power zero
• The process can be extended to get X2 , X3 , X4 , . . .
• So the function in our main equation becomes a polynomial
y = β0 X0 + β1 X1 + β2 X2 + β3 X3 + e
As before we set up a known signal with known noise
[27]: import numpy as np
import numpy.random as rnd

14
np.random.seed(42)

[28]: m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

[29]: plt.plot(X, y, "b.")

plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
save_fig("quadratic_data_plot")
plt.show()

Saving figure quadratic_data_plot

9.0.1 Transform the data

• Use the polynimial features tool to get X2 as well as X
• This tool will also help find relationships between variables
• For variables a and b and degree 3, it returns a2 , a3 , b2 , b3 as expected and also ab, a2 b and
ab2
• So using for large variable size p and large degree should be done with care

15
[30]: from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]

[30]: array([-0.75275929])

[31]: X_poly[0]

[31]: array([-0.75275929, 0.56664654])

9.0.2 Derive a linear regression model

• The ci=uve is not going to be a straight line
• So not linear in one sense
• Here we are applying the same linear techniques to data that has been transformed
– the algorithm doesn’t know that some of the attributes are related
• So this is still linear regression
We get
y = 0.56x2 + 0.93x + 1.78

Which is close to the original

1
y = x2 + x + 2
2

[32]: lin_reg = LinearRegression()

lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

[32]: (array([1.78134581]), array([[0.93366893, 0.56456263]]))

[33]: X_new=np.linspace(-3, 3, 100).reshape(100, 1)

X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])
save_fig("quadratic_predictions_plot")
plt.show()

Saving figure quadratic_predictions_plot

16
9.0.3 Overfit and underfit
• We know that the correct type of polynomial for these data is quadratic
• We can fit a degree 1 polynomial (i.e. straight line)
• And a degree 300 polynomial
• One should underfit, the other overfit
[34]: from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = Pipeline([
("poly_features", polybig_features),
("std_scaler", std_scaler),
("lin_reg", lin_reg),
])
polynomial_regression.fit(X, y)
y_newbig = polynomial_regression.predict(X_new)
plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)

plt.plot(X, y, "b.", linewidth=3)

17
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
save_fig("high_degree_polynomials_plot")
plt.show()

Saving figure high_degree_polynomials_plot

[35]: from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,␣
,→random_state=10)

train_errors, val_errors = [], []

for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))

18
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14) # not shown in the book
plt.xlabel("Training set size", fontsize=14) # not shown
plt.ylabel("RMSE", fontsize=14) # not shown

10 Learning curves
• We’ve used cross validation as the standard technique for estimating generalisation errors
– good performance on training data but bad cross validation error indicates overfit
– poor performance on training data and cross validation indicates underfit
• We want to visualise the overfit/underfit tradeoff as derivation proceeds
1. Start with a small subset of the training data
2. Learn a regression model
3. Calculate training & validation errors
4. Increase the size of the training set and repeat

10.0.1 Underfit
• This is the degree 1 model trying to predict quadratic data
• For one and two points, training error is low (we have straight line)
– but validation error is large (we have the wrong straight line)
• Training error increases as we increase the training data, then plateaus
– failure to capture the quadratic signal
• Validation error decrease to a similar plateau
• We conclude that adding data will not reduce error
– which is underfit
[36]: lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
plt.axis([0, 80, 0, 3]) # not shown in the book
save_fig("underfitting_learning_curves_plot") # not shown
plt.show() # not shown

Saving figure underfitting_learning_curves_plot

19
10.0.2 Overfit
• This is a degree 10 model trying to predict quadratic data
– again no training error up to 10 points as these determine a degree-10 polynomial
– again huge validation error as they determine the wrong degree-10 polynomial
• Again we see convergence to a plateau
• Two important differences
1. the plateaus indicate lower error
2. they are further apart
• These are the classic signs of overfit
[37]: from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])

plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3]) # not shown
save_fig("learning_curves_plot") # not shown
plt.show() # not shown

Saving figure learning_curves_plot

20
10.1 Terminology
• I refer to overfit and underfit
• Other sources (including the textbook) use different terminology
• Generalisation error consists of three things
1. Bias due to wrong assumptions about model complexity, often leading to underfit
2. Variance due to excessive sensitivity to small changes in training data, often leading to
overfit
3. Irreducible error due to the noise in the data (outliers, unreliable sensor readings,
unreliable human data entry, etc.)
• Increasing model complexity will typically increase variance and decrease bias
• Decreasing model complexity will typically decrease variance and increase bias
Caveat: the intercept terms we saw earlier are also called bias terms. For neural nets, they are
always called bias terms. This use of the word bias is not the same as the above

11 Regularized models
• Polynomial regression gives a clear way to increase model complexity
– and then descrease to reduce overfitting
– this process is called regularisation
• The idea can be extended (much) further:
ax2 + bx + c
dx3 + ex2 + f x + g

21
• The whole field of curve-fitting is about finding model with the correct mathematical type
– mixed basis function linear
– Even order and half order polynomials
– Chebyshev polynomials
– Fourier-series polynomials
– standard, ln x, sqrt even, y-transformed, even order, and half order rationals
– Chebyshev rationals
– Fourier-series rationals
– nonlinear peak equations
– nonlinear transition equations
– nonlinear kinetic equations
• This approach only work for small numbers of variables (i.e. columns of X)
– so another approach to regularistaion would be useful
• Regularise by constraining the weights (i.e. coefficients) that are allowed
– works for the normal equation and gradient descent methods
• Can be applied to the standard linear model (i.e. degree one)
– and to polynomial models
• We look at three commonly used techniques
• First produce a known signal (y = 12 x + 1) and add noise

[38]: np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

11.0.1 Ridge regression

1∑ 2
n
J(β) = MSE(β) + α βi
2
i=1

- The loss function is MSE plus the sum of the squared individual weights, multiplied by a hyper-
parameter α - If α is zero we just have linear regression - If α is large then the weights will be small
so we get a flat line through the data’s mean - note that we’re not using an intercept term here
Note that: - The loss function used for training is not the same as performance measure used for
testing - This is not uncommon: one is chosen for efficiency, the other performance - In classification
we often train using the log-loss function, but evaluate using precision/recall - We call specialist
solvers to find optimal weight since the underlying maths is now harder
Normal equation - two methods for getting an analytic solution for our data - both make rea-
sonable predictions
[39]: from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

[39]: array([[1.55071465]])

22
[40]: ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

[40]: array([[1.5507201]])

Stochastic gradient descent - another reasonable - but different - prediction for the expected
value at x = 1.5
[41]: sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

[41]: array([1.47012588])

11.0.2 Compare choice of α

• Left panel goes from the mean at α = 100, to a regulated model, to an unregulated model
• Right panel starts with a degree 10 model
– increased α gived flatter and less extreme models
[42]: from sklearn.linear_model import Ridge

def plot_model(model_class, polynomial, alphas, **model_kargs):

for alpha, style in zip(alphas, ("b-", "g--", "r:")):
model = model_class(alpha, **model_kargs) if alpha > 0 else␣
,→LinearRegression()

if polynomial:
model = Pipeline([
("poly_features", PolynomialFeatures(degree=10,␣
,→include_bias=False)),

("std_scaler", StandardScaler()),
("regul_reg", model),
])
model.fit(X, y)
y_new_regul = model.predict(X_new)
lw = 2 if alpha > 0 else 1
plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha =␣
,→{}$".format(alpha))

plt.plot(X, y, "b.", linewidth=3)

plt.legend(loc="upper left", fontsize=15)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 3, 0, 4])

plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)

23
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)

save_fig("ridge_regression_plot")
plt.show()

Saving figure ridge_regression_plot

11.0.3 Lasso regression

1∑
n
J(β) = MSE(β) + α |βi |
2
i=1

- The loss function is MSE plus the sum of the weights’ absolute values, multiplied by a hyperpa-
rameter α - for mathematicians, using the ℓ1 norm instead of ridge’s ℓ2 norm - Similar performance
- But Lasso sets the weights of unimportant varaible (close) to zero - and so automatically performs
variable selection - only use variables that contribute to good predictions - And returns a sparse
matrix (i.e. containing many zeros) which aids efficiency
Note: to be future-proof, we set max_iter=1000 and tol=1e-3 because these will be the default
values in Scikit-Learn 0.21.
[43]: from sklearn.linear_model import Lasso

plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), random_state=42)

24
save_fig("lasso_regression_plot")
plt.show()

/usr/local/lib/python3.8/site-
packages/sklearn/linear_model/_coordinate_descent.py:474: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations.
Duality gap: 2.802867703827423, tolerance: 0.0009294783355207351
model = cd_fast.enet_coordinate_descent(
Saving figure lasso_regression_plot

[44]: from sklearn.linear_model import Lasso

lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

[44]: array([1.53788174])

11.0.4 Elastic net regression

1∑ 1−r ∑
n n
J(β) = MSE(β) + rα |βi | + α (βi )2
2 2
i=1 i=1

- The loss function is MSE plus a combination of ridge and lasso terms - r is a mix factor - in the
code below it’s set to 12

[45]: from sklearn.linear_model import ElasticNet

elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X, y)

25
elastic_net.predict([[1.5]])

[45]: array([1.54333232])

11.0.5 Summary
• Regularisation is the only real option when modelling with many features
• Choice of technique is data-dependent
– if there are many unimportant features use Lasso or Elastic Net
– but you don’t usually know this at the start
• The data should always be scaled when using these techniques

12 Early stopping
• All the regression techniques we’ve looked at are aimed at reducing errors to close to zero
• With regularisation designed to work back from low training error (overfit) back to good
generalisation error
• You can think of this as wasted effort
– why not regularise by stopping as soon as optimal validation error is reached?
• This can’t be done for normal equation methods
– there is nothing to stop
• But for gradient descent - and other iterative techniques - this can be a big win
• For the example, add noise to y = 12 X2 + X + 2
– then train a degree 90 polynomial model
– that should massively overfit if left to run
[55]: np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)

X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(),␣

,→test_size=0.5, random_state=10)

Early stopping example:

[56]: from copy import deepcopy

poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler())
])

X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,

26
penalty=None, learning_rate="constant", eta0=0.0005,␣
random_state=42)
,→

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = deepcopy(sgd_reg)

Create the graph:

[57]: sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
penalty=None, learning_rate="constant", eta0=0.0005,␣
,→random_state=42)

n_epochs = 500
train_errors, val_errors = [], []
for epoch in range(n_epochs):
sgd_reg.fit(X_train_poly_scaled, y_train)
y_train_predict = sgd_reg.predict(X_train_poly_scaled)
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
train_errors.append(mean_squared_error(y_train, y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))

best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])

plt.annotate('Best model',
xy=(best_epoch, best_val_rmse),
xytext=(best_epoch, best_val_rmse + 1),
ha="center",
arrowprops=dict(facecolor='black', shrink=0.05),
fontsize=16,
)

best_val_rmse -= 0.03 # just to make the graph look better

plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)

27
plt.ylabel("RMSE", fontsize=14)
save_fig("early_stopping_plot")
plt.show()

Saving figure early_stopping_plot

12.0.1 Early stopping example

• Training error steadily approaches zero
– a degree 90 polynomial will get very close to some training values
• Validation error steadily reduces at first
– the models so far are underfitting
• Validation error then starts increasing the models are now overfitting
• So we stop at the point that validation error is minimal
– this gives a complex model (i.e. includes terms like a17 b88 since we used PolynomialFea-
tures)
– but has weights optimised for the overfit/underfit tradeoff
Note that: - These plots don’t always show smooth decline then rise in real life - so judgement is
needed - Writing down (and interpreting) the model is hard - so this is black-box machine learning
- When it works, you are confident that generalistion error will be minimal
[58]: best_epoch, best_model

28
[58]: (239,
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
eta0=0.0005, fit_intercept=True, l1_ratio=0.15,
learning_rate='constant', loss='squared_loss', max_iter=1,
n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,
shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0,
warm_start=True))

13 Logistic regression
• Standard regression but with the response/label scaled to between 0 and 1

ŷ = σ(xT β)

• Hence:
– returning a probability
– allowing classification
– confusion matrices & ROC curves
– weights/coefficients give log-odds which can be converted to odds ratios
– (i.e. quantifies the strength of the association between two events)
• Trained using the log loss function
– deriving and explaining this is out of scope for this module Log loss properties:
– noanalytic solution is known, so the equivalent of the normal equation does not exist
– is convex, so gradient descent (and related) algorithms will work

[64]: t = np.linspace(-10, 10, 100)

sig = 1 / (1 + np.exp(-t))
plt.figure(figsize=(9, 3))
plt.plot([-10, 10], [0, 0], "k-")
plt.plot([-10, 10], [0.5, 0.5], "k:")
plt.plot([-10, 10], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$")
plt.xlabel("t")
plt.legend(loc="upper left", fontsize=20)
plt.axis([-10, 10, -0.1, 1.1])
save_fig("logistic_function_plot")
plt.show()

Saving figure logistic_function_plot

29
[65]: from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())

[65]: ['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

[66]: print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset

--------------------

Data Set Characteristics:

:Number of Instances: 150 (50 in each of three classes)

:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================

Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194

30
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None

:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the

pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"

Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more …

[67]: X = iris["data"][:, 3:] # petal width

y = (iris["target"] == 2).astype(np.int) # 1 if Iris virginica, else 0

Note 1: To be future-proof we set solver="lbfgs" since this will be the default value in Scikit-
Learn 0.22.
Note 2: This is the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm, and Roger
Fletcher FRS was my MSc. dissertation supervisor

31
[72]: from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(solver="lbfgs", random_state=42)
log_reg.fit(X, y)

[72]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

14 Decision boundaries
• We’re now classifying again, based on logistic regression probabilities
• The model takes one variable (petal width) and returns a probability
• Above 2.0cm and below 1.0cm the classifier is confident
• Within these values the classifier is less sure
• We have a decision boundary at about 1.6cm
– both predictions are 50%
[73]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)

plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")

plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")
legend = plt.legend(loc='center left', shadow=True, fontsize='large')

32
The figure in the book actually is actually a bit fancier:
[74]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]

plt.figure(figsize=(8, 3))
plt.plot(X[y==0], y[y==0], "bs")
plt.plot(X[y==1], y[y==1], "g^")
plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")
plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")
plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14,␣
,→color="k", ha="center")

plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1,␣

,→fc='b', ec='b')

plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1,␣

,→fc='g', ec='g')

plt.xlabel("Petal width (cm)", fontsize=14)

plt.ylabel("Probability", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 3, -0.02, 1.02])
save_fig("logistic_regression_plot")
plt.show()

Saving figure logistic_regression_plot

[75]: decision_boundary

[75]: array([1.66066066])

[76]: log_reg.predict([[1.7], [1.5]])

33
[76]: array([1, 0])

• We can get the exact decision boundary

– and check predictions either side
• Now add another feature/variable - petal length - and repeat
1
• The hyperparameter in the sklearn LogisticRegression implementation is C = α
– so high C means low α
[77]: from sklearn.linear_model import LogisticRegression

X = iris["data"][:, (2, 3)] # petal length, petal width

y = (iris["target"] == 2).astype(np.int)

log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42)

log_reg.fit(X, y)

x0, x1 = np.meshgrid(
np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")

zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)

left_right = np.array([2.9, 7])

boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) /␣
,→log_reg.coef_[0][1]

plt.clabel(contour, inline=1, fontsize=12)

plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
save_fig("logistic_regression_contour_plot")
plt.show()

Saving figure logistic_regression_contour_plot

34
• Another (linear) decision boundary
• It is the straight line given by the set of points x such that

β0 + β1 x1 + β2 x2 = 0

• The other lines denote equal probability

– so the green dashed line is where the model is 90% sure

15 Multiclass logistic regression

• Also known as softmax regression
• Uses the softmax score for class k:
(k)
sk (x) = xT β

• So each class has its own set of weights/biases/parameters β (k)

– these are stored in a parameter matrix
• The output is a vector of k probabilities
– we take the largest one as our classification, as before
• Include all three varities in our model
• Note that muliclass means “one of k distinct classes”
– so not multioutput
– can’t be used to recognise k people in one picture
[78]: X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]

softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",␣
,→C=10, random_state=42)

softmax_reg.fit(X, y)

[78]: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,

intercept_scaling=1, l1_ratio=None, max_iter=100,

35
multi_class='multinomial', n_jobs=None, penalty='l2',
random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

[79]: x0, x1 = np.meshgrid(

np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)

zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

from matplotlib.colors import ListedColormap

custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)

contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
save_fig("softmax_regression_contour_plot")
plt.show()

Saving figure softmax_regression_contour_plot

36
[80]: softmax_reg.predict([[5, 2]])

[80]: array([2])

[81]: softmax_reg.predict_proba([[5, 2]])

[81]: array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])

16 Summary
• Chaper 4 contains plenty of important material
– Regression using the normal equation
– Regression using gradient descent techniques
– Regularisation using polynomial models
– Regularisation using weight limits
– Regularisation by early stopping
– Logistic regression
– Multiclass logistic regression
• For the exam I can’t ask you to
– invert a matrix
– perform gradient descent
– derive loss functions
– classify flowers (using these methods)
– etc.
• I can ask you to
– explain and compare the concepts
– interpret charts and/or vectors returned by the methods
– draw a straight line defined by (at least) two points
– etc.

37
17 Exercise solutions
17.1 1. to 11.
See appendix A.

17.2 12. Batch Gradient Descent with early stopping for Softmax Regression
(without using Scikit-Learn)
Let’s start by loading the data. We will just reuse the Iris dataset we loaded earlier.
[ ]: X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]

We need to add the bias term for every instance (x0 = 1):

[ ]: X_with_bias = np.c_[np.ones([len(X), 1]), X]

And let’s set the random seed so the output of this exercise solution is reproducible:
[ ]: np.random.seed(2042)

The easiest option to split the dataset into a training set, a validation set and a test set would be to
use Scikit-Learn’s train_test_split() function, but the point of this exercise is to try understand
the algorithms by implementing them manually. So here is one possible implementation:
[ ]: test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(total_size * test_ratio)

validation_size = int(total_size * validation_ratio)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the
Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all
classes except for the target class which will have a probability of 1.0 (in other words, the vector of
class probabilities for ay given instance is a one-hot vector). Let’s write a small function to convert
the vector of class indices into a matrix containing a one-hot vector for each instance:

38
[ ]: def to_one_hot(y):
n_classes = y.max() + 1
m = len(y)
Y_one_hot = np.zeros((m, n_classes))
Y_one_hot[np.arange(m), y] = 1
return Y_one_hot

Let’s test this function on the first 10 instances:

[ ]: y_train[:10]

[ ]: to_one_hot(y_train[:10])

Looks good, so let’s create the target class probabilities matrix for the training set and the test set:
[ ]: Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

Now let’s implement the Softmax function. Recall that it is defined by the following equation:
exp (sk (x))
σ (s(x))k =
∑
K
exp (sj (x))
j=1

[ ]: def softmax(logits):
exps = np.exp(logits)
exp_sums = np.sum(exps, axis=1, keepdims=True)
return exps / exp_sums

We are almost ready to start training. Let’s define the number of inputs and outputs:
[ ]: n_inputs = X_train.shape[1] # == 3 (2 features plus the bias term)
n_outputs = len(np.unique(y_train)) # == 3 (3 iris classes)

Now here comes the hardest part: training! Theoretically, it’s simple: it’s just a matter of trans-
lating the math equations into Python code. But in practice, it can be quite tricky: in particular,
it’s easy to mix up the order of the terms, or the indices. You can even end up with code that
looks like it’s working but is actually not computing exactly the right thing. When unsure, you
should write down the shape of each term in the equation and make sure the corresponding terms
in your code match closely. It can also help to evaluate each term independently and print them
out. The good news it that you won’t have to do this everyday, since all this is well implemented
by Scikit-Learn, but it will help you understand what’s going on under the hood.
So the equations we will need are the cost function:
1 ∑m ∑ K ( )
(i) (i)
J(�) = − yk log p̂k
m i=1 k=1
And the equation for the gradients:

39
1 ∑m ( )
(i) (i)
∇θ(k) J(�) = p̂k − yk x(i)
m i=1
( ) ( )
(i) (i) (i)
Note that log p̂k may not be computable if p̂k = 0. So we will add a tiny value ϵ to log p̂k
to avoid getting nan values.
[ ]: eta = 0.01
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):

logits = X_train.dot(Theta)
Y_proba = softmax(logits)
loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
error = Y_proba - Y_train_one_hot
if iteration % 500 == 0:
print(iteration, loss)
gradients = 1/m * X_train.T.dot(error)
Theta = Theta - eta * gradients

And that’s it! The Softmax model is trained. Let’s look at the model parameters:
[ ]: Theta

Let’s make predictions for the validation set and check the accuracy score:
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)

accuracy_score

Well, this model looks pretty good. For the sake of the exercise, let’s add a bit of ℓ2 regularization.
The following training code is similar to the one above, but the loss now has an additional ℓ2
penalty, and the gradients have the proper additional term (note that we don’t regularize the first
element of Theta since this corresponds to the bias term). Also, let’s try increasing the learning
rate eta.
[ ]: eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1 # regularization hyperparameter

Theta = np.random.randn(n_inputs, n_outputs)

40
for iteration in range(n_iterations):
logits = X_train.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))

l2_loss = 1/2 * np.sum(np.square(Theta[1:]))

loss = xentropy_loss + alpha * l2_loss
error = Y_proba - Y_train_one_hot
if iteration % 500 == 0:
print(iteration, loss)
gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]),␣
,→alpha * Theta[1:]]

Theta = Theta - eta * gradients

Because of the additional ℓ2 penalty, the loss seems greater than earlier, but perhaps this model
will perform better? Let’s find out:
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)

accuracy_score

Cool, perfect accuracy! We probably just got lucky with this validation set, but still, it’s pleasant.
Now let’s add early stopping. For this we just need to measure the loss on the validation set at
every iteration and stop when the error starts growing.
[ ]: eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1 # regularization hyperparameter
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):

logits = X_train.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))

l2_loss = 1/2 * np.sum(np.square(Theta[1:]))

loss = xentropy_loss + alpha * l2_loss
error = Y_proba - Y_train_one_hot

41
gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]),␣
,→alpha * Theta[1:]]
Theta = Theta - eta * gradients

logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))

l2_loss = 1/2 * np.sum(np.square(Theta[1:]))

loss = xentropy_loss + alpha * l2_loss
if iteration % 500 == 0:
print(iteration, loss)
if loss < best_loss:
best_loss = loss
else:
print(iteration - 1, best_loss)
print(iteration, loss, "early stopping!")
break

[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)

accuracy_score

Still perfect, but faster.

Now let’s plot the model’s predictions on the whole dataset:
[ ]: x0, x1 = np.meshgrid(
np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]
X_new_with_bias = np.c_[np.ones([len(X_new), 1]), X_new]

logits = X_new_with_bias.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

zz1 = Y_proba[:, 1].reshape(x0.shape)

zz = y_predict.reshape(x0.shape)

42
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)

contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()

And now let’s measure the final model’s accuracy on the test set:
[ ]: logits = X_test.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_test)

accuracy_score

Our perfect model turns out to have slight imperfections. This variability is likely due to the very
small size of the dataset: depending on how you sample the training set, validation set and the test
set, you can get quite different results. Try changing the random seed and running the code again
a few times, you will see that the results will vary.
[ ]:

COMP1680 Coursework 1
No ratings yet
COMP1680 Coursework 1
14 pages
Masks Common Moves
No ratings yet
Masks Common Moves
3 pages
COMP1680 Coursework 1
No ratings yet
COMP1680 Coursework 1
14 pages
Machine Learning Cheat Sheet
100% (1)
Machine Learning Cheat Sheet
211 pages
May2015 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May2015 Examination Diet School of Mathematics & Statistics ID5059
9 pages
MT4614 Exam
No ratings yet
MT4614 Exam
13 pages
May 2021 Examination Diet School of Mathematics & Statistics ID5059
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics ID5059
6 pages
Problem Sheet - 1: Figure 1. Conduction Band Figure 2. Valance Band
0% (1)
Problem Sheet - 1: Figure 1. Conduction Band Figure 2. Valance Band
2 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
2022 Linear Regression
No ratings yet
2022 Linear Regression
34 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
ML Labs
No ratings yet
ML Labs
46 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Lab 4 - Markdown Practical - Solution
No ratings yet
Lab 4 - Markdown Practical - Solution
5 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
04_training_linear_models
No ratings yet
04_training_linear_models
35 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Lecture04. Training Models (Regression in Chapter 4)
No ratings yet
Lecture04. Training Models (Regression in Chapter 4)
44 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
23 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
practicalMachineLearning_lecture3
No ratings yet
practicalMachineLearning_lecture3
25 pages
lecture3_supervised_learning_I
No ratings yet
lecture3_supervised_learning_I
84 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Lec 03
No ratings yet
Lec 03
42 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
L. D. College of Engineering: Lab Manual For
No ratings yet
L. D. College of Engineering: Lab Manual For
70 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Gradient Descent Vizcs229 PDF
No ratings yet
Gradient Descent Vizcs229 PDF
7 pages
Theory of Deep Learning 1652786371
No ratings yet
Theory of Deep Learning 1652786371
118 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
ML Lecture 2 2023
No ratings yet
ML Lecture 2 2023
59 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
Class2
No ratings yet
Class2
107 pages
Regression
No ratings yet
Regression
39 pages
Homework2 - Tran Anh Vu
No ratings yet
Homework2 - Tran Anh Vu
3 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
CH 1
No ratings yet
CH 1
24 pages
ML Cheatsheet PDF
100% (1)
ML Cheatsheet PDF
211 pages
Book Pytorch Scikit Learn Numpy
No ratings yet
Book Pytorch Scikit Learn Numpy
247 pages
Linear Regression With Gradient Descent
100% (1)
Linear Regression With Gradient Descent
8 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
Lab2 Linear Regression
100% (1)
Lab2 Linear Regression
18 pages
Examples for LSE, RLS, and RBFN
No ratings yet
Examples for LSE, RLS, and RBFN
16 pages
Updating_Weight
No ratings yet
Updating_Weight
9 pages
Machine Learning Lab (3) Report (21 CP 81)
No ratings yet
Machine Learning Lab (3) Report (21 CP 81)
7 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
A Brief Introduction to MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
2.5/5 (2)
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Cloud Computing Coursework
No ratings yet
Cloud Computing Coursework
4 pages
May 2021 Examination Diet School of Mathematics & Statistics MT4614
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics MT4614
6 pages
May 2021 Examination Diet School of Mathematics & Statistics MT4537
No ratings yet
May 2021 Examination Diet School of Mathematics & Statistics MT4537
11 pages
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
No ratings yet
Knowledge Discovery and Data Mining: Lecture 11 - Tree Methods - Introduction
49 pages
Lecture20 Slides
No ratings yet
Lecture20 Slides
35 pages
Lecture 21
No ratings yet
Lecture 21
138 pages
Lecture 22
No ratings yet
Lecture 22
64 pages
HDCP 2.3 On DisplayPort Comppliace Test Specification Mar 19
No ratings yet
HDCP 2.3 On DisplayPort Comppliace Test Specification Mar 19
129 pages
Gene 240 Revision Test 2
100% (1)
Gene 240 Revision Test 2
7 pages
Form Civil Affidavit of Debt SC
No ratings yet
Form Civil Affidavit of Debt SC
2 pages
Astm A 564-A 564M-2004 R2009
No ratings yet
Astm A 564-A 564M-2004 R2009
7 pages
Hepatorenal Recess
No ratings yet
Hepatorenal Recess
5 pages
User Manual Thc3t-02 v104 Updown
No ratings yet
User Manual Thc3t-02 v104 Updown
21 pages
List of A-Z Linux Commands
No ratings yet
List of A-Z Linux Commands
10 pages
Initiate and Complete A Journal Entry Document Via Park Document (Document Types SA or ZB)
No ratings yet
Initiate and Complete A Journal Entry Document Via Park Document (Document Types SA or ZB)
18 pages
Jingjiao The Church of The East in China and Central Asia
No ratings yet
Jingjiao The Church of The East in China and Central Asia
711 pages
UNIT 3 - Information Technology System Applicable in Nursing Practice
100% (3)
UNIT 3 - Information Technology System Applicable in Nursing Practice
85 pages
1341896144130-Train Light PDF
No ratings yet
1341896144130-Train Light PDF
21 pages
Flow Through An Airfoil
No ratings yet
Flow Through An Airfoil
55 pages
ICEMA Annual Data Report FY'22-23
100% (1)
ICEMA Annual Data Report FY'22-23
172 pages
Myocarditis After Covid Vaccine PDF
No ratings yet
Myocarditis After Covid Vaccine PDF
8 pages
Pengaruh Atribut Produk Rotiboy Terhadap Kepuasan Konsumen Di Mall Ciputra Semarang
No ratings yet
Pengaruh Atribut Produk Rotiboy Terhadap Kepuasan Konsumen Di Mall Ciputra Semarang
9 pages
SMA 3261 - Lecture 4 - Numerical - Differentiation
No ratings yet
SMA 3261 - Lecture 4 - Numerical - Differentiation
10 pages
DW Project
No ratings yet
DW Project
93 pages
Introduction To Sets: Set Theory
No ratings yet
Introduction To Sets: Set Theory
6 pages
Sustainability Powerpoint
No ratings yet
Sustainability Powerpoint
6 pages
Monthly 100 MCQS Online Test PDF
No ratings yet
Monthly 100 MCQS Online Test PDF
30 pages
Air Breathing Propulsion Unit-1
No ratings yet
Air Breathing Propulsion Unit-1
37 pages
Brian Tracy 18 Pasos para Programar La Mente para El Exito
No ratings yet
Brian Tracy 18 Pasos para Programar La Mente para El Exito
19 pages
Introduction To Chemical Industry
100% (1)
Introduction To Chemical Industry
19 pages
Activity 6 Research
No ratings yet
Activity 6 Research
2 pages
Notes:: Key Plan
No ratings yet
Notes:: Key Plan
1 page
BPS CSC15
No ratings yet
BPS CSC15
2 pages
D 10 NF 10
No ratings yet
D 10 NF 10
10 pages
Get Chinese Kites How to Make and Fly Them Direct eBook Download
No ratings yet
Get Chinese Kites How to Make and Fly Them Direct eBook Download
20 pages
Search Warrant in Lehi City Investigation
No ratings yet
Search Warrant in Lehi City Investigation
8 pages