Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Lecture09

February 15, 2021

1 Lecture 09
1.1 ID5059
1.2 Tom Kelsey - Jan 2021
As before, we take the Jupyter notebook associated with the course textbook - annotate - explain
Chapter 4 – Training Linear Models
This notebook contains all the sample code and solutions to the exercises in chapter 4.
Run in Google Colab

2 Setup
First, let’s import a few common modules, ensure MatplotLib plots figures inline and prepare a
function to save the figures. We also check that Python 3.5 or later is installed (although Python
2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as
Scikit-Learn �0.20.
[1]: # Python �3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn �0.20 is required


import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs


np.random.seed(42)

# To plot pretty figures


%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

1
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures


PROJECT_ROOT_DIR = "."
CHAPTER_ID = "training_linear_models"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):


path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)


import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

3 Linear regression using the Normal Equation


3.0.1 Linear regression
• Models of the form
ŷ = β0 + β1 x1 + β2 x2 + . . . + βp xp
– β0 is an intercept term
– needed to get solutions that have slope & intercept as in y = mx + c

3.0.2 Normal equation


Recall from Lecture 1 that our standard problem

y = Xβ + e

has an analytic solution:


ŷ = X(XT X)−1 )XT y
where analytic mean “can be solved exactly” - i.e. gives the coefficients that minimise the RMSE -
as long as the matrix XT X is invertible - in technical terms, it hasd a non-zero determinant - and
this solution can be computed relatively efficiently - inversion by Gaussian elimination is O(n3 )

3.0.3 Known signal, known noise


• Note that the examples in this lecture follow a useful pattern for analysis of properties of
models
• We take a known signal

2
– y = 3X + 4 in the first example
• And add a known (and repeatable) amount of noise
– random values from the interval [0, 1)
• This provides an empirical framework for the comparison of approaches
– since we know exactly what “error” means
[2]: import numpy as np

X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

[3]: plt.plot(X, y, "b.")


plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
save_fig("generated_data_plot")
plt.show()

Saving figure generated_data_plot

3.0.4 Intercept terms


• We know that the signal is a straight line with a slope

3
• So we need to add the β0 term as described above
• We could set it to anthing, but using 1 means that the β value returned needs no adjustment
[4]: X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

3.0.5 Solve using linear algebra tools from numpy


• Implement the normal equation and solve for β̂
– note that the book uses θ - this is just a symbol choice
• We get the best coefficients given the noise that was added
– close to 4 and 3
• Make a series of prediction using these coefficients
– and plot observed vs predicted
[5]: theta_best

[5]: array([[4.21509616],
[2.77011339]])

[6]: X_new = np.array([[0], [2]])


X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict

[6]: array([[4.21509616],
[9.75532293]])

[7]: plt.plot(X_new, y_predict, "r-")


plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()

4
The figure in the book actually corresponds to the following code, with a legend and axis labels:
[8]: plt.plot(X_new, y_predict, "r-", linewidth=2, label="Predictions")
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([0, 2, 0, 15])
save_fig("linear_model_predictions_plot")
plt.show()

Saving figure linear_model_predictions_plot

5
• Instead of writing the normal equatuion explicitly, we can call a library function that imple-
ments it
• And additionally gives a solution when the matrix is not inverttible
[9]: from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_

[9]: (array([4.21509616]), array([[2.77011339]]))

[10]: lin_reg.predict(X_new)

[10]: array([[4.21509616],
[9.75532293]])

The LinearRegression class is based on the scipy.linalg.lstsq() function (the name stands
for “least squares”), which you could call directly:

[11]: theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)


theta_best_svd

6
[11]: array([[4.21509616],
[2.77011339]])

This function computes X+ y, where X+ is the pseudoinverse of X (specifically the Moore-Penrose


inverse). You can use np.linalg.pinv() to compute the pseudoinverse directly:

[12]: np.linalg.pinv(X_b).dot(y)

[12]: array([[4.21509616],
[2.77011339]])

3.0.6 Summary
• For simple problems we seek simple solutions
• The normal equation gives optimal results which are cheap to compute
– if the number of attributes - columns in our data - is small
• But can be expensive when we have many instances - rows in our data
• An alternative approach is to iteratively lower the error until a minimum is reached

4 Gradient descent
• Back in high school you would have seen functions of the form y = ax2 + bx + c
– standard quadtratic ploynomial
• To get the gradient (i.e. slope) of the function, we differentiate to get y ′ = 2ax + b
• The function has a minimum (or maximum) where the gradient is zero

4.1 Apply this to MSE


1∑
n
MSE = (ŷj − yj )2
n
j=1

where
ŷ = β0 + β1 x1 + β2 x2 + . . . + βp xp

To make things easier to visualise, take p = 1, forget β0 , and consider one value of j

Error = (β1 x1 − y)2


= (β1 x1 − y)(β1 x1 − y)
= (β12 x21 − 2β1 x1 y + y 2 )

So the derivative is
2β12 x1 − 2β1 y
or
β1 x − y

7
It should be clear that MSE will be zero when the derivative is zero, and β1 x = y (i.e. when the
predictions match the observed data
The idea is to start somewhere where the error value could be anything. Then 1. Measure the
gradient of the error at that point 2. Move to somewhere where the the error is less 3. Repeat until
(hopefully) the error is close to zero Source
To make it work we need three things 1. A good starting point (but we don’t have any insights so
just choose at random) 2. A loss function that measures the difference, or error, between actual y
and predicted y at its current position and provide feedback to the process so that it can adjust
the parameters to minimize the error 3. A learning rate - or step size - that it is evaluated and
updated based on the behavior of the loss function - too small leads to slow convergence - too large
leads to ascillation around the minimum Source
Note that - this all works as we’ve chosen our error to have a quadratic form, hence a unique
global minimum - for deep learning, we hope that the error is approximately quadratic near an
error minimum - everything remains true in p dimensions, but is hard to plot - we just have to
understand the concept - the code does all the work for us

5 Linear regression using batch gradient descent


• Batch gradient descent sums the error for each point in a training set
– updating the model only after all training examples have been evaluated
• This process is referred to as a training epoch.
• While this batching provides computation efficiency, it can still have a long processing time
for large training datasets
– needs to store all of the data into memory
• Batch gradient descent usually produces a stable error gradient and convergence
– but sometimes that convergence point is not ideal, finding a local minimum versus the
global one
In the code below we 1. fix a learning rate, the numer of rows in the data (100) and a number of
iterations before giving up - the number of iterations chosen is also an educated guess 2. Start at
random (x, y) coordinates 3. Calculate all the gradients for every instance and collect into single
terms 4. Update our coefficient values by the product of the step and the gradients
[13]: eta = 0.1 # learning rate
n_iterations = 1000
m = 100

theta = np.random.randn(2,1) # random initialization

for iteration in range(n_iterations):


gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients

We get β values and predictions close to those expected

8
[14]: theta

[14]: array([[4.21509616],
[2.77011339]])

[15]: X_new_b.dot(theta)

[15]: array([[4.21509616],
[9.75532293]])

[16]: theta_path_bgd = []

def plot_gradient_descent(theta, eta, theta_path=None):


m = len(X_b)
plt.plot(X, y, "b.")
n_iterations = 1000
for iteration in range(n_iterations):
if iteration < 10:
y_predict = X_new_b.dot(theta)
style = "b-" if iteration > 0 else "r--"
plt.plot(X_new, y_predict, style)
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
theta = theta - eta * gradients
if theta_path is not None:
theta_path.append(theta)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 2, 0, 15])
plt.title(r"$\eta = {}$".format(eta), fontsize=16)

5.0.1 Vary the learning rate


• Too slow
• Too many jumps
• Just right
[17]: np.random.seed(42)
theta = np.random.randn(2,1) # random initialization

plt.figure(figsize=(10,4))
plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(132); plot_gradient_descent(theta, eta=0.1,␣
,→theta_path=theta_path_bgd)

plt.subplot(133); plot_gradient_descent(theta, eta=0.5)

save_fig("gradient_descent_plot")
plt.show()

9
Saving figure gradient_descent_plot

6 Stochastic Gradient Descent


• Batch GD suffers when the number of instances is very large
– since you need all the gradients in memory at each step
• Reduce memory needed by taking an instance at random at each iteration
• Also gradually reduce the learning rate
– we hope that the first instances got us “close”to the minimum
– and that the later ones help zero in
[18]: theta_path_sgd = []
m = len(X_b)
np.random.seed(42)

[19]: n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters

def learning_schedule(t):
return t0 / (t + t1)

theta = np.random.randn(2,1) # random initialization

for epoch in range(n_epochs):


for i in range(m):
if epoch == 0 and i < 20: # not shown in the book
y_predict = X_new_b.dot(theta) # not shown
style = "b-" if i > 0 else "r--" # not shown
plt.plot(X_new, y_predict, style) # not shown
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]

10
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
theta_path_sgd.append(theta) # not shown

plt.plot(X, y, "b.") # not shown


plt.xlabel("$x_1$", fontsize=18) # not shown
plt.ylabel("$y$", rotation=0, fontsize=18) # not shown
plt.axis([0, 2, 0, 15]) # not shown
save_fig("sgd_plot") # not shown
plt.show() # not shown

Saving figure sgd_plot

[20]: theta

[20]: array([[4.21076011],
[2.74856079]])

• As before, we get good results in terms of coefficients and predictions


– we see irregular convergence in the first 20 epochs due to the random choices
• The code above is designed to show how everything works
• There is also a library that lets you choose parameters and then run out of the box

11
Note that: - the training instances must be independent and identically distributed - otherwise we
might optimise on one variable, then another, … - and not find the global optimum - we can shufle
at each epoch to take care of this
[21]: from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1,␣


,→random_state=42)

sgd_reg.fit(X, y.ravel())

[21]: SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,


eta0=0.1, fit_intercept=True, l1_ratio=0.15,
learning_rate='invscaling', loss='squared_loss', max_iter=1000,
n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,
shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
warm_start=False)

[22]: sgd_reg.intercept_, sgd_reg.coef_

[22]: (array([4.24365286]), array([2.8250878]))

7 Mini-batch gradient descent


• Between batch and stochastic
• Take less than all, but more than one instance at each epoch
• Should be the best of both worlds
– and can use GPUs to speed up the matrix operations
• Choosing the mini-batch size is now another hyperparameter that can be tuned
[23]: theta_path_mgd = []

n_iterations = 50
minibatch_size = 20

np.random.seed(42)
theta = np.random.randn(2,1) # random initialization

t0, t1 = 200, 1000


def learning_schedule(t):
return t0 / (t + t1)

t = 0
for epoch in range(n_iterations):
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for i in range(0, m, minibatch_size):

12
t += 1
xi = X_b_shuffled[i:i+minibatch_size]
yi = y_shuffled[i:i+minibatch_size]
gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(t)
theta = theta - eta * gradients
theta_path_mgd.append(theta)

[24]: theta

[24]: array([[4.25214635],
[2.7896408 ]])

7.0.1 Compare the three GD methods for two attributes


1. Batch converges in an organised way
• but might be too expensive
2. Stochastic takes many iterations
• but each iteration uses very little memory
3. Mini-batch shows randomness, but converges faster than stochastic
Note that: - This is the theory - other data might give different results - All three methods require
scaled data - Numerical optimisation is a large and important area of research - there is much we
still don’t understand - full coverage of the issues would form another complete module
[25]: theta_path_bgd = np.array(theta_path_bgd)
theta_path_sgd = np.array(theta_path_sgd)
theta_path_mgd = np.array(theta_path_mgd)

[26]: plt.figure(figsize=(7,4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1,␣
,→label="Stochastic")

plt.plot(theta_path_mgd[:, 0], theta_path_mgd[:, 1], "g-+", linewidth=2,␣


,→label="Mini-batch")

plt.plot(theta_path_bgd[:, 0], theta_path_bgd[:, 1], "b-o", linewidth=3,␣


,→label="Batch")

plt.legend(loc="upper left", fontsize=16)


plt.xlabel(r"$\theta_0$", fontsize=20)
plt.ylabel(r"$\theta_1$ ", fontsize=20, rotation=0)
plt.axis([2.5, 4.5, 2.3, 3.9])
save_fig("gradient_descent_paths_plot")
plt.show()

Saving figure gradient_descent_paths_plot

13
8 Summary
• We’ve covered gradient descent techniques for linear regression with a convex (quadratic)
error surface
• For more complex notions of area we would consider local minima and saddle points
– places where the loss function is at or close to zero, so the model stops learning
– but is not giving us the error reduction that we want
Source
• We’ll cover this later, but for now the randomness in stochastic variants can help us “jump
out” of a local minimum or saddle point

9 Polynomial regression
• We have our X data
• Think of this as X1 i.e. raised to the power one
• We added intercept terms: a matrix the same size as X but with all entries one
– Think of this as X0 i.e. raised to the power zero
• The process can be extended to get X2 , X3 , X4 , . . .
• So the function in our main equation becomes a polynomial
y = β0 X0 + β1 X1 + β2 X2 + β3 X3 + e
As before we set up a known signal with known noise
[27]: import numpy as np
import numpy.random as rnd

14
np.random.seed(42)

[28]: m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

[29]: plt.plot(X, y, "b.")


plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
save_fig("quadratic_data_plot")
plt.show()

Saving figure quadratic_data_plot

9.0.1 Transform the data


• Use the polynimial features tool to get X2 as well as X
• This tool will also help find relationships between variables
• For variables a and b and degree 3, it returns a2 , a3 , b2 , b3 as expected and also ab, a2 b and
ab2
• So using for large variable size p and large degree should be done with care

15
[30]: from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]

[30]: array([-0.75275929])

[31]: X_poly[0]

[31]: array([-0.75275929, 0.56664654])

9.0.2 Derive a linear regression model


• The ci=uve is not going to be a straight line
• So not linear in one sense
• Here we are applying the same linear techniques to data that has been transformed
– the algorithm doesn’t know that some of the attributes are related
• So this is still linear regression
We get
y = 0.56x2 + 0.93x + 1.78

Which is close to the original


1
y = x2 + x + 2
2

[32]: lin_reg = LinearRegression()


lin_reg.fit(X_poly, y)
lin_reg.intercept_, lin_reg.coef_

[32]: (array([1.78134581]), array([[0.93366893, 0.56456263]]))

[33]: X_new=np.linspace(-3, 3, 100).reshape(100, 1)


X_new_poly = poly_features.transform(X_new)
y_new = lin_reg.predict(X_new_poly)
plt.plot(X, y, "b.")
plt.plot(X_new, y_new, "r-", linewidth=2, label="Predictions")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([-3, 3, 0, 10])
save_fig("quadratic_predictions_plot")
plt.show()

Saving figure quadratic_predictions_plot

16
9.0.3 Overfit and underfit
• We know that the correct type of polynomial for these data is quadratic
• We can fit a degree 1 polynomial (i.e. straight line)
• And a degree 300 polynomial
• One should underfit, the other overfit
[34]: from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = Pipeline([
("poly_features", polybig_features),
("std_scaler", std_scaler),
("lin_reg", lin_reg),
])
polynomial_regression.fit(X, y)
y_newbig = polynomial_regression.predict(X_new)
plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)

plt.plot(X, y, "b.", linewidth=3)

17
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
save_fig("high_degree_polynomials_plot")
plt.show()

Saving figure high_degree_polynomials_plot

[35]: from sklearn.metrics import mean_squared_error


from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,␣
,→random_state=10)

train_errors, val_errors = [], []


for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))

18
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14) # not shown in the book
plt.xlabel("Training set size", fontsize=14) # not shown
plt.ylabel("RMSE", fontsize=14) # not shown

10 Learning curves
• We’ve used cross validation as the standard technique for estimating generalisation errors
– good performance on training data but bad cross validation error indicates overfit
– poor performance on training data and cross validation indicates underfit
• We want to visualise the overfit/underfit tradeoff as derivation proceeds
1. Start with a small subset of the training data
2. Learn a regression model
3. Calculate training & validation errors
4. Increase the size of the training set and repeat

10.0.1 Underfit
• This is the degree 1 model trying to predict quadratic data
• For one and two points, training error is low (we have straight line)
– but validation error is large (we have the wrong straight line)
• Training error increases as we increase the training data, then plateaus
– failure to capture the quadratic signal
• Validation error decrease to a similar plateau
• We conclude that adding data will not reduce error
– which is underfit
[36]: lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
plt.axis([0, 80, 0, 3]) # not shown in the book
save_fig("underfitting_learning_curves_plot") # not shown
plt.show() # not shown

Saving figure underfitting_learning_curves_plot

19
10.0.2 Overfit
• This is a degree 10 model trying to predict quadratic data
– again no training error up to 10 points as these determine a degree-10 polynomial
– again huge validation error as they determine the wrong degree-10 polynomial
• Again we see convergence to a plateau
• Two important differences
1. the plateaus indicate lower error
2. they are further apart
• These are the classic signs of overfit
[37]: from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])

plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3]) # not shown
save_fig("learning_curves_plot") # not shown
plt.show() # not shown

Saving figure learning_curves_plot

20
10.1 Terminology
• I refer to overfit and underfit
• Other sources (including the textbook) use different terminology
• Generalisation error consists of three things
1. Bias due to wrong assumptions about model complexity, often leading to underfit
2. Variance due to excessive sensitivity to small changes in training data, often leading to
overfit
3. Irreducible error due to the noise in the data (outliers, unreliable sensor readings,
unreliable human data entry, etc.)
• Increasing model complexity will typically increase variance and decrease bias
• Decreasing model complexity will typically decrease variance and increase bias
Caveat: the intercept terms we saw earlier are also called bias terms. For neural nets, they are
always called bias terms. This use of the word bias is not the same as the above

11 Regularized models
• Polynomial regression gives a clear way to increase model complexity
– and then descrease to reduce overfitting
– this process is called regularisation
• The idea can be extended (much) further:
ax2 + bx + c
dx3 + ex2 + f x + g

21
• The whole field of curve-fitting is about finding model with the correct mathematical type
– mixed basis function linear
– Even order and half order polynomials
– Chebyshev polynomials
– Fourier-series polynomials
– standard, ln x, sqrt even, y-transformed, even order, and half order rationals
– Chebyshev rationals
– Fourier-series rationals
– nonlinear peak equations
– nonlinear transition equations
– nonlinear kinetic equations
• This approach only work for small numbers of variables (i.e. columns of X)
– so another approach to regularistaion would be useful
• Regularise by constraining the weights (i.e. coefficients) that are allowed
– works for the normal equation and gradient descent methods
• Can be applied to the standard linear model (i.e. degree one)
– and to polynomial models
• We look at three commonly used techniques
• First produce a known signal (y = 12 x + 1) and add noise

[38]: np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)

11.0.1 Ridge regression


1∑ 2
n
J(β) = MSE(β) + α βi
2
i=1

- The loss function is MSE plus the sum of the squared individual weights, multiplied by a hyper-
parameter α - If α is zero we just have linear regression - If α is large then the weights will be small
so we get a flat line through the data’s mean - note that we’re not using an intercept term here
Note that: - The loss function used for training is not the same as performance measure used for
testing - This is not uncommon: one is chosen for efficiency, the other performance - In classification
we often train using the log-loss function, but evaluate using precision/recall - We call specialist
solvers to find optimal weight since the underlying maths is now harder
Normal equation - two methods for getting an analytic solution for our data - both make rea-
sonable predictions
[39]: from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

[39]: array([[1.55071465]])

22
[40]: ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])

[40]: array([[1.5507201]])

Stochastic gradient descent - another reasonable - but different - prediction for the expected
value at x = 1.5
[41]: sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])

[41]: array([1.47012588])

11.0.2 Compare choice of α


• Left panel goes from the mean at α = 100, to a regulated model, to an unregulated model
• Right panel starts with a degree 10 model
– increased α gived flatter and less extreme models
[42]: from sklearn.linear_model import Ridge

def plot_model(model_class, polynomial, alphas, **model_kargs):


for alpha, style in zip(alphas, ("b-", "g--", "r:")):
model = model_class(alpha, **model_kargs) if alpha > 0 else␣
,→LinearRegression()

if polynomial:
model = Pipeline([
("poly_features", PolynomialFeatures(degree=10,␣
,→include_bias=False)),

("std_scaler", StandardScaler()),
("regul_reg", model),
])
model.fit(X, y)
y_new_regul = model.predict(X_new)
lw = 2 if alpha > 0 else 1
plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha =␣
,→{}$".format(alpha))

plt.plot(X, y, "b.", linewidth=3)


plt.legend(loc="upper left", fontsize=15)
plt.xlabel("$x_1$", fontsize=18)
plt.axis([0, 3, 0, 4])

plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)

23
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)

save_fig("ridge_regression_plot")
plt.show()

Saving figure ridge_regression_plot

11.0.3 Lasso regression


1∑
n
J(β) = MSE(β) + α |βi |
2
i=1

- The loss function is MSE plus the sum of the weights’ absolute values, multiplied by a hyperpa-
rameter α - for mathematicians, using the ℓ1 norm instead of ridge’s ℓ2 norm - Similar performance
- But Lasso sets the weights of unimportant varaible (close) to zero - and so automatically performs
variable selection - only use variables that contribute to good predictions - And returns a sparse
matrix (i.e. containing many zeros) which aids efficiency
Note: to be future-proof, we set max_iter=1000 and tol=1e-3 because these will be the default
values in Scikit-Learn 0.21.
[43]: from sklearn.linear_model import Lasso

plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), random_state=42)

24
save_fig("lasso_regression_plot")
plt.show()

/usr/local/lib/python3.8/site-
packages/sklearn/linear_model/_coordinate_descent.py:474: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations.
Duality gap: 2.802867703827423, tolerance: 0.0009294783355207351
model = cd_fast.enet_coordinate_descent(
Saving figure lasso_regression_plot

[44]: from sklearn.linear_model import Lasso


lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
lasso_reg.predict([[1.5]])

[44]: array([1.53788174])

11.0.4 Elastic net regression


1∑ 1−r ∑
n n
J(β) = MSE(β) + rα |βi | + α (βi )2
2 2
i=1 i=1

- The loss function is MSE plus a combination of ridge and lasso terms - r is a mix factor - in the
code below it’s set to 12

[45]: from sklearn.linear_model import ElasticNet


elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42)
elastic_net.fit(X, y)

25
elastic_net.predict([[1.5]])

[45]: array([1.54333232])

11.0.5 Summary
• Regularisation is the only real option when modelling with many features
• Choice of technique is data-dependent
– if there are many unimportant features use Lasso or Elastic Net
– but you don’t usually know this at the start
• The data should always be scaled when using these techniques

12 Early stopping
• All the regression techniques we’ve looked at are aimed at reducing errors to close to zero
• With regularisation designed to work back from low training error (overfit) back to good
generalisation error
• You can think of this as wasted effort
– why not regularise by stopping as soon as optimal validation error is reached?
• This can’t be done for normal equation methods
– there is nothing to stop
• But for gradient descent - and other iterative techniques - this can be a big win
• For the example, add noise to y = 12 X2 + X + 2
– then train a degree 90 polynomial model
– that should massively overfit if left to run
[55]: np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)

X_train, X_val, y_train, y_val = train_test_split(X[:50], y[:50].ravel(),␣


,→test_size=0.5, random_state=10)

Early stopping example:


[56]: from copy import deepcopy

poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler())
])

X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,

26
penalty=None, learning_rate="constant", eta0=0.0005,␣
random_state=42)
,→

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = deepcopy(sgd_reg)

Create the graph:


[57]: sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
penalty=None, learning_rate="constant", eta0=0.0005,␣
,→random_state=42)

n_epochs = 500
train_errors, val_errors = [], []
for epoch in range(n_epochs):
sgd_reg.fit(X_train_poly_scaled, y_train)
y_train_predict = sgd_reg.predict(X_train_poly_scaled)
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
train_errors.append(mean_squared_error(y_train, y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))

best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])

plt.annotate('Best model',
xy=(best_epoch, best_val_rmse),
xytext=(best_epoch, best_val_rmse + 1),
ha="center",
arrowprops=dict(facecolor='black', shrink=0.05),
fontsize=16,
)

best_val_rmse -= 0.03 # just to make the graph look better


plt.plot([0, n_epochs], [best_val_rmse, best_val_rmse], "k:", linewidth=2)
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="Validation set")
plt.plot(np.sqrt(train_errors), "r--", linewidth=2, label="Training set")
plt.legend(loc="upper right", fontsize=14)
plt.xlabel("Epoch", fontsize=14)

27
plt.ylabel("RMSE", fontsize=14)
save_fig("early_stopping_plot")
plt.show()

Saving figure early_stopping_plot

12.0.1 Early stopping example


• Training error steadily approaches zero
– a degree 90 polynomial will get very close to some training values
• Validation error steadily reduces at first
– the models so far are underfitting
• Validation error then starts increasing the models are now overfitting
• So we stop at the point that validation error is minimal
– this gives a complex model (i.e. includes terms like a17 b88 since we used PolynomialFea-
tures)
– but has weights optimised for the overfit/underfit tradeoff
Note that: - These plots don’t always show smooth decline then rise in real life - so judgement is
needed - Writing down (and interpreting) the model is hard - so this is black-box machine learning
- When it works, you are confident that generalistion error will be minimal
[58]: best_epoch, best_model

28
[58]: (239,
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
eta0=0.0005, fit_intercept=True, l1_ratio=0.15,
learning_rate='constant', loss='squared_loss', max_iter=1,
n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,
shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0,
warm_start=True))

13 Logistic regression
• Standard regression but with the response/label scaled to between 0 and 1

ŷ = σ(xT β)

• Hence:
– returning a probability
– allowing classification
– confusion matrices & ROC curves
– weights/coefficients give log-odds which can be converted to odds ratios
– (i.e. quantifies the strength of the association between two events)
• Trained using the log loss function
– deriving and explaining this is out of scope for this module Log loss properties:
– noanalytic solution is known, so the equivalent of the normal equation does not exist
– is convex, so gradient descent (and related) algorithms will work

[64]: t = np.linspace(-10, 10, 100)


sig = 1 / (1 + np.exp(-t))
plt.figure(figsize=(9, 3))
plt.plot([-10, 10], [0, 0], "k-")
plt.plot([-10, 10], [0.5, 0.5], "k:")
plt.plot([-10, 10], [1, 1], "k:")
plt.plot([0, 0], [-1.1, 1.1], "k-")
plt.plot(t, sig, "b-", linewidth=2, label=r"$\sigma(t) = \frac{1}{1 + e^{-t}}$")
plt.xlabel("t")
plt.legend(loc="upper left", fontsize=20)
plt.axis([-10, 10, -0.1, 1.1])
save_fig("logistic_function_plot")
plt.show()

Saving figure logistic_function_plot

29
[65]: from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())

[65]: ['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']

[66]: print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset


--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)


:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica

:Summary Statistics:

============== ==== ==== ======= ===== ====================


Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194

30
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None


:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the


pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

- Fisher, R.A. "The use of multiple measurements in taxonomic problems"


Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more …

[67]: X = iris["data"][:, 3:] # petal width


y = (iris["target"] == 2).astype(np.int) # 1 if Iris virginica, else 0

Note 1: To be future-proof we set solver="lbfgs" since this will be the default value in Scikit-
Learn 0.22.
Note 2: This is the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm, and Roger
Fletcher FRS was my MSc. dissertation supervisor

31
[72]: from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(solver="lbfgs", random_state=42)
log_reg.fit(X, y)

[72]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,


intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

14 Decision boundaries
• We’re now classifying again, based on logistic regression probabilities
• The model takes one variable (petal width) and returns a probability
• Above 2.0cm and below 1.0cm the classifier is confident
• Within these values the classifier is less sure
• We have a decision boundary at about 1.6cm
– both predictions are 50%
[73]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)

plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")


plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")
legend = plt.legend(loc='center left', shadow=True, fontsize='large')

32
The figure in the book actually is actually a bit fancier:
[74]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]

plt.figure(figsize=(8, 3))
plt.plot(X[y==0], y[y==0], "bs")
plt.plot(X[y==1], y[y==1], "g^")
plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")
plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")
plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14,␣
,→color="k", ha="center")

plt.arrow(decision_boundary, 0.08, -0.3, 0, head_width=0.05, head_length=0.1,␣


,→fc='b', ec='b')

plt.arrow(decision_boundary, 0.92, 0.3, 0, head_width=0.05, head_length=0.1,␣


,→fc='g', ec='g')

plt.xlabel("Petal width (cm)", fontsize=14)


plt.ylabel("Probability", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 3, -0.02, 1.02])
save_fig("logistic_regression_plot")
plt.show()

Saving figure logistic_regression_plot

[75]: decision_boundary

[75]: array([1.66066066])

[76]: log_reg.predict([[1.7], [1.5]])

33
[76]: array([1, 0])

• We can get the exact decision boundary


– and check predictions either side
• Now add another feature/variable - petal length - and repeat
1
• The hyperparameter in the sklearn LogisticRegression implementation is C = α
– so high C means low α
[77]: from sklearn.linear_model import LogisticRegression

X = iris["data"][:, (2, 3)] # petal length, petal width


y = (iris["target"] == 2).astype(np.int)

log_reg = LogisticRegression(solver="lbfgs", C=10**10, random_state=42)


log_reg.fit(X, y)

x0, x1 = np.meshgrid(
np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = log_reg.predict_proba(X_new)

plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")

zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)

left_right = np.array([2.9, 7])


boundary = -(log_reg.coef_[0][0] * left_right + log_reg.intercept_[0]) /␣
,→log_reg.coef_[0][1]

plt.clabel(contour, inline=1, fontsize=12)


plt.plot(left_right, boundary, "k--", linewidth=3)
plt.text(3.5, 1.5, "Not Iris virginica", fontsize=14, color="b", ha="center")
plt.text(6.5, 2.3, "Iris virginica", fontsize=14, color="g", ha="center")
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.axis([2.9, 7, 0.8, 2.7])
save_fig("logistic_regression_contour_plot")
plt.show()

Saving figure logistic_regression_contour_plot

34
• Another (linear) decision boundary
• It is the straight line given by the set of points x such that

β0 + β1 x1 + β2 x2 = 0

• The other lines denote equal probability


– so the green dashed line is where the model is 90% sure

15 Multiclass logistic regression


• Also known as softmax regression
• Uses the softmax score for class k:
(k)
sk (x) = xT β

• So each class has its own set of weights/biases/parameters β (k)


– these are stored in a parameter matrix
• The output is a vector of k probabilities
– we take the largest one as our classification, as before
• Include all three varities in our model
• Note that muliclass means “one of k distinct classes”
– so not multioutput
– can’t be used to recognise k people in one picture
[78]: X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]

softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",␣
,→C=10, random_state=42)

softmax_reg.fit(X, y)

[78]: LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,


intercept_scaling=1, l1_ratio=None, max_iter=100,

35
multi_class='multinomial', n_jobs=None, penalty='l2',
random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)

[79]: x0, x1 = np.meshgrid(


np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]

y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)

zz1 = y_proba[:, 1].reshape(x0.shape)


zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

from matplotlib.colors import ListedColormap


custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)


contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="center left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
save_fig("softmax_regression_contour_plot")
plt.show()

Saving figure softmax_regression_contour_plot

36
[80]: softmax_reg.predict([[5, 2]])

[80]: array([2])

[81]: softmax_reg.predict_proba([[5, 2]])

[81]: array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])

16 Summary
• Chaper 4 contains plenty of important material
– Regression using the normal equation
– Regression using gradient descent techniques
– Regularisation using polynomial models
– Regularisation using weight limits
– Regularisation by early stopping
– Logistic regression
– Multiclass logistic regression
• For the exam I can’t ask you to
– invert a matrix
– perform gradient descent
– derive loss functions
– classify flowers (using these methods)
– etc.
• I can ask you to
– explain and compare the concepts
– interpret charts and/or vectors returned by the methods
– draw a straight line defined by (at least) two points
– etc.

37
17 Exercise solutions
17.1 1. to 11.
See appendix A.

17.2 12. Batch Gradient Descent with early stopping for Softmax Regression
(without using Scikit-Learn)
Let’s start by loading the data. We will just reuse the Iris dataset we loaded earlier.
[ ]: X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]

We need to add the bias term for every instance (x0 = 1):

[ ]: X_with_bias = np.c_[np.ones([len(X), 1]), X]

And let’s set the random seed so the output of this exercise solution is reproducible:
[ ]: np.random.seed(2042)

The easiest option to split the dataset into a training set, a validation set and a test set would be to
use Scikit-Learn’s train_test_split() function, but the point of this exercise is to try understand
the algorithms by implementing them manually. So here is one possible implementation:
[ ]: test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)

test_size = int(total_size * test_ratio)


validation_size = int(total_size * validation_ratio)
train_size = total_size - test_size - validation_size

rnd_indices = np.random.permutation(total_size)

X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]

The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the
Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all
classes except for the target class which will have a probability of 1.0 (in other words, the vector of
class probabilities for ay given instance is a one-hot vector). Let’s write a small function to convert
the vector of class indices into a matrix containing a one-hot vector for each instance:

38
[ ]: def to_one_hot(y):
n_classes = y.max() + 1
m = len(y)
Y_one_hot = np.zeros((m, n_classes))
Y_one_hot[np.arange(m), y] = 1
return Y_one_hot

Let’s test this function on the first 10 instances:


[ ]: y_train[:10]

[ ]: to_one_hot(y_train[:10])

Looks good, so let’s create the target class probabilities matrix for the training set and the test set:
[ ]: Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)

Now let’s implement the Softmax function. Recall that it is defined by the following equation:
exp (sk (x))
σ (s(x))k =

K
exp (sj (x))
j=1

[ ]: def softmax(logits):
exps = np.exp(logits)
exp_sums = np.sum(exps, axis=1, keepdims=True)
return exps / exp_sums

We are almost ready to start training. Let’s define the number of inputs and outputs:
[ ]: n_inputs = X_train.shape[1] # == 3 (2 features plus the bias term)
n_outputs = len(np.unique(y_train)) # == 3 (3 iris classes)

Now here comes the hardest part: training! Theoretically, it’s simple: it’s just a matter of trans-
lating the math equations into Python code. But in practice, it can be quite tricky: in particular,
it’s easy to mix up the order of the terms, or the indices. You can even end up with code that
looks like it’s working but is actually not computing exactly the right thing. When unsure, you
should write down the shape of each term in the equation and make sure the corresponding terms
in your code match closely. It can also help to evaluate each term independently and print them
out. The good news it that you won’t have to do this everyday, since all this is well implemented
by Scikit-Learn, but it will help you understand what’s going on under the hood.
So the equations we will need are the cost function:
1 ∑m ∑ K ( )
(i) (i)
J(�) = − yk log p̂k
m i=1 k=1
And the equation for the gradients:

39
1 ∑m ( )
(i) (i)
∇θ(k) J(�) = p̂k − yk x(i)
m i=1
( ) ( )
(i) (i) (i)
Note that log p̂k may not be computable if p̂k = 0. So we will add a tiny value ϵ to log p̂k
to avoid getting nan values.
[ ]: eta = 0.01
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):


logits = X_train.dot(Theta)
Y_proba = softmax(logits)
loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba + epsilon), axis=1))
error = Y_proba - Y_train_one_hot
if iteration % 500 == 0:
print(iteration, loss)
gradients = 1/m * X_train.T.dot(error)
Theta = Theta - eta * gradients

And that’s it! The Softmax model is trained. Let’s look at the model parameters:
[ ]: Theta

Let’s make predictions for the validation set and check the accuracy score:
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)


accuracy_score

Well, this model looks pretty good. For the sake of the exercise, let’s add a bit of ℓ2 regularization.
The following training code is similar to the one above, but the loss now has an additional ℓ2
penalty, and the gradients have the proper additional term (note that we don’t regularize the first
element of Theta since this corresponds to the bias term). Also, let’s try increasing the learning
rate eta.
[ ]: eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1 # regularization hyperparameter

Theta = np.random.randn(n_inputs, n_outputs)

40
for iteration in range(n_iterations):
logits = X_train.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))

l2_loss = 1/2 * np.sum(np.square(Theta[1:]))


loss = xentropy_loss + alpha * l2_loss
error = Y_proba - Y_train_one_hot
if iteration % 500 == 0:
print(iteration, loss)
gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]),␣
,→alpha * Theta[1:]]

Theta = Theta - eta * gradients

Because of the additional ℓ2 penalty, the loss seems greater than earlier, but perhaps this model
will perform better? Let’s find out:
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)


accuracy_score

Cool, perfect accuracy! We probably just got lucky with this validation set, but still, it’s pleasant.
Now let’s add early stopping. For this we just need to measure the loss on the validation set at
every iteration and stop when the error starts growing.
[ ]: eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1 # regularization hyperparameter
best_loss = np.infty

Theta = np.random.randn(n_inputs, n_outputs)

for iteration in range(n_iterations):


logits = X_train.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))

l2_loss = 1/2 * np.sum(np.square(Theta[1:]))


loss = xentropy_loss + alpha * l2_loss
error = Y_proba - Y_train_one_hot

41
gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]),␣
,→alpha * Theta[1:]]
Theta = Theta - eta * gradients

logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))

l2_loss = 1/2 * np.sum(np.square(Theta[1:]))


loss = xentropy_loss + alpha * l2_loss
if iteration % 500 == 0:
print(iteration, loss)
if loss < best_loss:
best_loss = loss
else:
print(iteration - 1, best_loss)
print(iteration, loss, "early stopping!")
break

[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_valid)


accuracy_score

Still perfect, but faster.


Now let’s plot the model’s predictions on the whole dataset:
[ ]: x0, x1 = np.meshgrid(
np.linspace(0, 8, 500).reshape(-1, 1),
np.linspace(0, 3.5, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]
X_new_with_bias = np.c_[np.ones([len(X_new), 1]), X_new]

logits = X_new_with_bias.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

zz1 = Y_proba[:, 1].reshape(x0.shape)


zz = y_predict.reshape(x0.shape)

plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")

42
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])

plt.contourf(x0, x1, zz, cmap=custom_cmap)


contour = plt.contour(x0, x1, zz1, cmap=plt.cm.brg)
plt.clabel(contour, inline=1, fontsize=12)
plt.xlabel("Petal length", fontsize=14)
plt.ylabel("Petal width", fontsize=14)
plt.legend(loc="upper left", fontsize=14)
plt.axis([0, 7, 0, 3.5])
plt.show()

And now let’s measure the final model’s accuracy on the test set:
[ ]: logits = X_test.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)

accuracy_score = np.mean(y_predict == y_test)


accuracy_score

Our perfect model turns out to have slight imperfections. This variability is likely due to the very
small size of the dataset: depending on how you sample the training set, validation set and the test
set, you can get quite different results. Try changing the random seed and running the code again
a few times, you will see that the results will vary.
[ ]:

43

You might also like