1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
1 Lecture 09
1.1 ID5059
1.2 Tom Kelsey - Jan 2021
As before, we take the Jupyter notebook associated with the course textbook - annotate - explain
Chapter 4 – Training Linear Models
This notebook contains all the sample code and solutions to the exercises in chapter 4.
Run in Google Colab
2 Setup
First, let’s import a few common modules, ensure MatplotLib plots figures inline and prepare a
function to save the figures. We also check that Python 3.5 or later is installed (although Python
2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as
Scikit-Learn �0.20.
[1]: # Python �3.5 is required
import sys
assert sys.version_info >= (3, 5)
# Common imports
import numpy as np
import os
1
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
y = Xβ + e
2
– y = 3X + 4 in the first example
• And add a known (and repeatable) amount of noise
– random values from the interval [0, 1)
• This provides an empirical framework for the comparison of approaches
– since we know exactly what “error” means
[2]: import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
3
• So we need to add the β0 term as described above
• We could set it to anthing, but using 1 means that the β value returned needs no adjustment
[4]: X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
[5]: array([[4.21509616],
[2.77011339]])
[6]: array([[4.21509616],
[9.75532293]])
4
The figure in the book actually corresponds to the following code, with a legend and axis labels:
[8]: plt.plot(X_new, y_predict, "r-", linewidth=2, label="Predictions")
plt.plot(X, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.legend(loc="upper left", fontsize=14)
plt.axis([0, 2, 0, 15])
save_fig("linear_model_predictions_plot")
plt.show()
5
• Instead of writing the normal equatuion explicitly, we can call a library function that imple-
ments it
• And additionally gives a solution when the matrix is not inverttible
[9]: from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
lin_reg.intercept_, lin_reg.coef_
[10]: lin_reg.predict(X_new)
[10]: array([[4.21509616],
[9.75532293]])
The LinearRegression class is based on the scipy.linalg.lstsq() function (the name stands
for “least squares”), which you could call directly:
6
[11]: array([[4.21509616],
[2.77011339]])
[12]: np.linalg.pinv(X_b).dot(y)
[12]: array([[4.21509616],
[2.77011339]])
3.0.6 Summary
• For simple problems we seek simple solutions
• The normal equation gives optimal results which are cheap to compute
– if the number of attributes - columns in our data - is small
• But can be expensive when we have many instances - rows in our data
• An alternative approach is to iteratively lower the error until a minimum is reached
4 Gradient descent
• Back in high school you would have seen functions of the form y = ax2 + bx + c
– standard quadtratic ploynomial
• To get the gradient (i.e. slope) of the function, we differentiate to get y ′ = 2ax + b
• The function has a minimum (or maximum) where the gradient is zero
where
ŷ = β0 + β1 x1 + β2 x2 + . . . + βp xp
To make things easier to visualise, take p = 1, forget β0 , and consider one value of j
So the derivative is
2β12 x1 − 2β1 y
or
β1 x − y
7
It should be clear that MSE will be zero when the derivative is zero, and β1 x = y (i.e. when the
predictions match the observed data
The idea is to start somewhere where the error value could be anything. Then 1. Measure the
gradient of the error at that point 2. Move to somewhere where the the error is less 3. Repeat until
(hopefully) the error is close to zero Source
To make it work we need three things 1. A good starting point (but we don’t have any insights so
just choose at random) 2. A loss function that measures the difference, or error, between actual y
and predicted y at its current position and provide feedback to the process so that it can adjust
the parameters to minimize the error 3. A learning rate - or step size - that it is evaluated and
updated based on the behavior of the loss function - too small leads to slow convergence - too large
leads to ascillation around the minimum Source
Note that - this all works as we’ve chosen our error to have a quadratic form, hence a unique
global minimum - for deep learning, we hope that the error is approximately quadratic near an
error minimum - everything remains true in p dimensions, but is hard to plot - we just have to
understand the concept - the code does all the work for us
8
[14]: theta
[14]: array([[4.21509616],
[2.77011339]])
[15]: X_new_b.dot(theta)
[15]: array([[4.21509616],
[9.75532293]])
[16]: theta_path_bgd = []
plt.figure(figsize=(10,4))
plt.subplot(131); plot_gradient_descent(theta, eta=0.02)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(132); plot_gradient_descent(theta, eta=0.1,␣
,→theta_path=theta_path_bgd)
save_fig("gradient_descent_plot")
plt.show()
9
Saving figure gradient_descent_plot
[19]: n_epochs = 50
t0, t1 = 5, 50 # learning schedule hyperparameters
def learning_schedule(t):
return t0 / (t + t1)
10
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(epoch * m + i)
theta = theta - eta * gradients
theta_path_sgd.append(theta) # not shown
[20]: theta
[20]: array([[4.21076011],
[2.74856079]])
11
Note that: - the training instances must be independent and identically distributed - otherwise we
might optimise on one variable, then another, … - and not find the global optimum - we can shufle
at each epoch to take care of this
[21]: from sklearn.linear_model import SGDRegressor
sgd_reg.fit(X, y.ravel())
n_iterations = 50
minibatch_size = 20
np.random.seed(42)
theta = np.random.randn(2,1) # random initialization
t = 0
for epoch in range(n_iterations):
shuffled_indices = np.random.permutation(m)
X_b_shuffled = X_b[shuffled_indices]
y_shuffled = y[shuffled_indices]
for i in range(0, m, minibatch_size):
12
t += 1
xi = X_b_shuffled[i:i+minibatch_size]
yi = y_shuffled[i:i+minibatch_size]
gradients = 2/minibatch_size * xi.T.dot(xi.dot(theta) - yi)
eta = learning_schedule(t)
theta = theta - eta * gradients
theta_path_mgd.append(theta)
[24]: theta
[24]: array([[4.25214635],
[2.7896408 ]])
[26]: plt.figure(figsize=(7,4))
plt.plot(theta_path_sgd[:, 0], theta_path_sgd[:, 1], "r-s", linewidth=1,␣
,→label="Stochastic")
13
8 Summary
• We’ve covered gradient descent techniques for linear regression with a convex (quadratic)
error surface
• For more complex notions of area we would consider local minima and saddle points
– places where the loss function is at or close to zero, so the model stops learning
– but is not giving us the error reduction that we want
Source
• We’ll cover this later, but for now the randomness in stochastic variants can help us “jump
out” of a local minimum or saddle point
9 Polynomial regression
• We have our X data
• Think of this as X1 i.e. raised to the power one
• We added intercept terms: a matrix the same size as X but with all entries one
– Think of this as X0 i.e. raised to the power zero
• The process can be extended to get X2 , X3 , X4 , . . .
• So the function in our main equation becomes a polynomial
y = β0 X0 + β1 X1 + β2 X2 + β3 X3 + e
As before we set up a known signal with known noise
[27]: import numpy as np
import numpy.random as rnd
14
np.random.seed(42)
[28]: m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
15
[30]: from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]
[30]: array([-0.75275929])
[31]: X_poly[0]
16
9.0.3 Overfit and underfit
• We know that the correct type of polynomial for these data is quadratic
• We can fit a degree 1 polynomial (i.e. straight line)
• And a degree 300 polynomial
• One should underfit, the other overfit
[34]: from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
for style, width, degree in (("g-", 1, 300), ("b--", 2, 2), ("r-+", 2, 1)):
polybig_features = PolynomialFeatures(degree=degree, include_bias=False)
std_scaler = StandardScaler()
lin_reg = LinearRegression()
polynomial_regression = Pipeline([
("poly_features", polybig_features),
("std_scaler", std_scaler),
("lin_reg", lin_reg),
])
polynomial_regression.fit(X, y)
y_newbig = polynomial_regression.predict(X_new)
plt.plot(X_new, y_newbig, style, label=str(degree), linewidth=width)
17
plt.legend(loc="upper left")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.axis([-3, 3, 0, 10])
save_fig("high_degree_polynomials_plot")
plt.show()
18
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
plt.legend(loc="upper right", fontsize=14) # not shown in the book
plt.xlabel("Training set size", fontsize=14) # not shown
plt.ylabel("RMSE", fontsize=14) # not shown
10 Learning curves
• We’ve used cross validation as the standard technique for estimating generalisation errors
– good performance on training data but bad cross validation error indicates overfit
– poor performance on training data and cross validation indicates underfit
• We want to visualise the overfit/underfit tradeoff as derivation proceeds
1. Start with a small subset of the training data
2. Learn a regression model
3. Calculate training & validation errors
4. Increase the size of the training set and repeat
10.0.1 Underfit
• This is the degree 1 model trying to predict quadratic data
• For one and two points, training error is low (we have straight line)
– but validation error is large (we have the wrong straight line)
• Training error increases as we increase the training data, then plateaus
– failure to capture the quadratic signal
• Validation error decrease to a similar plateau
• We conclude that adding data will not reduce error
– which is underfit
[36]: lin_reg = LinearRegression()
plot_learning_curves(lin_reg, X, y)
plt.axis([0, 80, 0, 3]) # not shown in the book
save_fig("underfitting_learning_curves_plot") # not shown
plt.show() # not shown
19
10.0.2 Overfit
• This is a degree 10 model trying to predict quadratic data
– again no training error up to 10 points as these determine a degree-10 polynomial
– again huge validation error as they determine the wrong degree-10 polynomial
• Again we see convergence to a plateau
• Two important differences
1. the plateaus indicate lower error
2. they are further apart
• These are the classic signs of overfit
[37]: from sklearn.pipeline import Pipeline
polynomial_regression = Pipeline([
("poly_features", PolynomialFeatures(degree=10, include_bias=False)),
("lin_reg", LinearRegression()),
])
plot_learning_curves(polynomial_regression, X, y)
plt.axis([0, 80, 0, 3]) # not shown
save_fig("learning_curves_plot") # not shown
plt.show() # not shown
20
10.1 Terminology
• I refer to overfit and underfit
• Other sources (including the textbook) use different terminology
• Generalisation error consists of three things
1. Bias due to wrong assumptions about model complexity, often leading to underfit
2. Variance due to excessive sensitivity to small changes in training data, often leading to
overfit
3. Irreducible error due to the noise in the data (outliers, unreliable sensor readings,
unreliable human data entry, etc.)
• Increasing model complexity will typically increase variance and decrease bias
• Decreasing model complexity will typically decrease variance and increase bias
Caveat: the intercept terms we saw earlier are also called bias terms. For neural nets, they are
always called bias terms. This use of the word bias is not the same as the above
11 Regularized models
• Polynomial regression gives a clear way to increase model complexity
– and then descrease to reduce overfitting
– this process is called regularisation
• The idea can be extended (much) further:
ax2 + bx + c
dx3 + ex2 + f x + g
21
• The whole field of curve-fitting is about finding model with the correct mathematical type
– mixed basis function linear
– Even order and half order polynomials
– Chebyshev polynomials
– Fourier-series polynomials
– standard, ln x, sqrt even, y-transformed, even order, and half order rationals
– Chebyshev rationals
– Fourier-series rationals
– nonlinear peak equations
– nonlinear transition equations
– nonlinear kinetic equations
• This approach only work for small numbers of variables (i.e. columns of X)
– so another approach to regularistaion would be useful
• Regularise by constraining the weights (i.e. coefficients) that are allowed
– works for the normal equation and gradient descent methods
• Can be applied to the standard linear model (i.e. degree one)
– and to polynomial models
• We look at three commonly used techniques
• First produce a known signal (y = 12 x + 1) and add noise
[38]: np.random.seed(42)
m = 20
X = 3 * np.random.rand(m, 1)
y = 1 + 0.5 * X + np.random.randn(m, 1) / 1.5
X_new = np.linspace(0, 3, 100).reshape(100, 1)
- The loss function is MSE plus the sum of the squared individual weights, multiplied by a hyper-
parameter α - If α is zero we just have linear regression - If α is large then the weights will be small
so we get a flat line through the data’s mean - note that we’re not using an intercept term here
Note that: - The loss function used for training is not the same as performance measure used for
testing - This is not uncommon: one is chosen for efficiency, the other performance - In classification
we often train using the log-loss function, but evaluate using precision/recall - We call specialist
solvers to find optimal weight since the underlying maths is now harder
Normal equation - two methods for getting an analytic solution for our data - both make rea-
sonable predictions
[39]: from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
[39]: array([[1.55071465]])
22
[40]: ridge_reg = Ridge(alpha=1, solver="sag", random_state=42)
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
[40]: array([[1.5507201]])
Stochastic gradient descent - another reasonable - but different - prediction for the expected
value at x = 1.5
[41]: sgd_reg = SGDRegressor(penalty="l2", max_iter=1000, tol=1e-3, random_state=42)
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])
[41]: array([1.47012588])
if polynomial:
model = Pipeline([
("poly_features", PolynomialFeatures(degree=10,␣
,→include_bias=False)),
("std_scaler", StandardScaler()),
("regul_reg", model),
])
model.fit(X, y)
y_new_regul = model.predict(X_new)
lw = 2 if alpha > 0 else 1
plt.plot(X_new, y_new_regul, style, linewidth=lw, label=r"$\alpha =␣
,→{}$".format(alpha))
plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Ridge, polynomial=False, alphas=(0, 10, 100), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
23
plt.subplot(122)
plot_model(Ridge, polynomial=True, alphas=(0, 10**-5, 1), random_state=42)
save_fig("ridge_regression_plot")
plt.show()
- The loss function is MSE plus the sum of the weights’ absolute values, multiplied by a hyperpa-
rameter α - for mathematicians, using the ℓ1 norm instead of ridge’s ℓ2 norm - Similar performance
- But Lasso sets the weights of unimportant varaible (close) to zero - and so automatically performs
variable selection - only use variables that contribute to good predictions - And returns a sparse
matrix (i.e. containing many zeros) which aids efficiency
Note: to be future-proof, we set max_iter=1000 and tol=1e-3 because these will be the default
values in Scikit-Learn 0.21.
[43]: from sklearn.linear_model import Lasso
plt.figure(figsize=(8,4))
plt.subplot(121)
plot_model(Lasso, polynomial=False, alphas=(0, 0.1, 1), random_state=42)
plt.ylabel("$y$", rotation=0, fontsize=18)
plt.subplot(122)
plot_model(Lasso, polynomial=True, alphas=(0, 10**-7, 1), random_state=42)
24
save_fig("lasso_regression_plot")
plt.show()
/usr/local/lib/python3.8/site-
packages/sklearn/linear_model/_coordinate_descent.py:474: ConvergenceWarning:
Objective did not converge. You might want to increase the number of iterations.
Duality gap: 2.802867703827423, tolerance: 0.0009294783355207351
model = cd_fast.enet_coordinate_descent(
Saving figure lasso_regression_plot
[44]: array([1.53788174])
- The loss function is MSE plus a combination of ridge and lasso terms - r is a mix factor - in the
code below it’s set to 12
25
elastic_net.predict([[1.5]])
[45]: array([1.54333232])
11.0.5 Summary
• Regularisation is the only real option when modelling with many features
• Choice of technique is data-dependent
– if there are many unimportant features use Lasso or Elastic Net
– but you don’t usually know this at the start
• The data should always be scaled when using these techniques
12 Early stopping
• All the regression techniques we’ve looked at are aimed at reducing errors to close to zero
• With regularisation designed to work back from low training error (overfit) back to good
generalisation error
• You can think of this as wasted effort
– why not regularise by stopping as soon as optimal validation error is reached?
• This can’t be done for normal equation methods
– there is nothing to stop
• But for gradient descent - and other iterative techniques - this can be a big win
• For the example, add noise to y = 12 X2 + X + 2
– then train a degree 90 polynomial model
– that should massively overfit if left to run
[55]: np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)
poly_scaler = Pipeline([
("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
("std_scaler", StandardScaler())
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)
26
penalty=None, learning_rate="constant", eta0=0.0005,␣
random_state=42)
,→
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(X_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = deepcopy(sgd_reg)
n_epochs = 500
train_errors, val_errors = [], []
for epoch in range(n_epochs):
sgd_reg.fit(X_train_poly_scaled, y_train)
y_train_predict = sgd_reg.predict(X_train_poly_scaled)
y_val_predict = sgd_reg.predict(X_val_poly_scaled)
train_errors.append(mean_squared_error(y_train, y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
best_epoch = np.argmin(val_errors)
best_val_rmse = np.sqrt(val_errors[best_epoch])
plt.annotate('Best model',
xy=(best_epoch, best_val_rmse),
xytext=(best_epoch, best_val_rmse + 1),
ha="center",
arrowprops=dict(facecolor='black', shrink=0.05),
fontsize=16,
)
27
plt.ylabel("RMSE", fontsize=14)
save_fig("early_stopping_plot")
plt.show()
28
[58]: (239,
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
eta0=0.0005, fit_intercept=True, l1_ratio=0.15,
learning_rate='constant', loss='squared_loss', max_iter=1,
n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42,
shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0,
warm_start=True))
13 Logistic regression
• Standard regression but with the response/label scaled to between 0 and 1
ŷ = σ(xT β)
• Hence:
– returning a probability
– allowing classification
– confusion matrices & ROC curves
– weights/coefficients give log-odds which can be converted to odds ratios
– (i.e. quantifies the strength of the association between two events)
• Trained using the log loss function
– deriving and explaining this is out of scope for this module Log loss properties:
– noanalytic solution is known, so the equivalent of the normal equation does not exist
– is convex, so gradient descent (and related) algorithms will work
29
[65]: from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())
[66]: print(iris.DESCR)
.. _iris_dataset:
:Summary Statistics:
30
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
.. topic:: References
Note 1: To be future-proof we set solver="lbfgs" since this will be the default value in Scikit-
Learn 0.22.
Note 2: This is the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm, and Roger
Fletcher FRS was my MSc. dissertation supervisor
31
[72]: from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(solver="lbfgs", random_state=42)
log_reg.fit(X, y)
14 Decision boundaries
• We’re now classifying again, based on logistic regression probabilities
• The model takes one variable (petal width) and returns a probability
• Above 2.0cm and below 1.0cm the classifier is confident
• Within these values the classifier is less sure
• We have a decision boundary at about 1.6cm
– both predictions are 50%
[73]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
32
The figure in the book actually is actually a bit fancier:
[74]: X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
decision_boundary = X_new[y_proba[:, 1] >= 0.5][0]
plt.figure(figsize=(8, 3))
plt.plot(X[y==0], y[y==0], "bs")
plt.plot(X[y==1], y[y==1], "g^")
plt.plot([decision_boundary, decision_boundary], [-1, 2], "k:", linewidth=2)
plt.plot(X_new, y_proba[:, 1], "g-", linewidth=2, label="Iris virginica")
plt.plot(X_new, y_proba[:, 0], "b--", linewidth=2, label="Not Iris virginica")
plt.text(decision_boundary+0.02, 0.15, "Decision boundary", fontsize=14,␣
,→color="k", ha="center")
[75]: decision_boundary
[75]: array([1.66066066])
33
[76]: array([1, 0])
x0, x1 = np.meshgrid(
np.linspace(2.9, 7, 500).reshape(-1, 1),
np.linspace(0.8, 2.7, 200).reshape(-1, 1),
)
X_new = np.c_[x0.ravel(), x1.ravel()]
y_proba = log_reg.predict_proba(X_new)
plt.figure(figsize=(10, 4))
plt.plot(X[y==0, 0], X[y==0, 1], "bs")
plt.plot(X[y==1, 0], X[y==1, 1], "g^")
zz = y_proba[:, 1].reshape(x0.shape)
contour = plt.contour(x0, x1, zz, cmap=plt.cm.brg)
34
• Another (linear) decision boundary
• It is the straight line given by the set of points x such that
β0 + β1 x1 + β2 x2 = 0
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",␣
,→C=10, random_state=42)
softmax_reg.fit(X, y)
35
multi_class='multinomial', n_jobs=None, penalty='l2',
random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
y_proba = softmax_reg.predict_proba(X_new)
y_predict = softmax_reg.predict(X_new)
plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")
36
[80]: softmax_reg.predict([[5, 2]])
[80]: array([2])
16 Summary
• Chaper 4 contains plenty of important material
– Regression using the normal equation
– Regression using gradient descent techniques
– Regularisation using polynomial models
– Regularisation using weight limits
– Regularisation by early stopping
– Logistic regression
– Multiclass logistic regression
• For the exam I can’t ask you to
– invert a matrix
– perform gradient descent
– derive loss functions
– classify flowers (using these methods)
– etc.
• I can ask you to
– explain and compare the concepts
– interpret charts and/or vectors returned by the methods
– draw a straight line defined by (at least) two points
– etc.
37
17 Exercise solutions
17.1 1. to 11.
See appendix A.
17.2 12. Batch Gradient Descent with early stopping for Softmax Regression
(without using Scikit-Learn)
Let’s start by loading the data. We will just reuse the Iris dataset we loaded earlier.
[ ]: X = iris["data"][:, (2, 3)] # petal length, petal width
y = iris["target"]
We need to add the bias term for every instance (x0 = 1):
And let’s set the random seed so the output of this exercise solution is reproducible:
[ ]: np.random.seed(2042)
The easiest option to split the dataset into a training set, a validation set and a test set would be to
use Scikit-Learn’s train_test_split() function, but the point of this exercise is to try understand
the algorithms by implementing them manually. So here is one possible implementation:
[ ]: test_ratio = 0.2
validation_ratio = 0.2
total_size = len(X_with_bias)
rnd_indices = np.random.permutation(total_size)
X_train = X_with_bias[rnd_indices[:train_size]]
y_train = y[rnd_indices[:train_size]]
X_valid = X_with_bias[rnd_indices[train_size:-test_size]]
y_valid = y[rnd_indices[train_size:-test_size]]
X_test = X_with_bias[rnd_indices[-test_size:]]
y_test = y[rnd_indices[-test_size:]]
The targets are currently class indices (0, 1 or 2), but we need target class probabilities to train the
Softmax Regression model. Each instance will have target class probabilities equal to 0.0 for all
classes except for the target class which will have a probability of 1.0 (in other words, the vector of
class probabilities for ay given instance is a one-hot vector). Let’s write a small function to convert
the vector of class indices into a matrix containing a one-hot vector for each instance:
38
[ ]: def to_one_hot(y):
n_classes = y.max() + 1
m = len(y)
Y_one_hot = np.zeros((m, n_classes))
Y_one_hot[np.arange(m), y] = 1
return Y_one_hot
[ ]: to_one_hot(y_train[:10])
Looks good, so let’s create the target class probabilities matrix for the training set and the test set:
[ ]: Y_train_one_hot = to_one_hot(y_train)
Y_valid_one_hot = to_one_hot(y_valid)
Y_test_one_hot = to_one_hot(y_test)
Now let’s implement the Softmax function. Recall that it is defined by the following equation:
exp (sk (x))
σ (s(x))k =
∑
K
exp (sj (x))
j=1
[ ]: def softmax(logits):
exps = np.exp(logits)
exp_sums = np.sum(exps, axis=1, keepdims=True)
return exps / exp_sums
We are almost ready to start training. Let’s define the number of inputs and outputs:
[ ]: n_inputs = X_train.shape[1] # == 3 (2 features plus the bias term)
n_outputs = len(np.unique(y_train)) # == 3 (3 iris classes)
Now here comes the hardest part: training! Theoretically, it’s simple: it’s just a matter of trans-
lating the math equations into Python code. But in practice, it can be quite tricky: in particular,
it’s easy to mix up the order of the terms, or the indices. You can even end up with code that
looks like it’s working but is actually not computing exactly the right thing. When unsure, you
should write down the shape of each term in the equation and make sure the corresponding terms
in your code match closely. It can also help to evaluate each term independently and print them
out. The good news it that you won’t have to do this everyday, since all this is well implemented
by Scikit-Learn, but it will help you understand what’s going on under the hood.
So the equations we will need are the cost function:
1 ∑m ∑ K ( )
(i) (i)
J(�) = − yk log p̂k
m i=1 k=1
And the equation for the gradients:
39
1 ∑m ( )
(i) (i)
∇θ(k) J(�) = p̂k − yk x(i)
m i=1
( ) ( )
(i) (i) (i)
Note that log p̂k may not be computable if p̂k = 0. So we will add a tiny value ϵ to log p̂k
to avoid getting nan values.
[ ]: eta = 0.01
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
And that’s it! The Softmax model is trained. Let’s look at the model parameters:
[ ]: Theta
Let’s make predictions for the validation set and check the accuracy score:
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)
Well, this model looks pretty good. For the sake of the exercise, let’s add a bit of ℓ2 regularization.
The following training code is similar to the one above, but the loss now has an additional ℓ2
penalty, and the gradients have the proper additional term (note that we don’t regularize the first
element of Theta since this corresponds to the bias term). Also, let’s try increasing the learning
rate eta.
[ ]: eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1 # regularization hyperparameter
40
for iteration in range(n_iterations):
logits = X_train.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_train_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))
Because of the additional ℓ2 penalty, the loss seems greater than earlier, but perhaps this model
will perform better? Let’s find out:
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)
Cool, perfect accuracy! We probably just got lucky with this validation set, but still, it’s pleasant.
Now let’s add early stopping. For this we just need to measure the loss on the validation set at
every iteration and stop when the error starts growing.
[ ]: eta = 0.1
n_iterations = 5001
m = len(X_train)
epsilon = 1e-7
alpha = 0.1 # regularization hyperparameter
best_loss = np.infty
41
gradients = 1/m * X_train.T.dot(error) + np.r_[np.zeros([1, n_outputs]),␣
,→alpha * Theta[1:]]
Theta = Theta - eta * gradients
logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
xentropy_loss = -np.mean(np.sum(Y_valid_one_hot * np.log(Y_proba +␣
,→epsilon), axis=1))
[ ]: logits = X_valid.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)
logits = X_new_with_bias.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)
plt.figure(figsize=(10, 4))
plt.plot(X[y==2, 0], X[y==2, 1], "g^", label="Iris virginica")
plt.plot(X[y==1, 0], X[y==1, 1], "bs", label="Iris versicolor")
plt.plot(X[y==0, 0], X[y==0, 1], "yo", label="Iris setosa")
42
from matplotlib.colors import ListedColormap
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
And now let’s measure the final model’s accuracy on the test set:
[ ]: logits = X_test.dot(Theta)
Y_proba = softmax(logits)
y_predict = np.argmax(Y_proba, axis=1)
Our perfect model turns out to have slight imperfections. This variability is likely due to the very
small size of the dataset: depending on how you sample the training set, validation set and the test
set, you can get quite different results. Try changing the random seed and running the code again
a few times, you will see that the results will vary.
[ ]:
43