LinearModels Slides
LinearModels Slides
Linear Models
Linear Regression
Linear Regression
⎡ x1,p ⎤
⎢
⎢
x2,p ⎥
⎥
xp = ⎢ ⎥
⎢
⎢ ⋮ ⎥ ⎥
⎣ xN,p ⎦
Linear Regression
Linear Regression
⎡ 1 ⎤
⎢ x1,p ⎥
⎢ ⎥
⎢ ⎥
x̊p = ⎢ x2,p ⎥ , p = 1, . . . , P
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⎥
⎣x ⎦
N,p
⎡ w0 ⎤
⎢ w1 ⎥
⎢
⎢ ⎥
⎥
w = ⎢ w2 ⎥
⎢
⎢ ⎥
⎥
⎢ ⋮ ⎥
⎣w ⎦
N
x̊Tp w ≈ yp p = 1, . . . , P .
For a given set of parameters w this cost function computes the total
squared error between the associated hyperplane and the data, giving
a good measure of how well the particular linear model fits the
dataset.
The best fitting hyperplane is the one whose parameters minimize this
error.
We want to find a weight vector w so that for any data point (xp , yp ):
x̊Tp w ≈ yp
By squaring the error x̊Tp w − yp (so that both negative and positive
errors of the same magnitude are treated equally), we can define:
2
gp (w) = (x̊Tp w − yp )
as a point-wise cost function that measures the error of a model (in this
case a linear model) on the individual point (xp , yp ).
We want all P such values to be small. We can take their average over
the entire dataset, forming the Least Squares cost function for linear
regression:
1 P 1 P 2
g (w) = ∑ gp (w) = ∑ (x̊Tp w − yp )
P p=1 P p=1
1 P 2
minimize ∑ (x̊Tp w − yp )
w P p=1
Example: Visually verifying the convexity of the cost function for a toy dataset
Consider a toy dataset of 50 random points selected off the line y = x with
a small amount of Gaussian noise added to each point.
In [7]:
demo = linear_regression_visualizer(data)
demo.plot_data()
In [8]:
# compute linear combination of input points
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T
The contour plot and the surface generated by the Least Squares cost
function:
In [9]:
static_visualizer().two_input_surface_contour_plot(least_squares,[],view = [10,70],xmin = -4.5, xmax = 4.5, ymin = -4.5, ymax = 4.5,num_contours = 30)
The upward bending shape of the cost function's surface and the elliptical
shape of its contour lines show that the Least Squares cost function is
indeed convex for linear regression models.
The Least Squares cost function for linear regression is always convex
regardless of the input dataset.
Important issues with first and second order methods still need to be
considered:
1. How should we choose a steplength / learning rate for gradient
descent?
2. Newton's method can only be applied when N is of moderate value
(e.g., in the thousands).
The gradient of the Least Squares cost function can be computed as:
2 P
∇g (w) = ∑ x̊ (x̊T w − yp )
P p=1 p p
p=1 p=1
We use gradient descent to minimize the Least Squares cost over the
previous toy dataset.
We employ a fixed steplength value α = 0.5 for all 75 steps until
reaching the minimum of the function.
In [12]:
demo.animate_it_2d(video_path_1,weight_history,num_contours = 30,fps=20)
In [13]:
show_video(video_path_1, width=800)
Out[13]:
For N = 1 (left), the bottom step is the region of the space containing
class 0, i.e., yp = 0. The top step contains class 1, i. e. , yp = +1.
where
⎡ w0 ⎤ ⎡ 1 ⎤
⎢
⎢ 1
w ⎥
⎥ ⎢
⎢ 1
x ⎥
⎥
⎢
w = ⎢ w2 ⎥ ⎢ ⎥.
⎢ ⎥
⎥
and x̊ = ⎢ x2
⎢ ⎥
⎥
⎢
⎢ ⋮ ⎥
⎥ ⎢
⎢ ⋮ ⎥
⎥
⎣w ⎦ ⎣x ⎦
N N
step(t) = {
1 if t ≥ 0
.
0 if t < 0
step (x̊ w )
T
with a linear decision boundary between its lower and upper steps,
defined by all points x̊ where x̊ w = 0.
T
We want the point (xp , yp ) to lie on the correct side of the optimal
decision boundary, i.e., the output yp to lie on the proper step:
step (x̊Tp w ) = yp .
1 P 2
g(w) = ∑ (step (x̊Tp w ) − yp )
P p=1
1 P 2
g(w) = ∑ (step (x̊Tp w ) − yp )
P p=1
The left figure shows that the Least Squares surface consists of
discrete steps at many different levels, and each step is completely flat.
Because of this, local optimization methods cannot be used to
effectively minimize it.
σ (x̊Tp w ) = yp p = 1, . . , P
1 P 2
g (w) = ∑ (σ (x̊Tp w ) − yp )
P p=1
In [24]:
demo.plot_costs(viewmax = 25, view = [21,121])
⎧
⎪ −log (σ (x̊Tp w )) if yp = 1
gp (w) = ⎨
⎩ −log (1 − σ (x̊Tp w )) if yp = 0.
⎪
Since our label values yp ∈ {0, 1} we can write the log error equivalently
in a single line:
1 P 1 P
g (w) = ∑ gp (w) = − ∑ yp log (σ (x̊Tp w )) + (1 − yp ) log (1 − σ (x̊Tp w ))
P p=1 P p=1
This log error cost penalizes violations of our desired equalities much
more harshly than a squared error does:
In [28]:
fig = plt.figure(figsize = (9,3))
y = 1
alpha = np.linspace(0.5,.999,100)
least_squares = (y-alpha)**2
plt.plot(alpha, least_squares, 'b--')
log_error = -np.log(alpha)
plt.plot(alpha, log_error, 'r--')
y = 0
alpha = np.linspace(0.001,.5,100)
least_squares = (y-alpha)**2
plt.plot(alpha, least_squares, color='b', label='squared error')
log_error = -np.log(1-alpha)
plt.plot(alpha, log_error, color='r', label='log error')
plt.legend()
plt.xlabel('sigma',fontsize=14)
plt.ylabel('g_p', fontsize=14, rotation=0, labelpad=30)
plt.show()
In [30]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))
# compute cross-entropy
return cost/y.size
1 P
minimize − ∑ yp log (σ (x̊Tp w )) + (1 − yp ) log (1 − σ (x̊Tp w ))
w P p=1
To minimize the Cross Entropy cost, we can use any local optimization
method.
1 P
∇g (w) = − ∑ (yp − σ (x̊Tp w )) x̊p
P p=1
1 P
∇2 g (w) = ∑ σ (x̊Tp w ) (1 − σ (x̊Tp w )) x̊p x̊Tp .
P p=1
Out[33]:
The plotted surface of the Cross Entropy cost function looks convex.
Indeed, unlike the Least Squares, the Cross Entropy cost is always
convex regardless of the dataset used.
For this reason, the Cross Entropy cost is often used to perform logistic
regression.
Instead of our data sitting on a step function with lower and upper
steps taking on the values 0 and 1, respectively, they take on values -1
and +1.
sign(x) = {
+1 if x ≥ 0
.
−1 if x < 0
sign (x̊ w )
T
with a linear decision boundary between its two steps defined by all points
x̊ where x̊ w = 0.
T T
sign (x̊Tp w ) ≈ yp
The sigmoid function σ(⋅) ranges smoothly between 0 and 1. The tanh(⋅)
ranges smoothly between -1 and +1.
1 P 2
g(w) = ∑ (tanh (x̊Tp w ) − yp )
P p=1
tanh(x) + 1
σ (x) = .
2
⎧
⎪ −log (σ (x̊Tp w )) if yp = +1
gp (w) = ⎨
⎩ −log (1 − σ (x̊Tp w )) if yp = −1.
⎪
We have:
1 1 + e−x 1 e−x 1
1 − σ (x) = 1 − = − = = = σ(−x)
1+e −x 1+e −x 1+e −x 1+e −x 1 + ex
Then:
⎧
⎪ −log (σ (x̊Tp w )) if yp = +1
gp (w) = ⎨
⎩ −log (σ (−x̊Tp w )) if yp = −1.
⎪
Because we are using the label values ±1 we can move the label value in
each case inside the inner most paraenthesis, and we can write both
cases in a single line as
−yp x̊Tp w
1+e
Taking the average of this point-wise cost over all P points we have the
Softmax cost for logistic regression:
1 P 1 P
∑ gp (w) = ∑ log (1 + e−yp x̊p w )
T
g(w) =
P p=1 P p=1
In [40]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))
# compute cross-entropy
return cost/y.size
In [41]:
# the convex softmax cost function
def softmax(w):
cost = np.sum(np.log(1 + np.exp(-y*model(x,w))))
return cost/float(np.size(y))
1 P −yp x̊Tp w
∇g (w) = − ∑ e y x̊
P p=1 1 + e−yp x̊Tp w p p
In [44]:
animator.static_fig(weight_history,num_contours = 25,viewmax = 12)
There are 03 points in the example that look like they are on the wrong
side.
Note: in the classification context a 'noisy' point is one that has an
incorrect label. Such points are often misclassified by a trained
classifier, meaning that their true label value will not be correctly
predicted.
Two class classification datasets typically have noise of this kind and
are not often perfectly separable by a hyperplane.
In [50]:
demo.static_fig(weight_history[-1],view = [15,-140])
The Perceptron
The Perceptron
We treat classification as a particular form of non-linear regression
(e.g., employing a tanh nonlinearity for classification data with label
values yp ∈ {−1, +1}).
In the simplest instance our two classes of data are largely separated
by a linear decision boundary given by the collection of input x where
x̊T w = 0 with each class (largely) lying on either side.
The linear decision boundary cuts the input space into two half-spaces,
one lying "above" the hyperplane where x̊T w > 0, and one lying "below"
it where x̊T w < 0.
x̊Tp w > 0 if yp = +1
x̊Tp w < 0 if yp = −1.
The expression max (0, −yp x̊Tp w ) is always nonnegative, since it returns
zero if xp is classified correctly, and returns a positive value if the point
is classified incorrectly.
The functional form of this point-wise cost max (0, ⋅) is called a rectified
linear unit.
Because these point-wise costs are nonnegative and equal zero when
our weights are tuned correctly, we can take their average over the
entire dataset to form a proper cost function as:
1 P 1 P
g (w) = ∑ gp (w) = ∑ max (0, −yp x̊Tp w ) .
P p=1 P p=1
This cost function goes by many names such as the perceptron cost,
the rectified linear unit cost (or ReLU cost for short), and the hinge cost
(since when plotted a ReLU function looks like a hinge). This cost
function is always convex but has only a single (discontinuous)
derivative in each input dimension.
This implies that we can only use zero and first order local optimization
schemes (i.e., not Newton's method).
When dealing with a cost function that has some deficit, we replace it
with a smooth (or at least twice differentiable) cost function that
closely matches it everywhere.
If the approximation closely matches the true cost function, then for a
small amount of accuracy, we considerably broaden the set of
optimization tools we can use.
We replace the max function portion of the Perceptron cost with the
Softmax function defined as:
Example: when C = 2:
Suppose s0 ≤ s1 , so that max (s0 , s1 ) = s1 .
max (s0 , s1 ) = s0 + (s1 − s0 ) = log (es0 ) + log (es1 −s0 )
soft (s0 , s1 ) = log (es0 + es1 ) = log (es0 ) + log (1 + es1 −s0 )
soft (s0 , s1 ) is always larger than max (s0 , s1 ) but not by much.
gp (w) = soft (0, −yp x̊Tp w ) = log (e0 + e−yp x̊p w ) = log (1 + e−yp x̊p w )
T T
p=1 p=1
# load in dataset
data = np.loadtxt(data_path_1, delimiter = ',')
In [53]:
animator.static_fig(weight_history,num_contours = 25,viewmax = 25)
1 P
g (w0 ) = ∑ max (0, −yp x̊Tp w0 ) = 0.
P p=1
Any local optimization algorithm will halt immediately. This will not be
the case if we use the Softmax cost instead of the Perceptron cost.
p=1
With data that is indeed linearly separable, the Softmax cost achieves
this lower bound only only when the magnitude of the weights grows
to infinity.
P
g (w0 ) = ∑ log (1 + e−yp x̊p w ) > 0.
T 0
p=1
This decreases the Softmax cost as well with the minimum achieved as
C ⟶ ∞. However, regardless of the scalar C > 1, the decision boundary
defined by the initial weights x̊Tp w0 = 0 does not change location.
If we simply flip one of the labels - making this dataset not perfectly
linearly separable - the corresponding cost function does not have a
global minimum out at infinity, as illustrated in the contour plot below.
In [54]:
# switch a label
y[0,-1] = -1
data = np.vstack((x,y))
⎡ w0 ⎤
⎢ w1 ⎥
w=⎢
⎢
⎥.
⎥
⎢ ⋮ ⎥
⎣w ⎦
N
Take the difference between the decision boundary and its translation
evaluated at x′p and xp :
Since both formulae are equal to (x′p − xp )T ω we can set them equal to
each other:
d ∥ω∥2 = β
β b + xTp ω
d= = .
∥ω∥2 ∥ω∥2
If xp were to lie below the decision boundary and β < 0, we have the
same derivation.
T
b + x ω = b + xT ω = 0
∥ω∥2 ∥ω∥2 ∥ω∥2
Regardless of how large our weights w were to begin with we can always
normalize them in a consistent way by dividing off the magnitude of ω.
1 P
∑ log (1 + e−yp (b+xp ω ) )
T
minimize
b, ω P p=1
2
subject to ∥ω∥2 = 1
1 P
∑ log (1 + e−yp (b+xp ω ) ) + λ ∥ω∥22
T
g (b, ω) =
P p=1
The Margin-Perceptron
The translations above and below the separating hyperplane are more
generally defined as x̊T w = +β and x̊T w = −β respectively, where β > 0.
x̊Tp w ≥ 1 if yp = +1
x̊Tp w ≤ −1 if yp = −1
The additional 1 prevents the issue of a trivial zero solution with the
original Perceptron cost.
If the data is not linearly separable, a violation for the pth point adds the
positive value of 1 − yp x̊Tp w to the cost function.
The '1' used in the 1 − yp (x̊Tp w ) component of the cost could have been
any number we wanted - it was a normalization factor for the width of
the margin and, by convention, we used '1'.
While both perfectly distinguish between the two classes the green
separator (with smaller margin) divides up the space in a rather
awkward fashion given how the data is distributed, and will therefore
tend to more easily misclassify future datapoints.
On the other hand, the black separator (having a larger margin) divides
up the space more evenly with respect to the given data, and will tend
to classify future points more accurately.
∥x1 − x2 ∥2 .
Using the innner-product rule, and the fact that the two vectors x1 − x2
and ω are parallel to each other, we have:
2
∥x1 − x2 ∥2 =
∥ω∥2
There are infinitely many linear decision boundaries that separate the
two classes. Any of these can be found by Margin-Perceptron.
The SVM decision boundary is the one that provides the maximum
margin.
In [60]:
demo5.svm_comparison_fig()
In the right panel, the translates of the decision boundary pass through
points from both classes - equidistant from the SVM linear decision
boundary.
These points are called support vectors, hence the name Support
Vector Machines.
Thus with many datasets in practice the softmargin problem does not
provide a solution remarkably different than the perceptron or even
logistic regression.
Actually - with datasets that are not linearly separable - it often returns
exactly the same solution provided by the perceptron or logistic
regression.
p=1
p=1
As with the two class case we can in theory use any C distinct labels to
distinguish between the classes.
In [64]:
demo1.show_dataset()
With the cth two-class subproblem we simply assign temporary labels y~p
to the entire dataset, giving +1 labels to the cth class and −1 labels to
the remainder of the dataset
+1 if yp = c
y~p = { (2)
−1 if yp ≠ c
where again yp is the original label for the pth point, and run the two-class
classification scheme of our choice.
In [65]:
# solve the 2-class subproblems
demo1.solve_2class_subproblems()
With OvA we learn C two-class classifiers, and we can denote the weights
from the cth classifier as wc where
⎡ w0,c ⎤
⎢
⎢ 1,c
w ⎥
⎥
⎢ ⎥
wc = ⎢
⎢
⎥
⎥
⎢ ⎥
w2,c
⎢ ⎥
⎢ ⋮ ⎥
⎣w ⎦
N,c
In [66]:
demo1.show_fusion(region = 1)
All points in our toy dataset lie on the positive side of a single
classifier. These points are colored to match their respective classifier.
We repeat this logic for every point in the regions where two or more
classifiers are positive.
For points that are equidistant to two or more decision boundaries, we
assign a class label at random.
In [72]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])
In [73]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])
In [74]:
demo1.show_fusion(region = 3)
∥ω j ∥2
6: end for
7: To assign label y to a point x, apply the fusion rule: y =
T
argmax x̊ wc
c = 0,...,C−1
In [75]:
demo1.show_complete_coloring();
We have seen how the fusion rule defined class ownership for every
point x in the input space. With all two-class classifiers properly tuned,
ideally, we would like the fusion rule to hold true for as many points as
possible:
yp = argmax x̊Tp wj
j = 0,...,C−1
i.e., the signed distance from the point to its class decision boundary is
greater than (or equal to) its distances to every other two-class decision
boundary.
If our weights w0 , . . . , wC−1 are set ideally, gp (w0 , . . . , wC−1 ) should be zero
for as many points as possible.
We can now form a cost function by taking the average of the point-
wise cost over the entire dataset:
1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1
1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1
In [78]:
# compute C linear combinations of input point, one per classifier
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T
In [79]:
def multiclass_perceptron(w):
# pre-compute predictions on all points
all_evals = model(x,w)
# return average
return cost/float(np.size(y))
In [80]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')
# visualize dataset
demo.show_dataset()
In [81]:
demo.show_complete_coloring(weight_history, cost = multiclass_perceptron)
In the left figure, because we did not train each individual two-class
classifier in a One-versus-All manner, each individual learned two-class
classifier peforms quite poorly in separating its class from the rest of
the data.
In the right figure, we show the fused multi-class decision boundary
formed by combining these individual One-versus-All boundaries via
the fusion rule. The final multi-class decision boundary achieves perfect
classification.
1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1
1 P
g (w0 , . . . , wC−1 ) = ∑ max (0, x̊Tp (wj − wyp )) . (4)
P p=1 j = 0,...,C−1
j≠yp
⎡ w1,c ⎤
⎢ w2,c ⎥
ωj = ⎢
⎢ ⎥
⎥.
⎢
⎢ ⋮ ⎥
(bias): bc = w0,c (feature-touching weights): (5)
⎥
⎣ wN,c ⎦
1 P
minimize ∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )]
b0 , ω0 , ..., bC−1 , ωC−1 P p=1 j = 0,...,C−1
2
subject to ∥
∥ω j ∥
∥2 = 1, j = 0, . . . , C − 1
1 P
minimize ∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )]
b0 , ω0 , ..., bC−1 , ωC−1 P p=1 j = 0,...,C−1
2
subject to ∥
∥ω j ∥
∥2 = 1, j = 0, . . . , C − 1
In [94]:
lam = 10**-5 # our regularization paramter
def multiclass_perceptron(w):
# pre-compute predictions on all points
all_evals = model(x,w)
# add regularizer
cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2
# return average
return cost/float(np.size(y))
In [95]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')
# visualize dataset
demo.show_dataset()
In [96]:
demo.show_complete_coloring(weight_history, cost = multiclass_perceptron)
1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1
The multi-class Softmax cost function is convex, and (unlike the multi-
class Perceptron), it also has infinitely many smooth derivaties,
enabling us to use second-order methods in order to properly
minimize it.
It no longer has a trivial solution at zero.
In [107]:
def multiclass_softmax(w):
# pre-compute predictions on all points
all_evals = model(x,w)
# return average
return cost/float(np.size(y))
In [108]:
# load in dataset
data = np.loadtxt(dataset_path_2,delimiter = ',')
# visualize dataset
demo.show_dataset()
In [109]:
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)
The p-th summand of the multi-class Softmax cost can be written as:
⎛ ∑C−1 ex̊Tp wj ⎞
log (∑ e )− = log (∑ e ) − log e = log ⎜ j=0T ⎟.
C−1 C−1
x̊Tp wj x̊Tp wj x̊Tp wyp
x̊Tp wyp
j=0 c=0 ⎝ ex̊p wyp ⎠
Altogether we have:
⎛ ∑C−1 ex̊p wj ⎞
T
Or we can have:
⎛ ∑C−1 ex̊p wj ⎞ ⎛ ⎞
T
In [122]:
# compute C linear combinations of input point, one per classifier
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T
In [123]:
# multiclass softmaax regularized by the summed length of all normal vectors
lam = 10**(-5) # our regularization paramter
def multiclass_softmax(w):
# pre-compute predictions on all points
all_evals = model(x,w)
# add regularizer
cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2
# return average
return cost/float(np.size(y))
In [124]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')
In [125]:
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right pan
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)
In [132]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')
In [133]:
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)
In [ ]: