Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

LinearModels Slides

Uploaded by

bmhungqb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

LinearModels Slides

Uploaded by

bmhungqb
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

6/24/24, 4:34 AM LinearModels slides

Linear Models

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 1/130


6/24/24, 4:34 AM LinearModels slides

Linear Regression

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 2/130


6/24/24, 4:34 AM LinearModels slides

Linear Regression

Data for regression problems comes in the form of a set of P


input/output observation pairs

(x1 , y1 ) , (x2 , y2 ) , . . . , (xP , yP )


P
or {(xp , yp )}p=1 for short, where xp and yp denote the pth input and
output respectively.

In general, each input xp is a column vector of length N :

⎡ x1,p ⎤


x2,p ⎥

xp = ⎢ ⎥

⎢ ⋮ ⎥ ⎥
⎣ xN,p ⎦

and each output yp is scalar-valued.

Geometrically, the linear regression problem is analogously to fitting a


hyperplane to a scatter of points in N + 1 dimensional space.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 3/130


6/24/24, 4:34 AM LinearModels slides

Linear Regression

When N = 1, the inputs are also scalar-valued, linear regression


simplifies to fitting a line to the associated scatter of P data points in
two-dimensional space.

A line in 2D is determined by two parameters: a vertical intercept w0


and a slope w1 . We need to determine their values so that the following
approximate linear relationship holds between the input/output data:
w 0 + xp w 1 ≈ y p , p = 1, . . . , P .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 4/130


6/24/24, 4:34 AM LinearModels slides

Linear Regression

In general, when dealing with N -dimensional inputs, we have a bias


weight and N associated slope weights to tune properly in order to fit
a hyperplane with the linear relationship written as:
w0 + x1,p w1 + x2,p w2 + ⋯ + xN,p wN ≈ yp , p = 1, . . . , P .

Because each dimension of the input is referred to as a feature or input


feature, we will often refer to w1 , w2 , . . . , wN as the feature-touching
weights. The only weight not touching a feature is the bias w0 .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 5/130


6/24/24, 4:34 AM LinearModels slides

We can use the notation x̊ to denote an input x with a 1 placed on top


of it as

⎡ 1 ⎤
⎢ x1,p ⎥
⎢ ⎥
⎢ ⎥
x̊p = ⎢ x2,p ⎥ , p = 1, . . . , P
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⎥
⎣x ⎦
N,p

Now we can place all parameters in a single column vector w:

⎡ w0 ⎤
⎢ w1 ⎥

⎢ ⎥

w = ⎢ w2 ⎥

⎢ ⎥

⎢ ⋮ ⎥
⎣w ⎦
N

We can now write the linear relationship more compactly as

x̊Tp w ≈ yp p = 1, . . . , P .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 6/130


6/24/24, 4:34 AM LinearModels slides

The Least Squares cost function

To find the parameters of the hyperplane which best fits a regression


dataset, we must form a cost function that measures how well a linear
model with a particular choice of parameters fits the regression data.

A popular choice is the Least Squares cost function.

For a given set of parameters w this cost function computes the total
squared error between the associated hyperplane and the data, giving
a good measure of how well the particular linear model fits the
dataset.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 7/130


6/24/24, 4:34 AM LinearModels slides

The best fitting hyperplane is the one whose parameters minimize this
error.

We want to find a weight vector w so that for any data point (xp , yp ):

x̊Tp w ≈ yp

i.e., the approximate is as close as possible to the output.

Or equivalently, the error between x̊Tp w − yp is small as possible.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 8/130


6/24/24, 4:34 AM LinearModels slides

The Least Squares cost function

By squaring the error x̊Tp w − yp (so that both negative and positive
errors of the same magnitude are treated equally), we can define:
2
gp (w) = (x̊Tp w − yp )

as a point-wise cost function that measures the error of a model (in this
case a linear model) on the individual point (xp , yp ).

We want all P such values to be small. We can take their average over
the entire dataset, forming the Least Squares cost function for linear
regression:

1 P 1 P 2
g (w) = ∑ gp (w) = ∑ (x̊Tp w − yp )
P p=1 P p=1

We want to find the optimal parameter vector w that minimizes g(w):

1 P 2
minimize ∑ (x̊Tp w − yp )
w P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 9/130


6/24/24, 4:34 AM LinearModels slides

Minimization of the Least Squares cost function

Example: Visually verifying the convexity of the cost function for a toy dataset

Consider a toy dataset of 50 random points selected off the line y = x with
a small amount of Gaussian noise added to each point.
In [7]:
demo = linear_regression_visualizer(data)
demo.plot_data()

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 10/130


6/24/24, 4:34 AM LinearModels slides

In [8]:
# compute linear combination of input points
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

# an implementation of the least squares cost function for linear regression


def least_squares(w):
# compute the least squares cost
cost = np.sum((model(x,w) - y)**2)
return cost/float(y.size)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 11/130


6/24/24, 4:34 AM LinearModels slides

The contour plot and the surface generated by the Least Squares cost
function:
In [9]:
static_visualizer().two_input_surface_contour_plot(least_squares,[],view = [10,70],xmin = -4.5, xmax = 4.5, ymin = -4.5, ymax = 4.5,num_contours = 30)

The upward bending shape of the cost function's surface and the elliptical
shape of its contour lines show that the Least Squares cost function is
indeed convex for linear regression models.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 12/130


6/24/24, 4:34 AM LinearModels slides

Minimization of the Least Squares cost function

The Least Squares cost function for linear regression is always convex
regardless of the input dataset.

We can easily apply first or second order methods to minimize the


Least Squares cost function for linear regression.

Important issues with first and second order methods still need to be
considered:
1. How should we choose a steplength / learning rate for gradient
descent?
2. Newton's method can only be applied when N is of moderate value
(e.g., in the thousands).

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 13/130


6/24/24, 4:34 AM LinearModels slides

Minimization of the Least Squares cost function

The gradient of the Least Squares cost function can be computed as:

2 P
∇g (w) = ∑ x̊ (x̊T w − yp )
P p=1 p p

If we use Newton's method, we can compute the Hessian of the Least


Squares cost function. Because the Least Squares cost function for
linear regression is a convex quadratic, only a single Newton step can
completely minimize it.

This single Newton step solution is often referred to as minimizing the


Least Squares cosft via its normal equations. The system of equations
solved in taking this single Newton step is equivalent to:

(∑ x̊p x̊Tp ) w = ∑ x̊p yp .


P P

p=1 p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 14/130


6/24/24, 4:34 AM LinearModels slides

Example: Using gradient descent

We use gradient descent to minimize the Least Squares cost over the
previous toy dataset.
We employ a fixed steplength value α = 0.5 for all 75 steps until
reaching the minimum of the function.
In [12]:
demo.animate_it_2d(video_path_1,weight_history,num_contours = 30,fps=20)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 15/130


6/24/24, 4:34 AM LinearModels slides

In [13]:
show_video(video_path_1, width=800)

Out[13]:

As gradient descent approaches the mininum of the cost function, the


corresponding parameters provide a better and better fit to the data.
The best fit occurs at the end of the run at the point closest to the
Least Squares cost's minimizer.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 16/130


6/24/24, 4:34 AM LinearModels slides

When we use a local optimization method like gradient descent, we


must properly tune the steplength parameter α.

We compare two steplength values α = 0.01 and α = 0.5.


In [15]:
static_visualizer().plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = False,labels = [r'$\alpha = 0.5$',r'$\alpha = 0.01$'])

The steplength parameter is often called the learning rate in machine


learning, because this value determines how quickly the proper
parameters of the model are learned.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 17/130


6/24/24, 4:34 AM LinearModels slides

Linear Two-Class Classification

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 18/130


6/24/24, 4:35 AM LinearModels slides

Linear Two-Class Classificationression and the Cross Entropy Cost

Two-class classification is a particular instance of regression or surface-


fitting wherein the data still comes in the form of P input/output pairs
{(xp , yp )}Pp=1 , and each input xp is an N -dimensional vector.

However, the corresponding output yp is no longer continuous but


takes on only two discrete numbers.

While the actual value of these numbers in principle arbitrary, we


suppose that the output of our data takes on either the value of 0 or
+1, i.e., yp ∈ {0, +1}.

In the context of classification, the output values yp are called labels,


and all the points sharing the same label value are referred to as a class
of data.

A dataset containing points with label values yp ∈ {0, +1} is said to be a


dataset consisting of two classes.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 19/130


6/24/24, 4:35 AM LinearModels slides

For N = 1 (left), the bottom step is the region of the space containing
class 0, i.e., yp = 0. The top step contains class 1, i. e. , yp = +1.

Regression perspective: Two-class classification can be viewed as a case


of nonlinear regression to fit a nonlinear step function to the data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 20/130


6/24/24, 4:35 AM LinearModels slides

Perceptron perspective: We represent the dataset using its input only:


blue for points with yp = 0 and red for points with yp = +1. The edge
seperating the two steps (and the data points on them) when
projected onto the input space takes the form of a single point (when
N = 1).

When N = 2 (right) the seperating edge is a line.

For general N , what seperates the two classes of data will be a


hyperplane, which is also called a decision boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 21/130


6/24/24, 4:35 AM LinearModels slides

Fitting a discontinuous step function

We fit a line to a two-class dataset via linear regression. The line is a


poor representation of this data because its output takes just two
discrete values.
In [19]:
demo1.naive_fitting_demo()

We pass this fully tuned linear regressor through a discrete step


function by assigningthe label +1 to all output values > 0.5 and the
label 0 to all output < 0.5. The resulting step function is still not a
good representation for the data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 22/130


6/24/24, 4:35 AM LinearModels slides

We can write a linear model of N -dimensional input as:


T
x̊ w = w0 + x1 w1 + x2 w2 + ⋯ + xN wN .

where

⎡ w0 ⎤ ⎡ 1 ⎤

⎢ 1
w ⎥
⎥ ⎢
⎢ 1
x ⎥


w = ⎢ w2 ⎥ ⎢ ⎥.
⎢ ⎥

and x̊ = ⎢ x2
⎢ ⎥


⎢ ⋮ ⎥
⎥ ⎢
⎢ ⋮ ⎥

⎣w ⎦ ⎣x ⎦
N N

The step function is defined as: the step function is defined as

step(t) = {
1 if t ≥ 0
.
0 if t < 0

Inserting our linear model through this gives us a step function:

step (x̊ w )
T

with a linear decision boundary between its lower and upper steps,
defined by all points x̊ where x̊ w = 0.
T

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 23/130


6/24/24, 4:35 AM LinearModels slides

To tune the weight vector w properly, we can set up a Least Squares


cost function that reflects the input and output of our dataset.

We want the point (xp , yp ) to lie on the correct side of the optimal
decision boundary, i.e., the output yp to lie on the proper step:

step (x̊Tp w ) = yp .

To find weights that satisfy this set of P equalities, we form a Least


Squares cost function as:

1 P 2
g(w) = ∑ (step (x̊Tp w ) − yp )
P p=1

Our ideal weights then correspond to a minimizer of this cost function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 24/130


6/24/24, 4:35 AM LinearModels slides

1 P 2
g(w) = ∑ (step (x̊Tp w ) − yp )
P p=1

It is very difficult to properly minimize the above Least Squares cost


function using local optimization methods. At virtually every point, the
function is completely flat locally.
In [21]:
demo2 = cost_visualizer(data)
demo2.plot_costs(viewmax = 25,view = [20,125])

The left figure shows that the Least Squares surface consists of
discrete steps at many different levels, and each step is completely flat.
Because of this, local optimization methods cannot be used to
effectively minimize it.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 25/130


6/24/24, 4:35 AM LinearModels slides

The logistic sigmoid function

To make the minimization of the Least Squares cost possible, we can


replace the step function with a continuous approximation that
matches it closely.

Such an approximation is the logistic sigmoid function:


1
σ (x) = .
1 + e−x

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 26/130


6/24/24, 4:35 AM LinearModels slides

Logistic regression using the Least Squares cost

Replacing the step function in step (x̊Tp w ) with its sigmoid


approximation gives the related set of approximate equalities:

σ (x̊Tp w ) = yp p = 1, . . , P

The corresponding Least Squares cost function becomes:

1 P 2
g (w) = ∑ (σ (x̊Tp w ) − yp )
P p=1

Fitting a logistic sigmoid to classification data by minimizing this cost


function is referred to as performing logistic regression because we
are performing regression using a logistic function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 27/130


6/24/24, 4:35 AM LinearModels slides

Using normalized gradient descent for Least Squares logistic regression


In [22]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))

# sigmoid non-convex logistic least squares cost function


def sigmoid_least_squares(w):
cost = 0
for p in range(y.size):
x_p = x[:,p]
y_p = y[:,p]
cost += (sigmoid(w[0] + w[1]*x_p) - y_p)**2
return cost/y.size

In [24]:
demo.plot_costs(viewmax = 25, view = [21,121])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 28/130


6/24/24, 4:35 AM LinearModels slides

The resulting cost function is non-convex, but it can still be minimized


using local optimization methods.
Although this cost function is very flat in many places, normalized
gradient descent is designed specifically to deal with costs like this.
In [27]:
demo2.static_fig(w_hist,num_contours = 25,viewmax = 31)

Initialize at the point w0 = −w1 = 20 and run normalized gradient descent


for 900 iterations with α = 1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 29/130


6/24/24, 4:35 AM LinearModels slides

Logistic Regression using the Cross Entropy cost

Instead of using the squared error point-wise cost


2
gp (w) = (σ (x̊Tp w ) − yp ) , we could use the log error cost defined as:


⎪ −log (σ (x̊Tp w )) if yp = 1
gp (w) = ⎨
⎩ −log (1 − σ (x̊Tp w )) if yp = 0.

This log error point-wise cost is always non-negative (regardless of the


input and weight values) with a minimum value of 0.

Since our label values yp ∈ {0, 1} we can write the log error equivalently
in a single line:

gp (w) = −yp log (σ (x̊Tp w )) − (1 − yp ) log (1 − σ (x̊Tp w )) .

The overall cost function over all P data points is:

1 P 1 P
g (w) = ∑ gp (w) = − ∑ yp log (σ (x̊Tp w )) + (1 − yp ) log (1 − σ (x̊Tp w ))
P p=1 P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 30/130


6/24/24, 4:35 AM LinearModels slides

This log error cost penalizes violations of our desired equalities much
more harshly than a squared error does:
In [28]:
fig = plt.figure(figsize = (9,3))
y = 1
alpha = np.linspace(0.5,.999,100)
least_squares = (y-alpha)**2
plt.plot(alpha, least_squares, 'b--')
log_error = -np.log(alpha)
plt.plot(alpha, log_error, 'r--')
y = 0
alpha = np.linspace(0.001,.5,100)
least_squares = (y-alpha)**2
plt.plot(alpha, least_squares, color='b', label='squared error')
log_error = -np.log(1-alpha)
plt.plot(alpha, log_error, color='r', label='log error')
plt.legend()
plt.xlabel('sigma',fontsize=14)
plt.ylabel('g_p', fontsize=14, rotation=0, labelpad=30)
plt.show()

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 31/130


6/24/24, 4:35 AM LinearModels slides

Minimizing the Cross Entropy cost


In [29]:
# compute linear combination of input point
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [30]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))

# the convex cross-entropy cost function


def cross_entropy(w):
# compute sigmoid of model
a = sigmoid(model(x,w))

# compute cost of label 0 points


ind = np.argwhere(y == 0)[:,1]
cost = -np.sum(np.log(1 - a[:,ind]))

# add cost on label 1 points


ind = np.argwhere(y==1)[:,1]
cost -= np.sum(np.log(a[:,ind]))

# compute cross-entropy
return cost/y.size

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 32/130


6/24/24, 4:35 AM LinearModels slides

Minimizing the Cross Entropy cost

To optimally tune the parameters w, we want to minimize the Cross


Entropy cost:

1 P
minimize − ∑ yp log (σ (x̊Tp w )) + (1 − yp ) log (1 − σ (x̊Tp w ))
w P p=1

To minimize the Cross Entropy cost, we can use any local optimization
method.

The gradient can be computed as:

1 P
∇g (w) = − ∑ (yp − σ (x̊Tp w )) x̊p
P p=1

The Hessian of the Cross Entropy cost function is:

1 P
∇2 g (w) = ∑ σ (x̊Tp w ) (1 − σ (x̊Tp w )) x̊p x̊Tp .
P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 33/130


6/24/24, 4:35 AM LinearModels slides

Using gradient descent to perform logistic regression using the Cross


Entropy cost

Initialize at point w0 = w1 = 3, set α = 1, and run for 25 steps.


In [33]:
show_video(video_path_1)

Out[33]:

The plotted surface of the Cross Entropy cost function looks convex.
Indeed, unlike the Least Squares, the Cross Entropy cost is always
convex regardless of the dataset used.

For this reason, the Cross Entropy cost is often used to perform logistic
regression.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 34/130


6/24/24, 4:35 AM LinearModels slides

Using gradient descent to perform logistic regression using the Cross


Entropy cost

We re-run gradient descent with the same initialization w0 = w1 = 3 and


fixed steplength α = 1 for 2000 iterations.
In [34]:
# run gradient descent to minimize the softmax cost
g = cross_entropy; w = np.array([3.0,3.0])[:,np.newaxis]; max_its = 2000; alpha_choice = 1;
weight_history,cost_history = gradient_descent(g,alpha_choice,max_its,w)

# create a static figure illustrating gradient descent steps


animator.static_fig(weight_history,num_contours = 25,viewmax = 12)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 35/130


6/24/24, 4:35 AM LinearModels slides

Logistic Regression and the Softmax Cost

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 36/130


6/24/24, 4:35 AM LinearModels slides

Different labels, same story

If we change the label values from yp ∈ {0, 1} to yp ∈ {−1, +1}, we can


derive the same optimization problem.

Instead of our data sitting on a step function with lower and upper
steps taking on the values 0 and 1, respectively, they take on values -1
and +1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 37/130


6/24/24, 4:35 AM LinearModels slides

This step function is called a sign function since it returns the


numerical sign of its input:

sign(x) = {
+1 if x ≥ 0
.
−1 if x < 0

Inserting a linear model through the sign function gives us a step


function:

sign (x̊ w )
T

with a linear decision boundary between its two steps defined by all points
x̊ where x̊ w = 0.
T T

Any input lying exactly on the decision boundary can be assigned a


label at random.

A point is classified correctly when its true label is predicted correctly:

sign (x̊Tp w ) ≈ yp

Otherwise, it is said to have been misclassified.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 38/130


6/24/24, 4:35 AM LinearModels slides

Previously, we used the logistic sigmoid function to replace the step


function.
We now use an adjusted version of the logistic sigmoid function so
that its values range between -1 and 1 (instead of 0 and 1). This scaled
version of the sigmoid, called the hyperbolic tangent function, is:
2
tanh(x) = 2 σ (x) − 1 = − 1.
1 + e−x

The sigmoid function σ(⋅) ranges smoothly between 0 and 1. The tanh(⋅)
ranges smoothly between -1 and +1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 39/130


6/24/24, 4:35 AM LinearModels slides

We can form a Least Squares cost function with the tanh(⋅):

1 P 2
g(w) = ∑ (tanh (x̊Tp w ) − yp )
P p=1

which is non-convex with undesirable flat regions, requiring specialized


local methods for proper minimization.
In [38]:
demo2.plot_costs(viewmax = 25,view = [20,125])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 40/130


6/24/24, 4:35 AM LinearModels slides

Logistic regression using the Softmax cost

We rearrange the hyperbolic tangent function in terms of the sigmoid:

tanh(x) + 1
σ (x) = .
2

We can note that:

tanh (x) ≈ +1 ⟺ σ (x) ≈ 1


tanh (x) ≈ −1 ⟺ σ (x) ≈ 0.

Therefore, we can employ the same point-wise log error cost:


⎪ −log (σ (x̊Tp w )) if yp = +1
gp (w) = ⎨
⎩ −log (1 − σ (x̊Tp w )) if yp = −1.

We have:

1 1 + e−x 1 e−x 1
1 − σ (x) = 1 − = − = = = σ(−x)
1+e −x 1+e −x 1+e −x 1+e −x 1 + ex

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 41/130


6/24/24, 4:35 AM LinearModels slides

Then:


⎪ −log (σ (x̊Tp w )) if yp = +1
gp (w) = ⎨
⎩ −log (σ (−x̊Tp w )) if yp = −1.

Because we are using the label values ±1 we can move the label value in
each case inside the inner most paraenthesis, and we can write both
cases in a single line as

gp (w) = −log (σ (yp x̊Tp w )) .

Since −log (x) = 1


x we can re-write the point-wise cost above equivalently
as

gp (w) = −log ( ) = log (1 + e−yp x̊p w )


1 T

−yp x̊Tp w
1+e

Taking the average of this point-wise cost over all P points we have the
Softmax cost for logistic regression:

1 P 1 P
∑ gp (w) = ∑ log (1 + e−yp x̊p w )
T
g(w) =
P p=1 P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 42/130


6/24/24, 4:35 AM LinearModels slides

Minimizing Softmax logistic regression using standard gradient descent


In [39]:
# compute linear combination of input point
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [40]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))

# the convex cross-entropy cost function


def softmax(w):
# compute sigmoid of model
a = sigmoid(model(x,w))

# compute cost of label 0 points


ind = np.argwhere(y == -1)[:,1]
cost = -np.sum(np.log(1 - a[:,ind]))

# add cost on label 1 points


ind = np.argwhere(y==+1)[:,1]
cost -= np.sum(np.log(a[:,ind]))

# compute cross-entropy
return cost/y.size

In [41]:
# the convex softmax cost function
def softmax(w):
cost = np.sum(np.log(1 + np.exp(-y*model(x,w))))
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 43/130


6/24/24, 4:35 AM LinearModels slides

Minimizing Softmax logistic regression using standard gradient descent

The gradient of the Softmax cost function:

1 P −yp x̊Tp w
∇g (w) = − ∑ e y x̊
P p=1 1 + e−yp x̊Tp w p p

In [44]:
animator.static_fig(weight_history,num_contours = 25,viewmax = 12)

The Softmax cost for logistic regression is always convex.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 44/130


6/24/24, 4:35 AM LinearModels slides

Noisy classification datasets


In [46]:
demo.plot_data(view = [15,-140])

There are 03 points in the example that look like they are on the wrong
side.
Note: in the classification context a 'noisy' point is one that has an
incorrect label. Such points are often misclassified by a trained
classifier, meaning that their true label value will not be correctly
predicted.

Two class classification datasets typically have noise of this kind and
are not often perfectly separable by a hyperplane.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 45/130


6/24/24, 4:35 AM LinearModels slides

We run 100 steps of gradient descent with steplength α = 1.


In [48]:
static_visualizer().plot_cost_histories([cost_history],start = 0,points = False,labels = [r'$\alpha = 1$'])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 46/130


6/24/24, 4:35 AM LinearModels slides

In [50]:
demo.static_fig(weight_history[-1],view = [15,-140])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 47/130


6/24/24, 4:35 AM LinearModels slides

The Perceptron

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 48/130


6/24/24, 4:35 AM LinearModels slides

The Perceptron
We treat classification as a particular form of non-linear regression
(e.g., employing a tanh nonlinearity for classification data with label
values yp ∈ {−1, +1}).

This results in learning a proper nonlinear regressor and a


corresponding linear decision boundary:
T
x̊ w = 0.

Instead of learning this decision boundary as a result of a nonlinear


regression, the Perceptron aims at determining this ideal linear
decision boundary directly.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 49/130


6/24/24, 4:35 AM LinearModels slides

The Perceptron cost function

With two-class classification we have a training set of P points


{(xp , yp )}Pp=1 , where yp ∈ {−1, +1} - consisting of two classes which we
would like to learn how to distinguish between automatically.

In the simplest instance our two classes of data are largely separated
by a linear decision boundary given by the collection of input x where
x̊T w = 0 with each class (largely) lying on either side.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 50/130


6/24/24, 4:35 AM LinearModels slides

The linear decision boundary cuts the input space into two half-spaces,
one lying "above" the hyperplane where x̊T w > 0, and one lying "below"
it where x̊T w < 0.

Because we can always flip the orientation of an ideal hyperplane by


multiplying it by -1, we can say in general that when the weights of a
hyperplane are tuned properly, members of the class yp = +1 lie (mostly)
"above" it, while members of the yp = −1 class lie (mostly) "below" it.

Our desired set of weights define a hyperplane where as often as


possible we have that

x̊Tp w > 0 if yp = +1
x̊Tp w < 0 if yp = −1.

We can write both into a single equation:

−yp x̊Tp w < 0.

The ideal condition above representing a hyperplane correctly


classifying the point xp forms a point-wise cost:

gp (w) = max (0, −yp x̊Tp w ) = 0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 51/130


6/24/24, 4:35 AM LinearModels slides

The expression max (0, −yp x̊Tp w ) is always nonnegative, since it returns
zero if xp is classified correctly, and returns a positive value if the point
is classified incorrectly.

The functional form of this point-wise cost max (0, ⋅) is called a rectified
linear unit.

Because these point-wise costs are nonnegative and equal zero when
our weights are tuned correctly, we can take their average over the
entire dataset to form a proper cost function as:

1 P 1 P
g (w) = ∑ gp (w) = ∑ max (0, −yp x̊Tp w ) .
P p=1 P p=1

This cost function goes by many names such as the perceptron cost,
the rectified linear unit cost (or ReLU cost for short), and the hinge cost
(since when plotted a ReLU function looks like a hinge). This cost
function is always convex but has only a single (discontinuous)
derivative in each input dimension.

This implies that we can only use zero and first order local optimization
schemes (i.e., not Newton's method).

Note that the perceptron cost always has a trivial solution at w = 0,


since indeed g (0) = 0, thus one may need to take care in practice to
avoid finding it (or a point too close to it) accidentally.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 52/130


6/24/24, 4:35 AM LinearModels slides

The Softmax approximation to the Perceptron

When dealing with a cost function that has some deficit, we replace it
with a smooth (or at least twice differentiable) cost function that
closely matches it everywhere.
If the approximation closely matches the true cost function, then for a
small amount of accuracy, we considerably broaden the set of
optimization tools we can use.

We replace the max function portion of the Perceptron cost with the
Softmax function defined as:

soft (s0 , s1 , . . . , sC−1 ) = log (es0 + es1 + ⋯ + esC−1 )

where s0 , s1 , . . . , sC−1 are any C scalar values - which is a generic smooth


approximation to the max function, i.e.,
soft (s0 , s1 , . . . , sC−1 ) ≈ max (s0 , s1 , . . . , sC−1 )

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 53/130


6/24/24, 4:35 AM LinearModels slides

Example: when C = 2:
Suppose s0 ≤ s1 , so that max (s0 , s1 ) = s1 .
max (s0 , s1 ) = s0 + (s1 − s0 ) = log (es0 ) + log (es1 −s0 )

soft (s0 , s1 ) = log (es0 + es1 ) = log (es0 ) + log (1 + es1 −s0 )

soft (s0 , s1 ) is always larger than max (s0 , s1 ) but not by much.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 54/130


6/24/24, 4:35 AM LinearModels slides

We replace the ReLU perceptron cost in the point-wise cost function


with its softmax approximation:

gp (w) = soft (0, −yp x̊Tp w ) = log (e0 + e−yp x̊p w ) = log (1 + e−yp x̊p w )
T T

The overall cost function now is:


P P
g (w) = ∑ gp (w) = ∑ log (1 + e−yp x̊p w )
T

p=1 p=1

Like the ReLU cost, the Softmax cost is convex.


Unlike the ReLU cost, the Softmax cost has infinitely many derivatives
and Newton's method can be used.
The Softmax cost does not have a trivial solution at zero.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 55/130


6/24/24, 4:35 AM LinearModels slides

Example: Using Newton's method to minimize the Softmax cost


In [52]:
### define softmax cost ###
# compute linear combination of input point
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

# the convex softmax cost function


def softmax(w):
cost = np.sum(np.log(1 + np.exp(-y*model(x,w))))
return cost/float(np.size(y))

# load in dataset
data = np.loadtxt(data_path_1, delimiter = ',')

# get input/output pairs


x = data[:-1,:]
y = data[-1:,:]

# run gradient descent to minimize the softmax cost


g = softmax; w = np.ones((2,1)); max_its = 5;
weight_history,cost_history = newtons_method(g,max_its,w,epsilon = 10**(-7))

# create a static figure illustrating gradient descent steps


animator = classification_2d_visualizer(data,g)

In [53]:
animator.static_fig(weight_history,num_contours = 25,viewmax = 25)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 56/130


6/24/24, 4:35 AM LinearModels slides

The Softmax cost and a problem with linearly separable datasets

Suppose that we have a dataset whose two classes can be perfectly


separated by a hyperplane, and that we have chosen an appropriate
cost function to minimize it in order to determine proper weights for
our model.

Suppose further that we are extremely lucky and our initialization w0


produces a linear decision boundary x̊ with perfect sepearation.
T
w0 = 0

This means that for each of our P points, we have:

−yp x̊Tp w0 < 0

The point-wise Perceptron/ReLU cost is zero for every point i.e.,


gp (w0 ) = max (0, −yp x̊Tp w0 ) = 0

The Perceptron/ReLU cost is exactly equal to zero:

1 P
g (w0 ) = ∑ max (0, −yp x̊Tp w0 ) = 0.
P p=1

Any local optimization algorithm will halt immediately. This will not be
the case if we use the Softmax cost instead of the Perceptron cost.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 57/130


6/24/24, 4:35 AM LinearModels slides

The Softmax cost and a problem with linearly separable datasets

We always have e−y x̊ w > 0.


T 0
p p

The softmax point-wise cost is also nonnegative


gp (w0 ) = log (1 + e−y x̊ w ) > 0.
T 0
p p

The Softmax cost is nonnegative as well


P
g (w0 ) = ∑ log (1 + e−yp x̊p w ) > 0.
T 0

p=1

Using any local optimization method, e.g., gradient descent, we will


take steps away from the initialization w0 to drive the value of the
Softmax cost lower and lower toward its minimum at zero.

With data that is indeed linearly separable, the Softmax cost achieves
this lower bound only only when the magnitude of the weights grows
to infinity.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 58/130


6/24/24, 4:35 AM LinearModels slides

The Softmax cost and a problem with linearly separable datasets

P
g (w0 ) = ∑ log (1 + e−yp x̊p w ) > 0.
T 0

p=1

Each individual term log (1 + e−C ) = 0 only as C ⟶ ∞.

If we multiply our initialization w0 by any constant C > 1 we can


decrease the value of any negative exponential involving one of our
data points since C(−yp x̊Tp w0 ) < −yp x̊Tp w0 and so e−y x̊ .
T 0 T 0
p p Cw < e−yp x̊p w

This decreases the Softmax cost as well with the minimum achieved as
C ⟶ ∞. However, regardless of the scalar C > 1, the decision boundary
defined by the initial weights x̊Tp w0 = 0 does not change location.

This can cause numerical instability issues with local optimization


methods that make large progress at each step, e.g., Newton's method.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 59/130


6/24/24, 4:35 AM LinearModels slides

If we simply flip one of the labels - making this dataset not perfectly
linearly separable - the corresponding cost function does not have a
global minimum out at infinity, as illustrated in the contour plot below.
In [54]:
# switch a label
y[0,-1] = -1
data = np.vstack((x,y))

# draw contour of this new data


animator = classification_2d_visualizer(data,softmax)

# create a static figure illustrating gradient descent steps


animator.static_fig([],num_contours = 15,viewmax = 25)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 60/130


6/24/24, 4:35 AM LinearModels slides

Normalizing feature-touching weights

To control the magnitude of w means that we want to control the size


of the N + 1 individual weights it contains

⎡ w0 ⎤
⎢ w1 ⎥
w=⎢

⎥.

⎢ ⋮ ⎥
⎣w ⎦
N

We can do this by directly controling the size of just N of these


weights, and it is particularly convenient to do so using the final N
feature touching weights w1 , w2 , . . . , wN because these define the normal
vector to the linear decision boundary x̊ w = 0.
T

We express the linear decision boundary as:


T T
x̊ w = b + x ω = 0

where the feature-touching weights ω define the normal vector of the


linear decision boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 61/130


6/24/24, 4:35 AM LinearModels slides

A linear decision boundary written as b + x has a normal vector ω


T
ω=0

defined by its feature-touching weights.

The normal vector to a hyperplane (like our decision boundary) is


always perpendicular to it.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 62/130


6/24/24, 4:35 AM LinearModels slides

We have a point xp lying 'above' the linear decision boundary on a


translate of the decision boundary where b + x
T
ω=β>0

Suppose we know the location of the vertical projection of xp onto the


decision boundary, called x′p

The signed distance between xp and its vertical projection x′p :

d = ∥x′p − xp ∥2 sign (β) = ∥x′p − xp ∥2

where sign(β) = +1 in this example.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 63/130


6/24/24, 4:35 AM LinearModels slides

Because the vector x′p − xp is perpendicular to the decision boundary


(and so is parallel to the normal vector ω), the inner produce rule gives:
T
(x′p − xp ) ω = −∥x′p − xp ∥2 ∥ω∥2 = −d ∥ω∥2

Take the difference between the decision boundary and its translation
evaluated at x′p and xp :

0 − β = (b + (xp ) ω) − (b + xTp ω) = (x′p − xp ) ω


′ T T

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 64/130


6/24/24, 4:35 AM LinearModels slides

Since both formulae are equal to (x′p − xp )T ω we can set them equal to
each other:

d ∥ω∥2 = β

The signed distance d of xp to the decision boundary is

β b + xTp ω
d= = .
∥ω∥2 ∥ω∥2

If the feature-touching weights have unit length as ∥ω∥2 = 1 then the


signed distance d of a point xp to the decision boundary is given simply
by its evaluation b + xTp ω.

If xp were to lie below the decision boundary and β < 0, we have the
same derivation.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 65/130


6/24/24, 4:35 AM LinearModels slides

Another example with the input dimension N = 3 and the decision


boundary is a true hyperplane.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 66/130


6/24/24, 4:35 AM LinearModels slides

We can scale any linear decision boundary by a non-zero scalar C and


it still defines the same hyperplane. If we multiply by C = 1
∥ω∥2
we have:

T
b + x ω = b + xT ω = 0
∥ω∥2 ∥ω∥2 ∥ω∥2

Our feature-touching weights have unit length as ∥∥ ∥ω∥


ω ∥ = 1.
∥2
2

Regardless of how large our weights w were to begin with we can always
normalize them in a consistent way by dividing off the magnitude of ω.

This normalization scheme is particularly useful in the context of the


technical issue with the Softmax/Cross Entropy highlighted above
because clearly a decision boundary that perfectly separates two
classes of data can be feature-weight normalized to prevent its weights
from growing too large (and diverging to infinity).

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 67/130


6/24/24, 4:35 AM LinearModels slides

Regularizing two-class classification

We constrain the Softmax / Cross-Entropy cost so that feature-


touching weights always have length one i.e., ∥ω∥2 = 1.

1 P
∑ log (1 + e−yp (b+xp ω ) )
T
minimize
b, ω P p=1
2
subject to ∥ω∥2 = 1

We can solve the constrained optimization problem above by solving


the highly-related unconstrained regularized version of the original
Softmax cost:

1 P
∑ log (1 + e−yp (b+xp ω ) ) + λ ∥ω∥22
T
g (b, ω) =
P p=1

The regularization parameter λ is used to balance how strongly we


pressure one term or the other in minimizing their sum.

Minimizing the regularizer λ ∥ω∥22 prevents the divergence of their


magnitude since if their size does start to grow we our entire cost
function 'suffers' because of it, and becomes large. Because of this the
value of λ is typically chosen to be small (and positive) in practice,
although some fine-tuning can be useful.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 68/130


6/24/24, 4:35 AM LinearModels slides

Using Newton's method to minimize a regularized Softmax cost


In [56]:
# create a static figure illustrating gradient descent steps
animator.static_fig(weight_history,num_contours = 25,viewmax = 100)

We repeat the previous experiment but add a regularizer with λ = 10−3


to the Softmax cost.

The global minimum no longer lies at infinity.


We still learn a perfect decision boundary by a tightly fitting tanh (⋅)
function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 69/130


6/24/24, 4:35 AM LinearModels slides

Support Vector Machines

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 70/130


6/24/24, 4:35 AM LinearModels slides

The Margin-Perceptron

Suppose we have a two-class classification training dataset of P points


{(xp , yp )}Pp=1 with the labels yp ∈ {−1, +1}.

Suppose the dataset is linearly separable and we have a linear decision


boundary x̊T w =0 passing evenly through the region separating two
classes.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 71/130


6/24/24, 4:35 AM LinearModels slides

This separating hyperplane creates a buffer between the two classes


confined between two evenly shifted versions of itself:
1. One version that lies on the positive side of the separator and
just touches the class having labels yp = +1 (colored red in the
figure) taking the form x̊T w = +1.
2. One lying on the negative side of it just touching the class with
labels yp = −1 (colored blue in the figure) taking the form
x̊T w = −1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 72/130


6/24/24, 4:35 AM LinearModels slides

The translations above and below the separating hyperplane are more
generally defined as x̊T w = +β and x̊T w = −β respectively, where β > 0.

By dividing off β in both equations and reassigning the variables as


w ⟵ w
β
we can leave out the redundant parameter β and have the two
translations as stated x̊T w = ±1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 73/130


6/24/24, 4:35 AM LinearModels slides

All points in the +1 class lie exactly on or on the positive side of


x̊T w = +1.

All points in the −1 class lie exactly on or on the negative side of


x̊T w = −1.

x̊Tp w ≥ 1 if yp = +1
x̊Tp w ≤ −1 if yp = −1

We can combine these conditions into a single statement by


multiplying each by their respective label values, giving the single
inequality yp x̊Tp w ≥1 which can be equivalently written as a point-wise
cost

gp (w) = max (0, 1 − yp x̊Tp w ) = 0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 74/130


6/24/24, 4:35 AM LinearModels slides

gp (w) = max (0, 1 − yp x̊Tp w ) = 0

Again, this value is always nonnegative. Summing up all P equations of


the form above gives the Margin-Perceptron cost:
P
g (w) = ∑ max (0, 1 − yp x̊Tp w )
p=1

Notice that the original Perceptron cost is:


P
g (w) = ∑ max (0, −yp x̊Tp w )
p=1

The additional 1 prevents the issue of a trivial zero solution with the
original Perceptron cost.

If the data is indeed linearly separable any hyperplane passing


between the two classes will have parameters w where g (w) = 0.

If the data is not linearly separable, a violation for the pth point adds the
positive value of 1 − yp x̊Tp w to the cost function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 75/130


6/24/24, 4:35 AM LinearModels slides

Relation to the Softmax cost

As with the perceptron, one way to smooth out the margin-perceptron


here is by replacing the max operator with softmax:

soft (0, 1 − yp x̊Tp w ) = log (1 + e1−yp x̊p w )


T

The '1' used in the 1 − yp (x̊Tp w ) component of the cost could have been
any number we wanted - it was a normalization factor for the width of
the margin and, by convention, we used '1'.

Instead we could have chosen any value ϵ > 0, we have:

max (0, ϵ − yp x̊Tp w ) = 0

For all p, the Margin-Perceptron cost can be stated as:


P
g (w) = ∑ max (0, ϵ − yp x̊Tp w )
p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 76/130


6/24/24, 4:35 AM LinearModels slides

max (0, ϵ − yp x̊Tp w ) = 0

The softmax version is:

soft (0, ϵ − yp x̊Tp w ) = log (1 + eϵ−yp x̊p w )


T

When ϵ is quite small we of course have that


log (1 + eϵ−yp x̊p w ) ≈ log (1 + e−yp x̊p w ), the same summand used for the
T T

(smoothed) perceptron and logistic regression.


Thus we can in fact use the same softmax cost function here
P
g (w) = ∑ log (1 + e−yp x̊p w )
T
(1)
p=1

as a smoothed version of our Margin-Perceptron cost.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 77/130


6/24/24, 4:35 AM LinearModels slides

Maximum margin decision boundaries

When two classes of data are linearly separable infinitely many


hyperplanes could be drawn to separate the data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 78/130


6/24/24, 4:35 AM LinearModels slides

While both perfectly distinguish between the two classes the green
separator (with smaller margin) divides up the space in a rather
awkward fashion given how the data is distributed, and will therefore
tend to more easily misclassify future datapoints.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 79/130


6/24/24, 4:35 AM LinearModels slides

On the other hand, the black separator (having a larger margin) divides
up the space more evenly with respect to the given data, and will tend
to classify future points more accurately.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 80/130


6/24/24, 4:35 AM LinearModels slides

We can express a linear decision boundary as:


T T
x̊ w = b + x ω = 0.

To find the separating hyperplane with maximum margin we aim to


find a set of parameters so that the region defined by b + xT ω = ±1, with
each translate just touching one of the two classes, has the largest
possible margin.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 81/130


6/24/24, 4:35 AM LinearModels slides

The margin can be determined by calculating the distance between


any two points x1 and x2 (one from each translated hyperplane) lying
on the normal vector ω:

∥x1 − x2 ∥2 .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 82/130


6/24/24, 4:35 AM LinearModels slides

Taking the difference of the two translates evaluated at x1 and x2 :


T
(b + xT1 ω ) − (b + xT2 ω ) = (x1 − x2 ) ω = 2

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 83/130


6/24/24, 4:35 AM LinearModels slides

Maximum margin decision boundaries


T
(b + xT1 ω ) − (b + xT2 ω ) = (x1 − x2 ) ω = 2

Using the innner-product rule, and the fact that the two vectors x1 − x2
and ω are parallel to each other, we have:
2
∥x1 − x2 ∥2 =
∥ω∥2

Therefore, finding the separating hyperplane with maximum margin is


equivalent to finding the one with the smallest possible normal vector
ω.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 84/130


6/24/24, 4:35 AM LinearModels slides

The hard-margin and soft-margin SVM problems


We have the hard-margin support vector machine problem to find a
separating hyperplane for the data with minimum length normal
vector:
2
minimize ∥ω∥2
b, ω
subject to max (0, 1 − yp (b + xTp ω)) = 0, p = 1, . . . , P .

The Margin-Perceptron constraint here guarantee that the hyperplane


separates the data perfectly.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 85/130


6/24/24, 4:35 AM LinearModels slides

We can relax the constraints and form an unconstrained formulation


that is similar to the previous regularization approach:
P
g (b, ω) = ∑ max (0, 1 − yp (b + xTp ω)) + λ∥ω∥22
p=1

The parameter λ ≥ 0 is called a penalty or regularization parameter.


When λ is set to a small positive value we put more 'pressure' on the
cost function to make sure the constraints hold:

max (0, 1 − yp (b + xTp ω)) = 0, p = 1, . . . , P

This regularized form of the Margin-Perceptron cost function is


referred to as the soft-margin support vector machine cost.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 86/130


6/24/24, 4:35 AM LinearModels slides

Comparing the SVM decision boundary on linearly separable data


In [59]:
demo5.svm_comparison_fig()

Compare three decision boundaries learned via the Margin-Perceptron


(left) to the support vector machine decision boundary (right).

There are infinitely many linear decision boundaries that separate the
two classes. Any of these can be found by Margin-Perceptron.

The SVM decision boundary is the one that provides the maximum
margin.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 87/130


6/24/24, 4:35 AM LinearModels slides

In [60]:
demo5.svm_comparison_fig()

In the right panel, the translates of the decision boundary pass through
points from both classes - equidistant from the SVM linear decision
boundary.

These points are called support vectors, hence the name Support
Vector Machines.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 88/130


6/24/24, 4:35 AM LinearModels slides

SVMs and noisy data

A very big practical benefit of the soft-margin SVM problem is that it


allows us it to deal with noisy imperfectly (linearly) separable data –
which arise far more commonly in practice than datasets that are
perfectly linearly separable.

"Noise" makes at least one of the constraints in the hard-margin


problem impossible to satisfy (and thus the problem is technically
impossible to solve), i.e., for some p:

max (0, 1 − yp (b + xTp ω)) > 0

The soft-margin relaxation can always be properly minimized and is


therefore much more frequently used in practice.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 89/130


6/24/24, 4:35 AM LinearModels slides

SVMs and noisy data

However notice that once we forgo the assumption of perfectly (linear)


separability the added value of a 'maximum margin hyperplane'
provided by the SVM solution disappears since we no longer have a
margin to begin with.

Thus with many datasets in practice the softmargin problem does not
provide a solution remarkably different than the perceptron or even
logistic regression.

Actually - with datasets that are not linearly separable - it often returns
exactly the same solution provided by the perceptron or logistic
regression.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 90/130


6/24/24, 4:35 AM LinearModels slides

SVMs and noisy data

Consider the soft-margin SVM problem:


P
g (b, ω) = ∑ max (0, 1 − yp (b + xTp ω)) + λ∥ω∥2
2

p=1

We smoothen the Margin-Perceptron portion of the cost using the


Softmax function:
P
g (b, ω) = ∑ log (1 + e−yp (b+xp ω) ) + λ∥ω∥22
T

p=1

This is similar to a regularized Perceptron or logistic regression.

All three methods of linear two-class classification are very deeply


connected. They tend to provide similar results on realistic (not linearly
separable) datasets.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 91/130


6/24/24, 4:35 AM LinearModels slides

Linear Multi-Class Classification

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 92/130


6/24/24, 4:35 AM LinearModels slides

One-versus-All Multi-Class Classification

A multiclass dataset {(xp, yp )}Pp=1 consists of C distinct classes of data.

As with the two class case we can in theory use any C distinct labels to
distinguish between the classes.

For the sake of this derivation it is convenient to use label values


yp ∈ {0, 1, . . . , C − 1}.

In [64]:
demo1.show_dataset()

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 93/130


6/24/24, 4:35 AM LinearModels slides

Training C One-versus-All classifiers

The goal of multi-class classification is that we want to learn how to


distinguish each class of our data from the other C − 1 classes.

A good first step would be to learn C two-class classifiers on the entire


dataset, with the cth classifier trained to distinguish the cth class from
the remainder of the data.

With the cth two-class subproblem we simply assign temporary labels y~p
to the entire dataset, giving +1 labels to the cth class and −1 labels to
the remainder of the dataset

+1 if yp = c
y~p = { (2)
−1 if yp ≠ c

where again yp is the original label for the pth point, and run the two-class
classification scheme of our choice.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 94/130


6/24/24, 4:35 AM LinearModels slides

In [65]:
# solve the 2-class subproblems
demo1.solve_2class_subproblems()

# illustrate dataset with each subproblem and learned decision boundary


demo1.plot_data_and_subproblem_separators()

We learn 3 linear classifiers with logistic regression for each subproblem.


We solve each using the Newton's method.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 95/130


6/24/24, 4:35 AM LinearModels slides

1. Points on the positive side of a single classifier

With OvA we learn C two-class classifiers, and we can denote the weights
from the cth classifier as wc where

⎡ w0,c ⎤

⎢ 1,c
w ⎥

⎢ ⎥
wc = ⎢



⎢ ⎥
w2,c
⎢ ⎥
⎢ ⋮ ⎥
⎣w ⎦
N,c

and then the corresponding decision boundary as x̊ w c = 0.


T

In the previous example, in each case, because each subproblem is


perfectly linearly separable, the class to be distinguished from the rest lies
on the positive side of its respective classifier, with the remainder of the
points lying on the negative side.
This means that for the jth classifier we have for the pth point xp that
>0 if yp = c
x̊ wc {
T
<0 if yp ≠ c.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 96/130


6/24/24, 4:35 AM LinearModels slides

1. Points on the positive side of a single classifier

When evaluated by each two-class classifier individually, the one


identifying a point's true label always provides the largest evaluation,
i.e.,
T T
x̊ wj = max x̊ wc
c = 0,...,C−1

How do we classify arbitrary points? The predicted label y for input


point x here can be written as
T
y = argmax x̊ wc
c = 0,...,C−1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 97/130


6/24/24, 4:35 AM LinearModels slides

In [66]:
demo1.show_fusion(region = 1)

All points in our toy dataset lie on the positive side of a single
classifier. These points are colored to match their respective classifier.

There are regions left uncolored, which include regions where:


1. Points are on the positive side of more than one classifier.
2. Points are on the positive side of none of the classifiers.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 98/130


6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier


In [67]:
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1]);

We think of a classifier as being 'more confident' of the class identity of


given a point the farther the point lies from the classifier's decision
boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 99/130


6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier


In [68]:
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1]);

This is a simple geometric/ probabilistic concept, the bigger a point's


distance to the boundary the deeper into one region of a classifier's
half-space it lies, and thus we can be much more confident in its class
identity than a point closer to the boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 100/130


6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier


In [69]:
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1]);

If we slightly perturbed the decision boundary: those points originally


close to its boundary might end up on the other side of the perturbed
hyperplane, changing classes, whereas those points farther from the
boundary are less likely to be so affected.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 101/130


6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier


In [70]:
demo1.show_fusion(region = 2)

We repeat this logic for every point in the regions where two or more
classifiers are positive.
For points that are equidistant to two or more decision boundaries, we
assign a class label at random.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 102/130


6/24/24, 4:35 AM LinearModels slides

3. Points on the negative side of all classifiers


In [71]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])

When a point x lies on the positive side of none of our C One-versus-


All classifiers, it means that each of our classifiers designated it as not
in their respective class.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 103/130


6/24/24, 4:35 AM LinearModels slides

In [72]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])

Which classifier is the least confident about x not belonging to their


class?
Which decision boundary is the closest to the point x.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 104/130


6/24/24, 4:35 AM LinearModels slides

In [73]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])

Using the notion of signed distance to decision boundary:


Assuming the weights of our classifiers have been normalized, for an
input point x we assign the label y providing the maximum signed
distance, i.e.,
T
y = argmax x̊ wc .
c = 0,...,C−1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 105/130


6/24/24, 4:35 AM LinearModels slides

In [74]:
demo1.show_fusion(region = 3)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 106/130


6/24/24, 4:35 AM LinearModels slides

Putting all together

One-versus-All multi-class classification

1: Input: multiclass dataset {(xp, yp )}Pp=1 where yp ∈ {0, . . . , C − 1}, two-class


classification scheme and optimizer
2: for j = 0, . . . , C − 1
+1 if yp = j
3: form temporary labels y~p = {
−1 if yp ≠ j

4: solve two-class subproblem on {(xp, y~p )}Pp=1 to find weights wj

5: normalize classifier weights by magnitude of feature-touching


portion wj ⟵ ∥ w ∥ j

∥ω j ∥2

6: end for
7: To assign label y to a point x, apply the fusion rule: y =
T
argmax x̊ wc
c = 0,...,C−1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 107/130


6/24/24, 4:35 AM LinearModels slides

In [75]:
demo1.show_complete_coloring();

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 108/130


6/24/24, 4:35 AM LinearModels slides

Multi-Class Classification and the Perceptron

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 109/130


6/24/24, 4:35 AM LinearModels slides

The multi-class Perceptron cost function

We have seen how the fusion rule defined class ownership for every
point x in the input space. With all two-class classifiers properly tuned,
ideally, we would like the fusion rule to hold true for as many points as
possible:

yp = argmax x̊Tp wj
j = 0,...,C−1

Now, instead of tuning our C two-class classifiers one-by-one and then


combining them in this way, we can learn the weights of all C classifiers
simultaneously so as to satisfy this ideal condition as often as possible.

If the above Equation holds for the pth point, then:

x̊Tp wyp = argmax x̊Tp wj


j = 0,...,C−1

i.e., the signed distance from the point to its class decision boundary is
greater than (or equal to) its distances to every other two-class decision
boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 110/130


6/24/24, 4:35 AM LinearModels slides

Subtracting x̊Tp wy from the right-hand side, we have a point-wise cost


p

function that is always non-negative and minimal at zero:

gp (w0 , . . . , wC−1 ) = ( max x̊Tp wj ) − x̊Tp wyp . (3)


j = 0,...,C−1

If our weights w0 , . . . , wC−1 are set ideally, gp (w0 , . . . , wC−1 ) should be zero
for as many points as possible.

We can now form a cost function by taking the average of the point-
wise cost over the entire dataset:

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

This cost function is called the multi-class Perceptron cost, providing


a way to tune all classifier weights simulataneously to obtain weights
that satisfy the fusion rule as well as possible.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 111/130


6/24/24, 4:35 AM LinearModels slides

Example: Minimizing the multi-class Perceptron cost

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

The multi-class Perceptron cost also has a trivial solution at zero.


This undesirable behavior can be avoided by initializing any local
optimization method away from the origin.

We are restricted to using zero- and first-order optimization methods


to minimize the multi-class Perceptron cost as it second derivative is
zero.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 112/130


6/24/24, 4:35 AM LinearModels slides

In [78]:
# compute C linear combinations of input point, one per classifier
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [79]:
def multiclass_perceptron(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute maximum across data points


a = np.max(all_evals,axis = 0)

# compute cost in compact form using numpy broadcasting


b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 113/130


6/24/24, 4:35 AM LinearModels slides

In [80]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs


x = data[:-1,:]
y = data[-1:,:]

# create an instance of the ova demo


demo = MulticlassVisualizer(data)

# visualize dataset
demo.show_dataset()

# run gradient descent to minimize cost


g = multiclass_perceptron; w = 0.1*np.random.randn(3,3); max_its = 2000; alpha_choice = 10**(-1);
weight_history = demo.gradient_descent(g,w,alpha=alpha_choice,max_its=max_its)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 114/130


6/24/24, 4:35 AM LinearModels slides

In [81]:
demo.show_complete_coloring(weight_history, cost = multiclass_perceptron)

In the left figure, because we did not train each individual two-class
classifier in a One-versus-All manner, each individual learned two-class
classifier peforms quite poorly in separating its class from the rest of
the data.
In the right figure, we show the fused multi-class decision boundary
formed by combining these individual One-versus-All boundaries via
the fusion rule. The final multi-class decision boundary achieves perfect
classification.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 115/130


6/24/24, 4:35 AM LinearModels slides

Alternative formulations of the multi-class Perceptron cost

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

The multi-class Perceptron cost can also be derived as a direct


generalization of its two-class version. Using the following property of
the max function, we have:

max (s0 , s1 , . . . , sC−1 ) − z = max (s0 − z, s1 − z, . . . , sC−1 − z)

We can rewrite the point-wise multi-class Perceptron cost as:

max x̊Tp (wj − wyp ) .


j = 0,...,C−1

Because for j = yp , we have x̊Tp (wj − wy ) = 0, then we can have:


p

max (0, x̊Tp (wj − wyp )) .


j = 0,...,C−1
j≠yp

The entire multi-class Perceptron cost is:

1 P
g (w0 , . . . , wC−1 ) = ∑ max (0, x̊Tp (wj − wyp )) . (4)
P p=1 j = 0,...,C−1
j≠yp

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 116/130


6/24/24, 4:35 AM LinearModels slides

Regularizing the multi-class Perceptron

In minimizing the multi-class Perceptron cost, we should subject it to


the constraints that all vectors of feature-touching weights have unit
length so that we can compre the signed distance of each input xp to
each of our One-versus-All two-class decision boundaries.
Each weight vector wc consists of two components

⎡ w1,c ⎤
⎢ w2,c ⎥
ωj = ⎢
⎢ ⎥
⎥.

⎢ ⋮ ⎥
(bias): bc = w0,c (feature-touching weights): (5)

⎣ wN,c ⎦

In this notation each linear combination can be written as x̊Tp wc = bc + xTp ωc .

A proper constrained optimization problem involving our multi-class


Perceptron takes the form:

1 P
minimize ∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )]
b0 , ω0 , ..., bC−1 , ωC−1 P p=1 j = 0,...,C−1
2
subject to ∥
∥ω j ∥
∥2 = 1, j = 0, . . . , C − 1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 117/130


6/24/24, 4:35 AM LinearModels slides

Regularizing the multi-class Perceptron

1 P
minimize ∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )]
b0 , ω0 , ..., bC−1 , ωC−1 P p=1 j = 0,...,C−1
2
subject to ∥
∥ω j ∥
∥2 = 1, j = 0, . . . , C − 1

We can phrase this constrained problem in an unconstrained


regularized form by relaxing the constraints but penalizing their
magnitude:
C−1
1 P
∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )] + λ ∑ ∥∥ωj ∥∥2
2

P p=1 j = 0,...,C−1 j=0

λ is typically set to a small value (e.g., 10−3 or smaller).


The regularized form does not quite match the original constrained
formulation as regularizing all normal vectors together will not
necessarily guarantee that ∥ωj ∥22 = 1 for all j. However, it will generally
force the magnitude of all normal vectors to ''behave well''.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 118/130


6/24/24, 4:35 AM LinearModels slides

In [94]:
lam = 10**-5 # our regularization paramter
def multiclass_perceptron(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute maximum across data points


a = np.max(all_evals,axis = 0)

# compute cost in compact form using numpy broadcasting


b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# add regularizer
cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 119/130


6/24/24, 4:35 AM LinearModels slides

In [95]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs


x = data[:-1,:]
y = data[-1:,:]

# create an instance of the ova demo


demo = MulticlassVisualizer(data)

# visualize dataset
demo.show_dataset()

# run gradient descent to minimize cost


g = multiclass_perceptron; w = 0.1*np.random.randn(3,3); max_its = 2000; alpha_choice = 10**(-1);
weight_history = demo.gradient_descent(g,w,alpha=alpha_choice,max_its=max_its)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 120/130


6/24/24, 4:35 AM LinearModels slides

In [96]:
demo.show_complete_coloring(weight_history, cost = multiclass_perceptron)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 121/130


6/24/24, 4:35 AM LinearModels slides

The multi-class Softmax cost function

Recall that the softmax function:

soft (s0 , s1 , . . . , sC−1 ) = log (es0 + es1 + ⋯ + esC−1 )

is a close and smooth approximation to the maximum of C scalar numbers


s0 , . . . , sC−1 , i.e.,

soft (s0 , s1 , . . . , sC−1 ) ≈ max (s0 , s1 , . . . , sC−1 ) .

Replacing the max function in the multi-class Perceptron cost

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

gives the following multi-class Softmax cost function:

∑ [log (∑ ex̊p wj ) − x̊Tp wyp ] .


C−1
1 P T
g (w0 , . . . , wC−1 ) =
P p=1 j=0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 122/130


6/24/24, 4:35 AM LinearModels slides

Minimizing the multi-class Softmax cost function

The multi-class Softmax cost function is convex, and (unlike the multi-
class Perceptron), it also has infinitely many smooth derivaties,
enabling us to use second-order methods in order to properly
minimize it.
It no longer has a trivial solution at zero.
In [107]:
def multiclass_softmax(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute softmax across data points


a = np.log(np.sum(np.exp(all_evals),axis = 0))

# compute cost in compact form using numpy broadcasting


b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 123/130


6/24/24, 4:35 AM LinearModels slides

In [108]:
# load in dataset
data = np.loadtxt(dataset_path_2,delimiter = ',')

# get input/output pairs


x = data[:-1,:]
y = data[-1:,:]

# create instance of multiclass visualizer


demo = MulticlassVisualizer(data)

# visualize dataset
demo.show_dataset()

# run gradient descent to minimize cost


g = multiclass_softmax; w = 0.1*np.random.randn(3,4); max_its = 5;
weight_history = demo.newtons_method(g,w,max_its=max_its)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 124/130


6/24/24, 4:35 AM LinearModels slides

In [109]:
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 125/130


6/24/24, 4:35 AM LinearModels slides

Alternative formulations of the multi-class Softmax

∑ [log (∑ ex̊p wj ) − x̊Tp wyp ] .


C−1
1 P T
g (w0 , . . . , wC−1 ) =
P p=1 j=0

The p-th summand of the multi-class Softmax cost can be written as:

⎛ ∑C−1 ex̊Tp wj ⎞
log (∑ e )− = log (∑ e ) − log e = log ⎜ j=0T ⎟.
C−1 C−1
x̊Tp wj x̊Tp wj x̊Tp wyp
x̊Tp wyp
j=0 c=0 ⎝ ex̊p wyp ⎠

Altogether we have:

⎛ ∑C−1 ex̊p wj ⎞
T

∑ log ⎜ j=0T ⎟= ∑ log (1 + ∑ ex̊p (wj −wyp ) ) .


C−1
1 P 1 P T
g (w0 , . . . , wC−1 ) =
P p=1 ⎝ ex̊p wyp ⎠ P p=1 j=0

Or we can have:

⎛ ∑C−1 ex̊p wj ⎞ ⎛ ⎞
T

∑ log ⎜ j=0T ⎟ = − ∑ log ⎜ C−1 ⎟.


1 P 1 P T
ex̊p wyp
g (w0 , . . . , wC−1 ) =
P p=1 ⎝ ex̊p wyp ⎠ P p=1 ⎝∑ e x̊Tp wj ⎠
j=0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 126/130


6/24/24, 4:35 AM LinearModels slides

Regularization and the multi-class Softmax

It is common to regularize the Multi-class Softmax via its feature


touching weights as:

∑ [log (∑ ebj +xp ωj ) − (byp + xTp ωyp )] + λ ∑ ∥∥ωj ∥∥2


C−1 C−1
1 P T 2

P p=1 j=0 j=0

λ can be set to some small value like 10−3 .


Regularization can also help prevent local optimizaton methods like
Newton's method (which take large steps) from diverging when
dealing with perfectly separable data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 127/130


6/24/24, 4:35 AM LinearModels slides

In [122]:
# compute C linear combinations of input point, one per classifier
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [123]:
# multiclass softmaax regularized by the summed length of all normal vectors
lam = 10**(-5) # our regularization paramter
def multiclass_softmax(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute softmax across data points


a = np.log(np.sum(np.exp(all_evals),axis = 0))

# compute cost in compact form using numpy broadcasting


b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# add regularizer
cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 128/130


6/24/24, 4:35 AM LinearModels slides

In [124]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs


x = data[:-1,:]
y = data[-1:,:]

# create instance of multiclass visualizer


demo = MulticlassVisualizer(data)

# run gradient descent to minimize cost


g = multiclass_softmax; w = 0.1*np.random.randn(3,3); max_its = 200; alpha_choice = 1;
weight_history = demo.gradient_descent(g,w,alpha=alpha_choice,max_its=max_its)

In [125]:
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right pan
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 129/130


6/24/24, 4:35 AM LinearModels slides

In [132]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs


x = data[:-1,:]
y = data[-1:,:]

# create instance of multiclass visualizer


demo = MulticlassVisualizer(data)

# run gradient descent to minimize cost


g = multiclass_softmax; w = 0.1*np.random.randn(3,3); max_its = 5; alpha_choice = 5;
weight_history = demo.newtons_method(g,w,alpha=alpha_choice,max_its=max_its)

In [133]:
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)

In [ ]:

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 130/130

You might also like