0% found this document useful (0 votes)

9 views

LinearModels Slides

Uploaded by

bmhungqb

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

LinearModels Slides

Uploaded by

bmhungqb

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 130

6/24/24, 4:34 AM LinearModels slides

Linear Models

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 1/130

6/24/24, 4:34 AM LinearModels slides

Linear Regression

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 2/130

6/24/24, 4:34 AM LinearModels slides

Linear Regression

Data for regression problems comes in the form of a set of P

input/output observation pairs

(x1 , y1 ) , (x2 , y2 ) , . . . , (xP , yP )

P
or {(xp , yp )}p=1 for short, where xp and yp denote the pth input and
output respectively.

In general, each input xp is a column vector of length N :

⎡ x1,p ⎤
⎢
⎢
x2,p ⎥
⎥
xp = ⎢ ⎥
⎢
⎢ ⋮ ⎥ ⎥
⎣ xN,p ⎦

and each output yp is scalar-valued.

Geometrically, the linear regression problem is analogously to fitting a

hyperplane to a scatter of points in N + 1 dimensional space.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 3/130

6/24/24, 4:34 AM LinearModels slides

Linear Regression

When N = 1, the inputs are also scalar-valued, linear regression

simplifies to fitting a line to the associated scatter of P data points in
two-dimensional space.

A line in 2D is determined by two parameters: a vertical intercept w0

and a slope w1 . We need to determine their values so that the following
approximate linear relationship holds between the input/output data:
w 0 + xp w 1 ≈ y p , p = 1, . . . , P .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 4/130

6/24/24, 4:34 AM LinearModels slides

Linear Regression

In general, when dealing with N -dimensional inputs, we have a bias

weight and N associated slope weights to tune properly in order to fit
a hyperplane with the linear relationship written as:
w0 + x1,p w1 + x2,p w2 + ⋯ + xN,p wN ≈ yp , p = 1, . . . , P .

Because each dimension of the input is referred to as a feature or input

feature, we will often refer to w1 , w2 , . . . , wN as the feature-touching
weights. The only weight not touching a feature is the bias w0 .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 5/130

6/24/24, 4:34 AM LinearModels slides

We can use the notation x̊ to denote an input x with a 1 placed on top

of it as

⎡ 1 ⎤
⎢ x1,p ⎥
⎢ ⎥
⎢ ⎥
x̊p = ⎢ x2,p ⎥ , p = 1, . . . , P
⎢ ⎥
⎢ ⎥
⎢ ⋮ ⎥
⎣x ⎦
N,p

Now we can place all parameters in a single column vector w:

⎡ w0 ⎤
⎢ w1 ⎥
⎢
⎢ ⎥
⎥
w = ⎢ w2 ⎥
⎢
⎢ ⎥
⎥
⎢ ⋮ ⎥
⎣w ⎦
N

We can now write the linear relationship more compactly as

x̊Tp w ≈ yp p = 1, . . . , P .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 6/130

6/24/24, 4:34 AM LinearModels slides

The Least Squares cost function

To find the parameters of the hyperplane which best fits a regression

dataset, we must form a cost function that measures how well a linear
model with a particular choice of parameters fits the regression data.

A popular choice is the Least Squares cost function.

For a given set of parameters w this cost function computes the total
squared error between the associated hyperplane and the data, giving
a good measure of how well the particular linear model fits the
dataset.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 7/130

6/24/24, 4:34 AM LinearModels slides

The best fitting hyperplane is the one whose parameters minimize this
error.

We want to find a weight vector w so that for any data point (xp , yp ):

x̊Tp w ≈ yp

i.e., the approximate is as close as possible to the output.

Or equivalently, the error between x̊Tp w − yp is small as possible.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 8/130

6/24/24, 4:34 AM LinearModels slides

The Least Squares cost function

By squaring the error x̊Tp w − yp (so that both negative and positive
errors of the same magnitude are treated equally), we can define:
2
gp (w) = (x̊Tp w − yp )

as a point-wise cost function that measures the error of a model (in this
case a linear model) on the individual point (xp , yp ).

We want all P such values to be small. We can take their average over
the entire dataset, forming the Least Squares cost function for linear
regression:

1 P 1 P 2
g (w) = ∑ gp (w) = ∑ (x̊Tp w − yp )
P p=1 P p=1

We want to find the optimal parameter vector w that minimizes g(w):

1 P 2
minimize ∑ (x̊Tp w − yp )
w P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 9/130

6/24/24, 4:34 AM LinearModels slides

Minimization of the Least Squares cost function

Example: Visually verifying the convexity of the cost function for a toy dataset

Consider a toy dataset of 50 random points selected off the line y = x with
a small amount of Gaussian noise added to each point.
In [7]:
demo = linear_regression_visualizer(data)
demo.plot_data()

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 10/130

6/24/24, 4:34 AM LinearModels slides

In [8]:
# compute linear combination of input points
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

# an implementation of the least squares cost function for linear regression

def least_squares(w):
# compute the least squares cost
cost = np.sum((model(x,w) - y)**2)
return cost/float(y.size)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 11/130

6/24/24, 4:34 AM LinearModels slides

The contour plot and the surface generated by the Least Squares cost
function:
In [9]:
static_visualizer().two_input_surface_contour_plot(least_squares,[],view = [10,70],xmin = -4.5, xmax = 4.5, ymin = -4.5, ymax = 4.5,num_contours = 30)

The upward bending shape of the cost function's surface and the elliptical
shape of its contour lines show that the Least Squares cost function is
indeed convex for linear regression models.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 12/130

6/24/24, 4:34 AM LinearModels slides

Minimization of the Least Squares cost function

The Least Squares cost function for linear regression is always convex
regardless of the input dataset.

We can easily apply first or second order methods to minimize the

Least Squares cost function for linear regression.

Important issues with first and second order methods still need to be
considered:
1. How should we choose a steplength / learning rate for gradient
descent?
2. Newton's method can only be applied when N is of moderate value
(e.g., in the thousands).

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 13/130

6/24/24, 4:34 AM LinearModels slides

Minimization of the Least Squares cost function

The gradient of the Least Squares cost function can be computed as:

2 P
∇g (w) = ∑ x̊ (x̊T w − yp )
P p=1 p p

If we use Newton's method, we can compute the Hessian of the Least

Squares cost function. Because the Least Squares cost function for
linear regression is a convex quadratic, only a single Newton step can
completely minimize it.

This single Newton step solution is often referred to as minimizing the

Least Squares cosft via its normal equations. The system of equations
solved in taking this single Newton step is equivalent to:

(∑ x̊p x̊Tp ) w = ∑ x̊p yp .

P P

p=1 p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 14/130

6/24/24, 4:34 AM LinearModels slides

Example: Using gradient descent

We use gradient descent to minimize the Least Squares cost over the
previous toy dataset.
We employ a fixed steplength value α = 0.5 for all 75 steps until
reaching the minimum of the function.
In [12]:
demo.animate_it_2d(video_path_1,weight_history,num_contours = 30,fps=20)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 15/130

6/24/24, 4:34 AM LinearModels slides

In [13]:
show_video(video_path_1, width=800)

Out[13]:

As gradient descent approaches the mininum of the cost function, the

corresponding parameters provide a better and better fit to the data.
The best fit occurs at the end of the run at the point closest to the
Least Squares cost's minimizer.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 16/130

6/24/24, 4:34 AM LinearModels slides

When we use a local optimization method like gradient descent, we

must properly tune the steplength parameter α.

We compare two steplength values α = 0.01 and α = 0.5.

In [15]:
static_visualizer().plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = False,labels = [r'$\alpha = 0.5$',r'$\alpha = 0.01$'])

The steplength parameter is often called the learning rate in machine

learning, because this value determines how quickly the proper
parameters of the model are learned.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 17/130

6/24/24, 4:34 AM LinearModels slides

Linear Two-Class Classification

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 18/130

6/24/24, 4:35 AM LinearModels slides

Linear Two-Class Classificationression and the Cross Entropy Cost

Two-class classification is a particular instance of regression or surface-

fitting wherein the data still comes in the form of P input/output pairs
{(xp , yp )}Pp=1 , and each input xp is an N -dimensional vector.

However, the corresponding output yp is no longer continuous but

takes on only two discrete numbers.

While the actual value of these numbers in principle arbitrary, we

suppose that the output of our data takes on either the value of 0 or
+1, i.e., yp ∈ {0, +1}.

In the context of classification, the output values yp are called labels,

and all the points sharing the same label value are referred to as a class
of data.

A dataset containing points with label values yp ∈ {0, +1} is said to be a

dataset consisting of two classes.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 19/130

6/24/24, 4:35 AM LinearModels slides

For N = 1 (left), the bottom step is the region of the space containing
class 0, i.e., yp = 0. The top step contains class 1, i. e. , yp = +1.

Regression perspective: Two-class classification can be viewed as a case

of nonlinear regression to fit a nonlinear step function to the data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 20/130

6/24/24, 4:35 AM LinearModels slides

Perceptron perspective: We represent the dataset using its input only:

blue for points with yp = 0 and red for points with yp = +1. The edge
seperating the two steps (and the data points on them) when
projected onto the input space takes the form of a single point (when
N = 1).

When N = 2 (right) the seperating edge is a line.

For general N , what seperates the two classes of data will be a

hyperplane, which is also called a decision boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 21/130

6/24/24, 4:35 AM LinearModels slides

Fitting a discontinuous step function

We fit a line to a two-class dataset via linear regression. The line is a

poor representation of this data because its output takes just two
discrete values.
In [19]:
demo1.naive_fitting_demo()

We pass this fully tuned linear regressor through a discrete step

function by assigningthe label +1 to all output values > 0.5 and the
label 0 to all output < 0.5. The resulting step function is still not a
good representation for the data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 22/130

6/24/24, 4:35 AM LinearModels slides

We can write a linear model of N -dimensional input as:

T
x̊ w = w0 + x1 w1 + x2 w2 + ⋯ + xN wN .

where

⎡ w0 ⎤ ⎡ 1 ⎤
⎢
⎢ 1
w ⎥
⎥ ⎢
⎢ 1
x ⎥
⎥
⎢
w = ⎢ w2 ⎥ ⎢ ⎥.
⎢ ⎥
⎥
and x̊ = ⎢ x2
⎢ ⎥
⎥
⎢
⎢ ⋮ ⎥
⎥ ⎢
⎢ ⋮ ⎥
⎥
⎣w ⎦ ⎣x ⎦
N N

The step function is defined as: the step function is defined as

step(t) = {
1 if t ≥ 0
.
0 if t < 0

Inserting our linear model through this gives us a step function:

step (x̊ w )
T

with a linear decision boundary between its lower and upper steps,
defined by all points x̊ where x̊ w = 0.
T

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 23/130

6/24/24, 4:35 AM LinearModels slides

To tune the weight vector w properly, we can set up a Least Squares

cost function that reflects the input and output of our dataset.

We want the point (xp , yp ) to lie on the correct side of the optimal
decision boundary, i.e., the output yp to lie on the proper step:

step (x̊Tp w ) = yp .

To find weights that satisfy this set of P equalities, we form a Least

Squares cost function as:

1 P 2
g(w) = ∑ (step (x̊Tp w ) − yp )
P p=1

Our ideal weights then correspond to a minimizer of this cost function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 24/130

6/24/24, 4:35 AM LinearModels slides

1 P 2
g(w) = ∑ (step (x̊Tp w ) − yp )
P p=1

It is very difficult to properly minimize the above Least Squares cost

function using local optimization methods. At virtually every point, the
function is completely flat locally.
In [21]:
demo2 = cost_visualizer(data)
demo2.plot_costs(viewmax = 25,view = [20,125])

The left figure shows that the Least Squares surface consists of
discrete steps at many different levels, and each step is completely flat.
Because of this, local optimization methods cannot be used to
effectively minimize it.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 25/130

6/24/24, 4:35 AM LinearModels slides

The logistic sigmoid function

To make the minimization of the Least Squares cost possible, we can

replace the step function with a continuous approximation that
matches it closely.

Such an approximation is the logistic sigmoid function:

1
σ (x) = .
1 + e−x

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 26/130

6/24/24, 4:35 AM LinearModels slides

Logistic regression using the Least Squares cost

Replacing the step function in step (x̊Tp w ) with its sigmoid

approximation gives the related set of approximate equalities:

σ (x̊Tp w ) = yp p = 1, . . , P

The corresponding Least Squares cost function becomes:

1 P 2
g (w) = ∑ (σ (x̊Tp w ) − yp )
P p=1

Fitting a logistic sigmoid to classification data by minimizing this cost

function is referred to as performing logistic regression because we
are performing regression using a logistic function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 27/130

6/24/24, 4:35 AM LinearModels slides

Using normalized gradient descent for Least Squares logistic regression

In [22]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))

# sigmoid non-convex logistic least squares cost function

def sigmoid_least_squares(w):
cost = 0
for p in range(y.size):
x_p = x[:,p]
y_p = y[:,p]
cost += (sigmoid(w[0] + w[1]*x_p) - y_p)**2
return cost/y.size

In [24]:
demo.plot_costs(viewmax = 25, view = [21,121])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 28/130

6/24/24, 4:35 AM LinearModels slides

The resulting cost function is non-convex, but it can still be minimized

using local optimization methods.
Although this cost function is very flat in many places, normalized
gradient descent is designed specifically to deal with costs like this.
In [27]:
demo2.static_fig(w_hist,num_contours = 25,viewmax = 31)

Initialize at the point w0 = −w1 = 20 and run normalized gradient descent

for 900 iterations with α = 1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 29/130

6/24/24, 4:35 AM LinearModels slides

Logistic Regression using the Cross Entropy cost

Instead of using the squared error point-wise cost

2
gp (w) = (σ (x̊Tp w ) − yp ) , we could use the log error cost defined as:

⎧
⎪ −log (σ (x̊Tp w )) if yp = 1
gp (w) = ⎨
⎩ −log (1 − σ (x̊Tp w )) if yp = 0.
⎪

This log error point-wise cost is always non-negative (regardless of the

input and weight values) with a minimum value of 0.

Since our label values yp ∈ {0, 1} we can write the log error equivalently
in a single line:

gp (w) = −yp log (σ (x̊Tp w )) − (1 − yp ) log (1 − σ (x̊Tp w )) .

The overall cost function over all P data points is:

1 P 1 P
g (w) = ∑ gp (w) = − ∑ yp log (σ (x̊Tp w )) + (1 − yp ) log (1 − σ (x̊Tp w ))
P p=1 P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 30/130

6/24/24, 4:35 AM LinearModels slides

This log error cost penalizes violations of our desired equalities much
more harshly than a squared error does:
In [28]:
fig = plt.figure(figsize = (9,3))
y = 1
alpha = np.linspace(0.5,.999,100)
least_squares = (y-alpha)**2
plt.plot(alpha, least_squares, 'b--')
log_error = -np.log(alpha)
plt.plot(alpha, log_error, 'r--')
y = 0
alpha = np.linspace(0.001,.5,100)
least_squares = (y-alpha)**2
plt.plot(alpha, least_squares, color='b', label='squared error')
log_error = -np.log(1-alpha)
plt.plot(alpha, log_error, color='r', label='log error')
plt.legend()
plt.xlabel('sigma',fontsize=14)
plt.ylabel('g_p', fontsize=14, rotation=0, labelpad=30)
plt.show()

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 31/130

6/24/24, 4:35 AM LinearModels slides

Minimizing the Cross Entropy cost

In [29]:
# compute linear combination of input point
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [30]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))

# the convex cross-entropy cost function

def cross_entropy(w):
# compute sigmoid of model
a = sigmoid(model(x,w))

# compute cost of label 0 points

ind = np.argwhere(y == 0)[:,1]
cost = -np.sum(np.log(1 - a[:,ind]))

# add cost on label 1 points

ind = np.argwhere(y==1)[:,1]
cost -= np.sum(np.log(a[:,ind]))

# compute cross-entropy
return cost/y.size

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 32/130

6/24/24, 4:35 AM LinearModels slides

Minimizing the Cross Entropy cost

To optimally tune the parameters w, we want to minimize the Cross

Entropy cost:

1 P
minimize − ∑ yp log (σ (x̊Tp w )) + (1 − yp ) log (1 − σ (x̊Tp w ))
w P p=1

To minimize the Cross Entropy cost, we can use any local optimization
method.

The gradient can be computed as:

1 P
∇g (w) = − ∑ (yp − σ (x̊Tp w )) x̊p
P p=1

The Hessian of the Cross Entropy cost function is:

1 P
∇2 g (w) = ∑ σ (x̊Tp w ) (1 − σ (x̊Tp w )) x̊p x̊Tp .
P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 33/130

6/24/24, 4:35 AM LinearModels slides

Using gradient descent to perform logistic regression using the Cross

Entropy cost

Initialize at point w0 = w1 = 3, set α = 1, and run for 25 steps.

In [33]:
show_video(video_path_1)

Out[33]:

The plotted surface of the Cross Entropy cost function looks convex.
Indeed, unlike the Least Squares, the Cross Entropy cost is always
convex regardless of the dataset used.

For this reason, the Cross Entropy cost is often used to perform logistic
regression.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 34/130

6/24/24, 4:35 AM LinearModels slides

Using gradient descent to perform logistic regression using the Cross

Entropy cost

We re-run gradient descent with the same initialization w0 = w1 = 3 and

fixed steplength α = 1 for 2000 iterations.
In [34]:
# run gradient descent to minimize the softmax cost
g = cross_entropy; w = np.array([3.0,3.0])[:,np.newaxis]; max_its = 2000; alpha_choice = 1;
weight_history,cost_history = gradient_descent(g,alpha_choice,max_its,w)

# create a static figure illustrating gradient descent steps

animator.static_fig(weight_history,num_contours = 25,viewmax = 12)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 35/130

6/24/24, 4:35 AM LinearModels slides

Logistic Regression and the Softmax Cost

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 36/130

6/24/24, 4:35 AM LinearModels slides

Different labels, same story

If we change the label values from yp ∈ {0, 1} to yp ∈ {−1, +1}, we can

derive the same optimization problem.

Instead of our data sitting on a step function with lower and upper
steps taking on the values 0 and 1, respectively, they take on values -1
and +1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 37/130

6/24/24, 4:35 AM LinearModels slides

This step function is called a sign function since it returns the

numerical sign of its input:

sign(x) = {
+1 if x ≥ 0
.
−1 if x < 0

Inserting a linear model through the sign function gives us a step

function:

sign (x̊ w )
T

with a linear decision boundary between its two steps defined by all points
x̊ where x̊ w = 0.
T T

Any input lying exactly on the decision boundary can be assigned a

label at random.

A point is classified correctly when its true label is predicted correctly:

sign (x̊Tp w ) ≈ yp

Otherwise, it is said to have been misclassified.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 38/130

6/24/24, 4:35 AM LinearModels slides

Previously, we used the logistic sigmoid function to replace the step

function.
We now use an adjusted version of the logistic sigmoid function so
that its values range between -1 and 1 (instead of 0 and 1). This scaled
version of the sigmoid, called the hyperbolic tangent function, is:
2
tanh(x) = 2 σ (x) − 1 = − 1.
1 + e−x

The sigmoid function σ(⋅) ranges smoothly between 0 and 1. The tanh(⋅)
ranges smoothly between -1 and +1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 39/130

6/24/24, 4:35 AM LinearModels slides

We can form a Least Squares cost function with the tanh(⋅):

1 P 2
g(w) = ∑ (tanh (x̊Tp w ) − yp )
P p=1

which is non-convex with undesirable flat regions, requiring specialized

local methods for proper minimization.
In [38]:
demo2.plot_costs(viewmax = 25,view = [20,125])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 40/130

6/24/24, 4:35 AM LinearModels slides

Logistic regression using the Softmax cost

We rearrange the hyperbolic tangent function in terms of the sigmoid:

tanh(x) + 1
σ (x) = .
2

We can note that:

tanh (x) ≈ +1 ⟺ σ (x) ≈ 1

tanh (x) ≈ −1 ⟺ σ (x) ≈ 0.

Therefore, we can employ the same point-wise log error cost:

⎧
⎪ −log (σ (x̊Tp w )) if yp = +1
gp (w) = ⎨
⎩ −log (1 − σ (x̊Tp w )) if yp = −1.
⎪

We have:

1 1 + e−x 1 e−x 1
1 − σ (x) = 1 − = − = = = σ(−x)
1+e −x 1+e −x 1+e −x 1+e −x 1 + ex

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 41/130

6/24/24, 4:35 AM LinearModels slides

Then:

⎧
⎪ −log (σ (x̊Tp w )) if yp = +1
gp (w) = ⎨
⎩ −log (σ (−x̊Tp w )) if yp = −1.
⎪

Because we are using the label values ±1 we can move the label value in
each case inside the inner most paraenthesis, and we can write both
cases in a single line as

gp (w) = −log (σ (yp x̊Tp w )) .

Since −log (x) = 1

x we can re-write the point-wise cost above equivalently
as

gp (w) = −log ( ) = log (1 + e−yp x̊p w )

1 T

−yp x̊Tp w
1+e

Taking the average of this point-wise cost over all P points we have the
Softmax cost for logistic regression:

1 P 1 P
∑ gp (w) = ∑ log (1 + e−yp x̊p w )
T
g(w) =
P p=1 P p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 42/130

6/24/24, 4:35 AM LinearModels slides

Minimizing Softmax logistic regression using standard gradient descent

In [39]:
# compute linear combination of input point
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [40]:
# define sigmoid function
def sigmoid(t):
return 1/(1 + np.exp(-t))

# the convex cross-entropy cost function

def softmax(w):
# compute sigmoid of model
a = sigmoid(model(x,w))

# compute cost of label 0 points

ind = np.argwhere(y == -1)[:,1]
cost = -np.sum(np.log(1 - a[:,ind]))

# add cost on label 1 points

ind = np.argwhere(y==+1)[:,1]
cost -= np.sum(np.log(a[:,ind]))

# compute cross-entropy
return cost/y.size

In [41]:
# the convex softmax cost function
def softmax(w):
cost = np.sum(np.log(1 + np.exp(-y*model(x,w))))
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 43/130

6/24/24, 4:35 AM LinearModels slides

Minimizing Softmax logistic regression using standard gradient descent

The gradient of the Softmax cost function:

1 P −yp x̊Tp w
∇g (w) = − ∑ e y x̊
P p=1 1 + e−yp x̊Tp w p p

In [44]:
animator.static_fig(weight_history,num_contours = 25,viewmax = 12)

The Softmax cost for logistic regression is always convex.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 44/130

6/24/24, 4:35 AM LinearModels slides

Noisy classification datasets

In [46]:
demo.plot_data(view = [15,-140])

There are 03 points in the example that look like they are on the wrong
side.
Note: in the classification context a 'noisy' point is one that has an
incorrect label. Such points are often misclassified by a trained
classifier, meaning that their true label value will not be correctly
predicted.

Two class classification datasets typically have noise of this kind and
are not often perfectly separable by a hyperplane.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 45/130

6/24/24, 4:35 AM LinearModels slides

We run 100 steps of gradient descent with steplength α = 1.

In [48]:
static_visualizer().plot_cost_histories([cost_history],start = 0,points = False,labels = [r'$\alpha = 1$'])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 46/130

6/24/24, 4:35 AM LinearModels slides

In [50]:
demo.static_fig(weight_history[-1],view = [15,-140])

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 47/130

6/24/24, 4:35 AM LinearModels slides

The Perceptron

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 48/130

6/24/24, 4:35 AM LinearModels slides

The Perceptron
We treat classification as a particular form of non-linear regression
(e.g., employing a tanh nonlinearity for classification data with label
values yp ∈ {−1, +1}).

This results in learning a proper nonlinear regressor and a

corresponding linear decision boundary:
T
x̊ w = 0.

Instead of learning this decision boundary as a result of a nonlinear

regression, the Perceptron aims at determining this ideal linear
decision boundary directly.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 49/130

6/24/24, 4:35 AM LinearModels slides

The Perceptron cost function

With two-class classification we have a training set of P points

{(xp , yp )}Pp=1 , where yp ∈ {−1, +1} - consisting of two classes which we
would like to learn how to distinguish between automatically.

In the simplest instance our two classes of data are largely separated
by a linear decision boundary given by the collection of input x where
x̊T w = 0 with each class (largely) lying on either side.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 50/130

6/24/24, 4:35 AM LinearModels slides

The linear decision boundary cuts the input space into two half-spaces,
one lying "above" the hyperplane where x̊T w > 0, and one lying "below"
it where x̊T w < 0.

Because we can always flip the orientation of an ideal hyperplane by

multiplying it by -1, we can say in general that when the weights of a
hyperplane are tuned properly, members of the class yp = +1 lie (mostly)
"above" it, while members of the yp = −1 class lie (mostly) "below" it.

Our desired set of weights define a hyperplane where as often as

possible we have that

x̊Tp w > 0 if yp = +1
x̊Tp w < 0 if yp = −1.

We can write both into a single equation:

−yp x̊Tp w < 0.

The ideal condition above representing a hyperplane correctly

classifying the point xp forms a point-wise cost:

gp (w) = max (0, −yp x̊Tp w ) = 0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 51/130

6/24/24, 4:35 AM LinearModels slides

The expression max (0, −yp x̊Tp w ) is always nonnegative, since it returns
zero if xp is classified correctly, and returns a positive value if the point
is classified incorrectly.

The functional form of this point-wise cost max (0, ⋅) is called a rectified
linear unit.

Because these point-wise costs are nonnegative and equal zero when
our weights are tuned correctly, we can take their average over the
entire dataset to form a proper cost function as:

1 P 1 P
g (w) = ∑ gp (w) = ∑ max (0, −yp x̊Tp w ) .
P p=1 P p=1

This cost function goes by many names such as the perceptron cost,
the rectified linear unit cost (or ReLU cost for short), and the hinge cost
(since when plotted a ReLU function looks like a hinge). This cost
function is always convex but has only a single (discontinuous)
derivative in each input dimension.

This implies that we can only use zero and first order local optimization
schemes (i.e., not Newton's method).

Note that the perceptron cost always has a trivial solution at w = 0,

since indeed g (0) = 0, thus one may need to take care in practice to
avoid finding it (or a point too close to it) accidentally.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 52/130

6/24/24, 4:35 AM LinearModels slides

The Softmax approximation to the Perceptron

When dealing with a cost function that has some deficit, we replace it
with a smooth (or at least twice differentiable) cost function that
closely matches it everywhere.
If the approximation closely matches the true cost function, then for a
small amount of accuracy, we considerably broaden the set of
optimization tools we can use.

We replace the max function portion of the Perceptron cost with the
Softmax function defined as:

soft (s0 , s1 , . . . , sC−1 ) = log (es0 + es1 + ⋯ + esC−1 )

where s0 , s1 , . . . , sC−1 are any C scalar values - which is a generic smooth

approximation to the max function, i.e.,
soft (s0 , s1 , . . . , sC−1 ) ≈ max (s0 , s1 , . . . , sC−1 )

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 53/130

6/24/24, 4:35 AM LinearModels slides

Example: when C = 2:
Suppose s0 ≤ s1 , so that max (s0 , s1 ) = s1 .
max (s0 , s1 ) = s0 + (s1 − s0 ) = log (es0 ) + log (es1 −s0 )

soft (s0 , s1 ) = log (es0 + es1 ) = log (es0 ) + log (1 + es1 −s0 )

soft (s0 , s1 ) is always larger than max (s0 , s1 ) but not by much.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 54/130

6/24/24, 4:35 AM LinearModels slides

We replace the ReLU perceptron cost in the point-wise cost function

with its softmax approximation:

gp (w) = soft (0, −yp x̊Tp w ) = log (e0 + e−yp x̊p w ) = log (1 + e−yp x̊p w )
T T

The overall cost function now is:

P P
g (w) = ∑ gp (w) = ∑ log (1 + e−yp x̊p w )
T

p=1 p=1

Like the ReLU cost, the Softmax cost is convex.

Unlike the ReLU cost, the Softmax cost has infinitely many derivatives
and Newton's method can be used.
The Softmax cost does not have a trivial solution at zero.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 55/130

6/24/24, 4:35 AM LinearModels slides

Example: Using Newton's method to minimize the Softmax cost

In [52]:
### define softmax cost ###
# compute linear combination of input point
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

# the convex softmax cost function

def softmax(w):
cost = np.sum(np.log(1 + np.exp(-y*model(x,w))))
return cost/float(np.size(y))

# load in dataset
data = np.loadtxt(data_path_1, delimiter = ',')

# get input/output pairs

x = data[:-1,:]
y = data[-1:,:]

# run gradient descent to minimize the softmax cost

g = softmax; w = np.ones((2,1)); max_its = 5;
weight_history,cost_history = newtons_method(g,max_its,w,epsilon = 10**(-7))

# create a static figure illustrating gradient descent steps

animator = classification_2d_visualizer(data,g)

In [53]:
animator.static_fig(weight_history,num_contours = 25,viewmax = 25)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 56/130

6/24/24, 4:35 AM LinearModels slides

The Softmax cost and a problem with linearly separable datasets

Suppose that we have a dataset whose two classes can be perfectly

separated by a hyperplane, and that we have chosen an appropriate
cost function to minimize it in order to determine proper weights for
our model.

Suppose further that we are extremely lucky and our initialization w0

produces a linear decision boundary x̊ with perfect sepearation.
T
w0 = 0

This means that for each of our P points, we have:

−yp x̊Tp w0 < 0

The point-wise Perceptron/ReLU cost is zero for every point i.e.,

gp (w0 ) = max (0, −yp x̊Tp w0 ) = 0

The Perceptron/ReLU cost is exactly equal to zero:

1 P
g (w0 ) = ∑ max (0, −yp x̊Tp w0 ) = 0.
P p=1

Any local optimization algorithm will halt immediately. This will not be
the case if we use the Softmax cost instead of the Perceptron cost.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 57/130

6/24/24, 4:35 AM LinearModels slides

The Softmax cost and a problem with linearly separable datasets

We always have e−y x̊ w > 0.

T 0
p p

The softmax point-wise cost is also nonnegative

gp (w0 ) = log (1 + e−y x̊ w ) > 0.
T 0
p p

The Softmax cost is nonnegative as well

P
g (w0 ) = ∑ log (1 + e−yp x̊p w ) > 0.
T 0

p=1

Using any local optimization method, e.g., gradient descent, we will

take steps away from the initialization w0 to drive the value of the
Softmax cost lower and lower toward its minimum at zero.

With data that is indeed linearly separable, the Softmax cost achieves
this lower bound only only when the magnitude of the weights grows
to infinity.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 58/130

6/24/24, 4:35 AM LinearModels slides

The Softmax cost and a problem with linearly separable datasets

P
g (w0 ) = ∑ log (1 + e−yp x̊p w ) > 0.
T 0

p=1

Each individual term log (1 + e−C ) = 0 only as C ⟶ ∞.

If we multiply our initialization w0 by any constant C > 1 we can

decrease the value of any negative exponential involving one of our
data points since C(−yp x̊Tp w0 ) < −yp x̊Tp w0 and so e−y x̊ .
T 0 T 0
p p Cw < e−yp x̊p w

This decreases the Softmax cost as well with the minimum achieved as
C ⟶ ∞. However, regardless of the scalar C > 1, the decision boundary
defined by the initial weights x̊Tp w0 = 0 does not change location.

This can cause numerical instability issues with local optimization

methods that make large progress at each step, e.g., Newton's method.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 59/130

6/24/24, 4:35 AM LinearModels slides

If we simply flip one of the labels - making this dataset not perfectly
linearly separable - the corresponding cost function does not have a
global minimum out at infinity, as illustrated in the contour plot below.
In [54]:
# switch a label
y[0,-1] = -1
data = np.vstack((x,y))

# draw contour of this new data

animator = classification_2d_visualizer(data,softmax)

# create a static figure illustrating gradient descent steps

animator.static_fig([],num_contours = 15,viewmax = 25)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 60/130

6/24/24, 4:35 AM LinearModels slides

Normalizing feature-touching weights

To control the magnitude of w means that we want to control the size

of the N + 1 individual weights it contains

⎡ w0 ⎤
⎢ w1 ⎥
w=⎢
⎢
⎥.
⎥
⎢ ⋮ ⎥
⎣w ⎦
N

We can do this by directly controling the size of just N of these

weights, and it is particularly convenient to do so using the final N
feature touching weights w1 , w2 , . . . , wN because these define the normal
vector to the linear decision boundary x̊ w = 0.
T

We express the linear decision boundary as:

T T
x̊ w = b + x ω = 0

where the feature-touching weights ω define the normal vector of the

linear decision boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 61/130

6/24/24, 4:35 AM LinearModels slides

A linear decision boundary written as b + x has a normal vector ω

T
ω=0

defined by its feature-touching weights.

The normal vector to a hyperplane (like our decision boundary) is

always perpendicular to it.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 62/130

6/24/24, 4:35 AM LinearModels slides

We have a point xp lying 'above' the linear decision boundary on a

translate of the decision boundary where b + x
T
ω=β>0

Suppose we know the location of the vertical projection of xp onto the

decision boundary, called x′p

The signed distance between xp and its vertical projection x′p :

d = ∥x′p − xp ∥2 sign (β) = ∥x′p − xp ∥2

where sign(β) = +1 in this example.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 63/130

6/24/24, 4:35 AM LinearModels slides

Because the vector x′p − xp is perpendicular to the decision boundary

(and so is parallel to the normal vector ω), the inner produce rule gives:
T
(x′p − xp ) ω = −∥x′p − xp ∥2 ∥ω∥2 = −d ∥ω∥2

Take the difference between the decision boundary and its translation
evaluated at x′p and xp :

0 − β = (b + (xp ) ω) − (b + xTp ω) = (x′p − xp ) ω

′ T T

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 64/130

6/24/24, 4:35 AM LinearModels slides

Since both formulae are equal to (x′p − xp )T ω we can set them equal to
each other:

d ∥ω∥2 = β

The signed distance d of xp to the decision boundary is

β b + xTp ω
d= = .
∥ω∥2 ∥ω∥2

If the feature-touching weights have unit length as ∥ω∥2 = 1 then the

signed distance d of a point xp to the decision boundary is given simply
by its evaluation b + xTp ω.

If xp were to lie below the decision boundary and β < 0, we have the
same derivation.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 65/130

6/24/24, 4:35 AM LinearModels slides

Another example with the input dimension N = 3 and the decision

boundary is a true hyperplane.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 66/130

6/24/24, 4:35 AM LinearModels slides

We can scale any linear decision boundary by a non-zero scalar C and

it still defines the same hyperplane. If we multiply by C = 1
∥ω∥2
we have:

T
b + x ω = b + xT ω = 0
∥ω∥2 ∥ω∥2 ∥ω∥2

Our feature-touching weights have unit length as ∥∥ ∥ω∥

ω ∥ = 1.
∥2
2

Regardless of how large our weights w were to begin with we can always
normalize them in a consistent way by dividing off the magnitude of ω.

This normalization scheme is particularly useful in the context of the

technical issue with the Softmax/Cross Entropy highlighted above
because clearly a decision boundary that perfectly separates two
classes of data can be feature-weight normalized to prevent its weights
from growing too large (and diverging to infinity).

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 67/130

6/24/24, 4:35 AM LinearModels slides

Regularizing two-class classification

We constrain the Softmax / Cross-Entropy cost so that feature-

touching weights always have length one i.e., ∥ω∥2 = 1.

1 P
∑ log (1 + e−yp (b+xp ω ) )
T
minimize
b, ω P p=1
2
subject to ∥ω∥2 = 1

We can solve the constrained optimization problem above by solving

the highly-related unconstrained regularized version of the original
Softmax cost:

1 P
∑ log (1 + e−yp (b+xp ω ) ) + λ ∥ω∥22
T
g (b, ω) =
P p=1

The regularization parameter λ is used to balance how strongly we

pressure one term or the other in minimizing their sum.

Minimizing the regularizer λ ∥ω∥22 prevents the divergence of their

magnitude since if their size does start to grow we our entire cost
function 'suffers' because of it, and becomes large. Because of this the
value of λ is typically chosen to be small (and positive) in practice,
although some fine-tuning can be useful.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 68/130

6/24/24, 4:35 AM LinearModels slides

Using Newton's method to minimize a regularized Softmax cost

In [56]:
# create a static figure illustrating gradient descent steps
animator.static_fig(weight_history,num_contours = 25,viewmax = 100)

We repeat the previous experiment but add a regularizer with λ = 10−3

to the Softmax cost.

The global minimum no longer lies at infinity.

We still learn a perfect decision boundary by a tightly fitting tanh (⋅)
function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 69/130

6/24/24, 4:35 AM LinearModels slides

Support Vector Machines

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 70/130

6/24/24, 4:35 AM LinearModels slides

The Margin-Perceptron

Suppose we have a two-class classification training dataset of P points

{(xp , yp )}Pp=1 with the labels yp ∈ {−1, +1}.

Suppose the dataset is linearly separable and we have a linear decision

boundary x̊T w =0 passing evenly through the region separating two
classes.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 71/130

6/24/24, 4:35 AM LinearModels slides

This separating hyperplane creates a buffer between the two classes

confined between two evenly shifted versions of itself:
1. One version that lies on the positive side of the separator and
just touches the class having labels yp = +1 (colored red in the
figure) taking the form x̊T w = +1.
2. One lying on the negative side of it just touching the class with
labels yp = −1 (colored blue in the figure) taking the form
x̊T w = −1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 72/130

6/24/24, 4:35 AM LinearModels slides

The translations above and below the separating hyperplane are more
generally defined as x̊T w = +β and x̊T w = −β respectively, where β > 0.

By dividing off β in both equations and reassigning the variables as

w ⟵ w
β
we can leave out the redundant parameter β and have the two
translations as stated x̊T w = ±1.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 73/130

6/24/24, 4:35 AM LinearModels slides

All points in the +1 class lie exactly on or on the positive side of

x̊T w = +1.

All points in the −1 class lie exactly on or on the negative side of

x̊T w = −1.

x̊Tp w ≥ 1 if yp = +1
x̊Tp w ≤ −1 if yp = −1

We can combine these conditions into a single statement by

multiplying each by their respective label values, giving the single
inequality yp x̊Tp w ≥1 which can be equivalently written as a point-wise
cost

gp (w) = max (0, 1 − yp x̊Tp w ) = 0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 74/130

6/24/24, 4:35 AM LinearModels slides

gp (w) = max (0, 1 − yp x̊Tp w ) = 0

Again, this value is always nonnegative. Summing up all P equations of

the form above gives the Margin-Perceptron cost:
P
g (w) = ∑ max (0, 1 − yp x̊Tp w )
p=1

Notice that the original Perceptron cost is:

P
g (w) = ∑ max (0, −yp x̊Tp w )
p=1

The additional 1 prevents the issue of a trivial zero solution with the
original Perceptron cost.

If the data is indeed linearly separable any hyperplane passing

between the two classes will have parameters w where g (w) = 0.

If the data is not linearly separable, a violation for the pth point adds the
positive value of 1 − yp x̊Tp w to the cost function.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 75/130

6/24/24, 4:35 AM LinearModels slides

Relation to the Softmax cost

As with the perceptron, one way to smooth out the margin-perceptron

here is by replacing the max operator with softmax:

soft (0, 1 − yp x̊Tp w ) = log (1 + e1−yp x̊p w )

The '1' used in the 1 − yp (x̊Tp w ) component of the cost could have been
any number we wanted - it was a normalization factor for the width of
the margin and, by convention, we used '1'.

Instead we could have chosen any value ϵ > 0, we have:

max (0, ϵ − yp x̊Tp w ) = 0

For all p, the Margin-Perceptron cost can be stated as:

P
g (w) = ∑ max (0, ϵ − yp x̊Tp w )
p=1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 76/130

6/24/24, 4:35 AM LinearModels slides

max (0, ϵ − yp x̊Tp w ) = 0

The softmax version is:

soft (0, ϵ − yp x̊Tp w ) = log (1 + eϵ−yp x̊p w )

When ϵ is quite small we of course have that

log (1 + eϵ−yp x̊p w ) ≈ log (1 + e−yp x̊p w ), the same summand used for the
T T

(smoothed) perceptron and logistic regression.

Thus we can in fact use the same softmax cost function here
P
g (w) = ∑ log (1 + e−yp x̊p w )
T
(1)
p=1

as a smoothed version of our Margin-Perceptron cost.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 77/130

6/24/24, 4:35 AM LinearModels slides

Maximum margin decision boundaries

When two classes of data are linearly separable infinitely many

hyperplanes could be drawn to separate the data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 78/130

6/24/24, 4:35 AM LinearModels slides

While both perfectly distinguish between the two classes the green
separator (with smaller margin) divides up the space in a rather
awkward fashion given how the data is distributed, and will therefore
tend to more easily misclassify future datapoints.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 79/130

6/24/24, 4:35 AM LinearModels slides

On the other hand, the black separator (having a larger margin) divides
up the space more evenly with respect to the given data, and will tend
to classify future points more accurately.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 80/130

6/24/24, 4:35 AM LinearModels slides

We can express a linear decision boundary as:

T T
x̊ w = b + x ω = 0.

To find the separating hyperplane with maximum margin we aim to

find a set of parameters so that the region defined by b + xT ω = ±1, with
each translate just touching one of the two classes, has the largest
possible margin.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 81/130

6/24/24, 4:35 AM LinearModels slides

The margin can be determined by calculating the distance between

any two points x1 and x2 (one from each translated hyperplane) lying
on the normal vector ω:

∥x1 − x2 ∥2 .

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 82/130

6/24/24, 4:35 AM LinearModels slides

Taking the difference of the two translates evaluated at x1 and x2 :

T
(b + xT1 ω ) − (b + xT2 ω ) = (x1 − x2 ) ω = 2

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 83/130

6/24/24, 4:35 AM LinearModels slides

Maximum margin decision boundaries

T
(b + xT1 ω ) − (b + xT2 ω ) = (x1 − x2 ) ω = 2

Using the innner-product rule, and the fact that the two vectors x1 − x2
and ω are parallel to each other, we have:
2
∥x1 − x2 ∥2 =
∥ω∥2

Therefore, finding the separating hyperplane with maximum margin is

equivalent to finding the one with the smallest possible normal vector
ω.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 84/130

6/24/24, 4:35 AM LinearModels slides

The hard-margin and soft-margin SVM problems

We have the hard-margin support vector machine problem to find a
separating hyperplane for the data with minimum length normal
vector:
2
minimize ∥ω∥2
b, ω
subject to max (0, 1 − yp (b + xTp ω)) = 0, p = 1, . . . , P .

The Margin-Perceptron constraint here guarantee that the hyperplane

separates the data perfectly.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 85/130

6/24/24, 4:35 AM LinearModels slides

We can relax the constraints and form an unconstrained formulation

that is similar to the previous regularization approach:
P
g (b, ω) = ∑ max (0, 1 − yp (b + xTp ω)) + λ∥ω∥22
p=1

The parameter λ ≥ 0 is called a penalty or regularization parameter.

When λ is set to a small positive value we put more 'pressure' on the
cost function to make sure the constraints hold:

max (0, 1 − yp (b + xTp ω)) = 0, p = 1, . . . , P

This regularized form of the Margin-Perceptron cost function is

referred to as the soft-margin support vector machine cost.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 86/130

6/24/24, 4:35 AM LinearModels slides

Comparing the SVM decision boundary on linearly separable data

In [59]:
demo5.svm_comparison_fig()

Compare three decision boundaries learned via the Margin-Perceptron

(left) to the support vector machine decision boundary (right).

There are infinitely many linear decision boundaries that separate the
two classes. Any of these can be found by Margin-Perceptron.

The SVM decision boundary is the one that provides the maximum
margin.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 87/130

6/24/24, 4:35 AM LinearModels slides

In [60]:
demo5.svm_comparison_fig()

In the right panel, the translates of the decision boundary pass through
points from both classes - equidistant from the SVM linear decision
boundary.

These points are called support vectors, hence the name Support
Vector Machines.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 88/130

6/24/24, 4:35 AM LinearModels slides

SVMs and noisy data

A very big practical benefit of the soft-margin SVM problem is that it

allows us it to deal with noisy imperfectly (linearly) separable data –
which arise far more commonly in practice than datasets that are
perfectly linearly separable.

"Noise" makes at least one of the constraints in the hard-margin

problem impossible to satisfy (and thus the problem is technically
impossible to solve), i.e., for some p:

max (0, 1 − yp (b + xTp ω)) > 0

The soft-margin relaxation can always be properly minimized and is

therefore much more frequently used in practice.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 89/130

6/24/24, 4:35 AM LinearModels slides

SVMs and noisy data

However notice that once we forgo the assumption of perfectly (linear)

separability the added value of a 'maximum margin hyperplane'
provided by the SVM solution disappears since we no longer have a
margin to begin with.

Thus with many datasets in practice the softmargin problem does not
provide a solution remarkably different than the perceptron or even
logistic regression.

Actually - with datasets that are not linearly separable - it often returns
exactly the same solution provided by the perceptron or logistic
regression.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 90/130

6/24/24, 4:35 AM LinearModels slides

SVMs and noisy data

Consider the soft-margin SVM problem:

P
g (b, ω) = ∑ max (0, 1 − yp (b + xTp ω)) + λ∥ω∥2
2

p=1

We smoothen the Margin-Perceptron portion of the cost using the

Softmax function:
P
g (b, ω) = ∑ log (1 + e−yp (b+xp ω) ) + λ∥ω∥22
T

p=1

This is similar to a regularized Perceptron or logistic regression.

All three methods of linear two-class classification are very deeply

connected. They tend to provide similar results on realistic (not linearly
separable) datasets.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 91/130

6/24/24, 4:35 AM LinearModels slides

Linear Multi-Class Classification

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 92/130

6/24/24, 4:35 AM LinearModels slides

One-versus-All Multi-Class Classification

A multiclass dataset {(xp, yp )}Pp=1 consists of C distinct classes of data.

As with the two class case we can in theory use any C distinct labels to
distinguish between the classes.

For the sake of this derivation it is convenient to use label values

yp ∈ {0, 1, . . . , C − 1}.

In [64]:
demo1.show_dataset()

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 93/130

6/24/24, 4:35 AM LinearModels slides

Training C One-versus-All classifiers

The goal of multi-class classification is that we want to learn how to

distinguish each class of our data from the other C − 1 classes.

A good first step would be to learn C two-class classifiers on the entire

dataset, with the cth classifier trained to distinguish the cth class from
the remainder of the data.

With the cth two-class subproblem we simply assign temporary labels y~p
to the entire dataset, giving +1 labels to the cth class and −1 labels to
the remainder of the dataset

+1 if yp = c
y~p = { (2)
−1 if yp ≠ c

where again yp is the original label for the pth point, and run the two-class
classification scheme of our choice.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 94/130

6/24/24, 4:35 AM LinearModels slides

In [65]:
# solve the 2-class subproblems
demo1.solve_2class_subproblems()

# illustrate dataset with each subproblem and learned decision boundary

demo1.plot_data_and_subproblem_separators()

We learn 3 linear classifiers with logistic regression for each subproblem.

We solve each using the Newton's method.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 95/130

6/24/24, 4:35 AM LinearModels slides

1. Points on the positive side of a single classifier

With OvA we learn C two-class classifiers, and we can denote the weights
from the cth classifier as wc where

⎡ w0,c ⎤
⎢
⎢ 1,c
w ⎥
⎥
⎢ ⎥
wc = ⎢
⎢
⎥
⎥
⎢ ⎥
w2,c
⎢ ⎥
⎢ ⋮ ⎥
⎣w ⎦
N,c

and then the corresponding decision boundary as x̊ w c = 0.

In the previous example, in each case, because each subproblem is

perfectly linearly separable, the class to be distinguished from the rest lies
on the positive side of its respective classifier, with the remainder of the
points lying on the negative side.
This means that for the jth classifier we have for the pth point xp that
>0 if yp = c
x̊ wc {
T
<0 if yp ≠ c.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 96/130

6/24/24, 4:35 AM LinearModels slides

1. Points on the positive side of a single classifier

When evaluated by each two-class classifier individually, the one

identifying a point's true label always provides the largest evaluation,
i.e.,
T T
x̊ wj = max x̊ wc
c = 0,...,C−1

How do we classify arbitrary points? The predicted label y for input

point x here can be written as
T
y = argmax x̊ wc
c = 0,...,C−1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 97/130

6/24/24, 4:35 AM LinearModels slides

In [66]:
demo1.show_fusion(region = 1)

All points in our toy dataset lie on the positive side of a single
classifier. These points are colored to match their respective classifier.

There are regions left uncolored, which include regions where:

1. Points are on the positive side of more than one classifier.
2. Points are on the positive side of none of the classifiers.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 98/130

6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier

In [67]:
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1]);

We think of a classifier as being 'more confident' of the class identity of

given a point the farther the point lies from the classifier's decision
boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 99/130

6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier

In [68]:
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1]);

This is a simple geometric/ probabilistic concept, the bigger a point's

distance to the boundary the deeper into one region of a classifier's
half-space it lies, and thus we can be much more confident in its class
identity than a point closer to the boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 100/130

6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier

In [69]:
demo1.point_and_projection(point1 = [0.4,1] ,point2 = [0.6,1]);

If we slightly perturbed the decision boundary: those points originally

close to its boundary might end up on the other side of the perturbed
hyperplane, changing classes, whereas those points farther from the
boundary are less likely to be so affected.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 101/130

6/24/24, 4:35 AM LinearModels slides

2. Points on the positive side of more than one classifier

In [70]:
demo1.show_fusion(region = 2)

We repeat this logic for every point in the regions where two or more
classifiers are positive.
For points that are equidistant to two or more decision boundaries, we
assign a class label at random.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 102/130

6/24/24, 4:35 AM LinearModels slides

3. Points on the negative side of all classifiers

In [71]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])

When a point x lies on the positive side of none of our C One-versus-

All classifiers, it means that each of our classifiers designated it as not
in their respective class.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 103/130

6/24/24, 4:35 AM LinearModels slides

In [72]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])

Which classifier is the least confident about x not belonging to their

class?
Which decision boundary is the closest to the point x.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 104/130

6/24/24, 4:35 AM LinearModels slides

In [73]:
demo1.point_and_projection(point1 = [0.4,0.5] ,point2 = [0.45,0.45])

Using the notion of signed distance to decision boundary:

Assuming the weights of our classifiers have been normalized, for an
input point x we assign the label y providing the maximum signed
distance, i.e.,
T
y = argmax x̊ wc .
c = 0,...,C−1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 105/130

6/24/24, 4:35 AM LinearModels slides

In [74]:
demo1.show_fusion(region = 3)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 106/130

6/24/24, 4:35 AM LinearModels slides

Putting all together

One-versus-All multi-class classification

1: Input: multiclass dataset {(xp, yp )}Pp=1 where yp ∈ {0, . . . , C − 1}, two-class

classification scheme and optimizer
2: for j = 0, . . . , C − 1
+1 if yp = j
3: form temporary labels y~p = {
−1 if yp ≠ j

4: solve two-class subproblem on {(xp, y~p )}Pp=1 to find weights wj

5: normalize classifier weights by magnitude of feature-touching

portion wj ⟵ ∥ w ∥ j

∥ω j ∥2

6: end for
7: To assign label y to a point x, apply the fusion rule: y =
T
argmax x̊ wc
c = 0,...,C−1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 107/130

6/24/24, 4:35 AM LinearModels slides

In [75]:
demo1.show_complete_coloring();

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 108/130

6/24/24, 4:35 AM LinearModels slides

Multi-Class Classification and the Perceptron

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 109/130

6/24/24, 4:35 AM LinearModels slides

The multi-class Perceptron cost function

We have seen how the fusion rule defined class ownership for every
point x in the input space. With all two-class classifiers properly tuned,
ideally, we would like the fusion rule to hold true for as many points as
possible:

yp = argmax x̊Tp wj
j = 0,...,C−1

Now, instead of tuning our C two-class classifiers one-by-one and then

combining them in this way, we can learn the weights of all C classifiers
simultaneously so as to satisfy this ideal condition as often as possible.

If the above Equation holds for the pth point, then:

x̊Tp wyp = argmax x̊Tp wj

j = 0,...,C−1

i.e., the signed distance from the point to its class decision boundary is
greater than (or equal to) its distances to every other two-class decision
boundary.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 110/130

6/24/24, 4:35 AM LinearModels slides

Subtracting x̊Tp wy from the right-hand side, we have a point-wise cost

function that is always non-negative and minimal at zero:

gp (w0 , . . . , wC−1 ) = ( max x̊Tp wj ) − x̊Tp wyp . (3)

j = 0,...,C−1

If our weights w0 , . . . , wC−1 are set ideally, gp (w0 , . . . , wC−1 ) should be zero
for as many points as possible.

We can now form a cost function by taking the average of the point-
wise cost over the entire dataset:

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

This cost function is called the multi-class Perceptron cost, providing

a way to tune all classifier weights simulataneously to obtain weights
that satisfy the fusion rule as well as possible.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 111/130

6/24/24, 4:35 AM LinearModels slides

Example: Minimizing the multi-class Perceptron cost

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

The multi-class Perceptron cost also has a trivial solution at zero.

This undesirable behavior can be avoided by initializing any local
optimization method away from the origin.

We are restricted to using zero- and first-order optimization methods

to minimize the multi-class Perceptron cost as it second derivative is
zero.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 112/130

6/24/24, 4:35 AM LinearModels slides

In [78]:
# compute C linear combinations of input point, one per classifier
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [79]:
def multiclass_perceptron(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute maximum across data points

a = np.max(all_evals,axis = 0)

# compute cost in compact form using numpy broadcasting

b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 113/130

6/24/24, 4:35 AM LinearModels slides

In [80]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs

x = data[:-1,:]
y = data[-1:,:]

# create an instance of the ova demo

demo = MulticlassVisualizer(data)

# visualize dataset
demo.show_dataset()

# run gradient descent to minimize cost

g = multiclass_perceptron; w = 0.1*np.random.randn(3,3); max_its = 2000; alpha_choice = 10**(-1);
weight_history = demo.gradient_descent(g,w,alpha=alpha_choice,max_its=max_its)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 114/130

6/24/24, 4:35 AM LinearModels slides

In [81]:
demo.show_complete_coloring(weight_history, cost = multiclass_perceptron)

In the left figure, because we did not train each individual two-class
classifier in a One-versus-All manner, each individual learned two-class
classifier peforms quite poorly in separating its class from the rest of
the data.
In the right figure, we show the fused multi-class decision boundary
formed by combining these individual One-versus-All boundaries via
the fusion rule. The final multi-class decision boundary achieves perfect
classification.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 115/130

6/24/24, 4:35 AM LinearModels slides

Alternative formulations of the multi-class Perceptron cost

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

The multi-class Perceptron cost can also be derived as a direct

generalization of its two-class version. Using the following property of
the max function, we have:

max (s0 , s1 , . . . , sC−1 ) − z = max (s0 − z, s1 − z, . . . , sC−1 − z)

We can rewrite the point-wise multi-class Perceptron cost as:

max x̊Tp (wj − wyp ) .

j = 0,...,C−1

Because for j = yp , we have x̊Tp (wj − wy ) = 0, then we can have:

max (0, x̊Tp (wj − wyp )) .

j = 0,...,C−1
j≠yp

The entire multi-class Perceptron cost is:

1 P
g (w0 , . . . , wC−1 ) = ∑ max (0, x̊Tp (wj − wyp )) . (4)
P p=1 j = 0,...,C−1
j≠yp

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 116/130

6/24/24, 4:35 AM LinearModels slides

Regularizing the multi-class Perceptron

In minimizing the multi-class Perceptron cost, we should subject it to

the constraints that all vectors of feature-touching weights have unit
length so that we can compre the signed distance of each input xp to
each of our One-versus-All two-class decision boundaries.
Each weight vector wc consists of two components

⎡ w1,c ⎤
⎢ w2,c ⎥
ωj = ⎢
⎢ ⎥
⎥.
⎢
⎢ ⋮ ⎥
(bias): bc = w0,c (feature-touching weights): (5)
⎥
⎣ wN,c ⎦

In this notation each linear combination can be written as x̊Tp wc = bc + xTp ωc .

A proper constrained optimization problem involving our multi-class

Perceptron takes the form:

1 P
minimize ∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )]
b0 , ω0 , ..., bC−1 , ωC−1 P p=1 j = 0,...,C−1
2
subject to ∥
∥ω j ∥
∥2 = 1, j = 0, . . . , C − 1

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 117/130

6/24/24, 4:35 AM LinearModels slides

Regularizing the multi-class Perceptron

1 P
minimize ∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )]
b0 , ω0 , ..., bC−1 , ωC−1 P p=1 j = 0,...,C−1
2
subject to ∥
∥ω j ∥
∥2 = 1, j = 0, . . . , C − 1

We can phrase this constrained problem in an unconstrained

regularized form by relaxing the constraints but penalizing their
magnitude:
C−1
1 P
∑ [( max bj + xTp ωj ) − (byp + xTp ωyp )] + λ ∑ ∥∥ωj ∥∥2
2

P p=1 j = 0,...,C−1 j=0

λ is typically set to a small value (e.g., 10−3 or smaller).

The regularized form does not quite match the original constrained
formulation as regularizing all normal vectors together will not
necessarily guarantee that ∥ωj ∥22 = 1 for all j. However, it will generally
force the magnitude of all normal vectors to ''behave well''.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 118/130

6/24/24, 4:35 AM LinearModels slides

In [94]:
lam = 10**-5 # our regularization paramter
def multiclass_perceptron(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute maximum across data points

a = np.max(all_evals,axis = 0)

# compute cost in compact form using numpy broadcasting

b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# add regularizer
cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 119/130

6/24/24, 4:35 AM LinearModels slides

In [95]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs

x = data[:-1,:]
y = data[-1:,:]

# create an instance of the ova demo

demo = MulticlassVisualizer(data)

# visualize dataset
demo.show_dataset()

# run gradient descent to minimize cost

g = multiclass_perceptron; w = 0.1*np.random.randn(3,3); max_its = 2000; alpha_choice = 10**(-1);
weight_history = demo.gradient_descent(g,w,alpha=alpha_choice,max_its=max_its)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 120/130

6/24/24, 4:35 AM LinearModels slides

In [96]:
demo.show_complete_coloring(weight_history, cost = multiclass_perceptron)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 121/130

6/24/24, 4:35 AM LinearModels slides

The multi-class Softmax cost function

Recall that the softmax function:

soft (s0 , s1 , . . . , sC−1 ) = log (es0 + es1 + ⋯ + esC−1 )

is a close and smooth approximation to the maximum of C scalar numbers

s0 , . . . , sC−1 , i.e.,

soft (s0 , s1 , . . . , sC−1 ) ≈ max (s0 , s1 , . . . , sC−1 ) .

Replacing the max function in the multi-class Perceptron cost

1 P
g (w0 , . . . , wC−1 ) = ∑ [( max x̊Tp wj ) − x̊Tp wyp ]
P p=1 j = 0,...,C−1

gives the following multi-class Softmax cost function:

∑ [log (∑ ex̊p wj ) − x̊Tp wyp ] .

C−1
1 P T
g (w0 , . . . , wC−1 ) =
P p=1 j=0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 122/130

6/24/24, 4:35 AM LinearModels slides

Minimizing the multi-class Softmax cost function

The multi-class Softmax cost function is convex, and (unlike the multi-
class Perceptron), it also has infinitely many smooth derivaties,
enabling us to use second-order methods in order to properly
minimize it.
It no longer has a trivial solution at zero.
In [107]:
def multiclass_softmax(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute softmax across data points

a = np.log(np.sum(np.exp(all_evals),axis = 0))

# compute cost in compact form using numpy broadcasting

b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 123/130

6/24/24, 4:35 AM LinearModels slides

In [108]:
# load in dataset
data = np.loadtxt(dataset_path_2,delimiter = ',')

# get input/output pairs

x = data[:-1,:]
y = data[-1:,:]

# create instance of multiclass visualizer

demo = MulticlassVisualizer(data)

# visualize dataset
demo.show_dataset()

# run gradient descent to minimize cost

g = multiclass_softmax; w = 0.1*np.random.randn(3,4); max_its = 5;
weight_history = demo.newtons_method(g,w,max_its=max_its)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 124/130

6/24/24, 4:35 AM LinearModels slides

In [109]:
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 125/130

6/24/24, 4:35 AM LinearModels slides

Alternative formulations of the multi-class Softmax

∑ [log (∑ ex̊p wj ) − x̊Tp wyp ] .

C−1
1 P T
g (w0 , . . . , wC−1 ) =
P p=1 j=0

The p-th summand of the multi-class Softmax cost can be written as:

⎛ ∑C−1 ex̊Tp wj ⎞
log (∑ e )− = log (∑ e ) − log e = log ⎜ j=0T ⎟.
C−1 C−1
x̊Tp wj x̊Tp wj x̊Tp wyp
x̊Tp wyp
j=0 c=0 ⎝ ex̊p wyp ⎠

Altogether we have:

⎛ ∑C−1 ex̊p wj ⎞
T

∑ log ⎜ j=0T ⎟= ∑ log (1 + ∑ ex̊p (wj −wyp ) ) .

C−1
1 P 1 P T
g (w0 , . . . , wC−1 ) =
P p=1 ⎝ ex̊p wyp ⎠ P p=1 j=0

Or we can have:

⎛ ∑C−1 ex̊p wj ⎞ ⎛ ⎞
T

∑ log ⎜ j=0T ⎟ = − ∑ log ⎜ C−1 ⎟.

1 P 1 P T
ex̊p wyp
g (w0 , . . . , wC−1 ) =
P p=1 ⎝ ex̊p wyp ⎠ P p=1 ⎝∑ e x̊Tp wj ⎠
j=0

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 126/130

6/24/24, 4:35 AM LinearModels slides

Regularization and the multi-class Softmax

It is common to regularize the Multi-class Softmax via its feature

touching weights as:

∑ [log (∑ ebj +xp ωj ) − (byp + xTp ωyp )] + λ ∑ ∥∥ωj ∥∥2

C−1 C−1
1 P T 2

P p=1 j=0 j=0

λ can be set to some small value like 10−3 .

Regularization can also help prevent local optimizaton methods like
Newton's method (which take large steps) from diverging when
dealing with perfectly separable data.

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 127/130

6/24/24, 4:35 AM LinearModels slides

In [122]:
# compute C linear combinations of input point, one per classifier
def model(x,w):
a = w[0] + np.dot(x.T,w[1:])
return a.T

In [123]:
# multiclass softmaax regularized by the summed length of all normal vectors
lam = 10**(-5) # our regularization paramter
def multiclass_softmax(w):
# pre-compute predictions on all points
all_evals = model(x,w)

# compute softmax across data points

a = np.log(np.sum(np.exp(all_evals),axis = 0))

# compute cost in compact form using numpy broadcasting

b = all_evals[y.astype(int).flatten(),np.arange(np.size(y))]
cost = np.sum(a - b)

# add regularizer
cost = cost + lam*np.linalg.norm(w[1:,:],'fro')**2

# return average
return cost/float(np.size(y))

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 128/130

6/24/24, 4:35 AM LinearModels slides

In [124]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs

x = data[:-1,:]
y = data[-1:,:]

# create instance of multiclass visualizer

demo = MulticlassVisualizer(data)

# run gradient descent to minimize cost

g = multiclass_softmax; w = 0.1*np.random.randn(3,3); max_its = 200; alpha_choice = 1;
weight_history = demo.gradient_descent(g,w,alpha=alpha_choice,max_its=max_its)

In [125]:
# plot classification of space, individual learned classifiers (left panel) and joint boundary (middle panel), and cost-function panel in the right pan
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 129/130

6/24/24, 4:35 AM LinearModels slides

In [132]:
# load in dataset
data = np.loadtxt(dataset_path_1,delimiter = ',')

# get input/output pairs

x = data[:-1,:]
y = data[-1:,:]

# create instance of multiclass visualizer

demo = MulticlassVisualizer(data)

# run gradient descent to minimize cost

g = multiclass_softmax; w = 0.1*np.random.randn(3,3); max_its = 5; alpha_choice = 5;
weight_history = demo.newtons_method(g,w,alpha=alpha_choice,max_its=max_its)

In [133]:
demo.show_complete_coloring(weight_history, cost = multiclass_softmax)

In [ ]:

file:///C:/Users/bmhun/Documents/UIT/Year3/HKII/CS401-Mạng nơ ron và giải thuật di truyền/Slides/08 - Linear Models/LinearModels.slides.html#/4 130/130

MANNA - Marshall Brain
No ratings yet
MANNA - Marshall Brain
3 pages
Tutorial On Helmholtz Machine
No ratings yet
Tutorial On Helmholtz Machine
26 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
Lecture 06
No ratings yet
Lecture 06
6 pages
Lect5 Reg
No ratings yet
Lect5 Reg
16 pages
Appendix I Difference Equations
No ratings yet
Appendix I Difference Equations
2 pages
Multiple Regression - Estimation
No ratings yet
Multiple Regression - Estimation
18 pages
Lecture 15
No ratings yet
Lecture 15
5 pages
Backpropagation
No ratings yet
Backpropagation
4 pages
Analysis of Variance and Design of Experiments1
No ratings yet
Analysis of Variance and Design of Experiments1
5 pages
神经网络中涉及的向量和矩阵求导
100% (1)
神经网络中涉及的向量和矩阵求导
18 pages
Vandermonde_and_Lagrange_Basis (1)
No ratings yet
Vandermonde_and_Lagrange_Basis (1)
4 pages
04_lecturenote_MLE_MAP_discriminative
No ratings yet
04_lecturenote_MLE_MAP_discriminative
6 pages
Lecture 1: Introduction To Uncertainty Quantification: Today
No ratings yet
Lecture 1: Introduction To Uncertainty Quantification: Today
12 pages
Generalized mean
No ratings yet
Generalized mean
5 pages
Chapter - Two - CT - 1
No ratings yet
Chapter - Two - CT - 1
6 pages
5.02 The Matrix of A Linear Transformation I
No ratings yet
5.02 The Matrix of A Linear Transformation I
5 pages
Module 7 Beam Deflections
No ratings yet
Module 7 Beam Deflections
21 pages
Lecture 05
No ratings yet
Lecture 05
7 pages
Curve Fitting and Interpolation
No ratings yet
Curve Fitting and Interpolation
14 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
Linalg Review
No ratings yet
Linalg Review
11 pages
Cadeia de Markov
No ratings yet
Cadeia de Markov
184 pages
Simple Regression
No ratings yet
Simple Regression
8 pages
CH 4
No ratings yet
CH 4
33 pages
Cadeia de Markov
No ratings yet
Cadeia de Markov
178 pages
DF Formulae
No ratings yet
DF Formulae
7 pages
Sst414-Leson 1
No ratings yet
Sst414-Leson 1
6 pages
Lecture 4: Stochastic Gradient Based Adaptation: Least Mean Square (LMS) Algorithm
No ratings yet
Lecture 4: Stochastic Gradient Based Adaptation: Least Mean Square (LMS) Algorithm
25 pages
Mathematical Foundations of Computer Science Lecture Outline
No ratings yet
Mathematical Foundations of Computer Science Lecture Outline
4 pages
Week One Presentation Laff
No ratings yet
Week One Presentation Laff
33 pages
Exercise 1
No ratings yet
Exercise 1
8 pages
Appunti pg8 1 PDF
100% (1)
Appunti pg8 1 PDF
86 pages
Projections
No ratings yet
Projections
35 pages
Notes On MSE Gradients For Neural Networks: 1 2 Mean Squared Error (MSE)
No ratings yet
Notes On MSE Gradients For Neural Networks: 1 2 Mean Squared Error (MSE)
10 pages
Lecture-02_student
No ratings yet
Lecture-02_student
18 pages
Regression Analysis: Ordinary Least Squares
No ratings yet
Regression Analysis: Ordinary Least Squares
12 pages
Digital Communication - Information Theory
100% (1)
Digital Communication - Information Theory
4 pages
10 Matlab Fitting
No ratings yet
10 Matlab Fitting
36 pages
Solution To Problem Set Six - Choice Over Budget Sets and The Dual Problem
No ratings yet
Solution To Problem Set Six - Choice Over Budget Sets and The Dual Problem
8 pages
Solutions for Exercises in Linear Systems Theory by Joao Hespanha
No ratings yet
Solutions for Exercises in Linear Systems Theory by Joao Hespanha
9 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Spring 2021: Numerical Analysis Assignment 5 (Due Thursday April 22nd 10:00am)
No ratings yet
Spring 2021: Numerical Analysis Assignment 5 (Due Thursday April 22nd 10:00am)
4 pages
6 25 Notes
No ratings yet
6 25 Notes
10 pages
Position Basis Wavefunction
No ratings yet
Position Basis Wavefunction
2 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
0709 2661
No ratings yet
0709 2661
6 pages
Velocity Potential
No ratings yet
Velocity Potential
14 pages
Lecture 07
No ratings yet
Lecture 07
4 pages
Log-Linear Models, Memms, and CRFS: 1 Notation
No ratings yet
Log-Linear Models, Memms, and CRFS: 1 Notation
11 pages
CVX
No ratings yet
CVX
9 pages
Uncertainties Formula Sheet
No ratings yet
Uncertainties Formula Sheet
2 pages
Ch. VIII Potential Flow and Computational Fluid Dynamics: Review of Velocity-Potential Concepts
No ratings yet
Ch. VIII Potential Flow and Computational Fluid Dynamics: Review of Velocity-Potential Concepts
5 pages
LectureNotes6
No ratings yet
LectureNotes6
1 page
Lecture 7
No ratings yet
Lecture 7
10 pages
MIT2 092F09 Lec05 PDF
No ratings yet
MIT2 092F09 Lec05 PDF
4 pages
Micro-CH4-cem
No ratings yet
Micro-CH4-cem
11 pages
AutoEncoder
No ratings yet
AutoEncoder
11 pages
2-D Transformations Why Do We Need Transformations?: - Glusphere Draws A Sphere of Radius
No ratings yet
2-D Transformations Why Do We Need Transformations?: - Glusphere Draws A Sphere of Radius
7 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Curve Fitting
100% (1)
Curve Fitting
43 pages
Curve Fitting (Lecturers)
No ratings yet
Curve Fitting (Lecturers)
27 pages
IntroEconmerics AcFn 5031 (q1)
No ratings yet
IntroEconmerics AcFn 5031 (q1)
302 pages
Part 2 Exploring Relationships Among Variables
No ratings yet
Part 2 Exploring Relationships Among Variables
8 pages
Olaf Abel Ethylene Apc
No ratings yet
Olaf Abel Ethylene Apc
32 pages
Multiple Regression Analysis
100% (7)
Multiple Regression Analysis
6 pages
Nonlinear Parameter Optimization Using R Tools 1st Edition John C. Nash - The ebook with rich content is ready for you to download
100% (1)
Nonlinear Parameter Optimization Using R Tools 1st Edition John C. Nash - The ebook with rich content is ready for you to download
57 pages
PDF (Ebook PDF) Principles of Econometrics, 5th Edition Download
100% (3)
PDF (Ebook PDF) Principles of Econometrics, 5th Edition Download
51 pages
Forecast Calculation Examples
No ratings yet
Forecast Calculation Examples
6 pages
Casio Classpad Manual
No ratings yet
Casio Classpad Manual
57 pages
A Tutorial on MM Algorithms
No ratings yet
A Tutorial on MM Algorithms
8 pages
Inverse Problem Theory and Methods for Model Parameter Estimation 1st Edition Albert Tarantola - The full ebook version is available, download now to explore
100% (1)
Inverse Problem Theory and Methods for Model Parameter Estimation 1st Edition Albert Tarantola - The full ebook version is available, download now to explore
46 pages
Production Planning and Control - Forecasting-1
No ratings yet
Production Planning and Control - Forecasting-1
53 pages
Core Data Analysis Worksheet 6
No ratings yet
Core Data Analysis Worksheet 6
20 pages
(Springer Texts in Statistics) Jonathan D. Cryer, Kung-Sik Chan - Time Series Analysis - With Applications in R-Springer (2008) - 40-63
No ratings yet
(Springer Texts in Statistics) Jonathan D. Cryer, Kung-Sik Chan - Time Series Analysis - With Applications in R-Springer (2008) - 40-63
24 pages
Chapter 3-Cost Behavior
No ratings yet
Chapter 3-Cost Behavior
68 pages
Chap 006
92% (12)
Chap 006
89 pages
Pepper
No ratings yet
Pepper
14 pages
STA301 Short Notes (1 to 22) (1)
No ratings yet
STA301 Short Notes (1 to 22) (1)
94 pages
Nlreg10e PDF
No ratings yet
Nlreg10e PDF
30 pages
Chapter 14 - Section A - Mathcad Solutions
No ratings yet
Chapter 14 - Section A - Mathcad Solutions
56 pages
CHAPTER I Lecture Note
No ratings yet
CHAPTER I Lecture Note
20 pages
Z S N-1E Wiener: Filtér
No ratings yet
Z S N-1E Wiener: Filtér
4 pages
An Estimation of Solar Radiation Using Robust Linear Regression Method
No ratings yet
An Estimation of Solar Radiation Using Robust Linear Regression Method
8 pages
Unit 6 Point Estimation: Structure
No ratings yet
Unit 6 Point Estimation: Structure
18 pages
High-Low Method Formulas
No ratings yet
High-Low Method Formulas
3 pages
Fox 2016 PDF
100% (1)
Fox 2016 PDF
817 pages
Math in The Modern World All in Source by Jayson Lucena
86% (14)
Math in The Modern World All in Source by Jayson Lucena
104 pages
Managerial Accounting Chapter 2
No ratings yet
Managerial Accounting Chapter 2
61 pages