Lecture Notes 02

Lecture 2: Basic Artificial Neural
Networks
Xuming He
SIST, ShanghaiTech
Fall, 2020
9/9/2020 Xuming He – CS 280 Deep Learning 1

Logistics
 Course project
 Each team consists of 3~5 members
 You may make exceptions if you are among top 10% in first 3
quizzes
 Full course schedule on Piazza

 HW1 out next Monday
 Tutorial schedule: please vote on Piazza
 TA office hours
 See Piazza for detailed schedule and location

Outline
 Artificial neuron
 Perceptron algorithm
 Single layer neural networks

 Network models
 Example: Logistic Regression
 Multi-layer neural networks

 Limitations of single layer networks
 Networks with single hidden layer
Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu

Liang@Princeton’s course notes
Mathematical model of a neuron
9/9/2020 4
Single neuron as a linear classifier
 Binary classification

How do we determine the weights?
 Learning problem

Linear classification
 Learning problem: simple approach
• Drawback: Sensitive to “outliers”

1D Example
 Compare two predictors

Perceptron algorithm
 Learn a single neuron for binary classification
https://towardsdatascience.com/perceptron-explanation-implementation-and-a-visual-example-3c8e76b4e2d1

 Learn a single neuron for binary classification
 Task formulation

 Algorithm outline

 Intuition: correct the current mistake

 The Perceptron theorem

Hyperplane Distance
 The Perceptron theorem: proof


 The Perceptron theorem: proof intuition


 The Perceptron theorem

Perceptron Learning problem
 What loss function is minimized?



Outline

 Network models


Single layer neural network
9/9/2020 24
9/9/2020 25
9/9/2020 26
What is the output?
 Element-wise nonlinear functions
 Independent feature/attribute detectors

What is the output?
 Nonlinear functions with vector input
 Competition between neurons

What is the output?
 Nonlinear functions with vector input
 Example: Winner-Take-All (WTA)

A probabilistic perspective
 Change the output nonlinearity
 From WTA to Softmax function

Multiclass linear classifiers

 The WTA prediction: one-hot encoding of its predicted label

Probabilistic outputs

How to learn a multiclass classifier?
 Define a loss function and do minimization

Learning a multiclass linear classifier
 Design a loss function for multiclass classifiers
 Perceptron?
 Yes, see homework
 Hinge loss
 The SVM and max-margin (see CS231n)
 Probabilistic formulation
 Log loss and logistic regression
 Generalization issue
 Avoid overfitting by regularization

Example: Logistic Regression
 Learning loss: negative log likelihood

Logistic Regression
 Learning loss: example

Logistic Regression
 Learning loss: questions

Logistic Regression
 Learning loss: questions

Learning with regularization
 Constraints on hypothesis space
 Similar to Linear Regression

Learning with regularization
 Regularization terms
 Priors on the weights

 Bayesian: integrating out weights
 Empirical: computing MAP estimate of W

L1 vs L2 regularization
https://www.youtube.com/watch?v=jEVh0uheCPk
L1 vs L2 regularization
 Sparsity

Optimization: gradient descent
 Gradient descent
 Learning rate matters

 Stochastic gradient descent

 Stochastic gradient descent

Interpreting network weights
 What are those weights?

Outline

 Network models


Capacity of single neuron
 Binary classification
 A neuron estimates
 Its decision boundary is linear, determined by its weights

 Can solve linearly separable problems
 Examples

 Can’t solve non linearly separable problems
 Can we use multiple neurons to achieve this?

 Unless the input is transformed in a better representation

 Unless the input is transformed in a better representation

Adding one more layer
 Single hidden layer neural network
 2-layer neural network: ignoring input units
 Q: What if using linear activation in hidden layer?

Capacity of neural network
 Partition the input space into regions

 Form a stump/delta function


Multi-layer perceptron
 Boolean case
 Multilayer perceptrons (MLPs) can compute more complex
Boolean functions
 MLPs can compute any Boolean function
 Since they can emulate individual gates
 MLPs are universal Boolean functions

 Universal approximation
 Theorem (Hornik, 1991)
A single hidden layer neural network with a linear output unit can
approximate any continuous function arbitrarily well, given enough
hidden units.
 The result applies for sigmoid, tanh and many other hidden
layer activation functions
 Caveat: good result but not useful in practice

 How many hidden units?
 How to find the parameters by a learning algorithm?

General neural network
 Multi-layer neural network

Multilayer networks
Multilayer networks
Why more layers (deeper)?
 A deep architecture can represent certain functions more
compactly
 (Montufar et al., NIPS’14)
 Functions representable with a deep rectifier net can require an
exponential number of hidden units with a shallow one.

 A deep architecture can represent certain functions more
compactly
 Example: Boolean functions
 There are Boolean functions which require an exponential number
of hidden units in the single layer case
 require a polynomial number of hidden units if we can adapt the
number of layers
 Example: multivariate polynomials (Rolnick & Tegmark, ICLR’18)

 Total number of neurons m required to approximate natural classes
of multivariate polynomials of n variables
 grows only linearly with n for deep neural networks, but grows
exponentially when merely a single hidden layer is allowed.


Summary
 Artificial neurons
 Single-layer network
 Next time
 Computation in neural networks
 Convolutional neural networks

Lecture Notes 02

Uploaded by

Copyright:

Available Formats

Lecture Notes 02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture Notes 02

Uploaded by

Copyright:

Available Formats

Lecture 2: Basic Artificial Neural

9/9/2020 Xuming He – CS 280 Deep Learning 1

 Full course schedule on Piazza

9/9/2020 Xuming He – CS 280 Deep Learning 2

 Single layer neural networks

 Example: Logistic Regression

 Multi-layer neural networks

 Networks with single hidden layer

Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu

9/9/2020 Xuming He – CS 280 Deep Learning 5

9/9/2020 Xuming He – CS 280 Deep Learning 6

• Drawback: Sensitive to “outliers”

9/9/2020 Xuming He – CS 280 Deep Learning 7

9/9/2020 Xuming He – CS 280 Deep Learning 8

9/9/2020 Xuming He – CS 280 Deep Learning 9

9/9/2020 Xuming He – CS 280 Deep Learning 10

9/9/2020 Xuming He – CS 280 Deep Learning 11

9/9/2020 Xuming He – CS 280 Deep Learning 12

9/9/2020 Xuming He – CS 280 Deep Learning 13

9/9/2020 Xuming He – CS 280 Deep Learning 15

9/9/2020 Xuming He – CS 280 Deep Learning 16

9/9/2020 Xuming He – CS 280 Deep Learning 17

9/9/2020 Xuming He – CS 280 Deep Learning 18

9/9/2020 Xuming He – CS 280 Deep Learning 19

9/9/2020 Xuming He – CS 280 Deep Learning 20

9/9/2020 Xuming He – CS 280 Deep Learning 21

9/9/2020 Xuming He – CS 280 Deep Learning 22

 Single layer neural networks

 Example: Logistic Regression

 Multi-layer neural networks

 Networks with single hidden layer

Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu

9/9/2020 Xuming He – CS 280 Deep Learning 27

9/9/2020 Xuming He – CS 280 Deep Learning 28

9/9/2020 Xuming He – CS 280 Deep Learning 29

 From WTA to Softmax function

9/9/2020 Xuming He – CS 280 Deep Learning 30

 The WTA prediction: one-hot encoding of its predicted label

9/9/2020 Xuming He – CS 280 Deep Learning 31

9/9/2020 Xuming He – CS 280 Deep Learning 32

9/9/2020 Xuming He – CS 280 Deep Learning 33

9/9/2020 Xuming He – CS 280 Deep Learning 34

9/9/2020 Xuming He – CS 280 Deep Learning 35

9/9/2020 Xuming He – CS 280 Deep Learning 36

9/9/2020 Xuming He – CS 280 Deep Learning 37

9/9/2020 Xuming He – CS 280 Deep Learning 38

9/9/2020 Xuming He – CS 280 Deep Learning 39

 Priors on the weights

9/9/2020 Xuming He – CS 280 Deep Learning 40

9/9/2020 Xuming He – CS 280 Deep Learning 42

 Learning rate matters

9/9/2020 Xuming He – CS 280 Deep Learning 43

9/9/2020 Xuming He – CS 280 Deep Learning 44

9/9/2020 Xuming He – CS 280 Deep Learning 45

9/9/2020 Xuming He – CS 280 Deep Learning 46

 Single layer neural networks

 Example: Logistic Regression

 Multi-layer neural networks

 Networks with single hidden layer