Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture Notes 02

Download as pdf or txt
Download as pdf or txt
You are on page 1of 65

Lecture 2: Basic Artificial Neural

Networks

Xuming He
SIST, ShanghaiTech
Fall, 2020

9/9/2020 Xuming He – CS 280 Deep Learning 1


Logistics
 Course project
 Each team consists of 3~5 members
 You may make exceptions if you are among top 10% in first 3
quizzes

 Full course schedule on Piazza


 HW1 out next Monday
 Tutorial schedule: please vote on Piazza

 TA office hours
 See Piazza for detailed schedule and location

9/9/2020 Xuming He – CS 280 Deep Learning 2


Outline
 Artificial neuron
 Perceptron algorithm

 Single layer neural networks


 Network models

 Example: Logistic Regression

 Multi-layer neural networks


 Limitations of single layer networks

 Networks with single hidden layer

Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu


Liang@Princeton’s course notes
9/9/2020 Xuming He – CS 280 Deep Learning 3
Mathematical model of a neuron

9/9/2020 4
Single neuron as a linear classifier
 Binary classification

9/9/2020 Xuming He – CS 280 Deep Learning 5


How do we determine the weights?
 Learning problem

9/9/2020 Xuming He – CS 280 Deep Learning 6


Linear classification
 Learning problem: simple approach

• Drawback: Sensitive to “outliers”

9/9/2020 Xuming He – CS 280 Deep Learning 7


1D Example
 Compare two predictors

9/9/2020 Xuming He – CS 280 Deep Learning 8


Perceptron algorithm
 Learn a single neuron for binary classification

https://towardsdatascience.com/perceptron-explanation-implementation-and-a-visual-example-3c8e76b4e2d1

9/9/2020 Xuming He – CS 280 Deep Learning 9


Perceptron algorithm
 Learn a single neuron for binary classification

 Task formulation

9/9/2020 Xuming He – CS 280 Deep Learning 10


Perceptron algorithm
 Algorithm outline

9/9/2020 Xuming He – CS 280 Deep Learning 11


Perceptron algorithm
 Intuition: correct the current mistake

9/9/2020 Xuming He – CS 280 Deep Learning 12


Perceptron algorithm
 The Perceptron theorem

9/9/2020 Xuming He – CS 280 Deep Learning 13


Hyperplane Distance
Perceptron algorithm
 The Perceptron theorem: proof

9/9/2020 Xuming He – CS 280 Deep Learning 15


Perceptron algorithm
 The Perceptron theorem: proof

9/9/2020 Xuming He – CS 280 Deep Learning 16


Perceptron algorithm
 The Perceptron theorem: proof intuition

9/9/2020 Xuming He – CS 280 Deep Learning 17


Perceptron algorithm
 The Perceptron theorem: proof

9/9/2020 Xuming He – CS 280 Deep Learning 18


Perceptron algorithm
 The Perceptron theorem

9/9/2020 Xuming He – CS 280 Deep Learning 19


Perceptron Learning problem
 What loss function is minimized?

9/9/2020 Xuming He – CS 280 Deep Learning 20


Perceptron algorithm
 What loss function is minimized?

9/9/2020 Xuming He – CS 280 Deep Learning 21


Perceptron algorithm
 What loss function is minimized?

9/9/2020 Xuming He – CS 280 Deep Learning 22


Outline
 Artificial neuron
 Perceptron algorithm

 Single layer neural networks


 Network models

 Example: Logistic Regression

 Multi-layer neural networks


 Limitations of single layer networks

 Networks with single hidden layer

Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu


Liang@Princeton’s course notes
9/9/2020 Xuming He – CS 280 Deep Learning 23
Single layer neural network

9/9/2020 24
Single layer neural network

9/9/2020 25
Single layer neural network

9/9/2020 26
What is the output?
 Element-wise nonlinear functions
 Independent feature/attribute detectors

9/9/2020 Xuming He – CS 280 Deep Learning 27


What is the output?
 Nonlinear functions with vector input
 Competition between neurons

9/9/2020 Xuming He – CS 280 Deep Learning 28


What is the output?
 Nonlinear functions with vector input
 Example: Winner-Take-All (WTA)

9/9/2020 Xuming He – CS 280 Deep Learning 29


A probabilistic perspective
 Change the output nonlinearity

 From WTA to Softmax function

9/9/2020 Xuming He – CS 280 Deep Learning 30


Multiclass linear classifiers

 The WTA prediction: one-hot encoding of its predicted label

9/9/2020 Xuming He – CS 280 Deep Learning 31


Probabilistic outputs

9/9/2020 Xuming He – CS 280 Deep Learning 32


How to learn a multiclass classifier?
 Define a loss function and do minimization

9/9/2020 Xuming He – CS 280 Deep Learning 33


Learning a multiclass linear classifier
 Design a loss function for multiclass classifiers
 Perceptron?
 Yes, see homework
 Hinge loss
 The SVM and max-margin (see CS231n)
 Probabilistic formulation
 Log loss and logistic regression
 Generalization issue
 Avoid overfitting by regularization

9/9/2020 Xuming He – CS 280 Deep Learning 34


Example: Logistic Regression
 Learning loss: negative log likelihood

9/9/2020 Xuming He – CS 280 Deep Learning 35


Logistic Regression
 Learning loss: example

9/9/2020 Xuming He – CS 280 Deep Learning 36


Logistic Regression
 Learning loss: questions

9/9/2020 Xuming He – CS 280 Deep Learning 37


Logistic Regression
 Learning loss: questions

9/9/2020 Xuming He – CS 280 Deep Learning 38


Learning with regularization
 Constraints on hypothesis space
 Similar to Linear Regression

9/9/2020 Xuming He – CS 280 Deep Learning 39


Learning with regularization
 Regularization terms

 Priors on the weights


 Bayesian: integrating out weights
 Empirical: computing MAP estimate of W

9/9/2020 Xuming He – CS 280 Deep Learning 40


L1 vs L2 regularization

https://www.youtube.com/watch?v=jEVh0uheCPk
9/9/2020 Xuming He – CS 280 Deep Learning 41
L1 vs L2 regularization
 Sparsity

9/9/2020 Xuming He – CS 280 Deep Learning 42


Optimization: gradient descent
 Gradient descent

 Learning rate matters

9/9/2020 Xuming He – CS 280 Deep Learning 43


Optimization: gradient descent
 Stochastic gradient descent

9/9/2020 Xuming He – CS 280 Deep Learning 44


Optimization: gradient descent
 Stochastic gradient descent

9/9/2020 Xuming He – CS 280 Deep Learning 45


Interpreting network weights
 What are those weights?

9/9/2020 Xuming He – CS 280 Deep Learning 46


Outline
 Artificial neuron
 Perceptron algorithm

 Single layer neural networks


 Network models

 Example: Logistic Regression

 Multi-layer neural networks


 Limitations of single layer networks

 Networks with single hidden layer

Acknowledgement: Hugo Larochelle’s, Mehryar Mohri@NYU’s & Yingyu


Liang@Princeton’s course notes
9/9/2020 Xuming He – CS 280 Deep Learning 47
Capacity of single neuron
 Binary classification
 A neuron estimates
 Its decision boundary is linear, determined by its weights

9/9/2020 Xuming He – CS 280 Deep Learning 48


Capacity of single neuron
 Can solve linearly separable problems

 Examples

9/9/2020 Xuming He – CS 280 Deep Learning 49


Capacity of single neuron
 Can’t solve non linearly separable problems

 Can we use multiple neurons to achieve this?

9/9/2020 Xuming He – CS 280 Deep Learning 50


Capacity of single neuron
 Can’t solve non linearly separable problems
 Unless the input is transformed in a better representation

9/9/2020 Xuming He – CS 280 Deep Learning 51


Capacity of single neuron
 Can’t solve non linearly separable problems

 Unless the input is transformed in a better representation

9/9/2020 Xuming He – CS 280 Deep Learning 52


Adding one more layer
 Single hidden layer neural network
 2-layer neural network: ignoring input units

 Q: What if using linear activation in hidden layer?

9/9/2020 Xuming He – CS 280 Deep Learning 53


Capacity of neural network
 Single hidden layer neural network
 Partition the input space into regions

9/9/2020 Xuming He – CS 280 Deep Learning 54


Capacity of neural network
 Single hidden layer neural network
 Form a stump/delta function

9/9/2020 Xuming He – CS 280 Deep Learning 55


Capacity of neural network
 Single hidden layer neural network

9/9/2020 Xuming He – CS 280 Deep Learning 56


Multi-layer perceptron
 Boolean case
 Multilayer perceptrons (MLPs) can compute more complex
Boolean functions
 MLPs can compute any Boolean function
 Since they can emulate individual gates
 MLPs are universal Boolean functions

9/9/2020 Xuming He – CS 280 Deep Learning 57


Capacity of neural network
 Universal approximation
 Theorem (Hornik, 1991)
A single hidden layer neural network with a linear output unit can
approximate any continuous function arbitrarily well, given enough
hidden units.
 The result applies for sigmoid, tanh and many other hidden
layer activation functions

 Caveat: good result but not useful in practice


 How many hidden units?
 How to find the parameters by a learning algorithm?

9/9/2020 Xuming He – CS 280 Deep Learning 58


General neural network
 Multi-layer neural network

9/9/2020 Xuming He – CS 280 Deep Learning 59


Multilayer networks
Multilayer networks
Why more layers (deeper)?
 A deep architecture can represent certain functions more
compactly
 (Montufar et al., NIPS’14)
 Functions representable with a deep rectifier net can require an
exponential number of hidden units with a shallow one.

9/9/2020 Xuming He – CS 280 Deep Learning 62


Why more layers (deeper)?
 A deep architecture can represent certain functions more
compactly
 Example: Boolean functions
 There are Boolean functions which require an exponential number
of hidden units in the single layer case
 require a polynomial number of hidden units if we can adapt the
number of layers

 Example: multivariate polynomials (Rolnick & Tegmark, ICLR’18)


 Total number of neurons m required to approximate natural classes
of multivariate polynomials of n variables
 grows only linearly with n for deep neural networks, but grows
exponentially when merely a single hidden layer is allowed.

9/9/2020 Xuming He – CS 280 Deep Learning 63


Why more layers (deeper)?

9/9/2020 Xuming He – CS 280 Deep Learning 64


Summary
 Artificial neurons
 Single-layer network
 Multi-layer neural networks
 Next time
 Computation in neural networks
 Convolutional neural networks

9/9/2020 Xuming He – CS 280 Deep Learning 65

You might also like