Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
12 views

Week4_LearningII

Uploaded by

albertadi412
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Week4_LearningII

Uploaded by

albertadi412
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

CSCI218: Foundations

of Artificial Intelligence
Classical stats/ML: Minimize loss function
§ Which hypothesis space H to choose?
§ E.g., linear combinations of features: hw(x) = wTx
§ How to measure degree of fit?
§ Loss function, e.g., squared error Σj (yj – wTx)2
§ How to trade off degree of fit vs. complexity?
§ Regularization: complexity penalty, e.g., ||w||2
§ How do we find a good h?
§ Optimization (closed-form, numerical); discrete search
§ How do we know if a good h will predict well?
§ Try it and see (cross-validation, bootstrap, etc.)

2
Deep Learning/Neural Network

Image Classification
Very loose inspiration: Human neurons

Axonal arborization

Axon from another cell

Synapse
Dendrite Axon

Nucleus

Synapses

Cell body or Soma


Simple model of a neuron (McCulloch & Pitts, 1943)
Bias Weight
a0 = 1 aj = g(inj)
w0,j
g
wi,j inj
ai
Σ aj

Input Input Activation Output


Links Function Function Output Links

§ Inputs ai come from the output of node i to this node j (or from “outside”)
§ Each input link has a weight wi,j
§ There is an additional fixed input a0 with bias weight w0,j
§ The total input is inj = Si wi,j ai
§ The output is aj = g(inj) = g(Si wi,j ai) = g(w.a)
Activation functions g
g(ini) g(ini)

+1 +1

ini ini
(a)
Threshold (b)1/(1+e-x)
Sigmoid
Reminder: Linear Classifiers

▪ Inputs are feature values


▪ Each feature has a weight
▪ Sum is the activation

▪ If the activation is: f1


w1
▪ Positive, output +1 w2
▪ Negative, output -1
f2
w3 Σ >0?
f3
How to get probabilistic decisions?

If very positive, want probability going to 1


If very negative, want probability going to 0

Sigmoid function
Best w?
Maximum likelihood estimation:

with:

= Logistic Regression
Multiclass Logistic Regression
Multi-class linear classification
A weight vector for each class:

Score (activation) of a class y:

Prediction w/highest score wins:

How to make the scores into probabilities?

original activations softmax activations


Best w?
Maximum likelihood estimation:

with:

= Multi-Class Logistic Regression


Optimization

Optimization

i.e., how do we solve:


Hill Climbing
A simple, general idea
Start wherever
Repeat: move to the best neighboring state
If no neighbors better than current, quit

What’s particularly tricky when hill-climbing for multiclass


logistic regression?
• Optimization over a continuous space
• Infinitely many neighbors!
• How to do this efficiently?
1-D Optimization

Could evaluate and


Then step in best direction

Or, evaluate derivative:


Tells which direction to step into
2-D Optimization

Source: offconvex.org
Gradient Ascent
Perform update in uphill direction for each coordinate
The steeper the slope (i.e. the higher the derivative) the bigger the step
for that coordinate
E.g., consider:

Updates: ▪ Updates in vector notation:

with: = gradient
Steepest Descent
o Idea:
o Start somewhere
o Repeat: Take a step in the steepest descent direction

Figure source: Mathworks


Steepest Direction
o Steepest Direction = direction of the gradient

2 @g
3
@w1
6 @g7
6 @w2
7
rg = 6 7
4 ··· 5
@g
@wn
Optimization Procedure: Gradient Ascent

init
for iter = 1, 2, …

▪ : learning rate --- hyperparameter that needs to be chosen


carefully
Batch Gradient Ascent on the Log Likelihood Objective

init
for iter = 1, 2, …
Stochastic Gradient Ascent on the Log Likelihood Objective

Observation: once gradient on one training example has been


computed, might as well incorporate before computing next one

init
for iter = 1, 2, …
pick random j
Mini-Batch Gradient Ascent on the Log Likelihood Objective

Observation: gradient over small set of training examples (=mini-batch)


can be computed in parallel, might as well do that instead of a single one

init
for iter = 1, 2, …
pick random subset of training examples J
Neural Networks
Multi-class Logistic Regression
= special case of neural network (single layer, no hidden layer)
f1(x)

z1 s
f2(x) o
f
z2 t
f3(x)
m
a
x
… z3

fK(x)
Multi-layer Perceptron

x1
s
x2 o
f
… t
x3 m
a
… … … … x

xL

g = nonlinear activation function


Multi-layer Perceptron
Common Activation Functions

[source: MIT 6.S191 introtodeeplearning.com]


Multi-layer Perceptron
Training the MLP neural network is just like logistic regression:

just w tends to be a larger vector

just run gradient ascent è Back-propagation algorithm


Neural Networks Properties
Theorem (Universal Function Approximators). A two-layer
neural network with a sufficient number of neurons can
approximate any continuous function to any desired accuracy.

Practical considerations
Can deal with more complex, nonlinear classification & regression
Large number of neurons and weights
Danger for overfitting
Deep Learning Model

Neural network as
General computation graph

Krizhevsky, Suskever, Hinton, 2012


Deep Learning Model
Deep Learning Model
§ We need good features!

Feature Extraction Classification “Panda”?

Prior Knowledge,
Experience

Pose Occlusion Multiple Inter-class


objects similarity

Image courtesy of M. Ranzato


Deep Learning Model

§ Directly learn features representations from data.


§ Joint learn feature representation and classifier.

More abstract representation

Low-level Mid-level High-level


Features Features Features
Classifier “Panda”?

Deep Learning: train layers of features so that classifier works well.

Image courtesy of M. Ranzato


Deep Learning Model
Have we been here before?
ØYes.
• Basic ideas common to past neural networks research
• Standard machine learning strategies still relevant.
ØNo.
Today’s Deep Learning

Computational
Large-scale Data New Algorithms
Power
Deep Learning Model
Convolutional Neural Networks (CNNs)
§ A special multi-stage architecture inspired by visual system
§ Higher stages compute more global, more invariant features
Deep Learning Model

https://www.datasciencecentral.com/lenet-5-a-classic-cnn-architecture/
Different Neural Network Architectures
§ Exploration of different neural network architectures
§ ResNet: residual networks
§ Networks with attention
§ Transformer networks
§ Neural network architecture search
§ Really large models
§ GPT2, GPT3
§ CLIP

37
Acknowledgement

The lecture slides are based on the materials from ai.Berkey.edu


Thank you. Questions?

You might also like