Deep Learning Model
Deep Learning Model
of
Deep Learning
François Fleuret
beta-2023.05.06
François Fleuret is professor of computer science
at the University of Geneva, Switzerland.
List of figures 7
Foreword 8
I Foundations 10
1 Machine Learning 11
1.1 Learning from data . . . . . . . 12
1.2 Basis function regression . . . . 14
1.3 Under and over-fitting . . . . . 15
1.4 Categories of models . . . . . . 17
2 Efficient computation 19
2.1 GPUs, TPUs, and batches . . . . 20
2.2 Tensors . . . . . . . . . . . . . . 22
3 Training 24
3.1 Losses . . . . . . . . . . . . . . 25
3.2 Autoregressive models . . . . . 28
3.3 Gradient descent . . . . . . . . 31
3 144
3.4 Backpropagation . . . . . . . . 36
3.5 Training protocols . . . . . . . 41
3.6 Training data . . . . . . . . . . 44
II Deep models 46
4 Model components 47
4.1 The notion of layer . . . . . . . 48
4.2 Linear layers . . . . . . . . . . . 50
4.3 Activation functions . . . . . . 59
4.4 Pooling . . . . . . . . . . . . . . 62
4.5 Dropout . . . . . . . . . . . . . 65
4.6 Normalizing layers . . . . . . . 67
4.7 Skip connections . . . . . . . . 71
4.8 Attention layers . . . . . . . . . 74
4.9 Token embedding . . . . . . . . 81
4.10 Positional encoding . . . . . . . 82
5 Architectures 84
5.1 Multi-Layer Perceptrons . . . . 85
5.2 Convolutional networks . . . . 87
5.3 Attention models . . . . . . . . 94
Afterword 128
Bibliography 129
Index 137
5 144
List of Figures
4.1 1d convolution . . . . . . . . . . . . 52
4.2 2d convolution . . . . . . . . . . . . 53
4.3 Stride, padding, and dilation . . . . 54
4.4 Receptive field . . . . . . . . . . . . 56
4.5 Activation functions . . . . . . . . . 60
4.6 Max pooling . . . . . . . . . . . . . 63
4.7 Dropout . . . . . . . . . . . . . . . . 66
4.8 Batch normalization . . . . . . . . . 68
4.9 Skip connections . . . . . . . . . . . 72
4.10 Attention operator . . . . . . . . . . 75
4.11 Interpretation of the attention operator 76
6 144
4.12 Multi-Head Attention layer . . . . . 78
7 144
Foreword
If you did not get this book from its official url
https://fleuret.org/public/lbdl.pdf
François Fleuret
April 21, 2023
9 144
Part I
Foundations
10 144
Chapter 1
Machine Learning
11 144
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.
13 144
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean squared error
N
1X
ℒ (w) = (yn −f (xn ;w))2 , (1.1)
N
n=1
15 144
This is over-fitting.
16 144
1.4 Categories of models
We can organize the use of machine learning
models into three main categories:
18 144
Chapter 2
Efficient computation
19 144
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be fitting to deep models. As their usage
for AI has increased, GPUs got equipped with
dedicated sub-components referred to as
,tensor cores
and deep-learning specialized chips
such as Google’s Tensor Processing Units (TPUs)
have been produced.
20 144
to the cache memory near the actual computing
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for every sample. In practice a GPU processes
a batch that fits in memory almost as quickly as
a single sample.
21 144
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to pro-
cess by organizing them as tensors, which are
series of scalars arranged along several discrete
axes. They are elements of RN1 ×···×ND that gen-
eralize the notion of vector and matrix.
23 144
Chapter 3
Training
24 144
3.1 Losses
The example of the mean squared error of Equa-
tion 1.1 is a standard loss for predicting a con-
tinuous value.
expf (x;w)y
P̂ (Y = y | X = x) = P .
z expf (x;w)z
25 144
For density modeling, the standard loss is the
likelihood of the data. If f (x;w) is to be inter-
preted as a normalized log-probability or density,
the loss is the opposite of the sum of its value
over training samples.
27 144
3.2 Autoregressive models
Many spectacular applications in computer vi-
sion and natural language processing have been
tackled by modeling the distribution of a high-
dimension discrete vector with the chain rule:
the chain rule states that one can sample a full se-
quence of length T by sampling the xt s one after
another, each according to the predicted poste-
rior distribution, given the x1 ,...,xt−1 already
sampled. This is an autoregressive generative
model.
0 x1 x2 ... xT −1 xT
30 144
3.3 Gradient descent
Except in specific cases like the linear regres-
sion we saw in the previous chapter, the optimal
parameters w∗ do not have a closed form expres-
sion. In the general case the tool of choice to
minimize a function is gradient descent. It con-
sists of initializing the parameters with a random
w0 , and then improving this estimate by iterat-
ing gradient steps, each consisting of computing
the gradient of the loss with respect to the pa-
rameters, and subtracting a fraction of it
31 144
w
ℒ (w)
32 144
As for many algorithms, intuition tends to break
in very high dimension, and although it seems
that this procedure would be very easily trapped
in a local minimum, in reality, due to the number
of parameters, the design of the models, and
the stochasticity in the data, its efficiency is far
greater than one could expect.
where
𝓁n (w) = L(f (xn ;w),yn )
for some L, and the gradient is then
N
1X
∇ℒ |w (w) = ∇𝓁n |w (w). (3.2)
N
n=1
34 144
ent training speeds in different parts of a model.
35 144
3.4 Backpropagation
Using gradient descent requires a tech-
nical means to compute ∇𝓁|w (w) where
𝓁= L(f (x;w);y). Given that f and L are both
compositions of standard tensor operations, as
for any mathematical expression, the chain rule
allows us to get an expression of it.
fd (·;wd )
x(d−1) x(d)
×Jfd |x
∇𝓁 |x(d−1) ∇𝓁 |x(d)
×Jfd |w
∇𝓁 |wd
36 144
Forward and backward passes
Consider the simple case of a composition of
mappings
f = f1 ◦f2 ◦···◦fD .
38 144
Resource usage
Regarding the computational cost, as we will
see, the bulk of the computation goes into linear
operations that require one matrix product for
the forward pass, and two for the products by
the Jacobians for the backward pass. This makes
the latter roughly twice more costly than the
former.
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through many operators it may decrease
or increase exponentially. When it decreases
exponentially this is called the
training set used to optimize the model
parameters, and the other is a test set to estimate
the performance of the trained model.
Loss
Validation
Train
Number of epochs
42 144
An important design choice is the
learning rate schedule
during training. The general policy is
that the learning rate should be initially large to
avoid having the optimization being trapped in
a bad local minimum early, and that it should
get small so that the optimized parameter values
do not bounce around, and reaches a good mini-
mum in a narrow valley of the loss landscape.
43 144
3.6 Training data
One key aspect of deep learning is the steady
improvement of performance with the
training set
size, even in the multi-billions of samples
regime.
44 144
the computing device’s memory.
45 144
Part II
Deep models
46 144
Chapter 4
Model components
Y
4×4
g n=4
f
×K
32×32
X
48 144
• non-default valued meta-parameters are
added in blue on their right,
49 144
4.2 Linear layers
Linear layers are the most important modules
in terms of computation and number of parame-
ters. They benefit from decades of research and
engineering in algorithmic and chip design for
matrix operations.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = W X[d1 ,...,dK ]+b.
Convolutional layers
A linear layer can take as input an arbitrarily
shaped tensor by reshaping it into a vector, as
long as it has the right number of coefficients.
However such a layer is poorly adapted to deal-
ing with large tensors since the number of pa-
rameters and number of operations are propor-
tional to the product of the input and output
dimensions. For instance to process an RGB im-
age of size 256×256 as input and compute a
result of same size, it would require ≃ 4×1010
parameters and multiplications.
51 144
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1d transposed
1d convolution
convolution
Figure 4.1: A 1d convolution (left) takes as input
a D×T tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D×K, and stores
the resulting D′ ×1 tensors into Y . A 1d transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(·;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
size.
52 144
ϕ ψ
Y X
X Y
2d transposed
2d convolution
convolution
Figure 4.2: A 2d convolution (left) takes as input a
D×H ×W tensor X, applies the same affine map-
ping ϕ(·;w) to every sub-tensor of shape D×K ×L,
and stores the resulting D′ ×1×1 tensors into Y . A
2d transposed convolution (right) takes as input a
D×H ×W tensor, applies the same affine mapping
ψ(·;w) to every D×1×1 sub-tensor, and sums the
shifted resulting D′ ×K ×L tensors into Y .
53 144
Y
Y ϕ
X
ϕ
p=2
X
Padding
Y
Y
ϕ
X ϕ
X
s=2
...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameter: the stride s (left) modulates the step size
when going though the input tensor, the padding p
(top right) specifies how many zeros entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.
54 144
the same operator everywhere.
55 144
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its receptive field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases the width
and height of that area by roughly those of the kernel.
59 144
Tanh ReLU
gelu(x) = xP (Z ≤ x),
61 144
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max pooling layer which, similarly
to convolution, can operate in 1d and 2d, and is
defined by a kernel size.
max
max
...
max
1d max pooling
63 144
max over the sub-tensors. This is a linear opera-
tion, while max pooling is not.
64 144
4.5 Dropout
Some layers have been designed to explicitly
facilitate training, or improve the quality of the
learned representations.
0
1 1 1 1 1 1
0 1 1 1 1
0 0
1 1
1 1
0 1 1
0 1 1 1 1 1 1 1 1 1
× 1 1 1
0 1 1 1 1 1
0 1 1 1 1
0 × 1−p
1 1 1 1 1 1
0 1 1 1
0 1 1
0 1
1
0 1 1 1
0 1 1 1 1 1 1 1
0 1
X X
Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.
66 144
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the
normalizing layers
which force the empirical mean and
variance of groups of activations.
67 144
D D
H,W H,W
B B
x⊙γ +β x⊙γ +β
√ √
(x− m̂)/ v̂+ϵ (x− m̂)/ v̂+ϵ
batchnorm layernorm
68 144
viation γd
xb,d − m̂d
zb,d = √
v̂d +ϵ
yb,d = γd zb,d +βd .
70 144
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep architec-
ture are the skip connections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in-between. This unmodified signal can be
concatenated or added to the input to the layer
the connection branches into (see Figure 4.9). A
particular type of skip connections are the
residual connections
which combine the signal with
a sum, and usually skip only a few layers (see
Figure 4.9, right).
f8
... ...
f7
f6 +
f6
f5 f4
f5
f4 f3
f4
f3 +
f3
f2 f2
f2
f1 f1
f1 ... ...
...
72 144
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible size. In the case of residual connec-
tions, they may also facilitate the learning by
simplifying the task to finding a differential im-
provement instead of a full update.
73 144
4.8 Attention layers
In many applications there is a need for a pro-
cessing able to combine local information at lo-
cations far apart in a tensor. This can be for
instance distant details for coherent and realistic
image synthesis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in natural language processing.
form
formers
formers
ers, the dominant architecture for large lan-
guage models. See § 5.3 and § 7.1.
Attention operator
Given
Y = att(K,Q,V )
75 144
Q Y
K A V A
76 144
attention scores:
X
Yn = An,m Vm . (4.2)
m
×W O
(Y1 | ··· | YH )
attatt
attatt
att
Q K V
×W
×W1 2 Q×W
Q
Q×W
1 2K K×W 1 2V V
×W ×W K×W×W
×W3 Q ×W3 K ×W3 4V V
×W4
H ×W 4
H ×W H
×H
XQ XK XV
Figure 4.12: The Multi-head Attention layer applies
for each of its h = 1,...,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh .
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .
78 144
parameters a number H of heads, and the shapes
of three series of H trainable weight matrices
• W Q of size H ×D×DQK ,
• W K of size H ×D×DQK , and
• W V of size H ×D×DV ,
• X Q of size N Q ×D,
• X K of size N KV ×D, and
• X V of size N KV ×D,
80 144
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an
embedding layer
which consists of a lookup table
that directly maps integers to vectors.
∀d1 ,...,dK ,
Y [d1 ,...,dK ] = M [X[d1 ,...,dK ]].
81 144
4.10 Positional encoding
While the processing of a fully connected layer
is specific to both the positions of the features
in the input tensor, and to the position of the
resulting activation in the output tensor, convo-
lutional layers and multi-head attention layers
are oblivious to the absolute position in the ten-
sor. This is key to their strong invariance and
inductive bias, which is beneficial to deal with a
stationary signal.
82 144
D, Vaswani et al. [2017] add
pos-enc[t,d] =
sin d/Dt
if d ∈ 2N
T
t
cos (d−1)/D
T
otherwise,
with T = 104 .
83 144
Chapter 5
Architectures
84 144
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the
Multi-Layer Perceptron
(MLP), which takes the form
of a succession of fully connected layers sepa-
rated by activation functions. See an example
on Figure 5.1. For historical reasons, in such a
model, the number of hidden layers refers to the
number of linear layers, excluding the last one.
Y
2
fully-conn
relu
10
fully-conn
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one dimension tensor of size 50, is composed of three
fully connected layers with outputs of dimensions re-
spectively 25, 10, and 2, the two first followed by ReLU
layers.
85 144
any continuous function f can be approximated
arbitrarily well uniformly on a compact by a
model of the form l2 ◦σ◦l1 where l1 and l2 are
affine. Such a model is a MLP with a single hid-
den layer, and this result implies that it can ap-
proximate anything of practical value. However
this approximation holds if the dimension of the
first linear layer’s output can be arbitrarily large.
86 144
5.2 Convolutional networks
The standard architecture for processing images
is a convolutional network, or convnet, that
combines multiple convolutional layers, either
to reduce the signal size before it can be pro-
cessed by fully connected layers, or to output a
2d signal also of large size.
LeNet-like
The original LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of 2d
convolutional layers and max pooling layers that
play the role of feature extractor, with a series of
fully connected layers which act like a MLP and
performs the classification per se. See Figure 5.2
for an example.
Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
87 144
P̂ (Y )
10
fully-conn
Classifier
relu
200
fully-conn
256
reshape
relu
64×2×2
maxpool k=2
64×4×4
Feature conv-2d k=5
extractor
relu
32×8×8
maxpool k=3
32×24×24
conv-2d k=5
1×28×28
X
Figure 5.2: Example of a small LeNet-like network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension for 28×
28 scalars to 256. Its second half processes this 256
dimension feature vector through a one hidden layer
perceptron to compute 10 logit scores corresponding to
the ten possible digits.
88 144
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1
relu
batchnorm
conv-2d k=3 p=1
relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.
89 144
Y
4C
S ×H W
S × S
relu
+
batchnorm batchnorm
4C
S ×H W
S × S
conv-2d k=1 s=S conv-2d k=1
relu
batchnorm
C
S ×H W
S × S
conv-2d k=3 s=S p=1
relu
batchnorm
C
S ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.
90 144
P̂ (Y )
1000
fully-conn
2048
reshape
2048×1×1
avgpool k=7
resblock
×2
2048×7×7
dresblock
S=2
resblock
×5
1024×14×14
dresblock
S=2
resblock
×3
512×28×28
dresblock
S=2
resblock
×2
256×56×56
dresblock
S=1
64×56×56
maxpool k=3 s=2 p=1
relu
batchnorm
64×112×112
conv-2d k=7 s=2 p=3
3×224×224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
91 144
tation. However the parameter count of a con-
volutional layer, and its computational cost, are
quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3×3
convolution on this reduced number of chan-
nels, and then up-scaling the number of chan-
nels, again with a 1×1 convolution.
92 144
blocks. Surprisingly, in the first section, there
is no downscaling, only an increase of the num-
ber of channels by a factor of 4. The output of
the last residual block is 2048×7×7, which is
converted to a vector of dimension 2048 by an
average pooling of kernel size 7×7, and then
processed through a fully connected layer to get
the final logits, here for 1000 classes.
93 144
5.3 Attention models
As stated in § 4.8, many applications, in partic-
ular from natural language processing, greatly
benefit from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Transformer proposed
by Vaswani et al. [2017].
Transformer
The original Transformer, pictured on Figure 5.7,
was designed for sequence-to-sequence trans-
lation. It combines an encoder that processes
the input sequence to get a refined representa-
tion, and an autoregressive decoder that gener-
ates each token of the result sequence, given the
encoder’s representation of the input sequence,
and the output tokens generated so far. As the
residual convolutional networks of § 5.2, both
the encoder and the decoder of the Transformer
are sequences of compounded blocks built with
residual connections.
94 144
Y Y
+ +
dropout dropout
fully-conn fully-conn
MLP gelu gelu
fully-conn fully-conn
layernorm layernorm
+ +
mha mha
Q K V Q K V
layernorm layernorm
X QKV XQ X KV
Figure 5.6: Self-attention block (left) and
cross-attention block
(right). These specific structures proposed by
Radford et al. [2018] differ slightly from the original
architecture of Vaswani et al. [2017], in particular by
having the layer normalization first in the residual
blocks.
95 144
P̂ (Y1 ),..., P̂ (YS | Ys<S )
S ×V
fully-conn
S ×D
cross-att
Q KV
Decoder causal
self-att ×N
pos-enc +
S ×D
embed
S
0,Y1 ,...,YS−1
Z1 ,...,ZT
T ×D
self-att
×N
Encoder
pos-enc +
T ×D
embed
T
X1 ,...,XT
96 144
input two sequences, one to compute the queries,
and one the keys and values.
97 144
P̂ (X1 ),..., P̂ (XT | Xt<T )
T ×V
fully-conn
T ×D
causal
self-att ×N
pos-enc +
T ×D
embed
T
0,X1 ,...,XT −1
98 144
P̂ (Y )
C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn
D
Z0 ,Z1 ,...,ZM
(M +1)×D
self-att
×N
pos-enc +
(M +1)×D
E0 ,E1 ,...,EM
Image E0 ×W E
encoder M ×3P 2
X1 ,...,XM
99 144
Vision Transformer
Transformers have been put to use for image
classification with the Vision Transformer (ViT)
model [Dosovitskiy et al., 2020], see Figure 5.9.
100 144
Part III
Applications
101 144
Chapter 6
Prediction
102 144
6.1 Image denoising
A direct application of deep models to image pro-
cessing is to recover from degradation by using
the redundancy in the statistical structure of im-
ages. The petals of a sunflower on a grayscale
picture can be colored with high confidence, and
the texture of a geometric shape such as a table
on a low-light grainy picture can be corrected
by averaging it over a large area likely to be
uniform.
104 144
6.2 Image classification
Image classification is the simplest strategy to
extract semantics from an image, and consists
of predicting a class among a finite predefined
number of classes, given an input image.
105 144
6.3 Object detection
A more complex task for image understanding
is object detection, in which case the objective
is, given an input image, to predict the classes
and positions of objects of interest.
106 144
X
Z1
Z2
ZS−1
ZS
...
...
107 144
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].
108 144
it. This results in a non-ambiguous matching of
any bounding box (x1 ,x2 ,y1 ,y2 ) to a s,h,w, de-
termined respectively by max(x2 −x1 ,y2 −y1 ),
y1 +y2
2 , and 2 .
x1 +x2
109 144
regression of geometric quantities.
110 144
6.4 Semantic segmentation
The finest grain prediction task for image under-
standing is semantic segmentation, which con-
sists of predicting for every pixel the class of the
object it belongs to. This can be achieved with
a standard convolutional neural network, that
outputs a convolutional map with as many chan-
nels as classes, that carry the estimated logits for
every pixel.
111 144
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].
112 144
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].
113 144
6.5 Speech recognition
Speech recognition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and re-
cent one consists of casting it as a sequence-to-
sequence translation and then solving it with a
standard attention-based Transformer.
115 144
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations.
116 144
resulting in a N ×N matrix of similarity score
117 144
Figure 6.4: The CLIP text-image embedding [Radford
et al., 2021] allows to do zero-shot prediction by pre-
dicting what class description embedding is the most
consistent with the image embedding.
118 144
Chapter 7
Synthesis
119 144
7.1 Text generation
The standard approach to text synthesis is to use
an attention-based autoregressive model. The
most successful in this domain is the GPT [Rad-
ford et al., 2018], that we described in § 5.3.
120 144
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high dimension density.
A powerful one for image synthesis relies on
inverting a diffusion process.
x0
122 144
portance of x0 , and xt ’s density can rapidly be
approximated with a normal.
124 144
which are modulated dynamically.
125 144
discriminator’s loss. It can be shown that at the
equilibrium the generator produces samples in-
distinguishable from real data. In practice, when
the gradient flows through the discriminator to
the generator, it informs the latter about the cues
that the discriminator uses, that should be fixed.
126 144
• Many applications require to process signals
which are not organized regularly on a grid. For
instance molecules, proteins, 3D meshes, or geo-
graphic locations are more naturally structured
as graphs. Standard convolutional network or
even attention models are poorly adapted to
process such data, and the tool of choice for
such a task are Graph Neural Networks (GNN,
[Scarselli et al., 2009]). These models are com-
posed of layers that compute activations at each
vertex by combining linearly the activations lo-
cated at its immediate neighboring vertices. This
operation is very similar to a standard convo-
lution, except that the data structure does not
reflect a geometrical information associated to
the feature vectors they carry.
127 144
Afterword
128 144
Bibliography
131 144
D. Hendrycks and K. Gimpel. Gaussian Error
Linear Units (GELUs). CoRR, abs/1606.08415,
2016. [pdf]. 61
D. Hendrycks, K. Zhao, S. Basart, et al. Natural
Adversarial Examples. CoRR, abs/1907.07174,
2019. [pdf]. 117
J. Ho, A. Jain, and P. Abbeel. Denoising Diffusion
Probabilistic Models. CoRR, abs/2006.11239,
2020. [pdf]. 121, 122, 123
S. Hochreiter and J. Schmidhuber. Long Short-
Term Memory. Neural Computation, 9(8):1735–
1780, 1997. [pdf]. 124
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
Conference on Machine Learning (ICML), 2015.
[pdf]. 67
D. Kingma and J. Ba. Adam: A Method for
Stochastic Optimization. CoRR, abs/1412.6980,
2014. [pdf]. 34
D. P. Kingma and M. Welling. Auto-Encoding
Variational Bayes. CoRR, abs/1312.6114, 2013.
[pdf]. 125
A. Krizhevsky, I. Sutskever, and G. Hinton. Ima-
geNet Classification with Deep Convolutional
132 144
Neural Networks. In Neural Information Pro-
cessing Systems (NIPS), 2012. [pdf]. 8, 87
133 144
learning. Nature, 518(7540):529–533, February
2015. [pdf]. 126
135 144
H. Zhao, J. Shi, X. Qi, et al. Pyramid Scene
Parsing Network. CoRR, abs/1612.01105, 2016.
[pdf]. 112, 113
136 144
Index
1d convolution, 55
2d convolution, 55
activation, 22, 37
activation function, 59, 85
activation map, 57
Adam, 34
artificial neural network, 8, 11
attention layer, 74
attention operator, 75
autoencoder, 125
autograd, 38
autoregressive model, 29, 120
average pooling, 62
backpropagation, 38
backward pass, 38
basis function regression, 14
batch, 20, 34
batch normalization, 67
137 144
bias vector, 50, 55
BPE, 115, 120
Byte Pair Encoding, 115
cache memory, 20
capacity, 15
causal, 30, 75
causal model, 29, 77, 97
channel, 22
classification, 17
CLIP, 116
CLS token, 100
computational cost, 39
contrastive loss, 26, 116
convnet, 87
convolutional layer, 53, 87
convolutional network, 87
cross-attention block, 80, 95
cross-entropy, 25
filter, 55
fine tuning, 126
flops, 21
forward pass, 37
FP32, 21
framework, 22
fully connected layer, 50, 85, 87
GAN, 125
GELU, 61
Generative Adversarial Networks, 125
generator, 125
GNN, 127
GPT, 98, 116, 120
GPU, 8, 19
gradient descent, 31, 33, 36
gradient step, 31
Graph Neural Network, 127
Graphical Processing Unit, 8, 19
ground truth, 17
hidden layer, 85
hidden state, 124
max pooling, 62
mean squared error, 14, 25
memory requirement, 39
memory speed, 20
meta parameter, 13, 41
metric learning, 26
MLP, 85, 95
model, 12
140 144
Multi-Head Attention layer, 94
multi-head attention layer, 77
multi-layer perceptron, 85
padding, 55, 62
parameter, 12
parametric model, 12
peak performance, 21
pooling, 62
positional encoding, 82, 97
posterior probability, 25
pre-trained model, 109, 113
query, 75
random initialization, 51
receptive field, 56, 57, 106
rectified linear unit, 59, 124
recurrent neural network, 124
regression, 17
reinforcement learning, 126
ReLU, 59
141 144
residual block, 90
residual connection, 71, 89
residual network, 71, 89
resnet, 71, 89
ResNet-50, 89, 116
RL, 126
RNN, 124
tanh, 60
tensor, 22
tensor cores, 20
Tensor Processing Units, 20
test set, 41
text synthesis, 120
tokenizer, 28, 115, 120
tokens, 28
142 144
TPU, 20
trainable parameter, 12
training, 12
training set, 12, 24, 41, 44
Transformer, 71, 75, 94, 96, 114
transposed convolution, 58
under-fitting, 15
universal approximation theorem, 85
unsupervised learning, 18
VAE, 125
validation set, 41
value, 75
vanishing gradient, 40, 47
variational autoencoder, 125
Vision Transformer, 100
ViT, 100
vocabulary, 28
weight, 13
weight decay, 26
weight matrix, 50
143 144
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.
144 144