Homework For The Course "Advanced Learninig Models": 1 Neural Networks
Homework For The Course "Advanced Learninig Models": 1 Neural Networks
Homework For The Course "Advanced Learninig Models": 1 Neural Networks
Anja PANTOVIC
master MSIAM DS
anja.pantovic@grenoble-inp.org
Predrag PILIPOVIC
master MSIAM DS
predrag.pilipovic@grenoble-inp.org
1 Neural Networks
Let X = (xij )ij , i, j ∈ {1, ..., 5} denote the input of a convolutional layer with no bias. Let
W = (wij )ij , i, j ∈ {1, ..., 3} denote the weights of the convolutional filter. Let Y = (yij )ij , i ∈
{1, ..., I}, j ∈ {1, ..., J} denote the output of the convolution operation.
2. Let us suppose that we are in situation 1.(b) (i.e. stride 2 and no padding). Let us
also assume that the output of the convolution goes through a ReLU activation, whose
output is denoted by Z = (zij )ij , i ∈ {1, ..., I}, j ∈ {1, ..., J}:
(a) Derive the expression of the output pixels zij as a function of the input and the
weights.
We have seen that I = J = 2 in our case. Let us see what will happen when we apply
filter at the beginning of the layer, so we have
1
If we do the same thing for the rest yij , we will have
y12 = w11 x13 + w12 x14 + w13 x15
+ w21 x23 + w22 x24 + w23 x25
+ w31 x33 + w32 x34 + w33 x35
3 X
X 3
= wij xi,j+2 ,
i=1 j=1
Finally, we know that zlk = σ(ylk ), where σ is ReLu activation function, more pre-
cisely σ(x) = max{0, x}.
(b) How many multiplications and additions are needed to compute the output (the
forward pass)?
As we saw in the first part of the question, for computing ylk we need 9 multiplication
and 8 additions. As there are 4 cells in the output of the convolution, we need 4·(8+9)
operations to compute the output of the convolution.
3. Assume now that we are provided with the derivative of the loss w.r.t. the output of
the convolution layer ∂L/∂zij , ∀i ∈ {1, ..., I}, j ∈ {1, ..., J}:
(a) Derive the expression of ∂L/∂xij , ∀i, j ∈ {1, ...., 5}.
We will use the chain rule so we have
3 3
∂L X X ∂L ∂zlk ∂ylk
= · · .
∂xij ∂zlk ∂ylk ∂xij
l=1 k=1
We assumed to know ∂L/∂zlk , so we need to compute the two last partial derivatives.
We can easily see that
∂zlk 1, ylk > 0
= .
∂ylk 0, ylk < 0
For the last partial derivative, we need to change the indexing in two sums in formula
for ylk , so we have
3
X 3
X
ylk = wi−2(l−1),j−2(k−1) xij .
i=2l−1 j=2k−1
2
(b) Derive the expression of ∂L/∂wij , ∀i, j ∈ {1, ..., 3}.
Similarly, we have
3 3 3 3
∂L X X ∂L ∂zlk ∂ylk X X ∂L ∂zlk
= · · = · · xi+2(l−1),j+2(k−1) .
∂wij ∂zlk ∂ylk ∂wij ∂zlk ∂ylk
l=1 k=1 l=1 k=1
Let us now consider a fully connected layer, with two input and two output neurons, without
bias and with a sigmoid activation. Let xi , i = 1, 2 denote the inputs, and zj , j = 1, 2 the
output. Let wij denote the weight connecting input i to output j. Let us also assume that
the gradient of the loss at the output ∂L/∂zj , j = 1, 2 is provided.
∂L
(b) :
∂wij
Similarly, we have
∂L ∂L ∂zj ∂L xi exp(−(w1j x1 + w2j x2 ))
= · = · .
∂wij ∂zj ∂wij ∂zj (1 + exp(−(w1j x1 + w2j x2 )))2
But this time without the sum, because wij depends only on zj .
∂2L
(c) 2 :
∂wij
Having in mind that ∂L/∂zj is a function of wij , we have
∂2L ∂2L ∂L ∂ 2 zj
∂ ∂L ∂zj ∂zj
2 = · = · + · 2 .
∂wij ∂wij ∂zj ∂wij ∂zj ∂wij ∂wij ∂zj ∂wij
The only thing left to compute is ∂ 2 zj /∂wij
2
, because we assumed to know ∂L/∂zj ,
2
hence we will know ∂ L/∂zj ∂wij . We will need the second derivative of σ, so
2 exp(−2x) exp(−x)
σ 00 (x) = 3
− .
(1 + exp(−x)) (1 + exp(−x))2
Finally, we have
∂ 2 zj
∂ xi exp(−(w1j x1 + w2j x2 ))
2 =
∂wij ∂wij (1 + exp(−(w1j x1 + w2j x2 )))2
= x2i · σ 00 (w1j x1 + w2j x2 ) .
3
∂2L
(d) , i 6= i0 , j 6= l0 : Again, from the chain rule we have
∂wij wi0 j 0
∂2L ∂2L ∂ 2 zj
∂ ∂L ∂zj ∂zj ∂L
= · = · + · .
∂wij ∂wi0 j 0 ∂wij ∂zj ∂wi0 j 0 ∂zj ∂wij ∂wi0 j 0 ∂zj ∂wij ∂wi0 j 0
But now, we can see from the previous exercises that the last term will always be zero
for j 6= j 0 , because we do not have j 0 in ∂zj /∂wij , so we will have just the first term.
(e) The elements in (c) and (d) are the entries of the Hessian matrix of L w.r.t the
weight vector. Imagine now that storing the weights of a network requires 40
MB of disk space: how much would it require to store the gradient? And the
Hessian?
If storing the weights of a network requires 40 MB, and we have 4 weights, it means
that one representation of number will require 10 MB. We know that for gradient we
have 4 elements, so we can conclude it will require 40 MB, as well. Since Hessian
is symmetric matrix we need to store just the upper triangle, which means we need
n(n+1)
2 for the matrix of the size n × n. In our case, n = 4, so we need 10 elements,
or 100 MB.
n
for any n ∈ N, (x1 , x2 , ..., xn ) ∈ X n and (a1 , a2 , ..., an ) ∈ Rn with
P
ai = 0.
i=1
Since this holds for any (a1 , ..., an ) ∈ Rn , n ∈ N, it holds for (a1 , ..., an ) ∈ Rn with
Pn
ai = 0 as well. Hence, any positive definite function is conditionally positive definite.
i=1
2. Is a constant function p.d.? Is it c.p.d.?
Let k be a constant function
k :X ×X →R
(xi , xj ) 7→ c
Since k(xi , xj ) = k(xj , xi ) = c, for all (xi , xj ) ∈ X × X , the symmetry holds.
Let n ∈ N, (x1 , ..., xn ) ∈ X n and (a1 , ..., an ) ∈ Rn .
n X
n n X
n n
!2
X X X
ai aj k(xi , xj ) = c ai aj = c ai ≥ 0 ⇐⇒ c ≥ 0
i=1 j=1 i=1 j=1 i=1
So k is positive definite if and only if c ≥ 0. We already know that k is c.p.d. when c > 0
from the first question. Let us see if conditional positive definiteness of k works for any c.
4
n
Let n ∈ N, (x1 , ..., xn ) ∈ X n and (a1 , ..., an ) ∈ Rn such that
P
ai = 0. We have
i=1
n X
n n
!2
X X
ai aj k(xi , xj ) = c · ai =0
i=1 j=1 i=1
k(x, y) = −||x − y||2 = −||x||2 − ||y||2 + 2hx, yi = −||y − x||2 = k(y, x),
hence, k is symmetric. Let n ∈ N, (x1 , ..., xn ) ∈ X n and (a1 , ..., an ) ∈ Rn such that
Pn
ai = 0.
i=1
n
X n
X n
X n
X
ai aj k(xi , xj ) = − ai aj ||xi ||2 − ai aj ||xj ||2 + 2 ai aj hxi , xj i
i,j=1 i,j=1 i,j=1 i,j=1
X n n
X Xn n
X n
X
2
=− aj ai ||xi || − ai aj ||xj ||2 + 2 ai aj hxi , xj i
j=1 i=1 i=1 j=1 i,j
| {z } | {z }
0 0
2
n
X
Xn
=2 ai aj hxi , xj i = 2
ai xi
≥ 0.
i,j=1 i=1
Thus, k is a conditionally positive definite function. We can see intuitively that k is not
positive definite function. To prove that we need one counterexample. For that reason we
can use n = 2, so we have
n
X
ai aj k(xi , xj ) = a21 k(x1 , x1 ) +2a1 a2 k(x1 , x2 ) + a22 k(x2 , x2 )
| {z } | {z }
i,j=1
0 0
= −2a1 a2 kx1 − x2 k2 ≤ 0,
(⇒) Suppose k̃ is positive definite, i.e. for all n ∈ N, (a1 , ..., an ) ∈ Rn , (x1 , ..., xn ) ∈ X n ,
we have
Xn
ai aj k̃(x, y) ≥ 0.
i,j
n
Let us fix n ∈ N, and choose (a1 , ..., an ) ∈ Rn such that
P
ai = 0. For
i=1
(x1 , ..., xn ) ∈ X n , we have:
5
n
X n
X
ai aj k̃(x, y) = ai aj (k(xi , xj ) − k(x0 , xi ) − k(x0 , xj ) + k(x0 , x0 ))
i,j=1 i,j=1
X n X n n
X n
X
= ai aj k(xi , xj ) − aj ai k(x0 , xi )
i=1 j=1 j=1 i=1
| {z }
0
n n n
!2
X X X
− ai aj k(x0 , xj ) + k(x0 , x0 ) ai
i=1 j=1 i=1
| {z } | {z }
0 0
n X
X n
= ai aj k(xi , xj ).
i=1 j=1
n
P n
P
To use an assumption, we need to introduce a0 := − ai , so we get ai = 0.
i=1 i=0
Now, we have
n
X n
X n
X
ai aj k̃(xi , xj ) = ai aj k(xi , xj ) − ai aj k(x0 , xj )
i,j=1 i,j=1 i,j=1
X n Xn
− ai aj k(x0 , xi ) + ai aj k(x0 , x0 )
i,j=1 i,j=1
X n X n n
X
= ai aj k(xi , xj ) − ai aj k(x0 , xj )
i,j=1 i=1 j=1
| {z }
−a0
n n n
!2
X X X
− aj ai k(xi , x0 ) + ai k(x0 , x0 )
j=1 i=1 i=1
| {z } | {z }
−a0 a20
n
X
= ai aj k̃(xi , xj ) ≥ 0.
i,j=0
6
Let k be a c.p.d. kernel on X such that k(x, x) = 0, for any x ∈ X . From the previous
question, we know how to construct the feature map from k which is positive definite. Let
1
k̃(x, y) := k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ) ,
2
where x0 ∈ X is fixed. Then k̃ is p.d. and hence we can use Aronszajn theorem, which
says that there exists a Hilbert space H and a mapping Φ : X → H such that, for any
x, y ∈ X
k̃(x, y) = hΦ(x), Φ(y)iH .
Now, only thing left to be proven is
k(x, y) = −||Φ(x) − Φ(y)||2 .
For this part, we will use the assumption k(x, x) = 0, for any x ∈ X . We have
||Φ(x) − Φ(y)||2 = Φ2 (x) − 2hΦ(x), Φ(y)iH + Φ2 (y)
= k̃(x, x) − 2k̃(x, y) + k̃(y, y)
1
= k(x, x) −k(x0 , x) − k(x0 , x) + k(x0 , x0 )
2 | {z } | {z }
0 0
− 2k(x, y) + 2k(x0 , x) + 2k(x0 , y) + 2k(x0 , x0 )
| {z }
0
!
+ k(y, y) −k(x0 , y) − k(x0 , y) + k(x0 , x0 )
| {z } | {z }
0 0
− k(x0 , x) − k(x, y) + k(x0 , x) + k(x0 , y) − k(x0 , y)
= −k(x, y).
6. Show that if k is c.p.d., then the function exp(tk(x, y)) is p.d. for all t ≥ 0.
Firstly, we need to show that the product of two p.d. functions is also a p.d. function. Let
k1 , and k2 be two p.d. functions, and [k1 ], [k2 ] the positive semidefinite similarity matrices
of k1 , k2 , respectively. Since [k2 ] is symmetric positive semidefinite, it has a symmetric
positive semidefinite square root S, i.e. [k2 ] = S 2 , more precisely
n
X n
X
[k2 ]ij = Sil Slj = Sil Sjl ,
l=1 l=1
We used that the inner sum is nonnegative using the weights (ã1 , ã2 , ...ãn ) ∈ Rn and
the fact that k1 is p.d. kernel. So, we proved that the similarity matrix [k], defined by
[k]ij = [k1 ]ij [k2 ]ij is positive semidefinite. The symmetry of kernel k is immediate
consequence of the symmetry of k1 and k2 . This means, we proved that the product kernel
k is indeed a p.d. kernel.
We also need to prove that if a given sequence of p.d. kernels {kn }n∈N pointwise converges
to k, i.e. for all x, y ∈ X is
lim kn (x, y) = k(x, y),
n→∞
7
then k is a p.d. kernel. First of all, by uniqueness of the limit, we can indeed define
the pointwise limit k as a function. It is symmetric as a immediate consequence of the
symmetry of all the kernels kn . Let m ∈ N, (x1 , x2 , ..., xm ) ∈ X and (a1 , a2 , ..., am ) ∈
Rm , then we have
Xm m
X
ai aj k(xi , xj ) = ai aj lim kn (xi , xj
n→∞
i,j=1 i,j
Xm
= lim ai aj kn (xi , xj )
n→∞
i,j=1
Xm
= lim ai aj kn (xi , xj ) ≥ 0.
n→∞
i,j
| {z }
≥0
Now, we can go back to our assignment. If k is c.p.d. we know that we can associate k
with p.d. k̃, such that
k̃(x, y) = k(x, y) − k(x0 , x) − k(x0 , y) + k(x0 , x0 ),
for any x, y ∈ X , and some point x0 ∈ X . From the previous line it follows
k(x, y) = k̃(x, y) + k(x0 , x) + k(x0 , y) − k(x0 , x0 ),
or
exp(tk(x, y)) = exp(tk̃(x, y)) exp(tk(x0 , x)) exp(tk(x0 , y)) exp(−tk(x0 , x0 )) .
| {z }| {z }
k1 k2
We know that k̃ is p.d., and using the Taylor expansion we can write
∞ m
X (tk̃(x, y))m X (tk̃(x, y))i
exp(tk̃(x, y)) = = lim .
m=0
m! m→∞
i=0
i!
On the right hand side we have limit of the sum of the product of p.d., which means that
exp(tk̃(x, y)) is also p.d (obviously, sum of two p.d. is again p.d.). On the other hand, for
all (a1 , a2 , ..., an ) ∈ Rn we have
Xn
ai aj exp(tk(x0 , xi )) exp(t(x0 , xj )) exp(−tk(x0 , x0 ))
i,j=1
n
X
= exp(−tk(x0 , x0 )) ai exp(tk(x0 , xi ))aj exp(t(x0 , xj ))
i,j=1
n
2
X
= exp(−tk(x0 , x0 ))
ai exp(tk(x0 , xi ))
≥ 0.
i=1
Finally, we proved that k1 and k2 are p.d. hence exp(tk) is p.d. as a product of k1 and k2 .
7. Conversely, show that if the function exp(tk(x, y)) is p.d. for any t ≥ 0, then k is c.p.d.
We know that
exp(tk(x, y)) − 1
k(x, y) = lim ,
t→0 t
for all x, y ∈ X . We assummed that exp(tk) is p.d. so it must be c.p.d. also. Now, for any
n
n ∈ n, (x1 , x2 , ..., xn ) ∈ Rn , and any (a1 , a2 , ..., an ) ∈ Rn such that
P
ai = 0, we have
i=1
n n n n
X exp(tk(xi , xj )) − 1 1 X X X
ai aj = ai aj exp(tk(xi , xj )) − ai aj ≥ 0.
i,j=1
t t i,j=1 i=1 j=1
| {z } | {z }
≥0 0
8
exp(tk)−1
Which means that t is c.p.d. And finally,
n n
X X exp(tk(xi , xj )) − 1
ai aj k(xi , xj ) = ai aj lim
i,j=1 i,j=1
t→0 t
n
X exp(tk(xi , xj )) − 1
= lim ai aj ≥ 0,
t→0
i,j=1
t
so k is c.d.p.
8. Show that the shortest-path distance on a tree is c.p.d over the set of vertices (a tree is
an undirected graph without loops. The shortest-path distance between two vertices
is the number of edges of the unique path that connects them). Is the shortest-path
distance over graphs c.p.d. in general?
Let G = (V, E) a tree (V being a set of vertices and E set of edges) and x0 ∈ V its root.
Let us represent each node x ∈ V by Φ(x), as follows
Φ : V → R|E|
such that
1, if is the i-th edge is in the path between x and x0
Φi (x) =
0, otherwise
We know that for each vertex x ∈ V , the path to x0 is unique. Then, the graph distance
dG (x, y) between any two vertices x and y (the length of the shortest path between x and
y) is given by
dG (x, y) = kΦ(x) − Φ(y)k2 .
In problem 2.3 we have seen that −dG is c.p.d. Now, using the previous we can conclude
that exp(−tdG (x, y)) is p.d. for all t ≥ 0.
On the other hand, in general graphs do not have the property that −dG is c.p.d. We can
see that on the counterexample. Let us look at the graph below
1 3 5
9
In order for shortest distance to correspond to a c.p.d function, exp(−tdG (x, y)) must be
p.d. for all t ≥ 0. We can write down matrix [exp(−tdG (x, y))] and we shell show that it
is not positive semi definite.
We have
1 e−t e−t e−t e−2t
e−t 1 e−2t e−2t e−t
−t −2t
[dG ] = e e 1 e−2t e−t
e−t e−2t e−2t 1 e−t
−2t −t −t −t
e e e e 1
We can use Sylvester’s criterion to show that this matrix is not positive semi definite. Let
us calculate the corresponding determinant.
e−t e−t e−t e−2t
1
−t
e 1 e−2t e−2t e−t
e−2t e−t = e−10t (e2t − 2)(e2t − 1)4
−t
e e−2t 1
e−t e−2t e−2t 1 e−t
e−2t e−t e−t e−t 1
Hence, we can conclude that the shortest-path distance over graphs is not c.p.d. in general.
10