Technical Report Multidimensional, Downsampled Convolution For Autoencoders PDF
Technical Report Multidimensional, Downsampled Convolution For Autoencoders PDF
Ian Goodfellow
August 9, 2010
Abstract
This technical report describes discrete convolution with a multidimen-
sional kernel. Convolution implements matrix multiplication by a sparse
matrix with several elements constrained to be equal to each other. To
implement a convolutional autoencoder, the gradients of this operation,
the transpose of this operation, and the gradients of the transpose are all
needed. When using standard convolution, each of these supplementary
operations can be described as a convolution on slightly modied argu-
ments. When the output is implicitly downsampled by moving the kernel
in more than one pixel at each step, we must dene two new operations
in order to compute all of the necessary values.
1 Denitions
Let L be our loss function, W our weights dening the kernel, d a vector of
strides, H our hidden units, and V our visible units. Hcij indexes position c
(an N-dimensional index) within feature map i for example j . V is of the same
format as H . Wcij indexes the weight at position c within the kernel, connecting
visible channel i to hidden channel j .
Convolution with downsampling is performed (assuming W is pre-ipped)
by
X
Hcij = Wkmi Vd◦c+k,m,j
k,m
2 Basic gradient
The gradient of the loss function with respect to the weights is given by
1
∂L X ∂L ∂Hkmn
=
∂Wcij ∂Hkmn ∂Wcij
k,m,n
P
X ∂L ∂ p,q Wpqm Vd◦k+p,q,n
=
∂Hkmn ∂Wcij
k,m,n
P
X ∂L ∂ p,q Wpqj Vd◦k+p,q,n
=
∂Hkjn ∂Wcij
k,n
X ∂L
= Vd◦k+c,i,n
∂Hkjn
k,n
∂L X ∂L
= Vd◦k+c,i,m
∂Wcij ∂Hkjm
k,m
3 Transpose
We can think of strided convolution as multiplication by a matrix M . Let h be
H reshaped into a vector and v be V reshaped into a vector. Then
h = Mv
Let hr(c, i, j) be a reshaping function that maps indices in H to indices in
h. Let vr(c, i, j) be the same for V and v .
Then
X
hhr(c,i,j) = Wkmi vvr(d◦c+k,m,j)
k,m
2
X
ra = Wkmi Hcij
c,i,j,k,m|vr(d◦c+k,m,j)=a
X X
Rq,m,j = Wkmi Hcij
c,k|d◦c+k=q i
To sum over the correct set of values for c and k, the we will need a modulus
operator or saved information from a previous iteration of a for loop, unless
d = ~1. So this is not a convolution in the large stride case.
In the case where d = ~1, we have
XX
Rq,m,j = Ww−p,m,i Hq−w+p,i,j
p i
4 New notation
I'm going to make up some new notation now, since our operation isn't really
convolution (downsampling is built into the operation, we don't ip the kernel,
etc). From here on out, I will write
X
Hcij = Wkmi Vc◦d+k,m,j
k,m
as
H = W @d V
and
3
X X
Rq,m,j = Wkmi Hcij
c,k|d◦c+k=q i
as
R = W @Td H
and
∂L(H = W @d V ) X ∂L
= Vd◦k+c,i,m
∂Wcij ∂Hkjm
k,m
as
∇W L(H = W @d V ) = (∇H L)#d V
5 Autoencoder gradients
To make an autoencoder, we'll need to be able to compute
This means we're also going to need to be able to take the gradient of
L(W @T H) with respect to both W (so we know how to update the encoding
weights) and H , so we'll be able to propagate gradients back to the encoding
layer. Finally, when we stack the autoencoders into a convolution MLP, we'll
need to be able to propagate gradients back from one layer to another, so we
must also nd the gradient of L(W @V ) with respect to V .
X X
Rqmj = Wkmi Hcij
c,k|d◦c+k=q i
so
∂L X ∂L ∂Rqmj
=
∂Wxyz q,m,j
∂Rqmj ∂Wxyz
P P
X ∂L ∂ c,k|d◦c+k=q i Wkmi Hcij
=
q,m,j
∂Rqmj ∂Wxyz
4
P
X ∂L ∂ c|d◦c+x=q Wxyz Hczj
=
q,j
∂Rqyj ∂Wxyz
X ∂L X
= Hczj
q,j
∂Rqyj
c|d◦c+x=q
X ∂L
= Hczj
c,j
∂Rd◦c+x,y,j
Recall that the gradient of L(W @V ) with respect to the kernel is:
∂L X ∂L
= Vd◦k+c,i,m
∂Wcij ∂Hkjm
k,m
This has the same form as the gradient we just derived, ie both use the new
#operation. Thus we can write that the gradient of L(W @T H) with respect to
the kernel is given by
X X
Rqmj = Wkmi Hcij
c,k|d◦c+k=q i
so
∂L X ∂L ∂Rqmj
=
∂Hxyz q,m,j
∂Rqmj ∂Hxyz
P P
X ∂L ∂ c,k|d◦c+k=q i Wkmi Hcij
=
q,m,j
∂Rqmj ∂Hxyz
P
X ∂L ∂ k|d◦x+k=q Wkmy Hxyz
=
q,m
∂Rqmz ∂Hxyz
5
X ∂L X
= Wkmy
q,m
∂Rqmz
k|d◦x+k=q
X ∂L
= Wkmy
∂Rd◦x+k,m,z
k,m
Remember that X
Hcij = Wkmi Vd◦c+k,m,j
k,m
so we can write
∇H L(R = W @Td H) = W @d ∇R L
X ∂L X
= Wkyi
ci
∂Hciz
k|d◦c+k=x
X X ∂L
= Wkyi
i
∂Hciz
c,k|d◦c+k=x
∇V L(H = W @d V ) = W @Td ∇H L
6
6 The rest of the gradients
We now know enough to make a stacked autoencoder. However, there are still
some gradients that may be taken and it would be nice if our ops supported
all of them. The @ op's gradient can be expessed in terms of @T and #, and
the @T op's gradient can be expressed in terms of @ and #. Thus if we add
a gradient method to the # op, our ops will be innitely dierentiable for all
combinations of variables.
P
X ∂L(A) ∂ k,m Bkjm Cd◦k+c,i,m
=
c,i,j
∂Acij ∂Bxyz
X ∂L(A)
= Cd◦x+c,i,z
c,i
∂Aciy
So
∇B L(A = B#d C) = (∇A L)@d C
P
X ∂L(A) ∂ k,m Bkjm Cd◦k+c,i,m
=
c,i,j
∂Acij ∂Cxyz
7
X X ∂L(A) ∂Bkjz Cxyz
=
j
∂Acyj ∂Cxyz
c,k|d◦k+c=x
X X ∂L(A)
= Bkjz
j
∂Acyj
c,k|d◦k+c=x
7 Summary
We have dened these operations:
Strided convolution:
X
H = W @d V ⇒ Hcij = Wkmi Vc◦d+k,m,j
k,m
∇V L(H = W @d V ) = W @Td ∇H L
8
∇W L(R = W @Td H) = H#d ∇R L
∇H L(R = W @Td H) = W @d ∇R L