Kernel
Kernel
Julien Mairal
Inria Grenoble
Learning aspects
Building a functional space for CNNs (or similar objects).
Paradigm 3: Deep Kernel Machines
Deriving a measure of model complexity.
A quick zoom on convolutional neural networks
map data to a Hilbert space (RKHS) and work with linear forms:
ϕ
6
......................
...............
......... .....................................
........................
.................. ...........
.............. ......
.................................
.........
..................... z
......
....
.
..
....
.....
.......
.......
X ...................
...
...
...
..
...
...
.......
H
x zx
... ... .....
... .. ........................................ .. ............................................
... .................. ..
......................................... ..........................................
.........................................
.. ...................... .. ..................................
................................
z x
.. .. .....................................
.. .. ...................
*
... ...
x
... .. .............. ............................................ .
.. ... ............................. ............................................
... ......................... ........................ ...
.. .........................................
.......... x
...
x
....
...
... .....................
.....
I ....
.. -
Q
x
... ..
......Q
s
......................................:
.. .
.... ..
.... ..................................
:x
... ................................... ..................................
.... ........................................... ...........................................
.... ................ ...... ....... .......................
.... ...
........................
...... ...........................
................ ..
. ................
.. . ..........................
..................................
. .........................................
... ........................................
.. ..
............................................ ....... ...................................
... ...
... ...
...... ....
..........
.............. .......
..................... ......
................................. ..............
.. .................................
x2
R
x2 2
f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .
Why do we care?
Φ(x) is related to the network architecture and is independent
of training data. Is it stable? Does it lose signal information?
f is a predictive model. Can we control its stability?
Invarian
Two dim
[Mallat, 2012, Allassonnière, Amit, and Trouvé, 2007, Trouvé and Younes, 2005]...
Invarian
Two dim
Definition of stability
Representation Φ(·) is stable [Mallat, 2012] if:
Signal representation
Signal preservation of the multi-layer kernel mapping Φ.
Conditions of non-trivial stability for Φ.
Constructions to achieve group invariance.
Signal representation
Signal preservation of the multi-layer kernel mapping Φ.
Conditions of non-trivial stability for Φ.
Constructions to achieve group invariance.
On learning
Bounds on the RKHS norm k.kH to control stability and
generalization of a predictive model f .
Pk xk−1 .
Mk Pk xk−1 .
xk = Ak Mk Pk xk−1 .
xk := Ak Mk Pk xk–1 : Ω → Hk
Sk
Pk xk–1 (u) := (v ∈ Sk 7→ xk–1 (u + v)) ∈ Pk = Hk–1 .
Pk xk–1 (v) ∈ Pk
xk–1 : Ω → Hk–1
Examples
0 1 0 2
κexp (hz, z 0 i) = ehz,z i−1 = e− 2 kz−z k (if kzk = kz 0 k = 1).
1
κinv-poly (hz, z 0 i) = 2−hz,z 0 i .
[Schoenberg, 1942, Scholkopf, 1997, Smola et al., 2001, Cho and Saul, 2010, Zhang
et al., 2016, 2017, Daniely et al., 2016, Bach, 2017, Mairal, 2016]...
Mk Pk xk–1 : Ω → Hk
xk–1 : Ω → Hk–1
xk := Ak Mk Pk xk–1 : Ω → Hk
Multilayer representation
Multilayer representation
Prediction layer
e.g., linear f (x) = hw, Φn (x)i.
R
“linear kernel” K(x, x0 ) = hΦn (x), Φn (x0 )i = 0
Ω hxn (u), xn (u)idu.
Invarian
Two dim
[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...
Definition of stability
Representation Φ(·) is stable [Mallat, 2012] if:
[Mallat, 2012, Bruna and Mallat, 2013, Sifre and Mallat, 2013]...
P x(u) = (x(uv))v∈S .
hfw , M̄k P̄k x̄k−1 (u)i = fw (P̄k x̄k−1 (u)) = hw, P̄k x̄k−1 (u)i,
and X
P̄k x̄k−1 (u) = hfw , M̄k P̄k x̄k−1 (u)iw.
w∈B
hfw , M̄k P̄k x̄k−1 (u)i = fw (P̄k x̄k−1 (u)) = hw, P̄k x̄k−1 (u)i,
and X
P̄k x̄k−1 (u) = hfw , M̄k P̄k x̄k−1 (u)iw.
w∈B
deconvolution
Āk x̄k−1
x̄k
downsampling
linear pooling
dot-product kernel
x̄k−1
P̄k x̄k−1 (u) ∈ Pk
f : z 7→ kzkσ(hg, zi/kzk).
f : z 7→ kzkσ(hg, zi/kzk).
P
Smooth activations: σ(u) = ∞ j
j=0 aj u with aj ≥ 0.
P a2j
Norm: kf k2Hk ≤ Cσ2 (kgk2 ) = ∞ 2
j=0 bj kgk < ∞.
0.5 1
0.0 0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x x
f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .
f (x) = σk (Wk σk–1 (Wk–1 . . . σ2 (W2 σ1 (W1 x)) . . .)) = hf, Φ(x)iH .
√
Leads to margin bound O(kfˆN kR/γ N ) for a learned CNN fˆN
with margin (confidence) γ > 0.
Related to recent generalization bounds for neural networks based
on product of spectral norms [e.g., Bartlett et al., 2017,
Neyshabur et al., 2018].
Questions:
Better regularization?
How does SGD control capacity in CNNs?
What about networks with no pooling layers? ResNet?
ϕ(x0 )
[Williams and Seeger, 2001, Smola and Schölkopf, 2000, Zhang et al., 2008]...
I1
linear pooling
ψ1 (x0 ) ϕ1 (x) Hilbert space H1
ψ1 (x)
ϕ1 (x0 )
M1
projection on F1 F1
x0
kernel trick
I0
x
kΦ(x) − Φ(x0 )k
sup = 1.
x,x0 ∈L2 (Ω,H0 ) kx − x0 k