Puthawala 22 A
Puthawala 22 A
Puthawala 22 A
Flows
2019). Teshima et al. (2020) show that several flow net- our knowledge, heretofore unknown in the literature. We
works including those from Huang et al. (2018) and Jaini give an example of an absolutely continuous measure µ
et al. (2019) are also universal approximators of diffeomor- and embedding f : R2 → R3 such that f # µ can not be
phisms. approximated with combinations of flow layers and linear
expansive layers. This may be surprising since it was previ-
The injective flows considered here have key applications
ously conjectured that networks such as those of Brehmer
in inference and inverse problems; for an overview of
& Cranmer (2020) can approximate any “nice” density sup-
deep learning approaches to inverse problems, see (Arridge
ported on a “nice” manifold. We establish universality for
et al., 2019). Bora et al. (2017) proposed to regularize
manifolds with suitable topology, described in terms of ex-
compressed sensing problems by constraining the recov-
tendable embeddings. We find that the set of extendable
ery to the range of (pre-trained) generative models. Injec-
embeddings is a proper subset of all embeddings, but when
tive flows with efficient inverses as generative models give
m ≥ 3n + 1, via an application of the clean trick from al-
an efficient algorithmic projection2 on the range, which fa-
gebraic topology, we show that all diffeomorphisms are ex-
cilitates implementation of reconstruction algorithms. An
tendable and thus injective flows approximate distributions
alternative approach is Bayesian, where flows are used to
on arbitrary manifolds. Our universality proof also implies
obtain tractable variational approximations of posterior dis-
that optimality of the approximating network can be estab-
tributions over parameters of interest, via supervised train-
lished in reverse: optimality of a given layer can be estab-
ing on labeled input-output data pairs. Ardizzone et al.
lished without optimality of preceding layers. This settles a
(2018) encode the dimension-reducing forward process by
(generalization of a) conjecture posed for a three-part net-
an invertible neural network (INN), with additional outputs
work (composed of two flow networks and zero padding) in
used to encode posterior variability. Invertibility guaran-
(Brehmer & Cranmer, 2020). Finally, we show that these
tees that a model of the inverse process is learned implicitly.
universal architectures are also practical and admit exact
For a given measurement, the inverse pass of the INN ap-
layer-wise projections, as well as other properties discussed
proximates the posterior over parameters. Sun & Bouman
in Section 3.5.
(2020) propose variational approximations of the posterior
using an untrained deep generative model. They train a nor-
malizing flow which produces samples from the posterior, 2. Architectures Considered
with the prior and the noise model given implicitly by the
Let C(X, Y ) denote the space of continuous functions
regularized misfit functional. In Kothari et al. (2021) this
X → Y . Our goal is to make statements about networks in
procedure is adapted to priors specified by injective flows
F ⊂ C(X, Y ) that are of the form:
which yields significant improvements in computational ef-
n ,nL
ficiency. F = TLnL ◦ RLL−1 ◦ · · · ◦ T1n1 ◦ Rn1 0 ,n1 ◦ T0n0 (1)
n ,n
1.2. Our Contribution where R` `−1 ` ⊂ C(Rn`−1 , Rn` ), T`n` ⊂ C(Rn` , Rn` ),
L ∈ N, n0 = n, nL = m, and n` ≥ n`−1 for ` = 1, . . . , L.
We derive new approximation results for neural networks
We introduce a well-tuned shorthand notation and write H◦
composed of bijective flows and injective expansive layers,
G := {h ◦ g : h ∈ H, g ∈ G} throughout the paper.
including those introduced by (Brehmer & Cranmer, 2020)
and (Kothari et al., 2021). We show that these networks We identify R with the expansive layers and T with the bi-
universally jointly approximate a large class of manifolds jective flows. Loosely speaking, the purpose of the expan-
and densities supported on them. sive layers is to allow the network to parameterize high-
dimensional functions by low-dimensional coordinates in
We build on the results of Teshima et al. (2020) and develop
an injective way. The flow networks give the network
a new theoretical device which we refer to as the embed-
the expressivity necessary for universal approximation of
ding gap. This gap is a measure of how nearly a mapping
manifold-supported distributions.
from Ro → Rm embeds an n-dimensional manifold in Rm ,
where n ≤ o. We find a natural relationship between the
2.1. Expansive Layers
embedding gap and the problem of approximating proba-
bility measures with low-dimensional support. The expansive elements transform an n-dimensional man-
We then relate the embedding gap to a relaxation of univer- ifold M embedded in Rn`−1 , and embed it in a higher
sality we call the manifold embedding property. We show dimensional space Rn` . To preserve the topology of the
that this property captures the essential geometric aspects manifold they are injective. We thus make the following
of universality and uncover important topological restric- assumptions about the expansive elements:
tions on the approximation power of these networks, to Definition 2.1 (Expansive Element). A family of functions
R ⊂ C(Rn , Rm ) is called an family of expansive elements
2
Idempotent but in general not orthogonal. if m > n, and each R ∈ R is both injective and Lipschitz.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
Examples of bijective flow networks include BK,W (f, g) = inf kI − rkL∞ (f (K)) (6)
r∈emb(f (K),g(W ))
3
(T1) Coupling flows, introduced by (Dinh et al., 2014) con- Note that if X is a compact set, then continuity of the of
sider R(x) = Hk ◦ · · · ◦ H1 (x) where f −1 : f (X) → X is automatic, and need not be assumed (Suther-
land, 2009, Cor. 13.27). Moreover, if f : Rn → Rm is a contin-
hi [x]1:d , gi [x]d+1:n
uous injective map that satisfies |f (x)| → ∞ as |x| → ∞, then
Hi (x) = . (3) by (Mukherjee, 2015, Cor. 2.1.23) the map f −1 : f (Rn ) → Rn
[x]d+1:n is continuous.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
diffeomorphism f˜ : Rm → Rm , ` ≥ k, see (Hirsch, I is the identity map, then E := T ◦ R has the MEP
2012, Ch. 2, Theorem 2.7). Because of this, we have to w.r.t. emb(Rn , Rm ).
pay attention to the smoothness of the maps in the subset
F ⊂ emb(K, Rm ). (ii) If R is such that there is an injective R ∈ R and
open set U ⊂ Ro such that R U is linear, and
Definition 3.7 (Extendable Embeddings). We define the
T is a sup universal approximator in the space of
set of Extendable Embeddings as
Diff2 (Rm , Rm ), in the sense of (Teshima et al., 2020),
I(Rn , Rm ) := D ◦ L of the C 2 -smooth diffeomorphisms, then E := T ◦ R
has the MEP w.r.t. I(Rn , Rm ).
D = Diff1 (Rm , Rm )
L = L ∈ Rm×n : rank(L) = n ,
For uniform universal approximators that satisfy the as-
sumptions of (i), see e.g. (Puthawala et al., 2020). The
where Diffk (Rm , Rm ) is the set of C k -smooth diffeo- proof of Lemma 3.9 is in Appendix C.4.1. It has the fol-
morphisms from Rm to itself. Note that I(Rn , Rm ) ⊂ lowing implications for the architectures studied in Section
emb(Rn , Rm ). 2.
The word extendable in the name extendable embeddings Example 1. Let E := T ◦ R and (T1), (T2), (R1), . . . , (R4)
refers to the fact that the family D in Definition 3.7 is a be as described in Section 2. Then
proper subset of emb(L(K), Rm ) for some compact K ⊂
Rn and linear L ∈ Rm×n . Mappings in the set D are em- (i) If T is either (T1) or (T2) and R is (R4), then E has
beddings D : L(K) → Rm that extend to diffeomorphisms the m, n, o MEP w.r.t. emb(Rn , Rm ).
from all of Rm to itself. Said differently, a D ∈ D is a
map in emb1 (L(K), Rm ) that can be extended to a map (ii) If T is (T2) with sigmoidal activations (Huang et al.,
1 m m
D̃ ∈ Diff (R , R ) such that D̃ = D. This distinc- 2018), then if R is any of (R1), ..., (R4), then E has the
L(K) m, n, o MEP w.r.t. I(Rn , Rm ).
tion is important, as there are maps in emb1 (L(K), Rm )
that can not be extended to diffeomorphisms on all of Rm , The proof of Example 1 is in Appendix C.4.2.
as can be seen from the counterexample developed at the
beginning of this section. We now present our universal approximation result for net-
works given in Eqn. 1 and a decoupling property. Below,
We also present here a theorem that states that when m we say that a measure µ in Rn is absolutely continuous if
is more than three times larger than n, any differentiable it is absolutely continuous w.r.t. the Lebesgue measure.
embedding from compact K ⊂ Rn to Rm is necessarily
extendable. Theorem 3.10. Let n0 = n, nL = m K ⊂ Rn be compact,
µ ∈ P(K) be an absolutely continuous measure. Further
Theorem 3.8. When m ≥ 3n + 1 and k ≥ 1, for any C k n ,n n ,n
let, for each ` = 1, . . . , L, E` `−1 ` := T`n` ◦ R` `−1 `
embedding f ∈ embk (Rn , Rm ) and compact set K ⊂ Rn , n`−1 ,n`
where R` is a family of injective expansive elements
there is a map E ∈ I k (Rn , Rm ) (that is, E is in the closure
that contains a linear map, and T`n` is a family of bijective
of the set of flow type neural networks) such that E(K) =
family networks. Finally let T0n be distributionally univer-
f (K). Moreover,
sal, i.e. for any absolutely continuous µ ∈ P(Rn ) and
I k (K, Rm ) = embk (K, Rm ) (9) ν ∈ P(Rn ), there is a {Ti }i=1,2,... such that Ti# µ → ν in
distribution. Let one of the following two cases hold:
The proof of Theorem 3.8 in Appendix C.3.3. We also re- n ,m n ,n
(i) f ∈ FL L−1 ◦· · ·◦F1n,n1 and E` `−1 ` have the the n` ,
mark here that the proof of the above theorem relies on the n ,n
n`−1 ,n`−1 MEP for ` = 1, . . . , L with respect to F` `−1 ` .
so called ‘clean trick’ from differential topology. This trick
is related to fact that in R4 , all knots can be reduced to the (ii) f ∈ emb1 (Rn , Rm ) be a C 1 -smooth embedding, for
simple knot continuously. ` = 1, . . . , L n` ≥ 3n`−1 + 1 and the families T`n` are
dense in Diff2 (Rn` ).
n ,m
3.4. Universality Then, there is a sequence of {Ei }i=1,2,... ⊂ EL L−1 ◦· · ·◦
We now combine the notions of universality and extend- E1n1 ,n ◦ T0n such that
able embeddings to produce a result stating that many com-
monly used networks of the form studied in Section 2 have lim W2 (f # µ, Ei# µ) = 0. (10)
i→∞
the MEP.
Lemma 3.9. (i) If R ⊂ emb(Rn , Rm ) is a uniform uni- The proof of Theorem 3.10 is in Appendix C.4.3. The re-
versal approximator of C(Rn , Rm ) and I ∈ T where sults of Theorems 3.8 and 3.10 have a simple interpretation,
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
There is a gap between the negation of Theorem 3.10 and The proof of Theorem 3.10 also implies the following re-
Lemma 3.11. That is, it is possible for a family of functions sult which, loosely speaking, says that optimality of later
E to satisfy Lemma 3.11 but nevertheless satisfy the con- layers can be determined without requiring optimality of
clusion of Theorem 3.10; these functions approximate mea- earlier layers, while still having a network that is end-to-
sures without matching manifolds. Theorem 3.10 consid- end optimal. The conditions and result of this is visualized
ers approximating measures, whereas Lemma 3.11 refers to on a toy example in Figure 3.
matching manifolds exactly. As discussed in Section 3.3,
there are no extendable embeddings that map S 1 to the tre- Corollary 3.12. Let F n,o ⊂ emb(Rn , Ro ), F o,m ⊂
foil knot in R3 . Nevertheless, it is possible to construct a emb(Ro , Rm ), and let E o,m ⊂ emb(Ro , Rm ) have the
sequence of functions (Ei )i=1,... so that W2 (ν, Ei# µ) = m, n, o MEP w.r.t. F o,m ◦ F n,o . Then for every f ∈
0 where µ and ν are the uniform distributions on S 1 and F o,m ◦ F n,o and compact sets K ⊂ Rn and W ⊂ Ro
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
provably the least-squares minimizer. Further, if each ex- pp. 1–155. Springer, Heidelberg, 2013. doi: 10.1007/
pansive layer is any combination of (R1), (R2), or (R3) then 978-3-642-32160-3\ 1. URL https://doi.org/
the entire network can be inverted end-to-end by using ei- 10.1007/978-3-642-32160-3_1.
ther the above result or solving the normal equations di-
rectly. Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in
metric spaces and in the space of probability measures.
Springer Science & Business Media, 2008.
4. Conclusion
Ardizzone, L., Kruse, J., Wirkert, S., Rahner, D., Pelle-
Bijective flow networks are a powerful tool for learning grini, E. W., Klessen, R. S., Maier-Hein, L., Rother, C.,
push-forward mappings in a space of fixed dimension. In- and Köthe, U. Analyzing inverse problems with invert-
creasingly, these flow networks have been used in combi- ible neural networks. arXiv preprint arXiv:1808.04730,
nation with networks that increase dimension in order to 2018.
produce networks which are purportedly universal.
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan.
In this work, we have studied the theory underpinning
arXiv preprint arXiv:1701.07875, 2017.
these flow and expansive networks by introducing two new
notions, the embedding gap and the manifold embedding Arridge, S., Maass, P., Öktem, O., and Schönlieb, C.-
property. We show that these notions are both necessary B. Solving inverse problems using data-driven models.
and sufficient for proving universality, but require impor- Acta Numerica, 28:1–174, 2019.
tant topological and geometrical considerations which are,
heretofore, under-explored in the literature. We also find Billingsley, P. Convergence of probability measures. Wi-
that optimality of the studied networks can be established ley Series in Probability and Statistics: Probability and
‘in reverse,’ by minimizing the embedding gap, which we Statistics. John Wiley & Sons, Inc., New York, sec-
expect opens the door to convergence of layer-wise training ond edition, 1999. ISBN 0-471-19745-9. doi: 10.
schemes. Without compromising universality, we can also 1002/9780470316962. URL https://doi.org/
use specific expansive layers with a new layerwise projec- 10.1002/9780470316962. A Wiley-Interscience
tion result. Publication.
Ambrosio, L. and Gigli, N. A user’s guide to opti- Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G.
mal transport. In Modelling and optimisation of flows Cubic-spline flows. arXiv preprint arXiv:1906.02145,
on networks, volume 2062 of Lecture Notes in Math., 2019a.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
Durkan, C., Bekasov, A., Murray, I., and Papamakarios, Lee, H., Ge, R., Ma, T., Risteski, A., and Arora, S. On
G. Neural spline flows. Advances in Neural Information the ability of neural nets to express distributions. In
Processing Systems, 32:7511–7522, 2019b. Conference on Learning Theory, pp. 1271–1296. PMLR,
2017.
Golub, G. H. Matrix computations. Johns Hopkins Univer-
sity Press, 1996. Lei, Q., Jalal, A., Dhillon, I. S., and Dimakis, A. G. In-
verting deep generative models, one layer at a time.
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The
In Advances in Neural Information Processing Systems,
reversible residual network: Backpropagation without
pp. 13910–13919, 2019.
storing activations. In Advances in neural information
processing systems, pp. 2214–2224, 2017. Lu, Y. and Lu, J. A universal approximation theorem of
Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., deep neural networks for expressing distributions. arXiv
and Duvenaud, D. Ffjord: Free-form continuous dy- preprint arXiv:2004.08867, 2020.
namics for scalable reversible generative models. arXiv Madsen, I. H., Tornehave, J., et al. From calculus to
preprint arXiv:1810.01367, 2018. cohomology: de Rham cohomology and characteristic
Hirsch, M. W. Differential topology, volume 33. Springer classes. Cambridge university press, 1997.
Science & Business Media, 2012.
Milnor, J. On manifolds homeomorphic to the 7-sphere.
Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Annals of Mathematics, pp. 399–405, 1956.
Neural autoregressive flows. In International Conference
on Machine Learning, pp. 2078–2087. PMLR, 2018. Mukherjee, A. Differential topology. Hindustan
Book Agency, New Delhi; Birkhäuser/Springer,
Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i- Cham, second edition, 2015. ISBN 978-
revnet: Deep invertible networks. arXiv preprint 3-319-19044-0; 978-3-319-19045-7. doi:
arXiv:1802.07088, 2018. 10.1007/978-3-319-19045-7. URL https:
//doi.org/10.1007/978-3-319-19045-7.
Jaini, P., Selby, K. A., and Yu, Y. Sum-of-squares poly-
nomial flow. In International Conference on Machine Müller, S. Uniform approximation of homeomorphisms by
Learning, pp. 3009–3018. PMLR, 2019. diffeomorphisms. Topology and its Applications, 178:
315–319, 2014.
Kingma, D. P. and Dhariwal, P. Glow: Generative
flow with invertible 1x1 convolutions. In Advances Murasugi, K. Knot theory & its applications. Modern
in Neural Information Processing Systems, pp. 10215– Birkhäuser Classics. Birkhäuser Boston, Inc., Boston,
10224, 2018. MA, 2008. ISBN 978-0-8176-4718-6. doi: 10.1007/
978-0-8176-4719-3. URL https://doi.org/10.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X.,
1007/978-0-8176-4719-3. Translated from the
Sutskever, I., and Welling, M. Improving variational in-
1993 Japanese original by Bohdan Kurpita, Reprint of
ference with inverse autoregressive flow. arXiv preprint
the 1996 translation [MR1391727].
arXiv:1606.04934, 2016.
Kobyzev, I., Prince, S., and Brubaker, M. Normalizing Papamakarios, G., Nalisnick, E., Rezende, D. J., Mo-
flows: An introduction and review of current methods. hamed, S., and Lakshminarayanan, B. Normalizing
IEEE Transactions on Pattern Analysis and Machine flows for probabilistic modeling and inference. arXiv
Intelligence, 2020. preprint arXiv:1912.02762, 2019.
Kothari, K., Khorashadizadeh, A., de Hoop, M., and Dok- Puthawala, M., Kothari, K., Lassas, M., Dokmanić, I., and
manić, I. Trumpets: Injective flows for inference and in- de Hoop, M. Globally injective relu networks. arXiv
verse problems. arXiv preprint arXiv:2102.10461, 2021. preprint arXiv:2006.08464, 2020.
Kruse, J., Detommaso, G., Scheichl, R., and Köthe, U. Rolnick, D. and Körding, K. Reverse-engineering deep
Hint: Hierarchical invertible neural transport for den- relu networks. In International Conference on Machine
sity estimation and bayesian inference. arXiv preprint Learning, pp. 8178–8187. PMLR, 2020.
arXiv:1905.10687, 2019.
Séquin, C. H. Tori story. In Sarhangi, R. and Séquin,
Kruse, J., Ardizzone, L., Rother, C., and Köthe, U. Bench- C. H. (eds.), Proceedings of Bridges 2011: Mathematics,
marking invertible architectures on inverse problems. Music, Art, Architecture, Culture, pp. 121–130. Tessel-
arXiv preprint arXiv:2101.10763, 2021. lations Publishing, 2011. ISBN 978-0-9846042-6-5.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
A. Summary of Notation
Throughout the paper we make heavy use of the following notation.
1. Unless otherwise stated, X and Y always refer to subsets of Euclidean space, and K and W always refer to compact
subsets of Euclidean space.
2. f ∈ C(X, Y ) means that f : X → Y is continuous.
3. For families of functions F and G where each F 3 f : X → Y and G 3 g : Y → Z, then we define G ◦ F =
{g ◦ f : X → Z : f ∈ F, g ∈ G}.
4. f ∈ emb(X, Y ) means that f ∈ C(X, Y ) is continuous and injective on the range of f , i.e. an embedding, and
furthermore that f −1 : f (X) → X is continuous.
5. µ ∈ P(X) means that µ is a probability measure over X.
6. W2 (µ, ν) for µ, ν ∈ P(X) refers to the Wasserstein-2 distance, always with `2 ground metric.
7. k·kLp (X) refers to the Lp norm of functions, from X to R.
8. For vector-valued f : X → Y , kf kL∞ (X) = ess supx∈X kf k2 . Note that Y is always finite dimensional, and so all
discrete 1 ≤ q ≤ ∞ norms are equivalent.
9. Lip(g) refers to the Lipschitz constant of f .
10. For x ∈ Rn , [x]i ∈ R is the i’th component of x. Similarly, for matrix A ∈ Rm×n , [A]ij refers to the j’th element in
the i’th column.
C. Proofs
C.1. Main Results
C.1.1. E MBEDDING G AP
To aid all of our subsequent proofs, we first present the following lemma which present inequalities and identities for the
embedding gap.
Lemma C.1. For all of the following results, f ∈ emb(K, Rm ) and g ∈ emb(W, Rm ) and n ≤ o ≤ m.
1.
2. Let X, Y ⊂ W , let g be Lipschitz on W , and r ∈ emb(X, Y ). Then, there is a r0 ∈ emb(g(X), g(Y )) such that
g ◦ r = r0 ◦ g and kI − r0 kL∞ (g(X)) ≤ kI − rkL∞ (X) Lip(g).
3.
4. Let K ⊂ Rn , X ⊂ Rp and W ⊂ Ro be compact sets. Also, let f ∈ emb(K, W ) and h ∈ emb(X, W ), and let
g ∈ emb(W, Rm ) be a Lipschitz map. Then
5. BK,W (f, g) ≤ supx∈K kg ◦ h(x) − f (x)k2 where h ∈ emb(K, Ro ) is a map satisfying h(K) ⊂ W .
6. For any X that is the closure of an open set , if h ∈ emb(X, W ) then
8. For any r ∈ emb(f (K), g(W )) and h ∈ emb(X, W ) where X ⊂ Rp is the closure of a set U which is open in the
subspace topology of some vector space of dimension p, where n ≤ p ≤ o we have that
kI − rkL∞ (f (K)) = sup k(I − r)f (xn )k2 = sup kf (xn ) − r ◦ f (xn )k2
xn ∈K xn ∈K
2. g is injective on X, hence we can define r0 such that r0 = g ◦ r ◦ g −1 : g(X) → g(r(X)) ⊂ g(Y ) such that
r0 ∈ emb(g(X), g(Y )), and thus ∀x ∈ X,
But r is clearly surjective onto it’s range, hence taking the supremum over all x ∈ X yields
I − r−1 L∞ (r(K))
= kI − rkL∞ (K) (28)
4. As g ∈ emb(W, Rm ), the map g : W → g(W ) is a homeomorphism and there is g −1 ∈ emb(g(W ), W ). For a map
r ∈ emb(g ◦ f (K), g ◦ h(X)), we see that r̂ = g −1 ◦ r ◦ g ∈ emb(f (K), h(X)). Also, the opposite is valid as if
r̂ ∈ emb(f (K), h(X)) then r = g ◦ r̂ ◦ g −1 ∈ emb(g ◦ f (K), g ◦ h(X)). Thus
6. Given that g ◦ h(X) ⊂ g(W ), we have that emb(f (K), g ◦ h(X)) ⊂ emb(f (K), g(W )), thus the infimum in Eqn. 6
is taken over a smaller set, thus BK,W (f, g) ≤ BK,X (f, g ◦ h).
7. Note that for any r0 ∈ emb(r ◦ f (K), g(W )), r0 ◦ r ∈ emb(f (K), g(W )), and so we have
where we have used that r is injective for the final equality. This holds for all possible r0 , hence we have the result.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
8. Recall that f ∈ emb(K, W ), g ∈ emb(W, Rm ), h ∈ emb(X, W ) and r ∈ emb(f (K), g(W )). Then g −1 ∈
emb(g(W ), W ). As r ◦ f (K) ⊂ g(W ), we see that
r ◦ f = g ◦ g −1 ◦ r ◦ f.
W2 (f # µ, g # µ0 ) = W2 (f # µ, r ◦ f # µ) (33)
and so
Z 1/2
2
W2 (f # µ, g # µ0 ) ≤ kI − r k2 df # µ ≤ BK,W (f, g) + . (34)
K
As the set W is compact, by Prokhoros’s theorem, see (Billingsley, 1999, Theorem 5.1), the set of probability mea-
sures P (W ) is a compact set in the topology of weak convergence. Thus there is a sequence i → 0 such that the
measures µ0i converge weakly to a probability measure µo . As g : W → K is a continuous function, the push-forward
operation µ → g# µ is continuous g# : P (W ) → P (K) and thus g# µ0i converge weakly to g# µo . Finally, as g# µ0i
are supported in a compact set K, their second moments converge to those of g# µo as i → ∞. By (Ambrosio &
Gigli, 2013), Theorem 2.7, see also Remark 28, the weak convergence and the convergence of the second moments
imply the convergence in the Wasserstein-2 metric. Hence, g# µ0i converge to g# µo in Wasserstein-2 metric and we
see that
for all E1p,o ∈ E1p,o and for all compact sets W ⊂ Rp that satisfy E1p,o (W1 ) ⊂ W . We observe that if W 0 ⊂ Rp is a
compact set such that W 0 ⊂ W , we have
Thus, inequality Eq. 38 holds for all E1p,o ∈ E1p,o and for all compact sets W ⊂ Rp . Summarising, we have seen that there
are f ∈ F and > 0 such that for all E1p,o ∈ E1p,o and for all compact sets W ⊂ Rp we have ≤ BK,W (f, E2o,m ◦ E1p,o ).
Hence E2o,m does not have the m, n, o MEP with respect to F, and we have obtained a contradiction, which proves the
result.
where f : S 1 → Σ1 ⊂ R3 is an smooth embedding of S 1 to a trefoil knot Σ1 and v(θ) ∈ R3 is a unit vector normal to
Σ1 at the point f (θ) such that v(θ) is a smooth function of θ, and a > 0 is a small number. In this case, M1 = F (K) is a
2-dimensional submanifold of R3 with boundary, which can visualizes M1 as a knotted ribbon.
We now show that there are no maps E = T ◦ R such that E(K) = F (K) where T : R3 → R3 is an embedding, and
R : R2 → R3 injective and linear. The key insight is that if such a T existed, then this implies that the trefoil knot is
equivalent to S 1 in R3 , which is known to be false.
Let Uρ (A) denote the ρ-neighborhood of the set A in R3 . It is easy to see that R2 \ ({0} × [−1, 1]) is homeomorphic to
R2 \ B R2 (0, 1), which is further homeomorphic to R2 \ {0}. Thus, using tubular coordinates near Σ1 and a sufficiently
small ρ > 0, we see that R3 \ M1 is homeomorphic to R3 \ Uρ (Σ1 ), which is further homeomorphic to R3 \ Σ1 . Also,
when R : R2 → R3 is an injective linear map, we see that M2 = R(K) is a un-knotted band in R3 and R3 \ M2 is
homeomorphic to R3 \ Σ2 . If R3 \ M1 and R3 \ M2 would be homeomorphic, then also R3 \ Σ1 and R3 \ Σ2 would be
homeomorphic that is not possible by knot theory, see (Murasugi, 2008, Definition 1.3.1 and Theorem 1.3.1). This shows
that there are no injective linear maps R : R2 → R3 and homeomorphisms Φ : R3 → R3 such that (Φ ◦ R)(K) = M1 .
Similar examples can be obtained in a higher dimensional case by using a knotted torus (Séquin, 2011)6 and their Cartesian
products.
6
On the knotted torus, see http://gallery.bridgesmathart.org/exhibitions/2011-bridges-conference/
sequin.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
be the sphere bundle of Rm that is a manifold of dimension 2m − 1. By the proof’s of Whitney’s embedding theorem,
by Hirsch, (Hirsch, 2012, Chapter 1, Theorems 3.4 and 3.5), there is a set of ‘problem points’ H1 ⊂ S m−1 of Hausdorff
dimension 2n such that for all w ∈ Rm \ H1 the orthogonal projection
Pw : Rm → {w}⊥ = {y ∈ Rm : y ⊥ w}
Pw |M : M → {w}⊥ .
Moreover, let Tx M be the tangent space of manifold M at the point x and let us define another set of ‘problem points’ as
H2 = {v ∈ S m−1 : ∃x ∈ M, v ∈ Tx M}.
Z(y) ∈ M, Pw (Z(y)) = y,
that is it is the inverse of Pw |M : M → Pw (M), where Pw (M) ⊂ {w}⊥ . Let g : Ñ = Pw (M) → R be the function
Φ1 (M) ⊂ {w}⊥ .
In the case when m ≥ 3n + 1, we can repeat this construction n times. This is possible as m − n ≥ 2n + 1. Then we
obtain C k diffeomorphisms Φj : Rm → Rm , j = 1, . . . , n such that their composition Φn ◦ · · · ◦ Φ1 : Rm → Rm is a
C k -diffeomorphism such that which
M0 = Φn ◦ · · · ◦ Φ1 (M) ⊂ Y 0 ,
where Y 0 ⊂ Rm is a m − n dimensional linear space. By letting Ψ = Q ◦ Φn ◦ · · · ◦ Φ1 for rotation matrix Q ∈ Rm×m ,
we have that Y := Q(Y 0 ) = {0}n × Rm−n . Also, let X = Rn × {0}m−n , A = Q(M0 ) ⊂ X and φ : X → Rm be the
map
φ(x, 0) = Ψ(f (x)) ∈ Y,
where f is the function given in Eq. 39 and B = Ψ(f (A)) ⊂ Y . Then A is a C k -submanifold X, B is a C k -submanifold
Y and φ : A → B is a C k -diffeomorphism. We observe that m − n ≥ 2n + 1 and so we can apply (Madsen et al., 1997,
Lemma 7.6) to extend φ to a C k -diffeomorphism
h : Rm → Rm
such that h|A = φ. Note that (Madsen et al., 1997, Lemma 7.6) concerns an extension of a homeomorphism, but as
the extension h is given by an explicit formula which is locally a finite sum of C k functions, the same proof gives a
C k -diffeomorphic extension h to a diffeomorphism φ. Indeed, let A0 ⊂ Rn and B 0 ⊂ Rm−n be such sets that A =
A0 × {0}m−n , and B = {0}n × B 0 . Moreover, let φ̃ : A0 → Rn−m and ψ̃ : B 0 → Rn be such C k -smooth maps that
φ(x, 0) = (0, φ̃(x)) for (x, 0) ∈ A and φ−1 (0, y) = (ψ̃(y)) for (0, y) ∈ B. As A0 and B 0 are C k -submanifolds, the
map φ̃ has a C k -smooth extension f1 : Rn → Rn−m and the map ψ̃ has a C k -smooth extension f2 : Rn−m → Rn , that
is, f1 |A0 = φ̃ and f2 |B 0 = ψ̃. Following (Madsen et al., 1997, Lemma 7.6), we define the maps h1 : Rn × Rm−n →
Rn × Rm−n ,
h1 (x, y) = (x, y + f1 (x))
and h2 : Rn × Rm−n → Rn × Rm−n ,
h2 (x, y) = (x + f2 (y), y).
Observe that h2 has the inverse map h−1
2 (x, y) = (x − f2 (y), y). Then the map
h = h−1 n
2 ◦ h1 : R × R
m−n
→ Rn × Rm−n
is a C k -diffeomorphism that satisfies h|A = φ. This technique is called the ‘clean trick’.
Finally, to obtain the claim, we observe that when R : Rn → Rm , R(x) = (x, 0) ∈ {0}n × Rm−n is the zero padding
operator, we have
f (x) = Ψ−1 (φ(R(x))), x ∈ Rn .
As h|X = φ and R(x) ∈ X, this yields
that is,
f =E◦R
−1 m m
where E = Ψ ◦h : R → R is a C diffeomorphism. Thus f ∈ I k (Rn , Rm ). This proves Eq. 9 when m ≥
k
3n + 1.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
C.4. Universality
C.4.1. T HE P ROOF OF L EMMA 3.9
The proof of Lemma 3.9. (i) Let us consider > 0, a compact set K ⊂ Rn and f ∈ emb(Rn , Rm ). Let W =
K × {0}o−n and F : Ro → Rm be the map given by F (x, y) = f (x), (x, y) ∈ Rn × Ro−n . Because
Ro,m ⊂ emb(Ro , Rm ) is a uniform universal approximator of C(Rn , Rm ), there is an R ∈ Ro,m such that
kF − RkL∞ (W ) < . Then for the map E = I ◦ R we have that BK,W (f, E) < . This is true for every > 0, and
so E o,m has the MEP property w.r.t. the family emb(Rn , Rm ).
(ii) Recall that f := Φ0 ◦ R0 for Φ0 ∈ Diff1 (Rm , Rm ) and linear R0 : Rn → Rm , and that R ∈ R is such that R U is
linear for open U . We present the proof in the case when n = o, and we make the assumption that R K is linear. In this
case, we have the existence of an affine map A : Rm → Rm so that R0 = A ◦ R so that K̃ := R0 (K) = A(R(K)).
Let > 0 be given. By (Hirsch, 2012, Chapter 2, Theorem 2.7), the space Diff2 (Rm , Rm ) is dense in the space
Diff1 (Rm , Rm ), and so there is some Φ1 ∈ Diff2 (Rm , Rm ) such that
kΦ1 |K̃ − Φ0 |K̃ kL∞ (K̃;Rm ) < .
2
Then, let T ∈ T m be such that kT − Φ1 ◦ AkL∞ (R(K);Rm ) < 2 . Then we have that
Hence, if we let r = T ◦ R ◦ f −1 ∈ emb(f (K), T ◦ R(K)) then we obtain that BK,K (f, T ◦ R) < . This holds for
any , and hence we have that T ◦ R has the MEP for I(Rn , Rm ).
The proof in the case that o ≥ n follows with minor modification, and applying Lemma C.1 point 5.
(ii) Let T m be the family autoregressive flows with sigmoidal activations defined in (Huang et al., 2018). By (Teshima
et al., 2020, App. G, Theorem 1 and Proposition 7), T m are sup-universal approximators in the space Diff2 (Rm , Rm )
of C 2 -smooth diffeomorphisms Φ : Rm → Rm . When Ro,m is one of (R1) or (R2) the network is always linear,
hence the conditions are satisfied. If Ro,m is (R4), then Ro,m contains linear mappings, and if (R3), then we can shift
the origin, so that R(x) is linear on K. In all cases, Lemma 3.9 part (ii) applies.
Next, from universality of T0n for any 2 > 0, we have the existence of a T0 ∈ T0n so that W2 (µ00 , T0# µ) < 2 . From
Lemma C.1 points 7 and 8 we have that
W2 f # µ, Ẽ ◦ T0# µ ≤ 21 + 2 Lip(Ẽ). (42)
has the MEP w.r.t. F = I 1 (Rn , Rm ). By Theorem 3.8, we have that I 1 (Rn , Rm ) coincides with the space emb1 (Rn , Rm ).
Finally, by the assumption that T0n0 is dense in the space of C 2 -diffeomorphism Diff2 (Rn` ) implies that T0n0 is a Lp -
universal approximator for the set of C ∞ -smooth triangular maps for all p < ∞. Hence by Lemma 3 in Appendix A of
(Teshima et al., 2020), T0n0 is a distributionally universal. From these the claim in the case (ii) follows in the same way as
the case (i) using the family F = emb1 (Rn , Rm ).
Because E n,o (X, W ) has the o, n, n MEP, for each i = 1, . . . , we can find a Ei0 ∈ E n,o (X, W ) such that BK,X (Ei−1 ◦ ri ◦
f, Ei0 ) ≤ 1+Lip(E
1
i)
i , and so BK,X (f, Ei ◦ Ei0 ) ≤ 3i . For this choice of Ei0 , we have that limi→∞ BK,X (f, Ei ◦ Ei0 ) = 0.
From Lemma C.1 point 9, we have that for any absolutely continuous µ ∈ P(K), there is a absolutely continuous µ0 ∈
P(X) such that W2 (f # µ, Ei ◦ Ei0 # µ0 ) ≤ 3. By the universality of T n , continuity of Ei ◦ Ei0 , and absolute continuity of
µ and µ0 , we have the existence of Ti ∈ T n so that
Figure 5: An example showing how the unknot (left) can be deformed to approximate the trefoil knot (right). The black
part of both knots are identical, and the red section can be made arbitrarily skinny by bringing the black points together.
This can be done while sending the measure of the red sections to zero, if the starting measure have no atoms. In this way,
we can construct a sequence of diffeomorphisms (Ei )i=1,... so that W2 (Ei# µ, ν) → 0 where µ is the uniform measure on
S 1 , and ν the uniform measure on the trefoil knot. We would like to thank Reviewer 4 for suggesting this discussion and
providing the figure (in tikz code!).
lim W2 (Ei# µ, ν) = 0.
i→∞
Proof. The key idea of the construction is shown in Figure 5. In that figure the unknot is bent so that it overlaps the trefoil
knot, outside of an exceptional set (shown in red in Figure 5) which can be made as small as desired. The result follows by
constructing a sequence of functions which ‘squeeze’ this red section as small as possible.
Let µ be the uniform probability measure on S 1 ⊂ R2 , and ν the uniform probability measure on the trefoil knot, M. Let
R : R2 → R3 be a fixed linear map of the form R(x) = (x, 0).
We define a sequence (Xi )i=1,... of unknots in the following way. For any choice of two points on the top of the trefoil knot
as shown in black in Figure 5a, we can replace the straight-line red section with a U-shaped section as shown in Figure 5a
so that the resulting knot is the unknot. We obtain X1 by letting the black points be a distance 1 apart, X2 by letting them
be a distance 12 apart and so on, so that for Xi the two points are a distance 1i apart. Further, for each Xi , we define Ai and
Bi where Ai is the U-shaped piece of Xi (in red), and Bi = Xi \ Ai . Observe that Bi ⊂ M.
Let (Ti0 )i=1,... be a family of diffeomorphisms so that Ei : R3 → R3 maps S 1 × {0} to Xi . Further, let (Ti00 )i=1,... be such
that Ti00 : Xi → Xi so that χBi (Ti00 ◦ Ti0 ◦ R)# µ = χBi ν when χBi is the characteristic function of the set Bi .
Then we define Ei := Ti00 ◦ Ti0 ◦ R and compute
W2 (Ei# µ, ν) ≤ W2 χAi Ei# µ, χM\Bi ν + W2 (χBi Ei# µ, χBi ν)
= W2 χAi Ei# µ, χM\Bi ν .
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
As i increases, the length of M \ Bi goes to zero, thus ν(M \ Bi ) = µ(Ai ) converges to zero. Hence taking limits yields
lim W2 (Ei# µ, ν) ≤ lim W2 χAi Ei# µ, χM\Bi ν = 0.
i→∞ i→∞
Finally, Ei is certainly an extendable embedding, as R is linear, and Ti00 ◦ Ti0 are diffeomorphisms.
The above proof also applies when ν or µ have finitely many atoms. The same construction works if Ai is chosen so that
it contains no atoms for sufficiently large i.
Next, we show that all function sequence for which implication of Theorem 3.10 and conditions of Lemma 3.11 apply are
not uniformly Lipschitz. This implies that if they are differentiable they have unbounded gradients.
Lemma C.2. Let f be continuous and Ei be a sequence of continuous functions that are uniformly Lipschitz with constant
L. Let Ei be such that for all compact K and W subsets of Rn , there is an > 0, so ∀i and r ∈ emb(f (K), E(W )),
kI − rkL∞ (K) ≥ . If µ is the indicator function of d, then limi→∞ W2 (f# µ, Ei# µ) > 0.
Proof. Let Ei be uniformly Lipschitz with constant L. Consider a 2 tubular neighborhood of f (K). From the fact that
kI − rkL∞ (K) ≥ , we have that there is a point x ∈ E(W ) so that x lies outside of this neighborhood. From uniform
Lipschitzness of Ei , for each i there is a ball B of radius 4L around x so that all points in Ei ∩B are more than 4 away from
c
f (K). We also have that µ(Ei ∩ B) > c where c is the volume of the n dimensional ball. Thus, W2 (f# µ, Ei# µ) > 4L
for each i, and so limi→∞ W2 (f# µ, Ei# µ) > 0.
2
2 2
minn ky − R(x)k2 = minn kMy (y − W x)k2 + M̃y y . (45)
x∈R x∈R 2
But, (I n×n − ∆y − ∆y D) is a full-rank diagonal matrix (with entries either 1 or [D]i,i ), and B is full rank by
B
assumption, hence My is too.
−DB
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
n o
2. Because B is square and full rank there exists a basis7 b̂i of Rn such that
i=1,...,n
(
D E 1 if i = j
b̂j , bi = . (47)
0 if i 6= j
Eqn. 50 is clearly minimized by minizing each term in the sum, hence we search for a minimizer of the i’th term
2 2
min ([y]i − max(αi , 0)) + [y]i+n − max(− [D]ii αi , 0) (51)
αi ∈R
Noting f (αi ) as the quantity inside the minimum of Eqn. 51, we consider the positive, negative and zero αi cases of
Eqn. 51 separately and we get
2 2 2
min f (αi ) = min+ ([y]i − αi ) + [y]i+n = [y]i+n (52)
αi ∈R+ αi ∈R
2 2 2
min− f (αi ) = min+ [y]i + [y]i+n + [D]ii αi = [y]i (53)
αi ∈R αi ∈R
2 2
f (0) = [y]i + [y]i+n . (54)
[y]2
If [y]i+n > [y]i , then the minimizer of Eqn. 51 is αi = − [D]i+n < 0. Conversely if [y]i+n < [y]i then the minimizer
ii
of Eqn. 51 is αi = [y]i > 0. This argument applies all i = 1, . . . , n, and hence if [y]i 6= [y]i+1 for all i = 1, . . . , n
then the minimizing x is unique.
[y]2i+n 2
If [y]i = [y]i+1 then there are exactly two minimizers of f (αi ), − [D]ii and [y]i , for both of which f (αi ) = [y]i =
2
[y]i+n .
3. If we suppose that [y]i+n − [y]i > 0, then [c(y)]i = 0 and [c(y)]i+n > 0, thus [∆y ]ii = 1, hence if we let xmin be the
minimizing x from part 1, then
2 2
([y]i − max(hxmin , bi i , 0)) + [y]i+n − max(hxmin , − [D]ii bi i , 0) (55)
2 2
= [y]i + [y]i+n − max(hxmin , − [D]ii bi i , 0) (56)
h i2
2
= M̃y y + [My (y − W xmin )]i (57)
i
Thus combining Eqn.s 48, 49, 57 and 60 for each i = 1, . . . , n, we have that
2 2 2
min ky − R(x)k2 = minn kMy (y − W x)k2 + kMy yk2 . (61)
x∈Rn x∈R
7
Namely the columns of the matrix B −1
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
4. For the final point, combining all of the above points we have
2 2
min ky − R(x)k2 = minn kMy (y − W x)k2 . (62)
x∈Rn x∈R
−1
Further we have from Point 1 that My W is full rank, hence (My W ) My y = R† (y) is a minimizer of Eqn. 62. If
2
[y]i 6= [y]i+n for all i = 1, . . . , n then Part 2 applies, and R† (y) is the unique minimizer of ky − R(x)k2 . In either
case, we have that R† (y) is a minimizer.
where φ is coordinate-wise homogeneous degree 1 (such as ReLU) and W1 ∈ Rn1 ×n2 and W2 ∈ Rn2 ×n3 . If we let
P ∈ Rn2 ×n2 be any permutation matrix, and D+ be a diagonal matrix with strictly positive elements, then we can write
−1
f (x) = W2 P 0 D+ φ(D+ P W1 x) (64)
as well. Thus ReLU networks can only ever be uniquely given subject to these two isometries. When describe unique
recovery in the rest of this section, we mean modulo these two isometries.
In (Rolnick & Körding, 2020), the authors describe how all parameters of a ReLU network can be recovered uniquely
(called reverse engineered in (Rolnick & Körding, 2020)), subject to the so called ‘linear8 region assumption’, LRA.
n
The input space Rn can be partitioned into a finite number of open {Si }i=1
i
, where for each k, f (x) = Wk i + bi , i.e. the
network corresponds to an affine polyhedron in the output space. The algorithms (Rolnick & Körding, 2020, Alg.s 1 & 2)
are roughly described below.
n
First, identify at least one point within each affine polyhedra {Hj }j=1
j
. Then identify the boundaries between polyhedra.
nj nj
The boundaries between sections are always one affine ‘piece’ of piecewise hyperplanes {Hj }j=1 . These {Hj }j=1 are the
central objects which indicate the (de)activation of an element of a ReLU somewhere in the network. If the Hj are full
hyperplanes, then the ReLU that is (de)activates occurs in the first layer of the network. If Hj is not a full hyperplane, then
8
The use of ‘linear’ in this context is somewhat non-standard, and instead means affine. In this section we use the term ‘linear region
assumption’, but use ‘affine’ where (Rolnick & Körding, 2020) would use ‘linear’ to preserve mathematical meaning.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
it necessarily has a bend where it intersects another hyperplane Hj 0 . Further, except for a Lebesgue measure 0 set, when
Hj intersects Hj 0 the latter does not have a bend. If this is the case, then Hj 0 corresponds to a ReLU (de)activation in an
earlier layer than Hj . In this way the activation functions of every layer can be deduced. Once this is done, the normals of
the hyperplanes can be used to infer the row-vectors of the various weight matrices, letting one recover the entire network.
The above algorithm recovers all of the weights exactly provided that the LRA is satisfied. The LRA is satisfied if for every
distinct Si and Si0 , either Wi 6= Wi0 or bi 6= bi0 . That is, different sign patterns produce different affine sections in the
output space. This is a natural assumption, as the algorithm as described above reconstruction works by first detecting the
boundaries between adjacent affine polyhedra, which is only possible if the LRA holds.
Given the weights of a network there is currently no simple way to detect if the LRA is satisfied, to our knowledge. Nev-
ertheless the authors of (Rolnick & Körding, 2020) show that if it is satisfied, then unique recovery follows. Nevertheless
recovery of the range of the entire network is possible, but this recovery may not be unique.
In (Bui Thi Mai & Lampert, 2020) the authors also consider the problem of recovering weights of a ReLU neural network,
however the authors therein study the question of when there exist isometries beyond the two natural ones described above.
In particular the main result (Bui Thi Mai & Lampert, 2020, Theorem 1) shows the following. Let E n0 ,nL be a ReLU
network that is L layers deep and non-increasing. Suppose that E1 , E2 ∈ E n0 ,nL , E1 and E2 are general9 and for all
x ∈ Rn0 E1 (x) = E2 (x), then E1 is parametrically identical to E2 subject to the two natural isometries.
This work provides the stronger result, however does not apply to the networks that we consider out of the box. It does
apply to our expressive elements (provided that they use ReLU activation functions, and are non-increasing), but not
necessarily apply to the network on the whole.
9
A set is general in the topological sense if its complement is closed and nowhere dense