Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Puthawala 22 A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Universal Joint Approximation of Manifolds and Densities by Simple Injective

Flows

Michael Puthawala 1 Matti Lassas 2 Ivan Dokmanić 3 Maarten de Hoop 1

Abstract By design, however, invertible flows are bijective and may


We study approximation of probability measures not be a natural choice when the target distribution has
supported on n-dimensional manifolds embed- low-dimensional support. This problem can be overcome
ded in Rm by injective flows—neural networks by combining bijective flows with expansive, injective lay-
composed of invertible flows and injective lay- ers, which map to higher dimensions (Brehmer & Cranmer,
ers. We show that in general, injective flows be- 2020; Cunningham et al., 2020; Kothari et al., 2021). De-
tween Rn and Rm universally approximate mea- spite their empirical success, the theoretical aspects of such
sures supported on images of extendable em- globally injective architectures are not well understood.
beddings, which are a subset of standard em- In this work, we address approximation-theoretic proper-
beddings: when the embedding dimension m ties of injective flows. We prove that under mild condi-
is small, topological obstructions may preclude tions these networks universally approximate probability
certain manifolds as admissible targets. When measures supported on low-dimensional manifolds and de-
the embedding dimension is sufficiently large, scribe how their design enables applications to inference
m ≥ 3n + 1, we use an argument from alge- and inverse problems.
braic topology known as the clean trick to prove
that the topological obstructions vanish and in- 1.1. Prior Work
jective flows universally approximate any differ-
entiable embedding. Along the way we show that The idea to combine invertible (coupling) layers with ex-
the studied injective flows admit efficient projec- pansive layers has been explored by (Brehmer & Cran-
tions on the range, and that their optimality can mer, 2020) and (Kothari et al., 2021). Brehmer & Cranmer
be established ”in reverse,” resolving a conjec- (2020) combine two flow networks with a simple expansive
ture made in (Brehmer & Cranmer, 2020) element (in the sense made precise in Section 2.1) and ob-
tain a network that parameterizes probability distributions
supported on manifolds.1
1. Introduction Kothari et al. (2021) propose expansive coupling layers
Invertible flow networks emerged as powerful deep learn- and build networks similar to that of Brehmer & Cran-
ing models to learn maps between distributions (Durkan mer (2020) but with an arbitrary number of expressive and
et al., 2019a; Grathwohl et al., 2018; Huang et al., 2018; expansive elements. They observe that the resulting net-
Jaini et al., 2019; Kingma et al., 2016; Kingma & Dhari- work trains very fast with a small memory footprint, while
wal, 2018; Kobyzev et al., 2020; Kruse et al., 2019; Papa- producing high-quality samples on a variety of benchmark
makarios et al., 2019). They generate high-quality samples datasets.
(Kingma & Dhariwal, 2018) and facilitate solving scien- While (to the best of our knowledge) there are no
tific inference problems (Brehmer & Cranmer, 2020; Kruse approximation-theoretic results for injective flows, there
et al., 2021). exists a body of work on universality of invertible flows;
1
Department of Computational and Applied Math, Rice
see Kobyzev et al. (2020) for an overview. Several works
University, Houston, TX, USA 2 Department of Mathematics show that certain bijective flow architectures are distri-
and Statistics, University of Helsinki, Finland 3 Department butionally universal. This was proved for autoregressive
of Mathematics and Computer Science, University of Basel, flows with sigmoidal activations by Huang et al. (2018)
Basel, Switzerland. Correspondence to: Michael Puthawala and for sum-of-squares polynomial flows (Jaini et al.,
<map19@rice.edu>.
1
More precisely, distributions on manifolds are parameterized
Proceedings of the 39 th International Conference on Machine by the pushforward (via their network) of a simple probability
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy- measure in the latent space.
right 2022 by the author(s).
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

2019). Teshima et al. (2020) show that several flow net- our knowledge, heretofore unknown in the literature. We
works including those from Huang et al. (2018) and Jaini give an example of an absolutely continuous measure µ
et al. (2019) are also universal approximators of diffeomor- and embedding f : R2 → R3 such that f # µ can not be
phisms. approximated with combinations of flow layers and linear
expansive layers. This may be surprising since it was previ-
The injective flows considered here have key applications
ously conjectured that networks such as those of Brehmer
in inference and inverse problems; for an overview of
& Cranmer (2020) can approximate any “nice” density sup-
deep learning approaches to inverse problems, see (Arridge
ported on a “nice” manifold. We establish universality for
et al., 2019). Bora et al. (2017) proposed to regularize
manifolds with suitable topology, described in terms of ex-
compressed sensing problems by constraining the recov-
tendable embeddings. We find that the set of extendable
ery to the range of (pre-trained) generative models. Injec-
embeddings is a proper subset of all embeddings, but when
tive flows with efficient inverses as generative models give
m ≥ 3n + 1, via an application of the clean trick from al-
an efficient algorithmic projection2 on the range, which fa-
gebraic topology, we show that all diffeomorphisms are ex-
cilitates implementation of reconstruction algorithms. An
tendable and thus injective flows approximate distributions
alternative approach is Bayesian, where flows are used to
on arbitrary manifolds. Our universality proof also implies
obtain tractable variational approximations of posterior dis-
that optimality of the approximating network can be estab-
tributions over parameters of interest, via supervised train-
lished in reverse: optimality of a given layer can be estab-
ing on labeled input-output data pairs. Ardizzone et al.
lished without optimality of preceding layers. This settles a
(2018) encode the dimension-reducing forward process by
(generalization of a) conjecture posed for a three-part net-
an invertible neural network (INN), with additional outputs
work (composed of two flow networks and zero padding) in
used to encode posterior variability. Invertibility guaran-
(Brehmer & Cranmer, 2020). Finally, we show that these
tees that a model of the inverse process is learned implicitly.
universal architectures are also practical and admit exact
For a given measurement, the inverse pass of the INN ap-
layer-wise projections, as well as other properties discussed
proximates the posterior over parameters. Sun & Bouman
in Section 3.5.
(2020) propose variational approximations of the posterior
using an untrained deep generative model. They train a nor-
malizing flow which produces samples from the posterior, 2. Architectures Considered
with the prior and the noise model given implicitly by the
Let C(X, Y ) denote the space of continuous functions
regularized misfit functional. In Kothari et al. (2021) this
X → Y . Our goal is to make statements about networks in
procedure is adapted to priors specified by injective flows
F ⊂ C(X, Y ) that are of the form:
which yields significant improvements in computational ef-
n ,nL
ficiency. F = TLnL ◦ RLL−1 ◦ · · · ◦ T1n1 ◦ Rn1 0 ,n1 ◦ T0n0 (1)
n ,n
1.2. Our Contribution where R` `−1 ` ⊂ C(Rn`−1 , Rn` ), T`n` ⊂ C(Rn` , Rn` ),
L ∈ N, n0 = n, nL = m, and n` ≥ n`−1 for ` = 1, . . . , L.
We derive new approximation results for neural networks
We introduce a well-tuned shorthand notation and write H◦
composed of bijective flows and injective expansive layers,
G := {h ◦ g : h ∈ H, g ∈ G} throughout the paper.
including those introduced by (Brehmer & Cranmer, 2020)
and (Kothari et al., 2021). We show that these networks We identify R with the expansive layers and T with the bi-
universally jointly approximate a large class of manifolds jective flows. Loosely speaking, the purpose of the expan-
and densities supported on them. sive layers is to allow the network to parameterize high-
dimensional functions by low-dimensional coordinates in
We build on the results of Teshima et al. (2020) and develop
an injective way. The flow networks give the network
a new theoretical device which we refer to as the embed-
the expressivity necessary for universal approximation of
ding gap. This gap is a measure of how nearly a mapping
manifold-supported distributions.
from Ro → Rm embeds an n-dimensional manifold in Rm ,
where n ≤ o. We find a natural relationship between the
2.1. Expansive Layers
embedding gap and the problem of approximating proba-
bility measures with low-dimensional support. The expansive elements transform an n-dimensional man-
We then relate the embedding gap to a relaxation of univer- ifold M embedded in Rn`−1 , and embed it in a higher
sality we call the manifold embedding property. We show dimensional space Rn` . To preserve the topology of the
that this property captures the essential geometric aspects manifold they are injective. We thus make the following
of universality and uncover important topological restric- assumptions about the expansive elements:
tions on the approximation power of these networks, to Definition 2.1 (Expansive Element). A family of functions
R ⊂ C(Rn , Rm ) is called an family of expansive elements
2
Idempotent but in general not orthogonal. if m > n, and each R ∈ R is both injective and Lipschitz.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

Examples of expansive elements include In Eqn. 3, hi : Rd × Re → Rd is invertible w.r.t. the


first argument given the second, and gi : Rn−d → Re
T
is arbitrary. Typically in practice the operation in Eqn.

(R1) Zero padding: R(x) = xT , 0(m−n) where 0(m−n)
is the zero vector (Brehmer & Cranmer, 2020). 3 is combined with additional invertible operations
such as permutations, masking or convolutions (Dinh
(R2) Multiplication by an arbitrary full-rank matrix, or one- et al., 2014; 2016; Kingma & Dhariwal, 2018).
by-one convolution:
(T2) Autoregressive flows, introduced by Kingma et al.
R(x) = W x, or R(x) = w ? x (2) (2016) are generalizations of triangular flows
A : Rn → Rn where for i = 1, . . . , n the i’th value of
where W ∈ Rm×n and rank(W ) = n (Cunningham
A is given by of the form
et al., 2020), and w is a convolution kernel ? denotes
convolution (Kingma & Dhariwal, 2018).

[A]i (x) = hi [x]i , gi [x]1:i−1 (4)
(R3) Injective ReLU layers: R(x) = ReLU(W x), In Eqn. 4, hi : R × Rm → R where again hi is
 T T
W = B , −DB T , M T , or R(x) = invertible w.r.t. the first argument given the second,
T T
 
ReLU w , −w ? x for matrix B ∈ GLn (R), and gi : Ri−1 → Rm is arbitrary except for g1 = 0.
positive diagonal matrix D ∈ Rn×n , and arbitrary In Huang et al. (2018), the authors choose hi (x, y),
matrix M ∈ R(m−2n)×n (Puthawala et al., 2020). where y ∈ Rm , to be a multi-layer perceptron (MLP)
of the form
(R4) Injective ReLU networks (Puthawala et al., 2020,
Theorem 5). These are functions R : Rn → Rm hi (x, y) = φ ◦ Wp,y ◦ · · · ◦ φ ◦ W1,y (x) (5)
of the form R(x) = WL+1 ReLU(. . . ReLU(W1 x +
b1 ) . . . ) + bL where W` are n`+1 × n` matrices and b` where φ is a sigmoidal increasing non-linear activa-
are the bias vectors in Rn`+1 . The weight matrices WL tion function.
satisfy the Directed Spanning Set (DSS) condition for
` ≤ L (that make all layers injective) and WL+1 is a 3. Main Results
generic matrix which makes the map R : Rn → Rm
injective where m ≥ 2n+1. Note that the DSS condi- 3.1. Embedding Gap
tion requires that n` ≥ 2n`−1 for ` ≤ L and we have We call a function f an embedding and denote it by f ∈
n1 = n and nL+1 = m. emb(X, Y ) if f : X → Y is continuous, injective, and
f −1 : f (X) → X is continuous3 . Also we denote
Continuous piecewise-differentiable functions with by embk (Rn , Rm ) the set of maps f ∈ emb(Rn , Rm ) ∩
bounded gradients are always Lipschitz. Thus, the C k (Rn , Rm ) which differential df |x : Rn → Rm is in-
Lipschitzness assumption is automatically satisfied by jective at all points x ∈ Rn . We now introduce the em-
feed-forward networks with piecewise-differentiable bedding gap, a non-symmetric notion of distance between
activation functions with bounded gradients. This includes f and g. This quantifies the degree to which a mapping
compositions of ReLU and sigmoid layers. g ∈ emb(Ro , Rm ) fails to embed a manifold M = f (K)
for compact K ⊂ Rn where f ∈ emb(K, Rm ). Later in
2.2. Bijective Flow Networks the paper, f will be the function to be approximated, and g
The bulk of our theoretical analysis is devoted to the bijec- an approximating flow-network.
tive flow networks, which bend the range of the expansive Definition 3.1 (Embedding Gap). Let n ≤ p ≤ o ≤ m,
elements into the correct shape. We make the following K ⊂ Rn be compact and non-empty, W ⊂ Ro be compact
assumptions about the expressive elements: and contain the closure of set U which is open in the sub-
Definition 2.2 (Bijective Flow Network). Let T ⊂ space topology of some vector subspace V of dimension
C(Rn , Rn ) for n ∈ N. We call T a family of bijective p, where f ∈ emb(K, Rm ) and g ∈ emb(W, Rm ). The
flow networks if every T ∈ T is Lipschitz and bijective. Embedding Gap between f and g on K and W is

Examples of bijective flow networks include BK,W (f, g) = inf kI − rkL∞ (f (K)) (6)
r∈emb(f (K),g(W ))

3
(T1) Coupling flows, introduced by (Dinh et al., 2014) con- Note that if X is a compact set, then continuity of the of
sider R(x) = Hk ◦ · · · ◦ H1 (x) where f −1 : f (X) → X is automatic, and need not be assumed (Suther-
land, 2009, Cor. 13.27). Moreover, if f : Rn → Rm is a contin-

hi [x]1:d , gi [x]d+1:n
 uous injective map that satisfies |f (x)| → ∞ as |x| → ∞, then
Hi (x) = . (3) by (Mukherjee, 2015, Cor. 2.1.23) the map f −1 : f (Rn ) → Rn
[x]d+1:n is continuous.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

where I : f (K) → f (K) is the identity function and


khkL∞ (X) = ess supx∈X kh(x)k2 for h : X → Y , where
Y is some L∞ space. We refer to the embedding gap be-
tween f and g without specifying K and W when it is clear
from context.
Remark 3.2. As W ⊂ Ro contains U , an open set in V ,
there is an affine map A : Rn → V such that A(K) ⊂ W .
Thus, the map r0 = g ◦ A ◦ f −1 : f (K) → g(W ) is an Figure 1: A visualization of the embedding gap. In all
injective continuous map from a compact set to its range three figures we plot f (K) and gi (W ) for Left: i = 1,
and hence r0 ∈ emb(f (K), g(W )). This proves that the Center: i = 2 and Right: i = 3. Visually, we see that
infimum in 6 is non-empty. gi (W ) approaches f (W ) as i increases, and we compute
Before giving properties of BK,W (f, g), we briefly de- BK,W (f , g1 ) > BK,W (f , g2 ) > BK,W (f , g3 ) = 0.
scribe its interpretation and meaning. We denote by P(X)
the set of probability measures over X. If the embedding
gap between f and g is small, then g −1 ◦r embeds the range 3.2. Manifold Embedding Property
of f for an r that is nearly the identity. Hence g −1 nearly We now introduce a central concept, the manifold embed-
embeds the range of f into Ro . BK,W (f, g) also serves as ding property (MEP). A family of networks has the MEP
an upper bound if it can, as measured by the embedding gap, nearly embed
a large class of manifolds of certain dimension and reg-
inf W2 (f # µn , g # µo ) ≤ BK,W (f, g) ularity. The MEP is a property of a family of functions
µo ∈P(W )
E ⊂ emb(W, Rm ) where W ⊂ Ro . In this manuscript,
where µn ∈ P(K) is given, and W2 (ν1 , ν2 ) denotes E will always be formed by taking E := T ◦ R, where R
the Wasserstein-2 distance with `2 ground metric (Villani, and T are the expansive layers and bijective flow networks
2008). This is proven in Lemma C.1 part 9. The above re- described in sections 2.1 and 2.2 respectively.
sult has a simple meaning in the context of machine learn-
We note here that E having the MEP is closely related to
ing. Suppose we want to learn a generative model g to
the question of whether or not a given n-dimensional man-
(approximately) sample from a probability measure ν with
ifold M = f (K) for f ∈ emb(K, Rm ), K ⊂ Rn , can be
low-dimensional support, by applying g to samples from
approximated by an E ∈ E. This choice of first applying
a base distribution µo . Suppose further that ν is a push-
(possibly non-universal) expansive layers, and then univer-
forward of some (known or unknown) distribution µn via
sal layers puts some topological restrictions on the expres-
f . The embedding gap BK,W (f, g) then upper bounds the
sivity, which we discuss in great detail in Section 3.3.
2-Wasserstein distance between ν and g# µ0 for the best
possible choice of µo .4 In anticipation of these topological difficulties, when we
refer to the MEP, we consider it with respect to a class of
In the context of optimal transport, the embedding r can be
functions F ⊂ emb(Rn , Rm ). The MEP can be interpreted
interpreted as a candidate transport map from any measure
as a density statement, saying that our networks E are dense
pushed forward by f , that can be pulled back through g.
in some set F ⊂ emb(Rn , Rm ) in the topology induced
Loosely speaking, for µ0o = g −1 ◦ r ◦ f # µn , r transports
by the ‘BK,W distance.’ Two examples of F that we are
f # µn to g # µ0o with cost no more than kI − rkL∞ (f (K)) .
particularly interested in are the following. When F =
See Fig. 1 for a visualization of the embedding gap be-
emb(Rn , Rm ), and also when each f ∈ F can be written
tween two toy functions. The embedding gap satisfies in-
as f = D ◦ L where L : Rm×n is a linear map of rank n
equalities useful for studying networks of the form of Eqn.
and D : Rm → Rm is a C k diffeomorphism with k ≥ 1.
1, see Lemma C.1.
Definition 3.3 (Manifold Embedding Property). Let E ⊂
In the remainder of this section we use the embedding gap
emb(Ro , Rm ) and F ⊂ emb(Rn , Rm ) be two families of
to prove universality of neural networks. The set f (K) will
functions. We say that E has the m, n, o Manifold Embed-
be a target manifold to approximate, and g will be a neural
ding Property (MEP) w.r.t. F if for every compact non-
network of the form Eq. 1. The embedding gap requires
empty set K ⊂ Rn , f ∈ F, and  > 0, there is an E ∈ E
g to be a proper embedding and so, in particular, injective.
and a compact set W ⊂ Ro such that the restriction of f to
This is why we require injectivity of both the expansive and
K and the restriction of E to W satisfies
bijective flow layers.
4
The choice of p-Wasserstein distance is suitable for measures BK,W (f, E) < . (7)
with mismatched low-dimensional support; this has been widely
exploited in training generative models (Arjovsky et al., 2017). When it is clear from the context, we abbreviate the m, n, o
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

MEP w.r.t. F simply by the m, n, o MEP, or simply the


MEP.

We also present the following two lemmas which relate to


the algebra of the MEP.
Lemma 3.4. Let E1p,o ⊂ emb(Rp , Ro ) have the o, n, p
MEP w.r.t. F1n,o ⊂ emb(Rn , Ro ), and likewise let
E2o,m ⊂ emb(Ro , Rm ) have the m, o, o MEP w.r.t. F2o,m ⊂
emb(Ro , Rm ). If each E2o,m ∈ E2o,m is locally Lipschitz,
then E2o,m ◦ E1p,o has the m, n, p MEP w.r.t. F o,m ◦ F n,o .
Figure 2: An illustration of the case when n = 2, m = 3,
The proof of Lemma 3.4 is in Appendix C.2.1. and K = S 1 is the circle. Here f : S 1 → R3 is an
embedding such that the curve M = f (S 1 ) is a trefoil
We note that when the elements of E2o,m are differentiable, knot. Due to knot theoretical reasons, there are no map
local Lipschitzness is automatic, and need not be assumed, E = T ◦ R : R2 → R3 such that E(S 1 ) = M, where
see e.g. (Tao, 2009, Ex. 10.2.6). We also record the fol- R : R2 → R3 is a full rank linear map and T : R3 → R3
lowing lemma, proved in C.2.2, which is a weak-converse is a homeomorphism. This shows that a combination of
of Lemma 3.4. It states that if E2o,m ◦ E1p,o has the m, n, p linear maps and coupling flow maps can not represent all
MEP, then E2o,m has the m, n, o MEP. embedded manifolds. For this reason, we define the class
Lemma 3.5. Let E1p,o ⊂ emb(Rp , Ro ) and E2o,m ⊂ I(Rn , Rm ) of extendable embeddings f in Definition 3.7.
emb(Ro , Rm ) be such that E2o,m ◦E1p,o has the m, n, p MEP A similar 2-dimensional example can be obtained to a knot-
with respect to family F ⊂ emb(Rn , Rm ). Then E2o,m has ted ribbon, see Sec. C.3.1.
the m, n, o MEP with respect to family F.
Definition 3.6 (Uniform Universal Approximator). For a
non-empty subset F n,m ⊂ C(Rn , Rm ), a family E n,m ⊂ That is, E is the set of maps that can be written as composi-
C(Rn , Rm ) is said to be a uniform universal approximator tions of linear maps from R2 to R3 and homeomorphisms
of F n,m if for every f ∈ F n,m , every non-empty compact on all of R3 . Let f ∈ emb(K, R3 ) be an embedding that
K ⊂ Rn , and each  > 0, there is an E ∈ E n,m satisfying: maps K to a trefoil knot M = f (S 1 ), see Fig. 2. Such a
function f can not be written as a restriction of an E ∈ E
sup kf (x) − E(x)k2 < . (8)
x∈K to S 1 . In Sec. C.3.1 we prove this fact and build a re-
lated example where a measure, µ ∈ P(R2 ), supported on
If E ⊂ emb(Ro , Rm ) is a uniform universal approximator an annulus is pushed forward to a measure supported on a
of F o,m = C 0 (Ro , Rm ) on compact sets, then it has the knotted ribbon in R3 by an embedding g : R2 → R3 . For
m, n, o MEP w.r.t C 0 (Rn , Rm ) for any n ≤ o, see Lemma this measure, there are no E ∈ E such that g # µ = E # µ.
3.9. As an example, when m ≥ 2o + 1 injective ReLU net- We note that the counterexample is still valid if E is re-
works E : Ro → Rm (i.e., mappings of the form (R4)) are placed with Ê = T ◦ D where T = hom(R3 , R3 ) and
uniform universal approximator of C 0 (Ro , Rm ) on com- D = hom(R3 , R3 ) ◦ R3×2 . See C.3.2 for a proof. The
pact sets, see e.g. (Puthawala et al., 2020) and (Yarotsky, point here is not that R is linear, but rather that it embeds
2017; 2018). Thus, networks that are uniform universal all of R2 into R3 , rather than only S 1 into R3 .
approximators automatically possess the MEP. Generaliza-
With this difficulty in mind, we define the MEP property
tions of this are considered in Lemma 3.9.
with respect to a certain subclass of manifolds {f (K) :
With the definition of the MEP and uniform universal ap- f ∈ F}. Additionally, when considering flow networks
proximator established, we now discuss in detail the nature which are universal approximators of C 2 diffeomorphisms,
of the topological obstructions to approximating all one- we restrict the class of manifolds to be approximated even
chart manifolds. further. This is necessary because manifolds that are home-
omorphic are not necessarily diffeomorphic5 . Moreover,
3.3. Topological Obstructions to Manifold Learning it is known that C 2 -smooth diffeomorphisms can not ap-
with Neural Networks proximate general homeomorphisms in the C 0 topology,
see (Müller, 2014) for a precise statement. All C 1 -smooth
We show that using non-universal expansive layers and diffeomorphisms f : Rm → Rm , however, can be ap-
flow layers imposes some topological restrictions on what proximated in the strong topology of C 1 by C 2 -smooth
can be approximated. Let n = 2, m = 3, and K = S 1 ⊂
5
R2 be the circle, and let A classic example are the exotic spheres. These are topolog-
ical structures that are homeomorphic, but not diffeomorphic, to
E = T ◦ R ∈ C(R2 , R3 ) : R ∈ R3×2 , T ∈ hom(R3 , R3 ) . the sphere (Milnor, 1956).

Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

diffeomorphism f˜ : Rm → Rm , ` ≥ k, see (Hirsch, I is the identity map, then E := T ◦ R has the MEP
2012, Ch. 2, Theorem 2.7). Because of this, we have to w.r.t. emb(Rn , Rm ).
pay attention to the smoothness of the maps in the subset
F ⊂ emb(K, Rm ). (ii) If R is such that there is an injective R ∈ R and
open set U ⊂ Ro such that R U is linear, and
Definition 3.7 (Extendable Embeddings). We define the
T is a sup universal approximator in the space of
set of Extendable Embeddings as
Diff2 (Rm , Rm ), in the sense of (Teshima et al., 2020),
I(Rn , Rm ) := D ◦ L of the C 2 -smooth diffeomorphisms, then E := T ◦ R
has the MEP w.r.t. I(Rn , Rm ).
D = Diff1 (Rm , Rm )
L = L ∈ Rm×n : rank(L) = n ,

For uniform universal approximators that satisfy the as-
sumptions of (i), see e.g. (Puthawala et al., 2020). The
where Diffk (Rm , Rm ) is the set of C k -smooth diffeo- proof of Lemma 3.9 is in Appendix C.4.1. It has the fol-
morphisms from Rm to itself. Note that I(Rn , Rm ) ⊂ lowing implications for the architectures studied in Section
emb(Rn , Rm ). 2.
The word extendable in the name extendable embeddings Example 1. Let E := T ◦ R and (T1), (T2), (R1), . . . , (R4)
refers to the fact that the family D in Definition 3.7 is a be as described in Section 2. Then
proper subset of emb(L(K), Rm ) for some compact K ⊂
Rn and linear L ∈ Rm×n . Mappings in the set D are em- (i) If T is either (T1) or (T2) and R is (R4), then E has
beddings D : L(K) → Rm that extend to diffeomorphisms the m, n, o MEP w.r.t. emb(Rn , Rm ).
from all of Rm to itself. Said differently, a D ∈ D is a
map in emb1 (L(K), Rm ) that can be extended to a map (ii) If T is (T2) with sigmoidal activations (Huang et al.,
1 m m
D̃ ∈ Diff (R , R ) such that D̃ = D. This distinc- 2018), then if R is any of (R1), ..., (R4), then E has the
L(K) m, n, o MEP w.r.t. I(Rn , Rm ).
tion is important, as there are maps in emb1 (L(K), Rm )
that can not be extended to diffeomorphisms on all of Rm , The proof of Example 1 is in Appendix C.4.2.
as can be seen from the counterexample developed at the
beginning of this section. We now present our universal approximation result for net-
works given in Eqn. 1 and a decoupling property. Below,
We also present here a theorem that states that when m we say that a measure µ in Rn is absolutely continuous if
is more than three times larger than n, any differentiable it is absolutely continuous w.r.t. the Lebesgue measure.
embedding from compact K ⊂ Rn to Rm is necessarily
extendable. Theorem 3.10. Let n0 = n, nL = m K ⊂ Rn be compact,
µ ∈ P(K) be an absolutely continuous measure. Further
Theorem 3.8. When m ≥ 3n + 1 and k ≥ 1, for any C k n ,n n ,n
let, for each ` = 1, . . . , L, E` `−1 ` := T`n` ◦ R` `−1 `
embedding f ∈ embk (Rn , Rm ) and compact set K ⊂ Rn , n`−1 ,n`
where R` is a family of injective expansive elements
there is a map E ∈ I k (Rn , Rm ) (that is, E is in the closure
that contains a linear map, and T`n` is a family of bijective
of the set of flow type neural networks) such that E(K) =
family networks. Finally let T0n be distributionally univer-
f (K). Moreover,
sal, i.e. for any absolutely continuous µ ∈ P(Rn ) and
I k (K, Rm ) = embk (K, Rm ) (9) ν ∈ P(Rn ), there is a {Ti }i=1,2,... such that Ti# µ → ν in
distribution. Let one of the following two cases hold:
The proof of Theorem 3.8 in Appendix C.3.3. We also re- n ,m n ,n
(i) f ∈ FL L−1 ◦· · ·◦F1n,n1 and E` `−1 ` have the the n` ,
mark here that the proof of the above theorem relies on the n ,n
n`−1 ,n`−1 MEP for ` = 1, . . . , L with respect to F` `−1 ` .
so called ‘clean trick’ from differential topology. This trick
is related to fact that in R4 , all knots can be reduced to the (ii) f ∈ emb1 (Rn , Rm ) be a C 1 -smooth embedding, for
simple knot continuously. ` = 1, . . . , L n` ≥ 3n`−1 + 1 and the families T`n` are
dense in Diff2 (Rn` ).
n ,m
3.4. Universality Then, there is a sequence of {Ei }i=1,2,... ⊂ EL L−1 ◦· · ·◦
We now combine the notions of universality and extend- E1n1 ,n ◦ T0n such that
able embeddings to produce a result stating that many com-
monly used networks of the form studied in Section 2 have lim W2 (f # µ, Ei# µ) = 0. (10)
i→∞
the MEP.
Lemma 3.9. (i) If R ⊂ emb(Rn , Rm ) is a uniform uni- The proof of Theorem 3.10 is in Appendix C.4.3. The re-
versal approximator of C(Rn , Rm ) and I ∈ T where sults of Theorems 3.8 and 3.10 have a simple interpretation,
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

omitting some technical details. Densities on ‘nice’ man-


ifolds embedded in high-dimensional spaces can always
be approximated by neural networks of the form Eq. 1.
Here, ‘nice’ manifolds are smooth and homeomorphic to
Rn . This proves that networks like Eqn. 1 are ‘up to task’
of solving generation problems.
As discussed in the above and in Figure 2, there are topo- (a) f and E1 (b) f and E2 (c) f and E3
logical obstructions to obtaining the results of Theorem
3.10 with a general embedding f : Rn → Rm . When
n = 2, m = 3, L = 1, and µ is the uniform measure on
an annulus K ⊂ R2 target measure F# µ is the uniform
measure on a knotted ribbon M = f (K) ⊂ R3 . There
are no injective linear maps R : R2 → R3 and diffeomor-
phisms T : R3 → R3 such that E = T ◦ R would satisfy (d) E1 ◦ E10 (e) E2 ◦ E20 (f) E3 ◦ E30
M = E(K) and E# µ = F# µ.
Figure 3: A visualization of the construction described in
We note that our networks are designed expressly to ap- Corollary 3.12 applied to a toy example when m = 3, o =
proximate manifolds, and hence injectivity is key. This 2 and n = 1. In all figures, f (K) is the red curve, Ei (W )
separates our results from, e.g. (Lee et al., 2017, Theorem are the orange surfaces, Ei ◦ Ei0 (W 0 ) are the black curves,
3.1) or (Lu & Lu, 2020, Theorem 2.1), where universality Ti and µ are not pictured. (a) - (c) The orange surfaces
results of ReLU networks are also obtained. approach the red curves. This means that the sequence of
The previous theorem states that the entire network is uni- E1 , E2 and E3 send BK,W (f , Ei ) to zero as i increases.
versal if it can be broken into pieces that have the MEP. The (d) - (f) The black curves, a subset of the orange surfaces,
following lemma, proved in Appendix C.4.4, shows that if approach the red curves. This means that given E1 , E2
E n,m = Ho,m ◦ G n,o , then Ho,m must have the m, n, o and E3 we can always find another sequence E10 , E20 and
MEP if E n,m is universal. E30 that sends BK,W (f , Ei ◦ Ei0 ) to zero as i increases too.
This as a consequence, sends W2 (f # µ, Ei ◦ Ei0 ◦ Ti# µ)
Lemma 3.11. Suppose that E n,m = Ho,m ◦ G n,o where
to zero as i increases too for some choice of T1 , T2 and T3 .
E n,m ⊂ emb(Rn , Rm ), Ho,m ⊂ emb(Ro , Rm ), and
G n,o ⊂ emb(Rn , Ro ). If Ho,m does not have the m, n, o
MEP w.r.t. F, then there exists a f ∈ F, compact
K ⊂ Rn and  > 0 such that for all E ∈ E n,m , and
r ∈ emb(f (K), E(W )) trefoil knot respectively. Such a construction is given in
C.4.6.
kI − rkL∞ (K) ≥ . (11)
Although there are sequences of functions that approximate
Lemma 3.11 has a simple takeaway: If a bijective neural measure without matching manifolds, these sequences are
network is universal, then the last layer, last two layers, never uniformly Lipschitz. This is proven in C.4.6. Under
etc., must have the MEP. In other words, a network is only an idealization of training, we may consider a network un-
as universal as its last layer. Earlier layers, on the other dergoing training as successively better and better approx-
hand, need not satisfy the MEP. ‘Strong’ layers close to the imators of a target mapping. If the target mapping does
output can compensate for ‘weak’ layers closer to the input, not match the topology, then training necessarily leads to
but not the other way around. gradient blowup.

There is a gap between the negation of Theorem 3.10 and The proof of Theorem 3.10 also implies the following re-
Lemma 3.11. That is, it is possible for a family of functions sult which, loosely speaking, says that optimality of later
E to satisfy Lemma 3.11 but nevertheless satisfy the con- layers can be determined without requiring optimality of
clusion of Theorem 3.10; these functions approximate mea- earlier layers, while still having a network that is end-to-
sures without matching manifolds. Theorem 3.10 consid- end optimal. The conditions and result of this is visualized
ers approximating measures, whereas Lemma 3.11 refers to on a toy example in Figure 3.
matching manifolds exactly. As discussed in Section 3.3,
there are no extendable embeddings that map S 1 to the tre- Corollary 3.12. Let F n,o ⊂ emb(Rn , Ro ), F o,m ⊂
foil knot in R3 . Nevertheless, it is possible to construct a emb(Ro , Rm ), and let E o,m ⊂ emb(Ro , Rm ) have the
sequence of functions (Ei )i=1,... so that W2 (ν, Ei# µ) = m, n, o MEP w.r.t. F o,m ◦ F n,o . Then for every f ∈
0 where µ and ν are the uniform distributions on S 1 and F o,m ◦ F n,o and compact sets K ⊂ Rn and W ⊂ Ro
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

there is a sequence {Ei }i=1,2,... ⊂ E o,m such that


lim BK,W (f, Ei ) = 0. (12)
i→∞

Further,if there is a compact W 0 ⊂ Rn and E n,o ⊂


emb(W 0 , Ro ) has the o, n, n MEP w.r.t. F n,o , and a T n
is a universal approximator for distributions, then for any
absolutely continuous µ ∈ P(K) where K ⊂ Rn is
compact, there is a sequence {Ei0 }i=1,2,... ⊂ E n,o and
{Ti }i=1,2,..., ⊂ T n so that
lim W2 (f # µ, Ei ◦ Ei0 ◦ Ti# µ) = 0. (13) Figure 4: A schematic showing that, for a toy problem, the
i→∞
least-squares projection to a piecewise affine range can be
The proof of Corollary 3.12 is in Appendix C.4.5. Approx- discontinuous. Left: A partitioning of R2 into classes with
imation results for neural networks are typically given in gray boundaries. Two points y, y 0 are in the same class if
terms of the network end-to-end. Corollary 3.12 shows they are both closest to the same affine piece of R(R), the
that the layers of approximating networks can in fact be range of R. The three points y1 , y2 and y3 are each pro-
built one at a time. This is related to an observation made jected to the closest three points on R(R) yielding ỹ1 , ỹ2
in (Brehmer & Cranmer, 2020, Section B) about training and ỹ3 . Note that the projection operation is continuous
strategies, where the authors remark that they ‘expect faster within each section, but discontinuous across gray bound-
and more robust training of a network’ of the form in Eqn. aries between section.
1 when L = 1, that is F = T1m ◦Rn,m 1 ◦T0n . Corollary 3.12
shows that there exists a minimizing sequence in T1m that
need only minimize Eqn. 12; the T0n layers can be mini- (see (Golub, 1996, Section 5.3).) This includes cases (R1)
mized after. We can further combine Lemma 3.11 and Cor. or (R2). For (R3) we have the following result when D =
3.12 to prove that not only can the network from (Brehmer I n×n and M ∈ R0×n .
& Cranmer, 2020) be trained layerwise, but that any univer- t
Definition 3.14. Let W = B t −DB t ∈ R2n×n and

sal network can necessarily be trained layerwise, provided y ∈ R2n be given, and let R(x) = ReLU(W x). Then
that it can be written as a composition of two smaller lay- define c(y) ∈ R2n , ∆y ∈ Rn×n , My ∈ Rn×2n where
ers.
 n×n
−I n×n
 
I
c(y) := max y, 0 (14)
3.5. Layer-wise Inversion and Recovery of Weights −I n×n I n×n

In this subsection, we describe how our network can be 0 if i 6= j

augmented with more useful properties if the architecture [∆y ]i,j := 0 if [c(y)]i+n = 0 (15)
satisfies a few more assumptions without affecting univer- 

1 if [c(y)]i+n > 0
sal approximation. We focus on a new layerwise projection
My := (I n×n − ∆y ) ∆y
 
result, with a further discussion of black-box recovery of (16)
our network’s weights in Appendix C.5.2.
where the max in Eqn. 14 is taken element-wise.
Given a point y ∈ Rm that does not lie in the range of the
Theorem 3.15. Let y ∈ R2n . If for i = 1, . . . , n, [y]i 6=
network, projecting y onto the range of the network is a
[y]i+n then
practical problem without an obvious answer. The crux of
the problem is inverting the injective (but non-invertible) −1
R† (y) := (My W ) My y = argmin ky − R(x)k2 .
R layers when R contains only full-rank matrices as in x∈Rn
(R1) or (R2) then we can compute a least-squares solution. (17)
If, however, R contains layers which are only piecewise
linear, as in (R3), then the problem of computing a least Further, if there is a i ∈ {1, . . . , n} such that [y]i = [y]i+n ,
squares solution is more difficult, see Fig. 4. Neverthe- then there are multiple minimizers of ky − R(x)k2 , one of
less, we find that if R is (R3) we can still compute a least- which is R† (y).
squares solution.
The proof of Theorem 3.15 is given in Appendix C.5.1.
Assumption 3.13. Let R be given by one of (R1) or (R2),
or else (R3) when m = 2n. Remark 3.16. We note that Theorem 3.15 is different from
many of the existing work on inverting expansive layers,
If R only contains linear operators, then the least-squares e.g. (Aberdam et al., 2020; Bora et al., 2017; Lei et al.,
problem can be computed by solving the normal equations 2019), our result gives a direct inversion algorithm that is
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

provably the least-squares minimizer. Further, if each ex- pp. 1–155. Springer, Heidelberg, 2013. doi: 10.1007/
pansive layer is any combination of (R1), (R2), or (R3) then 978-3-642-32160-3\ 1. URL https://doi.org/
the entire network can be inverted end-to-end by using ei- 10.1007/978-3-642-32160-3_1.
ther the above result or solving the normal equations di-
rectly. Ambrosio, L., Gigli, N., and Savaré, G. Gradient flows: in
metric spaces and in the space of probability measures.
Springer Science & Business Media, 2008.
4. Conclusion
Ardizzone, L., Kruse, J., Wirkert, S., Rahner, D., Pelle-
Bijective flow networks are a powerful tool for learning grini, E. W., Klessen, R. S., Maier-Hein, L., Rother, C.,
push-forward mappings in a space of fixed dimension. In- and Köthe, U. Analyzing inverse problems with invert-
creasingly, these flow networks have been used in combi- ible neural networks. arXiv preprint arXiv:1808.04730,
nation with networks that increase dimension in order to 2018.
produce networks which are purportedly universal.
Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan.
In this work, we have studied the theory underpinning
arXiv preprint arXiv:1701.07875, 2017.
these flow and expansive networks by introducing two new
notions, the embedding gap and the manifold embedding Arridge, S., Maass, P., Öktem, O., and Schönlieb, C.-
property. We show that these notions are both necessary B. Solving inverse problems using data-driven models.
and sufficient for proving universality, but require impor- Acta Numerica, 28:1–174, 2019.
tant topological and geometrical considerations which are,
heretofore, under-explored in the literature. We also find Billingsley, P. Convergence of probability measures. Wi-
that optimality of the studied networks can be established ley Series in Probability and Statistics: Probability and
‘in reverse,’ by minimizing the embedding gap, which we Statistics. John Wiley & Sons, Inc., New York, sec-
expect opens the door to convergence of layer-wise training ond edition, 1999. ISBN 0-471-19745-9. doi: 10.
schemes. Without compromising universality, we can also 1002/9780470316962. URL https://doi.org/
use specific expansive layers with a new layerwise projec- 10.1002/9780470316962. A Wiley-Interscience
tion result. Publication.

Bora, A., Jalal, A., Price, E., and Dimakis, A. G.


5. Acknowledgements Compressed sensing using generative models. In
We would like to thank Anastasis Kratsios for his editorial Proceedings of the 34th International Conference on
input and mathematical discussions that helped us refine Machine Learning-Volume 70, pp. 537–546. JMLR. org,
and trim our presentation; Pekka Pankka for his suggestion 2017.
of the ‘clean trick,’ which was crucial to the development Brehmer, J. and Cranmer, K. Flows for simultaneous man-
of the proof of Lemma 3.8; and Reviewer for supplying 5 ifold learning and density estimation. arXiv preprint
and suggesting the addition of C.4.6. arXiv:2003.13913, 2020.
I.D. was supported by the European Research Council
Bui Thi Mai, P. and Lampert, C. Functional vs. paramet-
Starting Grant 852821—SWING. M.L. was supported by
ric equivalence of relu networks. In 8th International
Academy of Finland, grants 284715, 312110. M.V.dH.
Conference on Learning Representations, 2020.
gratefully acknowledges support from the Department of
Energy under grant DE-SC0020345, the Simons Founda- Cunningham, E., Zabounidis, R., Agrawal, A., Fiterau, I.,
tion under the MATH + X program, and the corporate and Sheldon, D. Normalizing flows across dimensions.
members of the Geo-Mathematical Imaging Group at Rice arXiv preprint arXiv:2006.13070, 2020.
University.
Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear
independent components estimation. arXiv preprint
References
arXiv:1410.8516, 2014.
Aberdam, A., Simon, D., and Elad, M. When and how
can deep generative models be inverted? arXiv preprint Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti-
arXiv:2006.15555, 2020. mation using real nvp. arXiv preprint arXiv:1605.08803,
2016.

Ambrosio, L. and Gigli, N. A user’s guide to opti- Durkan, C., Bekasov, A., Murray, I., and Papamakarios, G.
mal transport. In Modelling and optimisation of flows Cubic-spline flows. arXiv preprint arXiv:1906.02145,
on networks, volume 2062 of Lecture Notes in Math., 2019a.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

Durkan, C., Bekasov, A., Murray, I., and Papamakarios, Lee, H., Ge, R., Ma, T., Risteski, A., and Arora, S. On
G. Neural spline flows. Advances in Neural Information the ability of neural nets to express distributions. In
Processing Systems, 32:7511–7522, 2019b. Conference on Learning Theory, pp. 1271–1296. PMLR,
2017.
Golub, G. H. Matrix computations. Johns Hopkins Univer-
sity Press, 1996. Lei, Q., Jalal, A., Dhillon, I. S., and Dimakis, A. G. In-
verting deep generative models, one layer at a time.
Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The
In Advances in Neural Information Processing Systems,
reversible residual network: Backpropagation without
pp. 13910–13919, 2019.
storing activations. In Advances in neural information
processing systems, pp. 2214–2224, 2017. Lu, Y. and Lu, J. A universal approximation theorem of
Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I., deep neural networks for expressing distributions. arXiv
and Duvenaud, D. Ffjord: Free-form continuous dy- preprint arXiv:2004.08867, 2020.
namics for scalable reversible generative models. arXiv Madsen, I. H., Tornehave, J., et al. From calculus to
preprint arXiv:1810.01367, 2018. cohomology: de Rham cohomology and characteristic
Hirsch, M. W. Differential topology, volume 33. Springer classes. Cambridge university press, 1997.
Science & Business Media, 2012.
Milnor, J. On manifolds homeomorphic to the 7-sphere.
Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Annals of Mathematics, pp. 399–405, 1956.
Neural autoregressive flows. In International Conference
on Machine Learning, pp. 2078–2087. PMLR, 2018. Mukherjee, A. Differential topology. Hindustan
Book Agency, New Delhi; Birkhäuser/Springer,
Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i- Cham, second edition, 2015. ISBN 978-
revnet: Deep invertible networks. arXiv preprint 3-319-19044-0; 978-3-319-19045-7. doi:
arXiv:1802.07088, 2018. 10.1007/978-3-319-19045-7. URL https:
//doi.org/10.1007/978-3-319-19045-7.
Jaini, P., Selby, K. A., and Yu, Y. Sum-of-squares poly-
nomial flow. In International Conference on Machine Müller, S. Uniform approximation of homeomorphisms by
Learning, pp. 3009–3018. PMLR, 2019. diffeomorphisms. Topology and its Applications, 178:
315–319, 2014.
Kingma, D. P. and Dhariwal, P. Glow: Generative
flow with invertible 1x1 convolutions. In Advances Murasugi, K. Knot theory & its applications. Modern
in Neural Information Processing Systems, pp. 10215– Birkhäuser Classics. Birkhäuser Boston, Inc., Boston,
10224, 2018. MA, 2008. ISBN 978-0-8176-4718-6. doi: 10.1007/
978-0-8176-4719-3. URL https://doi.org/10.
Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X.,
1007/978-0-8176-4719-3. Translated from the
Sutskever, I., and Welling, M. Improving variational in-
1993 Japanese original by Bohdan Kurpita, Reprint of
ference with inverse autoregressive flow. arXiv preprint
the 1996 translation [MR1391727].
arXiv:1606.04934, 2016.
Kobyzev, I., Prince, S., and Brubaker, M. Normalizing Papamakarios, G., Nalisnick, E., Rezende, D. J., Mo-
flows: An introduction and review of current methods. hamed, S., and Lakshminarayanan, B. Normalizing
IEEE Transactions on Pattern Analysis and Machine flows for probabilistic modeling and inference. arXiv
Intelligence, 2020. preprint arXiv:1912.02762, 2019.

Kothari, K., Khorashadizadeh, A., de Hoop, M., and Dok- Puthawala, M., Kothari, K., Lassas, M., Dokmanić, I., and
manić, I. Trumpets: Injective flows for inference and in- de Hoop, M. Globally injective relu networks. arXiv
verse problems. arXiv preprint arXiv:2102.10461, 2021. preprint arXiv:2006.08464, 2020.

Kruse, J., Detommaso, G., Scheichl, R., and Köthe, U. Rolnick, D. and Körding, K. Reverse-engineering deep
Hint: Hierarchical invertible neural transport for den- relu networks. In International Conference on Machine
sity estimation and bayesian inference. arXiv preprint Learning, pp. 8178–8187. PMLR, 2020.
arXiv:1905.10687, 2019.
Séquin, C. H. Tori story. In Sarhangi, R. and Séquin,
Kruse, J., Ardizzone, L., Rother, C., and Köthe, U. Bench- C. H. (eds.), Proceedings of Bridges 2011: Mathematics,
marking invertible architectures on inverse problems. Music, Art, Architecture, Culture, pp. 121–130. Tessel-
arXiv preprint arXiv:2101.10763, 2021. lations Publishing, 2011. ISBN 978-0-9846042-6-5.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

Sun, H. and Bouman, K. L. Deep probabilistic imag-


ing: Uncertainty quantification and multi-modal solu-
tion characterization for computational imaging. arXiv
preprint arXiv:2010.14462, 2020.
Sutherland, W. A. Introduction to metric and
topological spaces. Oxford University Press, Ox-
ford, 2009. ISBN 978-0-19-956308-1. Sec-
ond edition [of MR0442869], Companion web site:
www.oup.com/uk/companion/metric.
Tao, T. Analysis, volume 185. Springer, 2009.
Teshima, T., Ishikawa, I., Tojo, K., Oono, K., Ikeda, M.,
and Sugiyama, M. Coupling-based invertible neural
networks are universal diffeomorphism approximators.
arXiv preprint arXiv:2006.11469, 2020.
Villani, C. Optimal transport: old and new, volume 338.
Springer Science & Business Media, 2008.

Yarotsky, D. Error bounds for approximations with deep


relu networks. Neural Networks, 94:103–114, 2017.
Yarotsky, D. Optimal approximation of continuous func-
tions by very deep relu networks. In Conference on
Learning Theory, pp. 639–649. PMLR, 2018.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

A. Summary of Notation
Throughout the paper we make heavy use of the following notation.

1. Unless otherwise stated, X and Y always refer to subsets of Euclidean space, and K and W always refer to compact
subsets of Euclidean space.
2. f ∈ C(X, Y ) means that f : X → Y is continuous.
3. For families of functions F and G where each F 3 f : X → Y and G 3 g : Y → Z, then we define G ◦ F =
{g ◦ f : X → Z : f ∈ F, g ∈ G}.
4. f ∈ emb(X, Y ) means that f ∈ C(X, Y ) is continuous and injective on the range of f , i.e. an embedding, and
furthermore that f −1 : f (X) → X is continuous.
5. µ ∈ P(X) means that µ is a probability measure over X.
6. W2 (µ, ν) for µ, ν ∈ P(X) refers to the Wasserstein-2 distance, always with `2 ground metric.
7. k·kLp (X) refers to the Lp norm of functions, from X to R.
8. For vector-valued f : X → Y , kf kL∞ (X) = ess supx∈X kf k2 . Note that Y is always finite dimensional, and so all
discrete 1 ≤ q ≤ ∞ norms are equivalent.
9. Lip(g) refers to the Lipschitz constant of f .
10. For x ∈ Rn , [x]i ∈ R is the i’th component of x. Similarly, for matrix A ∈ Rm×n , [A]ij refers to the j’th element in
the i’th column.

B. Detailed Comparison to Prior work


B.1. Connection to Brehmer & Cranmer (2020)
In (Brehmer & Cranmer, 2020), the authors introduce manifold-learning flows as an invertible method for learning proba-
bility density supported on a low-dimensional manifold. Their model can be written as

F = T1m ◦ Rn,m ◦ T0n (18)


I n×n
 
where T1m ⊂ C(Rm , Rm ), T0m ⊂ C(Rn , Rn ), and R = (m−n)×n is a zero-padding (R1). They invert f ∈ F in
0
two different ways. For manifold-learning flows (M-flows) they restrict T1m to be an invertible flow, and for manifold-
learning flows with separate encoder (Me -flows) they place no such restrictions on T1m and instead train a separate neural
network e to invert elements of T1m .
Our results apply out-of-the-box to the architectures used in Experiment A of (Brehmer & Cranmer, 2020). The architecture
described in Eqn. 18 is of the form of Eqn. 1 where L = 1. Further, although they are not studied here, our analysis can
also be applied to quadratic flows.
The network used in (Brehmer & Cranmer, 2020, Experiment 4.A) uses coupling networks, (T1), where T1m and T0n are
both 5 layers deep. For (Brehmer & Cranmer, 2020, Experiments 4.B and 4.C) the authors choose expressive elements T
that are rational quadratic flows (Durkan et al., 2019b) for both T1m and T0n . In Experiment 4.B they let T1 and T0 again
be 5 layers deep, and in 4.C they again let T1 by 20 layers deep and T0 15 layers. For the final experiment, 4.D, the choose
more complicated expressive elements that combine Glow (Kingma & Dhariwal, 2018) and Real NVP (Dinh et al., 2016)
architectures. These elements include the actnorm, 1 × 1 convolutions and rational-quadratic coupling transformations
along with a multi-scale transformation.
The authors mention universality of their network without our proof, but our universality results in Theorem 3.10 apply
to their networks from Experiment A wholesale. Further in their work the authors describe how training can be split into
a manifold phase and density phase, wherein the manifold phase T1m is trained to learn the manifold, and in the density
phase T1m if fixed and T0n is trained to learn the density thereupon. This statement is made formal and proven by our Cor.
3.12.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

B.2. Connection to Kothari et al. (2021)


In (Kothari et al., 2021), the authors introduce the ‘Trumpet’ architecture, for its architecture, which has many alternating
flow networks & expansive layers with many flow-networks in the low-dimensional early stages of the network, which
gives the architecture a shape similar to the titular instrument.
The architecture studied in (Kothari et al., 2021) is precisely of the form of Eqn. 1, where the bijective flow networks are
revnets (Gomez et al., 2017; Jacobsen et al., 2018) architecture, and the expansive elements are 1 × 1 convolutions, as in
(R2). To out knowledge, there are no results that show that the revnets used are universal approximators, but if they revnets
are substituted with either (T1) or (T2), then the, we could apply Theorem 3.10 to the resulting architecture.

C. Proofs
C.1. Main Results
C.1.1. E MBEDDING G AP
To aid all of our subsequent proofs, we first present the following lemma which present inequalities and identities for the
embedding gap.
Lemma C.1. For all of the following results, f ∈ emb(K, Rm ) and g ∈ emb(W, Rm ) and n ≤ o ≤ m.

1.

BK,W (f, g) ≥ sup inf kg(xo ) − f (xn )k2 . (19)


xn ∈K xo ∈W

2. Let X, Y ⊂ W , let g be Lipschitz on W , and r ∈ emb(X, Y ). Then, there is a r0 ∈ emb(g(X), g(Y )) such that
g ◦ r = r0 ◦ g and kI − r0 kL∞ (g(X)) ≤ kI − rkL∞ (X) Lip(g).
3.

kI − rkL∞ (K) = I − r−1 L∞ (r(K))


(20)

4. Let K ⊂ Rn , X ⊂ Rp and W ⊂ Ro be compact sets. Also, let f ∈ emb(K, W ) and h ∈ emb(X, W ), and let
g ∈ emb(W, Rm ) be a Lipschitz map. Then

BK,X (g ◦ f, g ◦ h) ≤ Lip(g)BK,X (f, h). (21)

5. BK,W (f, g) ≤ supx∈K kg ◦ h(x) − f (x)k2 where h ∈ emb(K, Ro ) is a map satisfying h(K) ⊂ W .
6. For any X that is the closure of an open set , if h ∈ emb(X, W ) then

BK,W (f, g) ≤ BK,X (f, g ◦ h) (22)

7. For any r ∈ emb(f (K), Rm ),

BK,W (f, g) ≤ kI − rkL∞ (f (K)) + BK,W (r ◦ f, g). (23)

8. For any r ∈ emb(f (K), g(W )) and h ∈ emb(X, W ) where X ⊂ Rp is the closure of a set U which is open in the
subspace topology of some vector space of dimension p, where n ≤ p ≤ o we have that

BK,X (f, g ◦ h) ≤ kI − rkL∞ (f (K)) + Lip(g)BK,X (g −1 ◦ r ◦ f, h) (24)

where Lip(g) denotes the Lipschitz constant of g.


9. For any µn ∈ P(K) there is a µo ∈ P(W ) such that

W2 (f # µn , g # µo ) ≤ BK,W (f, g) (25)


Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

Proof. 1. Let r ∈ C(f (K), g(W )), then

kI − rkL∞ (f (K)) = sup k(I − r)f (xn )k2 = sup kf (xn ) − r ◦ f (xn )k2
xn ∈K xn ∈K

= sup kf (xn ) − g(xo )k2 where xo = g −1 ◦ r ◦ f (xn )


xn ∈K

≥ sup inf kf (xn ) − g(xo )k2 .


xn ∈K xo ∈W

2. g is injective on X, hence we can define r0 such that r0 = g ◦ r ◦ g −1 : g(X) → g(r(X)) ⊂ g(Y ) such that
r0 ∈ emb(g(X), g(Y )), and thus ∀x ∈ X,

k(I − r0 ) ◦ g(x)k2 = kg(x) − g ◦ r(x)k2 ≤ Lip(g) kI − rkL∞ (X) (26)

where we have used kr(x) − xk2 ≤ kI − rkL∞ (X) .

3. For every x ∈ r(K), we have a y ∈ K such that x = r(y), thus ∀x ∈ r(K),

I − r−1 (x) 2 = k(r − I) (y)k2 .



(27)

But r is clearly surjective onto it’s range, hence taking the supremum over all x ∈ X yields

I − r−1 L∞ (r(K))
= kI − rkL∞ (K) (28)

4. As g ∈ emb(W, Rm ), the map g : W → g(W ) is a homeomorphism and there is g −1 ∈ emb(g(W ), W ). For a map
r ∈ emb(g ◦ f (K), g ◦ h(X)), we see that r̂ = g −1 ◦ r ◦ g ∈ emb(f (K), h(X)). Also, the opposite is valid as if
r̂ ∈ emb(f (K), h(X)) then r = g ◦ r̂ ◦ g −1 ∈ emb(g ◦ f (K), g ◦ h(X)). Thus

BK,X (g ◦ f, g ◦ h) = inf kI − rkL∞ (g◦f (K))


r∈emb(g◦f (K),g◦h(X))

= inf kI − g ◦ r̂ ◦ g −1 kL∞ (g◦f (K))


r=g◦r̂◦g −1 ∈emb(g◦f (K),g◦h(X))

= inf kg ◦ (I − r̂) ◦ g −1 kL∞ (g◦f (K))


r̂∈emb(f (K),h(X))

≤ Lip(g) inf k(I − r̂) ◦ g −1 kL∞ (g◦f (K))


r̂∈emb(f (K),h(X))

≤ Lip(g) inf kI − r̂kL∞ (f (K))


r̂∈emb(f (K),h(X))

≤ Lip(g) BK,X (f, h)

5. If we let r := g ◦ h ◦ f −1 , then r ∈ emb(f (K), g(W )), and

BK,W (f, g) ≤ kk(I − r) ◦ f (x)k2 kL∞ (K) (29)


= f (x) − g ◦ h ◦ f −1 ◦ f (x) 2 L∞ (K)
≤ sup kf (x) − g ◦ h(x)k2 . (30)
x∈K

6. Given that g ◦ h(X) ⊂ g(W ), we have that emb(f (K), g ◦ h(X)) ⊂ emb(f (K), g(W )), thus the infimum in Eqn. 6
is taken over a smaller set, thus BK,W (f, g) ≤ BK,X (f, g ◦ h).

7. Note that for any r0 ∈ emb(r ◦ f (K), g(W )), r0 ◦ r ∈ emb(f (K), g(W )), and so we have

BK,W (f, g) ≤ kI − r0 ◦ rkL∞ (f (K)) ≤ kI − rkL∞ (f (K)) + kr − r0 ◦ rkL∞ (f (K)) (31)


0
= kI − rkL∞ (f (K)) + kI − r kL∞ (r◦f (K)) (32)

where we have used that r is injective for the final equality. This holds for all possible r0 , hence we have the result.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

8. Recall that f ∈ emb(K, W ), g ∈ emb(W, Rm ), h ∈ emb(X, W ) and r ∈ emb(f (K), g(W )). Then g −1 ∈
emb(g(W ), W ). As r ◦ f (K) ⊂ g(W ), we see that

r ◦ f = g ◦ g −1 ◦ r ◦ f.

Thus Lemma C.1 points 4 and 8 yield that

BK,X (f, g ◦ h) ≤ kI − rkL∞ (f (K)) + BK,X (r ◦ f, g ◦ h)


≤ kI − rkL∞ (f (K)) + BK,X (g ◦ g −1 ◦ r ◦ f, g ◦ h)
≤ kI − rkL∞ (f (K)) + Lip(g) BK,X (g −1 ◦ r ◦ f, h),

which proves the claim.


9. Let r ∈ emb(f (K), g(W )) be such that kI − r kL∞ (Range(f )) ≤ BK,W (f, g) + , then for every x ∈ K, there exists
y ∈ W such that g(y) = r ◦ f (x). From injectivity of g, we have that y = g −1 ◦ r ◦ f (x). Note that g −1 ◦ r ◦ f ∈
emb(K, W ), hence K 0 := g −1 ◦ r ◦ f (K) ⊂ W is compact. Define µ0 ∈ P(K 0 ) where µ0 := (g −1 ◦ r ◦ f )# µ.
Clearly g # µ0 = r ◦ f # µ, and thus

W2 (f # µ, g # µ0 ) = W2 (f # µ, r ◦ f # µ) (33)

and so
Z 1/2
2
W2 (f # µ, g # µ0 ) ≤ kI − r k2 df # µ ≤ BK,W (f, g) + . (34)
K

As the set W is compact, by Prokhoros’s theorem, see (Billingsley, 1999, Theorem 5.1), the set of probability mea-
sures P (W ) is a compact set in the topology of weak convergence. Thus there is a sequence i → 0 such that the
measures µ0i converge weakly to a probability measure µo . As g : W → K is a continuous function, the push-forward
operation µ → g# µ is continuous g# : P (W ) → P (K) and thus g# µ0i converge weakly to g# µo . Finally, as g# µ0i
are supported in a compact set K, their second moments converge to those of g# µo as i → ∞. By (Ambrosio &
Gigli, 2013), Theorem 2.7, see also Remark 28, the weak convergence and the convergence of the second moments
imply the convergence in the Wasserstein-2 metric. Hence, g# µ0i converge to g# µo in Wasserstein-2 metric and we
see that

W2 (f # µ, g # µo ) ≤ BK,W (f, g). (35)

C.2. Manifold Embedding Property


C.2.1. T HE P ROOF OF L EMMA 3.4
The proof of Lemma 3.4. Let f = F2 ◦ F1 where F2 ∈ F o,m and F1 ∈ F n,o and  > 0 be given, and let E o,m . Clearly,
BK,W (f, E) ≤ BK,W (F2 , E) and so by the m, o, o MEP of E o,m with respect to F o,m , we have the existence of an
−1
rm ∈ emb(f (K), E o,m ) such that kI − rkL∞ (f (K)) < . Ko := (E o,m ) ◦ r ◦ f (K) is compact, hence E o,m is
Lipschitz on Ko , so we can apply Lemma C.1 point 8, so
−1
BK,W (f, E o,m ◦ E p,o ) ≤ kI − rkL∞ (f (K)) + Lip(E o,m )BK,W ((E o,m ) ◦ r ◦ f, E p,o ). (36)
−1
But, because f ∈ F o,m ◦ F n,o , we can choose a E p,o ∈ E1p,o so that BK,W ((E o,m ) ◦ r ◦ f, E p,o ) ≤ 
2 Lip(E o,m ) which,
combined with Eqn. 36, proves the result.

C.2.2. T HE P ROOF OF L EMMA 3.5


The proof of Lemma 3.5. Recall that F ⊂ emb(Rn , Rm ). Suppose that E2o,m does not have the m, n, o MEP with respect
to F, then there are some  > 0 and f ∈ F so that

∀E o,m ∈ E2o,m ∀W1 ⊂⊂ Ro , BK,W1 (f, E2o,m ) ≥ . (37)


Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

From Lemma C.1 point 6, we have that

 ≤ BK,W1 (f, E2o,m ) ≤ BK,W (f, E2o,m ◦ E1p,o ) (38)

for all E1p,o ∈ E1p,o and for all compact sets W ⊂ Rp that satisfy E1p,o (W1 ) ⊂ W . We observe that if W 0 ⊂ Rp is a
compact set such that W 0 ⊂ W , we have

BK,W (f, E2o,m ◦ E1p,o ) ≤ BK,W 0 (f, E2o,m ◦ E1p,o )

Thus, inequality Eq. 38 holds for all E1p,o ∈ E1p,o and for all compact sets W ⊂ Rp . Summarising, we have seen that there
are f ∈ F and  > 0 such that for all E1p,o ∈ E1p,o and for all compact sets W ⊂ Rp we have  ≤ BK,W (f, E2o,m ◦ E1p,o ).
Hence E2o,m does not have the m, n, o MEP with respect to F, and we have obtained a contradiction, which proves the
result.

C.3. Topological Obstructions to Manifold Learning with Neural Networks


C.3.1. S 1 CAN NOT BE M APPED E XTENDABLY TO THE T REFOIL K NOT
We first show that there are no maps E := T ◦ R where R : R2 → R3 such that T is a homeomorphism and E(S 1 ) is a
trefoil knot. We use the fact that the trivial knot S 1 and the trefoil knot M = f (S 1 ) are not equivalent, that is, there are
no homeomorphisms in R3 that map S 1 to M. Indeed, by (Murasugi, 2008, Section 3.2), the trefoil knot M and its mirror
image are not equivalent, whereas the trivial knot S 1 and its mirror image are equivalent. Hence, M and R(S 1 ) are not
equivalent knots in R3 . Thus by (Murasugi, 2008, Definition 1.3.1 and Theorem 1.3.1), we see that there is no orientation
preserving homeomorphism T : R3 → R3 such that T (R3 \ R(S 1 )) = R3 \ M. As the orientation of the map T can
be changed by composing T with the reflection J : R3 → R3 across the plane Range(R) that defines a homeomorphism
J : R3 \ R(S 1 ) → R3 \ R(S 1 ), we see that there is no homeomorphism T : R3 → R3 such that T (R3 \ R(S 1 )) = R3 \ M.
This example shows that the composition E = T ◦ R of a linear map R and a coupling flow T cannot have the property
that E(S 1 ) = f (S 1 ) for this embedding f . Moreover, the complement R3 \ E(S 1 ) is never homeomorphic to R3 \ f (S 1 )
for any such map E.
We now construct another example, similar to Figure 2, where an annulus that is mapped to a knotted ribbon in R3 . To
do this, replace the circle S 1 by an annulus K = {x ∈ R2 : 1/2 ≤ |x| ≤ 3/2}, that in the polar coordinates is
{(r, θ) : 1/2 ≤ r ≤ 3/2} and define a map F : K → R3 by defining in the polar coordinates

F (r, θ) = f (θ) + a(r − 1)v(θ)

where f : S 1 → Σ1 ⊂ R3 is an smooth embedding of S 1 to a trefoil knot Σ1 and v(θ) ∈ R3 is a unit vector normal to
Σ1 at the point f (θ) such that v(θ) is a smooth function of θ, and a > 0 is a small number. In this case, M1 = F (K) is a
2-dimensional submanifold of R3 with boundary, which can visualizes M1 as a knotted ribbon.
We now show that there are no maps E = T ◦ R such that E(K) = F (K) where T : R3 → R3 is an embedding, and
R : R2 → R3 injective and linear. The key insight is that if such a T existed, then this implies that the trefoil knot is
equivalent to S 1 in R3 , which is known to be false.
Let Uρ (A) denote the ρ-neighborhood of the set A in R3 . It is easy to see that R2 \ ({0} × [−1, 1]) is homeomorphic to
R2 \ B R2 (0, 1), which is further homeomorphic to R2 \ {0}. Thus, using tubular coordinates near Σ1 and a sufficiently
small ρ > 0, we see that R3 \ M1 is homeomorphic to R3 \ Uρ (Σ1 ), which is further homeomorphic to R3 \ Σ1 . Also,
when R : R2 → R3 is an injective linear map, we see that M2 = R(K) is a un-knotted band in R3 and R3 \ M2 is
homeomorphic to R3 \ Σ2 . If R3 \ M1 and R3 \ M2 would be homeomorphic, then also R3 \ Σ1 and R3 \ Σ2 would be
homeomorphic that is not possible by knot theory, see (Murasugi, 2008, Definition 1.3.1 and Theorem 1.3.1). This shows
that there are no injective linear maps R : R2 → R3 and homeomorphisms Φ : R3 → R3 such that (Φ ◦ R)(K) = M1 .
Similar examples can be obtained in a higher dimensional case by using a knotted torus (Séquin, 2011)6 and their Cartesian
products.
6
On the knotted torus, see http://gallery.bridgesmathart.org/exhibitions/2011-bridges-conference/
sequin.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

C.3.2. L INEAR H OMEOMORPHISM C OMPOSITION


In this subsection we prove that the topological obstructions to universality presented in Section 3.3 still apply when the
expansive elements are allowed to be hom(R3 , R3 ) ◦ R3×2 . This fact follows from the observation that hom(R3 , R3 ) ◦
hom(R3 , R3 ) = hom(R3 , R3 ), which yields that Ê = E.

C.3.3. T HE P ROOF OF T HEOREM 3.8


Given an f ∈ embk (K, Rm ), for k ≥ 1, we first show that for m ≥ 2n+1 there is always a diffeomorphism Ψ : Rm → Rm
n
so that Ψ ◦ f : Rn → {0} × Rm−n . The existence of such a Ψ borrows ideas from Whitney’s embedding theorem (Hirsch,
2012, Theorems 3.4 & 3.5) and is constructed by iteratively constructing an injective projection.
Next if m − n ≥ 2n + 1, then we can apply (Madsen et al., 1997, Lemma 7.6), a result analogous to the Tietze extension
n
theorem, to show that Ψ : M → {0} × Rm−n can be extended to a diffeomorphism on the entire space, h : Rm → Rm .
Hence f (x) = Ψ−1 ◦ h ◦ R(x) for diffeomorphism Ψ−1 ◦ h : Rm → Rm and zero-padding operator R : Rn → Rm , and
thus f ∈ I k (K, Rm ). This fact that for m sufficiently large compared to n such a diffeomorphism can always be extended
is related to the fact that in 4-dimensions, all knots can be opened. This can be contrasted with the case in Figure 2.
We now present our proof.

Proof. Let us next prove Eq. 9 when m ≥ 3n + 1. Let

f ∈ embk (Rn , Rm ) (39)

be a C k map and M = f (Rn ) be an embedded submanifold of Rm .


We have that m ≥ 3n + 1 > 2n + 1. Let S m−1 be the unit sphere of Rm and let

SRm = {(x, v) ∈ Rm × Rm : kvk = 1}

be the sphere bundle of Rm that is a manifold of dimension 2m − 1. By the proof’s of Whitney’s embedding theorem,
by Hirsch, (Hirsch, 2012, Chapter 1, Theorems 3.4 and 3.5), there is a set of ‘problem points’ H1 ⊂ S m−1 of Hausdorff
dimension 2n such that for all w ∈ Rm \ H1 the orthogonal projection

Pw : Rm → {w}⊥ = {y ∈ Rm : y ⊥ w}

has a restriction Pw |M on M defines an injective map

Pw |M : M → {w}⊥ .

Moreover, let Tx M be the tangent space of manifold M at the point x and let us define another set of ‘problem points’ as

H2 = {v ∈ S m−1 : ∃x ∈ M, v ∈ Tx M}.

For w ∈ S m−1 \ H2 the map


Pw |M : M → {w}⊥ ⊂ Rm
is an immersion, that is, it has an injective differential. The sphere tangent bundle SM of M has dimension 2n − 1, and
the set H2 has the Hausdorff dimension at most 2n − 1. Thus H = H1 ∪ H2 has Hausdorff dimension at most 2n < m − 1
and hence the set S m−1 \ H is non-empty. For w ∈ S m−1 \ H the map Pw |M : M → {w}⊥ is a C k injective immersion
and thus
Ñ = Pw (M) ⊂ {w}⊥
is a C k submanifold.
Let Z : Pw (M) → M be the C k function defined by

Z(y) ∈ M, Pw (Z(y)) = y,

that is it is the inverse of Pw |M : M → Pw (M), where Pw (M) ⊂ {w}⊥ . Let g : Ñ = Pw (M) → R be the function

g(y) = (Z(y) − y) · w, y ∈ Pw (M).


Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

Then Ñ is a n-dimensional C k submanifold of (m − 1)-dimensional Euclidean space H = {w}⊥ and g is a C k function


defined on it. By definition of a C k submanifold of H, any point x ∈ Ñ has a neighborhood U ⊂ H with local C k
coordinates ψ : U → Rm such that ψ(Ñ ∩ U ) = ({0}m−1−n × Rn ) ∩ ψ(U ). Using these coordinates, we see that g can
be extended to a C k function in U . Using a suitable partition of unity, we see that there is a C k map G : {w}⊥ → R that a
C k extension of g that is, G|Ñ = g.
Then the map
Φ 1 : Rm → R m , Φ1 (x) = x − G(Pw (x))w
is a C k diffeomorphism of Rm that maps M to m − 1 dimensional space {w}⊥ , that is

Φ1 (M) ⊂ {w}⊥ .

In the case when m ≥ 3n + 1, we can repeat this construction n times. This is possible as m − n ≥ 2n + 1. Then we
obtain C k diffeomorphisms Φj : Rm → Rm , j = 1, . . . , n such that their composition Φn ◦ · · · ◦ Φ1 : Rm → Rm is a
C k -diffeomorphism such that which
M0 = Φn ◦ · · · ◦ Φ1 (M) ⊂ Y 0 ,
where Y 0 ⊂ Rm is a m − n dimensional linear space. By letting Ψ = Q ◦ Φn ◦ · · · ◦ Φ1 for rotation matrix Q ∈ Rm×m ,
we have that Y := Q(Y 0 ) = {0}n × Rm−n . Also, let X = Rn × {0}m−n , A = Q(M0 ) ⊂ X and φ : X → Rm be the
map
φ(x, 0) = Ψ(f (x)) ∈ Y,
where f is the function given in Eq. 39 and B = Ψ(f (A)) ⊂ Y . Then A is a C k -submanifold X, B is a C k -submanifold
Y and φ : A → B is a C k -diffeomorphism. We observe that m − n ≥ 2n + 1 and so we can apply (Madsen et al., 1997,
Lemma 7.6) to extend φ to a C k -diffeomorphism

h : Rm → Rm

such that h|A = φ. Note that (Madsen et al., 1997, Lemma 7.6) concerns an extension of a homeomorphism, but as
the extension h is given by an explicit formula which is locally a finite sum of C k functions, the same proof gives a
C k -diffeomorphic extension h to a diffeomorphism φ. Indeed, let A0 ⊂ Rn and B 0 ⊂ Rm−n be such sets that A =
A0 × {0}m−n , and B = {0}n × B 0 . Moreover, let φ̃ : A0 → Rn−m and ψ̃ : B 0 → Rn be such C k -smooth maps that
φ(x, 0) = (0, φ̃(x)) for (x, 0) ∈ A and φ−1 (0, y) = (ψ̃(y)) for (0, y) ∈ B. As A0 and B 0 are C k -submanifolds, the
map φ̃ has a C k -smooth extension f1 : Rn → Rn−m and the map ψ̃ has a C k -smooth extension f2 : Rn−m → Rn , that
is, f1 |A0 = φ̃ and f2 |B 0 = ψ̃. Following (Madsen et al., 1997, Lemma 7.6), we define the maps h1 : Rn × Rm−n →
Rn × Rm−n ,
h1 (x, y) = (x, y + f1 (x))
and h2 : Rn × Rm−n → Rn × Rm−n ,
h2 (x, y) = (x + f2 (y), y).
Observe that h2 has the inverse map h−1
2 (x, y) = (x − f2 (y), y). Then the map

h = h−1 n
2 ◦ h1 : R × R
m−n
→ Rn × Rm−n

is a C k -diffeomorphism that satisfies h|A = φ. This technique is called the ‘clean trick’.
Finally, to obtain the claim, we observe that when R : Rn → Rm , R(x) = (x, 0) ∈ {0}n × Rm−n is the zero padding
operator, we have
f (x) = Ψ−1 (φ(R(x))), x ∈ Rn .
As h|X = φ and R(x) ∈ X, this yields

f (x) = Ψ−1 (h(R(x))), x ∈ Rn ,

that is,
f =E◦R
−1 m m
where E = Ψ ◦h : R → R is a C diffeomorphism. Thus f ∈ I k (Rn , Rm ). This proves Eq. 9 when m ≥
k

3n + 1.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

C.4. Universality
C.4.1. T HE P ROOF OF L EMMA 3.9
The proof of Lemma 3.9. (i) Let us consider  > 0, a compact set K ⊂ Rn and f ∈ emb(Rn , Rm ). Let W =
K × {0}o−n and F : Ro → Rm be the map given by F (x, y) = f (x), (x, y) ∈ Rn × Ro−n . Because
Ro,m ⊂ emb(Ro , Rm ) is a uniform universal approximator of C(Rn , Rm ), there is an R ∈ Ro,m such that
kF − RkL∞ (W ) < . Then for the map E = I ◦ R we have that BK,W (f, E) < . This is true for every  > 0, and
so E o,m has the MEP property w.r.t. the family emb(Rn , Rm ).

(ii) Recall that f := Φ0 ◦ R0 for Φ0 ∈ Diff1 (Rm , Rm ) and linear R0 : Rn → Rm , and that R ∈ R is such that R U is
linear for open U . We present the proof in the case when n = o, and we make the assumption that R K is linear. In this
case, we have the existence of an affine map A : Rm → Rm so that R0 = A ◦ R so that K̃ := R0 (K) = A(R(K)).
Let  > 0 be given. By (Hirsch, 2012, Chapter 2, Theorem 2.7), the space Diff2 (Rm , Rm ) is dense in the space
Diff1 (Rm , Rm ), and so there is some Φ1 ∈ Diff2 (Rm , Rm ) such that

kΦ1 |K̃ − Φ0 |K̃ kL∞ (K̃;Rm ) < .
2
Then, let T ∈ T m be such that kT − Φ1 ◦ AkL∞ (R(K);Rm ) < 2 . Then we have that

kT ◦ R − f kL∞ (K) = kT ◦ R − Φ0 ◦ R0 kL∞ (K)


≤ kT ◦ R − Φ1 ◦ A ◦ RkL∞ (K) + kΦ1 ◦ A ◦ R − Φ0 ◦ R0 kL∞ (K)
≤ kT − Φ1 ◦ AkL∞ (R(K)) + kΦ1 ◦ A ◦ R − Φ0 ◦ A ◦ RkL∞ (K)
 
< + = .
2 2

Hence, if we let r = T ◦ R ◦ f −1 ∈ emb(f (K), T ◦ R(K)) then we obtain that BK,K (f, T ◦ R) < . This holds for
any , and hence we have that T ◦ R has the MEP for I(Rn , Rm ).
The proof in the case that o ≥ n follows with minor modification, and applying Lemma C.1 point 5.

C.4.2. T HE P ROOF OF E XAMPLE 1


Proof. (i) From (Puthawala et al., 2020, Theorem 15) we have that Ro,m can approximate any continuous function
f ∈ emb(Rn , Rm ). Further, clearly (T1) and (T2) both contain the identity map, thus Lemma 3.9 (i) applies.

(ii) Let T m be the family autoregressive flows with sigmoidal activations defined in (Huang et al., 2018). By (Teshima
et al., 2020, App. G, Theorem 1 and Proposition 7), T m are sup-universal approximators in the space Diff2 (Rm , Rm )
of C 2 -smooth diffeomorphisms Φ : Rm → Rm . When Ro,m is one of (R1) or (R2) the network is always linear,
hence the conditions are satisfied. If Ro,m is (R4), then Ro,m contains linear mappings, and if (R3), then we can shift
the origin, so that R(x) is linear on K. In all cases, Lemma 3.9 part (ii) applies.

C.4.3. T HE P ROOF OF T HEOREM 3.10


The proof of Theorem 3.10. First we prove the claim under the assumptions (i).
First we prove the claim under assumption (i).
Let W ⊂ Rn be an open relatively compact set. From Lemma 3.4 we have that
n ,m
E n,m := EL L−1 ◦ · · · ◦ E1n,n1 (40)
n ,m
has the m, n, n MEP w.r.t. F := FL L−1 ◦· · ·◦F1n,n1 . Thus for any 1 > 0, we have an Ẽ ∈ E n,m s.t. BK,W (f, Ẽ) < 1 .
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
 
From Lemma C.1 point 9, we have the existence of a µ0 ∈ P(W ) so that W2 f # µ, Ẽ # µ0 < 1 . By convolving µ0 with
a suitable mollifier φ, we can obtain a measure µ00 = µ0 ∗ φ ∈ P(W ) that is absolutely continuous with respect to the
Lebesgue measure so that
1
W2 (µ0 , µ00 ) < ,
1 + Lip(Ẽ)
 
see (Ambrosio et al., 2008, Lemma 7.1.10.), and so W2 Ẽ # µ0 , Ẽ # µ00 < 1 . Hence,
 
W2 f # µ, Ẽ # µ00 < 21 . (41)

Next, from universality of T0n for any 2 > 0, we have the existence of a T0 ∈ T0n so that W2 (µ00 , T0# µ) < 2 . From
Lemma C.1 points 7 and 8 we have that
 
W2 f # µ, Ẽ ◦ T0# µ ≤ 21 + 2 Lip(Ẽ). (42)

For a given  > 0, choosing 1 < 4 and 2 < 


2(1+Lip(Ẽ))
yields that the map E = Ẽ ◦ T0 ∈ E is such that
W2 (f # µ, E # µ) < . This yields the result.
Next we prove the claim under the assumptions (ii). By our assumptions, in the weak topology of the space C 2 (Rnj , Rnj ),
the closure of the set T nj ⊂ C 2 (Rnj , Rnj ) contains the space of Diff2 (Rnj , Rnj ). Moreover, by our assumptions Rnj−1 ,nj
contains a linear map R. We observe that as Rnj−1 ,nj is a space of expansive elements, the map R is injective. and hence
by Lemma 3.9, the family
n ,n
Ej j−1 j = T nj ◦ Rnj−1 ,nj

has the MEP w.r.t. F = I 1 (Rn , Rm ). By Theorem 3.8, we have that I 1 (Rn , Rm ) coincides with the space emb1 (Rn , Rm ).
Finally, by the assumption that T0n0 is dense in the space of C 2 -diffeomorphism Diff2 (Rn` ) implies that T0n0 is a Lp -
universal approximator for the set of C ∞ -smooth triangular maps for all p < ∞. Hence by Lemma 3 in Appendix A of
(Teshima et al., 2020), T0n0 is a distributionally universal. From these the claim in the case (ii) follows in the same way as
the case (i) using the family F = emb1 (Rn , Rm ).

C.4.4. T HE P ROOF OF L EMMA 3.11


The proof of Lemma 3.11. The proof follows from taking the logical negation of the MEP for F. If the MEP is not satisfied,
then there is some f ∈ F so that BK,W (f, E) is never smaller than  > 0 for all E ∈ E. Applying the definition of
BK,W (f, E) from Eqn. 6 yields the result.

C.4.5. T HE P ROOF OF C OR . 3.12


The proof of Cor. 3.12. The proof of Eqn 12 follows from the definition of the MEP.
From Eqn. 12 for i = 1, . . . we have the existence of a i := BK,W (f, Ei ), where limi→∞ i = 0, and a
ri ∈ emb(f (K), Ei (W )) such that kI − ri kL∞ (f (K)) ≤ 2i . Applying Lemma C.1 point 8, we have that for any
E 0 ∈ E n,o (X, W )

BK,X (f, Ei ◦ E 0 ) ≤ 2i + Lip(Ei )BK,X (Ei−1 ◦ ri ◦ f, E 0 ). (43)

Because E n,o (X, W ) has the o, n, n MEP, for each i = 1, . . . , we can find a Ei0 ∈ E n,o (X, W ) such that BK,X (Ei−1 ◦ ri ◦
f, Ei0 ) ≤ 1+Lip(E
1
i)
i , and so BK,X (f, Ei ◦ Ei0 ) ≤ 3i . For this choice of Ei0 , we have that limi→∞ BK,X (f, Ei ◦ Ei0 ) = 0.
From Lemma C.1 point 9, we have that for any absolutely continuous µ ∈ P(K), there is a absolutely continuous µ0 ∈
P(X) such that W2 (f # µ, Ei ◦ Ei0 # µ0 ) ≤ 3. By the universality of T n , continuity of Ei ◦ Ei0 , and absolute continuity of
µ and µ0 , we have the existence of Ti ∈ T n so that

W2 (f # µ, Ei ◦ Ei0 ◦ Ti# µ) ≤ 4i (44)

for each i = 1, . . . . This proves the claim.


Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

(a) The unknot (b) The trefoil knot

Figure 5: An example showing how the unknot (left) can be deformed to approximate the trefoil knot (right). The black
part of both knots are identical, and the red section can be made arbitrarily skinny by bringing the black points together.
This can be done while sending the measure of the red sections to zero, if the starting measure have no atoms. In this way,
we can construct a sequence of diffeomorphisms (Ei )i=1,... so that W2 (Ei# µ, ν) → 0 where µ is the uniform measure on
S 1 , and ν the uniform measure on the trefoil knot. We would like to thank Reviewer 4 for suggesting this discussion and
providing the figure (in tikz code!).

C.4.6. F URTHER D ISCUSSION ON M ATCHING T OPOLOGY E XACTLY VS A PPROXIMATELY


In this section we discuss a theoretical gap between the positive approximation results of Theorem 3.10 and the negative
exact mapping results of Lemma 3.11. We show two main results.
First we construct sequences of maps of the form E = T ◦R that map the uniform measure on S 1 to the uniform measure on
the trefoil knot. As discussed in Section 3.3, there are no mappings of this form which map S 1 to the trefoil knot exactly, but
there are approximate mappings. This shows that there is some overlap between the two results, and extendable mappings
may be approximated by non-extendable mappings.
Second we prove that sequences of functions that approximate non-extendable embeddings with extendable ones neces-
sarily have unbounded gradients. This result shows that, when restricted to approximation by sequences with bounded
gradients, either Theorem 3.10 or Lemma 3.11 can apply, but never both.
Example 2. There is a sequence of extendable embeddings (Ei )i=1,... that map the uniform measure on S 1 , denoted µ, to
the uniform measure on the trefoil knot, denoted ν, so that

lim W2 (Ei# µ, ν) = 0.
i→∞

Proof. The key idea of the construction is shown in Figure 5. In that figure the unknot is bent so that it overlaps the trefoil
knot, outside of an exceptional set (shown in red in Figure 5) which can be made as small as desired. The result follows by
constructing a sequence of functions which ‘squeeze’ this red section as small as possible.
Let µ be the uniform probability measure on S 1 ⊂ R2 , and ν the uniform probability measure on the trefoil knot, M. Let
R : R2 → R3 be a fixed linear map of the form R(x) = (x, 0).
We define a sequence (Xi )i=1,... of unknots in the following way. For any choice of two points on the top of the trefoil knot
as shown in black in Figure 5a, we can replace the straight-line red section with a U-shaped section as shown in Figure 5a
so that the resulting knot is the unknot. We obtain X1 by letting the black points be a distance 1 apart, X2 by letting them
be a distance 12 apart and so on, so that for Xi the two points are a distance 1i apart. Further, for each Xi , we define Ai and
Bi where Ai is the U-shaped piece of Xi (in red), and Bi = Xi \ Ai . Observe that Bi ⊂ M.
Let (Ti0 )i=1,... be a family of diffeomorphisms so that Ei : R3 → R3 maps S 1 × {0} to Xi . Further, let (Ti00 )i=1,... be such
that Ti00 : Xi → Xi so that χBi (Ti00 ◦ Ti0 ◦ R)# µ = χBi ν when χBi is the characteristic function of the set Bi .
Then we define Ei := Ti00 ◦ Ti0 ◦ R and compute

W2 (Ei# µ, ν) ≤ W2 χAi Ei# µ, χM\Bi ν + W2 (χBi Ei# µ, χBi ν)

= W2 χAi Ei# µ, χM\Bi ν .
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

As i increases, the length of M \ Bi goes to zero, thus ν(M \ Bi ) = µ(Ai ) converges to zero. Hence taking limits yields

lim W2 (Ei# µ, ν) ≤ lim W2 χAi Ei# µ, χM\Bi ν = 0.
i→∞ i→∞

Finally, Ei is certainly an extendable embedding, as R is linear, and Ti00 ◦ Ti0 are diffeomorphisms.

The above proof also applies when ν or µ have finitely many atoms. The same construction works if Ai is chosen so that
it contains no atoms for sufficiently large i.
Next, we show that all function sequence for which implication of Theorem 3.10 and conditions of Lemma 3.11 apply are
not uniformly Lipschitz. This implies that if they are differentiable they have unbounded gradients.
Lemma C.2. Let f be continuous and Ei be a sequence of continuous functions that are uniformly Lipschitz with constant
L. Let Ei be such that for all compact K and W subsets of Rn , there is an  > 0, so ∀i and r ∈ emb(f (K), E(W )),
kI − rkL∞ (K) ≥ . If µ is the indicator function of d, then limi→∞ W2 (f# µ, Ei# µ) > 0.

Proof. Let Ei be uniformly Lipschitz with constant L. Consider a 2 tubular neighborhood of f (K). From the fact that
kI − rkL∞ (K) ≥ , we have that there is a point x ∈ E(W ) so that x lies outside of this neighborhood. From uniform

Lipschitzness of Ei , for each i there is a ball B of radius 4L around x so that all points in Ei ∩B are more than 4 away from
c
f (K). We also have that µ(Ei ∩ B) > c where c is the volume of the n dimensional ball. Thus, W2 (f# µ, Ei# µ) > 4L
for each i, and so limi→∞ W2 (f# µ, Ei# µ) > 0.

C.5. Layerwise Inversion and Recovery of Weights


C.5.1. L AYER - WISE P ROJECTION
Here we provide the details of our closed-form layerwise projection algorithm The flow layers are injective, and are often
implemented to be numerically easy to invert. Thus, the crux of the algorithm comes from inverting the injective expansive
layers, R. The range of the ReLU layer is piece-wise affine, hence the inversion follows a two-step program. First,
identify which affine piece (described algebraically, onto which sign pattern) to project. Second, project to this point using
a standard least-squares solver.
The second step is always straight-forward to analyze, but the first is more complicated.
 
B
The key step in our algorithm is the fact that for the specific choice of weight matrix W = , given any y ∈ R2n ,
−DB
we can always solve the least-squares inversion problem exactly.
We prove this result in several parts given below.

1. For any y ∈ R2n , My W ∈ Rn×n is full-rank.


2. If [y]i 6= [y]i+n for each i = 1, . . . , n, then the argmin in Eqn. 17 is well defined, i.e. that there is a unique minimizer.
Otherwise there are 2I minimizers, where I is the number of distinct i such that [y]i = [y]i+n .

3. If M̃y = ∆y (I n×n − ∆y ) , then


 

2
2 2
minn ky − R(x)k2 = minn kMy (y − W x)k2 + M̃y y . (45)
x∈R x∈R 2

4. We verify Eqn. 17.

The proof of Theorem 3.15. 1. Using the definition of My , we have,


 
B
= I n×n − ∆y B − ∆y DB = I n×n − ∆y − ∆y D B.
 
My (46)
−DB

But, (I n×n − ∆y − ∆y D) is a full-rank diagonal matrix (with entries either 1 or [D]i,i ), and B is full rank by
B
assumption, hence My is too.
−DB
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows
n o
2. Because B is square and full rank there exists a basis7 b̂i of Rn such that
i=1,...,n
(
D E 1 if i = j
b̂j , bi = . (47)
0 if i 6= j

For an x ∈ Rn , let αi = hx, bi i for i = 1, . . . , n be the expansion of x in the b̂i basis.


2n
X
2 2
min ky − R(x)k2 = minn [y − R(x)]i (48)
x∈Rn x∈R
i=1
n
X 2 2
= min ([y]i − max(hx, bi i , 0)) + [y]i+n − max(hx, − [D]ii bi i , 0) (49)
xi ∈R
i=1

We now consider minizing Eqn. 49 by minimizing the basis expansion in terms of αi ,


n
X 2 2
min ([y]i − max(αi , 0)) + [y]i+n − max(− [D]ii αi , 0) (50)
αi ∈R
i=1

Eqn. 50 is clearly minimized by minizing each term in the sum, hence we search for a minimizer of the i’th term
2 2
min ([y]i − max(αi , 0)) + [y]i+n − max(− [D]ii αi , 0) (51)
αi ∈R

Noting f (αi ) as the quantity inside the minimum of Eqn. 51, we consider the positive, negative and zero αi cases of
Eqn. 51 separately and we get
2 2 2
min f (αi ) = min+ ([y]i − αi ) + [y]i+n = [y]i+n (52)
αi ∈R+ αi ∈R
2 2 2
min− f (αi ) = min+ [y]i + [y]i+n + [D]ii αi = [y]i (53)
αi ∈R αi ∈R
2 2
f (0) = [y]i + [y]i+n . (54)
[y]2
If [y]i+n > [y]i , then the minimizer of Eqn. 51 is αi = − [D]i+n < 0. Conversely if [y]i+n < [y]i then the minimizer
ii
of Eqn. 51 is αi = [y]i > 0. This argument applies all i = 1, . . . , n, and hence if [y]i 6= [y]i+1 for all i = 1, . . . , n
then the minimizing x is unique.
[y]2i+n 2
If [y]i = [y]i+1 then there are exactly two minimizers of f (αi ), − [D]ii and [y]i , for both of which f (αi ) = [y]i =
2
[y]i+n .
3. If we suppose that [y]i+n − [y]i > 0, then [c(y)]i = 0 and [c(y)]i+n > 0, thus [∆y ]ii = 1, hence if we let xmin be the
minimizing x from part 1, then
2 2
([y]i − max(hxmin , bi i , 0)) + [y]i+n − max(hxmin , − [D]ii bi i , 0) (55)
2  2
= [y]i + [y]i+n − max(hxmin , − [D]ii bi i , 0) (56)
h i2
2
= M̃y y + [My (y − W xmin )]i (57)
i

If [y]i+n − [y]i ≤ 0 then we have


2 2
([y]i − max(hxmin , bi i , 0)) + [y]i+n − max(hxmin , − [D]ii bi i , 0) (58)
2 2
= ([y]i − max(hxmin , bi i , 0)) + [y]i+n (59)
h i2
2
= [My (y − W xmin )]i + M̃y y . (60)
i

Thus combining Eqn.s 48, 49, 57 and 60 for each i = 1, . . . , n, we have that
2 2 2
min ky − R(x)k2 = minn kMy (y − W x)k2 + kMy yk2 . (61)
x∈Rn x∈R
7
Namely the columns of the matrix B −1
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

4. For the final point, combining all of the above points we have
2 2
min ky − R(x)k2 = minn kMy (y − W x)k2 . (62)
x∈Rn x∈R

−1
Further we have from Point 1 that My W is full rank, hence (My W ) My y = R† (y) is a minimizer of Eqn. 62. If
2
[y]i 6= [y]i+n for all i = 1, . . . , n then Part 2 applies, and R† (y) is the unique minimizer of ky − R(x)k2 . In either
case, we have that R† (y) is a minimizer.

C.5.2. B LACK - BOX RECOVERY


We now discuss assumptions that enable black-box recovery of the weights of our entire network post-training.
Assumption C.3. For each ` = 1, . . . , L, R` is an affine ReLU layer. Each T` and T0 is constructed from a finite number
of affine ReLU layers.
Remark C.4. If a network F of the form of Eqn. 1 satisfies Assumption C.3, then given the range of the network, the range
of the network can be recovered exactly.
Further, if the linear region assumption from (Rolnick & Körding, 2020) is satisfied, then the exact weights are recovered,
subject to two natural isometries discussed below.
Remark C.5. The ReLU part of Assumption C.3 is for all examples in Sec. 2.1. Further it is also satisfied by both flows
considered in Sec. 2.2, provided that the various gi are given by layers of affine ReLU’s.
In (Rolnick & Körding, 2020), the authors show that, although ReLU networks depend on the value of their weight matrix
in non-linear ways, it is still possible to recover the exact weights of a given ReLU network in a black-box way, subject to
natural isometrics. The authors show that this is possible not only in theory, but in numerical applications as well.
The works of (Rolnick & Körding, 2020; Bui Thi Mai & Lampert, 2020) imply that provided the activation functions of
the expressive elements are ReLU then the entire network can be recovered in a black-box way. Further, provided that
either the ‘linear region assumption’ from (Rolnick & Körding, 2020) or the generality assumption from (Bui Thi Mai &
Lampert, 2020) is satisfied, then the entire network can be recovered uniquely modulo the natural isometries of rescaling
and permutation of weight matrices.
First we describe the two natural isometries of scaling and permutation. Consider the following function

f (x) = W2 φ(W1 x) (63)

where φ is coordinate-wise homogeneous degree 1 (such as ReLU) and W1 ∈ Rn1 ×n2 and W2 ∈ Rn2 ×n3 . If we let
P ∈ Rn2 ×n2 be any permutation matrix, and D+ be a diagonal matrix with strictly positive elements, then we can write
−1
f (x) = W2 P 0 D+ φ(D+ P W1 x) (64)

as well. Thus ReLU networks can only ever be uniquely given subject to these two isometries. When describe unique
recovery in the rest of this section, we mean modulo these two isometries.
In (Rolnick & Körding, 2020), the authors describe how all parameters of a ReLU network can be recovered uniquely
(called reverse engineered in (Rolnick & Körding, 2020)), subject to the so called ‘linear8 region assumption’, LRA.
n
The input space Rn can be partitioned into a finite number of open {Si }i=1
i
, where for each k, f (x) = Wk i + bi , i.e. the
network corresponds to an affine polyhedron in the output space. The algorithms (Rolnick & Körding, 2020, Alg.s 1 & 2)
are roughly described below.
n
First, identify at least one point within each affine polyhedra {Hj }j=1
j
. Then identify the boundaries between polyhedra.
nj nj
The boundaries between sections are always one affine ‘piece’ of piecewise hyperplanes {Hj }j=1 . These {Hj }j=1 are the
central objects which indicate the (de)activation of an element of a ReLU somewhere in the network. If the Hj are full
hyperplanes, then the ReLU that is (de)activates occurs in the first layer of the network. If Hj is not a full hyperplane, then
8
The use of ‘linear’ in this context is somewhat non-standard, and instead means affine. In this section we use the term ‘linear region
assumption’, but use ‘affine’ where (Rolnick & Körding, 2020) would use ‘linear’ to preserve mathematical meaning.
Universal Joint Approximation of Manifolds and Densities by Simple Injective Flows

it necessarily has a bend where it intersects another hyperplane Hj 0 . Further, except for a Lebesgue measure 0 set, when
Hj intersects Hj 0 the latter does not have a bend. If this is the case, then Hj 0 corresponds to a ReLU (de)activation in an
earlier layer than Hj . In this way the activation functions of every layer can be deduced. Once this is done, the normals of
the hyperplanes can be used to infer the row-vectors of the various weight matrices, letting one recover the entire network.
The above algorithm recovers all of the weights exactly provided that the LRA is satisfied. The LRA is satisfied if for every
distinct Si and Si0 , either Wi 6= Wi0 or bi 6= bi0 . That is, different sign patterns produce different affine sections in the
output space. This is a natural assumption, as the algorithm as described above reconstruction works by first detecting the
boundaries between adjacent affine polyhedra, which is only possible if the LRA holds.
Given the weights of a network there is currently no simple way to detect if the LRA is satisfied, to our knowledge. Nev-
ertheless the authors of (Rolnick & Körding, 2020) show that if it is satisfied, then unique recovery follows. Nevertheless
recovery of the range of the entire network is possible, but this recovery may not be unique.
In (Bui Thi Mai & Lampert, 2020) the authors also consider the problem of recovering weights of a ReLU neural network,
however the authors therein study the question of when there exist isometries beyond the two natural ones described above.
In particular the main result (Bui Thi Mai & Lampert, 2020, Theorem 1) shows the following. Let E n0 ,nL be a ReLU
network that is L layers deep and non-increasing. Suppose that E1 , E2 ∈ E n0 ,nL , E1 and E2 are general9 and for all
x ∈ Rn0 E1 (x) = E2 (x), then E1 is parametrically identical to E2 subject to the two natural isometries.
This work provides the stronger result, however does not apply to the networks that we consider out of the box. It does
apply to our expressive elements (provided that they use ReLU activation functions, and are non-increasing), but not
necessarily apply to the network on the whole.

9
A set is general in the topological sense if its complement is closed and nowhere dense

You might also like