Solving Inverse Problems Using Datadriven Models
Solving Inverse Problems Using Datadriven Models
1–174
c The Author(s), 2019
This is an Open Access article, distributed under the terms of the Creative Commons
Attribution-NonCommercial-NoDerivatives licence (http://creativecommons.org/licenses/by-nc-nd/4.0/),
which permits non-commercial re-use, distribution, and reproduction in any medium, provided the
original work is unaltered and is properly cited. The written permission of Cambridge University Press
must be obtained for commercial re-use or in order to create a derivative work.
doi:10.1017/S0962492919000059 Printed in the United Kingdom
Peter Maass
Department of Mathematics,
University of Bremen, Postfach 330 440,
28344 Bremen, Germany
E-mail: pmaass@math.uni-bremen.de
Ozan Öktem
Department of Mathematics,
KTH – Royal Institute of Technology,
SE-100 44 Stockholm, Sweden
E-mail: ozan@kth.se
Carola-Bibiane Schönlieb
Department of Applied Mathematics and Theoretical Physics,
Cambridge University, Wilberforce Road,
Cambridge, CB3 0WA, UK
E-mail: C.B.Schoenlieb@damtp.cam.ac.uk
CONTENTS
1 Introduction 2
2 Functional analytic regularization 7
3 Statistical regularization 22
4 Learning in functional analytic regularization 43
5 Learning in statistical regularization 81
6 Special topics 102
7 Applications 111
8 Conclusions and outlook 134
Acronyms 140
Appendices 141
References 145
1. Introduction
In several areas of science and industry there is a need to reliably recover
a hidden multi-dimensional model parameter from noisy indirect observa-
tions. A typical example is when imaging/sensing technologies are used in
medicine, engineering, astronomy and geophysics. These so-called inverse
problems are often ill-posed, meaning that small errors in data may lead to
large errors in the model parameter, or there are several possible model para-
meter values that are consistent with observations. Addressing ill-posedness
is critical in applications where decision making is based on the recovered
model parameter, for example in image-guided medical diagnostics. Fur-
thermore, many highly relevant inverse problems are large-scale: they in-
volve large amounts of data and the model parameter is high-dimensional.
Traditionally, an inverse problem is formalized as solving an equation of
the form
g = A(f ) + e.
Here g ∈ Y is the measured data, assumed to be given, and f ∈ X is the
model parameter we aim to reconstruct. In many applications, both g and
f are elements in appropriate function spaces Y and X, respectively. The
mapping A : X → Y is the forward operator, which describes how the model
parameter gives rise to data in the absence of noise and measurement errors,
and e ∈ Y is the observational noise that constitutes random corruptions in
the data g. The above view constitutes a knowledge-driven approach, where
the forward operator and the probability distribution of the observational
noise are derived from first principles.
Classical research on inverse problems has focused on establishing condi-
tions which guarantee that solutions to such ill-posed problems exist and on
1.1. Overview
This survey investigates algorithms for combining model- and data-driven
approaches for solving inverse problems. To do so, we start by reviewing
some of the main ideas of knowledge-driven approaches to inverse prob-
lems, namely functional analytic inversion (Section 2) and Bayesian inver-
sion (Section 3), respectively. These knowledge-driven inversion techniques
are derived from first principles of knowledge we have about the data, the
model parameter and their relationship to each other.
Knowledge- and data-driven approaches can now be combined in several
different ways depending on the type of reconstruction one seeks to com-
pute and the type of training data. Sections 4 and 5 represent the core
of the survey and discuss a range of inverse problem approaches that in-
troduce data-driven aspects in inverse problem solutions. Here, Section 4
is the data-driven sister section to functional analytic approaches in Sec-
tion 2. These approaches are primarily designed to combine data-driven
methods with functional analytic inversion. This is done either to make
functional analytic approaches more data-driven by appropriate paramet-
rization of these approaches and adapting these parametrizations to data, or
to accelerate an otherwise costly functional analytic reconstruction method.
Many reconstruction methods, however, are not naturally formulated
within the functional analytic view of inversion. An example is the posterior
mean reconstruction, whose formulation requires adopting the Bayesian
view of inversion. Section 5 is the data-driven companion to Bayesian inver-
sion in Section 3, and surveys methods that combine data- and knowledge-
driven methods in Bayesian inversion. The simplest is to apply data-driven
post-processing of a reconstruction obtained via a knowledge-driven method.
A more sophisticated approach is to use a learned iterative scheme that
integrates a knowledge-driven model for how data are generated into a
data-driven method for reconstruction. The latter is done by unrolling
a knowledge-driven iterative scheme, and both approaches, which compute
statistical estimators, can be combined with forward operators that are par-
tially learned via a data-driven method.
The above approaches come with different trade-offs concerning demands
on training data, statistical accuracy and robustness, functional complex-
ity, stability and interpretability. They also impact the choice of machine
learning methods and algorithms for training. Certain recent – and some-
what anecdotal – topics of data-driven inverse problems are discussed in
Section 6, and exemplar practical inverse problems and their data-driven
solutions are presented in Section 7.
Within data-driven approaches, deep neural networks will be a focus
of this survey. For an introduction to deep neural networks the reader
might find it helpful to consult some introductory literature on the topic.
We recommend Courville, Goodfellow and Bengio (2017) and Higham and
Higham (2018) for a general introduction to deep learning; see also Vidal,
Bruna, Giryes and Soatto (2017) for a survey of work that aims to provide a
mathematical justification for several properties of deep networks. Finally,
the reader may also consult Ye, Han and Cha (2018), who give a nice survey
of various types of deep neural network architectures.
variational models with mixed-noise data fidelity terms (Section 7.2), the ap-
plication of learned iterative reconstruction from Section 5.1.4 to computed
tomography (CT) and photoacoustic tomography (PAT) (Section 7.3), ad-
versarial regularizers from Section 4.7 for CT reconstruction as an example
of variational regularization with a trained neural network as a regularizer
(Section 7.4), and the application of deep inverse priors from Section 4.10
to magnetic particle imaging (MPI) (Section 7.5).
In Section 8 we finish our discussion with a few concluding remarks and
comments on future research directions.
2.4. Regularization
Unfortunately, Hadamard’s dogma stigmatized the study of ill-posed prob-
lems and thereby severely hampered the development of the field. Math-
ematicians’ interest in studying ill-posed problems was revitalized by the
pioneering works of Calderón and Zygmund (1952, 1956), Calderón (1958)
and John (1955a, 1955b, 1959, 1960), who showed that instability is an in-
trinsic property in some of the most interesting and challenging problems
in mathematical physics and applied analysis. To some extent, these pa-
pers constitute the origin of the modern theory of inverse problems and
regularization.
The aim of functional analytic regularization theory is to develop stable
schemes for estimating ftrue from data g in (2.1) based on knowledge of A,
and to prove analytical results for the properties of the estimated solution.
More precisely, a regularization of the inverse problem in (2.1) is formally
a scheme that provides a well-defined parametrized mapping Rθ : Y → X
(existence) that is continuous in Y for fixed θ (stability) and convergent. The
latter means there is a way to select θ so that Rθ (g) → ftrue as g → A(ftrue ).
Besides existence, stability and convergence, a complete mathematical
analysis of a regularization method also includes proving convergence rates
and stability estimates. Convergence rates provide an estimate of the dif-
ference between a regularized solution Rθ (g) and the solution of (2.1) with
e = 0 (provided it exists), whereas stability estimates provide a bound on
the difference between Rθ (g) and Rθ (A(ftrue )) depending on the error kek.
These theorems rely on ‘source conditions’: for example, convergence rate
results are obtained under the assumption that the true solution ftrue is in
Vainikko (1990) and Natterer (1977). Such concepts have been discussed in
the framework of parameter identification for partial differential equations
(quasi-reversibility): see Lattès and Lions (1969) for an early reference and
Hämarik, Kaltenbacher, Kangro and Resmerita (2016) and Kaltenbacher,
Kirchner and Vexler (2011) for some recent developments.
Variational methods. The idea here is to minimize a measure of data
misfit that is penalized using a regularizer (Kaltenbacher, Neubauer and
Scherzer 2008, Scherzer et al. 2009):
Rθ (g) := arg min{L(A(f ), g) + S θ (f )}, (2.7)
f ∈X
where we make use of the notation in Definitions 2.2, 2.4 and 2.5. This is a
generic, yet highly adaptable, framework for reconstruction with a natural
plug-and-play structure where the forward operator A, the data discrepancy
L and the regularizer S θ are chosen to fit the specific aspects of the inverse
problem. Well-known examples are classical Tikhonov regularization and
TV regularization.
Sections 2.5 and 2.6 provide a closer look at the development of variational
methods since these play an important role in Section 4, where data-driven
methods are used in functional analytic regularization. To simplify these
descriptions it is convenient to establish some key notions.
Definition 2.2. A regularization functional S : X → R+ quantifies how
well a model parameter possesses desirable features: a larger value usually
means less desirable properties.
In variational approaches to inverse problems, the value of S is considered
as a penalty term, and in Bayesian approaches it is seen as the negative log
of a prior probability distribution. Henceforth we will use S θ to denote a
regularization term that depends on a parameter set θ ∈ Θ; in particular
we will use θ as parameters that will be learned.
Remark 2.3. In some cases θ is a single scalar. We will use the notation
λ S(f ) ≡ S θ (f ) wherever such usage is unambiguous, and with the implic-
ation that θ = λ ∈ R+ . Furthermore, we will sometimes express the set θ
explicitly, e.g. S α,β , where the usage is unambiguous.
Definition 2.4. A data discrepancy functional L : Y × Y → R is a scalar
quantification of the similarity between two elements of data space Y .
The data discrepancy functional is considered to be a data fitting term
in variational approaches to inverse problems. Although often taken to
be a metric on data space, choosing it as an affine transformation of the
negative log-likelihood of data allows for a statistical interpretation, since
Bearing in mind the notation in Remarks 2.3 and 2.6, note that (2.10) has
the form of (2.7) with L given by the squared L2 -distance. Here X and
Y are Hilbert spaces (typically both are L2 spaces). Moreover, the choice
S(f ) := 21 kf k2 is the most common one for the penalty term in (2.10). In
fact, if A is linear, with A∗ denoting its adjoint, then standard arguments
for minimizing quadratic functionals yield
Rλ = (A∗ ◦ A +λ id)−1 ◦ A∗ . (2.11)
The data discrepancy. Here the choice is ideally guided by statistical consid-
erations for the observation noise (Bertero, Lantéri and Zanni 2008). Ideally
one selects L as an appropriate affine transform of the negative log-likelihood
of data, in which case minimizing f 7→ L(A(f ), g) becomes the same as com-
puting an maximum likelihood estimator. Hence, Poisson-distributed data
that typically appear in photography (Costantini and Susstrunk 2004) and
emission tomography applications (Vardi, Shepp and Kaufman 1985) lead
to a data discrepancy given by the Kullback–Leibler divergence (Sawatzky,
Brune, Müller and Burger 2009, Hohage and Werner 2016), while additive
normally distributed data, as for Gaussian noise, result in a least-squares
fit model.
and Fatemi (1992) for image denoising due to its edge-preserving properties,
favouring images f that have a sparse gradient. Here, the TV regularizer is
given as
Z
S(f ) := TV(f ) := |Df |(Ω) = d|Df |, (2.12)
Ω
where Ω ⊂ Rd is a fixed open and bounded set. The above functional (TV
regularizer) uses the total variation measure of the distributional derivat-
ive of f defined on Ω (Ambrosio, Fusco and Pallara 2000). A drawback
of using such a regularization procedure is apparent as soon as the true
model parameter not only consists of constant regions and jumps but also
possesses more complicated, higher-order structures, e.g. piecewise linear
parts. In this case, TV introduces jumps that are not present in the true
solution, which is referred to as staircasing (Ring 2000). Examples of gener-
alizations of TV for addressing this drawback typically incorporate higher-
order derivatives, e.g. total generalized variation (TGV) (Bredies, Kunisch
and Pock 2011) and the infimal-convolution total variation (ICTV) model
(Chambolle and Lions 1997). These read as
S α,β (f ) := ICTVα,β (f )
= min {αkDf − ∇vkM(Ω;R2 ) + βkD∇vkM(Ω;R2×2 ) }, (2.13)
v∈W 1,1 (Ω)
∇v∈BV (Ω)
and the second-order TGV (Bredies and Valkonen 2011, Bredies, Kunisch
and Valkonen 2013) reads as
S α,β (f ) := TGV2α,β (f )
= min {αkDf − wkM(Ω;R2 ) + βkEwkM(Ω;Sym2 (R2 )) }. (2.14)
w∈BD(Ω)
Here
BD(Ω) := {w ∈ L1 (Ω; Rd ) | kEwkM(Ω;Rd×d ) < ∞}
is the space of vector fields of bounded deformation on Ω with E denoting
the symmetrized gradient and Sym2 (R2 ) the space of symmetric tensors of
order 2 with arguments in R2 . The parameters α, β are fixed positive para-
meters. The main difference between (2.13) and (2.14) is that we do not
generally have w = ∇v for any function v. That results in some qualitative
differences of ICTV and TGV regularization: see e.g. Benning, Brune, Bur-
ger and Müller (2013). One may also consider Banach-space norms other
than TV, such as Besov norms (Lassas, Saksman and Siltanen 2009), which
behave more nicely with respect to discretization (see also Section 3.4). Dif-
ferent TV-type regularizers and their adaption to data by bilevel learning
of parameters (e.g. α and β in ICTV and TGV) will be discussed in more
detail in Section 4.3.1 and numerical results will be given in Section 7.2.
where
ξˆ = arg min{L(A(E ∗D (ξ)), g) + λkξk0 },
ξ∈Ξ
with θ = {D, λ}, i.e. θ is the scalar λ > 0 and the entire dictionary that
defines the synthesis operator E ∗D . In the corresponding analysis approach,
we get
Rθ (g) := arg min{L(A(f ), g) + λk E D (f )k0 }. (2.21)
f ∈X
where AD := A ◦ E ∗D . Here, Ts (ξ) sets all but the largest (in magnitude)
s elements of ξ to zero. This is therefore a proximal-gradient method
with the proximal of the function being 0 at 0 and 1 everywhere else (see
Section 8.2.7). Other examples are matching pursuit (MP) (Mallat and
Zhang 1993), orthogonal matching pursuit (OMP) (Tropp and Gilbert 2007)
and variants thereof such as StOMP (Donoho, Tsaig, Drori and Starck
2012), ROMP (Needell and Vershynin 2009) and CoSamp (Needell and
Tropp 2009).
The above approaches for solving (2.20) and (2.21) all have their ad-
vantages and disadvantages. First of all, greedy methods will generally not
give the same solution as convex relaxation. However, if the restricted iso-
metry property (RIP) from Section 2.7.3 holds, then both approaches have
the same solution. Convex relaxation has the advantage that it succeeds
with a very small number of possibly noisy measurements. However, their
numerical solution tends to be computationally burdensome. Combinator-
ial algorithms, on the other hand, can be extremely fast (sublinear in the
length of the target signal) but they require a very specific structure of the
forward operator A and a large number of samples. The performance of
greedy methods falls in between those of convex relaxation and combinat-
orial algorithms in their run-time and sampling efficiency.
then
(s)
kftrue − f k2
kfˆδ − ftrue k2 ≤ C δ + √ true
. (2.25)
s
(s)
In the above, ftrue ∈ Rn is a vector consisting of the s largest (in magnitude)
coefficients of ftrue and zeros otherwise.
Examples of matrices satisfying RIP are sub-Gaussian matrices and par-
tial bounded orthogonal matrices (Chen and Needell 2016). Theorem 2.8
states that the reconstruction error is at most proportional to the norm of
(s)
the noise in the data plus the tail ftrue − ftrue of the signal. Cohen, Dahmen
and DeVore (2009) show that this error bound is optimal (up to the precise
value of C). Moreover, if ftrue is s-sparse and δ = 0 (noise-free data), then
ftrue can be reconstructed exactly. Furthermore, if ftrue is compressible with
(2.19), then
kfˆδ − ftrue k2 ≤ C(δ + C 0 s1/2−1/q ). (2.26)
Finally, error estimates of the above type have been extended to the infinite-
dimensional setting in Adcock and Hansen (2016).
The choice of dictionary is clearly a central topic in sparsity-promoting
regularization and, as outlined in Section 4.4, the dictionary can be learned
beforehand or jointly alongside the signal recovery.
3. Statistical regularization
Statistical regularization, and Bayesian inversion in particular, is a complete
statistical inferential methodology for inverse problems. It offers a rich set
of tools for incorporating data into the recovery of the model parameter,
so it is a natural framework to consider when data-driven approaches from
machine learning are to be used for solving ill-posed inverse problems.
A key element is to treat both the data and model parameter as real-
izations of certain random variables and phrase the inverse problem as a
statistical inference question. In contrast, the functional analytic viewpoint
(Section 2) allows for data to be interpreted as samples generated by a
random variable, but there are no statistical assumptions on the model
parameters.
Remark 3.1. In functional analytic regularization, a statistical model for
data is mostly used to justify the choice of data discrepancy in a variational
method and for selecting an appropriate regularization parameter. Within
functional analytic regularization, one can more carefully account for stat-
istical properties of data which can be useful for uncertainty quantification
(Bissantz, Hohage, Munk and Ruymgaart 2007).
Bayesian statistics offers a natural setting for such a quest since it is nat-
ural to interpret measured data in an inverse problem as a sample of a ran-
dom variable conditioned on data whose distribution is the data likelihood.
The data likelihood can often be derived using knowledge-driven modelling.
Solving an inverse problem can then be stated as finding the distribution
of the model parameter conditioned on data (posterior distribution). The
Remark 3.2. In this setting, integrals over X and/or Y , which are needed
for defining expectation, are interpreted as a Bochner integral that extends
the Lebesgue integral to functions that take values in a Banach space. See
Dashti and Stuart (2017, Section A.2) for a brief survey of Banach and
Hilbert space-valued random variables.
(measured data) from the data model with unknown true model parameter.
A more precise statement reads as follows.
Definition 3.5. A statistical inverse problem is the task of recovering the
conditional distribution Πgpost ∈ PY of (f | g = g) under µ from measured
data g ∈ Y , where
g is a single sample of (g | f = ftrue ) ∼ Πfdata
true
. (3.4)
Here ftrue ∈ X is unknown while f 7→ Πfdata , which describes how data are
generated, is known.
The conceptual difference that comes from adopting such a statistical
view brings with it several potential advantages. The posterior, assuming
it exists, describes all possible solutions, so recovering it represents a more
complete solution to the inverse problem than recovering an approximation
of ftrue , which is the goal in functional analytic regularization. This is
particularly the case when one seeks to quantify the uncertainty in the
recovered model parameter in terms of statistical properties of the data.
However, recovering the entire posterior is often not feasible, such as in
inverse problems that arise in imaging. As an alternative, one can settle
for exploring the posterior by computing suitable estimators (Section 3.3).
Some may serve as approximations of ftrue whereas others are designed for
quantifying the uncertainty.
dΠfdata
(g) = exp(−L(f, g)) for all f ∈ X. (3.5)
dΠnoise
with
Eg∼Πnoise [exp(−L(f, g))] = 1.
The mapping f 7→ −L(f, g) is called the (data) log-likelihood for the data
g ∈Y.
Remark 3.6. The equality for the Radon–Nikodym derivative in (3.5)
means that
Eg∼Πf [F (g)] = Eg∼Πnoise [exp(−L(f, g))F (g)]
data
3.2.1. Existence
Existence for Bayesian inversion follows when Bayes’ theorem holds. Below
we state the precise existence theorem (Dashti and Stuart 2017, Theorem 16)
for the setting in Section 3.1.3, which covers the case when model parameter
and data spaces are infinite-dimensional separable Banach spaces.
Theorem 3.9 (existence for Bayes inversion). Assume that L : X ×
Y → R in (3.5) is continuously differentiable onTsome X 0 ⊂ X that contains
the support of the prior Πprior and Πprior (X 0 B) > 0 for some bounded
set B ⊂ X. Also, assume there exists mappings M1 , M2 : R+ × R+ → R+
that are component-wise monotone, non-decreasing and where the following
holds:
−L(f, g) ≤ M1 (r, kf k),
(3.8)
|L(f, g) − L(f, v)| ≤ M2 (r, kf k)kg − vk
for f ∈ X and g, v ∈ Br (0) ⊂ Y . Then, Z in (3.6) is finite, i.e. 0 < Z(g) < ∞
for any g ∈ Y , and the posterior given by (3.7) yields a well-defined PX -
valued mapping on Y : g 7→ Πgpost .
Under certain circumstances it is possible to work with improper priors
on X, for example by computing posterior distributions that approximate
the posteriors one would have obtained using proper conjugate priors whose
extreme values coincide with the improper prior.
3.2.2. Stability
One can show that small changes in the data lead to small changes in the
posterior distribution (in Hellinger metric) on PX . The precise formulation
given by Dashti and Stuart (2017, Theorem 16) reads as follows.
3.2.3. Convergence
Posterior consistency is the Bayesian analogue of the notion of convergence
in functional analytic regularization. More precisely, the requirement is that
the posterior Πgpost , where g is a sample of (g | f = ftrue ), concentrates in any
small neighbourhood of the true model parameter ftrue ∈ X as information
1
Increasing the ‘information in data indefinitely’ means e → 0 in (3.2), and if the data
space Y is finite-dimensional, one also lets its dimension (sample size) increase. See
Ghosal and van der Vaart (2017, Definition 6.1) for the precise definition.
To sidestep this difficulty, Castillo and Nickl (2013, 2014) seek to determine
maximal families Ψ that replace C ∞ in (3.11) and where such an asymp-
totic characterization holds. This leads to non-parametric Bernstein–von
Mises theorems, and while Castillo and Nickl (2013, 2014) considered ‘direct’
problems in non-parametric regression and probability density estimation,
recent papers have obtained non-parametric Bernstein–von Mises theorems
for certain classes of inverse problems. For example, Monard, Nickl and
Paternain (2019) consider the case of inverting the (generalized) ray trans-
form, whereas Nickl (2017a) considers PDE parameter estimation problems.
The case with general linear forward problem is treated by Giordano and
Kekkonen (2018), who build upon the techniques of Monard et al. (2019).
To give a flavour of the type of results obtained, we consider Theorem 2.5
of Monard et al. (2019), which is relevant to tomographic inverse problems
involving inversion of the ray transform for recovering a function (images)
defined on Ω ⊂ Rd , i.e. X ⊂ L2 (Ω). This theorem states that
1
h(f | g = g δ ) − E[f|g = g δ ], φiL2 → N(0, k A ◦(A∗ ◦ A)−1 (φ)kY ) (3.12)
δ
as δ → 0 for any φ ∈ C ∞ (Ω).
2
A Bayesian credible set is a subset of the model parameter space X that contains a
predefined fraction, say 95%, of the posterior mass. A frequentist confidence region
is a subset of X that includes the unknown true model parameter with a predefined
frequency as the experiment is repeated indefinitely.
of the Bernstein–von Mises theorem are fragile and easy to violate for a
dataset or analysis, and it is difficult to know, without outside information,
when this will occur. As nicely outlined in Nickl (2013, Section 2.25), the
parametric (finite-dimensional) setting already requires many assumptions,
such as a consistent maximum likelihood estimator, a true model parameter
in the support of the prior, and a log-likelihood that is sufficiently regular.
For example, data in an inverse problem are observational, and therefore it
is unlikely that an estimator of ftrue , such as maximum likelihood or the
posterior mean, is consistent, in which case a Bernstein–von Mises theorem
does not apply.
The advantage of the maximum likelihood estimator is that it does not in-
volve any integration over X, so it is computationally feasible to use on
large-scale problems. Furthermore, it only requires access to the data likeli-
hood (no need to specify a prior), which is known. On the other hand, it
does not act as a regularizer, so it is not suitable for ill-posed problems.
Maximum a posteriori (MAP) estimator. This estimator maximizes the pos-
terior probability, that is, it is the ‘most likely’ model parameter given
measured data g. In the finite-dimensional setting, the prior and posterior
distribution can typically be described by densities with respect to the Le-
besgue measure, and the MAP estimator is defined as the reconstruction
operator R : Y → X given by
R(g) := arg max πpost (f | g) = arg min{L(f, g) − log πprior (f )}.
f ∈X f ∈X
3
The Cameron–Martin space E associated with Π ∈ PX consists of elements f ∈ X such
that δf ~Π Π, that is, the translated measure B 7→ Π(B −f ) is absolutely continuous
with respect to Π. The Cameron–Martin space is fundamental when dealing with the
differential structure in X, mainly in connection with integration by parts formulas,
and it inherits a natural Hilbert space structure from the space X ∗ .
4
The Bregman distance for x 7→ hx, xi gives the L2 -loss.
for more abrupt (discontinuous) changes in the values of the unknown model
parameter at specific locations. Yet another is sparsity-promoting priors (see
Section 2.7), which encode the a priori belief that the unknown model para-
meter is compressible with respect to some underlying dictionary, that is,
it can be transformed into a linear combination of dictionary elements where
most coefficients vanish (or are small). Finally, there are hierarchical priors
which are formed by combining other priors hierarchically into an overall
prior. This is typically done in a two-step process where one first specifies
some underlying priors, often taken as natural conjugate priors, and then
mixes them in a second stage over hyper-parameters. Recently, Calvetti,
Somersalo and Strang (2019) have reformulated the question of sparse re-
covery as an inverse problem in the Bayesian framework, and expressed the
sparsity criteria by means of a hierarchical prior mode. More information
and further examples of priors in the finite-dimensional setting are given
by Kaipio and Somersalo (2005), Calvetti and Somersalo (2008, 2017) and
Calvetti et al. (2019).
Priors on function spaces. Defining priors when X is an infinite-dimensional
function space is somewhat involved. A common approach is to consider a
convergent series expansion and then let the coefficients be generated by a
random variable.
More precisely, consider the case when X is a Banach space of real-valued
functions on some fixed domain Ω ⊂ Rd . Following Dashti and Stuart (2017,
Section 2.1), let {φi }i ⊂ X be a countable sequence whose elements are
normalized, i.e. kφi kX = 1. Now consider model parameters f ∈ X of the
form
X
f = f0 + αi φi ,
i
3.5. Challenges
The statistical view of an inverse problem in Bayesian inversion extends
the functional analytic one, since the output is ideally the posterior that
describes all possible solutions. This is very tractable and fits well within the
scientific tradition of presenting data and inferred quantities with error bars.
Most priors are chosen to regularize the problem rather than improving
the output quality. Next, algorithmic advances (Section 3.5.2) have res-
ulted in methods that can sample in a computationally feasible manner
from a posterior distribution in a high-dimensional setting, say up to 106
dimensions. This is still not sufficient for large-scale two-dimensional ima-
ging or regular three-dimensional imaging applications. Furthermore, these
methods require an explicit prior, which may not be feasible if one uses
learning to obtain it. They may also make use of analytic approximations
such as those given in Section 3.2.5, which restricts the priors that can come
into question. For these reasons, most applications of Bayesian inversion on
large-scale problems only compute a MAP estimator, whereas estimators re-
quiring integration over the model parameter space remain computationally
unfeasible. These include Bayes estimators and the conditional mean as well
as estimators relevant for uncertainty quantification.
In conclusion, the above difficulties in specifying a ‘good’ prior and in
meeting the computational requirements have seriously limited the dissem-
ination of Bayes inversion in large-scale inverse problems, such as those
arising in imaging. Before providing further remarks on this, let us men-
tion that Section 5.1 shows how techniques from deep learning can be used
to address the above challenges when computing a wide range of estimators.
Likewise, in Section 5.2 we show how deep learning can be used to efficiently
sample from the posterior.
distribution that coincides with the posterior in the limit. Other variants
use Gibbs sampling, which reduces the autocorrelation between samples.
Technically, Gibbs sampling can be seen as a special case of Metropolis–
Hastings dynamics and it requires computation of conditional distribu-
tions. Further variants are auxiliary variable MCMC methods, such as
slice sampling (Neal 2003), proximal MCMC (Green, Latuszysński, Pereyra
and Robert 2015, Durmus, Moulines and Pereyra 2018, Repetti, Pereyra
and Wiaux 2019) and Hamiltonian Monte Carlo (Girolami and Calderhead
2011, Betancourt 2017). See also Dashti and Stuart (2017, Section 5) for
a nice abstract description of MCMC in the context of infinite-dimensional
Bayesian inversion.
An alternative approach to MCMC seeks to approximate the posterior
with more tractable distributions (deterministic inference), for example in
variational Bayes inference (Fox and Roberts 2012, Blei, Küçükelbir and
McAuliffe 2017) and expectation propagation (Minka 2001). Variational
Bayes inference has indeed emerged as a popular alternative to the clas-
sical MCMC methods for sampling from a difficult-to-compute probability
distribution, which in Bayesian inversion is the posterior distribution. The
idea is to start from a fixed family of probability distributions (variational
family) and select the one that best approximates the target distribution
under some similarity measure, such as the Kullback–Leibler divergence.
Blei et al. (2017, p. 860) try to provide some guidance on when to use
MCMC and when to use variational Bayes. MCMC methods tend to be
more computationally intensive than variational inference, but they also
provide guarantees of producing (asymptotically) exact samples from the
target density (Robert and Casella 2004). Variational inference does not
enjoy such guarantees: it can only find a density close to the target but
tends to be faster than MCMC. A recent development is the proof of a
Bernstein–von Mises theorem (Wang and Blei 2017, Theorem 5), which
shows that the variational Bayes posterior is asymptotically normal around
the variational frequentist estimate. Hence, if the variational frequentist
estimate is consistent, then the variational Bayes posterior converges to a
Gaussian with a mean centred at the true model parameter. Furthermore,
since variational Bayes rests on optimization, variational inference easily
takes advantage of methods such as stochastic optimization (Robbins and
Monro 1951, Kushner and Yin 1997) and distributed optimization (though
some MCMC methods can also exploit these innovations (Welling and Teh
2011, Ahmed et al. 2012)). Thus, variational inference is suited to large
data sets and scenarios where we want to quickly explore many models;
MCMC is suited to smaller data sets and scenarios where we are happy to
pay a heavier computational cost for more precise samples. Another factor
is the geometry of the posterior distribution. For example, the posterior
of a mixture model admits multiple modes, each corresponding to label
λ is chosen so that
L(A(fλ ), g) ≤ . (4.1)
Another method within this class is the L-curve (Hansen 1992). Here
the regularization parameter λ is chosen where the log-log plot of λ 7→
(L(A(fλ ), g)+λ S(fλ )) has the highest curvature (i.e. it exhibits a corner).
formulation:
θb ∈ arg minE(f,g)∼µ [`X (Rθ (g), f)],
θ
(4.3)
Rθ (g) := arg min{L(A(f ), g) + S θ (f )}.
f ∈X
Note here that θb is a Bayes estimator, but µ is not fully known. Instead it is
replaced by its empirical counterpart given by the supervised training data,
in which case θb corresponds to empirical risk minimization Section 3.3.
In the bilevel optimization literature, as in the optimization literature
as a whole, there are two main and mostly distinct approaches. The first
one is the discrete approach that first discretizes the problem (4.2) and
subsequently optimizes its parameters. In this way, optimality conditions
and their well-posedness are derived in finite dimensions, which circumvents
often difficult topological considerations related to convergence in infinite-
dimensional function spaces, but also jeopardizes preservation of continuous
structure (i.e. optimizing the discrete problem is not automatically equi-
valent to discretizing the optimality conditions of the continuous problem
(De los Reyes 2015)) and dimension-invariant convergence properties.
Alternatively, (4.2) and its parameter θ are optimized in the continuum
(i.e. appropriate infinite-dimensional function spaces) and then discretized.
The resulting problems present several difficulties due to the frequent non-
smoothness of the lower-level problem (think of TV regularization), which,
in general, makes it impossible to verify Karush–Kuhn–Tucker constraint
qualification conditions. This issue has led to the development of alternat-
ive analytical approaches in order to obtain first-order necessary optimal-
ity conditions (Bonnans and Tiba 1991, De los Reyes 2011, Hintermüller,
Laurain, Löbhard, Rautenberg and Surowiec 2014). The bilevel problems
under consideration are also related to generalized mathematical programs
with equilibrium constraints in function spaces (Luo, Pang and Ralph 1996,
Outrata 2000).
One of the first examples of the above is the paper by Haber and Tenorio
(2003), who considered a regularization functional S θ : X → R that can de-
pend on location and involves derivatives or other filters. Concrete examples
are anisotropic weighted Dirichlet energy where θ is a function, that is,
S θ (f ) := kθ( · )∇f ( · )k22 for θ : Ω → R,
and anisotropic weighted TV,
S θ (f ) := kθ(|∇f ( · )|)k1 for θ : R → R.
The paper contains no formal mathematical statements or proofs, but there
are many numerical examples showing how to use supervised learning tech-
niques to determine a regularization functional given a training set of feas-
ible solutions.
This framework is the basis for the analysis of the learning model, in which
convexity of the variational model and compactness properties in the space
of functions of bounded variation are crucial for proving existence of an
optimal solution: see De los Reyes, Schönlieb and Valkonen (2016). Richer
parametrizations for bilevel learning are discussed in Chen et al. (2012,
2015), for example, where non-linear functions and convolution kernels are
learned. Chen et al., however, treat the learning model in finite dimensions,
and a theoretical investigation of these more general bilevel learning models
in a function space setting is a matter for future research.
In order to derive sharp optimality conditions for optimal parameters of
(4.4) more regularity on the lower-level problem is needed. For shifting
the problem (4.4) into a more regular setting, the Radon norms are regu-
larized with Huber regularization and a convex, proper and weak* lower-
semicontinuous smoothing functional H : X → [0, ∞] is added to the lower-
level problem, typically H(f ) = 21 k ∇f k2 . In particular, the former is re-
quired for the single-valued differentiability of the solution map (λ, α) 7→
fα,λ , required by current numerical methods, irrespective of whether we
are in a function space setting (see e.g. Rockafellar and Wets 1998, The-
orem 9.56, for the finite-dimensional case). For parameters µ ≥ 0 and
γ ∈ (0, ∞], the lower-level problem in (4.4) is then replaced by
with
N
S γθ (f ) αj k J j (f )kγM(Ω;Rmj ) .
X
:=
j=1
Definition 4.1. Given γ ∈ (0, ∞], the Huber regularization for the norm
k · k2 on Rn is defined by
1 1
kgk2 − , kgk2 ≥ ,
2γ γ
kgkγ =
γ 1
kgk22 , kgk2 < .
2 γ
Then, for µ ∈ M(Ω; Rmj ) with Lebesgue decomposition µ = νLn + µs we
have the Huber-regularized total variation measure,
Z
|µ|γ (V ) := |ν(x)|γ dx + |µs |(V ) (V ⊂ Ω Borel-measurable),
v
and finally its Radon norm,
kµkγM(Ω;Rmj ) := k|µ|γ kM(Ω;Rmj ) .
In all of these, we interpret the choice γ = ∞ to give back the standard
unregularized total variation measure or norm. In this setting existence
of optimal parameters and differentiability of the solution operator can be
proved, and with this an optimality system can be derived: see De los Reyes
et al. (2016). More precisely, for the special case of the TV-denoising model
the following theorem holds.
Theorem 4.2 (TV denoising (De los Reyes et al. T 2016)). Consider
the denoising problem (2.2) where g, ftrue ∈ BV (Ω) L2 (Ω), and assume
TV(g) > TV(ftrue ). Also, let TVγ (f ) := kDf kγM(Ω;Rn ) . Then there exist
µ̄, γ̄ > 0 such that any optimal solution αγ,µ ∈ [0, ∞] to the problem
1
min kftrue − fα k2L2 (Ω) ,
2
α∈[0,∞]
1 2 γ µ 2
fα = arg min kg − f kL2 (Ω) + α TV (f ) + k ∇f kL2 (Ω;Rn )
f ∈BV (Ω) 2 2
satisfies αγ,µ > 0 whenever µ ∈ [0, µ̄] and γ ∈ [γ̄, ∞].
Theorem 4.2 states that if g is a noisy image which oscillates more than
the noise-free image ftrue , then the optimal parameter is strictly positive,
which is exactly what we would naturally expect. De los Reyes et al. (2016)
proved a similar result for second-order TGV and ICTV regularization for
the case when X = Y . The result was not extended to data with a general
Y , but it is possible with additional assumptions on the parameter space.
Moreover, in much of the analysis for (4.4) we could allow for spatially
dependent parameters α and λ. However, the parameters would then need
to lie in a finite-dimensional subspace of C0 (Ω; RN ): see De los Reyes and
Schönlieb (2013) and Van Chung et al. (2017). Observe that Theorem 4.2 al-
lows for infinite parameters α. Indeed, for regularization parameter learning
Figure 4.1. Parameter optimality for TV denoising in Theorem 4.2. The non-
convexity of the loss function, even for this one-parameter optimization problem,
is clearly visible. Courtesy of Pan Liu.
for the TGV2 case, which also holds for TV. In particular, these stronger
results open the door to further necessary and sufficient optimality condi-
tions. Further, using the adjoint optimality condition gradient, formulas
for the reduced cost functional can be derived, which in turn feed into the
design of numerical algorithms for solving (4.4): see Calatroni et al. (2016,
Section 3).
and non-convex ρ(z) = log(1 + |z|), and the non-smooth but convex `1 -
norm ρ(z) = |z|. Their experiments in particular suggested that, while the
two log-type non-linearities gave very similar regularization performance,
the convex `1 -norm regularizer clearly did worse. Moreover, Chen et al.
(2015) parametrized the non-linearities ρ with radial basis functions, whose
coefficients are learned as well. Earlier MRF-based bilevel learning schemes
exist: see e.g. Samuel and Tappen (2009).
A further development is the variational networks introduced by Kobler,
Klatzer, Hammernik and Pock (2017) and Hammernik et al. (2018). The
idea was to replace the above bilevel scheme with learning to optimize
(Section 4.9) using a supervised loss, thereby leading to a learned iterat-
ive method. This will be discussed in more detail in Section 5.1.4, where
the variational networks re-emerge in the framework of learned iterative
schemes.
where
N
X
S θ (f, ξ1 , . . . , ξN ) := [λj kPj (f ) − E ∗D (ξj )k22 + µj kξj kpp ], (4.10)
j=1
∗
with θ = (λj , µj )N 2 N
j=1 ∈ (R ) and E D : Ξ → X denoting the synthesis oper-
ator associated with the given dictionary D. Bai et al. (2017) propose the
following alternating scheme to solve the sparse-land reconstruction prob-
lem:
N
( )
j
X
f i+1 := arg min L(A(f ), g) + λj kPj (f ) − E ∗D (ξj )k22
f ∈X j=1
∗
i+1 i+1 2 p
ξj := arg min{λj kPj (f ) − E D (ξj )k2 + µj kξj kp } for j = 1, . . . , N .
ξi ∈Ξ
(4.11)
The advantage of these approaches over plain-vanilla dictionary learning
is that sparse-land models are computationally more feasible. Sparse-land
models are one example of a dictionary learning approach, which will be
discussed in the next section.
where
N
X
S θ (f, ξ1 , . . . , ξN , D) := [λj kPj (f ) − E ∗D (ξj )k22 + µj kξj kpp ], (4.13)
j=1
∗
with θ = ((λj , µj )N 2 N
j=1 ) ∈ (R ) , and E D : Ξ → X denotes the synthesis oper-
ator associated with the dictionary D. Usually an alternating minimization
scheme is used to optimize over the three variables in (4.12).
All three formulations are posed in terms of the `0 -norm and are NP-hard
If D is fixed then the sum in (4.17) decouples and leads to the convex
relaxation of the sparse coding problem in (2.20) for A = id.
In the finite-dimensional setting X = Rm and Ξ is replaced by Rn for
some n. Then the dictionary is D := {φk }k=1...n ⊂ Rm represented by an
n × m matrix D leading to the synthesis operator becoming a mapping
E ∗D : Rm → Rn with E ∗D (ξ) = D ·ξ. In this setting (4.17) becomes
N
b ξˆi ) :=
X
(D, arg min [`X (fi , D ·ξi ) + θkξi k1 ]. (4.18)
ξi ∈Rm ,D∈Rn×m i=1
Here, if D satisfies the RIP then the convex relaxation preserves the sparse
solution (Candès et al. 2006). State-of-the-art dictionary learning algorithms
are K-SVD (Aharon, Elad and Bruckstein 2006), geometric multi-resolution
analysis (GRMA) (Allard, Chen and Maggioni 2012) and online dictionary
learning (Mairal, Bach, Ponce and Sapiro 2010). Most work on dictionary
learning to date has been done in the context of denoising, i.e. A = id; see
also Rubinstein et al. (2010).
matrices (a union of banded and circulant matrices). Set up like this, con-
volutional dictionaries render computationally feasible shift-invariant dic-
tionaries, where atoms depend on the entire signal.
Convolutional sparse coding. Consider now the inverse problem of recover-
ing ftrue ∈ X from (2.1) with the assumption that ftrue is compressible with
respect to convolution dictionary D := {φi } ⊂ X.
In convolutional sparse coding (CSC), this is done by performing a syn-
thesis using convolutional dictionaries, that is, atoms act by convolutions.
More precisely, the reconstruction operator Rθ : Y → X is given as
ξˆi ∗ φi ,
X
R(g) := (4.19)
i
where X
ξˆi ∈ arg min L A ξi ∗ φi , g + λkξi k0 .
ξi ∈X i
Computational methods for solving (4.19) for denoising use convex relaxa-
tion followed by the alternating direction method of multipliers (ADMM)
in frequency space (Bristow, Eriksson and Lucey 2013) and its variants. See
also Sreter and Giryes (2017) on using LISTA in this context. So far CSC
has only been analysed in the context of denoising (Bristow et al. 2013,
Wohlberg 2014, Gu et al. 2015, Garcia-Cardona and Wohlberg 2017) with
theoretical properties given in Papyan, Sulam and Elad (2016a, 2016b).
Convolutional dictionary learning. Learning a dictionary in the context of
CSC is called convolutional dictionary learning. Here, given unsupervised
training data f1 , . . . , fm ∈ X and a loss function `X : X : X → X, one solves
(m m X
)
X X X
arg min `X fj , ξj,i ∗ φi + λ kξj,i k1 , (4.20)
φi ,ξj,i ∈X j=1 i j=1 i
ξ k = j (ξjk+1 ∗ φk+1
P
j ) for k = 1, . . . , L − 1,
and kξ k k0,∞ ≤ sk for k = 1, . . . , L. Hence, atoms φk,i ∈ Dk in the kth
convolution dictionary are compressible in the (k + 1)th dictionary Dk+1 for
k = 1, . . . , L − 1.
The ML-CSC model is a special case of CSC where intermediate repres-
entations have a specific structure (Sulam et al. 2017, Lemma 1). Building
on the theory for CSC, Sulam et al. (2017) provide a theoretical study of this
novel model and its associated pursuits for dictionary learning and sparse
coding in the context of denoising. Further, consequences for the theoretical
analysis of CNNs can be extracted from ML-CSC using the fact that the
resulting layered thresholding algorithm and the layered basis pursuit share
many similarities with a forward pass of a deep CNN.
Indeed, Papyan, Romano and Elad (2017) show that ML-CSC yields a
Bayesian model that is implicitly imposed on fˆ when deploying a CNN, and
that consequently characterizes signals belonging to the model behind a deep
CNN. Among other properties, one can show that the CNN is guaranteed
to recover an estimate of the underlying representations of an input signal,
assuming these are sparse in a local sense (Papyan et al. 2017, Theorem 4)
and the recovery is stable (Papyan et al. 2017, Theorems 8 and 10). Many of
these results also hold for fully connected networks, and they can be used to
formulate new algorithms for CNNs, for example to propose an alternative to
the commonly used forward pass algorithm in CNN. This is related to both
deconvolutional (Zeiler, Krishnan, Taylor and Fergus 2010, Pu et al. 2016)
and recurrent networks (Bengio, Simard and Frasconi 1994).
An essential technique for proving the key results in the cited references
for ML-CSC is based on unrolling, which establishes a link between sparsity-
promoting regularization (compressed sensing) and deep neural networks.
More precisely, one starts with a variational formulation like that in (4.20)
and specifies a suitable iterative optimization scheme. In this setting one
can prove several theoretical results, such as convergence, stability and er-
ror estimates. Next, one unrolls the truncated optimization iterates and
identifies the updating between iterates as layers in a deep neural network
(Section 4.9.1). The properties of this network, such as stability and con-
vergence, can now be analysed using methods from compressed sensing.
Deep dictionary learning. Another recent approach in the context of dic-
tionary learning is deep dictionary learning. Here, the two popular rep-
resentation learning paradigms – dictionary learning and deep learning –
come together. Conceptually, while dictionary learning focuses on learning
ΓN
s1 ,...,sN (f ) = [φ ∗ (ρ ◦ W sN ) ◦ . . . ◦ (ρ ◦ W s1 )](f ),
is equivalent to
(fˆ, ĥ) := arg min{L(A(f ), g) + λ S(h)} subject to f = h.
f,h
The latter can be solved using ADMM (Section 8.2.7), where the update in
h is computed using a proximal operator,
hk+1) = proxτ λ S (h(k) − f + u), (4.21)
where u is a Lagrange (dual) variable.
The idea is now to replace the proximal operator with a generic denoising
operator, which implies that the regularization functional S is not neces-
sarily explicitly defined. This opens the door to switching in any demon-
strably successful denoising algorithms without redesigning a reconstruction
algorithm – hence the name ‘Plug-and-Play’. However, the method comes
with some disadvantages. The lack of an explicit representation of the reg-
ularizer detracts from a strict Bayesian interpretation of the regularizer as
a prior probability distribution in MAP estimation and prevents explicit
monitoring of the change in posterior probability of the iterative estimates
of the solution. Next, the method is by design tied to the ADMM iter-
ative scheme, which may not be optimal and which requires a non-trivial
tuning of parameters of the ADMM algorithm itself (e.g. the Lagrangian
penalty weighting term). Finally, it is not provably convergent for arbitrary
denoising ‘engines’.
Regularization by denoising (RED). The RED method (Romano, Elad and
Milanfar 2017a) is motivated by the P 3 method. It is a variational method
where the reconstruction operator is given as in (2.7), where the regulariz-
ation functional is explicitly given as
S(f ) := hf, f − Λ(f )i for some Λ : X → X. (4.22)
The Λ operator above is a general (non-linear) denoising operator : it can
for example be a trained deep neural network. It does, however, need to
satisfy two two key properties, which are justifiable in terms of the desirable
features of an image denoiser.
Local homogeneity. The denoising operator should locally commute with
scaling, that is,
Λ(cf ) = cΛ(f )
for all f ∈ X and |c − 1| ≤ with small.5
Strong passivity. The derivative of Λ should have a spectral radius less
than unity:
ρ(∂Λ(f )) ≤ 1.
This is justified by imposing a condition that the effect of the denoiser
should not increase the norm of the model parameter:
kΛ(f )k = k∂Λ(f )f k ≤ kρ(∂Λ(f ))k kf k ≤ kf k.
The key implication from local homogeneity is that the directional deriv-
ative of Λ along f is just the application of the denoiser to the f itself:
∂Λ(f )f = Λ(f ).
5
This is a less restrictive condition than requiring equality for all c ≥ 0.
The key implication from strong passivity is that it allows for convergence
of the proposed RED methods by ensuring convexity of their associated
regularization functionals.
Defining W : X → R implicitly through the relation W (f )f = Λ(f ) and
assuming local homogeneity and strong passivity yields the following com-
putationally feasible expression for the gradient of the regularizer:
∇ S(f ) = f − Λ(f ) = (id − W (f ))(f ). (4.23)
Here, the operator (id − W (f )) is further interpreted as an ‘image adapt-
ive Laplacian-based’ regularizer. The above allows one to implement the
RED framework in any optimization such as gradient-descent, fixed-point
or ADMM in contrast to the P 3 approach, which is coupled to the ADMM
scheme. See also Reehorst and Schniter (2018) for further clarifications and
new interpretations of the regularizing properties of the RED method.
where θb ∈ arg minθ L(θ), with the loss function θ 7→ L(θ) defined as
We also point out that, unlike other data-driven approaches for inverse
problems, the above method can be adapted to work with only unsuper-
vised training data. A special case is to have gi ≈ A(fi ), which gives a
unsupervised formulation of (4.26).
Lunz et al. (2018) chose a Wasserstein-flavoured loss functional (Gulrajani
et al. 2017) to train the regularizer, that is, one solves (4.24) with the loss
function
The last term in the loss function serves to enforce the trained regularizer
S θ to be Lipschitz-continuous with constant one (Gulrajani et al. 2017).
Under appropriate assumptions on ρ and π (see Assumptions 4.4 and 4.5)
and for the asymptotic case of S θb having been trained to perfection, the
loss (4.27) coincides with the 1-Wasserstein distance defined in (B.2).
A list of qualitative properties of S θb can be proved: for example, The-
orem 1 of Lunz et al. (2018) shows that under appropriate regularity as-
sumptions on the Wasserstein distance between ρ and π, starting from ele-
ments in ρ and taking a gradient descent step of S θb (which results in a new
distribution ρη ) strictly decreases the Wasserstein distance between the new
distribution ρη and π. This is a good indicator that using S θb as a variational
regularization term, and consequently penalizing it, indeed introduces the
highly desirable incentive to align the distribution of regularized solutions
with the distribution of ground truth samples Πprior . Another characteriz-
ation of such a trained regularizer S θb using the Wasserstein loss in (B.2) is
the updates using a deep neural network. A key part of their approach
is ensuring that the learned scheme has convergence guarantees. See also
Rizzuti, Siahkoohi and Herrmann (2019) for a similar approach to solving
the Helmholtz equation.
Finally, we also mention Ingraham, Riesselman, Sander and Marks (2019),
who unroll a Monte Carlo simulation as a model for protein folding. They
compose a neural energy function with a novel and efficient simulator based
on Langevin dynamics to build an end-to-end-differentiable model of atomic
protein structure given amino acid sequence information.
where
b ∈ arg min k A ◦Ψ† (ξ0 ) − gk2 .
θ(g) θ
θ
At the core of the DIP approach is the assumption that one can construct a
(decoder) network Ψ†θ : Ξ → X which outputs elements in X, which are close
to or have a high probability of belonging to the set of feasible parameters.
We emphasize that the training is with respect to θ: the input ξ0 is kept
fixed. Furthermore, machine learning approaches generally use large sets of
training data, so it is somewhat surprising that deep image priors are trained
on a single data set g. In summary, the main ingredients of deep inverse
priors (DIPs) are a single data set, a well-chosen network architecture and a
stopping criterion for terminating the training process (Ulyanov et al. 2018).
One might assume that the network architecture Ψ†θ would need to incor-
porate rather specific details about the forward operator A, or even more
importantly about the prior distribution Πprior of feasible model paramet-
ers. This seems not to be the case: empirical evidence suggests that rather
generic network architectures work for different inverse problems, and the
obtained numerical results demonstrate the potential of DIP approaches for
large-scale inverse problems such as MPI.
So far, investigations related to DIP have been predominantly experi-
mental and mostly restricted to problems that do not fall within the class
of ill-posed inverse problems. However, some work has been done in the
context of inverse problems: for example, Van Veen et al. (2018) consider
the DIP approach to solve an inverse problem with a linear forward oper-
ator. They introduce a novel learned regularization technique which further
reduces the number of measurements required to achieve a given reconstruc-
tion error. An approach similar to DIP is considered by Gupta et al. (2018)
for CT reconstruction. Here one regularizes by projection onto a convex set
(see Gupta et al. 2018, equation (3)) and the projector is constructed by
training a U-Net against unsupervised data. Gupta et al. (2018, Theorem 3)
also provide guarantees for convergence to a local minimum.
We now briefly summarize the known theoretical foundations of DIP for
inverse problems based on the recent paper by Dittmer, Kluth, Maass and
Baguer (2018), who analyse and prove that certain network architectures in
combination with suitable stopping rules do indeed lead to regularization
schemes, which lead to the notion of ‘regularization by architecture’. We
also include numerical results for the integration operator; more complex
results for MPI are presented in Section 7.5.
there is potential for training networks with a single data point. Also,
Landweber iterations converge from rather arbitrary starting points, indic-
ating that the choice of ξ0 in the general case is indeed of minor importance.
Using this characterization of fˆ, we define the analytic deep prior as the
network, which is obtained by a gradient descent method with respect to B
for
1
L(B, g) = k A(fˆ) − gk2 . (4.55)
2
The resulting deep prior network has proxλ S as activation function, and the
linear map W and its bias b are as described in (ii) and (iii). This allows
us to obtain an explicit description of the gradient descent for B, which in
turn leads to an iteration of functionals J B .
Simple example. We here examine analytic deep priors for linear inverse
problems, i.e. A : X → Y is linear, and compare them to classical Tikh-
onov regularization with S(f ) = 21 kf k2 . Let fλ ∈ X denote the solution
obtained with the classical Tikhonov regularization, which by (2.11) can be
expressed as
fλ = (A∗ ◦ A +λ id)−1 ◦ A∗ (g).
This is equivalent to the solution obtained by the analytic deep prior ap-
proach, with B = A without any iteration. Now, take B = A as a starting
point for computing a gradient descent with respect to B using the DIP
approach, and compare the resulting fˆ with fλ .
The proximal mapping for the functional R above is given by
1
proxλ S (z) = z.
1+λ
A rather lengthy calculation (see Dittmer, Kluth, Maass and Baguer 2018)
yields an explicit formula for the derivative of F with respect to B in the
iteration
B k+1 = B k −η∂F (B k ).
The expression stated there can be made explicit for special settings. For
illustration we assume the rather unrealistic case that f + = h, where h ∈ X
is a singular function for A with singular value σ. The dual singular function
is denoted by v ∈ Y , i.e. A h = σv and A∗ v = σh, and we further assume
that the measurement noise in g is in the direction of this singular function,
i.e. g = (σ +δ)v. In this case, the problem is indeed restricted to the span of
h and the span of v, respectively. The iterates B k only change the singular
value βk of h, that is,
B k+1 = B k −ck h · , hiv,
with a suitable ck = c(λ, δ, σ, η).
Deep inverse priors for the integration operator. We now illustrate the use
of deep inverse prior approaches for solving an inverse problem with the
integration operator A : L2 ([0, 1]) → L2 ([0, 1]), defined by
Z t
A(f )(t) = f (s) ds. (4.56)
0
Here A is linear and compact, hence the task of evaluating its inverse is an
ill-posed inverse problem.
0.010
0.005
g
0.000
ftrue
−0.005
−0.010
f (B)
2
Reconstructions at λ = 0.01 L-curve True error 2 B k̂
0.1
ftrue
f (B)
2
Reconstructions at λ = 0.01 L-curve True error 2 B k̂
f (B)
2
2
0.0 fλ
0.1 f (B k̂ )
−0.1 ftrue
f (B)
2
2
0.0 fλ
0.0 0.2 0.4 0.6 0.8 1.0 f (B k̂ )
A( f (B)) − g
2
0 500 0 500
k̂ k̂
2
−0.1
(a) λ = 0.01
A(L-curve
f (B)
2
Reconstructions at λ = 0.001 True error 2 B k̂
0.0 0.2 0.4 0.6 0.8 1.0 2
0 500 0 500
f (B)) − g
2 k̂ k̂
0.1
ftrue
f (B)
2
Reconstructions at λ = 0.001 L-curve True error 2 B k̂
f (B)
2
2
0.0 fλ
0.1 f (B k̂ )
−0.1 ftrue
f (B)
2
2
0.0 fλ
0.0 0.2 0.4 0.6 0.8 1.0 f (B k̂ )
A( f (B)) − g
2 0 k̂ 2000 0 k̂ 2000
2
−0.1
(b) λ = 0.001
f (B)
2
2
0.0 Reconstructions at λ = 0.1, λ k̂ = 1.31 · 10−4 fλ
f (B)
2
k̂ L-curve True error 2 λ
f (B k̂ ) λ0
−0.1
0.1
ftrue λ k̂
−0.2
f (B)
2
2
0.0 Reconstructions at λ = 0.1, λ0.6 10−4
= 1.31 ·0.8 fλ
0 True
k̂ error3000 0
k̂ f (B)
23000 0 λ 3000
0.0 0.2 0.4 1.0
A(L-curve
2
k̂
k̂ k̂ f (B)) − g
2
2
f (B k̂ ) λ0
−0.1
0.1
Reconstructions at λ = 0.01, λ k̂ = 4.09 · 10−4
f (B)
2
L-curve
−0.2 (a) 1.0 f
λ = 0.1, λk̂ =
1.31
true
×
10−40 Truek̂ error3000 2 λ k̂
λ0
λ
f (B)
2
f (B)
2
2
0.0 f
0 3000 0 3000
λ
0.0 0.2 0.4 0.6 0.8 k̂ A( f (B)) − g
2 k̂ k̂
0.1 f (B k̂ )
2
−0.1
ftrue
2
0.0 Reconstructions at λ = 0.01, λ k̂ = 4.09 · 10−4 fλ
f (B)
2
k̂ L-curve True error 2 λ k̂ λ
−0.2 f (B k̂ ) λ0
0.0 0.2 0.4 0.6 0.8 1.0 0 3000 0 3000 0 3000
A( f (B)) − g
2 k̂ k̂ k̂
0.1
−0.1 2
ftrue λ k̂
f (B)
2
2
0.0 0.0 Reconstructions
0.2 at
0.4λ = 0.01,0.6
λ k̂ · 10−4
= 4.090.8 1.0 fλ
2
0 True error3000 0 k̂
f (B)
23000 0 k̂ λ 3000
A(L-curve
2
k̂
k̂ f (B)) − g
2
f (B k̂ ) λ0
0.1
−0.1 Reconstructions at λ = 0.001, λ k̂ = 3.51 · 10−4
f (B)
2
ftrue L-curve True error 2 λ k̂ λ
λ0
f (B)
2
f (B)
2
2
fλ
0.0 0.0 0.2 0.4 0.6 0.8 1.0 0 3000 0 k̂ 3000 0 k̂ 3000
A( f (B)) − g
2 k̂
k̂
0.1 f (B k̂ )
2
ftrue
−0.1
2
0.0 Reconstructions at λ = 0.001, λ k̂ = 3.51 · 10−4 fλ
f (B)
2
(b) λ = 0.01,
f (B )
k̂ L-curve
λk̂ =
4.09
k̂ × 10−4True error 2 λ k̂
λ0
λ
0.0 0.2 0.4 0.6 0.8 1.0 0 3000 0 k̂ 3000 0 k̂ 3000
A( f (B)) − g
2 k̂
−0.1
0.1 2
λ k̂
ftrue
f (B)
2
2
A(L-curve
2
k̂ f (B)) − g
2
2
f (B k̂ ) λ0
−0.1
0.1 λ k̂
ftrue
f (B)
2
2
A( f (B)) − g
2
k̂
2
f (B k̂ )
−0.1 λ k̂
5.1.1. Overview
There are various ways of combining techniques from deep learning with
statistical regularization. The statistical characteristics of training data
together with the choice of loss function determines the training problem
one seeks to solve during learning. This in turn determines the type of
estimator (reconstruction operator) one is approximating.
Supervised learning. The training data are given as samples (fi , gi ) ∈
X × Y generated by (f, g) ∼ µ. One can then approximate the Bayes
estimator, that is, we seek Rθb : Y → X, where θb solves
The actual training involves replacing the joint law µ with its empirical
counterpart induced by the supervised training data. Examples of methods
that build on the above are surveyed in Section 5.1.2.
Learned prior. The training data fi ∈ X are samples generated by a µf -
distributed random variable, where µf ∈ PX is the f-marginal of µ. One
can then learn the negative log prior density in a MAP estimator, that is,
Rθb : Y → X is given by
Rθb(g) := arg min{− log πdata (g | f ) + S θb(f )}.
f ∈X
Here πdata ( · | f ) is the density for the data likelihood Πfdata ∈ PY and θb
is learned such that S θb(f ) ≈ − log(πf (f )), with πf denoting the density
for µf ∈ PX , which is the f-marginal of µ. The actual training involves
replacing µf with its empirical counterpart induced by the training data.
Examples of methods that build on the above are surveyed in Section 4.7.
Unsupervised learning. The training data gi ∈ Y are samples generated
by a µg -distributed random variable where µg ∈ PY is the g-marginal of
µ. It is not possible to learn a prior in a MAP estimator from such training
data, but one can improve upon the computational feasibility for evaluating
a given MAP estimator. We do that by considering Rθb : Y → X, where θb
solves
θb := arg min Eg∼µg [− log πdata (g | Rθ (g)) + S λ (Rθ (g))]. (5.2)
θ
In the above, both the density πdata ( · | f ) for the data likelihood Πfdata ∈ PY
and the negative log density S λ : X → R of the prior are handcrafted. The
actual training involves replacing µg ∈ PY with its empirical counterpart
induced by the training data. In the above, µ bg is the empirical counterpart
of µbg given by training data, L : Y × Y → R is the negative data log-
likelihood, and S λ : X → R is the negative log-prior. The latter two are
not learned. Examples of methods that build on the above are surveyed in
Section 4.9.
Semi-supervised learning. The training data fi ∈ X and gi ∈ Y are
semi-supervised, i.e. unpaired samples from the marginal distributions µf
and µg of µ, respectively. One can then compute an estimator Rθb : Y → X,
where θb solves
θb ∈ arg min{E(f,g)∼µf ⊗µg [`Y (A(Rθ (g)), g) + `X (Rθ (g), f)]
θ
+ λ`PX ((Rθ )# (µg ), µf )}. (5.3)
In the above, `X : X × X → R and `Y : Y × Y → R are loss functions on
X and Y , respectively. Next, `PX : PX × PX → R is a distance notion
between probability distributions on X and (Rθ )# (µg ) ∈ PX denotes the
pushforward of the measure µg ∈ PY by Rθ : Y → X. It is common to
evaluate `PX using techniques from GANs, which introduce a separate deep
neural network (discriminator/critic). Finally, the parameter λ controls the
balance between the distributional consistency, noise suppression and data
consistency. One can also consider further variants of the above, for example
when there is access to a large sample of unpaired data combined with a
small amount of paired data, or when parts of the probability distributions
involved are known.
The choice of neural network architecture for the reconstruction operator
Rθb : Y → X is formally independent of the choice of loss function and the
set-up of the training problem. The choice does, however, impact the train-
ability of the learning, especially when there is little training data. In such
cases, it is important to make use of all the information. In inverse problems
one has explicit knowledge about how data are generated that comes in the
form of a forward operator, or one might have an expression for the en-
tire data likelihood. Architectures that embed such explicit knowledge, e.g.
the forward operator and the adjoint of its derivative, perform better when
there is little training data. They also have better generalization proper-
ties, and against adversarial attacks (Chakraborty et al. 2018, Akhtar and
Mian 2018) they are more difficult to design since a successful attack needs
to be consistent with how data are generated. Architectures that account
for such information can be defined by unrolling (Section 4.9.4).
The remaining sections survey various approaches from the literature in
computing the above estimators.
such training data, then one can approximate the Bayes estimator in (5.4)
by the neural network Rθb : Y → X, where the finite-dimensional network
parameter θb ∈ Θ is learned from data by solving the following empirical risk
minimization problem:
m
1 X
θb ∈ arg min `X (fi , Rθ (gi )), (5.7)
θ∈Θ m
i=1
where (fi , gi ) ∈ Σm as in (5.6). Note now that (5.7) does not explicitly
require specifying a prior f 7→ Πprior (f ) or a data likelihood g 7→ Πfdata that
models how data are generated given a model parameter. Information about
both of these is implicitly contained in supervised training data Σm ⊂ X×Y .
Fully learned Bayes estimation (Section 5.1.3) refers to approaches where
one assumes there is enough supervised training data to learn the joint law
µ, that is, one disregards the explicit knowledge about the data likelihood.
In contrast, learned iterative schemes (Section 5.1.4) include the information
about the data likelihood by using an appropriate architecture of Rθ : Y →
X. Learned post-processing methods (Section 5.1.5) offer an alternative
way to account for the data likelihood since these methods apply an initial
reconstruction operator that maps data to a model parameter. This is
actually an estimator different from the above Bayes estimator, but if the
loss is the squared L2 -norm and the initial reconstruction operator is a linear
sufficient statistic, then these estimators coincide in the ‘large-sample’ or
‘small-noise’ limit.
Regularizing the learning. The problem in (5.7) is ill-posed in itself, so one
should not try to solve it in the formal sense. A wide range of techniques
have been developed within supervised learning for implicitly or explicitly
regularizing the empirical risk minimization problem in (5.7) as surveyed
and categorized by Kukačka, Golkov and Cremers (2017). A key challenge
is to handle the non-convexity, and the energy landscape for the objective in
(5.7) typically has many local minima: for example, for binary classification
there is an exponential number (in terms of network parameters) of distinct
local minima (Auer, Herbster and Warmuth 1996).
Similar to Shai and Shai (2014, Section 2.1), we define a training algorithm
for (5.7) as an operator mapping a probability measure on X × Y to a
parameter in Θ that approximately solves (5.7):
T : PX×Y → Θ, (5.8)
where T (b
µ) ≈ θb with θb ∈ Θ denoting a solution to (5.7). Thus, the train-
ing algorithm is a method for approximately solving (5.7) given a fixed
neural network architecture. This also includes necessary regularization
techniques: for example, a common strategy for solving (5.7) is to use some
variant of stochastic gradient descent that is cleverly initialized (often at
AutoMap was used to reconstruct 128×128 pixel images from MRI and PET
imaging data. The dependence on fully connected layers results in a large
number of neural network parameters that have to be trained. Primarily
motivated by this difficulty, a further development of AutoMap is ETER-
net (Oh et al. 2018), which uses a recurrent neural network architecture in
place of the fully connected/convolutional auto-encoder architecture. Also
addressing 128 × 128 pixel images from MRI, Oh et al. (2018) found a re-
duction in required parameters by over 80%. A method similar to AutoMap
is used by Yoo et al. (2017) to solve the non-linear reconstruction problem
in diffuse optical tomography. Here the forward problem is the Lippman–
Schwinger equation but only a single fully connected layer is used in the
backprojection step. Yoo et al. (2017) exploit the intrinsically ill-posed
nature of the forward problem to argue that the mapping induced by the
auto-encoder step is low-rank and therefore sets an upper bound on the
dimension of the hidden convolution layers.
The advantage of fully learned Bayes estimation lies in its simplicity,
since one avoids making use of an explicit forward operator (or data likeli-
hood). On the other hand, any generic approach to reconstruction by deep
neural networks requires having connected layers that represent the relation
between model parameters and data. For this reason, generic fully learned
Bayes estimation will always scale badly: for example, in three-dimensional
tomographic reconstruction it is common to have an inverse problem which,
when discretized, involves recovering a (512 × 512 × 512 ≈ 108 )-dimensional
model parameter from data of the same order of magnitude. Hence, a
fully learned generic approach would involve learning at least 1016 weights
from supervised data! There have been several attempts to address the
above issue by considering neural network architectures that are adapted
to specific direct and inverse problems. One example is that of Khoo and
Ying (2018), who provide a novel neural network architecture (SwitchNet)
for solving inverse scattering problems involving the wave equation. By
leveraging the inherent low-rank structure of the scattering problems and
introducing a novel switching layer with sparse connections, the SwitchNet
architecture uses far fewer parameters than a U-Net architecture for such
problems. Another example is that of Ardizzone et al. (2018), who propose
encoding the forward operator using a invertible neural network, also called
a reversible residual network (Gomez, Ren, Urtasun and Grosse 2017). The
reconstruction operator is then obtained as the inverse of the invertible
neural network for the forward operator. However, it is unclear whether
this is a clever approach to problems that are ill-posed, since an inverse of
the forward operator is not stable. Another approach is that of Yoo et al.
(2017), who apply an AutoMap-like architecture for non-linear reconstruc-
tion problems in diffuse optical tomography. Here the forward problem is
the Lippman–Schwinger equation, but only a single fully connected layer is
used in the backprojection step. Yoo et al. (2017) exploit the intrinsically
ill-posed nature of the forward problem to argue that the mapping induced
by the auto-encoder step is low-rank and therefore sets an upper bound on
the dimension of the hidden convolution layers. The above approaches can
also to some extent be seen as further refinements of methods in Section 4.2.
However, neither of the above efforts address the challenge of finding suf-
ficient supervised training data necessary for the training. Furthermore,
any changes to the acquisition protocol or instrumentation may require re-
training, making the method impractical. In particular, due to the lack of
training data, fully learned Bayes estimation is inapplicable to cases when
data are acquired using novel instrumentation. A practical case would be
spectral CT, where novel direct counting energy resolving detectors are be-
ing developed.
The trained network is used to invert the Fourier transform (MRI image
reconstruction). However, the whole approach is unnecessarily complex,
and it is now surpassed by learned iterative methods that have a more
transparent logic. The survey will therefore focus on these latter variants.
∇L(A( · ), g) ∇L(A( · ), g)
Γθ1 Γθ2
cost for the training of the unrolled network. One approach is to replace
the end-to-end training of the entire neural network and instead break down
each iteration and train the sub-networks sequentially (gradient boosting).
This is the approach taken by Hauptmann et al. (2018) and Wu et al.
(2018a), but as shown by Wu et al. (2018a), the output quality has minor im-
provements over learned post-processing (Section 5.1.5), which also scales to
the three-dimensional setting. A better alternative could be to use a revers-
ible residual network architecture (Gomez et al. 2017, Ardizzone et al. 2018)
for a learned iterative method, since these are much better at managing
memory consumption in training (mainly when calculating gradients using
backpropagation) as networks grow deeper and wider. However, this is yet
to be done.
Learned iterative in both model parameter and data spaces. The final en-
hancement to the learned iterative schemes is to introduce an explicit learned
updating in the data space as well. To see how this can be achieved, one
can unroll a primal–dual-type scheme of the form
0 0
v = g and f ∈ X given
v k+1 = Γθd (v k , A(f k ), g))
d
for k = 0, . . . , N − 1. (5.14)
k
f k+1 = Γmm (f k , [∂ A(f )]∗ (v k+1 ))
θk
Here,
Γm
θm : X × X → X and Γdθd : Y × Y × Y → Y
k k
This is illustrated in Figure 5.2, and similar networks are also suggested by
Vogel and Pock (2017) and Kobler et al. (2017), who extend the approach of
Hammernik et al. (2018) by parametrizing and learning the data discrepancy
L. Applications are for inverting the two-dimensional Fourier transform
(two-dimensional MRI image reconstruction). See also He et al. (2019),
who unroll an ADMM scheme with updates in both reconstruction and
data spaces and apply that to two-dimensional CT reconstruction.
Finally, allowing for some memory in both model parameter and data
spaces leads to the learned primal–dual scheme of Adler and Öktem (2018b),
which is used for low-dose two-dimensional CT reconstruction. The robust-
ness of this approach against uncertainties in the image and uncertainties
in system settings is empirically studied by Boink, van Gils, Manohar and
Brune (2018). The conclusion is that learning improves pure knowledge-
Γdθd Γdθd
1 2
A A
[∂A]∗ [∂A]∗ [∂A]∗
Γm
θm Γm
θm Γm
θm
1 2 3
Figure 5.2. Learned iterative method in both model parameter and data spaces.
Illustration of the operator obtained by unrolling the scheme in (5.14) for N = 3
in the context of CT image reconstruction (Section 7.3.1).
Mardani et al. (2017b) (Section 5.1.6), which uses the same architecture as
learned iterative methods but a different loss, that is, it computes a different
estimator.
it does not satisfy the frame condition and it overly emphasizes the low-
frequency component of the signal, which leads to blurring artefacts in the
post-processed CT images. To address this, Han and Ye (2018) suggest a
U-Net-based network architecture with directional wavelets that satisfy the
frame condition. Finally, Ye et al. (2018) develop a mathematical frame-
work to understand deep learning approaches for inverse problems based on
these deep convolutional framelets. Such architectures represent a signal
decomposition similar to using wavelets or framelets, but here the basis is
learned from the training data. This idea of using techniques from applied
harmonic analysis and sparse signal processing to analyse approximation
properties of certain classes of deep neural networks bears similarities to
Bölcskei, Grohs, Kutyniok and Petersen (2019) (see Section 8.2.1) and the
scattering networks discussed in Section 4.5 as well as work related to multi-
layer convolutional sparse coding outlined in Section 4.4.2.
Yet another CNN architecture (Mixed-Scale Dense CNN) is proposed
in Pelt, Batenburg and Sethian (2018) for denoising and removing streak
artefacts from limited angle CT reconstructions. Empirical evidence shows
that this architecture comes with some advantages over encoder–decoder
networks. It can be trained on relatively small training sets and the same
hyper-parameters in training can often be re-used across a wide variety of
problems. This removes the need to perform a time-consuming trial-and-
error search for hyper-parameter values.
Besides architectures, one may also consider the choice of loss function,
as in Zhang et al. (2018), who consider CNN-based denoising of CT images
using a loss function that is a linear combination of squared L2 and multi-
scale structural similarity index (SSIM). A closely related work is that of
Zhang and Yu (2018), which uses a CNN trained on image patches with a
loss function that is a sum of squared L2 losses over the patches. The aim
here is to reduce streak artefacts from highly scattering media, such as metal
implants. A number of papers use techniques from GANs to post-process
CT images. Shan et al. (2018) use a conveying path-based convolutional
encoder–decoder network. A novel feature of this approach is that an initial
three-dimensional denoising model can be directly obtained by extending
a trained two-dimensional CNN, which is then fine-tuned to incorporate
three-dimensional spatial information from adjacent slices (transfer learning
from two to three dimensions). The paper also contains a summary of deep
learning network architectures for CT post-processing listing the loss func-
tion (squared L2 , adversarial or perpetual loss). A similar approach is taken
by Yang et al. (2018c), who denoise CT images via a GAN with Wasserstein
distance and perceptual similarity. The perceptual loss suppresses noise by
comparing the perceptual features of a denoised output against those of
the ground truth in an established feature space, while the generator fo-
cuses more on migrating the data noise distribution. Another approach is
that of You et al. (2018a), who use a generator from a GAN to generate
high-resolution CT images from low-resolution counterparts. The model is
trained on semi-supervised training data and the training is regularized by
enforcing a cycle-consistency expressed in terms of the Wasserstein distance.
See also You et al. (2018b) for similar work, along with an investigation of
the impact of different loss functions for training the GAN.
For PET image reconstruction, da Luis and Reader (2017) use a CNN
to denoise PET reconstructions obtained by ML-EM. A more involved ap-
proach is presented in Yang, Ying and Tang (2018a), which learns a post-
processing step for enhancing PET reconstructions obtained by MAP with
a Green smoothness prior (see Yang, Ying and Tang 2018a, equation (6)).
More precisely, this is supervised training on tuples consisting of ground
truth and a number of MAP solutions where one varies parameters defin-
ing the prior. The training seeks to learn a mapping that takes a set of
small image patches at the same location from the MAP solutions to the
corresponding patch in the ground truth, thereby resulting in a learned
patch-based image denoising scheme. A related approach is that of Kim
et al. (2018), who train a CNN to map low-dose PET images to a full-dose
one. Both low-dose and full-dose reconstructions are obtained using ordered
subsets ML-EM. Since the resulting trained CNN denoiser produces addi-
tional bias induced by the disparity of noise levels, one considers learning
a regularizer that includes the CNN denoiser (see Kim et al. 2018, equa-
tion (8)). The resulting variational problem is solved using the ADMM
method.
Concerning MRI, most learned post-processing applications seek to train
a mapping that takes a zero-filling reconstruction7 obtained from under-
sampled MRI data to the MRI reconstruction that corresponds to fully
sampled data. For example, Wang et al. (2016) use a CNN for this pur-
pose, and the deep learning output is either used as an initialization or as
a regularization term in classical compressed sensing approaches to MRI
image reconstruction. Another example is that of Hyun et al. (2018), who
use a CNN to process a zero-filling reconstruction followed by a particu-
lar k-space correction. This outperforms plain zero-filling reconstruction as
well as learned post-processing where Fourier inversion is combined with a
trained denoiser based on a plain U-Net architecture.
Similar to CT image processing, there has been some work on using GANs
to post-process MRI reconstructions. One example is that of Quan, Mem-
ber, Nguyen-Duc and Jeong (2018), who use a generator within a GAN
setting to learn a post-processing operator that maps a zero-filling recon-
struction image to a full reconstruction image. Training is regularized using
7
Zero-filling reconstruction is computed by setting to zero all Fourier coefficients that are
not measured in the MRI data, and then applying a normal inverse Fourier transform.
a loss that includes a cyclic loss consistency term that promotes accurate in-
terpolation of the given under-sampled k-space data. The generator consists
of multiple end-to-end networks chained together, where the first network
translates a zero-filling reconstruction image to a full reconstruction image,
and the following networks improve accuracy of the full reconstruction im-
age (refinement step). Another approach using a generator from a trained
GAN is given by Yang et al. (2018b), who use a U-Net architecture with
skip connections for the generator. The loss consists of an adversarial loss
term, a novel content loss term considering both squared L2 and a percep-
tual loss term defined by pre-trained deep convolutional networks. There is
also a squared L2 in both model parameter and data spaces, and the latter
involves applying the forward operator to the training data to evaluate the
squared L2 in data space. See Yang et al. (2018b, equation (13)) for the full
expression.
We conclude by mentioning some approaches that involve pre-processing.
For CT imaging, deep-learning-based pre-processing targets sinogram in-
painting, which is the task of mapping observed, sparsely sampled, CT pro-
jection data onto corresponding densely sampled CT projection data. Lee
et al. (2019) achieve this via a plain CNN whereas Ghani and Karl (2018)
use a generator from a trained conditional GAN. A CNN is also used by
Hong et al. (2018) to pre-process PET data. Here one uses a deep residual
CNN for PET super-resolution, that is, to map PET sinogram data from
a scanner with large pixellated crystals to one with small pixellated crys-
tals. The CNN-based method was designed and applied as an intermediate
step between the projection data acquisition and the image reconstruction.
Results are validated using both analytically simulated data, Monte Carlo
simulated data and experimental pre-clinical data. In a similar manner,
Allman, Reiter and Bell (2018) use a CNN to pre-process photoacoustic
data to identify and remove noise artefacts. Finally, we cite Huizhuo, Jin-
zhu and Zhanxing (2018), who use a CNN to jointly pre- and post-process
CT data and images. The pre-processing amounts to sinogram in-painting
and the post-processing is image denoising, and the middle reconstruction
step is performed using FBP. The loss function for this joint pre- and post-
processing scheme is given in Huizhuo et al. (2018, equation (3)).
The joint law µ above can be replaced by its empirical counterpart given
from supervised training data (fi , gi ), so the µ-expectation is replaced by an
averaging over training data. The resulting networks will then approximate
the conditional mean and the conditional pointwise variance, respectively.
As already shown, by using (5.19) it is possible to rewrite many estimat-
ors as minimizers of an expectation. Such estimators can then be approx-
imated using the direct estimation approach outlined here. This should
coincide with computing the same estimator by posterior sampling (Sec-
tion 5.2.1). Direct estimation is significantly faster, but not as flexible as
posterior sampling since each estimator requires a new neural network that
specifically trained for that estimator. Section 7.7 compares the outcomes
of the two approaches.
To describe how a Wasserstein GAN can be used for this purpose, let
data g ∈ Y be fixed and assume that Πgpost , the posterior of f at g = g,
can be approximated by elements in a parametrized family {Gθ (g)}θ∈Θ of
probability measures on X. The best such approximation is defined as
Gθ∗ (g), where θ∗ ∈ Θ solves
θ∗ ∈ arg min `PX (Gθ (g), Πgpost ). (5.20)
θ∈Θ
In the above, σ is the probability distribution for data and the random
variable g ∼ σ generates data.
Observe now that evaluating the objective in (5.21) requires access to
the very posterior that we seek to approximate. Furthermore, the dis-
tribution σ of data is often unknown, so an approach based on (5.21) is
essentially useless if the purpose is to sample from an unknown posterior.
Finally, evaluating the Wasserstein 1-distance directly from its definition is
not computationally feasible.
On the other hand, as we shall see, all of these drawbacks can be circum-
vented by rewriting (5.21) as an expectation over the joint law (f, g) ∼ µ.
This makes use of the Kantorovich–Rubinstein duality for the Wasserstein
1-distance (see (B.2)), and one obtains the following approximate version of
(5.21):
∗
θ ∈ arg min sup E(f,g)∼µ Dφ (f, g) − Ez∼η [Dφ (Gθ (z, g), g)] . (5.22)
θ∈Θ φ∈Φ
At first sight, it might be unclear why (5.22) is better suited than (5.21)
to sampling from the posterior, especially since the joint law µ in (5.22) is
unknown. The advantage becomes clear when one has access to supervised
training data for the inverse problem, i.e. i.i.d. samples (f1 , g1 ), . . . , (fm , gm )
generated by the random variable (f, g) ∼ µ. The µ-expectation in (5.22)
can then be replaced by an averaging over training data.
To summarize, solving (5.22) given supervised training data in X × Y
amounts to learning a generator Gθ∗ (z, ·) : Y → X such that Gθ∗ (z, g) with
z ∼ η is approximately distributed as the posterior Πgpost . In particular,
for given g ∈ Y we can sample from Πgpost by generating values of z 7→
Gθ∗ (z, g) ∈ X in which z ∈ Z is generated by sampling from z ∼ η.
An important part of the implementation is the concrete parametrizations
of the generator and discriminator:
Gθ : Z × Y → X and Dφ : X × Y → R.
We use deep neural networks for this purpose, and following Gulrajani et al.
(2017), we softly enforce the 1-Lipschitz condition on the discriminator by
including a gradient penalty term in the training objective function in (5.22).
Furthermore, if (5.22) is implemented as is, then in practice z is not used by
the generator (so called mode-collapse). To solve this problem, we introduce
a novel conditional Wasserstein GAN discriminator that can be used with
conditional WGAN without impairing its analytical properties: see Adler
and Öktem (2018a) for more details.
We conclude by referring to Section 7.7 for an example of how the con-
ditional Wasserstein GAN can be used in clinical image-guided decision
making.
6. Special topics
In this section we address several topics of machine learning that do not
strictly fall within the previously covered contexts of functional analytic or
statistical regularization. In Section 6.1 we discuss regularization methods
that go beyond pure reconstructions. These reconstructions include – at
least partially – the decision process, which typically follows the solution
of inverse problems, for example examination of a CT reconstruction by a
medical expert. Then Section 6.2.1 aims at investigating the connections
between neural networks and differential equations, and Section 6.2 dis-
cusses the case where the forward operator is incorrectly known. Finally,
Section 6.2.2 discusses total least-squares approaches, which are classical
tools for updating the operator as well as the reconstruction based on meas-
ured data. We are well aware that this is still very much an incomplete
list of topics not covered in the previous sections. As already mentioned
The training data are samples (fi , gi ) generated by (f, g) for computing θb
and (gi , di ) generated by (g, d) for computing φ.
b
Golub et al. (1999) conclude that if k L fγ k < γ solves the R-TLS problem,
then it also solves the TLS problem without regularization. Moreover, this
approach has been extended to include sparsity constrained optimization
(Zhu, Leus and Giannakis 2011), which equivalent formulation then reads
1 2 β 2
arg min k(A +δA)f − gk + αkf k1 + k∂ A kF .
δA,f 2 2
The previous sparsity-promoting approach, as well as the original R-TLS
approach, can be easily extended if sets of training data (fi , gi ) are available.
One either aims for a two-stage approach to first update the operator and
then solve the inverse problem with some new data point g, or one can
integrate both steps at once leading to
X
1 2 1 2 α 2 β 2
arg min k(A +δA)fi −gi k + k(A +δA)f −gk + k L f k + kδA kF .
δA,f 2 2 2 2
i
(a) Tikhonov, LS (b) sparsity, LS (c) Tikhonov, TLS (d) sparsity, TLS
(e) Tikhonov, LS (f) sparsity, LS (g) Tikhonov, TLS (h) sparsity, TLS
set for the model parameter, and this explicit relation is referred to as the
(microlocal) canonical relation. Using the canonical relation one can recover
the wavefront set from data without solving the inverse problem, a process
that can be highly non-trivial.
Second, the canonical relation also describes which part of the wavefront
set one can recover from data. This was done by Quinto (1993) for the case
when the two- or three-dimensional ray transform is restricted to parallel
lines, and by Quinto and Öktem (2008) for an analysis in the region-of-
interest limited angle setting. Faber, Katsevich and Ramm (1995) derived
a related principle for the three-dimensional ray transform restricted to lines
given by helical acquisition, which is common in medical imaging. Similar
principles hold for transforms integrating along other types of curves, e.g.
ellipses with foci on the x-axis and geodesics (Uhlmann and Vasy 2012).
Finally, recovering the wavefront set of the model parameter from data is
a less ill-posed procedure than attempting to recover the model parameter
itself. This was demonstrated in Davison (1983), where the severely ill-posed
reconstruction problem in limited angle CT becomes mildly ill-posed if one
settles for recovering the wavefront. See also Quinto and Öktem (2008) for
an application of this principle to cryo-electron tomography.
Data-driven extraction of the wavefront set. The above motivates the in-
verse problems community to work with the wavefront set. One difficulty
that has limited use of the wavefront set is that it is virtually impossible
to extract it numerically from a digitized signal. This is due to its defini-
tion, which depends on the asymptotic behaviour of the Fourier transform
after a localization procedure. An alternative possibility is to identify the
wavefront set after transforming the signal using a suitable representation,
e.g. a curvelet or shearlet transform (Candès and Donoho 2005, Kutyniok
and Labate 2009). This requires analysing the rate of decay of transformed
signal, which again is unfeasible in large-scale imaging applications.
A recent paper (Andrade-Loarca, Kutyniok, Öktem and Petersen 2019)
uses a data-driven approach to training a wavefront set extractor applicable
to noisy digitized signals. The idea is to construct a deep neural network
classifier that predicts the wavefront set from the shearlet coefficients of
a signal. The approach is successfully demonstrated on two-dimensional
imaging examples where it outperforms all conventional edge-orientation
estimators as well as alternative data-driven methods including the current
state of the art. This learned wavefront set extractor can now be combined
with a learned iterative method using the framework in Section 6.1.
Using the canonical relation to guide data-driven recovery. In a recent paper
Bubba et al. (2018) consider using the aforementioned microlocal canonical
relation to steer a data-driven component in limited angle CT reconstruc-
tion, which is a severely ill-posed inverse problem.
7. Applications
In this section we revisit some of the machine learning methods for inverse
problems discussed in the previous sections, and demonstrate their applic-
ability to prototypical examples of inverse problems.
which only uses information about the operator Aε . Here, given data g we
estimate ftrue by RTik
σ (g), where
∗ −1
RTik 2
σ (g) := (Aε ◦ Aε +σ id) ◦ A∗ε (g) = (ATε · Aε +σ 2 I)−1 · ATε ·g. (7.1)
The second inversion is based on a trained neural network, that is, given
data g we estimate ftrue by RNN NN
W∗ (g) where RW∗ : Y → X is a trained neural
∗
network with W given by
m
1 X
W∗ ∈ arg min k RNN (i) (i) 2
W (g ) − f k . (7.2)
W m
i=1
for n test pairs (f (i) , g (i) ) ∈ X × Y with g (i) := A f (i) + e(i) as in the training
set above but clearly distinct from the training examples.
The design of the network is crucial. We use a minimal network which
allows us to reproduce a matrix vector multiplication. Hence the network is
– in principle – capable of recovering the Tikhonov regularization operator
or even an improvement of it. We use a network with a single hidden layer
with four nodes. We restrict the eight weights connecting the two input
variables with the first layer by setting
w1 = −w3 = w11 , w2 = −w4 = w12 ,
w5 = −w7 = w21 , w6 = −w8 = w22 ,
as depicted in Figure 7.1. We obtain a neural network depending on four
variables w11 , w12 , w21 , w22 and the network acts as a multiplication of the
matrix
w11 w12
W=
w21 w22
with the input vector z = (z1 , z2 ). We denote the output of such a neural
network by RNN W (z) = W z.
For later use we define (2 × m) matrices X, Y and E that store the vectors
f (i) , g (i) and e(i) column-wise, so the training data can be summarized as
Y = Aε · X + E . (7.3)
w1 1
w2
w3 −1
w4
w5
1
w6
w7
w8 -1
Figure 7.1. The network design with eight parameters, a setting that yields a
matrix–vector multiplication of the input.
The training of such a network for modelling the forward problem is equi-
valent (using the Frobenius norm for matrices) to minimizing the expected
mean square error
n
1X 1
min k W f (i) − g (i) k2 = min k W X − Y k2 , (7.4)
W n W n
i=1
and the training model (7.2) for the inverse problem simplifies to
n
1X 1
min k W g (i) − f (i) k2 = min k W Y − X k2 . (7.5)
W n W n
i=1
In the next paragraph we report some numerical examples before analysing
these networks.
Testing error convergence for various values of ε. We train these networks
using a set of training data (f (i) , g (i) )i=1,...,m with m = 10 000, i.e. g (i) =
Aε f (i) + e(i) . The network design with restricted coefficients as described
above has four degrees of freedom w = (w11 , w12 , w21 , w22 ). The cor-
responding loss function is minimized by a gradient descent algorithm,
that is, the gradient of the loss function with respect to w is computed
by backpropagation (Rumelhart, Hinton and Williams 1986, Martens and
Sutskever 2012, Byrd, Chin, Nocedal and Wu 2012). We used 3000 itera-
tions (epochs) of this gradient descent to minimize the loss function of a
1
network for the forward operator using (7.4) or, respectively, for training a
network for solving the inverse problem using (7.2). The MSE errors on the
training data were close to zero in both cases.
After training we tested the resulting networks by drawing n = 10 000
new data vectors f (i) as well as error vectors e(i) . The g (i) were computed
as above. Table 7.1 lists the resulting values using this set of test data
Table 7.1. The errors of the inverse net with an ill-conditioned matrix Aε (i.e. ε 1)
are large and the computed reconstructions with the test data are meaningless.
for the network trained for the forward problem and the inverse problem,
respectively:
n
1X
NMSEfwd := k W f (i) − g (i) k2 ,
n
i=1
n
1X
NMSEinv := k W f (i) − g (i) k2 .
n
i=1
(a) too low β, high oscillation (b) optimal β (c) too high β, almost TV
(d) too low α, low β: good (e) too low α, optimal β: op- (f) too high α, high β: bad
match to noisy data timal TV2 -like behaviour TV2 -like behaviour
Figure 7.2. (a–c) Effect of choosing β on total generalized variation (TGV)2 de-
noising with optimal α. (d–f) Effect of choosing α too large in TGV2 denoising.
0.5
0.45
0.4
0.35
0.3
0.25
β
0.2
0.15
0.1
0.05
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
α
Figure 7.3. Contour plot of the objective functional in TGV2 denoising in the
(α, β)-plane.
Figure 7.4. Optimal denoising results for TGV2 , ICTV and TV, all with L22 as
data discrepancy.
Table 7.2. Quantified results for the parrot image (s := image width/height in
pixels = 256), using L22 discrepancy.
Denoise Initial (α, β) Result (α̂, β̂) Objective SSIM PSNR Its Figure
TGV2 (α̂TV /s, α̂TV ) (0.058/s2 , 0.041/s) 6.412 0.890 31.992 11 7.4(b)
ICTV (α̂/s, α̂TV ) (0.051/s2 , 0.041/s) 6.439 0.887 31.954 7 7.4(c)
TV 0.1/s 0.042/s 6.623 0.879 31.710 12 7.4(d)
Table 7.4. Cross-validated computations on the BSDS300 data set (Martin et al.
2001) split into two halves of 100 images each. TGV2 regularization with L2 -
discrepancy. ‘Learning’ and ‘validation’ indicate the halves used for learning α and
for computing the average PSNR and SSIM, respectively. Noise variance σ = 10.
Validation Learning α
~ Average PSNR Average SSIM
Figures 7.5 and 7.6 present denoising results with optimally learned para-
meters for mixed Gaussian and impulse noise and for mixed Gaussian and
Poisson noise, respectively. See Calatroni et al. (2017) for more details.
The original image has been corrupted with Gaussian noise of zero mean
and variance 0.005 and then a percentage of 5% of pixels has been corrup-
ted with impulse noise. The parameters have been chosen to be γ = 104 ,
µ = 10−15 and the mesh step size h = 1/312. The computed optimal weights
are λ̂1 = 734.25 and λ̂2 = 3401.2. Together with an optimal denoised image,
the results show the decomposition of the noise into its sparse and Gaussian
components: see Calatroni et al. (2017) for more details.
Remark 7.1. When optimizing only a handful of scalar parameters, as in
the examples discussed above, bilevel optimization is by no means the most
efficient approach for parameter learning. In fact, brute force line-search
methods are in this context still computationally feasible as the dimension-
ality of the parameter space being explored is small. However, even in
Figure 7.5. Optimized impulse-Gaussian denoising: (a) original image, (b) noisy
image with Gaussian noise of variance 0.005 and (c) with 5% of pixels corrup-
ted with impulse noise, (d) impulse noise residuum, (e) Gaussian noise residuum.
Optimal parameters λ̂1 = 734.25 and λ̂2 = 3401.2.
Figure 7.6. Optimized Poisson–Gauss denoising: (a) original image, (b) noisy image
corrupted by Poisson noise and Gaussian noise with mean zero and variance 0.001,
(c) denoised image. Optimal parameters λ̂1 = 1847.75 and λ̂2 = 73.45.
Figure 7.7. Example from supervised training data used to train the learned iter-
ative and learned post-processing methods used in Figure 7.8.
(a) data: pre-log sinogram (b) ground truth (c) filtered backprojection
Figure 7.9. Reconstructions of a 512 × 512 pixel human phantom along with two
zoom-in regions indicated by small circles. The left zoom-in has a true feature
whereas texture in the right zoom-in is uniform. The window is set to [−200, 200]
Hounsfield units. Among the methods tested, only the learned iterative method
(learned primal–dual algorithm) correctly recovers these regions. In the others, the
true feature in the left zoom-in is indistinguishable from other false features of the
same size/contrast, and the right-zoom in has a streak artefact. The improvement
that comes with using a learned iterative method thus translates into true clinical
usefulness.
Table 7.5. Summary of results shown in Figures 7.8 and 7.9 where an SSIM score of 1 corresponds to a perfect match. Note that the
learned iterative method (learned primal–dual algorithm) significantly outperforms TV regularization even when reconstructing
the Shepp–Logan phantom. With respect to run-time, the learned iterative method involves calls to the forward operator, and
123
124 S. Arridge, P. Maass, O. Öktem and C.-B. Schönlieb
Valluru, Wilson and Willmann 2016, Zhou, Yao and Wang 2016, Xia and
Wang 2014). In the setting considered here, data are collected as a time
series on a two-dimensional sensor Y = [Γ ⊂ R2 ] × [0, T ] on the surface of
a domain X = Ω ⊂ R3 . Several methods exist for reconstruction, including
filtered backprojection-type inversions of the spherical Radon transform and
numerical techniques such as time-reversal. As in problems such as CT
(Section 2.2.4) and MRI (Section 2.2.5), data subsampling may be employed
to accelerate image acquisition, which leads consequently to the need for
regularization to prevent noise propagation and artefact generation. The
long reconstruction times ensuing from conventional iterative reconstruction
algorithms have motivated consideration of machine learning methods.
The deep gradient descent (DGD) method (Hauptmann et al. 2018) for
PAT is an example of a learned iterative method (see Section 5.1.4). The
main aspects can be summarized as follows.
• Each iteration adds an update by combining measurement information
delivered via the gradient ∇L(g, A fk ) = A∗ (A fk − g) with an image
processing step
fk+1 = Gθk (∇L(g, A fk ), fk ), (7.7)
where the layer operators Gθk correspond to convolutional neural net-
works (CNNs) with different, learned parameters θk but with the same
architecture. The initialization for the iterations was the backprojec-
tion of the data f0 = A∗ g.
• The training data were taken from the publicly available data from
the ELCAP Public Lung Image Database.9 The data set consists of
50 whole-lung CT scans, from which about 1200 volumes of vessel
structures were segmented, and scaled up to the final target size of
80 × 240 × 240. Out of these volumes 1024 were chosen as the ground
truth ftrue for the training and simulated limited-view, subsampled
data, using the same measurement set-up as in the in vivo data. Pre-
computing the gradient information for each CNN took about 10 hours.
• Initial results from training on synthetic data showed a failure to ef-
fectively threshold the noise-like artefacts in the low absorption re-
gions (see Figure 7.10). This effect was ameliorated by simulating the
effect of the low absorbing background as a Gaussian random field
with short spatial correlation length. The synthetic CT volumes with
the added background were then used for the data generation, i.e.
i
gback i
= A fback + ε, whereas the clean volumes ftrue were used as ref-
erence for the training.
9
http://www.via.cornell.edu/databases/lungdb.html
Figure 7.10. Reconstruction from real measurement data of a human palm, without
adjustments of the training data. The images shown are top-down maximum in-
tensity projections. (a) Result of the DGD trained on images without added back-
ground. (b) TV reconstruction obtained from fully sampled data.
• The results were further improved using transfer training with a set
of 20 (fully sampled) measurements of a human finger, wrist and palm
from the same experimental system. To update the DGD an additional
five epochs of training on the pairs {greal , fTV } were performed with a
reduced learning rate taking only 90 minutes. The effect of the updated
DGD is shown in Figure 7.11.
Table 7.6. CT reconstruction on the LIDC dataset using various methods. Note
that the learned post-processing and RED methods require training on supervised
data, while the adversarial regularizer only requires training on unsupervised data.
(a)
(b)
(c) (d)
where s` denotes the kernel of the linear operator. Combining the meas-
urements of all receive coils yields – after discretization – a linear system of
equations Sc = g. Typically, the rows of S are normalized, resulting in the
final form of the linearized inverse problem denoted by
A c = g. (7.8)
This is a coarse simplification of the physical set-up, which neglects non-
linear magnetization effects of the nanoparticles as well as the non-homo-
geneity of the spatial sensitivity of the receive coils and also the small but
non-negligible particle–particle interactions. Hence this is a perfect set-up
for exploiting the potential of neural networks for matching complex and
high-dimensional non-linear models.
We test the capability of the deep imaging prior approach to improving
image reconstruction obtained by standard Tikhonov regularization. For
the experiments we use datasets generated by the Bruker preclinical MPI
system at the University Medical Center, Hamburg–Eppendorf.
We use the deep image prior network introduced by Ulyanov et al. (2018),
specifically their U-Net architecture. Our implementation is based on Tensor-
Flow (Abadi et al. 2015) and Keras (Chollet et al. 2015), and has the follow-
ing specifications. Between the encoder and decoder part of the U-Net our
skip connection has four channels. The convolutional encoder goes from the
input to 32, 32, 64 and 128 channels, each with strides of 2 × 2 and filters
of size 3 × 3. Then the convolutional decoder has the mirrored architecture
plus first a resize-nearest-neighbour layer to reach the desired output shape
and second an additional ReLU convolutional layer with filters of size 1.
The number of channels of this last layers is three for data set 1 (DS1) to
accommodate three slices (three two-dimensional scans, one above another)
of a two-dimensional phantom centred at the central slice of the three. The
Figure 7.13. MPI reconstructions of two phantoms using different methods: (a)–
(d) phantom with 4 mm distance between tubes containing ferromagnetic nano-
particles; (e)–(h) phantom with 2 mm distance. The methods used are Kaczmarz
with L2 -discrepancy (λ̃ = 5 × 10−4 ), `1 -regularization (λ̃ = 5 × 10−3 ) and DIP
(η = 5 × 10−5 ) for both cases. Photos of phantoms taken by T. Kluth at the
University Medical Center, Hamburg–Eppendorf.
`X `D
0.7
10
2 0.8
10 10
0.9
10
1
10 4 10
0.01 0.1 0.5 0.9 0.99 0.999
C
(a) (b)
Figure 7.15. Test data: (a) subset of CT data from an ultra-low-dose three-dimen-
sional helical scan and (b) the corresponding FBP reconstruction. Images are
shown using a display window set to [−150, 200] Hounsfield units.
Mean
−150 HU
50 HU
pStd
0 HU
Figure 7.16. Conditional mean and pointwise standard deviation (pStd) computed
from test data (Figure 7.15) using posterior sampling (Section 5.2.1) and direct
estimation (Section 5.1.6).
5 10 15 20 25 30 35 40
(a) (b) (c)
Figure 7.17. (b) Suspected tumour (red) and reference region (blue) shown in the
sample posterior mean image. (c) Average contrast differences between the tumour
and reference region. The histogram is computed by posterior sampling applied to
test data (Figure 7.15); the yellow curve is from direct estimation (Section 5.1.6),
and the true value is the red threshold. (a) The normal dose image that confirms
the presence of the feature.
The functional analytic and Bayesian viewpoints. The way deep learning
is used for solving an inverse problem depends on whether one adopts the
functional analytic or the Bayesian viewpoint.
Within the functional analytic viewpoint, a deep neural network is simply
a parametrized family of operators, and learning amounts to calibrating
the parameters against example data by minimizing some appropriate loss
function.
In Bayesian inversion, a deep neural network corresponds to a statistical
decision rule, so methods from deep learning constitute a computational
framework for statistical decision making in high dimensions. For example,
many of the estimators that have previously been computationally unfeas-
ible are now computable: for example, the conditional mean seems to be
well approximated by learned iterative schemes (Section 5.1.4). Likewise, a
trained generative network can be used to sample from the posterior (Sec-
tion 5.2) in a computationally feasible manner, as shown in Section 5.2.
Handling lack of training data. Inverse problems in the sciences and en-
gineering often have little training data compared to the dimensionality
of the model parameter. Furthermore, it is impractical to have a method
that requires retraining as soon as the measurement protocol changes. This
becomes an issue is medical imaging where data in multi-centre studies is
typically acquired using different CT or MRI scanners.
For these reasons, black-box machine learning algorithms (Section 7.1)
are not suitable for solving such inverse problems. On the other hand, in
these inverse problems there is often a knowledge-driven model for how
data are generated and it is important to integrate this information into
the data-driven method. Learned iterative schemes (Section 5.1.4) employ
a deep neural network that embeds this model for data into its architecture.
of the posterior distribution as the noise level tends to zero. Here, the reg-
ularization functional (in functional analytic regularization) and the prior
distribution (in Bayesian inversion) primarily act as a regularizers.
The above viewpoint does not acknowledge the potential that lies in en-
coding knowledge about the true model parameter into the regularization
functional or prior. Furthermore, in applications data are fixed with some
given noise level, and there is little, if any, guidance from the above theory
on which regularizer or prior to select in such a setting. Empirical evidence
suggests that instead of hand-crafting a regularization functional or a prior,
one can learn it from example data. This allows us to pick up information
related to the inverse problem that is difficult, if not impossible, to account
for otherwise.
8.2. Outlook
We identify several interesting directions for future research in the context
of inverse problems and machine learning.
Acknowledgements
This article builds on lengthy discussions and long-standing collaborations
with a large number of people. These include Jonas Adler, Sebastian Ban-
ert, Martin Benning, Marta Betcke, Luca Calatroni, Juan Carlos De Los
Reyes, Andreas Hauptmann, Lior Horesh, Bangti Jin, Iasonas Kokkinos,
Felix Lucka, Sebastian Lunz, Thomas Pock, Tuomo Valkonen and Olivier
Verdier. The authors are moreover grateful to the following people for
proofreading the manuscript and providing valuable feedback on its content
Acronyms
ADMM alternating direction method of multipliers
AutoMap automated transform by manifold approximation
BFGS Broyden–Fletcher–Goldfarb–Shanno
CG conjugate gradient
CNN convolutional neural network
CSC convolutional sparse coding
CT computed tomography
DGD deep gradient descent
DIP deep inverse prior
FBP filtered backprojection
FFP field-free point
FoE Field of Experts
GAN generative adversarial network
ICA independent component analysis
ICTV infimal-convolution total variation
ISTA Iterative Soft-Thresholding Algorithm
KL Kullback–Leibler
LISTA Learned Iterative Soft-Thresholding Algorithm
MAP maximum a posteriori
MCMC Markov chain Monte Carlo
ML-CSC multi-layer convolutional sparse coding
ML-EM maximum likelihood expectationmaximization
MPI magnetic particle imaging
MRF Markov random field
MRI magnetic resonance imaging
NETT neural network Tikhonov
P 3 Plug-and-Play Prior
PAT photoacoustic tomography
PCA principal component analysis
PDE partial differential equation
Appendices
A. Optimization of convex non-smooth functionals
Suppose in general that we want to optimize a problem defined as the sum
of two parts,
min [J (f ) := Φ(f ) + S(f )], (A.1)
f ∈X
rearranging yields
f − τ A∗ (A f − g) ∈ f + τ λ∂kf k1 .
Using (A.9) to invert the term on the right-hand side yields
Sτ λ (f − λ A∗ (A f − g)) = f.
Hence this is a fixed-point condition, which is a necessary condition for
all minimizers of f 7→ J λ (f ). Turning the fixed-point condition into an
iteration scheme yields
f k+1 = Sτ λ (f k − τ A∗ (A f k − g))
= Sτ λ ((id − τ A∗ A)f k + τ A∗ g). (A.10)
In the above, Π(p, q) ⊂ PX×X denotes the family of joint probability meas-
ures on X × X that has p and q as marginals. Note also that we assume
PX only contains measures where the Wasserstein distance takes finite val-
ues (Wasserstein space): see Villani (2009, Definition 6.4) for the formal
definition.
The Wasserstein 1-distance in (B.1) can be rewritten using the Kantoro-
vich–Rubinstein dual characterization (Villani 2009, Remark 6.5, p. 95),
resulting in
W(p, q) = sup {Ef∼q [D(f)] − Eh∼p [D(h)]} for p, q ∈ PX . (B.2)
D : X→R
D∈Lip(X)
REFERENCES10
M. Abadi et al. (2015), TensorFlow: Large-scale machine learning on heterogeneous
systems. Software available from https://www.tensorflow.org.
B. Adcock and A. C. Hansen (2016), ‘Generalized sampling and infinite-dimen-
sional compressed sensing’, Found. Comput. Math. 16, 1263–1323.
J. Adler and S. Lunz (2018), Banach Wasserstein GAN. In Advances in Neural In-
formation Processing Systems 31 (NIPS 2018) (S. Bengio et al., eds), Curran
Associates, pp. 6754–6763.
J. Adler and O. Öktem (2017), ‘Solving ill-posed inverse problems using iterative
deep neural networks’, Inverse Problems 33, 124007.
J. Adler and O. Öktem (2018a), Deep Bayesian inversion: Computational uncer-
tainty quantification for large scale inverse problems. arXiv:1811.05910
J. Adler and O. Öktem (2018b), ‘Learned primal–dual reconstruction’, IEEE Trans.
Medical Imaging 37, 1322–1332.
J. Adler, S. Lunz, O. Verdier, C.-B. Schönlieb and O. Öktem (2018), Task adapted
reconstruction for inverse problems. arXiv:1809.00948
L. Affara, B. Ghanem and P. Wonka (2018), Supervised convolutional sparse cod-
ing. arXiv:1804.02678
10
The URLs cited in this work were correct at the time of going to press, but the publisher
and the authors make no undertaking that the citations remain live or are accurate or
appropriate.
Y. Chen, T. Pock and H. Bischof (2012), Learning `1 -based analysis and synthesis
sparsity priors using bi-level optimization. In Workshop on Analysis Operator
Learning vs. Dictionary Learning (NIPS 2012).
Y. Chen, T. Pock, R. Ranftl and H. Bischof (2013), Revisiting loss-specific training
of filter-based MRFs for image restoration. In German Conference on Pattern
Recognition (GCPR 2013), Vol. 8142 of Lecture Notes in Computer Science,
Springer, pp. 271–281.
Y. Chen, R. Ranftl and T. Pock (2014), ‘Insights into analysis operator learning:
From patch-based sparse models to higher order MRFs’, IEEE Trans. Image
Process. 23, 1060–1072.
Y. Chen, W. Yu and T. Pock (2015), On learning optimized reaction diffusion
processes for effective image restoration. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2015), pp. 5261–5269.
F. Chollet et al. (2015), Keras: The Python Deep Learning library. https://keras.io
A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous and Y. LeCun (2015),
The loss surfaces of multilayer networks. In 18th International Conference on
Artificial Intelligence and Statistics (AISTATS 2015), pp. 192–204.
I. Y. Chun, X. Zheng, Y. Long and J. A. Fessler (2017), Sparse-view X-ray CT
reconstruction using `1 regularization with learned sparsifying transform. In
14th International Meeting on Fully Three-Dimensional Image Reconstruction
in Radiology and Nuclear Medicine (Fully3D 2017).
J. Chung and M. I. Espanol (2017), ‘Learning regularization parameters for general-
form Tikhonov’, Inverse Problems 33, 074004.
C. Clason, T. Helin, R. Kretschmann and P. Piiroinen (2018), Generalized modes
in Bayesian inverse problems. arXiv:1806.00519
A. Cohen, W. Dahmen and R. DeVore (2009), ‘Compressed sensing and best k-term
approximation’, J. Amer. Math. Soc. 22, 211–231.
J. Cohen, E. Rosenfeld and J. Z. Kolter (2019), Certified adversarial robustness
via randomized smoothing. arXiv:1902.02918v1
P. L. Combettes and J.-C. Pesquet (2011), Proximal splitting methods in signal
processing. In Fixed-Point Algorithms for Inverse Problems in Science and
Engineering (H. H. Bauschke et al., eds), Vol. 49 of Springer Optimization
and its Applications, Springer, pp. 185–212.
P. L. Combettes and J.-C. Pesquet (2012), ‘Primal–dual splitting algorithm for
solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum
type monotone operators’, Set-Valued Var. Anal. 20, 307–330.
P. L. Combettes and V. R. Wajs (2005), ‘Signal recovery by proximal forward–
backward splitting’, Multiscale Model. Simul. 4, 1168–1200.
R. Costantini and S. Susstrunk (2004), Virtual sensor design. In Electronic Imaging
2004, International Society for Optics and Photonics, pp. 408–419.
A. Courville, I. Goodfellow and Y. Bengio (2017), Deep Learning, MIT Press.
G. R. Cross and A. K. Jain (1983), ‘Markov random field texture models’, IEEE
Trans. Pattern Anal. Mach. Intel. 5, 25–39.
G. Cybenko (1989), ‘Approximation by superpositions of a sigmoidal function’,
Math. Control Signals Syst. 2, 303–314.
C. O. da Luis and A. J. Reader (2017), Deep learning for suppression of resolution-
recovery artefacts in MLEM PET image reconstruction. In 2017 IEEE
D. Kim and J. A. Fessler (2016), ‘Optimized first-order methods for smooth convex
minimization’, Math. Program. 159, 81–107.
K. Kim, G. E. Fakhri and Q. Li (2017), ‘Low-dose CT reconstruction using spatially
encoded nonlocal penalty’, Med. Phys. 44, 376–390.
K. Kim, D. Wu, K. Gong, J. Dutta, J. H. Kim, Y. D. Son, H. K. Kim, G. E. Fakhri
and Q. Li (2018), ‘Penalized PET reconstruction using deep learning prior
and local linear fitting’, IEEE Trans. Medical Imaging 37, 1478–1487.
S.-J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky (2007), ‘An interior-
point method for large-scale `1 -regularized least squares’, IEEE J. Selected
Topics Signal Process. 1, 606–617.
S. Kindermann (2011), ‘Convergence analysis of minimization-based noise level-free
parameter choice rules for linear ill-posed problems’, Electron. Trans. Numer.
Anal. 38, 233–257.
A. Kirsch (2011), An Introduction to the Mathematical Theory of Inverse Problems,
second edition, Vol. 120 of Applied Mathematical Sciences, Springer.
T. Klatzer and T. Pock (2015), Continuous hyper-parameter learning for support
vector machines. In 20th Computer Vision Winter Workshop (CVWW).
B. J. K. Kleijn and Y. Y. Zhao (2018), Criteria for posterior consistency.
arXiv:1308.1263v5
T. Kluth (2018), ‘Mathematical models for magnetic particle imaging’, Inverse
Problems 34, 083001.
T. Kluth and P. Maass (2017), ‘Model uncertainty in magnetic particle imaging:
Nonlinear problem formulation and model-based sparse reconstruction’, In-
ternat. J. Magnetic Particle Imaging 3, 1707004.
B. T. Knapik, B. T. Szabó, A. W. van der Vaart and J. H. van Zanten (2016),
‘Bayes procedures for adaptive inference in inverse problems for the white
noise model’, Probab. Theory Related Fields 164, 771–813.
B. T. Knapik, A. W. van der Vaart and J. H. van Zanten (2011), ‘Bayesian inverse
problems with Gaussian priors’, Ann. Statist. 39, 2626–2657.
B. T. Knapik, A. W. van der Vaart and J. H. van Zanten (2013), ‘Bayesian recov-
ery of the initial condition for the heat equation’, Commun. Statist. Theory
Methods 42, 1294–1313.
T. Knopp, N. Gdaniec and M. Möddel (2017), ‘Magnetic particle imaging: From
proof of principle to preclinical applications’, Phys. Med. Biol. 62, R124.
T. Knopp, T. Viereck, G. Bringout, M. Ahlborg, J. Rahmer and M. Hofmann
(2016), MDF: Magnetic particle imaging data format. arXiv:1602.06072
S. Ko, D. Yu and J.-H. Won (2017), On a class of first-order primal–dual algorithms
for composite convex minimization problems. arXiv:1702.06234
E. Kobler, T. Klatzer, K. Hammernik and T. Pock (2017), Variational networks:
connecting variational methods and deep learning. In German Conference on
Pattern Recognition (GCPR 2017), Vol. 10496 of Lecture Notes in Computer
Science, Springer, pp. 281–293.
F. Kokkinos and S. Lefkimmiatis (2018), Deep image demosaicking using a cascade
of convolutional residual denoising networks. arXiv:1803.05215
V. Kolehmainen, M. Lassas, K. Niinimäki and S. Siltanen (2012), ‘Sparsity-
promoting Bayesian inversion’, Inverse Problems 28, 025005.
X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang and S. P. Smolley (2016), Least squares
generative adversarial networks. arXiv:1611.04076
M. Mardani, E. Gong, J. Y. Cheng, J. Pauly and L. Xing (2017a), Recurrent
generative adversarial neural networks for compressive imaging. In IEEE 7th
International Workshop on Computational Advances in Multi-Sensor Adapt-
ive Processing (CAMSAP 2017).
M. Mardani, E. Gong, J. Y. Cheng, S. Vasanawala, G. Zaharchuk, M. Alley,
N. Thakur, S. Han, W. Dally, J. M. Pauly and L. Xing (2017b), Deep gener-
ative adversarial networks for compressed sensing (GANCS) automates MRI.
arXiv:1706.00051
M. Markkanen, L. Roininen, J. M. J. Huttunen and S. Lasanen (2019), ‘Cauchy
difference priors for edge-preserving Bayesian inversion’, J. Inverse Ill-Posed
Problems 27, 225–240.
I. Markovsky and S. Van Huffel (2007), ‘Overview of total least-squares methods’,
Signal Processing 87, 2283–2302.
J. Martens and I. Sutskever (2012), Training deep and recurrent networks with
Hessian-free optimization. In Neural Networks: Tricks of the Trade, Vol. 7700
of Lecture Notes in Computer Science, Springer, pp. 479–535.
D. Martin, C. Fowlkes, D. Tal and J. Malik (2001), A database of human segmented
natural images and its application to evaluating segmentation algorithms and
measuring ecological statistics. In 8th International Conference on Computer
Vision (ICCV 2001), Vol. 2, pp. 416–423.
M. T. McCann and M. Unser (2019), Algorithms for biomedical image reconstruc-
tion. arXiv:1901.03565
M. T. McCann, K. H. Jin and M. Unser (2017), ‘Convolutional neural networks
for inverse problems in imaging: A review’, IEEE Signal Process. Mag. 34,
85–95.
T. Meinhardt, M. Moeller, C. Hazirbas and D. Cremers (2017), Learning proximal
operators: Using denoising networks for regularizing inverse imaging prob-
lems. In IEEE International Conference on Computer Vision (ICCV 2017),
pp. 1799–1808.
K. Miller (1970), ‘Least squares methods for ill-posed problems with a prescribed
bound’, SIAM J. Math. Anal. 1, 52–74.
D. D. L. Minh and D. Le Minh (2015), ‘Understanding the Hastings algorithm’,
Commun. Statist. Simul. Comput. 44, 332–349.
T. Minka (2001), Expectation propagation for approximate Bayesian inference. In
17th Conference on Uncertainty in Artificial Intelligence (UAI ’01) (J. S.
Breese and D. Koller, eds), Morgan Kaufmann, pp. 362–369.
F. Monard, R. Nickl and G. P. Paternain (2019), Efficient nonparametric Bayesian
inference for X-ray transforms. Ann. Statist. 47, 1113–1147.
N. Moriakov, K. Michielsen, J. Adler, R. Mann, I. Sechopoulos and J. Teuwen
(2018), Deep learning framework for digital breast tomosynthesis reconstruc-
tion. arXiv:1808.04640
V. A. Morozov (1966), ‘On the solution of functional equations by the method of
regularization’, Soviet Math. Doklady 7, 414–417.
A. Mousavi and R. G. Baraniuk (2017), Learning to invert: Signal recovery via deep
convolutional networks. In 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pp. 2272–2276.
J. L. Mueller and S. Siltanen (2012), Linear and Nonlinear Inverse Problems with
Practical Applications, SIAM.
D. Mumford and J. Shah (1989), ‘Optimal approximations by piecewise smooth
functions and associated variational problems’, Commun. Pure Appl. Math.
42, 577–685.
K. Murase, M. Aoki, N. Banura, K. Nishimoto, A. Mimura, T. Kuboyabu and
I. Yabata (2015), ‘Usefulness of magnetic particle imaging for predicting the
therapeutic effect of magnetic hyperthermia’, Open J. Medical Imaging 5, 85.
F. Natterer (1977), ‘Regularisierung schlecht gestellter Probleme durch Projek-
tionsverfahren’, Numer. Math. 28, 329–341.
F. Natterer (2001), The Mathematics of Computerized Tomography, Vol. 32 of
Classics in Applied Mathematics, SIAM.
F. Natterer and F. Wübbeling (2001), Mathematical Methods in Image Reconstruc-
tion, SIAM.
R. M. Neal (2003), ‘Slice sampling’, Ann. Statist. 31, 705–767.
D. Needell and J. A. Tropp (2009), ‘CoSaMP: iterative signal recovery from incom-
plete and inaccurate samples’, Appl. Comput. Harmon. Anal. 26, 301–321.
D. Needell and R. Vershynin (2009), ‘Uniform uncertainty principle and signal
recovery via regularized orthogonal matching pursuit’, Found. Comput. Math.
9, 317–334.
Y. Nesterov (2004), Introductory Lectures on Convex Optimization: A Basic
Course, Vol. 87 of Applied Optimization, Springer.
Y. Nesterov (2007), Gradient methods for minimizing composite objective function.
CORE Discussion Papers no. 2007076, Center for Operations Research and
Econometrics (CORE), Université Catholique de Louvain.
A. Neubauer and H. K. Pikkarainen (2008), ‘Convergence results for the Bayesian
inversion theory’, J. Inverse Ill-Posed Problems 16, 601–613.
R. Nickl (2013), Statistical Theory. Lecture notes, University of Cambridge.
http://www.statslab.cam.ac.uk/˜nickl/Site/ files/stat2013.pdf
R. Nickl (2017a), ‘Bernstein–von Mises theorems for statistical inverse problems,
I: Schrödinger equation,’ J. Eur. Math. Soc., to appear. arXiv:1707.01764
R. Nickl (2017b), ‘On Bayesian inference for some statistical inverse problems with
partial differential equations’, Bernoulli News 24, 5–9.
R. Nickl and J. Söhl (2017), ‘Nonparametric Bayesian posterior contraction rates
for discretely observed scalar diffusions’, Ann. Statist. 45, 1664–1693.
R. Nickl, S. van de Geer and S. Wang (2018), Convergence rates for penalised least
squares estimators in PDE-constrained regression problems. arXiv:1809.08818
L. Nie and X. Chen (2014), ‘Structural and functional photoacoustic molecular
tomography aided by emerging contrast agents’, Chem. Soc. Review 43, 7132–
70.
J. Nocedal and S. Wright (2006), Numerical Optimization, Springer Series in Op-
erations Research and Financial Engineering, Springer.
C. Oh, D. Kim, J.-Y. Chung, Y. Han and H. W. Park (2018), ETER-net: End to
end MR image reconstruction using recurrent neural network. In International
E. T. Quinto (1993), ‘Singularities of the X-ray transform and limited data tomo-
graphy in R2 and R3 ’, SIAM J. Math. Anal. 24, 1215–1225.
E. T. Quinto and O. Öktem (2008), ‘Local tomography in electron microscopy’,
SIAM J. Appl. Math. 68, 1282–1303.
J. Radon (1917), ‘Über die Bestimmung von Funktionen durch ihre Integralwerte
längs gewisser Mannigfaltigkeiten’, Ber. Verh. Sächs. Akad. Wiss. (Leipzig)
69, 262–277.
M. Raissi and G. E. Karniadakis (2017), Hidden physics models: Machine learning
of nonlinear partial differential equations. arXiv:1708.00588v2
R. Ranftl and T. Pock (2014), A deep variational model for image segmentation.
In 36th German Conference on Pattern Recognition (GCPR 2014), Vol. 8753
of Lecture Notes in Computer Science, Springer, pp. 107–118.
K. Ray (2013), ‘Bayesian inverse problems with non-conjugate priors’, Electron. J.
Statist. 7, 2516–2549.
E. T. Reehorst and P. Schniter (2018), Regularization by denoising: Clarifications
and new interpretations. arXiv:1806.02296
A. Repetti, M. Pereyra and Y. Wiaux (2019), ‘Scalable Bayesian uncertainty quan-
tification in imaging inverse problems via convex optimization’, SIAM J. Ima-
ging Sci. 12, 87–118.
W. Ring (2000), ‘Structural properties of solutions to total variation regularization
problems’, ESAIM Math. Model. Numer. Anal. 34, 799–810.
S. Rizzo, F. Botta, S. Raimondi, D. Origgi, C. Fanciullo, A. G. Morganti and
M. Bellomi (2018), ‘Radiomics: The facts and the challenges of image ana-
lysis’, European Radiol. Exp. 2, 36.
G. Rizzuti, A. Siahkoohi and F. J. Herrmann (2019), Learned iterative solv-
ers for the Helmholtz equation. Submitted to 81st EAGE Conference and
Exhibition 2019. Available from https://www.slim.eos.ubc.ca/content/learned-
iterative-solvers-helmholtz-equation.
H. Robbins and S. Monro (1951), ‘A stochastic approximation method’, Ann. Math.
Statist. 22, 400–407.
C. P. Robert and G. Casella (2004), Monte Carlo Statistical Methods, Springer
Texts in Statistics, Springer.
R. T. Rockafellar and R. J.-B. Wets (1998), Variational Analysis, Springer.
Y. Romano and M. Elad (2015), Patch-disagreement as a way to improve K-SVD
denoising. In IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 1280–1284.
Y. Romano, M. Elad and P. Milanfar (2017a), ‘The little engine that could: Reg-
ularization by denoising (RED)’, SIAM J. Imaging Sci. 10, 1804–1844.
Y. Romano, J. Isidoro and P. Milanfar (2017b), ‘RAISR: Rapid and accurate image
super resolution’, IEEE Trans. Comput. Imaging 3, 110–125.
O. Ronneberger, P. Fischer and T. Brox (2015), U-Net: Convolutional networks
for biomedical image segmentation. In 18th International Conference on Med-
ical Image Computing and Computer-Assisted Intervention (MICCAI 2015)
(N. Navab et al., eds), Vol. 9351 of Lecture Notes in Computer Science,
Springer, pp. 234–241.
B. T. Szabó, A. W. van der Vaart and J. H. van Zanten (2013), ‘Empirical Bayes
scaling of Gaussian priors in the white noise model’, Electron. J. Statist.
7, 991–1018.
C. Szegedy, W. Zaremb, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow and R. Fer-
gus (2014), Intriguing properties of neural networks. arXiv:1312.6199v4
M. F. Tappen (2007), Utilizing variational optimization to learn Markov random
fields. In IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2007), pp. 1–8.
A. Tarantola (2005), Inverse Problem Theory and Methods for Model Parameter
Estimation, second edition, SIAM.
A. Tarantola and B. Valette (1982), ‘Inverse Problems = Quest for Information’,
J. Geophys 50, 159–170.
S. Tariyal, A. Majumdar, R. Singh and M. Vatsa (2016), ‘Deep dictionary learning’,
IEEE Access 4, 10096–10109.
U. Tautenhahn (2008), ‘Regularization of linear ill-posed problems with noisy right
hand side and noisy operator’, J. Inverse Ill-Posed Problems 16, 507–523.
A. Taylor, J. Hendrickx and F. Glineur (2017), ‘Smooth strongly convex interpol-
ation and exact worst-case performance of first-order methods’, Math. Pro-
gram. 161, 307–345.
M. Thoma (2016), A survey of semantic segmentation. arXiv:1602.06541
R. Tibshirani (1996), ‘Regression shrinkage and selection via the Lasso’, J. Royal
Statist. Soc. B 58, 267–288.
A. N. Tikhonov (1943), On the stability of inverse problems. Dokl. Akad. Nauk
SSSR 39, 195–198.
A. N. Tikhonov (1963), Solution of incorrectly formulated problems and the regu-
larization method. Dokl. Akad. Nauk. 151, 1035–1038.
A. N. Tikhonov and V. Y. Arsenin (1977), Solutions of Ill-Posed Problems, Win-
ston.
J. Tompson, K. Schlachter, P. Sprechmann and K. Perlin (2017), Accelerating
Eulerian fluid simulation with convolutional networks. arXiv:1607.03597v6
A. Traverso, L. Wee, A. Dekker and R. Gillies (2018), ‘Repeatability and reprodu-
cibility of radiomic features: A systematic review’, Imaging Radiation Onco-
logy 102, 1143–1158.
J. A. Tropp and A. C. Gilbert (2007), ‘Signal recovery from random measurements
via orthogonal matching pursuit’, IEEE Trans. Inform. Theory 53, 4655–
4666.
G. Uhlmann and A. Vasy (2012), ‘The inverse problem for the local geodesic X-ray
transform’, Inventio. Math. 205, 83–120.
D. Ulyanov, A. Vedaldi and V. Lempitsky (2018), Deep image prior. In IEEE
Conference on Computer Vision and Pattern Recognition (CVPR 2018),
pp. 9446–9454.
M. Unser and T. Blu (2000), ‘Fractional splines and wavelets’, SIAM Review 42,
43–67.
K. Valluru, K. Wilson and J. Willmann (2016), ‘Photoacoustic imaging in oncology:
Translational preclinical and early clinical experience’, Radiology 280, 332–
349.
C. Van Chung, J. De los Reyes and C.-B. Schönlieb (2017), ‘Learning optimal
spatially-dependent regularization parameters in total variation image de-
noising’, Inverse Problems 33, 074005.
D. Van Veen, A. Jalal, E. Price, S. Vishwanath and A. G. Dimakis (2018), Com-
pressed sensing with deep image prior and learned regularization.
arXiv:1806.06438
Y. Vardi, L. Shepp and L. Kaufman (1985), ‘A statistical model for positron emis-
sion tomography’, J. Amer. Statist. Assoc. 80 (389), 8–20.
B. S. Veeling, J. Linmans, J. Winkens, T. Cohen and M. Welling (2018), Rotation
equivariant CNNs for digital pathology. arXiv:1806.03962
S. V. Venkatakrishnan, C. A. Bouman and B. Wohlberg (2013), Plug-and-play
priors for model based reconstruction. In IEEE Global Conference on Signal
and Information Processing (GlobalSIP 2013), pp. 945–948.
R. Vidal, J. Bruna, R. Giryes and S. Soatto (2017), Mathematics of deep learning.
arXiv:1712.04741
C. Villani (2009), Optimal Transport: Old and New, Vol. 338 of Grundlehren der
mathematischen Wissenschaften, Springer.
C. Viroli and G. J. McLachlan (2017), Deep Gaussian mixture models.
arXiv:1711.06929
C. Vogel and T. Pock (2017), A primal dual network for low-level vision problems.
In GCPR 2017: Pattern Recognition (V. Roth and T. Vetter, eds), Vol. 10496
of Lecture Notes in Computer Science, Springer, pp. 189–202.
G. Wang, J. C. Ye, K. Mueller and J. A. Fessler (2018), ‘Image reconstruction
is a new frontier of machine learning’, IEEE Trans. Medical Imaging 37,
1289–1296.
L. V. Wang (2009), ‘Multiscale photoacoustic microscopy and computed tomo-
graphy’, Nature Photonics 3, 503–509.
S. Wang, Z. Su, L. Ying, X. Peng, S. Zhu, F. Liang, D. Feng and D. Liang (2016),
Accelerating magnetic resonance imaging via deep learning. In 2016 IEEE
13th International Symposium on Biomedical Imaging (ISBI), pp. 514–517.
Y. Wang and D. M. Blei (2017), Frequentist consistency of variational Bayes.
arXiv:1705.03439
Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli (2004), ‘Image quality
assessment: From error visibility to structural similarity’, IEEE Trans. Image
Process. 13, 600–612.
J. Weickert (1998), Anisotropic Diffusion in Image Processing, ECMI series, Teub-
ner.
M. Weiler, M. Geiger, M. Welling, W. Boomsma and T. Cohen (2018), 3D steer-
able CNNs: Learning rotationally equivariant features in volumetric data.
arXiv:1807.02547
J. Weizenecker, B. Gleich, J. Rahmer, H. Dahnke and J. Borgert (2009), ‘Three-
dimensional real-time in vivo magnetic particle imaging’, Phys. Med. Biol.
54, L1–L10.
M. Welling and Y. W. Teh (2011), Bayesian learning via stochastic gradient
Langevin dynamics. In 28th International Conference on Machine Learning
(ICML ’11), pp. 681–688.