algorithm_guide
algorithm_guide
1 Introduction
This guide is meant as a tutorial for the lensless image reconstruction algorithms used in DiffuserCam.
It provides a brief overview of the optics involved and how it was used to develop the most current version.
See our other document (“How to build a (Pi) DiffuserCam”) for information on how to actually build and
calibrate DiffuserCam.
For most 2D imaging applications, lens-based systems have been optimized in design and fabrication
to be the best option. However, lensless imaging systems have not been investigated nearly as much.
DiffuserCam is a lensless system that replaces the lens element with a diffuser (a thin, transparent, lightly
scattering material). See Figure 1 below.
1
• Possibility of 3D imaging/microscopy. We’ve also shown that lensless cameras can capture 3D images
and are robust to missing or dead pixels (see this paper), both of which are promising in the field of
microscopy.
1.2 DiffuserCam
Every diffuser has a “focal plane”. Instead of mapping a faraway point source to a point in this plane
(as lenses do), the diffuser maps a point source to a “caustic pattern” (see Fig. 2a) over the entire plane. So,
replacing the lens in a camera with a diffuser of the same focal length creates a system that maps points in
the scene to many points on the sensor (see Fig. 2b)
(a) Caustic image of a single point (b) Sensor reading of a hand (c) Reconstructed image of a hand
source
The key to DiffuserCam’s operation is that, while light information is spread out over the sensor, none
of that information is lost. You can see in Fig. 2b that the sensor reading won’t look like the object.
However, we can recover the object image using a reconstruction algorithm that requires a single calibration
measurement of the caustic produced by a point source. This measurement, called a point spread function
(PSF), completely characterizes the scattering behavior of the diffuser (under certain assumptions).
To derive the algorithm and understand where these assumptions come from, it’s helpful to think of the
imaging system as a function that maps objects in the real world to images on the sensor. More precisely, it
is a function f that maps a 2D array v of light intensity values (the scene) to a 2D array of pixel values b on
the sensor. Recovering the scene v from a sensor reading b is equivalent to inverting this function (though
sometimes the function isn’t invertible):
f (v) = b =⇒ v = f −1 (b)
2
2 Problem Specification
Roughly speaking, f is the composition of everything that happens to light as it travels from the object
scene to the sensor. Each ray from a point in the scene propagates a certain distance to the diffuser and is
locally refracted by the diffuser surface, then propagated again to the sensor plane. Whether or not the ray
hits the sensor depends on how it was bent – we will start by ignoring this issue and addressing the finite
sensor size after constructing the rest of the model.
We make the following approximations:
• Shift invariance: A lateral shift of the point source causes a lateral translation of the sensor reading.
Figure 3: As the point source shifts to the right, the image on the sensor shifts to the left
• Linearity: Scaling the intensity of a point source corresponds to scaling the intensity of the sensor
reading by the same amount. Also, the pattern due to two point sources is the sum of their individual
contributions. These two assumptions amount to having incoherent light sources and a sensor that
responds to light intensity linearly. Both of these conditions are often satisfied.
(a) Point source on axis (b) Point source off axis (c) Superposition of both point
sources
Figure 4: Each point source creates a pattern on the sensor. When two point sources are present,
the sensor reads the superposition of the patterns created by each individual point source.
3
In short, the diffuser system is assumed to be linear shift-invariant (LSI). We assume that v can be
represented as the sum of many point sources of varying intensity and position. By the LSI property of the
system, the output f (v) corresponding to the input v can be represented as a 2D convolution with a single
PSF h:
f (v) = h ∗ v
A first approach to solving for v, which ignores the crop, would be to try Wiener deconvolution. This
method is a common way to reverse convolution, but it relies on diagonalizing the measurement matrix, and
cannot model the cropping behavior at all (see our ADMM Jupyter notebook for explanation of diagonal-
ization). While Wiener deconvolution would work if A were convolutional, i.e. A = M, adding in the crop
makes A too complex to invert analytically.
Instead, we must find an efficient numerical way to “invert” f . In general, f isn’t invertible at all:
multiple v’s can be mapped to the same b. We can see A isn’t invertible for two reasons:
• Information is lost in the crop operation, so C is not an invertible matrix.
• Convolving with a fixed function, e.g. h, is not always invertible, so M is not necessarily invertible.
The typical approach to solving Av = b for non-invertible A is to formulate it as an optimization problem,
which has the same form regardless of whether A is convolutional or not:
1
v∗ = argmin kAv − bk22
v 2
When v = v∗ , Av∗ = b and the objective function Av − b is minimized.
It is worth noting that A is extremely large, and scales with the area of the sensor. Our sensor has ∼ 106
pixels, so A would have on the order of 106 × 106 = 1012 entries. While A is useful mathematically, it’s
computationally useless to ever load/store it in memory. Whichever algorithm we choose to solve the
minimization problem has to avoid ever loading A in memory. Our general approach to addressing this issue
will be to make sure the algorithm can be implemented in terms of the linear operators that make up f :
crop and convolution. Both of these operations have fast implementations on 2D images that don’t require
loading their corresponding matrices.
4
3 Solving for v
Gradient descent is an iterative algorithm that finds the minimum of a convex function by following the
slope “downhill” until it reaches a minimum. To solve the minimization problem
minimize g(x),
we find the gradient of g wrt x, ∇x g, and use the property that the gradient always points in the direction
of steepest ascent. In order to minimize g, we go the other direction:
x0 = initial guess
xk+1 ← xk − αk ∇g(xk ),
where α is a step size that determines how far in the descent direction we go at each iteration.
Applied to our problem:
1
g(v) = kAv − bk22
2
∇v g(v) = AH (Av − b),
where AH is the adjoint of A. Again, we want to write A as a composition of linear operators that are easy
to implement, so we never have to deal with A itself. For a product of arbitrary linear matrices FG, the
adjoint is (FG)H = GH FH . In our case:
Av = CMv
AH v = MH CH v
We’ve reduced the problem of finding the adjoint of A to finding the adjoints of M and C.
Finding the adjoint of M: The adjoint of M, a convolution, can be found by writing the operation using
Fourier transforms. The convolution theorem states:
Mv ⇐⇒ h ∗ v = F −1 (F(h) · F(v)),
where the · denotes pointwise multiplication, and F denotes the 2D Fourier transform operator. This
theorem is also known as “convolution of two signals in real space is multiplication in Fourier space.” Next,
we vectorize the previous statement by recognizing that 2D Fourier transforms are linear operators, so we
have the equivalence F(v) ⇐⇒ Fv. To fully write M as a product of matrices, we must also convert the
pointwise multiplication to a matrix multiplication:
Also, FH = F−1 by “unitarity” of the Fourier transform. Finally, the adjoint of a diagonal matrix is formed
by taking the complex conjugate of its entries.
In summary,
H
MH v = F−1 diag(Fh) F v
H H −1 H
= F diag(Fh) (F ) v
H ∗
=F diag(Fh) F(v),
∗
where denotes complex conjugation.
Finding the adjoint of C: Finally, we note that the adjoint of cropping, CH , is zero-padding (see section
2.4 the appendix)
5
Plugging in to the formula for AH , we find
( (
A = CF−1 diag(Fh) F f (v) = crop F −1 {F(h) · F(v)}
⇐⇒
AH = F−1 diag(Fh)∗ FCH f H (x) = F −1 {F(h)∗ · F(pad [x])} ,
where we have written A in its matrix formulation (left) and the corresponding way it is implemented
in code (right). Note that we converted efficient operations like pointwise multiplication to matrices purely
for the derivation. See the GD Jupyter notebook for the actual implementation of these operators.
3.1.1 GD Implementation
v0 = anything
vk+1 ← vk − αk AH (Avk − b)
Repeat forever
H
F(h) can be precomputed (because h is measured beforehand), and the action of diag(Fh) can be imple-
mented as pointwise multiplication with the conjugate F(h)∗ . Since all the other operations involve only
Fourier transforms, every operation in the gradient calculation can be efficiently calculated. For implemen-
tation details, see the GD Jupyter notebook.
In our problem, we need to keep in mind the physical interpretation of v. Since it represents an image,
it must be non-negative. We can add this constraint into the algorithm by “projecting” v onto the space
of non-negative images. In short, we zero all negative pixel values in the current image estimate at every
iteration.
One thing to keep in mind is the step size, αk . We want it to be large at first – “coarse” jumps to get
closer to the minimum quickly. As we get closer, large steps will cause the estimate to “bounce around” the
minimum, overshooting it each time. Ideally we would want to decrease the step size with each iteration
at a rate that would ensure continual progress. While varying step size might yield a faster convergence, it
requires hand tuning and can be time consuming. A constant but sufficiently small step size is guaranteed to
converge, with no parameter tuning necessary. In our case, it is possible to calculate the largest constant step
2
size that guarantees convergence in terms of A: 0 < α < H
, where kAH Ak2 is the maximum singular
kA Ak2
value of AH A (see this page for why). The GD Jupyter notebook shows how we actually approximate this
singular value (using M instead).
Lastly, all convergence guarantees are for an infinite number of iterations: “repeat forever”. In practice,
after a certain number of iterations (which varies by application) the updates are too small to change the
estimate significantly. In our case, after incorporating the speedup techniques below, most of the progress is
seen in the first 150-200 iterations. Sharper, more detailed images may require a few hundred more.
We also need to supply an initial “guess” of our image. It doesn’t actually matter what we use for this.
Currently, we are using a uniform image of half intensity, but you could initialize with all 0’s or a random
image.
Incorporating all of these details, we have:
v0 = I/2
for k = 0 to num iters:
0 1.8
vk+1 ← vk − AH (Avk − b)
kAH Ak
0
vk+1 ← proj(vk+1 )
v≥0
6
3.1.2 Gradient Descent Speedup
Gradient descent as written above works, but in practice, people always add a “momentum term” that
incorporates the old descent direction into the calculation of the new descent direction. This guards against
changing the descent direction too much and too often, which can be counterproductive. We implement
momentum by introducing µ, a factor that determines how much the new descent direction is determined
by the old descent direction. Typically µ = 0.9 is a good place to start. Another common practice is to
use “Nesterov” momentum, which involves an intermediate update p. We call this method, along with the
projection step, “accelerated projected gradient descent”.
v0 = I/2, µ = 0.9, p0 = 0
for k = 0 to num iters:
pk+1 ← µpk − αk grad(vk )
0
vk+1 ← vk − µpk + (1 + µ)pk+1
0
vk+1 ← proj(vk+1 )
v≥0
See this page for more details on parameter updates using momentum terms.
3.1.3 FISTA
Another way to speed up gradient descent is the Fast Iterative Shrinkage-Thresholding Algorithm
(FISTA). This also computes the accelerated projected gradient descent, but is more flexible about what the
projection step (or more generally the “proximal” step pL ) does. For example, one can show that doing ac-
celerated descent with `1 -regularization only requires exchanging the projection step with a soft-thresholding
step. Enforcing sparsity in other domains (for instance, on the gradient of the image rather than the image
itself) can be achieved via soft-thresholding transformations of the image. This algorithm is very useful for
solving linear inverse problems in image processing.
Each iteration is as follows (see this paper for a derivation and explanation of each term):
v0 = I/2, t1 = 1, x0 = v0
for k = 0 to num iters:
xk ← pL (vk )
p
1 + 1 + 4t2k
tk+1 ←
2
tk − 1
vk+1 ← xk + (xk − xk−1 )
tk+1
3.2 ADMM
Although gradient descent is a reliable algorithm that is guaranteed to converge, it is still slow. If we
want to process larger sets of data (e.g. 3D imaging), have a live feed of DiffuserCam, or just want to process
images more quickly, we need to tailor the algorithm more closely to the optical system involved. While this
introduces more tuning parameters (“knobs” to turn), speed of reconstruction can be drastically improved.
Here we present (without proof) the result of using alternating direction method of multipiers (ADMM) to
reconstruct the image.
We will only briefly motivate the use of ADMM and then provide the derivation of the update steps
specific to our problem. For background on ADMM, please refer to sections 2 and 3 of: Prof. Boyd’s
7
ADMM tutorial. To understand this document, background knowledge from Chapters 5 (Duality) and 9
(Unconstrained minimization) from his textbook on optimization may be necessary.
Recall the original minimization problem:
1
v̂ = argmin kb − Avk22 , (1)
v≥0 2
where 2D images are interpreted as vectors. We seek to split the single minimization over the vector v into
separable minimizations – for example:
1
v̂ = argmin kb − Cvk22
w≥0,x 2 (2)
s.t. x = Mv, w = v,
where we have decomposed the action of DiffuserCam C = CM into the convolution M followed by a crop
C. The primary reason is to make the expression more amenable to the ADMM algorithm, which adds a
set of “update steps“ for each additional constraint. If we don’t find a nice decomposition, some of these
updates will be inefficent to calculate.
In addition, because of these parallel update steps, we can add constraints (prior information) easily. A
common useful prior we add is to encourage the gradient of the image to be sparse – most natural images
can be approximated by piecwise constant intensities. Typically, gradient sparsity is enforced through “total
variation” regularization, where we include the `1 -norm of the gradient in our objective function:
1
v̂ = argmin kb − Cxk22 + τ kuk1
w≥0,u,x 2 (3)
s.t. x = Mv, u = Ψv, w = v,
The Lagrangian dual approach to minimizing the objective function is to solve the following optimization
problem:
maximize min L({u, x, w, v}, {ξ, η, ρ}) (5)
ξ,η,ρ u,x,w,v
The min above indicates that, ideally, we would want to jointly minimize over all the primal variables
(u, x, w, v) first, before performing the outer maximization over the dual variables (ξ, η, ρ). The ADMM
algorithm is a specific way of iteratively finding this optimal point. In reality, we only have estimates of
8
each of the variables, so the algorithm updates our estimates for the minimum primal variables during every
iteration that solves for the maximum dual variables.
Based on this paradigm, we can write down all the intermediate updates that take place in one “global”
update step:
uk+1 ← argminu L({u, xk , wk , vk }, {ξk , ηk , ρk })
x
k+1 ← argminx L({uk+1 , x, wk , vk }, {ξk , ηk , ρk })
Primal Updates:
w k+1 ← argminw L({uk+1 , xk+1 , w, vk }, {ξk , ηk , ρk })
vk+1 ← argminv L({uk+1 , xk+1 , wk+1 , v}, {ξk , ηk , ρk })
ξk+1 ← ξk + µ1 (Mvk − xk+1 )
Dual Updates: ηk+1 ← ηk + µ2 (Ψvk+1 − uk+1 )
ρk+1 ← ρk + µ2 (vk+1 − wk+1 )
Notice that each dual update step tries to solve the maximization problem via gradient ascent. In each
global iteration, we make one step in the ascent direction.
Next, for each primal variable, the individual optimization problem only depends on the terms in the
Lagrangian corresponding to that variable. For example, in the u-update, we only need to include the terms
τ kuk1 , µ22 ku − Ψvk22 , and η | (u − Ψv); all the other terms are constant with respect to u. So, we have:
The primal minimization updates can be solved using standard convex optimization techniques, which
are worked out in the DiffuserCam Derivations Supplement. The results are:
where
rk = (µ3 wk+1 − ρk ) + Ψ| (µ2 uk+1 − ηk ) + M| (µ1 xk+1 − ξk )