research-article

Open access

Centimeter-wave Free-space Neural Time-of-Flight Imaging

Authors:

Felix HeideAuthors Info & Claims

ACM Transactions on Graphics, Volume 42, Issue 1

Article No.: 3, Pages 1 - 18

https://doi.org/10.1145/3522671

Published: 03 March 2023 Publication History

All formats PDF

Abstract

Depth sensors have emerged as a cornerstone sensor modality with diverse applications in personal hand-held devices, robotics, scientific imaging, autonomous vehicles, and more. In particular, correlation Time-of-Flight (ToF) sensors have found widespread adoption for meter-scale indoor applications such as object tracking and pose estimation. While they offer high depth resolution at competitive costs, the precision of these indirect ToF sensors is fundamentally limited by their modulation contrast, which is in turn limited by the effects of photo-conversion noise. In contrast, optical interferometric methods can leverage short illumination modulation wavelengths to achieve depth precision three orders of magnitude greater than ToF, but typically find their range is restricted to the sub-centimeter.

In this work, we merge concepts from both correlation ToF design and interferometric imaging; a step towards bridging the gap between these methods. We propose a computational ToF imaging method that optically computes the GHz ToF correlation signal in free space before photo-conversion. To acquire a depth map, we scan a scene point-wise and computationally unwrap the collected correlation measurements. Specifically, we repurpose electro-optical modulators used in optical communication for ToF imaging with centimeter-wave signals, and achieve all-optical correlation at 7.15 GHz and 14.32 GHz modulation frequencies. While GHz modulation frequencies increase depth precision, these high modulation rates also pose a technical challenge. They result in dozens of wraps per meter which cannot be estimated robustly by existing phase unwrapping methods. We tackle this problem with a proposed segmentation-inspired phase unwrapping network, which exploits the correlation of adjacent GHz phase measurements to classify regions into their respective wrap counts. We validate this method in simulation and experimentally, and demonstrate precise depth sensing using centimeter wave modulation that is robust to surface texture and ambient light. Compared to existing analog demodulation methods, the proposed system outperforms all of them across all tested scenarios.

CCS Concepts: • Computing methodologies;

1 Introduction

From interactive gaming to precision industrial manufacturing, depth sensors have enabled advances in a broad set of consumer and research applications. Their ability to recover 3D data at scale [Silberman et al. 2012; Chang et al. 2015; Dai et al. 2017] and produce high-fidelity scene reconstructions [Izadi et al. 2011; Tulsiani et al. 2018] drives developments in 3D scene understanding [Dai et al. 2018; Song et al. 2015; Hickson et al. 2014], which in turn influence the fields of augmented reality, virtual reality, robotic scanning, autonomous vehicle guidance, and path planning for delivery drones.

Some of the most successful depth acquisition approaches for wide operating ranges are based on active time-of-flight sensing, as they offer high depth precision at a small sensor-illumination baseline [Hansard et al. 2012]. Passive approaches, that infer distance from parallax [Subbarao and Surya 1994; Mahjourian et al. 2018] or visual cues in monocular images [Bhat et al. 2021; Saxena et al. 2005], do not offer the same range and depth precision as they struggle with textureless regions and complex geometries [Smolyanskiy et al. 2018; Lazaros et al. 2008]. Active sensing approaches tackle this challenge by projecting light into the scene and reconstructing depth from the returned signal. Structured light methods such as active stereo systems use spatially patterned light to aid stereo matching [Ahuja and Abbott 1993; Baek and Heide 2021]. While being robust to textureless scenes, their accuracy is limited by illumination pattern density and sensor baseline, resulting in a large form factor. Time-of-flight (ToF) depth sensing approaches avoid these limitations by estimating depth from the travel time of photons leaving from and returning to the device, allowing for co-axial sensor setups with virtually no illumination-camera baseline.

Direct ToF systems, such as light detection and ranging (LiDAR) sensors [Schwarz 2010], directly measure the round-trip time of emitted light pulses to estimate point depths, and can theoretically provide accuracy over a long range. However, this direct acquisition approach demands fast pulsed lasers, accurate synchronization, narrow-band filters, and picosecond-resolution time-tagged detectors such as single-photon avalanche diodes (SPADs) [Aull et al. 2002; Bronzi et al. 2015; Niclass et al. 2005; Rochas et al. 2003]. Though affordable SPADs have recently entered the market, these have only 20cm depth resolution [Callenberg et al. 2021], more than \(50{\times }\) lower than their costly picosecond-resolution counterparts.

Amplitude-modulated continuous-wave (AMCW) ToF methods—which we hereon refer to as correlation ToF methods— [Lange and Seitz 2001; Su et al. 2018; Shrestha et al. 2016; Gupta et al. 2015] flood a scene with periodic amplitude-modulated light and indirectly infer depth from the phase shift of returned light. In contrast to direct ToF sensing approaches, this modulation and correlation does not require ultra-short pulse generation and time-tagging, this lowers sensor and laser complexity requirements. Correlation ToF sensors that demodulate the amplitude-modulated flash-illumination on-sensor have been widely adopted, for example, the Microsoft Kinect One camera. These sensors implement multiple charge buckets per pixel and shift a photo-electron to an individual bucket by applying an electrical potential between the individual quantum wells [Lange and Seitz 2001]. Though amplitude modulation allows for depth precision comparable to picosecond-pulsed direct ToF at meter-scale distances, while remaining low-cost thanks to scalable CMOS technology, it is also this sensing mode that fundamentally limits the sensor. Specifically, modulation after photo-electric conversion limits the maximum achievable modulation frequency to a few hundred MHz in practice, restricted by the photon absorption depth in silicon [Lange and Seitz 2001]. This has limited the depth precision of existing correlation ToF sensors to the sub-centimeter regime. Fiber-coupled modulation approaches from optical communication that bypass this limit suffer from low modulation contrast due to coupling loss [Kadambi and Raskar 2017; Rogers et al. 2021; Marchetti et al. 2017; Bandyopadhyay et al. 2020].

In this work, we co-opt free-space electro-optic modulators (EOMs) from optical communication and combine them with a phase unwrapping neural network to build a GHz correlation ToF system. EOM-based ranging systems are known to offer fast intensity modulation and can be integrated with conventional intensity sensors and a continuous-wave laser, bypassing the more complex hardware requirements of time-tagged ToF devices [Froome and Bradsell 1961]. Inspired by existing EOM-based ranging methods, we devise a two-pass EOM-based GHz ToF sensing system that achieves a 7 GHz modulation frequency with \(\gt 50\%\) contrast. Our system inherits the benefits of EOM-based systems—large-area freespace modulation, single-digit driving voltage—using conventional intensity sensors and continuous-wave lasers.

Although a higher modulation frequency can increase phase contrast and allow for more precise depth measurement, it also greatly complicates the task of phase unwrapping, a major obstacle in applying EOMs to depth sensing. At 7 GHz, even a 2 cm depth change results in a phase wrap, in contrast to 3 m of unambiguous depth for a 100 MHz ToF camera. In addition to a few dozens of wraps, imaging noise and the small modulation bandwidth of EOMs—only a few MHz—imposes a further challenge for conventional look-up table approaches. We tackle this challenge with a segmentation-inspired neural phase unwrapping network, where the problem is decomposed into ordinal classification, mapping regions of measured data to their wrap count. Trained in an end-to-end fashion on simulated ToF data and fine-tuned on a small set of experimental measurements, the proposed network exploits the correlation of adjacent measurements to robustly unwrap them.

We validate the proposed ToF system in simulation and experimentally, and demonstrate robust depth imaging for macroscopic diffuse scenes with freespace centimeter-wave modulation at mW laser powers, corresponding to \(\lt\)100 femtosecond temporal resolution. See Figure. Jointly with the learned unwrapping, the all-optical modulation without coupling losses allows for robustness to low-reflectance texture regions and highly specular objects with low diffuse reflectance components. We assess the neural phase unwrapping network extensively on real and simulated data, and validate that it outperforms existing conventional and learned unwrapping approaches across all tested scenarios. We further validate precision and compare extensively against post-photoconversion modulation, which fails in low flux scenarios, and interferometric approaches, that are limited to small ranges. As our free-space modulation is all-optical, we demonstrate that it can be readily combined with interferometric modulation, allowing us to narrow the gap between interferometry and correlation ToF imaging, with the future potential for photon-efficient imaging of macro-scale ultrafast phenomena.

Specifically, we make the following contributions in this work:

—

We introduce computational ToF imaging with fully optical free-space correlation and an EOM-based two-pass intensity modulation that allows for \(\ge\)10 GHz frequencies.

—

To tackle phase-unwrapping at centimeter wavelengths, we introduce a segmentation-based phase unwrapping network that poses phase recovery as a classification problem.

—

We validate the proposed method experimentally with a prototype, achieving robust depth imaging with freespace centimeter-wave modulation for macroscopic scenes.

To ensure reproducibility, we will share the schematics, code, and optical design of the proposed method.

2 Related Work

In this section, we seek to give the reader a broad overview of the current state of depth imaging to better illustrate the gap our work fills in the 3D vision ecosystem.

Depth Imaging. The wide family of modern depth imaging methods can be broadly categorized into passive and active systems. Passive approaches, which leverage solely image cues such as parallax [Hirschmuller 2005; Baek et al. 2016; Meuleman et al. 2020] or defocus [Subbarao and Surya 1994], can offer low-cost depth estimation solutions using commodity camera hardware [Garg et al. 2019]. Their reliance on visual features, however, means they struggle to achieve sub-cm accuracy in favorable conditions, and can fail catastrophically for complex scene geometries and textureless regions [Smolyanskiy et al. 2018]. Active methods, which first project a known signal into the scene before attempting to recover depth, can reduce this reliance on visual features. For example, structured light approaches, such as those used in the Kinect V1 and Intel D415 depth cameras, improve local image contrast with active illumination patterns [Baek and Heide 2021; Scharstein and Szeliski 2003; Ahuja and Abbott 1993], at a detriment to form-factor and power consumption. Even active stereo methods, however, still cannot disambiguate mm-scale features, as they are smaller than the illumination feature size itself and make finding accurate stereo correspondences infeasible. ToF imaging is an active method that does not rely on visual cues, and so avoids the pitfalls of stereo matching completely. ToF cameras instead directly or indirectly measure the travel time of light to infer distances [Lange and Seitz 2001; Hansard et al. 2012], with modern continuous-wave correlation ToF systems achieving sub-cm accuracy for megahertz-scale modulation frequencies. Interferometry extends this principle to the terahertz range, measuring the interference of electromagnetic waves to estimate their travel time. These systems can achieve micron-scale accuracy at the cost of mm-scale operating ranges [Hariharan 2003]. In this work, we seek to bridge the gap between commodity MHz-frequency correlation ToF systems and THz frequency interferometry with a GHz-frequency correlation ToF system for meter-scale imaging.

Pulsed ToF. Pulsed ToF systems, such as LiDAR, are direct ToF acquisition methods, which directly measure the travel time of photon packets to infer depth. They send discrete laser pulses into the scene and detect their reflections with avalanche photodiodes [Cova et al. 1996; Pandey et al. 2011] or single-photon detectors [McCarthy et al. 2009; Heide et al. 2018; Gupta et al. 2019b;, 2019a]. These sensors can extract depth from measured pulse returns without phase wrap ambiguities. Their depth precision is limited by their temporal resolution, however, and the complex detectors and narrow-band filters, used to filter out ambient light, contend with high cost as a result of fabrication complexity when compared to the conventional intensity sensors. Recently, low-cost pulsed sensors have appeared, however at the cost of coarse 20 cm depth precision [Callenberg et al. 2021]. In this work, we revisit indirect ToF with amplitude modulation paired with learned phase unwrapping as an approach to precise depth imaging that does not mandate time-resolved sensors and time-tagging electronics.

Correlation ToF. Amplitude-modulated continuous-wave ToF, which we refer to as simply correlation ToF, floods the scene with periodically modulated illumination and infers distance from phase differences in the returned light [Lange and Seitz 2001; Remondino and Stoppa 2013; Ringbeck 2007]. These systems, such as cameras in the prolific Microsoft Kinect series [Tölgyessy et al. 2021], can rely on affordable CMOS sensors and conventional CW laser diodes to produce dense depth measurements. This flood illumination can lead to multipath interference, though there exists a large body of work to mitigate this [Achar et al. 2017; Fuchs 2010; Freedman et al. 2014; Kirmani et al. 2013; Bhandari et al. 2014; Kadambi et al. 2013; Jiménez et al. 2014; Naik et al. 2015]. Correlation ToF measurements can also be used to resolve the travel-time of light in flight [Heide et al. 2013; Kadambi et al. 2013]. These time-resolved transient images have found a number of emerging applications, such as non-line-of-sight imaging [Heide et al. 2014; Kadambi et al. 2016], imaging through scattering media [Heide et al. 2014], and material classification [Su et al. 2016], which have also been solved with pulsed ToF systems [O’Toole et al. 2018; Heide et al. 2019] and interferometric methods [Gkioulekas et al. 2015]. All these methods, however, are restricted to working with modulation frequencies of only a few hundred MHz due to photon absorption depth in silicon [Lange and Seitz 2001], which governs how these devices perform photo-electric conversion. This limit places the depth resolution of modern correlation ToF sensors at mm- to cm-scale for operating ranges of up to several meters. Previous attempts at pushing this modulation frequency to the GHz regime struggle with low modulation contrast due to the energy loss from fiber coupling within eye-safe laser power levels [Kadambi and Raskar 2017; Li et al. 2018]. Li et al. [2018] overcome some of these limitations but solely rely on interferometric modulation, making the method susceptible to speckle, vibration, laser frequency drift, and other common interferometry errors. Notably, Bamji et al. [2018] achieve 200 MHz modulation frequency at high contrast but are limited to single-frequency modulation. Gupta et al. [2018] achieve 500 MHz modulation frequency with a fast photodiode and analog radio-frequency (RF) modulation, but contend with low modulation contrast at the GHz regime due to modulation after photo-conversion.

Interferometry and Frequency-Modulated Continuous-Wave ToF. Optical interferometry leverage the interference of electromagnetic waves to infer their path lengths, which are encoded in the measured amplitude and/or phase patterns. A detailed review of interferometry can be found in [Hariharan 2003]. Methods such as optical coherence tomography (OCT) [Huang et al. 1991] have found prolific use in biomedical applications [Fujimoto and Swanson 2016] for their ability to resolve micron-scale features in optical scattering media. This, however, comes with the caveat of an mm-scale operating range as diffuse scattering leads to a sharp decline in SNR. In graphics, OCT approaches have been successfully employed to achieve micron-scale light transport decompositions [Gkioulekas et al. 2015] and light transport probing [Kotwal et al. 2020]. Fourier-domain OCT systems mitigate some of the sensitivity to vibration by using a spectrometer and a broadband light source [Leitgeb et al. 2003]. While these methods provide high temporal resolution, they are also limited to cm-scale scenes. Frequency-modulated continuous-wave (FMCW) ToF systems employ an alternative interferometric approach to measuring distance. These methods continuously apply frequency modulation to their output illumination, which when combined in a wave-guide with the delayed returned light from the scene produces constructive and destructive interference patterns from which travel-time (and thereby depth) can be inferred. Experimental FMCW LiDAR setups can achieve millimeter precision for scenes at decimeter range [Behroozpour et al. 2016], but require complex tunable laser systems [Sandborn et al. 2016; Amann 1992; Gao and Hui 2012]. We revisit continuous-wave intensity modulation, which allows us to use conventional continuous-wave lasers modulated and demodulated in free-space.

Phase Unwrapping. In correlation to ToF systems, the analog correlation signal can experience phase shifts of more than one wavelength. To recover the true phase, and thereby accurately reconstruct depth, phase unwrapping algorithms are required [Dorrington et al. 2011; Crabb and Manduchi 2015; Lawin et al. 2016; An et al. 2016]. Single phase unwrapping approaches are only able to recover the relative depth, and require a priori assumptions to estimate scale [Crabb and Manduchi 2015; Ghiglia and Pritt 1998; Bioucas-Dias and Valadao 2007; Bioucas-Dias et al. 2008]. Multi-frequency phase unwrapping methods overcome this limitation by unwrapping high-frequency phases with their lower-frequency counterpart. Wrap count is recovered by either weighing Euclidean division candidates [Bioucas-Dias et al. 2009; Droeschel et al. 2010; Kirmani et al. 2013; Freedman et al. 2014; Lawin et al. 2016], or using a frequency-space lookup table [Gupta et al. 2015]. All of these methods, while powerful for MHz ToF imaging, fail in the presence of noise for the dozens of wrap counts observed in the GHz correlation imaging. To tackle this challenge, in this work we introduce a neural network capable of unwrapping GHz frequency ToF correlation measurements.

Electro-optic Modulators. EOMs control the refractive index of a crystal with an electric field to modulate the phase, frequency, amplitude, and polarization of incident light [Yariv and Yeh 2007]. As such, they have been employed in diverse applications, including fiber communications [Phare et al. 2015], frequency modulation spectroscopy [Tai et al. 2016], laser mode locking [Hudson et al. 2005], and optical interferometry [Minoni et al. 1991]. In particular, EOMs have been used in LiDAR systems to change the optical-carrier frequency for FMCW sensing [Behroozpour et al. 2017] or facilitate pulsed sensing [Chen et al. 2018]. Instead, we repurpose these EOMs for continuous-wave correlation ToF imaging. We employ a two-pass modulation scheme for our ranging system that, instead of optical frequency, modulates intensity with high contrast. We combine this acquisition scheme with a neural phase unwrapping method to then unwrap the dozens of phase wraps we encounter in the GHz regime.

3 Correlation ToF Imaging

In this section, we review the principles of correlation ToF imaging; for a detailed introduction, see [Lange 2000].

Image Formation. Correlation ToF cameras start by sending an amplitude-modulated light into the scene

\begin{equation} p(t) = \alpha \cos (\omega _p t) + \beta , \end{equation}

(1)

where \(\omega _p\) is modulation frequency, \(\alpha\) is amplitude, and \(\beta\) is a DC offset. After traveling through the scene and reflecting off a target, the measured return signal

\begin{equation} \tilde{p}(t - \tau) = \tilde{\alpha } \cos (\omega _p t - \phi) + \tilde{\beta }, \quad \phi = 2\pi \omega _p \tau , \end{equation}

(2)

is a time-delayed \(p(t)\) by time \(\tau\) with an observed attenuation in amplitude \(\tilde{\alpha }\), a shift in bias \(\tilde{\beta }\), and a time-dependent phase shift \(\phi\). This measured signal is then correlated with a reference

\begin{equation} r(t)=\cos (\omega _r t + \psi) + 1/2, \end{equation}

(3)

where \(\omega _r\) and \(\psi\) are the demodulation frequency and phase, respectively. In existing multi-bucket imagers, this correlation occurs during exposure via photonic mixer device pixels [Lange and Seitz 2001; Foix et al. 2011], which are modulated according to the reference function \(r(t)\). When we modulate and demodulate at the same frequency, that is \(\omega _p = \omega _r = \omega\), this is called homodyne imaging. Integrating this signal over exposure time \(T\), we get a correlation measurement

\begin{align} C_\psi = \int _{0}^{T} {\tilde{p}({t - \tau })r({t}) \, \mathrm{d}t} = \frac{\tilde{\alpha }}{2}\cos (\psi - \phi) + TK, \end{align}

(4)

where \(K\) is a general constant offset, meant to model a non-zero modulation bias on the sensor. Given this measurement, we aim at estimating the phase delay \(\phi\) from which the scene depth can be computed. As illustrated in Figure 1(b), the correlation measurement \(C_\psi\) is a constant dependent on the demodulation phase offset \(\psi\) (achieving its max at \(\psi =n\phi , n\in \mathbb {N}\)). In practice, this means we never have to explicitly sample \(\tilde{p}(t-\tau)\), which would require expensive ultrafast detectors and modulation electronics. Although the correlation measurement \(C_\psi\) does not directly give us access to the true phase \(\phi\), by sampling this function for multiple demodulation phase offsets \(\psi\) we can make use of Fourier analysis to discern the true phase \(\phi\). Existing correlation imagers typically acquire four equally-spaced correlation measurements at \(\psi \in [0,\, \pi /2, \, \pi , \, 3\pi /2]\). Using these, we can estimate the phase offset \(\hat{\phi }\) wrapped to the \(2\pi\) range as \(\hat{\phi }= \arctan ({\frac{C_{\pi }-C_{\pi /2}}{C_{0}-C_{3\pi /2}}})\). Phase unwrapping amounts to estimating the integer factor \(n\) to recover the unwrapped phase \(\phi =\hat{\phi }+ 2\pi n\). If successful, we can convert this phase estimate \(\phi\) to depth as \(z=\phi c/4\pi \omega _p\), where \(c\) is the speed of light.

Fig. 1.

Fig. 2.

Modulation Frequency. As we noted earlier in Equation (2), the round-trip path of the amplitude-modulated illumination imparts on it a \(\phi\) phase shift. Setting \(t=0, \tilde{\beta } = 0\) and \(\omega =100 MHz\) (a common modulation frequency in conventional ToF cameras) in Equation (2), we observe a 0.0009% signal difference for a 1 mm change in depth \(z\). See Figure 2. This means, with realistic imaging noise and quantization in existing sensors, we would practically not be able to discern millimeter scale features on object surfaces for a setup with this modulation frequency. To achieve higher precision we go to the higher frequency, the same experiment repeated for \(\omega =8 GHz\) leads to a more detectable 5.6% difference in signal amplitude. In practice, there are many factors that affect signal contrast, which we explore in the remainder of this work.

Fig. 3.

4 Overview

Realizing correlation imaging at two orders of magnitude higher frequencies than existing systems is prohibited by two technical challenges: modulating at GHz rates, and unwrapping the measured phase estimates, see Figure 3. Stable GHz demodulation is challenging as analog modulation after photo-conversion or with fiber-coupling suffers from the high noise of ultra-fast photodiodes or large coupling losses, respectively. Phase unwrapping becomes a challenge as the increase in modulation frequency results in multiple dozens of wraps instead of a handful of wraps. The proposed computational imaging system tackles both limitations as follows.

Fig. 4.

First, we present a convolutional network for high-frequency phase unwrapping, motivated by recent learning-based segmentation methods. Our approach represents wrap counts as class labels and segments measurements into their corresponding wrap regions, wherein we exploit the fact that proximal measurements that are highly correlated and are likely to also be similarly phase wrapped.

Second, we introduce a two-pass EOM-based system with frequency doubling to tackle the problem of GHz frequency intensity modulation. The proposed method performs correlation computation optically in free-space rather than in the conventional analog domain. In this way, we avoid photo-conversion artifacts and energy loss from fiber-coupling, enabling high modulation contrast ToF imaging at 7.15 GHz and 14.32 GHz.

5 Neural Phase Unwrapping

Phase unwrapping methods estimate the wrap count \(n\) of the wrapped phase \(\hat{\phi }\) to recover the unwrapped phase \(\phi\) for depth estimation. Our GHz ToF system presents two main challenges for unwrapping. First, the high modulation frequencies (7.15 GHz and 14.32 GHz) result in dozens of wraps over meter-scale scenes, as opposed to one or two for conventional MHz systems, see Figure 3. Second, the modulation bandwidth of our GHz correlation ToF system is limited to \(\pm\)10MHz, limiting the available sets of frequencies for multi-frequency approaches [Gutierrez-Barragan et al. 2019]. These challenges lead to lackluster performance from prior phase-unwrapping approaches including analytical solutions [Xia and Wang 2007], kernel methods [Lawin et al. 2016], and newer neural-network designs [Su et al. 2018; Zhang et al. 2019]. Here, we present a novel segmentation-inspired neural network tailored for high-frequency phase unwrapping. Rather than synthesizing the unwrapped phase directly, we pose this as an ordinal classification problem of wrap counts. Our network outputs \(N\) class weights for each input pixel, each corresponding to a candidate wrap count. Here, \(N\) is determined by the minimum and maximum expected wrap counts for the lowest modulation frequency, 7.15 GHz, to reduce class count.

5.1 Segmentation-based Fourier Phase Unwrapping

For our architecture, we modify the Fast SCNN [Poudel et al. 2019] image segmentation network. First, to encourage the network to learn local frequency unwrapping, rather than overfitting to global scene structure, we reduce its receptive field and add a full resolution skip layer directly to the output. We refer to the Supplemental Material for details on the network architecture. Second, as input to our network, in addition to measured amplitude, we use a Fourier feature encoding [Tancik et al. 2020] of wrapped phase \(\hat{\phi }\)

\begin{align} & \gamma (\hat{\phi }) = [cos(2^0 \hat{\phi }), sin(2^0\hat{\phi }), cos(2^1\hat{\phi }),\ldots , sin(2^{EC} \hat{\phi })]^\intercal . \end{align}

(5)

This was used to great success in [Mildenhall et al. 2020] as a positional encoding method, mapping x,y, and z coordinates to a higher dimensional space and improved training for their multilayer-perceptron representation. For our phase unwrapping network, the purpose is two-fold. This encoding increases the dimensionality of the input multi-frequency measurements to facilitate learning of high-frequency features, and effectively modulates the correlation values with a new set of sinusoids, as seen in Figure 4, which allows the network to perform a rudimentary frequency analysis of the underlying ToF signal.

Fig. 5.

5.2 Ordinal Classification Loss

We calculate our final estimate unwrapped phase \(\phi ^{\prime }\) with

\begin{equation} \phi ^{\prime } = \sum _{n=0}^{N-1} n\left(\frac{\gamma e^{\hat{\phi }_n}}{\sum _{m=0}^{N-1}e^{\gamma \hat{\phi }_m}} \right), \end{equation}

(6)

a differentiable argmax. Here \(\phi _n\) is the predicted weight for phase class \(n\), corresponding to \(n\) wraps, and \(\gamma\) adjusts the hardness of the argmax function. The predicted depth is as before, \(z^{\prime } = \phi ^{\prime } c/4\pi \omega\). This differentiable argmax allows for backpropagation through our phase-unwrapping network, meaning we are able to use both entropy-based classification losses on the output class weights and standard image losses on the estimated phase or depth. Taking into consideration the ordinal nature of wrap counts—that is, predicting one wrap for a twice wrapped measurement is better than predicting twenty—we opt for a mixed cross-entropy \(\mathcal {L}_{CE}\) and \(\ell _1\) loss \(\mathcal {L}_{L1}\)

\begin{align} & \mathcal {L} = \mathcal {L}_{CE} + w_{L1}\mathcal {L}_{L1} \nonumber \nonumber\\ & \mathcal {L}_{L1} = |z - z^{\prime }| \nonumber \nonumber\\ & \mathcal {L}_{CE} = -\sum _{n=0}^{N-1} \phi _n log(\phi _n^{\prime }), \end{align}

(7)

where \(z\) and \(\phi\) are ground truth measurements. The cross-entropy loss allows us to train the network as a classifier, while the smooth \(\ell _1\)-term provides a distance metric for the classes, penalizing the network for guessing wrap counts \(n^{\prime }\) far from the true \(n\).

5.3 ToF Simulation from RGBD

Given that there do not exist GHz ToF datasets, especially not ones with associated ground truth, we look to simulation to fill our need for training data. We simulate our measurements from the Hypersim RGB-D dataset [Roberts and Paczan 2021], containing 77,400 ground truth depth maps \(z\) (in mm) and images \(I\) from 461 computer-generated indoor scenes. We first calculate ground truth phase as \(\phi = (z 4\pi \omega)/c\), where \(\omega \in \lbrace \mathrm{7.15 GHz, 14.32 GHz}\rbrace\), and \(c\) is the speed of light. We simulate ToF correlation images \(C_\psi\) for \(\psi \in \lbrace 0,\, \pi /2,\, 3\pi /2,\, \pi \rbrace\) as

\begin{align} C_\psi = G T I_g (0.5+cos(\phi + \psi)/\pi) + \eta _P + \eta _G, \end{align}

(8)

where \(G\) is sensor gain, \(T\) is integration time, and \(I_g\) is the green channel of the image, meant to emulate the green laser in the experimental prototype. To simulate measurement fluctuations, we add Poisson noise \(\eta _{P}\) and Gaussian noise \(\eta _{G}\) with mean \(\mu\) and standard deviation \(\sigma\). We note that a typical correlation ToF camera follows a Skellam-Gaussian noise model [Hansard et al. 2012], however our all-optical correlation ToF design has no photon bucketing and subsequently encounters only Poisson-Gaussian noise.

5.4 Correlation Images to Wrapped Phase and Amplitude

From the correlation images \(C_\psi\) obtained either from the training dataset or our GHz ToF imaging system, we recover the wrapped phase \(\hat{\phi }\) and amplitude \(\hat{a}\) using a per-pixel Fourier transform

\begin{align} \hat{\phi }= \mathrm{angle}(\mathcal {F}_2 (C_\psi)), \hat{a} = 2|\mathcal {F}_2 (C_\psi)|, \end{align}

(9)

where \(\mathrm{angle}(\cdot)\) is the phase angle of a complex number and \(\mathcal {F}_i(\cdot)\) is the \(i\)th complex value of the Fourier transformed signal (e.g.,\(\mathcal {F}_0\) is the DC component). This process is repeated for the two modulation frequencies, 7.15 GHz and 14.32 GHz, and the arrays are stacked to form the raw multi-frequency measurements. As a result of the above Fourier recovery, \(\hat{\phi }\in [0, 2\pi ]\) is phase wrapped, and is passed into our phase-unwrapping network.

6 All-optical GHz Correlation ToF Imaging

GHz modulation frequencies for correlation ToF can allow for high-precision depth imaging as illustrated in Figure 2. However, practically realizing this idea has been challenging due to the limited photon absorption depth in silicon [Lange and Seitz 2001] and inefficacy of fiber coupling [Kadambi and Raskar 2017]. In this work, we take a different approach by co-opting free-space EOMs, mainly used for optical communication, and introducing a two-pass intensity modulation system with polarizing optics. Our method optically performs correlation computation, and, as such, permits the use of intensity sensors and continuous-wave lasers as compared to the more complex hardware requirements of pulsed LiDAR.

6.1 Backgrounds on EOM and Jones Calculus

We briefly review EOM and Jones calculus before describing our novel two-pass intensity modulation scheme. Modern EOMs modulate the phase, amplitude, and polarization of light by applying an electric field to control the refractive indices of a bulk crystal, perpendicular to the direction of light propagation, according to the electro-optic Pockel’s effect [Yariv 1967]. To mathematically model the effect of an EOM, we rely on a Jones vector and Jones matrix. The Jones vector is a \(2\times 1\) vector that describes the amplitude and phase of horizontal and vertical polarization components. As such, the corresponding Jones matrix describes the change of the polarization state of light with a \(2 \times 2\) matrix that can be multiplied by a Jones vector. We refer the reader to Collet [2005] for a review on Jones calculus. The Jones matrix describes how an EOM shifts the horizontal and vertical polarization waves of light by an amount dependent on the applied voltage \(V\) as

\begin{equation} B(V) = \begin{bmatrix} e^{-iV/2} & 0 \\ 0 & e^{iV/2} \end{bmatrix}, \quad V=\eta \cos (\omega t - \phi), \end{equation}

(10)

where \(V\) is a time-varying voltage function, \(\eta\) is the modulation power, \(\omega\) is the voltage modulation frequency, and \(\phi\) is the modulation phase. Our custom-designed resonant EOM is capable of generating phase differences in two orthogonally-polarized components of light at a 7.15 GHz frequency with 20 MHz of bandwidth. See Supplemental Material for additional information on our EOM.

6.2 Two-pass GHz Intensity Modulation

For GHz ToF imaging, we propose a two-pass GHz intensity modulation method that uses polarizing optics and custom phase-modulating EOMs. This allows us to achieve a 7.15 GHz modulation frequency at the EOM’s native resolution as well as a higher frequency of 14.32 GHz enabled by intensity modulation with a combination of two-pass phase modulation and polarization changes. The two modulation frequencies provide high depth resolution, with an effective wavelength of 2.1 cm for the 14.32 GHz frequency, and overcome the small native bandwidth of 20 MHz of the EOMs, resulting in a frequency difference of 7.17 GHz for phase unwrapping. We note that our method enables high-frequency EOM-based intensity modulation, an entirely different concept than conventional optical-frequency doubling using EOMs. We describe the working principle of our method below.

6.2.1 Intensity Modulation.

Our custom resonant EOM delays the phase of horizontal and vertical components of light at a frequency \(\omega =7.15\,GHz\). We exploit these polarization-dependent phase shifts to perform intensity modulation of incident light. Specifically, we use the following polarization optics: a polarizing beamsplitter (PBS), an half-wave plate (HWP), a quarter-wave plate (QWP), and a mirror as shown in Figure 6. Incident light enters a PBS, turning light into vertical linear polarization as

\begin{equation} E_{0} = A \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \end{equation}

(11)

where \(A\) is the amplitude of the incident light. The polarization state of the light is then modulated by an HWP and a QWP followed by an EOM at a given voltage \(V\) as

\begin{equation} E_{1} = B(V)Q(\theta _q)H(\theta _h)E_{0}, \end{equation}

(12)

where \(H(\theta _h)\) and \(Q(\theta _q)\) are the Jones matrices of the HWP and the QWP oriented at angles \(\theta _h\) and \(\theta _q\). The light then propagates in free-space for half a modulation wavelength \(c/\omega\) to where a mirror is placed, resulting in the change of Jones vector as

\begin{equation} E_{2} = M E_{1}, \end{equation}

(13)

where \(M\) is the Jones matrix of a mirror. Light travels again back to the EOM, the QWP, and the HWP. The PBS picks up the vertical linear polarization component of this light. Setting the HWP and the QWP angles as \(\theta _q=11.25^\circ\) \(\theta _q=45^\circ\), we obtain the output light

\begin{align} E_{3} & = L_h H(-\theta _h)Q(-\theta _q)B(V)MB(V)Q(\theta _q)H(\theta _h)E_0 \nonumber \nonumber\\ & = A\left[ \begin{array}{*{20}{c}} {\frac{{i(\cos V - \sin V)}}{{\sqrt 2 }}}\\ 0 \end{array} \right]. \end{align}

(14)

We square the magnitude of \(E_{3}\), and observe a modulated intensity

\begin{align} I(V) = |E_{3}|^2 = \frac{A^2}{2} (1-\sin 2V). \end{align}

(15)

Equation (15) indicates that the output intensity of light is a function of the voltage \(V\) applied to the EOM. As we supply a time-varying sinusoidal voltage to the EOM as in Equation (10), we arrive at the time-varying intensity-modulated light as

\begin{align} I(t) &= \frac{A^2}{2} (1-\sin (2\eta \sin (\omega t - \phi))) \nonumber \nonumber\\ &\approx \frac{A^2 }{2} (1-2\eta \sin (\omega t - \phi)). \end{align}

(16)

The last approximation is based on the Taylor expansion given that the modulation power \(\eta\) of our EOM is small. The applied voltage to the EOM has GHz modulation frequency \(\omega =7.15\,GHz\), enabling effective all-optical GHz modulation of light intensity. We refer to the Supplemental Material for detailed derivation.

Equation (16) describes the high-frequency intensity modulation realized by the proposed free-space two-pass phase modulation with polarizing optics shown in Figure 6. This optical configuration serves as a building block for both illumination and detection modules in our imaging system. In the illumination module, we input continuous laser light into the EOM, resulting in sinusoidal intensity-modulated light emitted into the scene. For the detection module, the returned amplitude-modulated light from the scene is demodulated by an additional intensity modulation with the reference signal \(r\), recall Equation (4), and we optically multiply \(r\) and \(\tilde{p}\) before integration on the detector.

6.2.2 Doubled Intensity-Modulation Rate.

Even though the voltage modulation frequency \(\omega\) is limited to a narrow modulation band of 20 MHz in our resonant EOM, we can modulate at the double frequency of \(2\omega\) by adjusting the angle of the HWP, \(\theta _h\), in front of the EOM, achieving 14.32 GHz modulation rate. While doubling the frequency of the optical carrier is well known in optics, we note that the proposed frequency doubling of the intensity modulation is novel. In the original operating mode, we set the HWP angle \(\theta _h\) as \(11.25^\circ\) resulting in the intensity modulation at the original frequency \(\omega\). For frequency doubling, we rotate the HWP to \(\theta _h = 22.5^\circ\). To derive the modulation behavior, we rely on the same Jones calculus from above. Specifically, changing the HWP angle \(\theta _h\) results in the output light \(E_3\) as

\begin{align} E_{3} = A\left[ \begin{array}{*{20}{c}} -i \sin V\\ 0 \end{array} \right]. \end{align}

(17)

The intensity \(I(V)\) is the magnitude square of \(E_3\) as

\begin{equation} I(V) = |E_3|^2 = \frac{A^2}{2} (1-\cos (2V)). \end{equation}

(18)

Note that the difference between Equations (15) and (18) is that we have \(\cos ()\) instead of \(\sin ()\). This single difference enables intensity modulation at a doubled frequency. After applying the time-varying voltage modulation of Equation (10), the time-varying intensity of the output light is

\begin{align} I(t) &= \frac{A^2}{2} (1-\cos (2\eta \sin (\omega t - \phi))) \nonumber \nonumber\\ &\approx \frac{A^2 }{2} \eta ^2 \sin ^2(\omega t - \phi) \nonumber \nonumber\\ & = \frac{A^2 }{4} \eta ^2 (1 - \cos (2\omega t - 2\phi)). \end{align}

(19)

We use the same Taylor expansion. Equation (19) shows that we can obtain the doubled frequency \(2\omega\) at 1/4th amplitude compared to the single-frequency mode at \(\omega\) – only by changing the polarization optics instead of the electro-optical modulation itself.

6.2.3 Validation of GHz Intensity Modulation.

We validate our GHz intensity modulation module consisting of a PBS, an HWP, a QWP, an EOM, and a mirror. We emit laser light to a mirror at a fixed position and directly capture the intensity of the modulated light steered onto a GHz photodetector, see Supplemental Material for the measurement configuration. Figure 7 demonstrates the effective GHz intensity modulation with high modulation contrast at two different HWP angles of \(11.25^\circ\) and \(22.5^\circ\), corresponding to the modulation frequencies of \(\omega =7.15~\)GHz and \(2\omega =14.32~\)GHz.

6.3 Coaxial Spatial Intensity Imaging

Equipped with the intensity modulation block, we design a coaxial imaging system with an illumination and a detection module, see Figure 5. For the illumination module, we opt for continuous-wave laser illumination at 532 nm (for eye-safe lab operation of the prototype) followed by a GHz intensity modulation block. A second GHz modulation block is used for the detection module, combined with an avalanche photodiode (APD) for intensity sensing. We employ an APD to allow for high gain at fast readout rates in low-flux scenarios, which is especially important for scene surfaces low reflectance and objects at long distances; see Supplemental Material. Using a non-polarizing beamsplitter, we share the same path for the output light to a scene and the detected light from a scene, improving the signal-to-ratio of the system. For 2D spatial scanning, we use a 2-axis galvonometer in front of the beamsplitter, as shown at the bottom of Figure 5. Although the proposed free-space modulation method is not limited to co-axial scanning, the beam-steered acquisition effectively eliminates most multi-path interference, which we neglect in the remainder of this work.

Fig. 6.

Fig. 7.

Fig. 8.

Analog Signal Integration. We use a conventional avalanche photodiode with a gain \(G\) to detect the correlation signals from the detection module without any quantization involved. This generates analog photocurrent which is then low-pass filtered with an electrical filter and a resistor-capacitor (RC) circuit that further integrates the constant correlation input signal over an exposure time \(T\). We read out the analog signal with an ADC with 14 bit quantization. This results in the digital read of

\begin{align} C_\psi = Q\left(G\int _{0}^{T} {\tilde{p}({t - \tau })r({t}) \, \mathrm{d}t} \right) = Q\left(\frac{\tilde{\alpha }}{2}\cos (\psi - \phi) + TK \right), \end{align}

(20)

where \(Q\) is the 14 bit ADC quantization, \(\phi\) is the illumination phase, and \(\psi\) is the phase of the reference \(r\) shown in Equation (4).

6.4 Fine-tuning with Active Stereo Supervision

We fine-tune our phase-unwrapping network to allow for specialization to the specific noise characteristics of our experimental system and minor deviations of the modulation functions from ideal sinusoids. We acquire pseudo-ground-truth phase wrap maps by augmenting our system with stereo cameras (other ground-truth acquisition approaches are also possible, we chose stereo for ease of implementation). We build the acquisition system by mounting two CMOS cameras (FLIR Grasshopper3 GS3-U3-32S4C) to the ToF rig, with 8 mm lenses to match our system’s FOV as shown in Figure 8. This effectively creates an auxiliary active stereo system from which to recover coarse scene depth without additional captures. After geometric calibration, we triangulate the position of the ToF laser spot in 3D space with the stereo cameras as we scan the scene. The estimated depth for each laser spot allows us to generate a pseudo ground truth wrap map. We perform fine-tuning on a diverse dataset of captured stereo measurements, which are all withheld from the experimental validation section. For details on the fine-tuning, we refer to the Supplemental Document.

Fig. 9.

7 Experimental Setup

For experimental validation, we implement the prototype system shown in Figure 1 . While we assemble the experimental prototype on an optical breadboard, we note that the EOMs and optics can be integrated in a small form factor similar to LiDAR sensors.

7.1 Illumination Module

We use a single transverse mode continuous-wave laser at 532 nm wavelength (Laser Quantum Gem 532). The laser beam is coupled with a custom-design single mode high power optical fiber (OZ Optics QPMJ-A3AHPCA3AHPC-488-3.5/125-3AS-1-1) which removes the higher order modal light and produces a uniform Gaussian beam at the output of the fiber, maintaining \(20-30\%\) laser output power. The light then enters a 2.5\(\times\) inverse beam expander (Thorlabs LC1060-A and LA1608-A) that reduces the beam diameter from 1.25 mm to 0.5 mm, matching the desired beam size for our EOM. The reduced light becomes horizontally linearly polarized by passing through a first PBS (Thorlabs PBS101). Then, a pair of HWP (Thorlabs WPMH05M-532) and QWP (Thorlabs WPQ05M-532) modulates the polarization state of the beam. The polarization-modulated light passes through our custom GHz EOM that operates at the modulation frequency \(\omega\). The light is reflected by a mirror (Thorlabs PF10-03-P01), returning back to the EOM, the QWP, the HWP, and the PBS. This procedure results in the GHz-frequency intensity modulation of light.

The light then passes through a mirror (Thorlabs PF10-03-P01) and an NBS (Thorlabs CCM1-BS013) dividing the incident beam into two beams of equal intensity. One beam is directed to an integrating sphere (Thorlabs S140C) which measures the intensity of emitted light for calibration purposes and the other beam passes through another NBS (Thorlabs CCM1-BS013). The purpose of this module is to calibrate intensity fluctuations from the laser by normalizing the signal incident on the detection module. The optical intensity modulation has a higher frequency than the integration time of a few milliseconds, which allows compensation after the modulation without error. It splits the beam again into two paths with equal intensity where one half of the beam is used as the reference beam in interferometric measurement mode (used for precision comparison see Figure 12) with a mirror; otherwise, this beam is discarded in intensity-measurement mode. The other half of the beam is sent to a scene through a mirror (Thorlabs PF30-03-P01) and a 2-axis galvo mirror system (Thorlabs GVS012) for spatial scanning. The emitted CW laser power is 3 mW. For photon efficiency estimates, see Supplemental Document. To avoid speckle artifacts of the coherent laser illumination, we slightly defocus the projected beam.

7.2 Detection Module

The intensity-modulated light returns from a scene and passes through the galvo mirror system and the mirror followed by an NBS which redirects the beam to the detection module. We use a 1.6\(\times\) inverse beam expander (Thorlabs LA1213-A and Thorlabs LC1060-A) and a mirror (Thorlabs PF10-03-P01) resulting in a beam diameter of 0.5 mm and collimated beam accurately entering the detection EOM. Symmetric to the emission module, we mount a PBS, an HWP, a QWP, an EOM, and a mirror which in effect optically demodulate returned light from the scene. The intensity demodulated light is then captured by an avalanche photodiode (Thorlabs APD440A) with a focusing lens (Thorlabs LA1951-A). We use a 10 kHz lowpass filter (Thorlabs EF120) RC low pass integrator circuit with RC time constant \(t_{RC} = 100 ms\) to integrate the detected photocurrent signal, then passed into an analog-digital-converter (LabJack T7) to sample the signal at up to 24 K samples per second. We integrate over 20 samples for a single phase measurement and sample 16 phases corresponding to 13 ms integration time for a single galvo measurement point.

7.3 RF Driver

To operate the EOMs with a sinusoidal voltage input, we use two custom RF drivers with a high-frequency DDS which are synchronized to the external clock source of a function generator (Siglent SDG2042X). Our DSS signal generators support sinusoidal modulation only, leaving non-sinusoidal modulation as interesting future work [Gupta et al. 2018]. The external clock enables accurate control of the phase of the modulation signal \(\phi\). Our driver contains two RF modulators to output an RF signal provided to the EOMs. The RF driver performs frequency locking to significantly increase the output power and reduces frequency drifting in the EOM. For further details, refer to the Supplemental Document.

7.4 Comparison to Analog RF Demodulation

For comparison of the proposed system with demodulation of a signal after photo-conversion, we add a highspeed GaAs 12 GHz photodetector (EOT GaAs PIN Detector ET-4000) connected to an RF demodulation circuit. This measurement setup can be enabled by flipping a flip-mirror in the optical path, redirecting the scene illumination to the fast photodiode instead of the proposed detection module. This photodiode offered the highest photon-detection efficiency and high-frequency response available to us. The captured photocurrent from the detector is sent as input to an I/Q demodulator consisting of analog microwave electronics as follows. The photodetector signal is first amplified and band-pass filtered. Then it enters an RF mixer to be demodulated with the local oscillator (LO) signal from the RF driver. This produces a signal with the difference of the two frequencies and a signal with the sum of the two frequencies. These signals are passed through a low pass filter which removes the higher frequency signal. Then the remaining homodyne DC signal, is output as two signals, an in-phase component \(I\), and a quadrature component \(Q\) shifted by 90 degrees. For a detailed circuit design, see the Supplementary Document.

8 Assessment

In this section, we validate the proposed computational ToF method in simulation and experimentally. Specifically, we first perform quantitative evaluation of our neural unwrapping approach on a synthetic dataset and compare with other baseline phase unwrapping methods. We then experimentally validate the proposed system quantitatively and qualitatively on unseen real-world measurements captured by our experimental prototype.

8.1 Simulated Analysis

Ablation Study. We conduct an ablation study to validate our choice of Fourier feature encoding and combined loss function. The different ablation configurations and corresponding quantitative results are shown in Table 1, and we refer to the Supplemental Document for qualitative results. We observe that Fourier encoding leads to a 10 percentage point boost in correct wrap predictions, supporting the theory that the doubly modulated phases provide valuable features during training, possibly in the form of a learned frequency analysis of the underlying measurements. We concatenate phase edges to the input in order to aid the network’s understanding of each wrap region. For the loss functions, we find the model trained on cross-entropy loss alone demonstrates competitive results, validating the choice to represent phase unwrapping as a classification problem. However, when we make use of the differentiable argmax function to directly introduce \(\ell _1\) loss on predicted depth, we see a reduction in outliers and an overall smoother final prediction. This reinforces the problem as ordinal classification, where the ordering of classes—in this case, wrap counts—is significant.

Table 1.

	Input			Loss		Performance (%)
	\(\hat{\phi }\)	\(\gamma (\hat{\phi })\)	\(\|\nabla \hat{\phi }\|\)	\(\mathcal {L}_{CE}\)	\(\mathcal {L}_{L1}\)	\(\uparrow \delta \lt 1\)	\(\uparrow \delta \lt 2\)	\(\uparrow \delta \lt 3\)	\(\downarrow \delta \ge 3\)
Proposed	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	51.6%	69.1%	77.0%	23.0%
\(\mathcal {L}_{CE}\) Only	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	-	49.1%	65.0%	73.3%	26.7%
F. Features	\(\checkmark\)	\(\checkmark\)	-	\(\checkmark\)	\(\checkmark\)	40.2%	59.6%	68.8%	31.2%
Phase Only	\(\checkmark\)	-	-	\(\checkmark\)	\(\checkmark\)	30.3%	52.3%	65.7%	34.3%

Table 1. Ablation Study Configurations and Corresponding Quantitative Results

Here the \(\delta\) metric represents the percent of pixels whose prediction is \(\delta\) wraps from ground truth wrap count. Up arrow denotes “higher is better”, down arrow means “lower is better”. The best performing method is highlighted as bold.

Comparison to Phase Unwrapping Methods. We validate our proposed neural unwrapping approach on a synthetic test set and discuss the qualitative and quantitative results. As a baseline, we compare our work against traditional unwrapping methods including the approach used in phasor imaging [Gupta et al. 2015], the algebraic chinese-remainder theorem (CRT) solution [Pei et al. 1996; Xia and Wang 2007], and the kernel density method (KDE) [Lawin et al. 2016], which is also used in the Kinect V2 software. We also compare to an unmodified U-Net [Ronneberger et al. 2015] baseline and three recent regression-based deep learning approaches [Wang et al. 2019; Zhang et al. 2019; Su et al. 2018].

We show qualitative performance of our proposed neural unwrapping method and baseline methods including CRT [Xia and Wang 2007], KDE [Lawin et al. 2016] and the next best network-based method [Zhang et al. 2019] in Figure 9, and refer to the Supplemental Document for additional qualitative comparisons. Table 2 presents quantitative classification results for the full range of methods in increasingly widening error bands, as well as outlier percentages. Visually CRT and KDE achieve similar results, as they have similar underlying mechanics for wrap calculation, however KDE’s spatial aggregation allows it to better tackle noise and make significantly more correct estimates. This is quantitatively confirmed by the fact that more than half of CRT’s predictions are outliers (\(\delta \ge 10\)) while for KDE this number is less than 9%. The last conventional method, phasor imaging, struggles heavily under added noise and sub-optimal modulation frequencies, leading to nearly all the measurements being incorrectly unwrapped. The U-Net and three comparison deep learning methods produce similar spatially smoother predictions than the classical methods, however often bin entire patches of the image into the wrong wrap count, leading to a marginally higher rate of outliers than KDE while making more than double the number of correct predictions. Deep-ToF [Su et al. 2018] solves the phase unwrapping problem by directly regressing the raw correlation measurements to depth using the modified U-Net with skip connections. This regression-based method often results in globally inaccurate phase unwrapping as it hallucinates depth directly without the exact meaning of wrap counts. In contrast, our unwrapping network avoids this problem by estimating wrap counts with the segmentation-inspired network architecture and loss functions. The proposed neural unwrapping method more than doubles the rate of correct predictions when compared to Zhang et al.’s [2019] baseline. This is confirmed qualitatively in Figure 9 with spatially-consistent outputs and object discontinuities that accurately align with the amplitude measurement. The proposed network outperforms all existing methods in GHz frequency unwrapping.

Fig. 10.

Table 2.

Method	\(\uparrow \delta \lt 1\)	\(\uparrow \delta \lt 2\)	\(\uparrow \delta \lt 3\)	\(\downarrow \delta \ge 3\)	\(\downarrow \delta \ge 10\)
Phasor [2015]	0.74%	1.66%	3.50%	96.5%	84.4%
CRT [2007]	9.29%	14.7%	19.7%	80.3%	56.0%
KDE [2016]	9.46%	18.56%	27.0%	73.0%	8.93%
One-Step [2019]	19.9%	37.6%	52.2%	47.8%	14.6%
U-Net [2015]	21.8%	45.6%	64.4%	35.6%	10.0%
Deep-ToF [2018]	20.1%	47.5%	67.6%	32.4%	8.40%
Rapid. [2019]	23.1%	45.4%	61.1%	38.9%	9.74%
Proposed	51.6%	69.1%	77.0%	23.0%	7.59%

Table 2. Quantitative Comparison Table for Proposed Neural Phase Unwrapping Method and Baselines, as Evaluated on the Synthetic Test Scenes

\(\delta \ge 10\) metric added to better quantify outlier performance. The best performing method is highlighted as bold.

Impact of Measurement Noise. In addition to simulating the Poisson-Gaussian noise as described in Section 3, we further test our method with two different measurement distortions. First, we simulate global phase offsets between the two high modulation frequencies. Note that we used a phase-accurate clock (SDG2024X) to mitigate phase offsets, however, high GHz modulation frequencies can make the system sensitive to the phase shifts. We add \(\pm 0.01\) radians global offset to one of the phase measurements. Table 3 shows that our neural phase unwrapping is robust to such phase shift, resulting in a minor performance drop of less than 1% for all metrics. Second, we simulate a high noise a level \(\sigma =2,\!000\) instead of 1,200 to mimic low-signal scenario with strong ambient light present. Again, we obtain a minor performance drop for 1.6\(\times\) higher noise level than the training as shown in Table 3.

Table 3.

Environment	\(\uparrow \delta \lt 1\)	\(\uparrow \delta \lt 2\)	\(\uparrow \delta \lt 3\)	\(\downarrow \delta \ge 3\)
Conventional	51.6%	69.1%	77.0%	23.0%
With Phase Offset	51.0%	68.9%	77.0%	23.0%
With Ambient Light	42.7%	60.4%	70.0%	30.0%

Table 3. Quantitative Results for Our Phase Unwrapping Method Against Phase Offset Error and Higher Noise Level Due to the Ambient Light

The best performing method is highlighted as bold.

Training Details. For all methods except phasor imaging, we simulate measurements for two modulation frequencies, a fundamental 7.15 GHz signal, and a shifted plus frequency doubled (7.15 GHz + 10 MHz) \(\times\) 2 \(=\) 14.32 GHz signal. We note that these frequencies correspond to the frequencies we can implement in the experimental setup. For the phasor imaging method, we input 7.15 GHz and 7.16 GHz simulated measurements, as these are the locally optimal feasible shifts achieved by the optical amplitude modulation system. To be robust against real-world noise, we simulate measurements with sensor gain \(G=20\) and integration time \(T=1,\!000\) ms, with noise parameters \(\mu =0, \sigma =1{,}200\). See Figure 10. The models are trained for 1,000 epochs each, with 500 samples drawn per epoch, each consisting of a 512 \(\times\) 512 image and ground truth depth patch (sampled randomly from the full RGB-D datum). We use a OneCycle learning rate schedule with a ratio of 0.995 per epoch and an initial rate of \(1e-3\); training on 3 Nvidia V100 GPUs with a batch size of 12 takes approximately 24 hours. The synthetic test set consists of the 42nd frame of each simulated scene, withheld from both the training and validation sets. We balance the losses by setting \(w_{L1}=0.1\), which leads to noticeable improvements in smoothness without the classifier’s early training behavior. During inference, running on one Nvidia V100 GPU, we achieve an average runtime of 16.5 ms \(\approx\) 60 FPS per image of size 256 \(\times\) 256, and 50 ms \(\approx\) 20 FPS with the full synthetic image size of 768 \(\times\) 1,024.

Fig. 11.

8.2 Experimental Assessment

In this section, we validate the proposed computational ToF imaging approach on experimental scenes.

Qualitative Reconstruction. We demonstrate depth captures on diverse real-world scenes as shown in Figure 15. All scenes were captured with the galvo on the floor plane with respect to the scene, and swept through 16 phase shifts from \(0-\pi\), corresponding to 13 ms integration time for a single galvo measurement point. Note that we perform this capture procedure under strong room ambient light for all captured scenes, demonstrating the robustness to ambient illumination. Operating outside the visible range (our system uses 532 nm for lab eye safety reasons), and employing narrow-band spectral filters can further enhance this robustness. We use a single-frequency 7.15 GHz and double-frequency 14.32 GHz pair for depth measurement. Figure 15 shows that combining the proposed free-space correlation acquisition and neural unwrapping method enables high-fidelity depth reconstruction of all the tested objects with a wide dynamic range.

Compared to RF demodulation after photon conversion, using the highspeed GaAs 12 GHz photodetector as described in Section 7, the proposed method drastically outperforms post-detection modulation across all experimental tests for an identical photon budget. We tested even for a 10\(\times\) higher laser power of 30 mW with the same result, again validating the photon-efficiency of the proposed free-space modulation approach. Our measured phase maps clearly show depth-dependent contours for diverse surface reflectance types (see also Supplemental Material), demonstrating the robustness of the proposed system. Moreover, our imager handles large variations in object reflectance. From a diffuse bust, a highly specular helmet with a very small diffuse component, to a textured owl object with low albedo components. We evaluate the impact of our neural phase unwrapping on these challenging scenes compared with the existing KDE [Lawin et al. 2016], recent learned network [Zhang et al. 2019] method, and micro ToF phasor unwrapping [Gupta et al. 2015] methods. KDE unwrapping [Lawin et al. 2016] struggles with the high frequencies of the proposed system and residual measurements noise, failing to provide high-quality residual measurements. The other two methods [Gupta et al. 2015; Lawin et al. 2016] also fail to recover correct meaningful geometric structures which can be found in the Supplemental Document. The lookup-table-based unwrapping method from [Gupta et al. 2015] fails here due to the small modulation bandwidth available in our experimental system. We note that we use the optimal frequency settings for the phasor unwrapping [Gupta et al. 2015] in our operating bandwidth. Our neural phase unwrapping successfully handles high wrap counts in the GHz regime, enabling us to obtain accurate depth maps across all scenes. Thus, these experiments validate that the proposed method robustly performs high-frequency correlation depth imaging, outperforming existing approaches and phase unwrapping methods across all tested scenarios.

Validation of Correlation Profiles. We validate the functionality of the proposed imaging system by acquiring correlation measurements as figure of merit. Specifically, we capture measurements of a static target without galvo movement while sweeping the phase of the reference signal driven by the RF driver. To this end, we place a mirror (Thorlabs PF10-03-P01) at a fixed position and uniformly sample \(\psi\) over a range of 0 to 2\(\pi\). Figure 11 confirms that the measured correlation values accurately follow the sinusoidal image formation model from Equation (20).

Fig. 12.

Quantitative Evaluation of Depth Precision. We quantitatively evaluate depth precision of our experimental prototype by capturing objects at known distances using a motion stage (Thorlabs MTS50/M-Z8) as shown in Figure 12 and Table 2. We control the position of the target object that is placed at a 60 cm distance from the setup. At this depth offset, we sweep over a 1 mm travel distance with a 50 \(\mu\)m step size (stage error 0.05 \(\mu\)m) and estimate corresponding depth values using the proposed method. Our imaging system achieves a mean depth error of 32.5 um and 32.9 um for a specular mirror and a diffuse reflector respectively.

Fig. 13.

Furthermore, we measure the height difference between the two metallic precision-fabricated gage blocks at a 100 cm distance as shown in Figure 13. The two gage blocks (ACCUSIZE DIN861 Metric, Grade2) are placed on a static mount. The difference of the measured depths, the height difference, is 0.4988 mm which is only 12 \(\mu\)m off from the ground truth 0.5 mm, demonstrating the precision of our depth acquisition. We also captured the shape of a large diffuse flat plane at a slanted angle. Once we obtain the depth map from our measurements, we fit a plane equation and the fitting R-squared value is 0.9998, demonstrating the accuracy of our method over longer travel distance than the translation-stage experiment.

Fig. 14.

Comparison with MHz Correlation ToF Imaging. The proposed method performs all-optical GHz modulation for high-resolution depth imaging. Figure 14 compares our method with the conventional MHz correlation ToF imaging used in LUCID Helios Flex camera equipped with four VCSEL diodes of cumulatively 8 mW illumination module at 850 nm wavelength and 8 ms exposure, which is comparable to the effective photon budget of our system, although less susceptible to ambient light due to the wavelength filter. Our estimated depth contains fine geometric details for challenging scenes at correct depth scales, whereas the MHz correlation ToF suffers from low-precision depth with mm deviation on diffuse highly reflecting surfaces and larger cm to 10 cm deviation for surface areas of low reflectance. We ignore errors due to multipath effects (naturally suppressed by scanning in the proposed method) in this evaluation by focusing on convex object shapes. We note that we provide here qualitative comparisons with RGB frames as reference rather than ground truth depth,which is challenging to acquire for highly specular objects such as the helmet. Qualitatively, MHz correlation ToF fails to recover correct geometry for bright and dark spots on the owl, resulting in a 200 mm depth error (i.e., texture-dependent depth errors). The method also fails for the faint diffuse component returned from the specular helmet scene. While this trend is also confirmed in the estimated normals, we note that the holes in the helmet are “closed” by an incorrect wrapping estimate.

Fig. 15.

Fig. 16.

Comparison to RF Demodulation and Optical Interferometry. We compare the proposed method to RF demodulation after photo-conversion and to interferometric depth estimation. To compare to RF demodulation, we use the same highspeed GaAs 12 GHz photodetector as before. We note that this was the fastest photodiode available to us, see again Section 7 and the Supplement for additional details. To compare interferometric depth estimation with the proposed method, we add a moving reference mirror and an intensity detector so that interference can be detected with the superposed reference and scene beams as shown in Figure 12(c). To implement this approach with the same proposed setup, we place a beam block in front of the reference mirror when we use the system in the proposed correlation mode. For a fair comparison, we unwrap the interferometric data with sequential unwrapping which adds the smallest multiple of 2 pi whenever the phase exceeds 2 pi.

Table 4 shows that for a 1 mm sweep at 0.6 m distance, our proposed method with an emitter-decoder setup outperforms the RF demodulation in-depth accuracy. The proposed method has a lower depth error than the RF method for RMSE and MAE for both a specular reflector shown in (a), and a diffuse reflector shown in (b). The depth estimates for all methods are shown for a 1 mm range compared to a ground truth for both specular and diffuse reflectors. We validate that, while post-photoconversion performs well for high flux levels, typical diffuse scenes results in low photon counts that are challenging to sense at high frequencies. As such, the RF demodulation approach fails for the diffuse scene object. We note that in contrast to direct fast sampling at rates higher than 10 GHz in the RF setup, our all-optical sensing enables us to get away from low-frequency kHz sampling (six orders of magnitude slower) with high SNR.

Table 4.

Method	RMSE Mirror	MAE Mirror	RMSE Diffuse	MAE Diffuse
Interferometry	20 \(\mu\)m	20 \(\mu\)m	14 \(\mu\)m	14 \(\mu\)m
RF	49.5 \(\mu\)m	48.8 \(\mu\)m	11800 \(\mu\)m	11800 \(\mu\)m
Proposed	33.5 \(\mu\)m	32.5 \(\mu\)m	34.6 \(\mu\)m	32.9 \(\mu\)m

Table 4. Quantitative Comparison of the Proposed Method for Diffuse and Specular Objects Corresponding to the Measurements in Figure 12

The RSME and MAE for interferometry, RF, and proposed methods are shown in Table 4 for specular and diffuse reflectors. The interferometric depth estimation performs best in terms of RMSE and MAE for specular and diffuse reflectors as expected. In this experiment, the proposed method achieves a depth precision of around 30 microns. We note that optical interferometry setup is extremely sensitive to scene scale, and system vibrations, to the point where measurements had to be completed remotely from outside the lab and repeated multiple times due to tiny measurement fluctuations.

9 Discussion

We have introduced a computational imaging method, that presents a complementary direction to existing ToF methods. Specifically, we have jointly designed the optics, sensing, and neural network reconstruction such that computation that is typically done on the sensor, or digitally after sensing, is executed optically on the incident photon stream. Doing so, we introduce concepts from optics on electro-optical modulation to the graphics and imaging community, while devising a new method for two-pass modulation and a new method for unwrapping high-frequency phase measurements. Although we experimentally and synthetically validate that our system performs effective centimeter-wave ToF depth imaging, as a nascent technology, our work also leaves the reader with some open questions regarding its future, which are discussed below.

Implementing Array Sensors. We have opted for sequential point-wise scanning using a galvo system as the beam diameter passing through our EOMs is limited by the EOM’s small active area, 2.5 \(\times\) 2.5 mm. An alternative implementation requiring further engineering efforts is the use of telescope optics to spatially expand the EOM-modulated light, hence exploiting that correlation ToF only mandates global intensity modulation instead of per-pixel intensity control, see also [Kim et al. 2019].

Flood-Illumination and Multi-Path Interference. As our prototype performs point-wise scanning, direct reflection dominates the measurement, which mitigates multi-path interference. When implementing the proposed system with flood illumination in the future, and using 2D array sensing with large-area EOMs, retraining the network with flood-illumination might appear as an immediate solution to multi-path interference. We note that proposed high-frequency modulation may already provide sufficient robustness to the multi-path problem [Gupta et al. 2015].

Generalization to Complex Geometry and Reflectance. For scenes with simple shapes, moving planar targets in Figure 12 and gage blocks and a slanted plane in Figure 13, we demonstrate micron-scale depth resolution. In the future, we hope this approach can be extended to resolve micron-scale features in more complex scenes. While the proposed method outperforms previous methods for complex macroscopic scenes, capturing accurate depth still proves challenging; local geometries induce phase noise as the angular light beams are integrated over uneven depths. In the future, narrow beam sampling or flood-illuminated setups with array sensing might be a hardware solution to this challenge. In addition, more accurate ground truth sensing in the fine-tuning step might also overcome this domain gap issue in the neural network reconstruction.

Phase Unwrapping and Denoising. The proposed neural unwrapping method exploits the ordinal nature of the wrap counts and segmentation-based image semantics to recover dozens of wrap counts, while existing methods fail for more than a handful. While this approach shares some similarities with denoising in that we want to recover clean phase measurements from noisy readings, it does so in a joint manner. Rather than performing denoising and unwrapping sequentially, the proposed network ingests both correlation measurements simultaneously and can use their joint information—and independent noise distributions—to inform unwrapping. In this way, we avoid accidentally denoising phase measurements into the wrong wrap count bin.

10 Conclusion

We propose a computational ToF imaging method that correlates light all-optically at centimeter-wave frequencies, without fiber coupling or photon-conversion—enabling high SNR sensing with more than 10 GHz modulation frequency. To this end, we solve two technical challenges: modulating without large signal losses at GHz rates, and unwrapping phase at these rates which render conventional phase unwrapping methods ineffective. Specifically, we propose a two-pass intensity modulation with free-space EOMs and polarizing optics, which works in tandem with a neural phase unwrapping method to handle high wrapping counts in GHz-frequency measurements, on the order of dozens of wraps. The resulting imaging method achieves ToF imaging with centimeter intensity modulation for macroscopic scenes, robust to materials of low reflectance, highly-specular materials, and ambient light. We demonstrate accurate depth reconstructions, outperforming existing phase-unwrapping and post-photo-conversion ToF methods for all synthetic and real-world experiments. Our approach makes a step towards the goal of filling the gap between interferometric and correlation ToF. Our method performs computation optically that traditionally has been done after or during the sensing process. As such, in the future, we envision that the proposed approach could serve as an optical compute block for a diverse array of tasks, including velocity imaging, transient imaging, non-line-of-sight imaging, and imaging in scattering media, with the potential for fueling imaging of ultrafast phenomena across disciplines.

Supplementary Material

tog-21-0071-File004 (tog-21-0071-file004.mp4)

Supplemental video

Download
53.31 MB

References

[1]

Supreeth Achar, Joseph R. Bartels, William L.’Red’ Whittaker, Kiriakos N. Kutulakos, and Srinivasa G. Narasimhan. 2017. Epipolar time-of-flight imaging. ACM Transactions on Graphics 36, 4 (2017), 1–8.

Abstract

1 Introduction

2 Related Work

3 Correlation ToF Imaging

4 Overview

5 Neural Phase Unwrapping

5.1 Segmentation-based Fourier Phase Unwrapping

5.2 Ordinal Classification Loss

5.3 ToF Simulation from RGBD

5.4 Correlation Images to Wrapped Phase and Amplitude

6 All-optical GHz Correlation ToF Imaging

6.1 Backgrounds on EOM and Jones Calculus

6.2 Two-pass GHz Intensity Modulation

6.2.1 Intensity Modulation.

6.2.2 Doubled Intensity-Modulation Rate.

6.2.3 Validation of GHz Intensity Modulation.

6.3 Coaxial Spatial Intensity Imaging

6.4 Fine-tuning with Active Stereo Supervision

7 Experimental Setup

7.1 Illumination Module

7.2 Detection Module

7.3 RF Driver

7.4 Comparison to Analog RF Demodulation

8 Assessment

8.1 Simulated Analysis

8.2 Experimental Assessment

9 Discussion

10 Conclusion

Supplementary Material

References

Cited By

Index Terms

Recommendations

Differential Frequency Heterodyne Time-of-Flight Imaging for Instantaneous Depth and Velocity Estimation

Increasing the accuracy of Time-of-Flight cameras for machine vision applications

Epipolar time-of-flight imaging

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations