1 Introduction
From interactive gaming to precision industrial manufacturing, depth sensors have enabled advances in a broad set of consumer and research applications. Their ability to recover 3D data at scale [Silberman et al.
2012; Chang et al.
2015; Dai et al.
2017] and produce high-fidelity scene reconstructions [Izadi et al.
2011; Tulsiani et al.
2018] drives developments in 3D scene understanding [Dai et al.
2018; Song et al.
2015; Hickson et al.
2014], which in turn influence the fields of augmented reality, virtual reality, robotic scanning, autonomous vehicle guidance, and path planning for delivery drones.
Some of the most successful depth acquisition approaches for wide operating ranges are based on active time-of-flight sensing, as they offer high depth precision at a small sensor-illumination baseline [Hansard et al.
2012]. Passive approaches, that infer distance from parallax [Subbarao and Surya
1994; Mahjourian et al.
2018] or visual cues in monocular images [Bhat et al.
2021; Saxena et al.
2005], do not offer the same range and depth precision as they struggle with textureless regions and complex geometries [Smolyanskiy et al.
2018; Lazaros et al.
2008]. Active sensing approaches tackle this challenge by projecting light into the scene and reconstructing depth from the returned signal. Structured light methods such as active stereo systems use spatially patterned light to aid stereo matching [Ahuja and Abbott
1993; Baek and Heide
2021]. While being robust to textureless scenes, their accuracy is limited by illumination pattern density and sensor baseline, resulting in a large form factor.
Time-of-flight (
ToF) depth sensing approaches avoid these limitations by estimating depth from the travel time of photons leaving from and returning to the device, allowing for co-axial sensor setups with virtually no illumination-camera baseline.
Direct ToF systems, such as
light detection and ranging (
LiDAR) sensors [Schwarz
2010],
directly measure the round-trip time of emitted light pulses to estimate point depths, and can theoretically provide accuracy over a long range. However, this direct acquisition approach demands fast pulsed lasers, accurate synchronization, narrow-band filters, and picosecond-resolution time-tagged detectors such as
single-photon avalanche diodes (
SPADs) [Aull et al.
2002; Bronzi et al.
2015; Niclass et al.
2005; Rochas et al.
2003]. Though affordable SPADs have recently entered the market, these have only 20cm depth resolution [Callenberg et al.
2021], more than
\(50{\times }\) lower than their costly picosecond-resolution counterparts.
Amplitude-modulated continuous-wave (
AMCW) ToF methods—which we hereon refer to as
correlation ToF methods— [Lange and Seitz
2001; Su et al.
2018; Shrestha et al.
2016; Gupta et al.
2015] flood a scene with periodic amplitude-modulated light and
indirectly infer depth from the phase shift of returned light. In contrast to direct ToF sensing approaches, this modulation and correlation does not require ultra-short pulse generation and time-tagging, this lowers sensor and laser complexity requirements. Correlation ToF sensors that demodulate the amplitude-modulated flash-illumination on-sensor have been widely adopted, for example, the Microsoft Kinect One camera. These sensors implement multiple charge buckets per pixel and shift a photo-electron to an individual bucket by applying an electrical potential between the individual quantum wells [Lange and Seitz
2001]. Though amplitude modulation allows for depth precision
comparable to picosecond-pulsed direct ToF at meter-scale distances, while remaining low-cost thanks to scalable CMOS technology, it is also this sensing mode that fundamentally limits the sensor. Specifically,
modulation after photo-electric conversion limits the maximum achievable modulation frequency to a few hundred MHz in practice, restricted by the photon absorption depth in silicon [Lange and Seitz
2001]. This has limited the depth precision of existing correlation ToF sensors to the sub-centimeter regime. Fiber-coupled modulation approaches from optical communication that bypass this limit suffer from low modulation contrast due to coupling loss [Kadambi and Raskar
2017; Rogers et al.
2021; Marchetti et al.
2017; Bandyopadhyay et al.
2020].
In this work, we co-opt free-space
electro-optic modulators (
EOMs) from optical communication and combine them with a phase unwrapping neural network to build a GHz correlation ToF system. EOM-based ranging systems are known to offer fast intensity modulation and can be integrated with conventional intensity sensors and a continuous-wave laser, bypassing the more complex hardware requirements of time-tagged ToF devices [Froome and Bradsell
1961]. Inspired by existing EOM-based ranging methods, we devise a two-pass EOM-based GHz ToF sensing system that achieves a 7 GHz modulation frequency with
\(\gt 50\%\) contrast. Our system inherits the benefits of EOM-based systems—large-area freespace modulation, single-digit driving voltage—using conventional intensity sensors and continuous-wave lasers.
Although a higher modulation frequency can increase phase contrast and allow for more precise depth measurement, it also greatly complicates the task of phase unwrapping, a major obstacle in applying EOMs to depth sensing. At 7 GHz, even a 2 cm depth change results in a phase wrap, in contrast to 3 m of unambiguous depth for a 100 MHz ToF camera. In addition to a few dozens of wraps, imaging noise and the small modulation bandwidth of EOMs—only a few MHz—imposes a further challenge for conventional look-up table approaches. We tackle this challenge with a segmentation-inspired neural phase unwrapping network, where the problem is decomposed into ordinal classification, mapping regions of measured data to their wrap count. Trained in an end-to-end fashion on simulated ToF data and fine-tuned on a small set of experimental measurements, the proposed network exploits the correlation of adjacent measurements to robustly unwrap them.
We validate the proposed ToF system in simulation and experimentally, and demonstrate robust depth imaging for macroscopic diffuse scenes with freespace centimeter-wave modulation at mW laser powers, corresponding to \(\lt\)100 femtosecond temporal resolution. See Figure. Jointly with the learned unwrapping, the all-optical modulation without coupling losses allows for robustness to low-reflectance texture regions and highly specular objects with low diffuse reflectance components. We assess the neural phase unwrapping network extensively on real and simulated data, and validate that it outperforms existing conventional and learned unwrapping approaches across all tested scenarios. We further validate precision and compare extensively against post-photoconversion modulation, which fails in low flux scenarios, and interferometric approaches, that are limited to small ranges. As our free-space modulation is all-optical, we demonstrate that it can be readily combined with interferometric modulation, allowing us to narrow the gap between interferometry and correlation ToF imaging, with the future potential for photon-efficient imaging of macro-scale ultrafast phenomena.
Specifically, we make the following contributions in this work:
—
We introduce computational ToF imaging with fully optical free-space correlation and an EOM-based two-pass intensity modulation that allows for \(\ge\)10 GHz frequencies.
—
To tackle phase-unwrapping at centimeter wavelengths, we introduce a segmentation-based phase unwrapping network that poses phase recovery as a classification problem.
—
We validate the proposed method experimentally with a prototype, achieving robust depth imaging with freespace centimeter-wave modulation for macroscopic scenes.
To ensure reproducibility, we will share the schematics, code, and optical design of the proposed method.
2 Related Work
In this section, we seek to give the reader a broad overview of the current state of depth imaging to better illustrate the gap our work fills in the 3D vision ecosystem.
Depth Imaging. The wide family of modern depth imaging methods can be broadly categorized into passive and active systems. Passive approaches, which leverage solely image cues such as parallax [Hirschmuller
2005; Baek et al.
2016; Meuleman et al.
2020] or defocus [Subbarao and Surya
1994], can offer low-cost depth estimation solutions using commodity camera hardware [Garg et al.
2019]. Their reliance on visual features, however, means they struggle to achieve sub-cm accuracy in favorable conditions, and can fail catastrophically for complex scene geometries and textureless regions [Smolyanskiy et al.
2018]. Active methods, which first project a known signal into the scene before attempting to recover depth, can reduce this reliance on visual features. For example, structured light approaches, such as those used in the Kinect V1 and Intel D415 depth cameras, improve local image contrast with active illumination patterns [Baek and Heide
2021; Scharstein and Szeliski
2003; Ahuja and Abbott
1993], at a detriment to form-factor and power consumption. Even active stereo methods, however, still cannot disambiguate mm-scale features, as they are smaller than the illumination feature size itself and make finding accurate stereo correspondences infeasible. ToF imaging is an active method that does not rely on visual cues, and so avoids the pitfalls of stereo matching completely. ToF cameras instead directly or indirectly measure the travel time of light to infer distances [Lange and Seitz
2001; Hansard et al.
2012], with modern continuous-wave correlation ToF systems achieving sub-cm accuracy for megahertz-scale modulation frequencies. Interferometry extends this principle to the terahertz range, measuring the interference of electromagnetic waves to estimate their travel time. These systems can achieve micron-scale accuracy at the cost of mm-scale operating ranges [Hariharan
2003]. In this work, we seek to bridge the gap between commodity MHz-frequency correlation ToF systems and THz frequency interferometry with a GHz-frequency correlation ToF system for meter-scale imaging.
Pulsed ToF. Pulsed ToF systems, such as LiDAR, are direct ToF acquisition methods, which
directly measure the travel time of photon packets to infer depth. They send discrete laser pulses into the scene and detect their reflections with avalanche photodiodes [Cova et al.
1996; Pandey et al.
2011] or single-photon detectors [McCarthy et al.
2009; Heide et al.
2018; Gupta et al.
2019b;,
2019a]. These sensors can extract depth from measured pulse returns without phase wrap ambiguities. Their depth precision is limited by their temporal resolution, however, and the complex detectors and narrow-band filters, used to filter out ambient light, contend with high cost as a result of fabrication complexity when compared to the conventional intensity sensors. Recently, low-cost pulsed sensors have appeared, however at the cost of coarse 20 cm depth precision [Callenberg et al.
2021]. In this work, we revisit indirect ToF with amplitude modulation paired with learned phase unwrapping as an approach to precise depth imaging that does not mandate time-resolved sensors and time-tagging electronics.
Correlation ToF. Amplitude-modulated continuous-wave ToF, which we refer to as simply correlation ToF, floods the scene with periodically modulated illumination and infers distance from phase differences in the returned light [Lange and Seitz
2001; Remondino and Stoppa
2013; Ringbeck
2007]. These systems, such as cameras in the prolific Microsoft Kinect series [Tölgyessy et al.
2021], can rely on affordable CMOS sensors and conventional CW laser diodes to produce dense depth measurements. This flood illumination can lead to multipath interference, though there exists a large body of work to mitigate this [Achar et al.
2017; Fuchs
2010; Freedman et al.
2014; Kirmani et al.
2013; Bhandari et al.
2014; Kadambi et al.
2013; Jiménez et al.
2014; Naik et al.
2015]. Correlation ToF measurements can also be used to resolve the travel-time of light in flight [Heide et al.
2013; Kadambi et al.
2013]. These time-resolved transient images have found a number of emerging applications, such as non-line-of-sight imaging [Heide et al.
2014; Kadambi et al.
2016], imaging through scattering media [Heide et al.
2014], and material classification [Su et al.
2016], which have also been solved with pulsed ToF systems [O’Toole et al.
2018; Heide et al.
2019] and interferometric methods [Gkioulekas et al.
2015]. All these methods, however, are restricted to working with modulation frequencies of only a few hundred MHz due to photon absorption depth in silicon [Lange and Seitz
2001], which governs how these devices perform photo-electric conversion. This limit places the depth resolution of modern correlation ToF sensors at mm- to cm-scale for operating ranges of up to several meters. Previous attempts at pushing this modulation frequency to the GHz regime struggle with low modulation contrast due to the energy loss from fiber coupling within eye-safe laser power levels [Kadambi and Raskar
2017; Li et al.
2018]. Li et al. [
2018] overcome some of these limitations but solely rely on interferometric modulation, making the method susceptible to speckle, vibration, laser frequency drift, and other common interferometry errors. Notably, Bamji et al. [
2018] achieve 200 MHz modulation frequency at high contrast but are limited to single-frequency modulation. Gupta et al. [
2018] achieve 500 MHz modulation frequency with a fast photodiode and analog
radio-frequency (
RF) modulation, but contend with low modulation contrast at the GHz regime due to modulation after photo-conversion.
Interferometry and Frequency-Modulated Continuous-Wave ToF. Optical interferometry leverage the interference of electromagnetic waves to infer their path lengths, which are encoded in the measured amplitude and/or phase patterns. A detailed review of interferometry can be found in [Hariharan
2003]. Methods such as
optical coherence tomography (
OCT) [Huang et al.
1991] have found prolific use in biomedical applications [Fujimoto and Swanson
2016] for their ability to resolve micron-scale features in optical scattering media. This, however, comes with the caveat of an mm-scale operating range as diffuse scattering leads to a sharp decline in SNR. In graphics, OCT approaches have been successfully employed to achieve micron-scale light transport decompositions [Gkioulekas et al.
2015] and light transport probing [Kotwal et al.
2020]. Fourier-domain OCT systems mitigate some of the sensitivity to vibration by using a spectrometer and a broadband light source [Leitgeb et al.
2003]. While these methods provide high temporal resolution, they are also limited to cm-scale scenes.
Frequency-modulated continuous-wave (
FMCW) ToF systems employ an alternative interferometric approach to measuring distance. These methods continuously apply frequency modulation to their output illumination, which when combined in a wave-guide with the delayed returned light from the scene produces constructive and destructive interference patterns from which travel-time (and thereby depth) can be inferred. Experimental FMCW LiDAR setups can achieve millimeter precision for scenes at decimeter range [Behroozpour et al.
2016], but require complex tunable laser systems [Sandborn et al.
2016; Amann
1992; Gao and Hui
2012]. We revisit continuous-wave intensity modulation, which allows us to use conventional continuous-wave lasers modulated and demodulated in free-space.
Phase Unwrapping. In correlation to ToF systems, the analog correlation signal can experience phase shifts of more than one wavelength. To recover the true phase, and thereby accurately reconstruct depth, phase unwrapping algorithms are required [Dorrington et al.
2011; Crabb and Manduchi
2015; Lawin et al.
2016; An et al.
2016]. Single phase unwrapping approaches are only able to recover the relative depth, and require
a priori assumptions to estimate scale [Crabb and Manduchi
2015; Ghiglia and Pritt
1998; Bioucas-Dias and Valadao
2007; Bioucas-Dias et al.
2008]. Multi-frequency phase unwrapping methods overcome this limitation by unwrapping high-frequency phases with their lower-frequency counterpart. Wrap count is recovered by either weighing Euclidean division candidates [Bioucas-Dias et al.
2009; Droeschel et al.
2010; Kirmani et al.
2013; Freedman et al.
2014; Lawin et al.
2016], or using a frequency-space lookup table [Gupta et al.
2015]. All of these methods, while powerful for MHz ToF imaging, fail in the presence of noise for the dozens of wrap counts observed in the GHz correlation imaging. To tackle this challenge, in this work we introduce a neural network capable of unwrapping GHz frequency ToF correlation measurements.
Electro-optic Modulators. EOMs control the refractive index of a crystal with an electric field to modulate the phase, frequency, amplitude, and polarization of incident light [Yariv and Yeh
2007]. As such, they have been employed in diverse applications, including fiber communications [Phare et al.
2015], frequency modulation spectroscopy [Tai et al.
2016], laser mode locking [Hudson et al.
2005], and optical interferometry [Minoni et al.
1991]. In particular, EOMs have been used in LiDAR systems to change the optical-carrier frequency for FMCW sensing [Behroozpour et al.
2017] or facilitate pulsed sensing [Chen et al.
2018]. Instead, we repurpose these EOMs for continuous-wave correlation ToF imaging. We employ a two-pass modulation scheme for our ranging system that, instead of optical frequency, modulates
intensity with high contrast. We combine this acquisition scheme with a neural phase unwrapping method to then unwrap the dozens of phase wraps we encounter in the GHz regime.
3 Correlation ToF Imaging
In this section, we review the principles of correlation ToF imaging; for a detailed introduction, see [Lange
2000].
Image Formation. Correlation ToF cameras start by sending an amplitude-modulated light into the scene
where
\(\omega _p\) is modulation frequency,
\(\alpha\) is amplitude, and
\(\beta\) is a DC offset. After traveling through the scene and reflecting off a target, the measured return signal
is a time-delayed
\(p(t)\) by time
\(\tau\) with an observed attenuation in amplitude
\(\tilde{\alpha }\), a shift in bias
\(\tilde{\beta }\), and a time-dependent phase shift
\(\phi\). This measured signal is then correlated with a reference
where
\(\omega _r\) and
\(\psi\) are the demodulation frequency and phase, respectively. In existing multi-bucket imagers, this correlation occurs during exposure via photonic mixer device pixels [Lange and Seitz
2001; Foix et al.
2011], which are modulated according to the reference function
\(r(t)\). When we modulate and demodulate at the same frequency, that is
\(\omega _p = \omega _r = \omega\), this is called
homodyne imaging. Integrating this signal over exposure time
\(T\), we get a correlation measurement
where
\(K\) is a general constant offset, meant to model a non-zero modulation bias on the sensor. Given this measurement, we aim at estimating the phase delay
\(\phi\) from which the scene depth can be computed. As illustrated in Figure
1(b), the correlation measurement
\(C_\psi\) is a constant dependent on the demodulation phase offset
\(\psi\) (achieving its max at
\(\psi =n\phi , n\in \mathbb {N}\)). In practice, this means we never have to explicitly sample
\(\tilde{p}(t-\tau)\), which would require expensive ultrafast detectors and modulation electronics. Although the correlation measurement
\(C_\psi\) does not directly give us access to the true phase
\(\phi\), by sampling this function for multiple demodulation phase offsets
\(\psi\) we can make use of Fourier analysis to discern the true phase
\(\phi\). Existing correlation imagers typically acquire four equally-spaced correlation measurements at
\(\psi \in [0,\, \pi /2, \, \pi , \, 3\pi /2]\). Using these, we can estimate the phase offset
\(\hat{\phi }\) wrapped to the
\(2\pi\) range as
\(\hat{\phi }= \arctan ({\frac{C_{\pi }-C_{\pi /2}}{C_{0}-C_{3\pi /2}}})\). Phase unwrapping amounts to estimating the integer factor
\(n\) to recover the unwrapped phase
\(\phi =\hat{\phi }+ 2\pi n\). If successful, we can convert this phase estimate
\(\phi\) to depth as
\(z=\phi c/4\pi \omega _p\), where
\(c\) is the speed of light.
Modulation Frequency. As we noted earlier in Equation (
2), the round-trip path of the amplitude-modulated illumination imparts on it a
\(\phi\) phase shift. Setting
\(t=0, \tilde{\beta } = 0\) and
\(\omega =100 MHz\) (a common modulation frequency in conventional ToF cameras) in Equation (
2), we observe a 0.0009% signal difference for a 1 mm change in depth
\(z\). See Figure
2. This means, with realistic imaging noise and quantization in existing sensors, we would practically not be able to discern millimeter scale features on object surfaces for a setup with this modulation frequency. To achieve higher precision we go to the higher frequency, the same experiment repeated for
\(\omega =8 GHz\) leads to a more detectable 5.6% difference in signal amplitude. In practice, there are many factors that affect signal contrast, which we explore in the remainder of this work.
9 Discussion
We have introduced a computational imaging method, that presents a complementary direction to existing ToF methods. Specifically, we have jointly designed the optics, sensing, and neural network reconstruction such that computation that is typically done on the sensor, or digitally after sensing, is executed optically on the incident photon stream. Doing so, we introduce concepts from optics on electro-optical modulation to the graphics and imaging community, while devising a new method for two-pass modulation and a new method for unwrapping high-frequency phase measurements. Although we experimentally and synthetically validate that our system performs effective centimeter-wave ToF depth imaging, as a nascent technology, our work also leaves the reader with some open questions regarding its future, which are discussed below.
Implementing Array Sensors. We have opted for sequential point-wise scanning using a galvo system as the beam diameter passing through our EOMs is limited by the EOM’s small active area, 2.5
\(\times\) 2.5 mm. An alternative implementation requiring further engineering efforts is the use of telescope optics to spatially expand the EOM-modulated light, hence exploiting that correlation ToF only mandates global intensity modulation instead of per-pixel intensity control, see also [Kim et al.
2019].
Flood-Illumination and Multi-Path Interference. As our prototype performs point-wise scanning, direct reflection dominates the measurement, which mitigates multi-path interference. When implementing the proposed system with flood illumination in the future, and using 2D array sensing with large-area EOMs, retraining the network with flood-illumination might appear as an immediate solution to multi-path interference. We note that proposed high-frequency modulation may already provide sufficient robustness to the multi-path problem [Gupta et al.
2015].
Generalization to Complex Geometry and Reflectance. For scenes with simple shapes, moving planar targets in Figure
12 and gage blocks and a slanted plane in Figure
13, we demonstrate micron-scale depth resolution. In the future, we hope this approach can be extended to resolve micron-scale features in more complex scenes. While the proposed method outperforms previous methods for complex macroscopic scenes, capturing accurate depth still proves challenging; local geometries induce phase noise as the angular light beams are integrated over uneven depths. In the future, narrow beam sampling or flood-illuminated setups with array sensing might be a hardware solution to this challenge. In addition, more accurate ground truth sensing in the fine-tuning step might also overcome this domain gap issue in the neural network reconstruction.
Phase Unwrapping and Denoising. The proposed neural unwrapping method exploits the ordinal nature of the wrap counts and segmentation-based image semantics to recover dozens of wrap counts, while existing methods fail for more than a handful. While this approach shares some similarities with denoising in that we want to recover clean phase measurements from noisy readings, it does so in a joint manner. Rather than performing denoising and unwrapping sequentially, the proposed network ingests both correlation measurements simultaneously and can use their joint information—and independent noise distributions—to inform unwrapping. In this way, we avoid accidentally denoising phase measurements into the wrong wrap count bin.