1 Introduction
There is a spectrum of approaches to novel view synthesis, from local ray interpolation to image-based rendering to global radiance fields. Local ray interpolation methods, popularized with light fields or lumigraphs, interpolate novel ray colors from input photographs via two-plane ray parameterization [Gortler et al.
1996; Levoy and Hanrahan
1996]. For accurate scene reproduction, local ray interpolation methods require capturing the light field with a high spatio-angular resolution. To reduce this capture burden, unstructured lumigraphs use geometric proxies to guide the selection of corresponding rays to interpolate. Such proxies can be global [Buehler et al.
2001; Eisemann et al.
2008] or per input view [Hedman et al.
2018;
2016]. Rendering quality is determined by the accuracy of the geometric proxy, which itself must be reconstructed. Beyond interpolating input colors, we can also create novel views by reconstructing a global scene representation of color and geometry, such as a neural radiance field (NeRFs) [Mildenhall et al.
2020] or a set of 3D Gaussians (3DGS) [Kerbl et al.
2023]. These representations are optimized by minimizing the difference between the input photographs and their reproduction through volumetric rendering.
One challenging visual phenomenon is reflections that appear to move with camera motion; a second is the related phenomenon of refractions through transparent objects. Many methods, including NeRF and 3DGS, often handle reflections by optimizing virtual geometry to be ‘behind’ surfaces at the total reflected light path length. For planar reflectors, a global virtual reflection geometry provides a plausible explanation of the visual phenomenon that is consistent with all input photographs [Sinha et al.
2012]. However, for curved reflectors, it is difficult to maintain a consistent virtual geometry, resulting in blurred reflections (Fig.
1). To better model curved reflections, methods have adopted additional physical modeling through surface normal estimation and material decomposition [Verbin et al.
2022; Zhang et al.
2021b]. Such approaches may assume distant environmental lighting conditions [Verbin et al.
2022] that limit application to specific real-world scenarios. Factorizing materials increases the number of free parameters, leading to increased ambiguity during optimization [Zhang et al.
2021b]. As a result, dealing with curved reflective surfaces or transparent objects still poses challenges and producing sharp reflections is difficult.
Given this difficulty, we revisit ideas from geometry-guided ray interpolation techniques with modern neural fields, differentiable rendering, and end-to-end optimization. Interpolation-based methods can avoid the blurring that occurs in global optimization of view-dependent appearance like reflections because only a few similar neighboring photographs are used to produce ray colors. But the appearance still depends significantly on the proxy geometry, and we know that consistent recovery of reflector geometry is tricky for curved reflectors. Rather than a global proxy geometry, we propose for each input view to define a local proxy geometry. As each local geometry only has to remain consistent across the small set of neighboring views used to produce a novel view, this makes it possible to represent complex curved reflectors in a ‘piecewise’ way, helping to maintain sharp reflections.
For per-view geometry, we use a density field approximated by a mixture of ten Gaussians per ray. The Gaussian parameters per ray are encoded via an MLP into a feature, which is then stored within a 2D hash grid per photo (Fig.
2). To calculate soft visibility, we use the Gaussian mixture cumulative distribution function (CDF) along the ray. During rendering, input photo colors are warped backwards from a small set of neighboring views, blended using optimized weights from an MLP, and alpha-composed. This local representation performs well in modeling curved reflectors, producing sharper reflections on objects like cars and glass busts than Ref-NeRF and 3DGS, and producing as sharp results as the global front-facing multi-plane geometry method NeX [Wizadwongsa et al.
2021] without being restricted to front-facing scenes.
In summary, our main technical contributions are as follows.
•
A per-view Gaussian density mixture representation and image-based rendering approach that is well-suited to modeling high-frequency reflections for curved and transparent objects.
•
An end-to-end optimization scheme with photometric and consistency losses to encourage coherence across per-view proxies, and a sparse voxel grid sampling for efficiency.
Compared with other state-of-the-art neural- and image-based rendering methods, our method can produce sharper results with high-fidelity view-dependent appearance (Fig.
1).
2 Related work
Warping Image-based Rendering. Unstructured lumigraph rendering [Buehler et al.
2001] warps and interpolates a collection of input images through a proxy geometry. Many methods rely on global geometry reconstructed from captured images [Chaurasia et al.
2011; Goesele et al.
2010; Ortiz-Cayon et al.
2015]. Global geometry can constitute depth images, visual and opacity hulls for pixel visibility, and 3D meshes for view-dependent texturing and surface light fields [Chaurasia et al.
2013; Debevec et al.
1996; Fitzgibbon et al.
2005; Matusik et al.
2000;
2002; Wood et al.
2000]. Later methods use per-view information to improve rendering quality. For instance, Chaurasia et al. [
2013] use super-pixels as constraints to derive per-pixel depth, which significantly mitigates image warping artifacts along occlusion edges. Hedman et al. [
2016] reconstruct a global geometry and refine the depth map for each view to align edges between the depth channel and the RGB channels. The resulting per-view meshes effectively handle large occlusions and motion parallax. Subsequently, the DeepBlending method [Hedman et al.
2018] integrates two distinct multi-view stereo (MVS) reconstructions for per-view depth refinement. To reduce ghosting, a deep neural network blends images warped with per-view meshes. Wang et al. [
2021] also combine image-based rendering method with neural networks and propose a generic view interpolation method. While these methods can reproduce view-dependent effects to some extent, they struggle with specular effects due to their reliance on a proxy geometry with only a single surface—this often fails to represent reflections well without considering reflected rays.
Layered Reflections. We might also try to separate the input images into separate layers containing reflections [Kopf et al.
2013; Szeliski et al.
2000]. Sinha et al. [
2012] estimate the foreground and reflected depth to handle planar reflections. Xu et al. [
2021] explicitly reconstructed a two-layer geometry along with diffuse and reflected images. Rodriguez et al. [
2020] estimate curved reflector geometry for car windows with a two-layer representation—background and car window—using reflective flows. These methods advance our ability to model reflections in IBR by introducing a reflection layer.
Layered Geometry. These representations help to handle complex occlusions and appearance. Layered depth images (LDI) [Shade et al.
1998] store scene geometry within a projective volume at a specific viewpoint. Penner et al. [
2017] extend this concept by constructing projective volumes with additional depth uncertainty for captured images, leading to higher quality view synthesis at occlusion edges. Hedman et al. [
2017] use two-layer color-and-depth panoramas to produce perspective views near captured viewpoints with motion parallax effects. Layered representations can also be predicted using deep neural networks, such as end-to-end deep stereo for unstructured view interpolation [Flynn et al.
2016], and deep view synthesis based on multiplane image (MPI) techniques [Li and Khademi Kalantari
2020; Mildenhall et al.
2019; Srinivasan et al.
2019; Wizadwongsa et al.
2021; Xu et al.
2019; Zhou et al.
2018]. LLFF [Mildenhall et al.
2019] and NeX360 [Phongthawee et al.
2022] extend a single MPI to multiple MPIs, where novel views are rendered by blending adjacent MPIs.
Our method connects to layered representations as we use a fixed number of Gaussians per ray. This is related to mixture density distributions in stereo matching [Tosi et al.
2021]. However, unlike an MPI that shares the same depth sampling for all pixels in each plane, our method employs individual depth sampling for each ray.
Global scenes.
Many methods create a global representation via reconstruction. Neural radiance fields (NeRF) encode scene density and appearance compactly within a multi-layer perceptron (MLP) neural network. This is a volume rendered to produce an image. Volume rendering has been accelerated through hash grids and direct voxel storage [Fridovich-Keil et al.
2022; Müller et al.
2022], scaled through tiling [Wu et al.
2023;
2022], and extended to dynamic scenes [Pumarola et al.
2021]. Our approach also uses 2D hash grids to encode the local Gaussian mixtures.
Some neural methods are designed for complex appearance [Ma et al.
2024; Verbin et al.
2024; Wu et al.
2024]. Ref-NeRF [Verbin et al.
2022] optimizes a spatial MLP to predict diffuse colors and surface normals and then produces specular reflections via normal-reflected rays and a directional MLP. Other inverse rendering representations incorporate surface details, normals, lighting, albedo, and bidirectional reflectance distribution functions (BRDFs) [Hasselgren et al.
2022; Laine et al.
2020; Liu et al.
2019; Munkberg et al.
2022; Yao et al.
2022; Zhang et al.
2021a;
2021b]. These approaches can be susceptible to over-fitting and are sensitive to initialization and regularization.
Points or primitives have also been studied [Kopanas et al.
2021; Lassner and Zollhöfer
2021; Yifan et al.
2019]. Kopanas et al. [
2022] separate reflections among a point cloud using a neural warp field. 3D Gaussian splatting (3DGS) [Kerbl et al.
2023] uses anisotropic Gaussians with color and density as scene primitives, achieving faster rendering speed compared to NeRF-based approaches.
Our method also uses Gaussians to represent the density. However, instead of optimizing a global representation of color and density, we optimize local per-ray density only and use input views for color rendering. This scheme better renders a high-frequency view-dependent appearance. Further, since our Gaussian representation is 1D along each ray, it is challenging to determine the coverage area to be splatted onto other views, so we ray cast instead of splat.
4 Implementation Details
Hash grid and network architecture. Storing the Gaussian mixture mean, standard deviation, and weight parameters requires
W ×
H ×
N × 3 floats for each
W ×
H image, which may exceed available GPU memory during optimization. To overcome this, we reduce memory use by representing the Gaussian parameters compactly. The approach uses a hash table of features unique to each image, followed by a shallow MLP shared across all features (Fig.
2) [Müller et al.
2022]. The hash table features are collectively optimized such that the Gaussian mixtures can reproduce the scene. This allows similar ray density distributions to share MLP capacity through the embedding space while allowing the image-space location of those distributions to move.
Each 2D hash grid has 16 levels, where the coarsest level resolution is 16 × 16 (e.g., in a 640 × 480 image, each cell covers 40 × 30 pixels at the coarsest level due its 16 × 16 subdivision) and the highest resolution matches that of the input view. The feature size at each level is two, and the hash table size for each 2D hash grid is 216.
We use three MLPs. The first MLP decodes features from the hash grid to produce the Gaussian mixture parameters. The second MLP decodes features from the hash grid to produce colors. Both of these MLPs have two 64-neuron layers. The third MLP produces the neural blending weight and comprises two sub-modules. The first sub-module encodes 3D points into 16-dimension features. These features are concatenated with ray directions and input to the second sub-module. Both submodules have two 32-neuron layers. Every MLP is Gaussian activated [Ramasinghe and Lucey
2022] with a variance of 0.01.
Baking and rendering. To accelerate rendering at the cost of memory, we can precompute and store the optimized Gaussian mixture parameters {μn, σn, ωn} for each view as a 2D grid. This eliminates the computational overhead of hashing and executing the MLP to recover these parameters. For the neural blending weights, the MLP is small enough to be stored in CUDA shared memory. For rendering, our method samples 192 points for each ray in the target view and blends density and colors for each point across 8 neighboring views.
Ray sampling strategy. To maintain balanced gradient scales during each optimization iteration, we adopt a strategy of randomly sampling an equal number of pixels for each view.
Point sampling and occupancy grid. We use an occupancy grid to speed up volume point sampling. For every sample point, if the visibility-aware weight
w (Eq.
10) exceeds 0.01, the corresponding voxel covering this point is occupied. When processing each target ray, we begin by sampling 64 points in disparity space. Subsequently, we uniformly sample 128 points only within the occupied voxels.
Further, we adopt a strategy of gradually subdividing our occupied grid to increase its resolution [Liu et al.
2020; Wu et al.
2023]. This subdivision is performed every 1,000 iterations. We start with an initial resolution of 8
3 and progressively increase it until reaching a resolution of 512
3.
Viewport extension.
In certain scenes, some points are only observed from the target view and so will warp outside all neighboring views. During optimization, since each input view is treated as a target view, these points do not have information from neighboring views to correctly minimize the reconstruction loss, resulting in floating geometries. To address this issue, we extend each input view by 50 pixels on each side. The 2D hash feature grid is also extended accordingly. For the extended pixels, as they do not have captured colors, we reuse the 2D hash grid features to generate color for the extended pixels, using an additional color-MLP shared for all the images. The color for the extended pixels is optimized with respect to the target view. After optimization, the extended feature grids and the color-MLP are discarded.
Hyperparameters configuration. Throughout all of our experiments, we optimize for 60k iterations, with each batch consisting of 2,048 rays. We use the
Adam [Kingma and Ba
2014] optimizer with a learning rate that decays from 1
e− 3 to 1
e− 4. The training time for each scene is approximately 4-5 hours on a single Nvidia V100 GPU.
6 Conclusion
We have shown that a per-view Gaussian density mixture with image-based rendering can be end-to-end optimized to achieve high-frequency reflections for curved and transparent objects.
Limitations and future work. As an IBR method, our results can still show some visual ‘snapping’ on curved reflections as the target view moves between different sets of neighbors. In these cases, there is a visual trade-off vs. global scene methods between such snapping in our case and blurring in the case of Ref-NeRF and 3DGS.
The approach is not yet constructed for fast rendering as it uses a volume sampling method. Each 952 × 535 view takes around 270 ms to render on an NVIDIA RTX 3090Ti, which is faster than NeRF but is slower than 3DGS. One way to increase rendering speed is to reduce the number of sampled points per ray. A coarse-to-fine strategy per ray may increase speed without reducing quality.
Our method achieves highest quality with dense scene capture, especially for complex reflections or refractions. In diffuse areas, this results in redundant duplication. But, even with the many views in our LGDM dataset, global scene representations like Ref-NeRF or 3DGS still cannot achieve as high a quality of reflections as ours; this discrepancy seems pertinent to investigate in future work.
Another direction for future work is to explore why our method reconstructs view-dependent effects better than other approaches, particularly methods that construct global geometries. We have observed that flexible local geometries and direct color blending are factors in achieving these results. Developing a rigorous evaluation framework or theory to support these findings would be valuable.