research-article

Free access

Local Gaussian Density Mixtures for Unstructured Lumigraph Rendering

Authors: Xiuchao Wu, Jiamin Xu, Chi Wang, Yifan Peng, Qixing Huang, James Tompkin, Weiwei XuAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 16, Pages 1 - 11

https://doi.org/10.1145/3680528.3687659

Published: 03 December 2024 Publication History

All formats PDF

Abstract

To improve novel view synthesis of curved-surface reflections and refractions, we revisit local geometry-guided ray interpolation techniques with modern differentiable rendering and optimization. In contrast to depth or mesh geometries, our approach uses a local or per-view density represented as Gaussian mixtures along each ray. To synthesize novel views, we warp and fuse local volumes, then alpha-composite using input photograph ray colors from a small set of neighboring images. For fusion, we use a neural blending weight from a shallow MLP. We optimize the local Gaussian density mixtures using both a reconstruction loss and a consistency loss. The consistency loss, based on per-ray KL-divergence, encourages more accurate geometry reconstruction. In scenes with complex reflections captured in our LGDM dataset, the experimental results show that our method outperforms state-of-the-art novel view synthesis methods by 12.2%–37.1% in PSNR, due to its ability to maintain sharper view-dependent appearances. Project webpage: https://xchaowu.github.io/papers/lgdm/index.html

1 Introduction

There is a spectrum of approaches to novel view synthesis, from local ray interpolation to image-based rendering to global radiance fields. Local ray interpolation methods, popularized with light fields or lumigraphs, interpolate novel ray colors from input photographs via two-plane ray parameterization [Gortler et al. 1996; Levoy and Hanrahan 1996]. For accurate scene reproduction, local ray interpolation methods require capturing the light field with a high spatio-angular resolution. To reduce this capture burden, unstructured lumigraphs use geometric proxies to guide the selection of corresponding rays to interpolate. Such proxies can be global [Buehler et al. 2001; Eisemann et al. 2008] or per input view [Hedman et al. 2018; 2016]. Rendering quality is determined by the accuracy of the geometric proxy, which itself must be reconstructed. Beyond interpolating input colors, we can also create novel views by reconstructing a global scene representation of color and geometry, such as a neural radiance field (NeRFs) [Mildenhall et al. 2020] or a set of 3D Gaussians (3DGS) [Kerbl et al. 2023]. These representations are optimized by minimizing the difference between the input photographs and their reproduction through volumetric rendering.

Fig. 1:

One challenging visual phenomenon is reflections that appear to move with camera motion; a second is the related phenomenon of refractions through transparent objects. Many methods, including NeRF and 3DGS, often handle reflections by optimizing virtual geometry to be ‘behind’ surfaces at the total reflected light path length. For planar reflectors, a global virtual reflection geometry provides a plausible explanation of the visual phenomenon that is consistent with all input photographs [Sinha et al. 2012]. However, for curved reflectors, it is difficult to maintain a consistent virtual geometry, resulting in blurred reflections (Fig. 1). To better model curved reflections, methods have adopted additional physical modeling through surface normal estimation and material decomposition [Verbin et al. 2022; Zhang et al. 2021b]. Such approaches may assume distant environmental lighting conditions [Verbin et al. 2022] that limit application to specific real-world scenarios. Factorizing materials increases the number of free parameters, leading to increased ambiguity during optimization [Zhang et al. 2021b]. As a result, dealing with curved reflective surfaces or transparent objects still poses challenges and producing sharp reflections is difficult.

Given this difficulty, we revisit ideas from geometry-guided ray interpolation techniques with modern neural fields, differentiable rendering, and end-to-end optimization. Interpolation-based methods can avoid the blurring that occurs in global optimization of view-dependent appearance like reflections because only a few similar neighboring photographs are used to produce ray colors. But the appearance still depends significantly on the proxy geometry, and we know that consistent recovery of reflector geometry is tricky for curved reflectors. Rather than a global proxy geometry, we propose for each input view to define a local proxy geometry. As each local geometry only has to remain consistent across the small set of neighboring views used to produce a novel view, this makes it possible to represent complex curved reflectors in a ‘piecewise’ way, helping to maintain sharp reflections.

For per-view geometry, we use a density field approximated by a mixture of ten Gaussians per ray. The Gaussian parameters per ray are encoded via an MLP into a feature, which is then stored within a 2D hash grid per photo (Fig. 2). To calculate soft visibility, we use the Gaussian mixture cumulative distribution function (CDF) along the ray. During rendering, input photo colors are warped backwards from a small set of neighboring views, blended using optimized weights from an MLP, and alpha-composed. This local representation performs well in modeling curved reflectors, producing sharper reflections on objects like cars and glass busts than Ref-NeRF and 3DGS, and producing as sharp results as the global front-facing multi-plane geometry method NeX [Wizadwongsa et al. 2021] without being restricted to front-facing scenes.

In summary, our main technical contributions are as follows.

•

A per-view Gaussian density mixture representation and image-based rendering approach that is well-suited to modeling high-frequency reflections for curved and transparent objects.

•

An end-to-end optimization scheme with photometric and consistency losses to encourage coherence across per-view proxies, and a sparse voxel grid sampling for efficiency.

Compared with other state-of-the-art neural- and image-based rendering methods, our method can produce sharper results with high-fidelity view-dependent appearance (Fig. 1).

2 Related work

Fig. 2:

Warping Image-based Rendering. Unstructured lumigraph rendering [Buehler et al. 2001] warps and interpolates a collection of input images through a proxy geometry. Many methods rely on global geometry reconstructed from captured images [Chaurasia et al. 2011; Goesele et al. 2010; Ortiz-Cayon et al. 2015]. Global geometry can constitute depth images, visual and opacity hulls for pixel visibility, and 3D meshes for view-dependent texturing and surface light fields [Chaurasia et al. 2013; Debevec et al. 1996; Fitzgibbon et al. 2005; Matusik et al. 2000; 2002; Wood et al. 2000]. Later methods use per-view information to improve rendering quality. For instance, Chaurasia et al. [2013] use super-pixels as constraints to derive per-pixel depth, which significantly mitigates image warping artifacts along occlusion edges. Hedman et al. [2016] reconstruct a global geometry and refine the depth map for each view to align edges between the depth channel and the RGB channels. The resulting per-view meshes effectively handle large occlusions and motion parallax. Subsequently, the DeepBlending method [Hedman et al. 2018] integrates two distinct multi-view stereo (MVS) reconstructions for per-view depth refinement. To reduce ghosting, a deep neural network blends images warped with per-view meshes. Wang et al. [2021] also combine image-based rendering method with neural networks and propose a generic view interpolation method. While these methods can reproduce view-dependent effects to some extent, they struggle with specular effects due to their reliance on a proxy geometry with only a single surface—this often fails to represent reflections well without considering reflected rays.

Layered Reflections. We might also try to separate the input images into separate layers containing reflections [Kopf et al. 2013; Szeliski et al. 2000]. Sinha et al. [2012] estimate the foreground and reflected depth to handle planar reflections. Xu et al. [2021] explicitly reconstructed a two-layer geometry along with diffuse and reflected images. Rodriguez et al. [2020] estimate curved reflector geometry for car windows with a two-layer representation—background and car window—using reflective flows. These methods advance our ability to model reflections in IBR by introducing a reflection layer.

Layered Geometry. These representations help to handle complex occlusions and appearance. Layered depth images (LDI) [Shade et al. 1998] store scene geometry within a projective volume at a specific viewpoint. Penner et al. [2017] extend this concept by constructing projective volumes with additional depth uncertainty for captured images, leading to higher quality view synthesis at occlusion edges. Hedman et al. [2017] use two-layer color-and-depth panoramas to produce perspective views near captured viewpoints with motion parallax effects. Layered representations can also be predicted using deep neural networks, such as end-to-end deep stereo for unstructured view interpolation [Flynn et al. 2016], and deep view synthesis based on multiplane image (MPI) techniques [Li and Khademi Kalantari 2020; Mildenhall et al. 2019; Srinivasan et al. 2019; Wizadwongsa et al. 2021; Xu et al. 2019; Zhou et al. 2018]. LLFF [Mildenhall et al. 2019] and NeX360 [Phongthawee et al. 2022] extend a single MPI to multiple MPIs, where novel views are rendered by blending adjacent MPIs.

Our method connects to layered representations as we use a fixed number of Gaussians per ray. This is related to mixture density distributions in stereo matching [Tosi et al. 2021]. However, unlike an MPI that shares the same depth sampling for all pixels in each plane, our method employs individual depth sampling for each ray.

Global scenes.

Many methods create a global representation via reconstruction. Neural radiance fields (NeRF) encode scene density and appearance compactly within a multi-layer perceptron (MLP) neural network. This is a volume rendered to produce an image. Volume rendering has been accelerated through hash grids and direct voxel storage [Fridovich-Keil et al. 2022; Müller et al. 2022], scaled through tiling [Wu et al. 2023; 2022], and extended to dynamic scenes [Pumarola et al. 2021]. Our approach also uses 2D hash grids to encode the local Gaussian mixtures.

Some neural methods are designed for complex appearance [Ma et al. 2024; Verbin et al. 2024; Wu et al. 2024]. Ref-NeRF [Verbin et al. 2022] optimizes a spatial MLP to predict diffuse colors and surface normals and then produces specular reflections via normal-reflected rays and a directional MLP. Other inverse rendering representations incorporate surface details, normals, lighting, albedo, and bidirectional reflectance distribution functions (BRDFs) [Hasselgren et al. 2022; Laine et al. 2020; Liu et al. 2019; Munkberg et al. 2022; Yao et al. 2022; Zhang et al. 2021a; 2021b]. These approaches can be susceptible to over-fitting and are sensitive to initialization and regularization.

Points or primitives have also been studied [Kopanas et al. 2021; Lassner and Zollhöfer 2021; Yifan et al. 2019]. Kopanas et al. [2022] separate reflections among a point cloud using a neural warp field. 3D Gaussian splatting (3DGS) [Kerbl et al. 2023] uses anisotropic Gaussians with color and density as scene primitives, achieving faster rendering speed compared to NeRF-based approaches.

Our method also uses Gaussians to represent the density. However, instead of optimizing a global representation of color and density, we optimize local per-ray density only and use input views for color rendering. This scheme better renders a high-frequency view-dependent appearance. Further, since our Gaussian representation is 1D along each ray, it is challenging to determine the coverage area to be splatted onto other views, so we ray cast instead of splat.

3 Method

3.1 Local Gaussian Density Mixture Representation

Our input is a set of M input images \(\lbrace I_i\rbrace _{i=1}^M \in \mathcal {I}\) with corresponding camera poses \(\lbrace P_i\rbrace _{i=1}^M\). Our objective is to reconstruct a local density field \(\lbrace \mathcal {D}_i\rbrace _{i=1}^M\) that, once rendered using IBR in neighboring views, will reproduce the input images. We will take a volume rendering approach. The density distributions in the fields \(\mathcal {D}\) are parametrically represented by a Gaussian mixture per ray with N = 10 kernels.

Each pixel u in the image I corresponds to a ray with direction d. Given a camera origin o, a point x = o + td exists along the ray at a distance t. The volume density at the point α(x) = α(u, t) is a weighted sum of Gaussians:

\begin{align} \alpha (\mathbf {u},t)=\sum _{n=1}^{N} \omega _n(\mathbf {u}) \, g(t; \mu _n(\mathbf {u}), \sigma _n(\mathbf {u})), \end{align}

(1)

where each Gaussian g(·) has an associated mean μ_n(u), standard deviation σ_n(u), and weight ω_n(u).

The soft visibility v or transmittance for each point x can be calculated analytically from the Gaussian mixture parameterization:

\begin{equation} \begin{split} v(\mathbf {u},t) &= \exp \left(-\int _{s}^t \alpha \left(\mathbf {u}, \delta \right) d\delta \right) \\ &= \exp \left(-\sum _{n=1}^{N} \omega _n(\mathbf {u}) \left(G(t; \mu _n(\mathbf {u}), \sigma _n(\mathbf {u})) - G(s; \mu _n(\mathbf {u}), \sigma _n(\mathbf {u})) \right) \right) \end{split} \end{equation}

(2)

where s denotes the location of the near plane and G(t;μ, σ) is the cumulative distribution function (CDF) of the Gaussian function:

\begin{align} G(t; \mu , \sigma) = \frac{1}{2} \text{erf}(\frac{t - \mu }{\sigma \sqrt {2}}) + \frac{1}{2}, \end{align}

(3)

where erf(·) is the error function.

Fig. 3:

3.2 Warping and Fusing Volumes

Next, we introduce a differentiable volume rendering procedure to create novel views from local Gaussian mixtures. The novel target view is formed in three steps via a ray sampling approach, using information from a set of L input views that are neighbors to the target view. First, we reproject local density into the target view’s frame via backward warping. Then, we merge local densities while considering occlusion by fusion. Finally, we alpha composite input colors via a neural blending weight. Note that we will use symbols without subscripts to refer to properties of the target view.

Backward warping. For each pixel coordinate u in the target view, we sample a set of world-space points \(\lbrace \mathbf {x}^k\rbrace _{k=1}^K\) along the ray, transform each into the camera space of each neighboring view i, then project the point to pixel coordinates to produce \(\mathbf {u}^{\prime }_i\). With this, we can obtain a color \(\mathbf {c}_i^k\), density \(\alpha _i^k\), and visibility \(v_i^k\):

\begin{align} \mathbf {u}^{\prime }_i &= \pi ({{\mathbf {x}}^k; P_i}), \mathbf {c}_i^k = I_i(\mathbf {u}^{\prime }_i), \end{align}

(4)

\begin{align} \alpha _i^k &= \alpha _i(\mathbf {u}^{\prime }_i, ||{\bf x}^k - {\bf o}_i||), v_i^k = v_i(\mathbf {u}^{\prime }_i, ||{\bf x}^k - {\bf o}_i||), \end{align}

(5)

where π_i(·) denotes the projection operation from a point in world space to the pixel space of the i-th neighboring input view. We calculate density and visibility using Eq. 1 and 2, respectively.

Fusion. We fuse the warped densities and colors by considering visibility and using an optimized blending weight. As we expect density to represent local and view-independent geometry, we consider only visibility when fusing:

\begin{align} \tilde{\alpha }^k = \frac{\sum _i v^k_i \cdot \alpha ^k_i }{\sum _i v^k_i}. \end{align}

(6)

Color will still contain view-dependent information even with local warping. As a result, we fuse the multi-view colors using a neural blending weight \(h^k_i\):

\begin{align} \tilde{\mathbf {c}}^k = \frac{\sum _i h^k_i \cdot v^k_i \cdot c^k_i }{\sum _i h^k_i \cdot v^k_i}. \end{align}

(7)

\(h^k_i\) is encoded in a small MLP:

\begin{align} h^k_i = \Phi ({\bf x}^k, {\bf d}^k - {\bf d}^k_i;\theta) , \end{align}

(8)

where d^k and \({\bf d}^k_i\) indicate the ray direction for the point x^k for the target and neighboring views, and θ represents the MLP parameters. We use the point position and relative ray direction as inputs to allow view dependence.

In comparison to a fixed blending weight function (ULR [Buehler et al. 2001], InsideOut [Hedman et al. 2016]), our neural blending weights are optimized end-to-end and can better capture high frequency reflection (Fig. 3). Compared to weights predicted by a pre-trained CNN (DeepBlending [Hedman et al. 2018]), our neural blending weights use a compact MLP.

Alpha composition. Finally, we accumulate a fused color \(\tilde{\mathbf {c}}^k\) along the ray by using the fused densities \(\tilde{\alpha }^k\). This step obtains the rendered color for each pixel:

\begin{align} \tilde{{\bf c}} &= \sum _k \left(w^k \cdot \tilde{\mathbf {c}}^k\right), \end{align}

(9)

\begin{align} w^k &= \exp (-\sum _{j=1}^{k-1} \tilde{\alpha }^j \cdot \delta ^j) \cdot (1 - \exp (-\tilde{\alpha }^k \cdot \delta ^k), \end{align}

(10)

where w^k represents the alpha-blending weight for the point x_k in the target view.

Neighbor view selection.

Even with visibility-aware and view-dependent fusion, picking a set of good neighbor views L is critical to achieving high quality results. This selection is guided by an accumulated cosine similarity per ray that considers the angle between the target and neighbor view rays:

\begin{align} {\bf S}_i = \sum _k {\bf S}_i^k = \sum _k \frac{({\bf o} - {\bf x}^k) \cdot ({\bf o}_i - {\bf x}^k)}{||{\bf o} - {\bf x}^k|| \cdot ||{\bf o}_i - {\bf x}^k||}. \end{align}

(11)

\({\bf S}_i^k\) is set to 0 if the projected pixel falls outside the viewport.

Given ray-view similarities {S_i} for all neighboring views, next we select the L views. Naïvely selecting views with the highest similarity values may select uninformative ‘clumps’ of cameras (Fig. 4 left). Instead, we stratify selection (Fig. 4 right): We project the centers of all neighboring cameras onto the image plane of the target view, which separates neighboring cameras centers into four quadrants. Then, we rank view similarity within each quadrant and iteratively select neighboring views with the highest similarity values from each quadrant in turn. This approach tends to produce more balanced neighbors. For instance, when observing an occlusion edge, neighbor cameras in one direction will always fail to see a point beyond the edge. Our strategy avoids selecting only those cameras even if they are nearby.

Fig. 4:

3.3 Optimization

To optimize the local density fields and neural weights, we define a loss \(\mathcal {L}\) with two components: an L2 reconstruction loss \(\mathcal {L}_{\mathrm{r}}\) and a consistency loss \(\mathcal {L}_{\mathrm{c}}\):

\begin{align} \mathcal {L} = \mathcal {L}_\mathrm{r} + 0.01 \mathcal {L}_\mathrm{c}, \end{align}

(12)

Reconstruction loss. To optimize geometry that is local to each view, we must enforce constraints between views. Hence, during optimization, each input view is treated as a target view to be reconstructed. For this, we minimize the difference between rendered pixel colors \(\tilde{\mathbf {c}}\) and their matched input photograph pixel colors c:

\begin{align} \mathcal {L}_\mathrm{r} = ||\tilde{\mathbf {c}}-\mathbf {c}||^2_2. \end{align}

(13)

Consistency loss. Even with a reconstruction loss, the density field local to an input view may be inconsistent with the fused density field produced when the input view is treated as a target view. We add a consistency loss by minimizing the KL divergence between alpha-blending weights of K sampled points along rays:

\begin{align} \mathcal {L}_\mathrm{c} = \sum _k \tilde{w}^k \cdot \log (\frac{\tilde{w}^k}{w^k}) , \end{align}

(14)

where \(\tilde{w}^k\) is the alpha-blending weight obtained from the fused density field of the target view, and w^k is the alpha-blending weight derived from the (same) input view’s local density field.

Although it is only a soft constraint, the consistency loss plays an important role in improving the view consistency of the rendering results, as adding it achieves more accurate geometry in general (Fig. 8). We can consider the loss’ effect in two ways: 1) scene areas that rely on density alignment, like occlusion edges, are encouraged to be similar; and 2) scene areas that can be reproduced using a variety of different geometries (say, low-textured regions) are encouraged to be similar.

4 Implementation Details

Hash grid and network architecture. Storing the Gaussian mixture mean, standard deviation, and weight parameters requires W × H × N × 3 floats for each W × H image, which may exceed available GPU memory during optimization. To overcome this, we reduce memory use by representing the Gaussian parameters compactly. The approach uses a hash table of features unique to each image, followed by a shallow MLP shared across all features (Fig. 2) [Müller et al. 2022]. The hash table features are collectively optimized such that the Gaussian mixtures can reproduce the scene. This allows similar ray density distributions to share MLP capacity through the embedding space while allowing the image-space location of those distributions to move.

Each 2D hash grid has 16 levels, where the coarsest level resolution is 16 × 16 (e.g., in a 640 × 480 image, each cell covers 40 × 30 pixels at the coarsest level due its 16 × 16 subdivision) and the highest resolution matches that of the input view. The feature size at each level is two, and the hash table size for each 2D hash grid is 2¹⁶.

We use three MLPs. The first MLP decodes features from the hash grid to produce the Gaussian mixture parameters. The second MLP decodes features from the hash grid to produce colors. Both of these MLPs have two 64-neuron layers. The third MLP produces the neural blending weight and comprises two sub-modules. The first sub-module encodes 3D points into 16-dimension features. These features are concatenated with ray directions and input to the second sub-module. Both submodules have two 32-neuron layers. Every MLP is Gaussian activated [Ramasinghe and Lucey 2022] with a variance of 0.01.

Baking and rendering. To accelerate rendering at the cost of memory, we can precompute and store the optimized Gaussian mixture parameters {μ_n, σ_n, ω_n} for each view as a 2D grid. This eliminates the computational overhead of hashing and executing the MLP to recover these parameters. For the neural blending weights, the MLP is small enough to be stored in CUDA shared memory. For rendering, our method samples 192 points for each ray in the target view and blends density and colors for each point across 8 neighboring views.

Ray sampling strategy. To maintain balanced gradient scales during each optimization iteration, we adopt a strategy of randomly sampling an equal number of pixels for each view.

Point sampling and occupancy grid. We use an occupancy grid to speed up volume point sampling. For every sample point, if the visibility-aware weight w (Eq. 10) exceeds 0.01, the corresponding voxel covering this point is occupied. When processing each target ray, we begin by sampling 64 points in disparity space. Subsequently, we uniformly sample 128 points only within the occupied voxels.

Further, we adopt a strategy of gradually subdividing our occupied grid to increase its resolution [Liu et al. 2020; Wu et al. 2023]. This subdivision is performed every 1,000 iterations. We start with an initial resolution of 8³ and progressively increase it until reaching a resolution of 512³.

Viewport extension.

In certain scenes, some points are only observed from the target view and so will warp outside all neighboring views. During optimization, since each input view is treated as a target view, these points do not have information from neighboring views to correctly minimize the reconstruction loss, resulting in floating geometries. To address this issue, we extend each input view by 50 pixels on each side. The 2D hash feature grid is also extended accordingly. For the extended pixels, as they do not have captured colors, we reuse the 2D hash grid features to generate color for the extended pixels, using an additional color-MLP shared for all the images. The color for the extended pixels is optimized with respect to the target view. After optimization, the extended feature grids and the color-MLP are discarded.

Hyperparameters configuration. Throughout all of our experiments, we optimize for 60k iterations, with each batch consisting of 2,048 rays. We use the Adam [Kingma and Ba 2014] optimizer with a learning rate that decays from 1e^{− 3} to 1e^{− 4}. The training time for each scene is approximately 4-5 hours on a single Nvidia V100 GPU.

Fig. 5:

Table 1:

Scene	Metric	LLFF	NPC	NeX	Ref-NeRF	INGP	3DGS	Ours
Blue Car	PSNR↑	25.64	25.14	28.77	29.00	27.11	31.63	32.82
67 views	SSIM↑	0.903	0.913	0.930	0.915	0.907	0.965	0.974
Forward-facing	LPIPS↓	0.111	0.229	0.215	0.323	0.299	0.153	0.068
Red Car	PSNR↑	26.42	22.62	N/A	28.11	27.00	28.52	31.88
128 views	SSIM↑	0.905	0.859	N/A	0.894	0.913	0.950	0.964
Half-circling	LPIPS↓	0.137	0.292	N/A	0.291	0.198	0.141	0.091
Natatorium	PSNR↑	23.98	23.88	25.01	25.82	25.67	27.43	31.22
144 views	SSIM↑	0.851	0.867	0.846	0.862	0.879	0.929	0.960
Forward-facing	LPIPS↓	0.119	0.229	0.287	0.277	0.219	0.136	0.071
Glass Bust	PSNR↑	26.28	20.89	N/A	27.92	25.73	29.42	33.48
194 views	SSIM↑	0.883	0.821	N/A	0.894	0.871	0.954	0.971
Half-circling	LPIPS↓	0.165	0.389	N/A	0.327	0.373	0.125	0.087
Skyscraper	PSNR↑	20.67	20.21	25.89	24.26	21.95	27.83	30.66
132 views	SSIM↑	0.786	0.844	0.880	0.827	0.792	0.942	0.961
Forward-facing	LPIPS↓	0.137	0.253	0.233	0.339	0.332	0.099	0.073
Mall	PSNR↑	24.83	26.12	30.29	28.15	28.07	24.79	32.53
112 views	SSIM↑	0.879	0.922	0.948	0.907	0.916	0.907	0.969
Forward-facing	LPIPS↓	0.100	0.203	0.129	0.240	0.212	0.198	0.073
Bull	PSNR↑	25.88	23.02	25.95	25.38	25.25	26.63	28.90
233 views	SSIM↑	0.873	0.832	0.873	0.840	0.862	0.909	0.932
Unstructured	LPIPS↓	0.186	0.298	0.330	0.408	0.298	0.211	0.148
Sculpture	PSNR↑	21.81	19.33	N/A	24.12	22.69	25.28	27.01
150 views	SSIM↑	0.783	0.750	N/A	0.810	0.801	0.900	0.912
360^o-circling	LPIPS↓	0.263	0.433	N/A	0.419	0.422	0.203	0.189
	PSNR↑	24.44	22.65	N/A	26.60	25.43	27.69	31.06
Mean	SSIM↑	0.858	0.851	N/A	0.868	0.868	0.932	0.955
	LPIPS↓	0.152	0.290	N/A	0.328	0.294	0.158	0.100

Table 1: Quantitative comparisons on our new LGDM dataset. Best results are highlighted as 1st, 2nd and 3rd. The type of camera trajectory for each captured scene is visualized in Fig. 5.

5 Experiments

Table 2:

Method	Metric	CD	Tools	Crest	Seasoning	Food	Giants	Lab	Pasta	Mean
	PSNR↑	31.43	28.16	21.23	28.60	23.68	26.00	30.43	22.07	26.45
NeX	SSIM↑	0.958	0.953	0.757	0.928	0.832	0.898	0.949	0.844	0.890
	LPIPS↓	0.129	0.151	0.162	0.168	0.203	0.147	0.146	0.211	0.165
	PSNR↑	31.34	27.95	21.40	28.55	24.22	24.12	32.40	21.44	26.43
Ours	SSIM↑	0.981	0.952	0.749	0.926	0.849	0.845	0.981	0.832	0.890
	LPIPS↓	0.083	0.137	0.148	0.169	0.180	0.177	0.080	0.219	0.150

Table 2: Quantitative comparisons on the Shiny dataset.

Fig. 6:

Baselines. We compare to NeX [Wizadwongsa et al. 2021] which uses a single multi-plane image (MPI) with a neural view-dependent appearance basis; LLFF [Mildenhall et al. 2019] which fuses local multi-plane image representations; Instant-NGP (INGP) [Müller et al. 2022] which models scenes as global radiance fields; 3DGS [Kerbl et al. 2023] which uses primitives to describe a radiance field; Ref-NeRF [Verbin et al. 2022] and NeuralCatacaustics (NPC) [Kopanas et al. 2022], which are both designed to handle curved reflections. We use the original implementations and hyperparameters provided by the respective authors.

5.1 Evaluation on LGDM Data

To evaluate reflections on wide motion scenes, we capture a new dataset ‘LGDM’ with 8 scenes showing prominent or complex reflections, including multi-layer planar reflections (Mall), large glass surfaces (Skyscraper, Natatorium), curved surface reflections (Blue Car, Red Car, Sculpture), refraction (Glass Bust, Bull). This dataset is captured with 67–233 images, resized from 4K (3840 × 2160) to 1K resolution for training and evaluation, and the types of camera trajectory used in the capturing are illustrated in Fig. 5. We select a subset of the images as hold-out views (12.5%) for evaluation. The camera poses are computed using COLMAP [Schönberger and Frahm 2016]. Skyscraper and Sculpture were captured with a DJI MINI 4 PRO drone, and the remaining scenes were captured with an iPhone 14 Pro Max.

Quantitatively, our method outperforms all compared methods (Table 1), increasing PSNR by 2.64 dB on average. Among these results, NeX [Wizadwongsa et al. 2021] exhibits a significant decrease in PSNR when confronted with larger motion parallax due to its reliance on a single MPI. As Sculpture is a 360° scene, and the camera circles around the Red Car and Glass Bust from 120° to 180°, NeX’s single MPI is not suitable for these scenarios and so we mark them as N/A. 3DGS [Kerbl et al. 2023] suffers from a brittle optimization in challenging scenes. For instance, in Mall, where the content on the TV is dynamic and the texture and reflections on the floor are high frequency, we see floating geometries that lead to a large drop in PSNR (Fig. 11). To enable direct comparison between rendered and captured views, we intentionally design the camera trajectory to be close to the captured views via interpolating between selected keyframe camera poses in this demo for the Red Car scene. A video comparison (1:16–1:23) shows that our method outperforms SOTA methods in reproducing fine details such as the reflections on the polished car surface and the flare on the car window.

While LLFF [Mildenhall et al. 2019] also uses a local representation and generates visually appealing results, it suffers from minor ‘pixel shifting’ caused by inaccurate geometry. As reflection-specific methods, Ref-NeRF [Verbin et al. 2022] and NPC [Kopanas et al. 2022] still struggle with large curved reflections, such as the glass in Natatorium (Fig. 11). These methods also encounter difficulties in handling transparent objects, as seen in the Glass Bust scene.

5.2 Evaluation on Shiny Dataset

The publicly-available Shiny dataset from NeX [Wizadwongsa et al. 2021]) includes complex reflections from CDs and refractions through water bottles. The motion parallax and depth of field in this dataset are relatively smaller than in our LGDM dataset. Both NeX and our method produce similar quantitative results (Tab. 2); however, qualitative results reveal that our method reproduces sharper reflections with more details (Fig. 6). For example, while the CD scene show similar PSNR for both methods, our approach reproduces the linear striped pattern on the CD itself whereas NeX does not. Average LPIPS decreases from 0.165 to 0.150, suggesting this improved perceptual quality over NeX [Wizadwongsa et al. 2021].

5.3 Evaluation on Real Forward-facing dataset

On the publicly-available real forward-facing (RFF) dataset from NeRF [Mildenhall et al. 2020], our approach show competitive performance with NeX [Wizadwongsa et al. 2021], where both show higher average metrics than other methods (Tab. 3). INGP [Müller et al. 2022] converges quickly but struggles to consistently produce details. 3DGS [Kerbl et al. 2023] can generate sharp results for certain scene areas but exhibits significant artifacts, such as floating geometries.

Table 3:

Scene	Metric	LLFF	NeRF	NeX	INGP	3DGS	Ours
	PSNR↑	24.41	26.76	27.26	24.84	24.86	27.18
Mean	SSIM↑	0.863	0.883	0.904	0.855	0.876	0.905
	LPIPS↓	0.211	0.246	0.178	0.262	0.197	0.166

Table 3: Quantitative comparisons on the Real Forward-Facing dataset. Best results are highlighted as 1st, 2nd and 3rd. Please refer to the supplementary material for the metrics on each scene.

5.4 Ablations

Consistency loss. We select Fern, Flower, and T-rex from the RFF dataset, as well as Natatorium and Glass Bust from our LGDM dataset to evaluate the impact of the consistency loss (Tab. 4). Removing the consistency loss \(\mathcal {L}_\mathrm{c}\) significantly reduces PSNR. Qualitatively, removing it results in geometry exhibiting noise and missing details, leading to noticeable artifacts during rendering (Fig. 8).

Number of Gaussians and selected views. On the Fern data from the RFF dataset, reducing the number of Gaussians can result in significant missing geometry (Fig. 9). Similarly, employing fewer selected views for warping and blending can present challenges in accurately locating geometric surfaces, potentially leading to undesirable artifacts such as blurring and ghosting in the final results. Through experimentation, we show that using 10 Gaussians and eight selected views for each ray is a reasonable balance between computational efficiency and reconstruction quality.

Captured image baseline. In ray interpolation-based IBR, the baseline between captured images is a key factor that influences rendering results. We decrease sampling by 1/2, 1/4, and 1/8 of the total number of training views (Fig. 10 and Tab. 5). As sparsity increases, fewer input views indeed reduces rendering quality; at 1/8th of the input views, we see noticeable artifacts in the rendered images.

Table 4:

	Metric	Fern	Flower	T-Rex	Natatorium	Glass Bust
	PSNR↑	24.55	28.53	26.71	31.16	33.18
w/o \(\mathcal {L}_c\)	SSIM↑	0.853	0.928	0.928	0.960	0.971
	LPIPS↓	0.231	0.141	0.199	0.069	0.084
	PSNR↑	25.58	29.15	27.86	31.22	33.48
Ours	SSIM↑	0.880	0.934	0.043	0.960	0.971
	LPIPS↓	0.193	0.130	0.165	0.071	0.084

Table 4: Ablation on losses. Without the consistency loss, the PSNR decreases for both scenes in the RFF dataset and our LGDM dataset.

Table 5:

Scene	Metric	Full Set	1/2	1/4	1/8
	PSNR↑	33.48	31.70	29.25	27.07
Glass Bust	SSIM↑	0.971	0.962	0.940	0.910
	LPIPS↓	0.084	0.101	0.130	0.166
	PSNR↑	31.22	30.33	28.96	27.00
Natatorium	SSIM↑	0.960	0.952	0.935	0.907
	LPIPS↓	0.071	0.085	0.104	0.140

Table 5: Ablation on the density of the captured views. 1/2, 1/4, and 1/8 represent the proportion of the full training set. We uniformly sample views from full set as training views.

6 Conclusion

We have shown that a per-view Gaussian density mixture with image-based rendering can be end-to-end optimized to achieve high-frequency reflections for curved and transparent objects.

Limitations and future work. As an IBR method, our results can still show some visual ‘snapping’ on curved reflections as the target view moves between different sets of neighbors. In these cases, there is a visual trade-off vs. global scene methods between such snapping in our case and blurring in the case of Ref-NeRF and 3DGS.

The approach is not yet constructed for fast rendering as it uses a volume sampling method. Each 952 × 535 view takes around 270 ms to render on an NVIDIA RTX 3090Ti, which is faster than NeRF but is slower than 3DGS. One way to increase rendering speed is to reduce the number of sampled points per ray. A coarse-to-fine strategy per ray may increase speed without reducing quality.

Our method achieves highest quality with dense scene capture, especially for complex reflections or refractions. In diffuse areas, this results in redundant duplication. But, even with the many views in our LGDM dataset, global scene representations like Ref-NeRF or 3DGS still cannot achieve as high a quality of reflections as ours; this discrepancy seems pertinent to investigate in future work.

Another direction for future work is to explore why our method reconstructs view-dependent effects better than other approaches, particularly methods that construct global geometries. We have observed that flexible local geometries and direct color blending are factors in achieving these results. Developing a rigorous evaluation framework or theory to support these findings would be valuable.

Acknowledgments

We thank the anonymous reviewers for their professional and constructive comments. Weiwei Xu is partially supported by NSFC grant No. 61732016. Jiamin Xu is partially supported by NSFC grant No. 62302134 and ZJNSF grant No. LQ24F020031. James Tompkin is partially supported by the US NSF CNS-2038897 and by CAREER IIS-2144956. Qixing Huang is partially supported by US NSF CAREER IIS-2047677 and by US NSF CAREER IIS-2413161. Yifan Peng is partially supported by Hong Kong University Grants Committee (ECS 27212822, GRF 17208023) and National Natural Science Foundation of China. This paper is supported by Ant Group and Information Technology Center and State Key Lab of CAD&CG, Zhejiang University.

Fig. 7:

Fig. 8:

Fig. 9:

Fig. 10:

Fig. 11:

Supplemental Material

PDF File

Supplemental Material 4

Download
2.13 MB

References

[1]

C. Buehler, M. Bosse, L. McMillan, S. Gortler, and M. Cohen. 2001. Unstructured lumigraph rendering. In ACM Trans. Graph.425–432.

Abstract

1 Introduction

2 Related work

3 Method

3.1 Local Gaussian Density Mixture Representation

3.2 Warping and Fusing Volumes

3.3 Optimization

4 Implementation Details

5 Experiments

5.1 Evaluation on LGDM Data

5.2 Evaluation on Shiny Dataset

5.3 Evaluation on Real Forward-facing dataset

5.4 Ablations

6 Conclusion

Acknowledgments

Supplemental Material

References

Index Terms

Recommendations

Geometry enhanced 3D Gaussian Splatting for high quality deferred rendering

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Unstructured lumigraph rendering

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations