Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering

Zhiwen Yan  Weng Fei Low  Yu Chen  Gim Hee Lee
Department of Computer Science, National University of Singapore
yan.zhiwen@u.nus.edu{wengfei.low, chenyu}@comp.nus.edu.sggimhee.lee@nus.edu.sg
Abstract

3D Gaussians have recently emerged as a highly efficient representation for 3D reconstruction and rendering. Despite its high rendering quality and speed at high resolutions, they both deteriorate drastically when rendered at lower resolutions or from far away camera position. During low resolution or far away rendering, the pixel size of the image can fall below the Nyquist frequency compared to the screen size of each splatted 3D Gaussian and leads to aliasing effect. The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues, we propose a multi-scale 3D Gaussian splatting algorithm, which maintains Gaussians at different scales to represent the same scene. Higher-resolution images are rendered with more small Gaussians, and lower-resolution images are rendered with fewer larger Gaussians. With similar training time, our algorithm can achieve 13%-66% PSNR and 160%-2400% rendering speed improvement at 4×\times×-128×\times× scale rendering on Mip-NeRF360 dataset compared to the single scale 3D Gaussian splatting. Project website: https://jokeryan.github.io/projects/ms-gs/.

[Uncaptioned image]
Figure 1: The rendering quality and speed of the original 3D Gaussian splatting[12] deteriorate severely at low resolutions or from distant cameras due to aliasing. Conversely, our multi-scale 3D Gaussians representation utilizes selective rendering to achieve faster (160%2400%percent160percent2400160\%-2400\%160 % - 2400 % at 4×\times×-128×\times× resolution) and more accurate rendering at lower resolutions.

1 Introduction

3D Gaussian Splatting[12] has recently emerged as a highly efficient representation for novel view synthesis. Compared to the time-consuming ray marching used in most neural radiance fields (NeRF) [15, 16, 2], a high-resolution image can be rendered in real-time by rasterizing the splatted 3D Gaussians. However, this rasterization algorithm is subjected to severe aliasing effect and speed deterioration when rendering the same scene at low resolution or from distant positions as shown in Fig. 1. This limitation significantly constrain the application of the 3D Gaussian splatting algorithm in reconstructing and rendering large-scale scenes.

Aliasing effect is a consequence of inadequate sampling frequency failing to capture the continuous signal accurately. In the context of rendering, image pixels are sampled with an interval of one-pixel size. The signal can be considered as the 3D scene represented implicitly as in NeRF or explicitly as in 3D Gaussians. When part of the 3D scene is represented with high details but rendered with low resolution or from distant positions, the disparity between the low sampling and high signal frequencies culminates in aliasing artifacts. A naive solution is to render at high resolution and subsequently down-scale the rendered image to a lower resolution. However, this solution is not viable for scenes containing both near and far regions which are very common. Due to the inability of 3D Gaussian splatting algorithm to accommodate varying resolutions within a single image, rendering the entire image with a even higher resolution for the sake of far away regions is neither time nor memory efficient.

We postulate that the pronounced aliasing artifacts observed when rendering with 3D Gaussians, as opposed to other techniques such as NeRF, are primarily attributable to the splatting of small Gaussians. 3D regions with intricate details are represented with large amount of small Gaussians. When rendering these regions with low resolution or from a distant view, many splatted small Gaussians are cramped in one pixel and therefore the pixel color of this region is dominated by the front-most Gaussian, even if this Gaussian is much smaller than others and not at the center. This problem is further aggravated by the low pass filter in [12, 19] applied to each individual Gaussian with the intention to mitigate aliasing on edges at high resolutions. This problem is explained in more detail in Sec. 3.2.

In addition to the aliasing artifacts, the rendering speed of 3D Gaussians is also affected at low resolution. The number of 3D Gaussians that need to be rendered remains constant at lower resolutions, but they are more concentrated to fewer pixels. The Gaussians that are splatted to the same pixel cannot be rendered in parallel. This means that the image rendering is even slower at lower resolution in comparison with NeRF rendering time that reduces linearly with decreasing resolution. Hence, although aliasing is not a problem exclusive to 3D Gaussian splatting, it is more prominent and more difficult to tackle.

Contributions

To mitigate the aliasing problem for 3D Gaussian splatting, we propose a novel multi-scale 3D Gaussians to represent the scene at different levels of detail (LOD) as shown in Fig. 2. This is inspired by the mipmap and LOD algorithms widely used in computer graphics, which pre-computes textures and polygons at different scales to be rendered under different resolutions and distances. Similarly, we add larger, coarser Gaussians for lower resolutions by aggregating the smaller and finer Gaussians from higher resolutions. Depending on the pixel coverage of the splatted Gaussians during rendering, only a subset of the Gaussians is used. A simplified explanation for this is that the coarse Gaussians are used to render low-resolution images and the fine Gaussians are used to render high-resolution images. With fewer than 5% number of Gaussians added and a similar training time, our method can achieve 13%-66% PSNR and 160%-2400% rendering speed improvements at 4×\times×-128×\times× scale rendering on Mip-NeRF360 dataset[2], while maintaining a comparable rendering quality and speed at 1×\times× scale.

Refer to caption
Figure 2: Overall pipeline of our algorithm. At the early stage of training (left), small Gaussians below certain size threshold in each voxel are aggregated, enlarged and inserted into the scene at different resolution scale. During rendering (right), the multi-scale Gaussians of the appropriate “pixel coverage” at the current render resolution are selected for rendering. If the rendering resolution scale equals to the scale of the Gaussians, the expected “pixel coverage” range of the Gaussians are updated accordingly.

2 Related Works

2.1 Anti-Aliasing in Computer Graphics

Aliasing is a long-standing problem for computer graphics when rendering a scene to a discrete image. Traditional anti-aliasing techniques primarily target mesh representations. Supersampling Anti-Aliasing (SSAA) [5] renders the scene at a higher resolution before downscaling, leading to significantly more time and memory demain, and therefore is less used in real-time applications. The Multisample Anti-Aliasing (MSAA) [1, 5] algorithm selectively supersamples pixels on the edges, reducing resource and time consumption. This technique is not very suitable for 3D Gaussian splatting because of its requirement for regular grids and lack of support for variable sampling resolution at different pixels. The more recent Fast Approximate Anti-Aliasing (FXAA) [14, 11] is a post processing algorithm that smooths the jagged edges after the image is rendered. Unfortunately, this technique is also not suitable for Gaussian representation as the front-most Gaussian dominates the pixel color and produces chunky instead of jagged artifacts in mesh rendering.

In contrast to the supersampling methods mentioned above, our method takes the inspiration from hierarchical mipmap [18] and level of details (LOD) [9, 7] algorithms to address the aliasing for 3D Gaussians. Mipmap uses multi-scale textures for the rendering at different resolution or from different distances. LOD algorithm represents the models in a scene with different complexity to be rendered at different distances. Both techniques not only mitigate the aliasing effect by reducing the complexity of the scene representation, but also enhances rendering speed, particularly for large-scale scenes.

2.2 Anti-Aliasing in Neural Representation

The recent success of neural representations especially Neural Radiance Fields (NeRF) [15, 16, 6] has also inspired some works to develop algorithms against aliasing effect on neural representations beyond the traditional mesh representation. Mip-NeRF [2, 3] employ low pass filters on the positional encoding of the input spatial coordinates to reduce the scene signal frequency. Building on the hash grid representation used by InstantNGP [16] with no position encoding, Zip-NeRF [4] proposes a multi-sampling strategy in the conical frustum instead of the camera ray, at the cost of 6×\times× rendering time. Similar to the mipmap algorithm in mesh texture rendering, Tri-MipNeRF [10] and MipGrid [17] proposes to use multi-scale feature grids for rendering at different resolution or distance.

Conversely, 3D Gaussian splatting [12] presents unique anti-aliasing challenges due to its distinct scene representation. It does not have any positional encoding or feature grid, and its requirement for regular grids conflicts the more flexible multi-sampling strategies. The concentration of small Gaussians in detail-rich regions exacerbates aliasing and speed issues, more so than in NeRF representations. To the best of our knowledge, we are the first to propose an anti-aliasing algorithm for scene reconstruction using 3D Gaussian splatting.

3 Preliminaries

3.1 3D Gaussian Splatting

3D Gaussian splatting is first proposed in EWA Splatting [19], and later used by [12] for scene reconstruction and novel view synthesis. The scene is represented by a set of 𝐊𝐊\mathbf{K}bold_K 3D Gaussians {𝒢V^k,μ^k,σk,𝐜kk[1,𝐊]}conditional-setsubscript𝒢superscript^𝑉𝑘superscript^𝜇𝑘subscript𝜎𝑘subscript𝐜𝑘𝑘1𝐊\{\mathcal{G}_{\hat{V}^{k},\hat{\mu}^{k}},\sigma_{k},\mathbf{c}_{k}\mid k\in[1% ,\mathbf{K}]\}{ caligraphic_G start_POSTSUBSCRIPT over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∣ italic_k ∈ [ 1 , bold_K ] } with variance V^ksuperscript^𝑉𝑘\hat{V}^{k}over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, center μ^ksuperscript^𝜇𝑘\hat{\mu}^{k}over^ start_ARG italic_μ end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, density σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and color 𝐜ksuperscript𝐜𝑘\mathbf{c}^{k}bold_c start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. During rendering, the 3D Gaussians are splatted to the 2D screen to by the perspective transformation to form 2D Gaussians 𝒢Vk,μksubscript𝒢superscript𝑉𝑘superscript𝜇𝑘\mathcal{G}_{V^{k},\mu^{k}}caligraphic_G start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The image is then divided into 16×\times×16 regular tiles and all 2D Gaussians touching each tile are sorted based on their original depth. The color of each pixel in the tile is then rasterized from the sequential alpha blending the 2D Gaussians from front to back.

3.2 Cause of Aliasing in 3D Gaussian Splatting

Aliasing can occur when sampling a continuous signal g(x)𝑔𝑥g(x)italic_g ( italic_x ) with a discrete sampling function δs(x,Δx)=n=δ(xnΔx)subscript𝛿𝑠𝑥Δ𝑥superscriptsubscript𝑛𝛿𝑥𝑛Δ𝑥\delta_{s}(x,\Delta x)=\sum_{n=-\infty}^{\infty}\delta(x-n\cdot\Delta x)italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x , roman_Δ italic_x ) = ∑ start_POSTSUBSCRIPT italic_n = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_δ ( italic_x - italic_n ⋅ roman_Δ italic_x ), where δ𝛿\deltaitalic_δ is a impulse function. The result of the sample in the spatial domain is:

gs(x)=δs(x,Δx)g(x).subscript𝑔𝑠𝑥subscript𝛿𝑠𝑥Δ𝑥𝑔𝑥g_{s}(x)=\delta_{s}(x,\Delta x)\cdot g(x).italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x ) = italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_x , roman_Δ italic_x ) ⋅ italic_g ( italic_x ) . (1)

This sampled function converted into the frequency domain using Fourier transform operator \mathcal{F}caligraphic_F becomes:

[gs(u)]delimited-[]subscript𝑔𝑠𝑢\displaystyle\mathcal{F}[{g_{s}}(u)]caligraphic_F [ italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u ) ] =1Δxk=δ(ukΔx)[g(x)]absent1Δ𝑥superscriptsubscript𝑘𝛿𝑢𝑘Δ𝑥delimited-[]𝑔𝑥\displaystyle=\frac{1}{\Delta x}\sum_{k=-\infty}^{\infty}{\delta(u-\frac{k}{% \Delta x})}\ast\mathcal{F}[g(x)]= divide start_ARG 1 end_ARG start_ARG roman_Δ italic_x end_ARG ∑ start_POSTSUBSCRIPT italic_k = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_δ ( italic_u - divide start_ARG italic_k end_ARG start_ARG roman_Δ italic_x end_ARG ) ∗ caligraphic_F [ italic_g ( italic_x ) ] (2)
=1Δxk=𝐆(ukΔx).absent1Δ𝑥superscriptsubscript𝑘𝐆𝑢𝑘Δ𝑥\displaystyle=\frac{1}{\Delta x}\sum_{k=-\infty}^{\infty}\mathbf{G}(u-\frac{k}% {\Delta x}).= divide start_ARG 1 end_ARG start_ARG roman_Δ italic_x end_ARG ∑ start_POSTSUBSCRIPT italic_k = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT bold_G ( italic_u - divide start_ARG italic_k end_ARG start_ARG roman_Δ italic_x end_ARG ) .

When the highest frequency component fmaxsubscript𝑓𝑚𝑎𝑥f_{max}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT of the signal is greater than half of the sampling frequency fs=1Δxsubscript𝑓𝑠1Δ𝑥f_{s}=\frac{1}{\Delta x}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG roman_Δ italic_x end_ARG, 𝐆(ukΔx)𝐆𝑢𝑘Δ𝑥\mathbf{G}(u-\frac{k}{\Delta x})bold_G ( italic_u - divide start_ARG italic_k end_ARG start_ARG roman_Δ italic_x end_ARG ) in the summation sequence would overlap with each other and causes the sampled signal to diverge from the actual signal. This phenomenon is the aliasing effect and the minimum sampling frequency needed to avoid aliasing is fNy=2fmaxsubscript𝑓𝑁𝑦2subscript𝑓𝑚𝑎𝑥f_{Ny}=2\cdot f_{max}italic_f start_POSTSUBSCRIPT italic_N italic_y end_POSTSUBSCRIPT = 2 ⋅ italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, known as the Nyquist frequency.

The EWA splatting [19] used by 3D Gaussian splatting [12] also tries to mitigate the aliasing problem by applying a low pass filter to the rendered color. To approximate this efficiently, it applies a Gaussian kernel h(𝐱)𝐱h(\mathbf{x})italic_h ( bold_x ) as the low pass filter on each splatted 2D signal gc(𝐱)subscript𝑔𝑐𝐱g_{c}(\mathbf{x})italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x ) independently to produce a band limited signal:

gc(𝐱)superscriptsubscript𝑔𝑐𝐱\displaystyle g_{c}^{\prime}(\mathbf{x})italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_x ) =gc(𝐱)h(𝐱)absentsubscript𝑔𝑐𝐱𝐱\displaystyle=g_{c}(\mathbf{x})\ast h(\mathbf{x})= italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_x ) ∗ italic_h ( bold_x ) (3)
kσk𝐜kTk2qk(η)h(𝐱η)𝑑ηabsentsubscript𝑘subscript𝜎𝑘subscript𝐜𝑘subscript𝑇𝑘subscriptsuperscript2subscript𝑞𝑘𝜂𝐱𝜂differential-d𝜂\displaystyle\approx\sum_{k}\sigma_{k}\mathbf{c}_{k}T_{k}\int_{\mathcal{R}^{2}% }q_{k}(\eta)h(\mathbf{x}-\eta)d\eta≈ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT caligraphic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_η ) italic_h ( bold_x - italic_η ) italic_d italic_η
=kσk𝐜kTk(qkh)(𝐱),absentsubscript𝑘subscript𝜎𝑘subscript𝐜𝑘subscript𝑇𝑘subscript𝑞𝑘𝐱\displaystyle=\sum_{k}\sigma_{k}\mathbf{c}_{k}T_{k}\cdot(q_{k}\ast h)(\mathbf{% x}),= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∗ italic_h ) ( bold_x ) ,

where 2superscript2\mathcal{R}^{2}caligraphic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the range of one pixel, qk(𝐱)subscript𝑞𝑘𝐱q_{k}(\mathbf{x})italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x ) is the 2D integrated Gaussian kernel, and σk,𝐜k,Tksubscript𝜎𝑘subscript𝐜𝑘subscript𝑇𝑘\sigma_{k},\mathbf{c}_{k},T_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the opacity, color, and transmittance at each Gaussian, respectively. By combining the reconstruction Gaussian kernel 𝒢Vksubscript𝒢superscript𝑉𝑘\mathcal{G}_{V^{k}}caligraphic_G start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and low pass Gaussian kernel 𝒢Vhsubscript𝒢superscript𝑉\mathcal{G}_{V^{h}}caligraphic_G start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT of covariance matrix Vksuperscript𝑉𝑘V^{k}italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT and Vhsuperscript𝑉V^{h}italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, the band limit function becomes:

gc(𝐱)superscriptsubscript𝑔𝑐𝐱\displaystyle g_{c}^{\prime}(\mathbf{x})italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_x ) =kαk(𝒢Vk𝒢Vh)(𝐱)absentsubscript𝑘subscript𝛼𝑘subscript𝒢superscript𝑉𝑘subscript𝒢superscript𝑉𝐱\displaystyle=\sum_{k}\alpha_{k}\cdot(\mathcal{G}_{V^{k}}\ast\mathcal{G}_{V^{h% }})(\mathbf{x})= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ ( caligraphic_G start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∗ caligraphic_G start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) ( bold_x ) (4)
=kαk𝒢Vk+Vh(𝐱),absentsubscript𝑘subscript𝛼𝑘subscript𝒢superscript𝑉𝑘superscript𝑉𝐱\displaystyle=\sum_{k}\alpha_{k}\cdot\mathcal{G}_{V^{k}+V^{h}}(\mathbf{x}),= ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ caligraphic_G start_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_x ) ,

where αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT represents all coefficients invariant of 𝐱𝐱\mathbf{x}bold_x at each Gaussian and Vhsuperscript𝑉V^{h}italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is determined by the screen pixel size. A simple understanding of this is that the covariance of each 3D Gaussian is increased based on the screen pixel size.

This method of applying a low pass filter to each 3D Gaussian independently helps to smooth the edges of the Gaussians when the Gaussians are not too small compared to the pixel size. However, it also gives rise to two substantial issues at low resolutions:

  1. 1.

    Vhsuperscript𝑉V^{h}italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT added to the original covariance Vksuperscript𝑉𝑘V^{k}italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT effectively increases the extent of each Gaussian, especially when Vhsuperscript𝑉V^{h}italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is large compared to Vksuperscript𝑉𝑘V^{k}italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT at low resolutions. Small Gaussians in the front dominate the color of the pixel and cause severe artifacts shown in Fig. 7.

  2. 2.

    The number of Gaussians involved in the sequential ksubscript𝑘\sum_{k}∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for each pixel scales increases with decreasing image resolution. Due to the incremental calculation of the transmittance Tksubscript𝑇𝑘T_{k}italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the rendering even slower at lower resolutions.

4 Our Method

4.1 Multi-Scale Gaussians Based on Pixel Coverage

To mitigate the aliasing artifacts of 3D Gaussians [12] while avoiding the two problems of the EWA splatting [19], we introduce multi-scale 3D Gaussians (cf. Fig. 2) that tackle the problem on the scene-level instead of on each individual Gaussian. The 3D scene is represented with Gaussians from 4 levels of detail, corresponding to the 1×1\times1 ×, 4×4\times4 ×, 16×16\times16 ×, and 64×64\times64 × downsampled resolution. Small finer-level Gaussians are aggregated to create larger Gaussians for coarser levels during training. Each 3D Gaussian 𝒢klsuperscriptsubscript𝒢𝑘𝑙\mathcal{G}_{k}^{l}caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT belongs to one of the levels l𝑙litalic_l and is included or excluded independently during the rendering based on its “pixel coverage”.

Pixel Coverage of Gaussian.

The “pixel coverage” of a Gaussian reflects the size of the Gaussian when splatted onto the screen space compared to the pixel size at the current rendering resolution. The “pixel coverage” Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of a splatted 2D Gaussian 𝒢(μk,Vk)subscript𝒢superscript𝜇𝑘superscript𝑉𝑘\mathcal{G}_{(\mu^{k},V^{k})}caligraphic_G start_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT is defined as the length of its horizontal or vertical axis until the low opacity level set, whichever is smaller, as shown in Fig. 3. The pixel coverage is measured in pixel count and the opacity threshold σTsubscript𝜎𝑇\sigma_{T}italic_σ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is set as 12551255\frac{1}{255}divide start_ARG 1 end_ARG start_ARG 255 end_ARG.

Refer to caption
Figure 3: Pixel coverage of a 3D Gaussian is its horizontal or vertical size, whichever is smaller measured by the level set.

The pixel coverage approximates the extent of a 2D splatted Gaussian in the spatial domain. During the rendering from a given camera direction, the color of each splatted Gaussian is constant within this pixel coverage. As a result, the coverage of this pixel approximates the inverse of the highest frequency component fmax=1/Sksubscript𝑓𝑚𝑎𝑥1subscript𝑆𝑘f_{max}=1/S_{k}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 1 / italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in this region. Compared to the sampling frequency of fs=1px1subscript𝑓𝑠1psuperscriptx1f_{s}=1\mathrm{px}^{-1}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 1 roman_p roman_x start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT during rasterization, a signal frequency of fmax>fs/2subscript𝑓𝑚𝑎𝑥subscript𝑓𝑠2f_{max}>f_{s}/2italic_f start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT > italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / 2 can cause the sampling to fall below the Nyquist frequency needed to avoid aliasing.

Consequently, the Gaussians with pixel coverage Sk<ST=2pxsubscript𝑆𝑘subscript𝑆𝑇2pxS_{k}<S_{T}=2\mathrm{px}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 2 roman_p roman_x should be filtered out during rendering to avoid aliasing. Since 3D Gaussian representation does not encode the signal of different frequencies at different Gaussians, naively filtering out the small Gaussians will result in a hole or part missing in the scene as shown in Fig. 4. To address this issue, we propose to aggregate the small Gaussians to form large Gaussians that encode the low-frequency signal. These large Gaussians would appear when the small Gaussians are filtered out.

Refer to caption
Figure 4: Missing parts caused by naive small Gaussian filtering at different resolution scales.

Aggregate to Insert Large Gaussians.

All 3D Gaussians initialized from the input point cloud at the start of the training belong to the finest level l=1𝑙1l=1italic_l = 1. They are densified by splitting and cloning as in [12], and all the densified Gaussians would inherit the same level. After the warm-up stage of the first 1,000 iterations, we introduce coarse-level Gaussians by aggregating fine-level Gaussians that are too small as visualized in Fig. 5 and described in Algorithm 1. The procedure is outlined as follows:

  1. 1.

    For all levels {lm2lmlmax}conditional-setsubscript𝑙𝑚2subscript𝑙𝑚subscript𝑙𝑚𝑎𝑥\{l_{m}\mid 2\leq l_{m}\leq l_{max}\}{ italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∣ 2 ≤ italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≤ italic_l start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }, we render all 3D Gaussians from [1,lm1]1subscript𝑙𝑚1[1,l_{m}-1][ 1 , italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - 1 ] at the 4lm1superscript4subscript𝑙𝑚14^{l_{m}-1}4 start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT times downsampled resolution of all training images. All 3D Gaussians with the minimal “pixel coverage” Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT smaller than the filter threshold STsubscript𝑆𝑇S_{T}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are chosen for the aggregation.

  2. 2.

    The chosen 3D Gaussians are binned by a (400/lm)3superscript400subscript𝑙𝑚3(400/l_{m})^{3}( 400 / italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT resolution voxel grid based on their positions. The attributes of all Gaussians within each voxel are aggregated to create a new Gaussian using average pooling, including position, scaling, opacity and color. More details are included in the supplementary.

  3. 3.

    Based on the average “pixel coverage” Savgsubscript𝑆𝑎𝑣𝑔S_{avg}italic_S start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT of Gaussians in each voxel, the scaling of each new Gaussian created is enlarged by ST/Savgsubscript𝑆𝑇subscript𝑆𝑎𝑣𝑔S_{T}/S_{avg}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_S start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT so that it is of a size suitable to be rendered at lmsubscript𝑙𝑚l_{m}italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. This new Gaussian belongs to level lmsubscript𝑙𝑚l_{m}italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Refer to caption
Figure 5: Large Gaussians are created by aggregating the small Gaussians in each voxel below the pixel coverage threshold, and then enlarged by the pixel coverage multiplier.

Not all Gaussians from the fine levels are small. Many Gaussians in the background or in the textureless regions are large and do not need to be aggregated. The number of Gaussians created is often fewer than 5% of the final total number of Gaussians.

Algorithm 1 Aggregate Small Gaussians
1:procedure AggregateGaussians(𝒢1:K1:lmaxsuperscriptsubscript𝒢:1𝐾:1subscript𝑙𝑚𝑎𝑥\mathcal{G}_{1:K}^{1:l_{max}}caligraphic_G start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_l start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT)
2:     for lm2subscript𝑙𝑚2l_{m}\leftarrow 2italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ← 2 to lmaxsubscript𝑙𝑚𝑎𝑥l_{max}italic_l start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT do
3:         S1:Ksubscript𝑆:1𝐾S_{1:K}italic_S start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT=PixelCoverage(𝒢1:K1:lm1superscriptsubscript𝒢:1𝐾:1subscript𝑙𝑚1\mathcal{G}_{1:K}^{1:l_{m}-1}caligraphic_G start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT, scale 4lm1superscript4subscript𝑙𝑚14^{l_{m}-1}4 start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT)
4:         Gsmall={𝒢k|Sk<ST,k[1:K]}G_{small}=\{\mathcal{G}_{k}|S_{k}<S_{T},\forall k\in[1:K]\}italic_G start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , ∀ italic_k ∈ [ 1 : italic_K ] }
5:         for n1𝑛1n\leftarrow 1italic_n ← 1 to (400/lm)3superscript400subscript𝑙𝑚3(400/l_{m})^{3}( 400 / italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT do
6:              Gn={𝒢k|𝒢kG_{n}=\{\mathcal{G}_{k}|\mathcal{G}_{k}\,italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPTin voxeln,𝒢kGsmall}\,n,\forall\mathcal{G}_{k}\in G_{small}\}italic_n , ∀ caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G start_POSTSUBSCRIPT italic_s italic_m italic_a italic_l italic_l end_POSTSUBSCRIPT }
7:              𝒢new,nlm=superscriptsubscript𝒢𝑛𝑒𝑤𝑛subscript𝑙𝑚absent\mathcal{G}_{new,n}^{l_{m}}=caligraphic_G start_POSTSUBSCRIPT italic_n italic_e italic_w , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = Enlarge(Average(Gnsubscript𝐺𝑛G_{n}italic_G start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT))
8:              InsertIntoScene(𝒢new,nlmsuperscriptsubscript𝒢𝑛𝑒𝑤𝑛subscript𝑙𝑚\mathcal{G}_{new,n}^{l_{m}}caligraphic_G start_POSTSUBSCRIPT italic_n italic_e italic_w , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT)
9:         end for
10:     end for
11:end procedure
Algorithm 2 Selective Rendering Based on Pixel Coverage
1:procedure SelectiveRender(𝒢1:K1:lmaxsuperscriptsubscript𝒢:1𝐾:1subscript𝑙𝑚𝑎𝑥\mathcal{G}_{1:K}^{1:l_{max}}caligraphic_G start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_l start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, scale lrsubscript𝑙𝑟l_{r}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT)
2:     S1:K=subscript𝑆:1𝐾absentS_{1:K}=italic_S start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT =PixelCoverage(𝒢1:K1:lmax)superscriptsubscript𝒢:1𝐾:1subscript𝑙𝑚𝑎𝑥(\mathcal{G}_{1:K}^{1:l_{max}})( caligraphic_G start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 : italic_l start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
3:     G1={𝒢k|Sk/SkmaxSrelmax,k}subscript𝐺1conditional-setsubscript𝒢𝑘subscript𝑆𝑘superscriptsubscript𝑆𝑘𝑚𝑎𝑥superscriptsubscript𝑆𝑟𝑒𝑙𝑚𝑎𝑥for-all𝑘G_{1}=\{\mathcal{G}_{k}|S_{k}/S_{k}^{max}\leq S_{rel}^{max},\forall k\}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ≤ italic_S start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT , ∀ italic_k }
4:     G2={𝒢k|Sk/SkminSrelminSkST,k}subscript𝐺2conditional-setsubscript𝒢𝑘formulae-sequencesubscript𝑆𝑘superscriptsubscript𝑆𝑘𝑚𝑖𝑛superscriptsubscript𝑆𝑟𝑒𝑙𝑚𝑖𝑛subscript𝑆𝑘subscript𝑆𝑇for-all𝑘G_{2}=\{\mathcal{G}_{k}|S_{k}/S_{k}^{min}\geq S_{rel}^{min}\lor S_{k}\geq S_{T% },\forall k\}italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT / italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT ≥ italic_S start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT ∨ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , ∀ italic_k }
5:     Glr={𝒢kl|l=lr,k}subscript𝐺subscript𝑙𝑟conditional-setsuperscriptsubscript𝒢𝑘𝑙𝑙subscript𝑙𝑟for-all𝑘G_{l_{r}}=\{\mathcal{G}_{k}^{l}|l=l_{r},\forall k\}italic_G start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT | italic_l = italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , ∀ italic_k }
6:     for 𝒢klGlrsuperscriptsubscript𝒢𝑘𝑙subscript𝐺subscript𝑙𝑟\mathcal{G}_{k}^{l}\in G_{l_{r}}caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ italic_G start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT do
7:         UpdateRange(Skmax,Skmin,Sksuperscriptsubscript𝑆𝑘𝑚𝑎𝑥superscriptsubscript𝑆𝑘𝑚𝑖𝑛subscript𝑆𝑘S_{k}^{max},S_{k}^{min},S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT)
8:     end for
9:     return Render(G1G2,lrsubscript𝐺1subscript𝐺2subscript𝑙𝑟G_{1}\cap G_{2},l_{r}italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT)
10:end procedure

Multi-Scale Training and Selective Rendering.

After the large Gaussians are added, the model is trained with both the original images and the downsampled images. A maximum pixel coverage Skmaxsuperscriptsubscript𝑆𝑘𝑚𝑎𝑥S_{k}^{max}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT and a minimum pixel coverage Skminsuperscriptsubscript𝑆𝑘𝑚𝑖𝑛S_{k}^{min}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT of each Gaussian are stored for the selective rendering. If the rendering downsample scale equals to the downsample scale when the Gaussian 𝒢ksubscript𝒢𝑘\mathcal{G}_{k}caligraphic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is created, its Skmaxsuperscriptsubscript𝑆𝑘𝑚𝑎𝑥S_{k}^{max}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT and Skminsuperscriptsubscript𝑆𝑘𝑚𝑖𝑛S_{k}^{min}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT values are updated with the new pixel coverage Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT:

Skmaxsuperscriptsubscript𝑆𝑘superscript𝑚𝑎𝑥\displaystyle S_{k}^{{}^{\prime}max}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT =max(λ1Skmax,Sk),absent𝑚𝑎𝑥subscript𝜆1superscriptsubscript𝑆𝑘𝑚𝑎𝑥subscript𝑆𝑘\displaystyle=max(\lambda_{1}S_{k}^{max},S_{k}),= italic_m italic_a italic_x ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (5)
Skminsuperscriptsubscript𝑆𝑘superscript𝑚𝑖𝑛\displaystyle S_{k}^{{}^{\prime}min}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT =min(λ2Skmin,Sk),absent𝑚𝑖𝑛subscript𝜆2superscriptsubscript𝑆𝑘𝑚𝑖𝑛subscript𝑆𝑘\displaystyle=min(\lambda_{2}S_{k}^{min},S_{k}),= italic_m italic_i italic_n ( italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT , italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are decay coefficients taking the empirical value of 0.950.950.950.95 and 1.051.051.051.05, respectively.

Refer to caption
Figure 6: Based on the rendering resolution, the current pixel coverage of a Gaussian relative to its minimum and maximum pixel coverages determines whether it is selected for rendering.

During rendering at any resolution or camera distance, a Gaussian is selected for rendering if its pixel coverage Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT on the screen satisfies the following condition:

(SkSkmaxSrelmax)(SkSkminSrelminSkST),subscript𝑆𝑘superscriptsubscript𝑆𝑘𝑚𝑎𝑥superscriptsubscript𝑆𝑟𝑒𝑙𝑚𝑎𝑥subscript𝑆𝑘superscriptsubscript𝑆𝑘𝑚𝑖𝑛superscriptsubscript𝑆𝑟𝑒𝑙𝑚𝑖𝑛subscript𝑆𝑘subscript𝑆𝑇(\frac{S_{k}}{S_{k}^{max}}\leq S_{rel}^{max})\land(\frac{S_{k}}{S_{k}^{min}}% \geq S_{rel}^{min}\lor S_{k}\geq S_{T}),( divide start_ARG italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT end_ARG ≤ italic_S start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT ) ∧ ( divide start_ARG italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT end_ARG ≥ italic_S start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT ∨ italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , (6)

where Srelmaxsuperscriptsubscript𝑆𝑟𝑒𝑙𝑚𝑎𝑥S_{rel}^{max}italic_S start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT and Srelminsuperscriptsubscript𝑆𝑟𝑒𝑙𝑚𝑖𝑛S_{rel}^{min}italic_S start_POSTSUBSCRIPT italic_r italic_e italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT are the maximum and minimum relative pixel coverage taking the empirical values of 1.51.51.51.5 and 0.50.50.50.5 respectively. If the pixel coverage of a Gaussian is too much larger than the Skmaxsuperscriptsubscript𝑆𝑘𝑚𝑎𝑥S_{k}^{max}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_a italic_x end_POSTSUPERSCRIPT, it is filtered out from rendering. Similarly, if it is too much smaller than the Skminsuperscriptsubscript𝑆𝑘𝑚𝑖𝑛S_{k}^{min}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m italic_i italic_n end_POSTSUPERSCRIPT and is smaller than STsubscript𝑆𝑇S_{T}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, it is filtered out from rendering (cf. Fig. 6). The absolute STsubscript𝑆𝑇S_{T}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT threshold is used to preserve the large Gaussians from the lower scales, as they do not cause the aliasing problem if their screen size is not sufficiently small. This selective rendering procedure is described in Algorithm 2. Additionally, even if the Gaussians from the finest level are too large or the Gaussians from the coarsest level are too small, they are not filtered to render beyond the maximum and below the minimum training resolutions.

The pixel coverage range of each Gaussian allows the model to maintain multi-scale Gaussians for different levels of detail. The appropriate subset of Gaussians is chosen for rendering at different resolutions and distances. More smaller Gaussians encoding the high-frequency information are rendered at high resolution, and fewer and larger Gaussians encoding the low-frequency information are rendered at low resolution for less aliasing effect and faster speed.

Table 1: Quantitative comparison and ablation study on the 360 dataset [3] at various downsampled scales, with time in “ms”.
Scale 1x 4x 16x 64x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 27.52 0.142 10.5 22.50 0.137 9.3 17.79 0.149 27.9 15.23 N.A. 103.3
3DGS + MS Train 27.35 0.155 11.3 23.50 0.126 7.7 20.21 0.115 22.8 19.38 N.A. 84.8
3DGS + Filter Small 27.40 0.153 10.0 23.81 0.149 5.4 20.02 0.186 4.8 17.38 N.A. 4.6
3DGS + Insert Large 18.02 0.604 9.7 18.75 0.531 2.5 20.23 0.256 2.7 21.53 N.A. 7.1
Our Full Method 27.39 0.155 9.1 24.82 0.132 5.4 24.75 0.066 4.9 25.35 N.A. 4.9
Table 2: Quantitative comparison and ablation study on Tank and Temples dataset [13] at various downsampled scales, with time in “ms”.
Scale 1x 4x 16x 64x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 23.74 0.096 6.5 19.70 0.105 11.1 15.61 0.068 43.4 13.88 N.A. 82.6
3DGS + MS Train 22.97 0.118 6.0 21.46 0.086 9.6 18.56 0.049 37.4 16.54 N.A. 71.7
3DGS + Filter Small 23.78 0.100 5.6 20.12 0.107 4.5 17.41 0.072 4.4 14.95 N.A. 4.7
3DGS + Insert Large 10.84 0.697 5.1 11.15 0.703 1.7 11.73 0.447 1.7 12.62 N.A. 2.5
Our Full Method 23.46 0.111 7.6 21.92 0.087 4.7 20.91 0.034 4.8 19.67 N.A. 5.9
Table 3: Quantitative comparison and ablation study on the Deep Blending dataset [8] at various downsampled scales, with time in “ms”.
Scale 1x 4x 16x 64x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 29.65 0.094 8.6 27.48 0.066 7.5 22.06 0.067 20.7 17.75 N.A. 59.7
3DGS + MS Train 29.46 0.102 6.6 28.18 0.062 5.3 24.13 0.055 14.3 20.03 N.A. 41.3
3DGS + Filter Small 29.68 0.095 6.7 28.26 0.064 4.2 24.52 0.078 3.6 18.29 N.A. 3.2
3DGS + Insert Large 20.59 0.379 4.6 20.83 0.336 1.6 21.29 0.143 2.1 20.10 N.A. 4.2
Our Full Method 29.70 0.096 7.4 28.43 0.064 3.9 27.66 0.036 3.4 25.70 N.A. 3.4

5 Experiments

In this section, we present a comprehensive evaluation of our proposed model, which is grounded on the implementation framework of the official release of the 3D Gaussian Splatting code. To achieve a similar training time as the baseline model, our models are trained for 40000 iterations with all other hyper-parameters unchanged. All rendering speed are measured on a single RTX3090 GPU. We evaluate the performance of the vanilla 3D Gaussian Splatting[12] algorithm and our model on the multi-scale 360[3], Tank And Temples[13], and Deep Blending[8] dataset, aligned with the data used by the original paper. These datasets cover a wide range of object centric, indoor, and ourdoor scenes.

Our evaluation focuses on the rendering quality and speed at multiple downsampling scales of 1x, 4x, 16x, and 64x derived from the test views. The rendering quality is measured in PSNR and LPIPS, while the speed is measured in per-image rendering time. This multi-scale evaluation is aimed at simulating the rendering performance in scenarios of low-resolution imaging or when captured from distant cameras. More detailed evaluations, including the results for more resolution scales and per-scene decomposition, are included in the supplementary materials due to the space constraint. Additionally, the supplementary materials include a video that offers an intuitive qualitative comparison of the two algorithms, vividly demonstrating the improvement of our algorithm in quality and speed from multiple viewpoints.

Refer to caption
Figure 7: Qualitative Comparison on 360 dataset[3] for different resolution scales.
Refer to caption
Figure 8: Qualitative Comparison on Tank and Temples dataset[13] for different resolution scales.
Refer to caption
Figure 9: Qualitative Comparison on Deep Blending dataset[8] for different resolution scales.

Quantitative Comparison.

As shown in Tab. 1, Tab. 2, and Tab. 3, our method can achieve substantial quality and speed improvements compared to the original 3D Gaussian Splatting [12] at lower resolutions. The quality and speed improvements become more pronounced as the resolution reduces, with the most noticeable 6-10dB PSNR and 20-30×\times× speed gain at the 64×\times× resolution scale. As the resolution reduces, the original splatting algorithm slows down while our method accelerates. The rendering quality and speed at the original resolution (1×\times×) remain comparable, indicating the effectiveness of our multi-scale Gaussians in representing both the high and low resolutions together.

Qualitative Comparison.

We present the qualitative comparison with the original 3D Gaussian Splatting [12] shown in Fig. 7, Fig. 8, and Fig. 9. At higher resolutions (1×\times×-8×\times×), both ours and the original algorithm can render the novel view rather faithfully. However, as the resolution reduces further(16×\times×-64×\times×), the original splatting algorithm produces severe artifacts, where the foreground becomes larger and larger, dominating the pixel colors as explained in Sec. 3.2. In contrast, the images rendered by our method closely resemble the ground truth across all resolution scales.

Ablations.

To evaluate the effectiveness of the different components proposed, we present the ablation quantitative results in Tab. 1, Tab. 2, and Tab. 3 and the ablation qualitative results in the supplementary. The three ablation methods evaluated are named “3DGS+MS Train”, “3DGS+Filter Small”, and “3DGS+Insert Large”. “3DGS+MS Train” reports the result with multi-scale training on top of the original 3D Gaussian splatting. The “3DGS+Filter Small” reports the result with small Gaussian filtering using pixel coverage and the multi-scale training, which is needed to update the maximum and minimum pixel coverage. Similarly, the “3DGS+Insert Large” reports the result with large Gaussian insertion and the multi-scale training.

The ablation results reflect that multi-scale training marginally improves low-resolution rendering quality, but the rendering speed remains very slow. When filtering out the small Gaussians with multi-scale training, the speed at low resolution is increased by 20-30×\times× with minimal rendering quality loss. The speed gain is caused by the considerably fewer Gaussians rendered. When inserting the large Gaussians and training with multi-scale supervision, without the small Gaussian filtering, the rendering quality drops significantly because the details of the scene are covered with large Gaussians completely for all resolutions. However, when adding the large Gaussians together with the small Gaussian filtering, the rendering quality and speed at low resolution are enhanced significantly without jeopardizing the high-resolution quality. This indicates the effectiveness of all three components and the full method proposed.

6 Limitations

Since all Gaussian filtering of our proposed method relies on the pixel coverage, it can only be done after the initial splatting process when the coverage is calculated. Although the splatting of individual Gaussians are performed in parallel and does not takes more time at lower resolution, it is still a considerable overhead when rendering at a very low resolution. Even if a very small portion of the Gaussians are used for rendering in the end, all Gaussians still need to be splatted. This is the main reason why our rendering time is not decreased linearly as the resolution decreases. In our future work, we will look into a more lightweight criteria to filter small and large Gaussians before splatting them onto the screen to achieve an even faster rendering speed.

7 Conclusion

We analyzed the cause of the severe aliasing artifact and speed degradation of the existing 3D Gaussian splatting. We identified the key challenge of mitigating the aliasing for 3D Gaussian splatting lies in representing the scene with Gaussians of appropriate scale. Based on this observation, we propose to calculate the pixel coverage of 3D Gaussians during splatting and use this as a criteria for selective rendering. Gaussians that are too large or too small at the current rendering resolution are filtered for anti-aliasing and speed improvements. We also proposed to insert large Gaussians by aggregating small Gaussians during training to preserve the low frequency details and prevent part missing. Our experiments on various datasets support the effectiveness of our algorithm in rendering quality and speed at both high and low resolution, mitigating the severe aliasing artifact of the original 3D Gaussian splatting.

Acknowledgement.

This work is supported by the Agency for Science, Technology and Research (A*STAR) under its MTC Programmatic Funds (Grant No. M23L7b0021).

\thetitle

Supplementary Material

8 Video Comparison

To better demonstrate the improvement of our algorithm in quality and speed for different resolutions, we include a video comparing our results with the original 3D Gaussian Splatting[12] at multiple scenes from different views and resolutions.

9 Details of Gaussian Aggregation Algorithm

Due to the space constraint of the main paper, some details of the Gaussian aggregation process are omitted. In this section, we will elaborate further with some examples to help the readers understand and reproduce our work. The process consists of the following steps:

Render at Lower Resolution.

Since we want to insert large Gaussians that are of appropriate size to be rendered at lower resolutions, we need to aggregate small Gaussians to form large Gaussians. Pixel coverage is used to determine whether a Gaussian is too small, we need to render all Gaussians first to calculate their pixel coverage at all training cameras. For all coarse levels lm=[2,lmax]subscript𝑙𝑚2subscript𝑙𝑚𝑎𝑥l_{m}=[2,l_{max}]italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ 2 , italic_l start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ], we render all Gaussians from [1,lm1]1subscript𝑙𝑚1[1,l_{m}-1][ 1 , italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - 1 ] at 4lm1superscript4subscript𝑙𝑚14^{l_{m}-1}4 start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT times downsampled resolution. For example, we render all Gaussians from level 1 to 3 at the 64×64\times64 × downsampled resolution from all training cameras to add large Gaussians for level 4. A Gaussian splatted to any of the training cameras with a pixel coverage Sksubscript𝑆𝑘S_{k}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT smaller than STsubscript𝑆𝑇S_{T}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is considered too small, and is included for the next step of aggregation.

10 Theoretical Anti-aliasing Effectiveness of Gaussian Aggregation for 1D Signals

Our algorithm eschews low-pass filters for individual Gaussians as they do not mitigate the slow rendering speed. Instead, as shown in Fig. 10, we opt to substitute smaller Gaussians with fewer, larger ones, reducing the signal bandwidth and the number of primitives rendered. Heeding the reviewer’s suggestion, we now delve deeper into the signal-processing analysis of our algorithm’s anti-aliasing effect from first principles. Aliasing arises when a signal’s bandwidth surpasses half the sampling frequency, as per Nyquist sampling theorem. Taking the mixture of 1D Gausssians Σieai(xxi)2subscriptΣ𝑖superscript𝑒subscript𝑎𝑖superscript𝑥subscript𝑥𝑖2\Sigma_{i}e^{-a_{i}(x-x_{i})^{2}}roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as an example, where ai=12σi2subscript𝑎𝑖12superscriptsubscript𝜎𝑖2a_{i}=\frac{1}{2\sigma_{i}^{2}}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, we aim to prove that in our algorithm, they are consistently substituted with a Gaussian whose 3dB bandwidth is below the aliasing frequency threshold 0.5px10.5superscriptpx10.5\mathrm{px}^{-1}0.5 roman_px start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

According to our algorithm, the mixture of Gaussians is first aggregated into an average Gaussian gavg(x)=ea(xμ)2subscript𝑔𝑎𝑣𝑔𝑥superscript𝑒𝑎superscript𝑥𝜇2g_{avg}(x)=e^{-a(x-\mu)^{2}}italic_g start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT ( italic_x ) = italic_e start_POSTSUPERSCRIPT - italic_a ( italic_x - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT with average a=1NΣiai𝑎1𝑁subscriptΣ𝑖subscript𝑎𝑖a=\frac{1}{N}\Sigma_{i}a_{i}italic_a = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and μ=1NΣixi𝜇1𝑁subscriptΣ𝑖subscript𝑥𝑖\mu=\frac{1}{N}\Sigma_{i}x_{i}italic_μ = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG roman_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We can apply the Fourier transform to convert it to the frequency domain to become [gavg(x)](f)=πaeπ2f2/adelimited-[]subscript𝑔𝑎𝑣𝑔𝑥𝑓𝜋𝑎superscript𝑒superscript𝜋2superscript𝑓2𝑎\mathcal{F}[{g_{avg}}(x)](f)=\sqrt{\frac{\pi}{a}}e^{-\pi^{2}f^{2}/a}caligraphic_F [ italic_g start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT ( italic_x ) ] ( italic_f ) = square-root start_ARG divide start_ARG italic_π end_ARG start_ARG italic_a end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_a end_POSTSUPERSCRIPT. The 3dB bandwidth f3dBsubscript𝑓3𝑑𝐵f_{3dB}italic_f start_POSTSUBSCRIPT 3 italic_d italic_B end_POSTSUBSCRIPT is the frequency where the magnitude is 1/2121/\sqrt{2}1 / square-root start_ARG 2 end_ARG of its peak magnitude. By solving

πaeπ2f3dB2/a=12πa,𝜋𝑎superscript𝑒superscript𝜋2superscriptsubscript𝑓3𝑑𝐵2𝑎12𝜋𝑎\sqrt{\frac{\pi}{a}}e^{-\pi^{2}f_{3dB}^{2}/a}=\frac{1}{\sqrt{2}}\sqrt{\frac{% \pi}{a}},\vspace{-2mm}square-root start_ARG divide start_ARG italic_π end_ARG start_ARG italic_a end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 3 italic_d italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_a end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG square-root start_ARG divide start_ARG italic_π end_ARG start_ARG italic_a end_ARG end_ARG , (7)

we find f3dB=1πaln2subscript𝑓3𝑑𝐵1𝜋𝑎2f_{3dB}=\frac{1}{\pi}\sqrt{a\ln{\sqrt{2}}}italic_f start_POSTSUBSCRIPT 3 italic_d italic_B end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_π end_ARG square-root start_ARG italic_a roman_ln square-root start_ARG 2 end_ARG end_ARG.

Our algorithm then scales standard deviation up by ST/Ssubscript𝑆𝑇𝑆S_{T}/Sitalic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_S, where ST=2pxsubscript𝑆𝑇2pxS_{T}=2\mathrm{px}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 2 roman_p roman_x is the selective rendering threshold and S𝑆Sitalic_S is the pixel coverage of the Gaussian. We determine S𝑆Sitalic_S by calculating the size at its level set, solving ea(0.5S)2=1/Nsuperscript𝑒𝑎superscript0.5𝑆21𝑁e^{-a(0.5S)^{2}}=1/Nitalic_e start_POSTSUPERSCRIPT - italic_a ( 0.5 italic_S ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = 1 / italic_N with 1/N=1/2551𝑁12551/N=1/2551 / italic_N = 1 / 255 on 8-bit color images. This yields S=21alnN𝑆21𝑎𝑁S=2\sqrt{\frac{1}{a}\ln{N}}italic_S = 2 square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_a end_ARG roman_ln italic_N end_ARG and thus the scaled standard deviation becomes σ=221alnNσsuperscript𝜎221𝑎𝑁𝜎\sigma^{\prime}=\frac{2}{2\sqrt{\frac{1}{a}\ln{N}}}\sigmaitalic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 2 end_ARG start_ARG 2 square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_a end_ARG roman_ln italic_N end_ARG end_ARG italic_σ. Given a=12σ2𝑎12superscript𝜎2a=\frac{1}{2\sigma^{2}}italic_a = divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, the scaled a=1alnNa=lnNsuperscript𝑎1𝑎𝑁𝑎𝑁a^{\prime}=\frac{1}{a}\ln N\cdot a=\ln Nitalic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_a end_ARG roman_ln italic_N ⋅ italic_a = roman_ln italic_N. Consequently, we calculate the 3dB bandwidth f3dBsubscriptsuperscript𝑓3𝑑𝐵f^{\prime}_{3dB}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 italic_d italic_B end_POSTSUBSCRIPT of the scaled Gaussian as:

f3dBsubscriptsuperscript𝑓3𝑑𝐵\displaystyle f^{\prime}_{3dB}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 italic_d italic_B end_POSTSUBSCRIPT =1πaln2=1πlnNln2absent1𝜋superscript𝑎21𝜋𝑁2\displaystyle=\frac{1}{\pi}\sqrt{a^{\prime}\ln{\sqrt{2}}}=\frac{1}{\pi}\sqrt{% \ln{N}\cdot\ln{\sqrt{2}}}= divide start_ARG 1 end_ARG start_ARG italic_π end_ARG square-root start_ARG italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_ln square-root start_ARG 2 end_ARG end_ARG = divide start_ARG 1 end_ARG start_ARG italic_π end_ARG square-root start_ARG roman_ln italic_N ⋅ roman_ln square-root start_ARG 2 end_ARG end_ARG (8)
=0.441px1<0.5px1.absent0.441superscriptpx10.5superscriptpx1\displaystyle=0.441\mathrm{px}^{-1}<0.5\mathrm{px}^{-1}.= 0.441 roman_px start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT < 0.5 roman_px start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .

This indicates that the bandwidth of the scaled Gaussians remains invariant to the attributes of the smaller Gaussians they replace, and is below half of the sampling frequency to avoid aliasing. While differing from the traditional low-pass filtering, our method is equally effective in anti-aliasing but more efficient in rendering.

Refer to caption
Figure 10: Instead of low-pass filters, we replace smaller Gaussians with fewer larger Gaussians, ensuring their bandwidth is below the aliasing threshold 0.5px10.5superscriptpx10.5\mathrm{px}^{-1}0.5 roman_px start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Unbounded Scene Normalization.

The Gaussians can be located at the range of (,)(-\infty,\infty)( - ∞ , ∞ ) in unbounded scenes. This is not suitable for voxelization later as only a limited amount of voxels can be used. To normalize the unbounded space, the center region and the outer region are handled in different manners. The space bounded by a axis-aligned cube of length B𝐵Bitalic_B defined by the span of all training cameras is considered the center region, and the rest is considered the outer region. To preserve the structure in the center region, the coordinates are linearly scaled from [B,B]𝐵𝐵[-B,B][ - italic_B , italic_B ] to [1,1]11[-1,1][ - 1 , 1 ]. To normalize the unbounded outer region, the coordinates are non-linearly scaled from (,)(-\infty,\infty)( - ∞ , ∞ ) to (2,2)22(-2,2)( - 2 , 2 ). The exact normalization is as follows:

𝐱norm={𝐱/B,ifmax(|𝐱|)B2B/𝐱,otherwise.subscript𝐱𝑛𝑜𝑟𝑚cases𝐱𝐵if𝑚𝑎𝑥𝐱𝐵2𝐵𝐱otherwise\displaystyle\mathbf{x}_{norm}=\begin{cases}\mathbf{x}/B,&\text{if}\,max(|% \mathbf{x}|)\leq B\\ 2-B/\mathbf{x},&\text{otherwise}\end{cases}.bold_x start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT = { start_ROW start_CELL bold_x / italic_B , end_CELL start_CELL if italic_m italic_a italic_x ( | bold_x | ) ≤ italic_B end_CELL end_ROW start_ROW start_CELL 2 - italic_B / bold_x , end_CELL start_CELL otherwise end_CELL end_ROW . (9)

Voxelization.

After the Gaussian positions are normalized to [2,2]22[-2,2][ - 2 , 2 ], they need to be voxelized so that all Gaussians in one voxel are grouped together for the aggregation later. The size of the voxel increases as the resolutions decrease because coarser levels require fewer larger Gaussians. Specifically, when inserting large Gaussians for level lmsubscript𝑙𝑚l_{m}italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the voxel size is chosen to be an empirical value of (400/lm)3superscript400subscript𝑙𝑚3(400/l_{m})^{3}( 400 / italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. All Gaussians with their center in one voxel are grouped together for the next step. Although it is possible for a Gaussian to extent beyond the voxel while its center resides in the voxel, it is unlikely to reach too far as large Gaussians are filtered out in the earlier procedure.

Average Pooling and Enlargement

After the small Gaussians are grouped in individual voxels, their parameters are averaged to create the large Gaussian. Specifically, the large Gaussian takes the average position, rotation, spherical harmonics features, opacity and scaling. However, a new Gaussian would be too small if it remains at this scaling. Consequently, we calculate the average pixel coverage of all the aggregated small Gaussians Savgsubscript𝑆𝑎𝑣𝑔S_{avg}italic_S start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT using their pixel coverage derived earlier. The scaling of the new Gaussian is then enlarged by ST/Savgsubscript𝑆𝑇subscript𝑆𝑎𝑣𝑔S_{T}/S_{avg}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_S start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT for its pixel coverage to be approximately STsubscript𝑆𝑇S_{T}italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, which is suitable to be rendered at level lmsubscript𝑙𝑚l_{m}italic_l start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. This average pooling is not perfect, but simple and effective enough to produce a reasonable initialization for the multi-scale training later.

11 Qualitative Ablation Study

To better compare the effectiveness of each of our proposed module qualitatively, we present the rendering results of our method and various ablation models in Fig. 1115. The ablation model design follows the experiment section in the main paper. Specifically, the “+MS Train” model is trained using multi-scale images, but the Gaussians are only of a single scale as in 3D Gaussian Splatting [12]. The low-resolution performance is slightly improved, but the rendering speed is as slow as the original method. The “+Filter Small” model filters the small Gaussians based on the pixel coverage on top of the multi-scale training. It significantly accelerates the low-resolution rendering process, but the scene has some part missing as shown in the rendered images. The image rendered also has artifacts like black dots at low resolutions, caused by the filtered small Gaussians. The “+Insert Large” model inserts the large Gaussians from aggregation on top of the multi-scale training. It has good rendering speed and quality at low resolutions, but the image rendered at high resolution is over-smoothed. This is caused by the finer level Gaussians not filtered out but optimized together with the inserted large Gaussians at low resolutions. Our ”Full Method” overcomes the weakness of the ablation models and produces high-quality rendering at fast speed on both high and low resolutions. The small Gaussians filtered improves the speed, and the large Gaussians inserted improves the quality at low resolutions. The qualitative ablation supports the effectiveness of our proposed components.

Refer to caption
Figure 11: Qualitative ablation results of our proposed method on the ”Bicycle” scene.
Refer to caption
Figure 12: Qualitative ablation results of our proposed method on the ”Counter” scene.
Refer to caption
Figure 13: Qualitative ablation results of our proposed method on the ”Garden” scene.
Refer to caption
Figure 14: Qualitative ablation results of our proposed method on the ”Treehill” scene.
Refer to caption
Figure 15: Qualitative ablation results of our proposed method on the ”Truck” scene.

12 Quantitative Results on More Resolutions

We present the quantitative results of our method, the original 3D Gaussian Splatting[12], and the various ablation methods on more downsampled resolutions. The resolutions include those that are not used during training which demonstrate the performance and robustness of our model. The experiments are conducted on MipNeRF-360 dataset [3] as shown in Tab. 4, Tank and Temple dataset [13] as shown in Tab. 5, and Deep Blending dataset [8] as shown in Tab. 6.

Table 4: Quantitative comparison and ablation study on MipNeRF 360 dataset [3] at more downsampled scales, with time in “ms”.
Scale 1x 2x 4x 8x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 27.52 0.142 10.5 25.96 0.124 8.0 22.50 0.137 9.3 19.79 0.154 14.6
3DGS + MS Train 27.35 0.155 11.3 26.33 0.128 7.3 23.50 0.126 7.7 21.38 0.131 12.1
3DGS + Filter Small 27.40 0.153 10.0 26.42 0.129 6.8 23.81 0.149 5.4 21.73 0.175 5.1
3DGS + Insert Large 18.02 0.604 9.7 18.28 0.593 3.4 18.75 0.531 2.5 19.39 0.419 2.2
Our Method 27.39 0.155 9.1 26.44 0.134 6.3 24.82 0.132 5.4 24.44 0.112 5.1
Scale 16x 32x 64x 128x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 17.79 0.149 27.9 16.30 0.084 55.2 15.23 N.A. 103.3 14.55 N.A. 123.2
3DGS + MS Train 20.21 0.115 22.8 19.80 0.060 45.6 19.38 N.A. 84.8 18.75 N.A. 100.1
3DGS + Filter Small 20.02 0.186 4.8 18.81 0.090 4.4 17.38 N.A. 4.6 16.13 N.A. 4.8
3DGS + Insert Large 20.23 0.256 2.7 21.17 0.081 4.6 21.53 N.A. 7.1 20.25 N.A. 9.4
Our Method 24.75 0.066 4.9 25.06 0.025 4.7 25.35 N.A. 4.9 22.55 N.A. 5.0
Table 5: Quantitative comparison and ablation study on Tank and Temple dataset [13] at more downsampled scales, with time in “ms”.
Scale 1x 2x 4x 8x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 23.74 0.096 6.5 22.55 0.080 7.1 19.70 0.105 11.1 17.34 0.117 21.5
3DGS + MS Train 22.97 0.118 6.0 23.04 0.083 6.3 21.46 0.086 9.6 20.18 0.080 18.5
3DGS + Filter Small 23.78 0.100 5.6 22.76 0.079 5.1 20.12 0.107 4.5 18.62 0.122 4.4
3DGS + Insert Large 10.84 0.697 5.1 10.96 0.719 2.4 11.15 0.703 1.7 11.40 0.631 1.6
Our Method 23.46 0.111 7.6 22.44 0.095 5.6 21.92 0.087 4.7 20.88 0.082 4.6
Scale 16x 32x 64x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 15.61 0.068 43.4 14.45 N.A. 70.9 13.88 N.A. 82.6
3DGS + MS Train 18.56 0.049 37.4 17.41 N.A. 61.7 16.54 N.A. 71.7
3DGS + Filter Small 17.41 0.072 4.4 16.05 N.A. 4.5 14.95 N.A. 4.7
3DGS + Insert Large 11.73 0.447 1.7 12.14 N.A. 2.1 12.62 N.A. 2.5
Our Method 20.91 0.034 4.8 21.01 N.A. 5.4 19.67 N.A. 5.9
Table 6: Quantitative comparison and ablation study on Deep Blending dataset [8] at more downsampled scales, with time in “ms”.
Scale 1x 2x 4x 8x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 29.65 0.094 8.6 29.41 0.065 6.6 27.48 0.066 7.5 24.67 0.076 11.3
3DGS + MS Train 29.46 0.102 6.6 29.42 0.069 4.8 28.18 0.062 5.3 26.15 0.065 8.0
3DGS + Filter Small 29.68 0.095 6.7 29.53 0.064 4.9 28.26 0.064 4.2 26.51 0.082 3.8
3DGS + Insert Large 20.59 0.379 4.6 20.67 0.381 2.2 20.83 0.336 1.6 21.07 0.263 1.7
Our Method 29.70 0.096 7.4 29.58 0.065 4.8 28.43 0.064 3.9 27.59 0.063 3.6
Scale 16x 32x 64x
Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
3D Gaussian[12] 22.06 0.067 20.7 19.74 N.A. 36.3 17.75 N.A. 59.7
3DGS + MS Train 24.13 0.055 14.3 22.09 N.A. 24.8 20.03 N.A. 41.3
3DGS + Filter Small 24.52 0.078 3.6 22.01 N.A. 3.3 18.29 N.A. 3.2
3DGS + Insert Large 21.29 0.143 2.1 21.14 N.A. 2.8 20.10 N.A. 4.2
Our Method 27.66 0.036 3.4 27.22 N.A. 3.3 25.70 N.A. 3.4

13 Per-Scene Quantitative Results

We present the per-scene decomposition of the quantitative results of our method and the original 3D Gaussian splatting [12] in various resolutions. The experiments are carried on MipNeRF-360 dataset [3] as shown in Tab. 7, Tank and Temple dataset [13] as shown in Tab. 8, and Deep Blending dataset [8] as shown in Tab. 9. The scenes chosen to be tested on follow the experiments carried out in the original 3D Gaussian splatting paper [12].

Table 7: Per-scene performance decomposition on MipNeRF-360 dataset[3]. Time measured in ’ms’.
Scale 1x 4x 16x 64x 128x
Scene Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
garden 3D-GS[12] 27.27 0.070 15.0 20.42 0.136 14.4 16.74 0.166 48.8 14.92 N.A. 200.9 14.29 N.A. 245.0
garden Ours 27.16 0.080 11.8 23.99 0.112 7.8 26.41 0.044 7.5 24.79 N.A. 8.6 21.19 N.A. 9.6
flowers 3D-GS[12] 21.41 0.309 9.1 18.89 0.239 8.8 15.46 0.165 24.9 13.90 N.A. 93.2 13.69 N.A. 112.2
flowers Ours 21.11 0.333 8.1 20.83 0.234 5.7 21.97 0.093 5.1 22.69 N.A. 4.9 21.82 N.A. 5.0
treehill 3D-GS[12] 22.60 0.274 10.0 21.63 0.232 9.7 18.71 0.193 24.6 16.19 N.A. 90.6 15.52 N.A. 97.0
treehill Ours 22.64 0.291 8.7 22.31 0.239 5.8 23.55 0.072 5.4 24.28 N.A. 4.9 22.27 N.A. 5.0
bicycle 3D-GS[12] 25.15 0.164 18.8 19.71 0.178 15.5 16.27 0.215 43.9 14.99 N.A. 163.8 15.15 N.A. 187.0
bicycle Ours 24.44 0.210 13.4 24.76 0.131 7.4 25.00 0.081 6.4 26.02 N.A. 6.5 21.56 N.A. 6.9
counter 3D-GS[12] 29.15 0.099 7.5 24.81 0.084 6.4 17.94 0.101 19.2 14.32 N.A. 60.4 13.39 N.A. 74.6
counter Ours 29.17 0.100 6.6 26.77 0.076 3.3 23.44 0.057 2.8 24.59 N.A. 2.7 21.14 N.A. 2.7
kitchen 3D-GS[12] 31.70 0.064 9.3 23.95 0.081 8.5 18.50 0.093 35.4 15.00 N.A. 124.4 14.15 N.A. 150.3
kitchen Ours 31.64 0.064 8.1 25.93 0.089 4.2 24.16 0.049 3.9 25.35 N.A. 3.3 21.50 N.A. 3.2
room 3D-GS[12] 31.63 0.093 8.0 26.60 0.057 5.1 19.50 0.096 12.0 15.50 N.A. 49.2 14.37 N.A. 70.8
room Ours 31.51 0.094 6.6 28.95 0.053 3.1 28.15 0.025 2.9 25.77 N.A. 2.9 21.82 N.A. 2.9
stump 3D-GS[12] 26.75 0.138 10.6 22.24 0.152 10.1 18.57 0.188 26.5 17.33 N.A. 95.2 16.97 N.A. 114.0
stump Ours 26.59 0.152 12.9 23.52 0.150 8.2 25.22 0.112 7.2 29.22 N.A. 7.1 29.09 N.A. 7.2
bonsai 3D-GS[12] 32.04 0.065 6.0 24.23 0.075 5.3 18.43 0.126 15.4 14.95 N.A. 52.4 13.46 N.A. 57.9
bonsai Ours 32.27 0.067 5.5 26.32 0.106 3.3 24.87 0.062 2.8 25.40 N.A. 2.9 22.53 N.A. 2.8
Table 8: Per-scene performance decomposition on Tank and Temple dataset[13]. Time measured in ’ms’.
Scale 1x 4x 16x 64x
Scene Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
truck 3D-GS[12] 25.39 0.064 7.3 19.97 0.103 11.3 15.69 0.064 49.2 14.20 N.A. 89.1
truck Ours 24.94 0.078 9.0 23.67 0.059 5.4 22.62 0.024 6.0 19.99 N.A. 8.6
train 3D-GS[12] 22.09 0.129 5.8 19.42 0.108 10.9 15.54 0.072 37.6 13.57 N.A. 76.1
train Ours 21.98 0.144 6.2 20.17 0.114 3.9 19.21 0.044 3.5 19.36 N.A. 3.3
Table 9: Per-scene performance decomposition on Deep Blending dataset[8] Time measured in ’ms’.
Scale 1x 4x 16x 64x
Scene Metric PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow PSNR\uparrow LPIPS\downarrow Time\downarrow
drjohnson 3D-GS[12] 29.14 0.106 10.1 27.23 0.079 9.3 22.73 0.078 26.3 18.60 N.A. 67.6
drjohnson Ours 29.19 0.108 8.6 27.96 0.078 4.4 26.80 0.051 3.9 27.19 N.A. 3.8
playroom 3D-GS[12] 30.15 0.082 7.0 27.72 0.053 5.7 21.40 0.056 15.0 16.89 N.A. 51.8
playroom Ours 30.20 0.084 6.2 28.89 0.051 3.4 28.53 0.020 3.0 24.22 N.A. 3.0

References

  • Akeley [1993] Kurt Akeley. Reality engine graphics. In Proceedings of the 20th Annual Conference on Computer Graphics and Interactive Techniques, page 109–116, New York, NY, USA, 1993. Association for Computing Machinery.
  • Barron et al. [2021] Jonathan T. Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P. Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. ICCV, 2021.
  • Barron et al. [2022] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR, 2022.
  • Barron et al. [2023] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. ICCV, 2023.
  • Beets and Barron [2000] Kristof Beets and David L. Barron. Super-sampling anti-aliasing analyzed. 2000.
  • Chen et al. [2022] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
  • Erikson [1996] Carl Erikson. Polygonal simplification. Technical Report 96-016, 1996.
  • Hedman et al. [2018] Peter Hedman, Julien Philip, True Price, Jan-Michael Frahm, George Drettakis, and Gabriel Brostow. Deep blending for free-viewpoint image-based rendering. ACM Trans. Graph., 37(6), 2018.
  • Heok and Daman [2004] Tan Kim Heok and D. Daman. A review on level of detail. In Proceedings. International Conference on Computer Graphics, Imaging and Visualization, 2004. CGIV 2004., pages 70–75, 2004.
  • Hu et al. [2023] Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19774–19783, 2023.
  • Jimenez et al. [2011] Jorge Jimenez, Diego Gutiérrez, Jason Yang, Alexander Reshetov, Pete Demoreuille, Tobias Berghoff, Cedric Perthuis, Henry Yu, Morgan Mcguire, Timothy Lottes, Hugh Malan, and Emil Persson. Filtering approaches for real-time anti-aliasing. ACM SIGGRAPH 2011 Courses, SIGGRAPH’11, 2011.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  • Knapitsch et al. [2017] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics, 36(4), 2017.
  • Lottes [2011] Timothy Lottes. Fxaa. Technical report, Nvidia, 2011.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, 2022.
  • Nam et al. [2023] Seungtae Nam, Daniel Rho, Jong Hwan Ko, and Eunbyung Park. Mip-grid: Anti-aliased grid representations for neural radiance fields. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Williams [1983] Lance Williams. Pyramidal parametrics. SIGGRAPH Comput. Graph., 17(3):1–11, 1983.
  • Zwicker et al. [2001] M. Zwicker, H. Pfister, J. van Baar, and M. Gross. Ewa volume splatting. In Proceedings Visualization, 2001. VIS ’01., pages 29–538, 2001.