Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

GaussianCube: A Structured and Explicit
Radiance Representation for 3D Generative Modeling

Bowen Zhang1∗    Yiji Cheng2∗   Jiaolong Yang3†   Chunyu Wang3†
Feng Zhao1    Yansong Tang2    Dong Chen3‡    Baining Guo3
1University of Science and Technology of China  2Tsinghua University  3Microsoft Research Asia
Abstract

We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. Existing radiance representations either require an implicit feature decoder, which significantly degrades the modeling power of the representation, or are spatially unstructured, making them difficult to integrate with mainstream 3D diffusion methods. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting using a fixed number of free Gaussians, and then rearranging these Gaussians into a predefined voxel grid via Optimal Transport. Since GaussianCube is a structured grid representation, it allows us to use standard 3D U-Net as our backbone in diffusion modeling without elaborate designs. More importantly, the high-accuracy fitting of the Gaussians allows us to achieve a high-quality representation with orders of magnitude fewer parameters than previous structured representations for comparable quality, ranging from one to two orders of magnitude. The compactness of GaussianCube greatly eases the difficulty of 3D generative modeling. Extensive experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D synthesis all show that our model achieves state-of-the-art generation results both qualitatively and quantitatively, underscoring the potential of GaussianCube as a highly accurate and versatile radiance representation for 3D generative modeling. Project page: https://gaussiancube.github.io/.

11footnotetext: Interns at Microsoft Research Asia. Equal advising. Corresponding author.

1 Introduction

The field of 3D generative modeling [55, 36, 5, 53, 46, 8, 17] has witnessed remarkable growth, driven by advancements in generative modeling [23, 18, 38, 15, 66, 27]. Most of the prior works in this domain leverage variants of Neural Radiance Field (NeRF) [35, 8, 53, 37] as their underlying 3D representations, which typically consist of an explicit structured proxy representation and an implicit feature decoder. However, such hybrid NeRF variants exhibit degraded representation power, particularly when used for generative modeling where a single implicit feature decoder is shared across all objects. Additionally, the high computational complexity of volumetric rendering leads to both slow rendering speed and extensive memory costs.

Recently, the emergence of 3D Gaussian Splatting (GS) [28] has enabled improved reconstruction quality and real-time rendering capabilities [64, 33, 58]. The fully explicit nature of 3DGS eliminates the need for a shared implicit decoder, providing another key advantage over NeRFs. Although 3DGS has been widely studied in scene reconstruction tasks, its spatially unstructured nature presents a significant challenge when applied to mainstream generative modeling frameworks.

To overcome these barriers, we introduce GaussianCube – an innovative radiance representation that is both structured and fully explicit, with strong fitting capabilities (see Table 1 for comparisons with prior works). The proposed approach first ensures high-accuracy fitting with a predefined number of free Gaussians, and subsequently organizes these Gaussians into a structured voxel grid. Such an explicit grid-based structure permits the seamless application of standard 3D convolutional architectures, such as U-Net, thereby eliminating the need for complex, specialized network designs [69, 55] that are often necessary with unstructured or implicitly decoded representations.

Structuring 3D Gaussians without sacrificing fitting quality is not a trivial task. A naive starting point would be obtaining a fixed number of Gaussians by omitting the densification and pruning steps in GS. However, such simplification fails to lead the Gaussians close to the object surfaces and results in significant quality degradation. In contrast, we propose a densification-constrained fitting strategy, which retains the original pruning process yet constrains the number of Gaussians that perform densification, ensuring the total does not exceed a predefined maximum N3superscript𝑁3N^{3}italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. For the subsequent structuralization, we allocate the Gaussians across an N×N×N𝑁𝑁𝑁N\times N\times Nitalic_N × italic_N × italic_N voxel grid using Optimal Transport (OT). Consequently, our fitted Gaussians are systematically arranged within the voxel grid, with each voxel containing the features of a Gaussian. The proposed OT-based structuralization achieves maximal spatial correspondence, characterized by minimal total transport distances, while preserving the expressive power of 3DGS.

Representation Spatially-structured Fully-explicit Real-time Rendering Rel. Parameters\downarrow
Instant-NGP [35] 26.63×26.63\times26.63 ×
Neural Voxels [53] 145.9×145.9\times145.9 ×
Triplane [8] 13.7×13.7\times13.7 ×
Gaussian Splatting [28] 4.0×4.0\times4.0 ×
Our GaussianCube 1.0×\mathbf{1.0\times}bold_1.0 ×
Table 1: Comparison with previous 3D representations with respect to spatial structure, explicitness, real-time rendering capability, and relative parameter count (Rel. Parameters) for representations of comparable quality.

The structured nature of GaussianCube enables us to perform efficient 3D diffusion [23] modeling for the following three reasons: 1) It allows the use of standard 3D U-Net as our backbone for diffusion modeling without elaborate designs. 2) The spatial coherence of GaussianCube permits the use of standard 3D convolutions to capture the correlations among neighboring Gaussians, facilitating efficient feature extraction. 3) GaussianCube enables high-quality fitting with orders of magnitude fewer parameters than prior structured representations of similar quality. Since recent works [30, 3] have demonstrated diffusion models’ struggle in handling high-dimensional distributions, the compactness of GaussianCube significantly reduces the modeling difficulty of the generative framework.

We conduct comprehensive experiments to verify the efficacy of our approach. The model’s capability for unconditional and class-conditioned generation is evaluated on the ShapeNet [9] and OmniObject3D [59] datasets. Both the quantitative and qualitative comparisons indicate that our model surpasses all previous methods. We also perform digital avatar generation on a synthetic avatar dataset [57]. Our model is capable of producing high-fidelity 3D avatars conditioned on single portrait images, excelling beyond prior art in both identity preservation and detail creation. Additionally, we assess our model’s capacity for the challenging text-to-3D creation task on Objaverse [12]. Our model demonstrates competitive performance both quantitatively and qualitatively, producing results consistent with the given text prompts in just 5 seconds. All experiments show the strong capabilities of our GaussianCube and suggest its potential as a powerful and versatile 3D representation for a variety of applications. Some generated samples of our method are presented in Figure 1.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Our diffusion model is able to create diverse objects with complex geometry and rich texture details (top three rows). Our method also supports creating high-fidelity digital avatars (the forth row) conditioned on single portrait images (visualized in dashed boxes) and high-quality 3D assets given text prompts (the fifth row).

2 Related Work

Radiance field representation. Radiance fields model ray interactions with scene surfaces and can be in either implicit or explicit forms. Early works of neural radiance fields (NeRFs) [35, 67, 40, 1, 42] are often in an implicit form, which represents scenes without defining geometry. These works optimize a continuous scene representation using volumetric ray-marching that leads to extremely high computational costs. Recent works introduce the use of explicit proxy representation [8, 24, 16, 48, 37, 63] followed by an implicit feature decoder to enable faster rendering. Recently, the 3D Gaussian Splatting methods [28, 64, 58, 11, 29] utilize 3D Gaussians as their underlying representation and offer impressive reconstruction quality. The fully explicit representation also provides real-time rendering speed. However, the 3D Gaussians are unstructured representation, and require per-scene optimization to achieve photo-realistic quality. In contrast, our work proposes a structured representation termed GaussianCube for 3D generative tasks.

3D generation. Previous works of SDS-based optimization [41, 52, 62, 49, 10, 50, 65] distill 2D diffusion priors [44] to a 3D representation with the score functions, but these works are notably time-intensive, often taking several minutes to hours. While 3D-aware GANs [8, 17, 7, 19, 39, 14, 61] facilitate view-dependent image generation from single-image collections, they struggle to capture the complexity of diverse objects with intricate geometric variations [60]. Although recent works[55, 36, 20, 53, 46] have utilized diffusion models with structured proxy representations for 3D generation, the use of a shared implicit feature decoder across different assets restricts expressiveness and the computational demands of NeRF hinder efficient training. In contrast, we introduce a structured and fully explicit radiance representation for 3D generative modeling, building upon 3DGS [28]. A concurrent work of [21] includes elaborate designs to form the Gaussians into volumetric representation during fitting, yet does not thoroughly address global correspondence. In contrast, our approach only restricts the total count of Gaussians while allowing freedom in their spatial distribution during the fitting. We then organize these Gaussians into a voxel grid using Optimal Transport, which yields a spatially coherent arrangement with minimal global offset cost, effectively easing the difficulty of generative modeling.

3 Method

Following prior works, our framework comprises two primary stages as shown in Figure 2: representation construction and diffusion modeling. In representation construction phase, we first apply a densification-constrained 3DGS fitting algorithm for each object to obtain a constant number of Gaussians. These Gaussians are then organized into the proposed spatially structured GaussianCube via Optimal Transport between the positions of Gaussians and centers of a predefined voxel grid. For diffusion modeling, we train a 3D diffusion model to learn the distribution of GaussianCubes. We will detail our designs for each stage subsequently.

Refer to caption
Figure 2: Overall framework. Our framework comprises two main stages of representation construction and 3D diffusion. In the representation construction stage, given multi-view renderings of a 3D asset, we perform densification-constrained fitting to obtain 3D Gaussians with constant numbers. Subsequently, the Gaussians are structured into GaussianCube via Optimal Transport. In the 3D diffusion stage, our 3D diffusion model is trained to generate GaussianCube from Gaussian noise.

3.1 Representation Construction

We build our GaussianCube upon 3DGS, an explicit representation that offers impressive fitting quality and real-time rendering speed. However, it fails to yield Gaussians of fixed length since the adaptive density control during GS fitting can lead to a varying number of Gaussians for different objects. Furthermore, the lack of a predetermined spatial ordering for Gaussians leads to a disorganized spatial structure. These aspects pose significant challenges to 3D generative modeling. To overcome these obstacles, we first introduce our densification-constrained fitting strategy to obtain a fixed number of free Gaussians. Then, we systematically arrange the resulting Gaussians within a predefined voxel grid via Optimal Transport, thereby achieving a structured and explicit radiance representation.

Formally, a 3D asset is represented by a collection of 3D Gaussians as introduced in Gaussian Splatting [28]. The geometry of the i𝑖iitalic_i-th 3D Gaussian 𝒈isubscript𝒈𝑖\bm{g}_{i}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is given by

𝒈i(𝒙)=exp(12(𝒙𝝁i)𝚺i1(𝒙𝝁i)),subscript𝒈𝑖𝒙12superscript𝒙subscript𝝁𝑖topsuperscriptsubscript𝚺𝑖1𝒙subscript𝝁𝑖\bm{g}_{i}(\bm{x})=\exp\left(-\frac{1}{2}\left(\bm{x}-\bm{\mu}_{i}\right)^{% \top}\bm{\Sigma}_{i}^{-1}\left(\bm{x}-\bm{\mu}_{i}\right)\right),bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_x - bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_x - bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (1)

where 𝝁i3subscript𝝁𝑖superscript3\bm{\mu}_{i}\in\mathbb{R}^{3}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the center of the Gaussian and 𝚺i3×3subscript𝚺𝑖superscript33\bm{\Sigma}_{i}\in\mathbb{R}^{3\times 3}bold_Σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT is the covariance matrix defining the shape and size, which can be decomposed into a quaternion 𝒒i4subscript𝒒𝑖superscript4\bm{q}_{i}\in\mathbb{R}^{4}bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and a vector 𝒔i3subscript𝒔𝑖superscript3\bm{s}_{i}\in\mathbb{R}^{3}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for rotation and scaling, respectively. Moreover, each Gaussian 𝒈isubscript𝒈𝑖\bm{g}_{i}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT have an opacity value αisubscript𝛼𝑖\alpha_{i}\in\mathbb{R}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R and a color feature 𝒄i3subscript𝒄𝑖superscript3\bm{c}_{i}\in\mathbb{R}^{3}bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT for rendering. Combining them together, the C𝐶Citalic_C-channel feature vector 𝜽i={𝝁i,𝒔i,𝒒i,αi,𝒄i}Csubscript𝜽𝑖subscript𝝁𝑖subscript𝒔𝑖subscript𝒒𝑖subscript𝛼𝑖subscript𝒄𝑖superscript𝐶\bm{\theta}_{i}=\{\bm{\mu}_{i},\bm{s}_{i},\bm{q}_{i},\alpha_{i},\bm{c}_{i}\}% \in\mathbb{R}^{C}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT fully characterizes the Gaussian 𝒈isubscript𝒈𝑖\bm{g}_{i}bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Densification-constrained fitting. Our approach begins with the aim of maintaining a constant number of Gaussians 𝒈Nmax×C𝒈superscriptsubscript𝑁max𝐶\bm{g}\in\mathbb{R}^{N_{\text{max}}\times C}bold_italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT across different objects during the fitting. A simplistic approach might involve omitting the densification and pruning in the original GS. However, we argue that such simplifications significantly harm the fitting quality, with empirical evidence shown in Table 6. Instead, we propose to retain the pruning process while imposing a new constraint on the densification phase as shown in Figure 3 (a). The fitting process encompasses several distinct stages: 1) Densification Detection: Assuming the current iteration includes Ncsubscript𝑁cN_{\text{c}}italic_N start_POSTSUBSCRIPT c end_POSTSUBSCRIPT Gaussians, we identify densification candidates by selecting those with view-space position gradient magnitudes exceeding a predefined threshold τ𝜏\tauitalic_τ. We denote the number of candidates as Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. 2) Candidate sampling: To prevent exceeding the predefined maximum of Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT Gaussians, we select min(NmaxNc,Nd)subscript𝑁maxsubscript𝑁csubscript𝑁𝑑\min{(N_{\text{max}}-N_{\text{c}},N_{d})}roman_min ( italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT c end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) Gaussians with the largest view-space positional gradients from the candidates for densification. 3) Densification: We modify the densification approach by alternating between cloning and splitting actions into separate steps. 4) Pruning Detection and Pruning: We identify and remove the Gaussians with α𝛼\alphaitalic_α less than a small threshold ϵitalic-ϵ\epsilonitalic_ϵ. After completing the fitting process, we pad Gaussians with α=0𝛼0\alpha=0italic_α = 0 to reach the target count of Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT without affecting the rendering results. Benefiting from our proposed strategy, we attain a high-quality representation with orders of magnitude fewer parameters compared to existing works of similar quality, which significantly reduces the modeling difficulty for the diffusion models.

Gaussian structuralization via Optimal Transport. To further organize the obtained Gaussians into a spatially structured representation for 3D generative modeling, we propose to map the Gaussians to a predefined structured voxel grid 𝒗Nv×Nv×Nv×C𝒗superscriptsubscript𝑁𝑣subscript𝑁𝑣subscript𝑁𝑣𝐶\bm{v}\in\mathbb{R}^{N_{v}\times N_{v}\times N_{v}\times C}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT where Nv=Nmax3subscript𝑁𝑣3subscript𝑁maxN_{v}=\sqrt[3]{N_{\text{max}}}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = nth-root start_ARG 3 end_ARG start_ARG italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG. Intuitively, we aim to “move” each Gaussian into a voxel while preserving their geometric relations as much as possible. While naive approaches such as nearest neighbor transport fall short in conserving these relations due to disregard for global arrangement with evidence shown in Figure 10, we formulate this as an Optimal Transport (OT) problem [54, 4] between the Gaussians’ spatial positions {𝝁i,i=1,,Nmax}formulae-sequencesubscript𝝁𝑖𝑖1subscript𝑁max\{\bm{\mu}_{i},i=1,\ldots,N_{\text{max}}\}{ bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT } and the voxel grid centers {𝒙j,j=1,,Nmax}formulae-sequencesubscript𝒙𝑗𝑗1subscript𝑁max\{\bm{x}_{j},j=1,\ldots,N_{\text{max}}\}{ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j = 1 , … , italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT }. Let 𝐃𝐃\mathbf{D}bold_D be a distance matrix with 𝐃ijsubscript𝐃𝑖𝑗\mathbf{D}_{ij}bold_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT being the moving distance between 𝝁isubscript𝝁𝑖\bm{\mu}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒙jsubscript𝒙𝑗\bm{x}_{j}bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, i.e., 𝐃ij=𝝁i𝒙j2subscript𝐃𝑖𝑗superscriptnormsubscript𝝁𝑖subscript𝒙𝑗2\mathbf{D}_{ij}=\|\bm{\mu}_{i}-\bm{x}_{j}\|^{2}bold_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∥ bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The transport plan is represented by a binary matrix 𝐓Nmax×Nmax𝐓superscriptsubscript𝑁maxsubscript𝑁max\mathbf{T}\in\mathbb{R}^{N_{\text{max}}\times N_{\text{max}}}bold_T ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and the optimal transport plan is given by:

minimize𝐓i=1Nmaxj=1Nmax𝐓ij𝐃ij subject to j=1Nmax𝐓ij=1i{1,,Nmax}i=1Nmax𝐓ij=1j{1,,Nmax}𝐓ij{0,1}(i,j){1,,Nmax}×{1,,Nmax}.𝐓minimizesuperscriptsubscript𝑖1subscript𝑁maxsuperscriptsubscript𝑗1subscript𝑁maxsubscript𝐓𝑖𝑗subscript𝐃𝑖𝑗 subject to formulae-sequencesuperscriptsubscript𝑗1subscript𝑁maxsubscript𝐓𝑖𝑗1for-all𝑖1subscript𝑁maxmissing-subexpressionformulae-sequencesuperscriptsubscript𝑖1subscript𝑁maxsubscript𝐓𝑖𝑗1for-all𝑗1subscript𝑁maxmissing-subexpressionformulae-sequencesubscript𝐓𝑖𝑗01for-all𝑖𝑗1subscript𝑁max1subscript𝑁max\begin{array}[]{ll}\underset{\mathbf{T}}{\operatorname{minimize}}&\sum_{i=1}^{% N_{\text{max}}}\sum_{j=1}^{N_{\text{max}}}\mathbf{T}_{ij}\mathbf{D}_{ij}\\ \text{ subject to }&\sum_{j=1}^{N_{\text{max}}}\mathbf{T}_{ij}=1\quad\forall i% \in\{1,\ldots,N_{\text{max}}\}\\ &\sum_{i=1}^{N_{\text{max}}}\mathbf{T}_{ij}=1\quad\forall j\in\{1,\ldots,N_{% \text{max}}\}\\ &\mathbf{T}_{ij}\in\{0,1\}\qquad\forall(i,j)\in\{1,\ldots,N_{\text{max}}\}% \times\{1,\ldots,N_{\text{max}}\}.\end{array}start_ARRAY start_ROW start_CELL underbold_T start_ARG roman_minimize end_ARG end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT bold_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL subject to end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ∀ italic_i ∈ { 1 , … , italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ∀ italic_j ∈ { 1 , … , italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 } ∀ ( italic_i , italic_j ) ∈ { 1 , … , italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT } × { 1 , … , italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT } . end_CELL end_ROW end_ARRAY (2)

The solution is a bijective transport plan 𝐓superscript𝐓\mathbf{T}^{*}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that minimizes the total transport distances. We employ the Jonker-Volgenant algorithm [25] to solve the OT problem. We provide a 2D illustration in Figure 3 (b). We organize the Gaussians according to the solutions, with the j𝑗jitalic_j-th voxel encapsulating the feature vector of the corresponding Gaussian 𝜽k={𝝁k𝒙j,𝒔k,𝒒k,αk,𝒄k}Csubscript𝜽𝑘subscript𝝁𝑘subscript𝒙𝑗subscript𝒔𝑘subscript𝒒𝑘subscript𝛼𝑘subscript𝒄𝑘superscript𝐶\bm{\theta}_{k}=\{\bm{\mu}_{k}-\bm{x}_{j},\bm{s}_{k},\bm{q}_{k},\alpha_{k},\bm% {c}_{k}\}\in\mathbb{R}^{C}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, where k𝑘kitalic_k is determined by the optimal transport plan (i.e., 𝐓kj=1subscriptsuperscript𝐓𝑘𝑗1\mathbf{T}^{*}_{kj}=1bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT = 1). Note that we replace the original Gaussian positions with offsets of the current voxel center to reduce the solution space for diffusion models. As a result, our fitted Gaussians are systematically arranged within a voxel grid 𝒗𝒗\bm{v}bold_italic_v and preserve the spatial correspondence of neighboring Gaussians, which further facilitates generative modeling.

\begin{overpic}[width=424.94574pt]{imgs/framework/densification_OT-cropped.pdf% } \put(23.0,-1.5){(a)} \put(70.0,-1.5){(b)} \end{overpic}
Figure 3: Illustration of representation construction. First, we perform densification-constrained fitting to yield a fixed number of Gaussians, as shown in (a). We then employ Optimal Transport to organize the resultant Gaussians into a voxel grid. A 2D illustration of this process is presented in (b).

3.2 3D Diffusion on GaussianCube

We now introduce our 3D diffusion model incorporated with the proposed expressive, efficient and spatially structured representation. After organizing the fitted Gaussians 𝒈𝒈\bm{g}bold_italic_g into GaussianCube 𝒚𝒚\bm{y}bold_italic_y for each object, we aim to model the distribution of GaussianCube, i.e., p(𝒚)𝑝𝒚p(\bm{y})italic_p ( bold_italic_y ).

Formally, the generation procedure can be formulated into the inversion of a discrete-time Markov forward process. During the forward phase, we gradually add noise to 𝒚0p(𝒚)similar-tosubscript𝒚0𝑝𝒚\bm{y}_{0}\sim p(\bm{y})bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_italic_y ) and obtain a sequence of increasingly noisy samples {𝒚t|t[0,T]}conditional-setsubscript𝒚𝑡𝑡0𝑇\{\bm{y}_{t}|t\in[0,T]\}{ bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_t ∈ [ 0 , italic_T ] } according to 𝒚t:=αt𝒚0+σtϵassignsubscript𝒚𝑡subscript𝛼𝑡subscript𝒚0subscript𝜎𝑡bold-italic-ϵ\bm{y}_{t}:=\alpha_{t}\bm{y}_{0}+\sigma_{t}\bm{\epsilon}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ, where ϵ𝒩(𝟎,𝑰)bold-italic-ϵ𝒩0𝑰\bm{\epsilon}\in\mathcal{N}(\mathbf{0},\bm{I})bold_italic_ϵ ∈ caligraphic_N ( bold_0 , bold_italic_I ) represents the added Gaussian noise, and αt,σtsubscript𝛼𝑡subscript𝜎𝑡\alpha_{t},\sigma_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT constitute the noise schedule. As a result, 𝒚Tsubscript𝒚𝑇\bm{y}_{T}bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT will finally reach isotropic Gaussian noise after sufficient destruction steps. By reversing the above process, we are able to perform the generation process by gradually denoise the sample starting from pure Gaussian noise 𝒚T𝒩(𝟎,𝑰)similar-tosubscript𝒚𝑇𝒩0𝑰\bm{y}_{T}\sim\mathcal{N}(\mathbf{0},\bm{I})bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) until reaching 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Our diffusion model is trained to denoise 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for each timestep t𝑡titalic_t, facilitating both unconditional and conditional generation.

Model architecture. Thanks to the spatially structured organization of the proposed GaussianCube, standard 3D convolution is sufficient to effectively extract and aggregate the features of neighboring Gaussians without elaborate designs. We leverage the standard U-Net network for diffusion [38, 15] and simply replace the original 2D operators including convolution, attention, upsampling and downsampling with their 3D counterparts.

Conditioning mechanism. Our model supports a variety of condition signals to control the generation process. When performing class-conditioned diffusion modeling, we employ adaptive group normalization (AdaGN) [15] to inject the class labels into our model. For image-conditioned digital avatar creation, we leverage a pretrained vision transformer [6] to encode the conditional image into a sequence of feature tokens. We subsequently adopt cross-attention to make the model learn the correspondence between 3D activations and 2D image feature tokens following [5]. We also leverage cross-attention as our condition mechanism when creating 3D objects from text, similar to previous text-to-image diffusion models [44].

Representation Spatially-structured PSNR\uparrow LPIPS\downarrow SSIM\uparrow Rel. Speed\uparrow Params (M)\downarrow
Instant-NGP 33.98 0.0386 0.9809 1×1\times1 × 12.25
Gaussian Splatting 35.32 0.0303 0.9874 2.58×2.58\times2.58 × 1.84
Voxels 28.95 0.0959 0.9470 1.73×1.73\times1.73 × 0.47
Voxels 25.80 0.1407 0.9111 1.73×1.73\times1.73 × 0.47
Triplane 32.61 0.0611 0.9709 1.05×1.05\times1.05 × 6.30
Triplane 31.39 0.0759 0.9635 1.05×1.05\times1.05 × 6.30
Our GaussianCube 34.94 0.0347 0.9863 3.33×\mathbf{3.33\times}bold_3.33 × 0.46
Table 2: Comparison with prior 3D representations of spatial structure, fitting quality, relative fitting speed (Rel. Speed) and parameter sizes on ShapeNet Car. denotes that the implicit feature decoder is shared across different objects. All methods are evaluated at 30K iterations.
\begin{overpic}[width=433.62pt]{imgs/results/fitting_small.jpg} \put(5.0,-2.0){Ground-truth} \put(21.0,-2.0){Instant-NGP} \put(36.0,-2.0){Gaussian Splatting} \put(57.0,-2.0){Voxel${}^{*}$} \put(72.0,-2.0){Triplane${}^{*}$} \put(85.0,-2.0){{Our GaussianCube}} \end{overpic}
Figure 4: Qualitative results of object fitting.

Training objective. In our 3D diffusion training, we parameterize our model 𝒚^θsubscript^𝒚𝜃\hat{\bm{y}}_{\theta}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to predict the noise-free input 𝒚0subscript𝒚0\bm{y}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using:

simple =𝔼t,𝒚0,ϵ[𝒚^θ(αt𝒚0+σtϵ,t,𝒄cls)𝒚022],subscriptsimple subscript𝔼𝑡subscript𝒚0bold-italic-ϵdelimited-[]superscriptsubscriptnormsubscript^𝒚𝜃subscript𝛼𝑡subscript𝒚0subscript𝜎𝑡bold-italic-ϵ𝑡subscript𝒄clssubscript𝒚022\mathcal{L}_{\text{simple }}=\mathbb{E}_{t,\bm{y}_{0},\bm{\epsilon}}\left[% \left\|\hat{\bm{y}}_{\theta}\left(\alpha_{t}\bm{y}_{0}+\sigma_{t}\bm{\epsilon}% ,t,\bm{c}_{\text{cls}}\right)-\bm{y}_{0}\right\|_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ end_POSTSUBSCRIPT [ ∥ over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ , italic_t , bold_italic_c start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT ) - bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (3)

where the condition signal 𝒄clssubscript𝒄cls\bm{c}_{\text{cls}}bold_italic_c start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT is only needed when training conditional diffusion models. We additionally impose image-level supervision to improve the rendering quality of generated GaussianCube, which has been demonstrated to effectively enhance the perceptual details in previous works [55, 36]. Specifically, we penalize the discrepancy between the rasterized images Ipredsubscript𝐼predI_{\text{pred}}italic_I start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT of the predicted GaussianCubes and the ground-truth images Igtsubscript𝐼gtI_{\text{gt}}italic_I start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT using:

image =𝔼t,Ipred (lΨl(Ipred)Ψl(Igt)22)+𝔼t,Ipred(IpredIgt 2),subscriptimage subscript𝔼𝑡subscript𝐼pred subscript𝑙superscriptsubscriptnormsuperscriptΨ𝑙subscript𝐼predsuperscriptΨ𝑙subscript𝐼gt22subscript𝔼𝑡subscript𝐼predsubscriptnormsubscript𝐼predsubscript𝐼gt 2\displaystyle\mathcal{L}_{\text{image }}=\mathbb{E}_{t,I_{\text{pred }}}\left(% \sum_{l}\left\|\Psi^{l}\left(I_{\text{pred}}\right)-\Psi^{l}\left(I_{\text{gt}% }\right)\right\|_{2}^{2}\right)+\mathbb{E}_{t,I_{\text{pred}}}\left(\left\|I_{% \text{pred}}-I_{\text{gt }}\right\|_{2}\right),caligraphic_L start_POSTSUBSCRIPT image end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_t , italic_I start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ roman_Ψ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT ) - roman_Ψ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_t , italic_I start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∥ italic_I start_POSTSUBSCRIPT pred end_POSTSUBSCRIPT - italic_I start_POSTSUBSCRIPT gt end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (4)

where ΨlsuperscriptΨ𝑙\Psi^{l}roman_Ψ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is the multi-resolution feature extracted using the pre-trained VGG [47]. Benefiting from the efficiency of both rendering speed and memory costs from GS [28], we are able to perform fast training with high-resolution renderings. Our overall training loss can be formulated as:

=simple+λimage,subscriptsimple𝜆subscriptimage\mathcal{L}=\mathcal{L}_{\text{simple}}+\lambda\mathcal{L}_{\text{image}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT simple end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT image end_POSTSUBSCRIPT , (5)

where λ𝜆\lambdaitalic_λ is a balancing weight.

Method ShapeNet Car ShapeNet Chair OmniObject3D
FID-50K\downarrow KID-50K(‰)\downarrow FID-50K\downarrow KID-50K(‰)\downarrow FID-50K\downarrow KID-50K(‰)\downarrow
EG3D 30.48 20.42 27.98 16.01 - -
GET3D 17.15 9.58 19.24 10.95 - -
DiffTF 51.88 41.10 47.08 31.29 46.06 22.86
Ours 13.01 8.46 15.99 9.95 11.62 2.78
Table 3: Quantitative results of unconditional generation on ShapeNet Car and Chair [9] and class-conditioned generation on OmniObject3D [59].
\begin{overpic}[width=390.25534pt]{imgs/results/shapenet_all_small.jpg} \put(12.0,-3.0){EG3D~{}\cite[cite]{[\@@bibref{Number}{chan2022efficient}{}{}]}% } \put(35.0,-3.0){GET3D~{}\cite[cite]{[\@@bibref{Number}{gao2022get3d}{}{}]}} \put(60.0,-3.0){DiffTF~{}\cite[cite]{[\@@bibref{Number}{cao2023large}{}{}]}} \put(86.0,-3.0){{Ours}} \end{overpic}
Figure 5: Qualitative comparison of unconditional 3D generation on ShapeNet Car and Chair datasets. Our model is capable of generating results of complex geometry with rich details.
\begin{overpic}[width=390.25534pt]{imgs/results/omni_all.jpg} \put(18.0,-3.0){DiffTF~{}\cite[cite]{[\@@bibref{Number}{cao2023large}{}{}]}} \put(75.0,-3.0){{Ours}} \end{overpic}
Figure 6: Qualitative comparison of class-conditioned 3D generation on large-vocabulary OmniObject3D [59]. Our model is able to handle diverse distribution with semantically accurate results.

4 Experiments

4.1 Dataset and Metrics

To measure the expressiveness and efficiency of various 3D representations, we fit 100 objects in ShapeNet Car [9] using each representation and report the PSNR, LPIPS [68] and Structural Similarity Index Measure (SSIM) metrics when synthesizing novel views. Furthermore, we conduct experiments of single-category unconditional generation on ShapeNet [9] Car and Chair, and class-conditioned generation on real-world scanned dataset OmniObject3D [59]. We compute the FID [22] and KID [2] scores between 50K generated renderings and 50K ground-truth renderings. For image-conditioned digital avatar generation, we utilize the synthetic avatar dataset [57], which comprises highly-detailed 3D avatars created by synthetic pipeline. We assess the generation quality of 5K rendering from 500 test avatars and additionally include cosine similarity of identity embedding [13] (CSIM) to measure the ID preservation. The experiments of text-to-3D generation are performed on the large-scale challenging Objaverse dataset [12]. We numerically evaluate the text alignment quality using CLIP score [43] of 300 test prompts. All images are rendered with 512×512512512512\times 512512 × 512 resolution. For more details of data, please refer to Section A.1.

Method PSNR\uparrow LPIPS\downarrow SSIM\uparrow CSIM\uparrow FID-5K\downarrow KID-5K(‰)\downarrow
Rodin w/o 2D SR 18.80 0.2842 0.7439 0.6594 32.07 24.78
Rodin 18.59 0.2821 0.7373 0.6466 20.02 9.24
Ours 21.87 0.1768 0.7703 0.7821 8.32 2.67
Table 4: Quantitative results of digital avatar creation conditioned on single portrait image.
\begin{overpic}[width=346.89731pt]{imgs/results/img_cond.jpg} \put(5.0,-2.5){Reference} \put(35.0,-2.5){Rodin~{}\cite[cite]{[\@@bibref{Number}{wang2023rodin}{}{}]}} \put(78.0,-2.5){{Ours}} \end{overpic}
Figure 7: Qualitative comparison of 3D avatar creation conditioned on single frontal portraits.

4.2 Implementation Details

To construct GaussianCube for each object, we perform the proposed densification-constrained fitting for 30K iterations, where Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is set to 32,768. After OT-based structuralization, we obtain 32×32×32×143232321432\times 32\times 32\times 1432 × 32 × 32 × 14 GaussianCube for each object. For the 3D diffusion model, we adopt the ADM U-Net network [38, 15]. We perform full attention at the resolution of 83superscript838^{3}8 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and 43superscript434^{3}4 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT within the network. The timesteps of diffusion models are set to 1,00010001,0001 , 000 and we train the models using the cosine noise schedule [38] with loss weight λ𝜆\lambdaitalic_λ set to 10101010. For more training details, please refer to Section A.1.

4.3 Main Results

3D fitting. We first evaluate our representation capability of object fitting against previous NeRF-based representations including Voxels [53] with similar parameter sizes and Triplane [8], which are widely adopted in previous 3D generation works [8, 55, 5, 36, 53]. We also include Instant-NGP [37] and original Gaussian Splatting [28] for reference despite their unsuitability for generative modeling due to their unstructured spatial nature. As shown in Table 2, our GaussianCube outperforms all NeRF-based representations among all metrics. Figure 3 illustrates that GaussianCube can faithfully reconstruct geometry details and intricate textures. Moreover, we achieve such high-quality fitting with minimal parameters due to the densification-constrained fitting, showcasing our compactness. Notably, the shared implicit feature decoder in the multi-object fitting of NeRF-based methods leads to significant decreases in quality compared to single-object fitting as evidenced in Table 2. While the fully explicit nature of GS results in no quality gap between single and multiple object fitting.

DreamGaussian VolumeDiffusion Shap-E LGM Ours
CLIP Score\uparrow 26.38 24.41 30.52 30.06 30.56
Inference Time (s)\downarrow 120similar-toabsent120\sim 120∼ 120 5 7 2 5
Table 5: Quantitative results of text-to-3D creation. Inference time is measured on a single A100 GPU. While Shape-E, LGM achieve comparable CLIP scores as ours, they either utilize millions of training data or leverage 2D diffusion prior.
\begin{overpic}[width=424.94574pt]{imgs/results/text_cond_all.jpg} \put(2.0,-2.0){DreamGaussian~{}\cite[cite]{[\@@bibref{Number}{tang2023% dreamgaussian}{}{}]}} \put(22.0,-2.0){VolumeDiffusion~{}\cite[cite]{[\@@bibref{Number}{tang2023% volumediffusion}{}{}]}} \put(47.0,-2.0){Shap-E~{}\cite[cite]{[\@@bibref{Number}{jun2023shap}{}{}]}} \put(67.0,-2.0){LGM~{}\cite[cite]{[\@@bibref{Number}{tang2024lgm}{}{}]}} \put(88.0,-2.0){{Ours}} \end{overpic}
Figure 8: Qualitative comparison of text-to-3D generation on Objaverse [12]. Our model is able to generate high-quality samples according to the given text prompts.

Single-category unconditional generation. For unconditional generation, we compare our method with the state-of-the-art 3D generation works including 3D-aware GANs [8, 17] and Triplane diffusion models [5]. As shown in Table 3, our method surpasses all prior works in terms of both FID and KID scores and sets new records. We provide visual comparisons in Figure 5, where EG3D and DiffTF tend to generate blurry results with poor geometry, and GET3D fails to provide satisfactory textures. In contrast, our method yields high-fidelity results with authentic geometry and sharp textures.

Large-vocabulary class-conditioned generation. We also compare class-conditioned generation with DiffTF [5] on more diverse and challenging OmniObject3D [59] dataset. We achieve significantly better FID and KID scores than DiffTF as shown in Table 3. Visual comparisons in Figure 6 reveal that DiffTF often struggles to create intricate geometry and detailed textures, whereas our method is able to generate objects with complex geometry and realistic textures.

Image-conditioned avatar generation. For 3D avatar generation conditioned on a single reference image, we compare our method with state-of-the-art Triplane diffusion models, Rodin [44]. Our model surpasses Rodin among all evaluated metrics as shown in Table 4. Although Rodin utilizes a 2D refiner [56] to boost the visual quality of facial areas, which significantly compromises 3D consistency. Our model still outperforms it by direct real-3D generation. Results in Figure 7 demonstrate that our model faithfully preserves the identity, expression and accessories of the references with rich details, while Rodin struggles to provide satisfactory results even using 2D refinement.

Method Densify & Prune Representation Fitting Generation
PSNR\uparrow LPIPS\downarrow SSIM\uparrow FID-50K\downarrow KID-50K(‰)\downarrow
A. Voxel grid w/o offset 25.87 0.1228 0.9217 - -
B. Voxel grid w/ offset 30.18 0.0780 0.9628 40.52 24.35
C. Ours w/o OT 34.94 0.0346 0.9863 21.41 14.37
D. Ours 34.94 0.0346 0.9863 13.01 8.46
Table 6: Quantitative ablation of both representation fitting and generation quality on ShapeNet Car.
\begin{overpic}[width=346.89731pt]{imgs/results/ablation/ablation_fitting_all_% small.jpg} \put(5.0,-1.0){Ground-truth} \put(32.0,-1.0){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} A.} \put(57.0,-1.0){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} B.} \put(78.0,-1.0){{~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generati% on} D. (Ours)}} \end{overpic}
Figure 9: Qualitative ablation of representation fitting.
\begin{overpic}[width=424.94574pt]{imgs/results/ablation/ablation_mapping_% results_small.jpg} \put(5.0,-1.5){{OT (Ours)}} \put(20.0,-1.5){Nearest Neighbor} \put(41.0,-1.5){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} B.} \put(65.0,-1.5){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} C.} \put(83.0,-1.5){{~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generati% on} D. (Ours)}} \put(17.0,-4.0){\small{(a)}} \put(67.0,-4.0){\small{(b)}} \end{overpic}
Figure 10: Visual ablation of the Gaussian organization methods and 3D generation. For visualization of Gaussian structuralization in (a), we map the coordinates of the corresponding voxel of each Gaussians to RGB values to visualize the organization. Our OT-based solution also results in the best generation quality shown in (b).

Text-to-3D generation. We compare text-to-3D generation with prior arts including diffusion models [26, 53], optimization-based method [50] and feed-forward method [51]. Our model achieves competitive text-3D alignment results as shown in Table 5. The visual comparison in Figure 8 shows that our model is able to create high-quality samples aligning with text prompts in 5 seconds. DreamGaussian tends to create over-saturated results and suffers from Janus problem. VolumeDiffusion produces unsatisfactory textures with poor text alignment. Shap-E can produce semantically accurate results but struggles to generate complex geometry. LGM reconstructs 3D Gaussians from multi-view images generated by text-conditioned multi-view diffusion pipeline [45], whereas the inconsistency [51] of the generated multi-views often results in inaccurate geometric reconstruction.

4.4 Ablation Study

We first examine the key factors in representation construction on ShapeNet Car. To spatially structure the Gaussians, a simplistic approach would be anchoring the positions of Gaussians to a predefined voxel grid while omitting densification and pruning, which leads to severe failure when fitting the objects as shown in Figure 9. Even by introducing learnable offsets to the voxel grid, the results still lack details. We observe the offsets are typically too small to effectively lead the Gaussians close to the object surfaces, which indicates the importance of densification in the fitting process. Instead, GaussianCube can capture both complex geometry and intricate details as shown in Figure 9. The numerical comparison in Table 6 also demonstrates the superior fitting quality of GaussianCube.

We also evaluate how the representation affects 3D generative modeling on ShapeNet Car as shown in Table 6 and Figure 10. Limited by the poor fitting quality, performing diffusion modeling on voxel grid with learnable offsets leads to blurry generation results as shown in Figure 10. To validate the importance of organizing Gaussians via Optimal Transport (OT), we compare with the organization based on nearest neighbor transport. We linearly map each Gaussian’s corresponding coordinates of voxel to RGB color to visualize different organizations. As shown in Figure 10 (a), our proposed OT approach yields smooth color transitions, indicating that our method successfully preserves the spatial correspondence. However, nearest neighbor results in abrupt color transitions due to their disregard for global structure. Both the quantitative results in Table 6 and visual comparisons Figure 10 indicate that our globally structured arrangement facilitates generative modeling by alleviating its complexity, successfully leading to superior generation quality.

5 Conclusion

We have presented GaussianCube, a structured and explicit radiance representation crafted for 3D generative models. We begin by fitting each 3D object with a constant number of Gaussians using our proposed densification-constrained fitting algorithm. We further organize the obtained Gaussians into a spatially structured representation by solving the Optimal Transport between the positions of Gaussians and the predefined voxel grid. The proposed GaussianCube is spatially structured, allowing to use standard 3D U-Net for diffusion modeling without elaborate designs. Moreover, GaussianCube can achieve high-quality fitting using much fewer parameters compared to prior works of similar quality, which further eases the difficulty of generative modeling. Our 3D diffusion models equipped with GaussianCube achieve state-of-the-art generation quality on the evaluated datasets, underscoring its potential of GaussianCube as a versatile and powerful radiance representation for 3D generation.

References

  • Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
  • Bińkowski et al. [2018] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
  • Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  • Burkard and Cela [1999] Rainer E Burkard and Eranda Cela. Linear assignment problems and extensions. In Handbook of combinatorial optimization: Supplement volume A, pages 75–149. Springer, 1999.
  • Cao et al. [2023] Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer. arXiv preprint arXiv:2309.07920, 2023.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  • Chan et al. [2021] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799–5809, 2021.
  • Chan et al. [2022] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  • Chang et al. [2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • Cheng et al. [2023] Yiji Cheng, Fei Yin, Xiaoke Huang, Xintong Yu, Jiaxiang Liu, Shikun Feng, Yujiu Yang, and Yansong Tang. Efficient text-guided 3d-aware portrait generation with score distillation sampling on distribution. arXiv preprint arXiv:2306.02083, 2023.
  • Cotton and Peyton [2024] R James Cotton and Colleen Peyton. Dynamic gaussian splatting from markerless motion capture reconstruct infants movements. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 60–68, 2024.
  • Deitke et al. [2023] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  • Deng et al. [2019] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
  • Deng et al. [2022] Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10673–10683, 2022.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
  • Gao et al. [2022] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. arXiv preprint arXiv:2209.11163, 2022.
  • Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Gu et al. [2021] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
  • Gupta et al. [2023] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  • He et al. [2024] Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation. arXiv preprint arXiv:2403.12957, 2024.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hu et al. [2023] Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19774–19783, 2023.
  • Jonker and Volgenant [1988] Roy Jonker and Ton Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. In DGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vorträge der 16. Jahrestagung der DGOR zusammen mit der NSOR, pages 622–622. Springer, 1988.
  • Jun and Nichol [2023] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
  • Li et al. [2024] Mengtian Li, Shengxiang Yao, Zhifeng Xie, Keyu Chen, and Yu-Gang Jiang. Gaussianbody: Clothed human reconstruction via 3d gaussian splatting. arXiv preprint arXiv:2401.09720, 2024.
  • Li et al. [2023] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
  • Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, ICLR, 2019.
  • Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
  • Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
  • Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  • Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • Müller et al. [2023] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023.
  • Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • Niemeyer and Geiger [2021] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
  • Park et al. [2021] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
  • Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  • Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  • Shue et al. [2023] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sun et al. [2022] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
  • Sun et al. [2023] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  • Tang et al. [2023a] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023a.
  • Tang et al. [2024] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
  • Tang et al. [2023b] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023b.
  • Tang et al. [2023c] Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459, 2023c.
  • Villani et al. [2009] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
  • Wang et al. [2023] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
  • Wang et al. [2021] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9168–9178, 2021.
  • Wood et al. [2021] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3681–3691, 2021.
  • Wu et al. [2023a] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023a.
  • Wu et al. [2023b] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023b.
  • Xia and Xue [2023] Weihao Xia and Jing-Hao Xue. A survey on deep generative 3d-aware image synthesis. ACM Computing Surveys, 56(4):1–34, 2023.
  • Xiang et al. [2022] Jianfeng Xiang, Jiaolong Yang, Yu Deng, and Xin Tong. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. arXiv preprint arXiv:2206.07255, 2022.
  • Xu et al. [2022a] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022a.
  • Xu et al. [2022b] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022b.
  • Xu et al. [2023] Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. arXiv preprint arXiv:2312.03029, 2023.
  • Yi et al. [2023] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
  • Zhang et al. [2022] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11304–11314, 2022.
  • Zhang et al. [2020] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Zhou et al. [2021] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.

Appendix A Appendix

A.1 Additional Implementation Details

Dataset preparation. We conduct experiments on ShapeNet Car [9], ShapeNet Chair [9], OmniObject3D [59], Synthetic Avatar [57] and Objaverse [12] datasets. For each dataset, we report the total number of objects used for training, the number of views rendered per object for GaussianCube fitting and the distribution of camera poses used for rendering in Table 7. For the Objaverse dataset, we excluded low-quality objects, such as those without textures or with defective reconstructions following [53]. We also report the object bounding box 𝒃𝒃\bm{b}bold_italic_b in the world coordinate system of each dataset in Table 7, which is used to construct the predefined voxel grid within [𝒃,𝒃]3superscript𝒃𝒃3[-\bm{b},\bm{b}]^{3}[ - bold_italic_b , bold_italic_b ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT during OT-based Gaussian structuralization.

Representation construction. We set Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT to 32768 and C𝐶Citalic_C to 14 omitting the view-dependent spherical harmonics. This simplification appears to have a negligible impact on object fitting while concurrently reducing the data dimension, thereby alleviating the difficulty of diffusion modeling. During our densification-constrained fitting procedure, we primarily follow the hyper-parameters in original Gaussian Splatting [28]. For OT-based Gaussian structuralization, we adopt an approximate solution for the OT problem due to the O(Nmax3)𝑂superscriptsubscript𝑁max3O\left(N_{\text{max}}^{3}\right)italic_O ( italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) time complexity of Jonker-Volgenant algorithm [25]. This is achieved by dividing the positions of the Gaussians and the voxel grid into four sorted segments and then applying the Jonker-Volgenant solver to each segment individually. We empirically find this approximation successfully strikes a balance between computational efficiency and spatial structure preservation. The proposed densification-constrained fitting takes around 2.672.672.672.67 minutes for each object of 30K iterations and the OT-based voxelization takes around 2222 minutes which can be run on CPU in parallel.

3D Diffusion. To train the 3D diffusion model, we initially compute the instance-wise statistics of mean 𝝁¯Nv×Nv×Nv×C¯𝝁superscriptsubscript𝑁𝑣subscript𝑁𝑣subscript𝑁𝑣𝐶\bar{\bm{\mu}}\in\mathbb{R}^{N_{v}\times N_{v}\times N_{v}\times C}over¯ start_ARG bold_italic_μ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT and standard deviation 𝝈¯Nv×Nv×Nv×C¯𝝈superscriptsubscript𝑁𝑣subscript𝑁𝑣subscript𝑁𝑣𝐶\bar{\bm{\sigma}}\in\mathbb{R}^{N_{v}\times N_{v}\times N_{v}\times C}over¯ start_ARG bold_italic_σ end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT, from the GaussianCubes of each training dataset respectively. These statistical measures are then utilized to normalize the training data. For our 3D diffusion model architecture, we adopt the ADM-UNet from [15] and replace the convolution, upsampling, downsampling and attention operations with 3D implementations. We train our model using AdamW optimizer [31], and apply exponential moving average (EMA) with a rate of 0.9999 during training. For unconditional generation on ShapeNet, we train the model with a base learning rate 5e55𝑒55e-55 italic_e - 5 for 850K iterations and then decay the learning rate to 5e65𝑒65e-65 italic_e - 6 for another 150K iterations. For 3D digital avatar creation from a single portrait image, we adopt the pretrained DINO ViT-B/16 [6] to encode the 512×512512512512\times 512512 × 512 conditional images into 1025×76810257681025\times 7681025 × 768 conditional feature tokens. For text-to-3D creation, we take CLIP-L/14 [43] to encode the text prompts into 77×7687776877\times 76877 × 768 conditional feature tokens. We provide more detailed configurations of the model architectures, diffusion training and inference for each dataset in Table 8.

Implementation of Gaussian organization visualization in  Figure 10 (a). For the i𝑖iitalic_i-th Guassian, we obtain its corresponding voxel grid centers 𝒙k3subscript𝒙𝑘superscript3\bm{x}_{k}\in\mathbb{R}^{3}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT according to Optimal Transport plan 𝐓superscript𝐓\mathbf{T}^{*}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (i.e., 𝐓ik=1subscriptsuperscript𝐓𝑖𝑘1\mathbf{T}^{*}_{ik}=1bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT = 1) as illustrated in Section 3.1. To visualize the coordinates of 𝒙ksubscript𝒙𝑘\bm{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we map them to RGB color 𝑪k3subscript𝑪𝑘superscript3\bm{C}_{k}\in\mathbb{R}^{3}bold_italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT using:

𝑪k=(𝒙k+𝒃)2𝒃×𝟐𝟓𝟓,subscript𝑪𝑘subscript𝒙𝑘𝒃2𝒃255\bm{C}_{k}=\frac{(\bm{x}_{k}+\bm{b})}{2\bm{b}}\times\bm{255},bold_italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + bold_italic_b ) end_ARG start_ARG 2 bold_italic_b end_ARG × bold_255 , (6)

where 𝒃𝒃\bm{b}bold_italic_b is the bounding box in the world coordinate system. The resultant point cloud like visualizations are shown in Figure 10 (a), where smooth color transitions indicate coherent spatial correspondence preservation.

A.2 Additional Ablation Study and Analysis

Ablation of Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT in densification-constrained fitting. We conduct experiments to evaluate how Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT affects fitting on ShapeNet Car. The results in Table 9 indicate that there is a clear trend where increasing Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT leads to improved fitting accuracy. However, a larger Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT also incurs higher computational costs during diffusion training. Therefore, we set Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT to 32,768 to strike a balance between high-quality fitting and computational efficiency.

Ablation of classifier-free guidance in class-conditioned generation. We study how classifier-free guidance (CFG) impacts our generation quality when inference class-conditioned diffusion models. We report the FID and KID metrics in Table 10 under different CFG scales.

Visualization of intermediate results in the denoising process. During inference, our model starts from Gaussian noise and progressively denoises to yield the high-quality GaussianCube. We present visualizations of the intermediate renderings 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at various timesteps t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] throughout the denoising process, offering a detailed insight into the GaussianCube diffusion procedure. As illustrated in Figure 11, our model first establishes the global structure and then incrementally enhances the details, which is similar to previous 3D diffusion models [55, 46].

Dataset # Objects # Views per object Rotation Angle Elevation Angle Bounding Box
ShapeNet Car 7,462 150 [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ] [16π,12π]16𝜋12𝜋[\frac{1}{6}\pi,\frac{1}{2}\pi][ divide start_ARG 1 end_ARG start_ARG 6 end_ARG italic_π , divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_π ] 0.45
ShapeNet Chair 6,775 150 [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ] [16π,12π]16𝜋12𝜋[\frac{1}{6}\pi,\frac{1}{2}\pi][ divide start_ARG 1 end_ARG start_ARG 6 end_ARG italic_π , divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_π ] 0.35
OmniObject3D 5,795 100 [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ] [0,12π]012𝜋[0,\frac{1}{2}\pi][ 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_π ] 1.0
Sythetic Avatar 98,000 300 [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ] [16π,23π]16𝜋23𝜋[\frac{1}{6}\pi,\frac{2}{3}\pi][ divide start_ARG 1 end_ARG start_ARG 6 end_ARG italic_π , divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_π ] 40.0
Objaverse 125,653 150 [0,2π]02𝜋[0,2\pi][ 0 , 2 italic_π ] [0,23π]023𝜋[0,\frac{2}{3}\pi][ 0 , divide start_ARG 2 end_ARG start_ARG 3 end_ARG italic_π ] 0.5
Table 7: Details of each dataset.
ShapeNet Car ShapeNet Car OmniObject3D Synthetic Avatar Objaverse
Diffusion steps 1,000 1,000 1,000 1,000 1,000
Noise schedule Cosine Cosine Cosine Cosine Cosine
NFEs 300 300 300 250 44
Inferece sampler DPM-solver [32] DPM-solver [32] DPM-solver [32] DPM-solver [32] DPM-solver [32]
CFG scale - - 2.0 1.3 3.5
Model size 82M 82M 82M 339M 339M
Channels 64 64 64 128 128
Channel mult (1,2,3,4) (1,2,3,4) (1,2,3,4) (1,2,3,4) (1,2,3,4)
Num res blocks 3 3 3 3 3
Attn resolutions (8, 4) (8, 4) (8, 4) (8, 4) (8, 4)
Num head channels 64 64 64 64 64
Dropout 0 0 0 0 0
Scale shift norm True True True True True
Training steps 1,000K 1,000K 700K 1,200K 1,800K
Training GPUs 16 16 16 16 32
Batch size 128 128 128 128 256
Base lr 5e55𝑒55e-55 italic_e - 5 5e55𝑒55e-55 italic_e - 5 5e55𝑒55e-55 italic_e - 5 5e55𝑒55e-55 italic_e - 5 5e55𝑒55e-55 italic_e - 5
Lr decay steps 850K 850K - - -
Table 8: Detailed configuration of model architecture, diffusion training and inference on each dataset.
Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT PSNR\uparrow LPIPS\downarrow SSIM\uparrow
4096 16 32.56 0.0547 0.9765
13824 24 34.32 0.0396 0.9842
32768 32 34.94 0.0347 0.9863
110592 48 35.29 0.0307 0.9874
262144 64 35.34 0.0301 0.9875
Table 9: Quantitative ablation of Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT in densification-constrained fitting. We set Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT to 32,768 in this paper.
Scale w/o CFG 1.3 1.5 2.0 3.0 6.0
FID-50K\downarrow 13.39 12.07 11.72 11.62 12.99 32.80
KID-50K(‰)\downarrow 4.01 3.12 3.00 2.78 3.17 14.36
Table 10: Quantitative ablation of CFG scale in the class-conditioned generation of OmniObject3D [59].
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11: Visualization of generation results in intermediate diffusion timesteps.
Refer to caption
Figure 12: Visualization of nearest neighbor search on ShapeNet Car and Chair.
\begin{overpic}[width=432.31653pt]{imgs/supp/failure_cases.jpg} \put(5.0,-3.0){Text Condition} \put(30.0,-3.0){Generated Sample} \put(57.0,-3.0){Text Condition} \put(80.0,-3.0){Generated Sample} \end{overpic}
Figure 13: Failure cases.

Nearest neighbors analysis. We perform nearest neighbor search of some unconditionally generated samples in the paper according to the similarity of pretrained CLIP [43] features. The results in Figure 12 demonstrate that our model is capable of generating novel geometry and textures rather than simply memorizing the training data.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Reference Rodin [55] Ours
Figure 14: Additional qualitative comparison of 3D avatars creation conditioned on single in-the-wild portraits.
\begin{overpic}[width=432.31653pt]{imgs/supp/rodin_supp.jpg} \put(5.0,-3.0){Reference} \put(37.0,-3.0){Rodin~{}\cite[cite]{[\@@bibref{Number}{wang2023rodin}{}{}]}} \put(79.0,-3.0){{Ours}} \end{overpic}
Figure 15: Qualitative comparison generated digital avatars conditioned on synthetic portraits.
\begin{overpic}[width=390.25534pt]{imgs/supp/text_cond_supp_small.jpg} \put(2.0,-2.0){DreamGaussian~{}\cite[cite]{[\@@bibref{Number}{tang2023% dreamgaussian}{}{}]}} \put(22.0,-2.0){VolumeDiffusion~{}\cite[cite]{[\@@bibref{Number}{tang2023% volumediffusion}{}{}]}} \put(47.0,-2.0){Shap-E~{}\cite[cite]{[\@@bibref{Number}{jun2023shap}{}{}]}} \put(67.0,-2.0){LGM~{}\cite[cite]{[\@@bibref{Number}{tang2024lgm}{}{}]}} \put(88.0,-2.0){{Ours}} \end{overpic}
Figure 16: Additional qualitative comparison of text-to-3D generation on Objaverse [12]. Our model is capable of creating high-quality samples following input text prompts.
Refer to caption
Figure 17: Additional results of text-to-3D generation.
\begin{overpic}[width=368.57964pt]{imgs/supp/text_cond_variation_supp.jpg} \put(5.0,-2.0){Text Condition} \put(37.0,-2.0){Sample 1} \put(60.0,-2.0){Sample 2} \put(85.0,-2.0){Sample 3} \end{overpic}
Figure 18: Variation of text-to-3D generation. Our model is able to generate diverse results conditioned on the same text prompt.
\begin{overpic}[width=368.57964pt]{imgs/supp/text_cond_editing_supp.jpg} \end{overpic}
Figure 19: Example of text-guided 3D editing.
Refer to caption
Figure 20: Additional generated samples on ShapeNet Car.
Refer to caption
Figure 21: Additional generated samples on ShapeNet Chair.
Refer to caption
Figure 22: Additional generated samples on OmniObject3D.

A.3 Additional Visual Results

For 3D avatar generation, while trained on synthetic dataset, our model is capable of generalizing to in-the-wild portrait input. We provide more visual comparison of 3D avatar creation conditioned on in-the-wild portraits with Rodin [55] in Figure 14. We also include additional comparison conditioned on synthetic input from our test in Figure 15. Our model can faithfully retain the identity of the reference portrait and is able to provide high-fidelity results with rich details, e.g. hair, glasses and clothing. Although utilizing a pretrained 2D super-resolution module which significantly compromises 3D consistency, Rodin struggles to follow the conditional images and fails to produce detailed textures in non-facial areas e.g. clothing and hair.

We include additional qualitative comparison and generated samples of text-to-3D generation in Figure 16 and Figure 17 respectively. Our model yields samples with better visual quality, and is capable of handling challenging prompts. The results in Figure 18 show the generation diversity of our results given the same text prompt. Our model is also capable of performing text-guided editing of generated objects by leveraging SDEdit [34] as depicted in Figure 19, demonstrating the promise of achieving controllable 3D generation.

We provide more generated samples of unconditional and class-conditioned generation in Figure 20Figure 21 and Figure 22. The additional results demonstrate the strong capability of our model to create high-quality 3D assets with complex geometry and intricate textures.

Furthermore, we also provide an additional video in supplementary material, which intuitively illustrates our approach and visualizes the generated results.

A.4 Limitations

While GaussianCube represents a substantial step forward in developing an ideal representation for 3D content generation, it still has some limitations. Specifically, although the GaussianCube construction procedure is considerably more rapid than that of NeRF-based methods and can be executed in parallel, it still requires approximately 5 minutes to construct each object. This presents a challenge for scaling up training on extensive 3D datasets. In future work, we plan to investigate more time-efficient methods for GaussianCube construction. Additionally, akin to prior 2D diffusion models, our text-to-3D diffusion model encounters difficulties in presenting the specified number of objects within prompts as shown in Figure 13. To address this, we will look into enhancing the precision and controllability of 3D generation in the future.

A.5 Broader Impacts

The proposed GaussianCube enables high-quality 3D asset fitting with few parameters, which significantly simplifies the challenges of 3D generative modeling. Our diffusion model is capable of generating high-quality 3D assets of complex geometry and intricate textures while also accommodating a variety of conditional signals to steer the creating procedure. The strong capability of GaussianCube suggests its potential to serve as a versatile 3D representation for a variety of applications in future 3D research endeavors.

Like all generative models, particular caution is required when dealing with sensitive tasks involving human representations. Our avatar creation model is trained exclusively on a synthetic dataset [57] composed of large-scale 3D digital avatars which are generated through a graphics pipeline. We conceptualize digital avatars as analogous to those created by specialized 3D artists, rather than photorealistic human images. This strategy in selecting training data mitigates privacy and copyright issues that might arise from utilizing real human photo collections. Nevertheless, it is crucial to acknowledge that avatars generated by our model from real-world imagery could still be misused for spreading disinformation. As such, we advocate implementing rigorous safeguards and promoting responsible use of our technology other related ones to mitigate such risks.