GaussianCube: A Structured and Explicit
Radiance Representation for 3D Generative Modeling

Bowen Zhang^1∗ Yiji Cheng^2∗ Jiaolong Yang^3† Chunyu Wang^3†
Feng Zhao¹ Yansong Tang² Dong Chen^3‡ Baining Guo³
¹University of Science and Technology of China ²Tsinghua University ³Microsoft Research Asia

Abstract

We introduce a radiance representation that is both structured and fully explicit and thus greatly facilitates 3D generative modeling. Existing radiance representations either require an implicit feature decoder, which significantly degrades the modeling power of the representation, or are spatially unstructured, making them difficult to integrate with mainstream 3D diffusion methods. We derive GaussianCube by first using a novel densification-constrained Gaussian fitting algorithm, which yields high-accuracy fitting using a fixed number of free Gaussians, and then rearranging these Gaussians into a predefined voxel grid via Optimal Transport. Since GaussianCube is a structured grid representation, it allows us to use standard 3D U-Net as our backbone in diffusion modeling without elaborate designs. More importantly, the high-accuracy fitting of the Gaussians allows us to achieve a high-quality representation with orders of magnitude fewer parameters than previous structured representations for comparable quality, ranging from one to two orders of magnitude. The compactness of GaussianCube greatly eases the difficulty of 3D generative modeling. Extensive experiments conducted on unconditional and class-conditioned object generation, digital avatar creation, and text-to-3D synthesis all show that our model achieves state-of-the-art generation results both qualitatively and quantitatively, underscoring the potential of GaussianCube as a highly accurate and versatile radiance representation for 3D generative modeling. Project page: https://gaussiancube.github.io/.

¹¹footnotetext: Interns at Microsoft Research Asia. ^†Equal advising. ^‡Corresponding author.

1 Introduction

The field of 3D generative modeling [55, 36, 5, 53, 46, 8, 17] has witnessed remarkable growth, driven by advancements in generative modeling [23, 18, 38, 15, 66, 27]. Most of the prior works in this domain leverage variants of Neural Radiance Field (NeRF) [35, 8, 53, 37] as their underlying 3D representations, which typically consist of an explicit structured proxy representation and an implicit feature decoder. However, such hybrid NeRF variants exhibit degraded representation power, particularly when used for generative modeling where a single implicit feature decoder is shared across all objects. Additionally, the high computational complexity of volumetric rendering leads to both slow rendering speed and extensive memory costs.

Recently, the emergence of 3D Gaussian Splatting (GS) [28] has enabled improved reconstruction quality and real-time rendering capabilities [64, 33, 58]. The fully explicit nature of 3DGS eliminates the need for a shared implicit decoder, providing another key advantage over NeRFs. Although 3DGS has been widely studied in scene reconstruction tasks, its spatially unstructured nature presents a significant challenge when applied to mainstream generative modeling frameworks.

To overcome these barriers, we introduce GaussianCube – an innovative radiance representation that is both structured and fully explicit, with strong fitting capabilities (see Table 1 for comparisons with prior works). The proposed approach first ensures high-accuracy fitting with a predefined number of free Gaussians, and subsequently organizes these Gaussians into a structured voxel grid. Such an explicit grid-based structure permits the seamless application of standard 3D convolutional architectures, such as U-Net, thereby eliminating the need for complex, specialized network designs [69, 55] that are often necessary with unstructured or implicitly decoded representations.

Structuring 3D Gaussians without sacrificing fitting quality is not a trivial task. A naive starting point would be obtaining a fixed number of Gaussians by omitting the densification and pruning steps in GS. However, such simplification fails to lead the Gaussians close to the object surfaces and results in significant quality degradation. In contrast, we propose a densification-constrained fitting strategy, which retains the original pruning process yet constrains the number of Gaussians that perform densification, ensuring the total does not exceed a predefined maximum $N^{3}$ . For the subsequent structuralization, we allocate the Gaussians across an $N\times N\times N$ voxel grid using Optimal Transport (OT). Consequently, our fitted Gaussians are systematically arranged within the voxel grid, with each voxel containing the features of a Gaussian. The proposed OT-based structuralization achieves maximal spatial correspondence, characterized by minimal total transport distances, while preserving the expressive power of 3DGS.

Representation	Spatially-structured	Fully-explicit	Real-time Rendering	Rel. Parameters $\downarrow$
Instant-NGP [35]	✗	✗	✗	$26.63\times$
Neural Voxels [53]	✓	✗	✗	$145.9\times$
Triplane [8]	✓	✗	✗	$13.7\times$
Gaussian Splatting [28]	✗	✓	✓	$4.0\times$
Our GaussianCube	✓	✓	✓	$\mathbf{1.0\times}$

Table 1: Comparison with previous 3D representations with respect to spatial structure, explicitness, real-time rendering capability, and relative parameter count (Rel. Parameters) for representations of comparable quality.

The structured nature of GaussianCube enables us to perform efficient 3D diffusion [23] modeling for the following three reasons: 1) It allows the use of standard 3D U-Net as our backbone for diffusion modeling without elaborate designs. 2) The spatial coherence of GaussianCube permits the use of standard 3D convolutions to capture the correlations among neighboring Gaussians, facilitating efficient feature extraction. 3) GaussianCube enables high-quality fitting with orders of magnitude fewer parameters than prior structured representations of similar quality. Since recent works [30, 3] have demonstrated diffusion models’ struggle in handling high-dimensional distributions, the compactness of GaussianCube significantly reduces the modeling difficulty of the generative framework.

We conduct comprehensive experiments to verify the efficacy of our approach. The model’s capability for unconditional and class-conditioned generation is evaluated on the ShapeNet [9] and OmniObject3D [59] datasets. Both the quantitative and qualitative comparisons indicate that our model surpasses all previous methods. We also perform digital avatar generation on a synthetic avatar dataset [57]. Our model is capable of producing high-fidelity 3D avatars conditioned on single portrait images, excelling beyond prior art in both identity preservation and detail creation. Additionally, we assess our model’s capacity for the challenging text-to-3D creation task on Objaverse [12]. Our model demonstrates competitive performance both quantitatively and qualitatively, producing results consistent with the given text prompts in just 5 seconds. All experiments show the strong capabilities of our GaussianCube and suggest its potential as a powerful and versatile 3D representation for a variety of applications. Some generated samples of our method are presented in Figure 1.

Refer to caption — Figure 1: Our diffusion model is able to create diverse objects with complex geometry and rich texture details (top three rows). Our method also supports creating high-fidelity digital avatars (the forth row) conditioned on single portrait images (visualized in dashed boxes) and high-quality 3D assets given text prompts (the fifth row).

2 Related Work

Radiance field representation. Radiance fields model ray interactions with scene surfaces and can be in either implicit or explicit forms. Early works of neural radiance fields (NeRFs) [35, 67, 40, 1, 42] are often in an implicit form, which represents scenes without defining geometry. These works optimize a continuous scene representation using volumetric ray-marching that leads to extremely high computational costs. Recent works introduce the use of explicit proxy representation [8, 24, 16, 48, 37, 63] followed by an implicit feature decoder to enable faster rendering. Recently, the 3D Gaussian Splatting methods [28, 64, 58, 11, 29] utilize 3D Gaussians as their underlying representation and offer impressive reconstruction quality. The fully explicit representation also provides real-time rendering speed. However, the 3D Gaussians are unstructured representation, and require per-scene optimization to achieve photo-realistic quality. In contrast, our work proposes a structured representation termed GaussianCube for 3D generative tasks.

3D generation. Previous works of SDS-based optimization [41, 52, 62, 49, 10, 50, 65] distill 2D diffusion priors [44] to a 3D representation with the score functions, but these works are notably time-intensive, often taking several minutes to hours. While 3D-aware GANs [8, 17, 7, 19, 39, 14, 61] facilitate view-dependent image generation from single-image collections, they struggle to capture the complexity of diverse objects with intricate geometric variations [60]. Although recent works[55, 36, 20, 53, 46] have utilized diffusion models with structured proxy representations for 3D generation, the use of a shared implicit feature decoder across different assets restricts expressiveness and the computational demands of NeRF hinder efficient training. In contrast, we introduce a structured and fully explicit radiance representation for 3D generative modeling, building upon 3DGS [28]. A concurrent work of [21] includes elaborate designs to form the Gaussians into volumetric representation during fitting, yet does not thoroughly address global correspondence. In contrast, our approach only restricts the total count of Gaussians while allowing freedom in their spatial distribution during the fitting. We then organize these Gaussians into a voxel grid using Optimal Transport, which yields a spatially coherent arrangement with minimal global offset cost, effectively easing the difficulty of generative modeling.

3 Method

Following prior works, our framework comprises two primary stages as shown in Figure 2: representation construction and diffusion modeling. In representation construction phase, we first apply a densification-constrained 3DGS fitting algorithm for each object to obtain a constant number of Gaussians. These Gaussians are then organized into the proposed spatially structured GaussianCube via Optimal Transport between the positions of Gaussians and centers of a predefined voxel grid. For diffusion modeling, we train a 3D diffusion model to learn the distribution of GaussianCubes. We will detail our designs for each stage subsequently.

3.1 Representation Construction

We build our GaussianCube upon 3DGS, an explicit representation that offers impressive fitting quality and real-time rendering speed. However, it fails to yield Gaussians of fixed length since the adaptive density control during GS fitting can lead to a varying number of Gaussians for different objects. Furthermore, the lack of a predetermined spatial ordering for Gaussians leads to a disorganized spatial structure. These aspects pose significant challenges to 3D generative modeling. To overcome these obstacles, we first introduce our densification-constrained fitting strategy to obtain a fixed number of free Gaussians. Then, we systematically arrange the resulting Gaussians within a predefined voxel grid via Optimal Transport, thereby achieving a structured and explicit radiance representation.

Formally, a 3D asset is represented by a collection of 3D Gaussians as introduced in Gaussian Splatting [28]. The geometry of the $i$ -th 3D Gaussian $\bm{g}_{i}$ is given by

\bm{g}_{i}(\bm{x})=\exp\left(-\frac{1}{2}\left(\bm{x}-\bm{\mu}_{i}\right)^{% \top}\bm{\Sigma}_{i}^{-1}\left(\bm{x}-\bm{\mu}_{i}\right)\right),

(1)

where $\bm{\mu}_{i}\in\mathbb{R}^{3}$ is the center of the Gaussian and $\bm{\Sigma}_{i}\in\mathbb{R}^{3\times 3}$ is the covariance matrix defining the shape and size, which can be decomposed into a quaternion $\bm{q}_{i}\in\mathbb{R}^{4}$ and a vector $\bm{s}_{i}\in\mathbb{R}^{3}$ for rotation and scaling, respectively. Moreover, each Gaussian $\bm{g}_{i}$ have an opacity value $\alpha_{i}\in\mathbb{R}$ and a color feature $\bm{c}_{i}\in\mathbb{R}^{3}$ for rendering. Combining them together, the $C$ -channel feature vector $\bm{\theta}_{i}=\{\bm{\mu}_{i},\bm{s}_{i},\bm{q}_{i},\alpha_{i},\bm{c}_{i}\}% \in\mathbb{R}^{C}$ fully characterizes the Gaussian $\bm{g}_{i}$ .

Densification-constrained fitting. Our approach begins with the aim of maintaining a constant number of Gaussians $\bm{g}\in\mathbb{R}^{N_{\text{max}}\times C}$ across different objects during the fitting. A simplistic approach might involve omitting the densification and pruning in the original GS. However, we argue that such simplifications significantly harm the fitting quality, with empirical evidence shown in Table 6. Instead, we propose to retain the pruning process while imposing a new constraint on the densification phase as shown in Figure 3 (a). The fitting process encompasses several distinct stages: 1) Densification Detection: Assuming the current iteration includes $N_{\text{c}}$ Gaussians, we identify densification candidates by selecting those with view-space position gradient magnitudes exceeding a predefined threshold $\tau$ . We denote the number of candidates as $N_{d}$ . 2) Candidate sampling: To prevent exceeding the predefined maximum of $N_{\text{max}}$ Gaussians, we select $\min{(N_{\text{max}}-N_{\text{c}},N_{d})}$ Gaussians with the largest view-space positional gradients from the candidates for densification. 3) Densification: We modify the densification approach by alternating between cloning and splitting actions into separate steps. 4) Pruning Detection and Pruning: We identify and remove the Gaussians with $\alpha$ less than a small threshold $\epsilon$ . After completing the fitting process, we pad Gaussians with $\alpha=0$ to reach the target count of $N_{\text{max}}$ without affecting the rendering results. Benefiting from our proposed strategy, we attain a high-quality representation with orders of magnitude fewer parameters compared to existing works of similar quality, which significantly reduces the modeling difficulty for the diffusion models.

Gaussian structuralization via Optimal Transport. To further organize the obtained Gaussians into a spatially structured representation for 3D generative modeling, we propose to map the Gaussians to a predefined structured voxel grid $\bm{v}\in\mathbb{R}^{N_{v}\times N_{v}\times N_{v}\times C}$ where $N_{v}=\sqrt[3]{N_{\text{max}}}$ . Intuitively, we aim to “move” each Gaussian into a voxel while preserving their geometric relations as much as possible. While naive approaches such as nearest neighbor transport fall short in conserving these relations due to disregard for global arrangement with evidence shown in Figure 10, we formulate this as an Optimal Transport (OT) problem [54, 4] between the Gaussians’ spatial positions $\{\bm{\mu}_{i},i=1,\ldots,N_{\text{max}}\}$ and the voxel grid centers $\{\bm{x}_{j},j=1,\ldots,N_{\text{max}}\}$ . Let $\mathbf{D}$ be a distance matrix with $\mathbf{D}_{ij}$ being the moving distance between $\bm{\mu}_{i}$ and $\bm{x}_{j}$ , i.e., $\mathbf{D}_{ij}=\|\bm{\mu}_{i}-\bm{x}_{j}\|^{2}$ . The transport plan is represented by a binary matrix $\mathbf{T}\in\mathbb{R}^{N_{\text{max}}\times N_{\text{max}}}$ , and the optimal transport plan is given by:

\begin{array}[]{ll}\underset{\mathbf{T}}{\operatorname{minimize}}&\sum_{i=1}^{% N_{\text{max}}}\sum_{j=1}^{N_{\text{max}}}\mathbf{T}_{ij}\mathbf{D}_{ij}\\ \text{ subject to }&\sum_{j=1}^{N_{\text{max}}}\mathbf{T}_{ij}=1\quad\forall i% \in\{1,\ldots,N_{\text{max}}\}\\ &\sum_{i=1}^{N_{\text{max}}}\mathbf{T}_{ij}=1\quad\forall j\in\{1,\ldots,N_{% \text{max}}\}\\ &\mathbf{T}_{ij}\in\{0,1\}\qquad\forall(i,j)\in\{1,\ldots,N_{\text{max}}\}% \times\{1,\ldots,N_{\text{max}}\}.\end{array}

(2)

The solution is a bijective transport plan $\mathbf{T}^{*}$ that minimizes the total transport distances. We employ the Jonker-Volgenant algorithm [25] to solve the OT problem. We provide a 2D illustration in Figure 3 (b). We organize the Gaussians according to the solutions, with the $j$ -th voxel encapsulating the feature vector of the corresponding Gaussian $\bm{\theta}_{k}=\{\bm{\mu}_{k}-\bm{x}_{j},\bm{s}_{k},\bm{q}_{k},\alpha_{k},\bm% {c}_{k}\}\in\mathbb{R}^{C}$ , where $k$ is determined by the optimal transport plan (i.e., $\mathbf{T}^{*}_{kj}=1$ ). Note that we replace the original Gaussian positions with offsets of the current voxel center to reduce the solution space for diffusion models. As a result, our fitted Gaussians are systematically arranged within a voxel grid $\bm{v}$ and preserve the spatial correspondence of neighboring Gaussians, which further facilitates generative modeling.

\begin{overpic}[width=424.94574pt]{imgs/framework/densification_OT-cropped.pdf% } \put(23.0,-1.5){(a)} \put(70.0,-1.5){(b)} \end{overpic}

Figure 3: Illustration of representation construction. First, we perform densification-constrained fitting to yield a fixed number of Gaussians, as shown in (a). We then employ Optimal Transport to organize the resultant Gaussians into a voxel grid. A 2D illustration of this process is presented in (b).

3.2 3D Diffusion on GaussianCube

We now introduce our 3D diffusion model incorporated with the proposed expressive, efficient and spatially structured representation. After organizing the fitted Gaussians $\bm{g}$ into GaussianCube $\bm{y}$ for each object, we aim to model the distribution of GaussianCube, i.e., $p(\bm{y})$ .

Formally, the generation procedure can be formulated into the inversion of a discrete-time Markov forward process. During the forward phase, we gradually add noise to $\bm{y}_{0}\sim p(\bm{y})$ and obtain a sequence of increasingly noisy samples $\{\bm{y}_{t}|t\in[0,T]\}$ according to $\bm{y}_{t}:=\alpha_{t}\bm{y}_{0}+\sigma_{t}\bm{\epsilon}$ , where $\bm{\epsilon}\in\mathcal{N}(\mathbf{0},\bm{I})$ represents the added Gaussian noise, and $\alpha_{t},\sigma_{t}$ constitute the noise schedule. As a result, $\bm{y}_{T}$ will finally reach isotropic Gaussian noise after sufficient destruction steps. By reversing the above process, we are able to perform the generation process by gradually denoise the sample starting from pure Gaussian noise $\bm{y}_{T}\sim\mathcal{N}(\mathbf{0},\bm{I})$ until reaching $\bm{y}_{0}$ . Our diffusion model is trained to denoise $\bm{y}_{t}$ into $\bm{y}_{0}$ for each timestep $t$ , facilitating both unconditional and conditional generation.

Model architecture. Thanks to the spatially structured organization of the proposed GaussianCube, standard 3D convolution is sufficient to effectively extract and aggregate the features of neighboring Gaussians without elaborate designs. We leverage the standard U-Net network for diffusion [38, 15] and simply replace the original 2D operators including convolution, attention, upsampling and downsampling with their 3D counterparts.

Conditioning mechanism. Our model supports a variety of condition signals to control the generation process. When performing class-conditioned diffusion modeling, we employ adaptive group normalization (AdaGN) [15] to inject the class labels into our model. For image-conditioned digital avatar creation, we leverage a pretrained vision transformer [6] to encode the conditional image into a sequence of feature tokens. We subsequently adopt cross-attention to make the model learn the correspondence between 3D activations and 2D image feature tokens following [5]. We also leverage cross-attention as our condition mechanism when creating 3D objects from text, similar to previous text-to-image diffusion models [44].

Representation	Spatially-structured	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	Rel. Speed $\uparrow$	Params (M) $\downarrow$
Instant-NGP	✗	33.98	0.0386	0.9809	$1\times$	12.25
Gaussian Splatting	✗	35.32	0.0303	0.9874	$2.58\times$	1.84
Voxels	✓	28.95	0.0959	0.9470	$1.73\times$	0.47
Voxels^∗	✓	25.80	0.1407	0.9111	$1.73\times$	0.47
Triplane	✓	32.61	0.0611	0.9709	$1.05\times$	6.30
Triplane^∗	✓	31.39	0.0759	0.9635	$1.05\times$	6.30
Our GaussianCube	✓	34.94	0.0347	0.9863	$\mathbf{3.33\times}$	0.46

Table 2: Comparison with prior 3D representations of spatial structure, fitting quality, relative fitting speed (Rel. Speed) and parameter sizes on ShapeNet Car. ^∗ denotes that the implicit feature decoder is shared across different objects. All methods are evaluated at 30K iterations.

\begin{overpic}[width=433.62pt]{imgs/results/fitting_small.jpg} \put(5.0,-2.0){Ground-truth} \put(21.0,-2.0){Instant-NGP} \put(36.0,-2.0){Gaussian Splatting} \put(57.0,-2.0){Voxel${}^{*}$} \put(72.0,-2.0){Triplane${}^{*}$} \put(85.0,-2.0){{Our GaussianCube}} \end{overpic}

Figure 4: Qualitative results of object fitting.

Training objective. In our 3D diffusion training, we parameterize our model $\hat{\bm{y}}_{\theta}$ to predict the noise-free input $\bm{y}_{0}$ using:

\mathcal{L}_{\text{simple }}=\mathbb{E}_{t,\bm{y}_{0},\bm{\epsilon}}\left[% \left\|\hat{\bm{y}}_{\theta}\left(\alpha_{t}\bm{y}_{0}+\sigma_{t}\bm{\epsilon}% ,t,\bm{c}_{\text{cls}}\right)-\bm{y}_{0}\right\|_{2}^{2}\right],

(3)

where the condition signal $\bm{c}_{\text{cls}}$ is only needed when training conditional diffusion models. We additionally impose image-level supervision to improve the rendering quality of generated GaussianCube, which has been demonstrated to effectively enhance the perceptual details in previous works [55, 36]. Specifically, we penalize the discrepancy between the rasterized images $I_{\text{pred}}$ of the predicted GaussianCubes and the ground-truth images $I_{\text{gt}}$ using:

\displaystyle\mathcal{L}_{\text{image }}=\mathbb{E}_{t,I_{\text{pred }}}\left(% \sum_{l}\left\|\Psi^{l}\left(I_{\text{pred}}\right)-\Psi^{l}\left(I_{\text{gt}% }\right)\right\|_{2}^{2}\right)+\mathbb{E}_{t,I_{\text{pred}}}\left(\left\|I_{% \text{pred}}-I_{\text{gt }}\right\|_{2}\right),

(4)

where $\Psi^{l}$ is the multi-resolution feature extracted using the pre-trained VGG [47]. Benefiting from the efficiency of both rendering speed and memory costs from GS [28], we are able to perform fast training with high-resolution renderings. Our overall training loss can be formulated as:

\mathcal{L}=\mathcal{L}_{\text{simple}}+\lambda\mathcal{L}_{\text{image}},

(5)

where $\lambda$ is a balancing weight.

Method	ShapeNet Car		ShapeNet Chair		OmniObject3D
Method	FID-50K $\downarrow$	KID-50K(‰) $\downarrow$	FID-50K $\downarrow$	KID-50K(‰) $\downarrow$	FID-50K $\downarrow$	KID-50K(‰) $\downarrow$
EG3D	30.48	20.42	27.98	16.01	-	-
GET3D	17.15	9.58	19.24	10.95	-	-
DiffTF	51.88	41.10	47.08	31.29	46.06	22.86
Ours	13.01	8.46	15.99	9.95	11.62	2.78

Table 3: Quantitative results of unconditional generation on ShapeNet Car and Chair [9] and class-conditioned generation on OmniObject3D [59].

\begin{overpic}[width=390.25534pt]{imgs/results/shapenet_all_small.jpg} \put(12.0,-3.0){EG3D~{}\cite[cite]{[\@@bibref{Number}{chan2022efficient}{}{}]}% } \put(35.0,-3.0){GET3D~{}\cite[cite]{[\@@bibref{Number}{gao2022get3d}{}{}]}} \put(60.0,-3.0){DiffTF~{}\cite[cite]{[\@@bibref{Number}{cao2023large}{}{}]}} \put(86.0,-3.0){{Ours}} \end{overpic}

Figure 5: Qualitative comparison of unconditional 3D generation on ShapeNet Car and Chair datasets. Our model is capable of generating results of complex geometry with rich details.

\begin{overpic}[width=390.25534pt]{imgs/results/omni_all.jpg} \put(18.0,-3.0){DiffTF~{}\cite[cite]{[\@@bibref{Number}{cao2023large}{}{}]}} \put(75.0,-3.0){{Ours}} \end{overpic}

Figure 6: Qualitative comparison of class-conditioned 3D generation on large-vocabulary OmniObject3D [59]. Our model is able to handle diverse distribution with semantically accurate results.

4 Experiments

4.1 Dataset and Metrics

To measure the expressiveness and efficiency of various 3D representations, we fit 100 objects in ShapeNet Car [9] using each representation and report the PSNR, LPIPS [68] and Structural Similarity Index Measure (SSIM) metrics when synthesizing novel views. Furthermore, we conduct experiments of single-category unconditional generation on ShapeNet [9] Car and Chair, and class-conditioned generation on real-world scanned dataset OmniObject3D [59]. We compute the FID [22] and KID [2] scores between 50K generated renderings and 50K ground-truth renderings. For image-conditioned digital avatar generation, we utilize the synthetic avatar dataset [57], which comprises highly-detailed 3D avatars created by synthetic pipeline. We assess the generation quality of 5K rendering from 500 test avatars and additionally include cosine similarity of identity embedding [13] (CSIM) to measure the ID preservation. The experiments of text-to-3D generation are performed on the large-scale challenging Objaverse dataset [12]. We numerically evaluate the text alignment quality using CLIP score [43] of 300 test prompts. All images are rendered with $512\times 512$ resolution. For more details of data, please refer to Section A.1.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	CSIM $\uparrow$	FID-5K $\downarrow$	KID-5K(‰) $\downarrow$
Rodin w/o 2D SR	18.80	0.2842	0.7439	0.6594	32.07	24.78
Rodin	18.59	0.2821	0.7373	0.6466	20.02	9.24
Ours	21.87	0.1768	0.7703	0.7821	8.32	2.67

Table 4: Quantitative results of digital avatar creation conditioned on single portrait image.

\begin{overpic}[width=346.89731pt]{imgs/results/img_cond.jpg} \put(5.0,-2.5){Reference} \put(35.0,-2.5){Rodin~{}\cite[cite]{[\@@bibref{Number}{wang2023rodin}{}{}]}} \put(78.0,-2.5){{Ours}} \end{overpic}

Figure 7: Qualitative comparison of 3D avatar creation conditioned on single frontal portraits.

4.2 Implementation Details

To construct GaussianCube for each object, we perform the proposed densification-constrained fitting for 30K iterations, where $N_{\text{max}}$ is set to 32,768. After OT-based structuralization, we obtain $32\times 32\times 32\times 14$ GaussianCube for each object. For the 3D diffusion model, we adopt the ADM U-Net network [38, 15]. We perform full attention at the resolution of $8^{3}$ and $4^{3}$ within the network. The timesteps of diffusion models are set to $1,000$ and we train the models using the cosine noise schedule [38] with loss weight $\lambda$ set to $10$ . For more training details, please refer to Section A.1.

4.3 Main Results

3D fitting. We first evaluate our representation capability of object fitting against previous NeRF-based representations including Voxels [53] with similar parameter sizes and Triplane [8], which are widely adopted in previous 3D generation works [8, 55, 5, 36, 53]. We also include Instant-NGP [37] and original Gaussian Splatting [28] for reference despite their unsuitability for generative modeling due to their unstructured spatial nature. As shown in Table 2, our GaussianCube outperforms all NeRF-based representations among all metrics. Figure 3 illustrates that GaussianCube can faithfully reconstruct geometry details and intricate textures. Moreover, we achieve such high-quality fitting with minimal parameters due to the densification-constrained fitting, showcasing our compactness. Notably, the shared implicit feature decoder in the multi-object fitting of NeRF-based methods leads to significant decreases in quality compared to single-object fitting as evidenced in Table 2. While the fully explicit nature of GS results in no quality gap between single and multiple object fitting.

	DreamGaussian	VolumeDiffusion	Shap-E	LGM	Ours
CLIP Score $\uparrow$	26.38	24.41	30.52	30.06	30.56
Inference Time (s) $\downarrow$	$\sim 120$	5	7	2	5

Table 5: Quantitative results of text-to-3D creation. Inference time is measured on a single A100 GPU. While Shape-E, LGM achieve comparable CLIP scores as ours, they either utilize millions of training data or leverage 2D diffusion prior.

\begin{overpic}[width=424.94574pt]{imgs/results/text_cond_all.jpg} \put(2.0,-2.0){DreamGaussian~{}\cite[cite]{[\@@bibref{Number}{tang2023% dreamgaussian}{}{}]}} \put(22.0,-2.0){VolumeDiffusion~{}\cite[cite]{[\@@bibref{Number}{tang2023% volumediffusion}{}{}]}} \put(47.0,-2.0){Shap-E~{}\cite[cite]{[\@@bibref{Number}{jun2023shap}{}{}]}} \put(67.0,-2.0){LGM~{}\cite[cite]{[\@@bibref{Number}{tang2024lgm}{}{}]}} \put(88.0,-2.0){{Ours}} \end{overpic}

Figure 8: Qualitative comparison of text-to-3D generation on Objaverse [12]. Our model is able to generate high-quality samples according to the given text prompts.

Single-category unconditional generation. For unconditional generation, we compare our method with the state-of-the-art 3D generation works including 3D-aware GANs [8, 17] and Triplane diffusion models [5]. As shown in Table 3, our method surpasses all prior works in terms of both FID and KID scores and sets new records. We provide visual comparisons in Figure 5, where EG3D and DiffTF tend to generate blurry results with poor geometry, and GET3D fails to provide satisfactory textures. In contrast, our method yields high-fidelity results with authentic geometry and sharp textures.

Large-vocabulary class-conditioned generation. We also compare class-conditioned generation with DiffTF [5] on more diverse and challenging OmniObject3D [59] dataset. We achieve significantly better FID and KID scores than DiffTF as shown in Table 3. Visual comparisons in Figure 6 reveal that DiffTF often struggles to create intricate geometry and detailed textures, whereas our method is able to generate objects with complex geometry and realistic textures.

Image-conditioned avatar generation. For 3D avatar generation conditioned on a single reference image, we compare our method with state-of-the-art Triplane diffusion models, Rodin [44]. Our model surpasses Rodin among all evaluated metrics as shown in Table 4. Although Rodin utilizes a 2D refiner [56] to boost the visual quality of facial areas, which significantly compromises 3D consistency. Our model still outperforms it by direct real-3D generation. Results in Figure 7 demonstrate that our model faithfully preserves the identity, expression and accessories of the references with rich details, while Rodin struggles to provide satisfactory results even using 2D refinement.

Method	Densify & Prune	Representation Fitting			Generation
Method	Densify & Prune	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$	FID-50K $\downarrow$	KID-50K(‰) $\downarrow$
A. Voxel grid w/o offset	✗	25.87	0.1228	0.9217	-	-
B. Voxel grid w/ offset	✗	30.18	0.0780	0.9628	40.52	24.35
C. Ours w/o OT	✓	34.94	0.0346	0.9863	21.41	14.37
D. Ours	✓	34.94	0.0346	0.9863	13.01	8.46

Table 6: Quantitative ablation of both representation fitting and generation quality on ShapeNet Car.

\begin{overpic}[width=346.89731pt]{imgs/results/ablation/ablation_fitting_all_% small.jpg} \put(5.0,-1.0){Ground-truth} \put(32.0,-1.0){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} A.} \put(57.0,-1.0){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} B.} \put(78.0,-1.0){{~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generati% on} D. (Ours)}} \end{overpic}

Figure 9: Qualitative ablation of representation fitting.

\begin{overpic}[width=424.94574pt]{imgs/results/ablation/ablation_mapping_% results_small.jpg} \put(5.0,-1.5){{OT (Ours)}} \put(20.0,-1.5){Nearest Neighbor} \put(41.0,-1.5){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} B.} \put(65.0,-1.5){~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generatio% n} C.} \put(83.0,-1.5){{~{}\lx@cref{creftypecap~refnum}{tab:ablation_fitting_generati% on} D. (Ours)}} \put(17.0,-4.0){\small{(a)}} \put(67.0,-4.0){\small{(b)}} \end{overpic}

Figure 10: Visual ablation of the Gaussian organization methods and 3D generation. For visualization of Gaussian structuralization in (a), we map the coordinates of the corresponding voxel of each Gaussians to RGB values to visualize the organization. Our OT-based solution also results in the best generation quality shown in (b).

Text-to-3D generation. We compare text-to-3D generation with prior arts including diffusion models [26, 53], optimization-based method [50] and feed-forward method [51]. Our model achieves competitive text-3D alignment results as shown in Table 5. The visual comparison in Figure 8 shows that our model is able to create high-quality samples aligning with text prompts in 5 seconds. DreamGaussian tends to create over-saturated results and suffers from Janus problem. VolumeDiffusion produces unsatisfactory textures with poor text alignment. Shap-E can produce semantically accurate results but struggles to generate complex geometry. LGM reconstructs 3D Gaussians from multi-view images generated by text-conditioned multi-view diffusion pipeline [45], whereas the inconsistency [51] of the generated multi-views often results in inaccurate geometric reconstruction.

4.4 Ablation Study

We first examine the key factors in representation construction on ShapeNet Car. To spatially structure the Gaussians, a simplistic approach would be anchoring the positions of Gaussians to a predefined voxel grid while omitting densification and pruning, which leads to severe failure when fitting the objects as shown in Figure 9. Even by introducing learnable offsets to the voxel grid, the results still lack details. We observe the offsets are typically too small to effectively lead the Gaussians close to the object surfaces, which indicates the importance of densification in the fitting process. Instead, GaussianCube can capture both complex geometry and intricate details as shown in Figure 9. The numerical comparison in Table 6 also demonstrates the superior fitting quality of GaussianCube.

We also evaluate how the representation affects 3D generative modeling on ShapeNet Car as shown in Table 6 and Figure 10. Limited by the poor fitting quality, performing diffusion modeling on voxel grid with learnable offsets leads to blurry generation results as shown in Figure 10. To validate the importance of organizing Gaussians via Optimal Transport (OT), we compare with the organization based on nearest neighbor transport. We linearly map each Gaussian’s corresponding coordinates of voxel to RGB color to visualize different organizations. As shown in Figure 10 (a), our proposed OT approach yields smooth color transitions, indicating that our method successfully preserves the spatial correspondence. However, nearest neighbor results in abrupt color transitions due to their disregard for global structure. Both the quantitative results in Table 6 and visual comparisons Figure 10 indicate that our globally structured arrangement facilitates generative modeling by alleviating its complexity, successfully leading to superior generation quality.

5 Conclusion

We have presented GaussianCube, a structured and explicit radiance representation crafted for 3D generative models. We begin by fitting each 3D object with a constant number of Gaussians using our proposed densification-constrained fitting algorithm. We further organize the obtained Gaussians into a spatially structured representation by solving the Optimal Transport between the positions of Gaussians and the predefined voxel grid. The proposed GaussianCube is spatially structured, allowing to use standard 3D U-Net for diffusion modeling without elaborate designs. Moreover, GaussianCube can achieve high-quality fitting using much fewer parameters compared to prior works of similar quality, which further eases the difficulty of generative modeling. Our 3D diffusion models equipped with GaussianCube achieve state-of-the-art generation quality on the evaluated datasets, underscoring its potential of GaussianCube as a versatile and powerful radiance representation for 3D generation.

References

Barron et al. [2022] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470–5479, 2022.
Bińkowski et al. [2018] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
Burkard and Cela [1999] Rainer E Burkard and Eranda Cela. Linear assignment problems and extensions. In Handbook of combinatorial optimization: Supplement volume A, pages 75–149. Springer, 1999.
Cao et al. [2023] Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3d diffusion model with transformer. arXiv preprint arXiv:2309.07920, 2023.
Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
Chan et al. [2021] Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5799–5809, 2021.
Chan et al. [2022] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
Chang et al. [2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
Cheng et al. [2023] Yiji Cheng, Fei Yin, Xiaoke Huang, Xintong Yu, Jiaxiang Liu, Shikun Feng, Yujiu Yang, and Yansong Tang. Efficient text-guided 3d-aware portrait generation with score distillation sampling on distribution. arXiv preprint arXiv:2306.02083, 2023.
Cotton and Peyton [2024] R James Cotton and Colleen Peyton. Dynamic gaussian splatting from markerless motion capture reconstruct infants movements. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 60–68, 2024.
Deitke et al. [2023] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
Deng et al. [2019] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
Deng et al. [2022] Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10673–10683, 2022.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
Fridovich-Keil et al. [2022] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501–5510, 2022.
Gao et al. [2022] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. arXiv preprint arXiv:2209.11163, 2022.
Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Gu et al. [2021] Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985, 2021.
Gupta et al. [2023] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas Oğuz. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
He et al. [2024] Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. Gvgen: Text-to-3d generation with volumetric representation. arXiv preprint arXiv:2403.12957, 2024.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Hu et al. [2023] Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-miprf: Tri-mip representation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19774–19783, 2023.
Jonker and Volgenant [1988] Roy Jonker and Ton Volgenant. A shortest augmenting path algorithm for dense and sparse linear assignment problems. In DGOR/NSOR: Papers of the 16th Annual Meeting of DGOR in Cooperation with NSOR/Vorträge der 16. Jahrestagung der DGOR zusammen mit der NSOR, pages 622–622. Springer, 1988.
Jun and Nichol [2023] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023.
Li et al. [2024] Mengtian Li, Shengxiang Yao, Zhifeng Xie, Keyu Chen, and Yu-Gang Jiang. Gaussianbody: Clothed human reconstruction via 3d gaussian splatting. arXiv preprint arXiv:2401.09720, 2024.
Li et al. [2023] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. Generative image dynamics. arXiv preprint arXiv:2309.07906, 2023.
Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, ICLR, 2019.
Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022.
Luiten et al. [2023] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713, 2023.
Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
Mildenhall et al. [2021] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
Müller et al. [2023] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328–4338, 2023.
Müller et al. [2022] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG), 41(4):1–15, 2022.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
Niemeyer and Geiger [2021] Michael Niemeyer and Andreas Geiger. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11453–11464, 2021.
Park et al. [2021] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021.
Poole et al. [2022] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
Shue et al. [2023] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875–20886, 2023.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Sun et al. [2022] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5459–5469, 2022.
Sun et al. [2023] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
Tang et al. [2023a] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023a.
Tang et al. [2024] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. arXiv preprint arXiv:2402.05054, 2024.
Tang et al. [2023b] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023b.
Tang et al. [2023c] Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. arXiv preprint arXiv:2312.11459, 2023c.
Villani et al. [2009] Cédric Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
Wang et al. [2023] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
Wang et al. [2021] Xintao Wang, Yu Li, Honglun Zhang, and Ying Shan. Towards real-world blind face restoration with generative facial prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9168–9178, 2021.
Wood et al. [2021] Erroll Wood, Tadas Baltrušaitis, Charlie Hewitt, Sebastian Dziadzio, Thomas J Cashman, and Jamie Shotton. Fake it till you make it: face analysis in the wild using synthetic data alone. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3681–3691, 2021.
Wu et al. [2023a] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528, 2023a.
Wu et al. [2023b] Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023b.
Xia and Xue [2023] Weihao Xia and Jing-Hao Xue. A survey on deep generative 3d-aware image synthesis. ACM Computing Surveys, 56(4):1–34, 2023.
Xiang et al. [2022] Jianfeng Xiang, Jiaolong Yang, Yu Deng, and Xin Tong. Gram-hd: 3d-consistent image generation at high resolution with generative radiance manifolds. arXiv preprint arXiv:2206.07255, 2022.
Xu et al. [2022a] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. arXiv preprint arXiv:2212.14704, 2022a.
Xu et al. [2022b] Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5438–5448, 2022b.
Xu et al. [2023] Yuelang Xu, Benwang Chen, Zhe Li, Hongwen Zhang, Lizhen Wang, Zerong Zheng, and Yebin Liu. Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. arXiv preprint arXiv:2312.03029, 2023.
Yi et al. [2023] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529, 2023.
Zhang et al. [2022] Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformer-based gan for high-resolution image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11304–11314, 2022.
Zhang et al. [2020] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492, 2020.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Zhou et al. [2021] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826–5835, 2021.

Appendix A Appendix

A.1 Additional Implementation Details

Dataset preparation. We conduct experiments on ShapeNet Car [9], ShapeNet Chair [9], OmniObject3D [59], Synthetic Avatar [57] and Objaverse [12] datasets. For each dataset, we report the total number of objects used for training, the number of views rendered per object for GaussianCube fitting and the distribution of camera poses used for rendering in Table 7. For the Objaverse dataset, we excluded low-quality objects, such as those without textures or with defective reconstructions following [53]. We also report the object bounding box $\bm{b}$ in the world coordinate system of each dataset in Table 7, which is used to construct the predefined voxel grid within $[-\bm{b},\bm{b}]^{3}$ during OT-based Gaussian structuralization.

Representation construction. We set $N_{\text{max}}$ to 32768 and $C$ to 14 omitting the view-dependent spherical harmonics. This simplification appears to have a negligible impact on object fitting while concurrently reducing the data dimension, thereby alleviating the difficulty of diffusion modeling. During our densification-constrained fitting procedure, we primarily follow the hyper-parameters in original Gaussian Splatting [28]. For OT-based Gaussian structuralization, we adopt an approximate solution for the OT problem due to the $O\left(N_{\text{max}}^{3}\right)$ time complexity of Jonker-Volgenant algorithm [25]. This is achieved by dividing the positions of the Gaussians and the voxel grid into four sorted segments and then applying the Jonker-Volgenant solver to each segment individually. We empirically find this approximation successfully strikes a balance between computational efficiency and spatial structure preservation. The proposed densification-constrained fitting takes around $2.67$ minutes for each object of 30K iterations and the OT-based voxelization takes around $2$ minutes which can be run on CPU in parallel.

3D Diffusion. To train the 3D diffusion model, we initially compute the instance-wise statistics of mean $\bar{\bm{\mu}}\in\mathbb{R}^{N_{v}\times N_{v}\times N_{v}\times C}$ and standard deviation $\bar{\bm{\sigma}}\in\mathbb{R}^{N_{v}\times N_{v}\times N_{v}\times C}$ , from the GaussianCubes of each training dataset respectively. These statistical measures are then utilized to normalize the training data. For our 3D diffusion model architecture, we adopt the ADM-UNet from [15] and replace the convolution, upsampling, downsampling and attention operations with 3D implementations. We train our model using AdamW optimizer [31], and apply exponential moving average (EMA) with a rate of 0.9999 during training. For unconditional generation on ShapeNet, we train the model with a base learning rate $5e-5$ for 850K iterations and then decay the learning rate to $5e-6$ for another 150K iterations. For 3D digital avatar creation from a single portrait image, we adopt the pretrained DINO ViT-B/16 [6] to encode the $512\times 512$ conditional images into $1025\times 768$ conditional feature tokens. For text-to-3D creation, we take CLIP-L/14 [43] to encode the text prompts into $77\times 768$ conditional feature tokens. We provide more detailed configurations of the model architectures, diffusion training and inference for each dataset in Table 8.

Implementation of Gaussian organization visualization in Figure 10 (a). For the $i$ -th Guassian, we obtain its corresponding voxel grid centers $\bm{x}_{k}\in\mathbb{R}^{3}$ according to Optimal Transport plan $\mathbf{T}^{*}$ (i.e., $\mathbf{T}^{*}_{ik}=1$ ) as illustrated in Section 3.1. To visualize the coordinates of $\bm{x}_{k}$ , we map them to RGB color $\bm{C}_{k}\in\mathbb{R}^{3}$ using:

\bm{C}_{k}=\frac{(\bm{x}_{k}+\bm{b})}{2\bm{b}}\times\bm{255},

(6)

where $\bm{b}$ is the bounding box in the world coordinate system. The resultant point cloud like visualizations are shown in Figure 10 (a), where smooth color transitions indicate coherent spatial correspondence preservation.

A.2 Additional Ablation Study and Analysis

Ablation of $N_{\text{max}}$ in densification-constrained fitting. We conduct experiments to evaluate how $N_{\text{max}}$ affects fitting on ShapeNet Car. The results in Table 9 indicate that there is a clear trend where increasing $N_{\text{max}}$ leads to improved fitting accuracy. However, a larger $N_{\text{max}}$ also incurs higher computational costs during diffusion training. Therefore, we set $N_{\text{max}}$ to 32,768 to strike a balance between high-quality fitting and computational efficiency.

Ablation of classifier-free guidance in class-conditioned generation. We study how classifier-free guidance (CFG) impacts our generation quality when inference class-conditioned diffusion models. We report the FID and KID metrics in Table 10 under different CFG scales.

Visualization of intermediate results in the denoising process. During inference, our model starts from Gaussian noise and progressively denoises to yield the high-quality GaussianCube. We present visualizations of the intermediate renderings $\bm{y}_{t}$ at various timesteps $t\in[0,T]$ throughout the denoising process, offering a detailed insight into the GaussianCube diffusion procedure. As illustrated in Figure 11, our model first establishes the global structure and then incrementally enhances the details, which is similar to previous 3D diffusion models [55, 46].

Dataset	# Objects	# Views per object	Rotation Angle	Elevation Angle	Bounding Box
ShapeNet Car	7,462	150	$[0,2\pi]$	$[\frac{1}{6}\pi,\frac{1}{2}\pi]$	0.45
ShapeNet Chair	6,775	150	$[0,2\pi]$	$[\frac{1}{6}\pi,\frac{1}{2}\pi]$	0.35
OmniObject3D	5,795	100	$[0,2\pi]$	$[0,\frac{1}{2}\pi]$	1.0
Sythetic Avatar	98,000	300	$[0,2\pi]$	$[\frac{1}{6}\pi,\frac{2}{3}\pi]$	40.0
Objaverse	125,653	150	$[0,2\pi]$	$[0,\frac{2}{3}\pi]$	0.5

Table 7: Details of each dataset.

	ShapeNet Car	ShapeNet Car	OmniObject3D	Synthetic Avatar	Objaverse
Diffusion steps	1,000	1,000	1,000	1,000	1,000
Noise schedule	Cosine	Cosine	Cosine	Cosine	Cosine
NFEs	300	300	300	250	44
Inferece sampler	DPM-solver [32]	DPM-solver [32]	DPM-solver [32]	DPM-solver [32]	DPM-solver [32]
CFG scale	-	-	2.0	1.3	3.5
Model size	82M	82M	82M	339M	339M
Channels	64	64	64	128	128
Channel mult	(1,2,3,4)	(1,2,3,4)	(1,2,3,4)	(1,2,3,4)	(1,2,3,4)
Num res blocks	3	3	3	3	3
Attn resolutions	(8, 4)	(8, 4)	(8, 4)	(8, 4)	(8, 4)
Num head channels	64	64	64	64	64
Dropout	0	0	0	0	0
Scale shift norm	True	True	True	True	True
Training steps	1,000K	1,000K	700K	1,200K	1,800K
Training GPUs	16	16	16	16	32
Batch size	128	128	128	128	256
Base lr	$5e-5$	$5e-5$	$5e-5$	$5e-5$	$5e-5$
Lr decay steps	850K	850K	-	-	-

Table 8: Detailed configuration of model architecture, diffusion training and inference on each dataset.

$N_{\text{max}}$	$N_{v}$	PSNR $\uparrow$	LPIPS $\downarrow$	SSIM $\uparrow$
4096	16	32.56	0.0547	0.9765
13824	24	34.32	0.0396	0.9842
32768	32	34.94	0.0347	0.9863
110592	48	35.29	0.0307	0.9874
262144	64	35.34	0.0301	0.9875

Table 9: Quantitative ablation of

N_{\text{max}}

in densification-constrained fitting. We set

N_{\text{max}}

to 32,768 in this paper.

Scale	w/o CFG	1.3	1.5	2.0	3.0	6.0
FID-50K $\downarrow$	13.39	12.07	11.72	11.62	12.99	32.80
KID-50K(‰) $\downarrow$	4.01	3.12	3.00	2.78	3.17	14.36

Table 10: Quantitative ablation of CFG scale in the class-conditioned generation of OmniObject3D [59].

\begin{overpic}[width=432.31653pt]{imgs/supp/failure_cases.jpg} \put(5.0,-3.0){Text Condition} \put(30.0,-3.0){Generated Sample} \put(57.0,-3.0){Text Condition} \put(80.0,-3.0){Generated Sample} \end{overpic}

Figure 13: Failure cases.

Nearest neighbors analysis. We perform nearest neighbor search of some unconditionally generated samples in the paper according to the similarity of pretrained CLIP [43] features. The results in Figure 12 demonstrate that our model is capable of generating novel geometry and textures rather than simply memorizing the training data.

\begin{overpic}[width=432.31653pt]{imgs/supp/rodin_supp.jpg} \put(5.0,-3.0){Reference} \put(37.0,-3.0){Rodin~{}\cite[cite]{[\@@bibref{Number}{wang2023rodin}{}{}]}} \put(79.0,-3.0){{Ours}} \end{overpic}

Figure 15: Qualitative comparison generated digital avatars conditioned on synthetic portraits.

\begin{overpic}[width=390.25534pt]{imgs/supp/text_cond_supp_small.jpg} \put(2.0,-2.0){DreamGaussian~{}\cite[cite]{[\@@bibref{Number}{tang2023% dreamgaussian}{}{}]}} \put(22.0,-2.0){VolumeDiffusion~{}\cite[cite]{[\@@bibref{Number}{tang2023% volumediffusion}{}{}]}} \put(47.0,-2.0){Shap-E~{}\cite[cite]{[\@@bibref{Number}{jun2023shap}{}{}]}} \put(67.0,-2.0){LGM~{}\cite[cite]{[\@@bibref{Number}{tang2024lgm}{}{}]}} \put(88.0,-2.0){{Ours}} \end{overpic}

Figure 16: Additional qualitative comparison of text-to-3D generation on Objaverse [12]. Our model is capable of creating high-quality samples following input text prompts.

\begin{overpic}[width=368.57964pt]{imgs/supp/text_cond_variation_supp.jpg} \put(5.0,-2.0){Text Condition} \put(37.0,-2.0){Sample 1} \put(60.0,-2.0){Sample 2} \put(85.0,-2.0){Sample 3} \end{overpic}

Figure 18: Variation of text-to-3D generation. Our model is able to generate diverse results conditioned on the same text prompt.

\begin{overpic}[width=368.57964pt]{imgs/supp/text_cond_editing_supp.jpg} \end{overpic}

Figure 19: Example of text-guided 3D editing.

A.3 Additional Visual Results

For 3D avatar generation, while trained on synthetic dataset, our model is capable of generalizing to in-the-wild portrait input. We provide more visual comparison of 3D avatar creation conditioned on in-the-wild portraits with Rodin [55] in Figure 14. We also include additional comparison conditioned on synthetic input from our test in Figure 15. Our model can faithfully retain the identity of the reference portrait and is able to provide high-fidelity results with rich details, e.g. hair, glasses and clothing. Although utilizing a pretrained 2D super-resolution module which significantly compromises 3D consistency, Rodin struggles to follow the conditional images and fails to produce detailed textures in non-facial areas e.g. clothing and hair.

We include additional qualitative comparison and generated samples of text-to-3D generation in Figure 16 and Figure 17 respectively. Our model yields samples with better visual quality, and is capable of handling challenging prompts. The results in Figure 18 show the generation diversity of our results given the same text prompt. Our model is also capable of performing text-guided editing of generated objects by leveraging SDEdit [34] as depicted in Figure 19, demonstrating the promise of achieving controllable 3D generation.

We provide more generated samples of unconditional and class-conditioned generation in Figure 20, Figure 21 and Figure 22. The additional results demonstrate the strong capability of our model to create high-quality 3D assets with complex geometry and intricate textures.

Furthermore, we also provide an additional video in supplementary material, which intuitively illustrates our approach and visualizes the generated results.

A.4 Limitations

While GaussianCube represents a substantial step forward in developing an ideal representation for 3D content generation, it still has some limitations. Specifically, although the GaussianCube construction procedure is considerably more rapid than that of NeRF-based methods and can be executed in parallel, it still requires approximately 5 minutes to construct each object. This presents a challenge for scaling up training on extensive 3D datasets. In future work, we plan to investigate more time-efficient methods for GaussianCube construction. Additionally, akin to prior 2D diffusion models, our text-to-3D diffusion model encounters difficulties in presenting the specified number of objects within prompts as shown in Figure 13. To address this, we will look into enhancing the precision and controllability of 3D generation in the future.

A.5 Broader Impacts

The proposed GaussianCube enables high-quality 3D asset fitting with few parameters, which significantly simplifies the challenges of 3D generative modeling. Our diffusion model is capable of generating high-quality 3D assets of complex geometry and intricate textures while also accommodating a variety of conditional signals to steer the creating procedure. The strong capability of GaussianCube suggests its potential to serve as a versatile 3D representation for a variety of applications in future 3D research endeavors.

Like all generative models, particular caution is required when dealing with sensitive tasks involving human representations. Our avatar creation model is trained exclusively on a synthetic dataset [57] composed of large-scale 3D digital avatars which are generated through a graphics pipeline. We conceptualize digital avatars as analogous to those created by specialized 3D artists, rather than photorealistic human images. This strategy in selecting training data mitigates privacy and copyright issues that might arise from utilizing real human photo collections. Nevertheless, it is crucial to acknowledge that avatars generated by our model from real-world imagery could still be misused for spreading disinformation. As such, we advocate implementing rigorous safeguards and promoting responsible use of our technology other related ones to mitigate such risks.

GaussianCube: A Structured and Explicit Radiance Representation for 3D Generative Modeling