Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
11institutetext: The Hong Kong University of Science and Technology 22institutetext: Huawei Noah’s Ark Lab
Equal contribution.   Corresponding Author: 22email: jchcyan@gmail.com
https://jointdreamer.github.io

JointDreamer: Ensuring Geometry Consistency and Text Congruence in Text-to-3D Generation via Joint Score Distillation

Chenhan Jiang*\orcidlink0000-0001-8771-3641 11    Yihan Zeng 22    Tianyang Hu 22    Songcun Xu 22    Wei Zhang 22    Hang Xu 22    Dit-Yan Yeung 11
Abstract

Score Distillation Sampling (SDS) by well-trained 2D diffusion models has shown great promise in text-to-3D generation. However, this paradigm distills view-agnostic 2D image distributions into the rendering distribution of 3D representation for each view independently, overlooking the coherence across views and yielding 3D inconsistency in generations. In this work, we propose Joint Score Distillation (JSD), a new paradigm that ensures coherent 3D generations. Specifically, we model the joint image distribution, which introduces an energy function to capture the coherence among denoised images from the diffusion model. We then derive the joint score distillation on multiple rendered views of the 3D representation, as opposed to a single view in SDS. In addition, we instantiate three universal view-aware models as energy functions, demonstrating compatibility with JSD. Empirically, JSD significantly mitigates the 3D inconsistency problem in SDS, while maintaining text congruence. Moreover, we introduce the Geometry Fading scheme and Classifier-Free Guidance (CFG) Switching strategy to enhance generative details. Our framework, JointDreamer, establishes a new benchmark in text-to-3D generation, achieving outstanding results with an 88.5% CLIP R-Precision and 27.7% CLIP Score. These metrics demonstrate exceptional text congruence, as well as remarkable geometric consistency and texture fidelity.

Keywords:
3D Vision 3D Generation Energy Function
Refer to caption
Figure 1: Text-to-3D generations by JointDreamer from scratch. JointDreamer excels in generating geometrically consistent and high-fidelity 3D assets, adhering to complex textual descriptions that are challenging for previous methods.

1 Introduction

3D content creation is essential for diverse applications, including gaming, robotics simulation, and virtual reality. However, it is labor-intensive, demanding substantial time for skilled designers to create a single 3D asset. Hence, automating 3D creation with text input has attracted considerable attention. Recently, the score distillation sampling (SDS) algorithm pioneered by DreamFusion [37] shows promise in text-to-3D tasks, which lifts image distribution from a well-trained diffusion model [38] into parameterized 3D representation like NeRF [31]. Compared to 3D generative models [34, 19, 2] that struggle with producing arbitrary objects due to limited text-3D training data, SDS-based methods [37, 24, 42, 6] can generate arbitrary 3D assets with diverse text input.

Although SDS-based methods benefit from the generalizability of diffusion models, they often encounter a common issue known as Janus artifacts  [37, 24, 42, 6]. These artifacts manifest as repeated content from different viewpoints of a 3D generation, yielding a lack of realism and coherence in the rendered views. We investigate the Janus artifacts by visualizing image distributions of the diffusion model [38] from multiple viewpoints, as illustrated in Fig. 2(a). The results reveal the view-agnostic nature and the content inconsistency across views of the diffusion model. Consequently, SDS optimizes each rendered view of 3D representation independently and inherits image distribution without coherent multi-view perspective, leading to the geometric inconsistency of 3D generations.

Existing works [1, 22] address the aforementioned challenges within the SDS framework by employing prompt engineering techniques. However, the effectiveness of such methods remains unsatisfactory, as evident from the results depicted in Fig. 2(b). Alternative methods [41, 16] propose to finetune view-aware diffusion models using rendered images of 3D datasets [8, 7]. Nevertheless, they are prone to overfitting on limited text-3D training data [23], decreasing the text congruence when handling complex text inputs. Based on the above observations, it requires a rethinking of the SDS optimization to enhance the 3D coherence of generations while maintaining generalizability.

Refer to caption
Figure 2: Illustration of text-conditioned images for different viewpoints, where input texts are augmented with corresponding direction prompts for each view. (a) The original generations from 2D diffusion model [38] are view-agnostic and inconsistent across views. (b) Text prompt tuning [1] has limited improvement in the directional structure of generated images for each view. (c) JSD injects coherence measurement from the proposed binary classifier (refer to Section 4.2), contributing to modified directional structures and semantical consistency across views.

In this work, we first present Joint Score Distillation (JSD), which significantly promotes the 3D consistency of generation and inherits generalizability from diffusion models. Specifically, we model the joint image distribution of diffusion model via an energy function measuring coherence across denoised images. It facilitates the extension of the KL-Divergence in SDS from single-view into multi-view. We then derive the joint score distillation function from multi-view KL-Divergence, which ensures inter-view coherence in the optimization process of 3D generation. We show that SDS is a special case of JSD with the energy term omitted, which indicates the absence of coherence constraint across views.

Building upon JSD optimization, we present three view-aware models as energy terms to showcase the compatibility of JSD: the Binary Classification Model, the Image-to-Image Translation Model, and the Multi-view Generation Model. Through empirical analysis, it is observed that different view-aware models introduce distinct coherence measurements, leading to diverse 3D generations, while all contributing to 3D consistency. Furthermore, to facilitate a more comprehensive comparison with existing text-to-3D generation methods, we introduce JointDreamer, an innovative framework capable of producing geometric-consistent and high-fidelity 3D assets adhering to complex text descriptions. Notably, in addition to incorporating a Multi-view Generation model as an energy term in JSD, we introduce two complementary techniques, namely the Geometry Fading scheme and the Classifier-Free Guidance (CFG) Switching strategy, to enhance generative details.

We systematically assess the quality of our approach, both qualitatively and quantitatively, compared to existing methods. Qualitative results gallery can be found in Fig. 1. Our JointDreamer consistently produces high-fidelity 3D assets and mitigates Janus artifacts in SDS. It maintains text congruence even when confronted with complex text input, achieving 88.5% CLIP R-Precision and 27.7% CLIP Score.

In brief, our contributions are summarized as follows:

  • We introduce a novel Joint Score Distillation (JSD) for text-to-3D generation, optimizing multiple views jointly via an energy function to capture inter-view coherence.

  • We present three view-aware models as energy functions to show compatibility with JSD, all of them mitigate the Janus problem in SDS.

  • We introduce the text-to-3D framework JointDreamer, incorporating complementary Geometry Fading and CFG Switching techniques. Our JointDreamer achieves geometrically consistent and high-fidelity 3D assets even with complex textual inputs.

2 Related Works

Text-to-3D Generation.

Existing text-to-3D generation methods can be categorized into two streams: 3D generative models and 2D optimization methods. The former encompasses various deep generative models such as Variational Auto Encoders (VAEs) [12, 13], Generative Adversarial Models (GAN) [33, 35, 9, 10, 4], diffusion models [34, 5, 28] and transformer architectures [2, 19]. These models are efficient in inference but often struggle with generalizability and training stability, attributed to the limited scope and complexity of available 3D datasets. The latter approach centers around the Score Distillation Sampling (SDS) algorithm proposed by [37], which leverages 2D diffusion model priors[38] for optimizing 3D representations. Subsequent advancements have refined this technique by improving the 3D representations [24, 6], the sampling scheduler [17] and loss design [42]. However, the above approaches overlook the geometric consistency problem, facing inherent multi-face Janus issues in SDS. Prior works try to alleviate the Janus issues with prompt tuning [1] yet they achieve limited effect. Very recent work MVDream [41] addresses the problem by fine-tuning a multi-view diffusion model, but it is susceptible to overfitting on scarce 3D training data, compromising semantic consistency in text-to-3D generations. In this work, we address the fundamental flaw of SDS that optimizes each view independently by introducing a joint optimization function that enforces inter-view consistency, essentially solving the Janus issues in SDS while preserving its generalizability.

Diffusion-based Novel View Synthesis.

As an alternative to 3D generation, novel view synthesis models the challenge as a view-conditioned image-to-image translation task. There have been proposals for pose-conditioned image-to-image diffusion models [43] that generate novel views on synthetic data in 3D. Recently,  [26] promoted the generalizability of novel view synthesis by fine-tuning the 2D diffusion model [38] on renderings of 3D dataset, which facilitates images-to-3D tasks with 3D reconstruction or SDS algorithm. To further improve the 3D consistency across generated views and input view, very recent works [40, 29, 27] modify the generation process into multi-view generation and present corresponding architecture designs. Generally, these methods take camera specifications as conditions and enable the viewpoint-aware generations, but they can hardly accurately capture a complete 3D scene consistently and densely. Our method acknowledges the potential of these models to discern relative inter-view relationships, which we harness to provide inter-view coherence for our JSD, thus serving as universal guidance models.

Refer to caption
Figure 3: Overview of JointDreamer Framework. We introduce an energy function to model the joint distribution for multi-view denoised images from 2D diffusion model, facilitating the Joint Score Distillation (JSD) optimization for text-to-3D generation.

3 Preliminaries

We review the original SDS and address its fundamental limitation: optimizing independently for each single view of 3D representation and distilling with the view-agnostic image distribution of diffusion model. It results in geometric inconsistency issues, dubbed as Multi-Face Janus Problem.

Score Distillation Sampling (SDS).

SDS optimization is widely adopted by text-to-3D generation pipelines [37, 24, 6, 42, 30]. Given a 3D representation with learnable parameters θ𝜃\thetaitalic_θ and a pre-trained 2D diffusion model with noise prediction network ϵΦ(xt,t,y)subscriptitalic-ϵΦsubscriptx𝑡𝑡𝑦\epsilon_{\Phi}(\text{x}_{t},t,y)italic_ϵ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ), SDS optimizes θ𝜃\thetaitalic_θ by minimizing the KL-divergence:

minθDKL(qtθ(xt|c,y)||pt(xt|y)).\small\min\limits_{\theta}D_{KL}(q_{t}^{\theta}(\textbf{x}_{t}|c,y)||p_{t}(% \textbf{x}_{t}|y)).roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c , italic_y ) | | italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) ) . (1)

Here, pt(xt|y)subscript𝑝𝑡conditionalsubscriptx𝑡𝑦p_{t}(\text{x}_{t}|y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) is the image distribution sampled from diffusion model, qtθ(xt|c,y)superscriptsubscript𝑞𝑡𝜃conditionalsubscriptx𝑡𝑐𝑦q_{t}^{\theta}(\textbf{x}_{t}|c,y)italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c , italic_y ) is the distribution of rendered image xt=g(θ,c)subscriptx𝑡𝑔𝜃𝑐\text{x}_{t}=g(\theta,c)x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( italic_θ , italic_c ) with respect to camera pose c𝑐citalic_c at timestep t𝑡titalic_t of the forward diffusion process, where g𝑔gitalic_g is the renderer. To solve Eq. (1), the score distillation function is derived as:

θLSDS(θ)Et,x[w(t)σtαtθKL(qtθ(xt|c,y)||pt(xt|y))]Et,ϵΦ[w(t)(ϵ^Φ(xt,t,y)ϵ)δg(θ,c)δθ],\small\begin{split}\nabla_{\theta}L_{SDS}(\theta)&\triangleq\mathrm{E}_{t,% \textbf{x}}[w(t)\frac{\sigma_{t}}{\alpha_{t}}\nabla_{\theta}KL(q_{t}^{\theta}(% \textbf{x}_{t}|c,y)||p_{t}(\textbf{x}_{t}|y))]\\ &\triangleq\mathrm{E}_{t,\epsilon_{\Phi}}[w(t)(\hat{\epsilon}_{\Phi}(\textbf{x% }_{t},t,y)-\epsilon)\frac{\delta g(\theta,c)}{\delta\theta}],\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_S italic_D italic_S end_POSTSUBSCRIPT ( italic_θ ) end_CELL start_CELL ≜ roman_E start_POSTSUBSCRIPT italic_t , x end_POSTSUBSCRIPT [ italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_K italic_L ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c , italic_y ) | | italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≜ roman_E start_POSTSUBSCRIPT italic_t , italic_ϵ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) - italic_ϵ ) divide start_ARG italic_δ italic_g ( italic_θ , italic_c ) end_ARG start_ARG italic_δ italic_θ end_ARG ] , end_CELL end_ROW (2)

where ϵ^Φ:=(1+s)ϵΦ(xt,t,y)sϵΦ(xt,t,)assignsubscript^italic-ϵΦ1𝑠subscriptitalic-ϵΦsubscriptx𝑡𝑡𝑦𝑠subscriptitalic-ϵΦsubscriptx𝑡𝑡\hat{\epsilon}_{\Phi}:=(1+s)\epsilon_{\Phi}(\text{x}_{t},t,y)-s\epsilon_{\Phi}% (\text{x}_{t},t,\emptyset)over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT := ( 1 + italic_s ) italic_ϵ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) - italic_s italic_ϵ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , ∅ ) is modification of predicted noise with classifier-free guidance (CFG) s𝑠sitalic_s, w(t)𝑤𝑡w(t)italic_w ( italic_t ) is time-dependent weighting function.

Multi-Face Janus Problem. To achieve consistency, it is essential that the rendering distributions q0θ(x0|c,y)superscriptsubscript𝑞0𝜃conditionalsubscriptx0𝑐𝑦q_{0}^{\theta}(\textbf{x}_{0}|c,y)italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_c , italic_y ) adhere to text condition y𝑦yitalic_y and image distribution pt(xt|y)subscript𝑝𝑡conditionalsubscriptx𝑡𝑦p_{t}(\text{x}_{t}|y)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) keep consistency across views with different poses. For the image distribution p0(x0|y)subscript𝑝0conditionalsubscriptx0𝑦p_{0}(\textbf{x}_{0}|y)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y ) of 2D diffusion model, the pose condition can be injected via input text with the corresponding directional prompt [37, 42]. As illustrated in Fig. 2(a), the pre-trained 2D image distribution, trained on individual images, is view-agnostic and lacks identity consistency across views. Even with a text tuning mechanism [1] specifically designed for multi-view image generation, as shown in Fig. 2(b), the above issues are far from resolved. Since SDS minimizes KL-divergence between the image distribution and rendering distribution independently for each rendered view, it can only inevitably inherit the 3D-awareness deficit of the 2D diffusion model, resulting in inconsistent 3D generation, which is commonly referred to as the Multi-face Janus Problem of SDS.

4 Method

In this section, we introduce JointDreamer, a novel text-to-3D generation framework as illustrated in Fig. 3. We first present the derivation of Joint Score Distillation (JSD) in Sec. 4.1, which extends the single-view optimization in SDS into a multi-view KL-Divergence. Then we integrate universal view-aware models into JSD to show the compatibility of JSD in Sec. 4.2, where we instantiate three kinds of view-aware models to capture inter-view coherence. Finally, we elaborate on the overall framework JointDreamer in Sec. 4.3, where we integrate the multi-view generation model into JSD. We also propose a geometry fading scheme and CFG switching strategy to further enhance generative quality.

4.1 Joint Score Distillation (JSD)

To address the multi-face problem arising from SDS, we extend the score distillation from single-view to multi-view settings and promote inter-view coherence across 2D image distribution, and thus derive our JSD optimization function.

Coherence Modeling for Joint Image Distribution.

As discussed above, the rendering distributions of the 3D representation should maintain 3D consistency across views x~={x1,x2,,xV}~xsuperscriptx1superscriptx2superscriptx𝑉\tilde{\textbf{x}}=\{\textbf{x}^{1},\textbf{x}^{2},\ldots,\textbf{x}^{V}\}over~ start_ARG x end_ARG = { x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , x start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT } with respect to different poses 𝐜~={c1,c2,,cV}~𝐜superscript𝑐1superscript𝑐2superscript𝑐𝑉\tilde{\bf c}=\{c^{1},c^{2},\ldots,c^{V}\}over~ start_ARG bold_c end_ARG = { italic_c start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_c start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT }. However, for 2D pre-trained diffusion models, different views are generated independently. To ensure consistency, we propose modeling the joint image distribution of multiple views, denoted as p0(x~|c~,y)subscript𝑝0conditional~x~c𝑦p_{0}(\tilde{\textbf{x}}|\tilde{\textbf{c}},y)italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG | over~ start_ARG c end_ARG , italic_y ), within the diffusion model. Following the commonly adopted assumption in energy-based distribution modeling [44, 21, 45], we introduce an energy function that measures the inter-view coherence by 𝒞(x~,c~):Vd:𝒞~x~csuperscript𝑉𝑑\mathcal{C}(\tilde{\textbf{x}},\tilde{\textbf{c}}):\mathbb{R}^{Vd}\to\mathbb{R}caligraphic_C ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) : blackboard_R start_POSTSUPERSCRIPT italic_V italic_d end_POSTSUPERSCRIPT → blackboard_R and define:

p0(x~|c~,y)exp(𝒞(x~,c~))i=1Vp0(xi|ci,y),proportional-tosubscript𝑝0conditional~x~c𝑦𝒞~x~csuperscriptsubscriptproduct𝑖1𝑉subscript𝑝0conditionalsuperscriptx𝑖superscript𝑐𝑖𝑦\small p_{0}(\tilde{\textbf{x}}|\tilde{\textbf{c}},y)\propto\exp(\mathcal{C}(% \tilde{\textbf{x}},\tilde{\textbf{c}}))\prod_{i=1}^{V}p_{0}(\textbf{x}^{i}|c^{% i},y),italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG | over~ start_ARG c end_ARG , italic_y ) ∝ roman_exp ( caligraphic_C ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y ) , (3)

where a larger 𝒞(x~,c~)𝒞~x~c\mathcal{C}(\tilde{\textbf{x}},\tilde{\textbf{c}})caligraphic_C ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) indicates greater coherence among the denoised view images. As a result, the joint image distribution is no longer view-independent. In practice, the joint energy function can be implemented via various view-aware models, as long as they can reflect the coherence across multiple views. In Sec. 4.2, we explore different choices in depth. The modeling of coherence across joint image distributions facilitates our JSD on multi-view, integrating inter-view coherence to ensure consistent 3D representation.

KL-Divergence on Multiple Views.

We extend the single-view KL-divergence in SDS to a multi-view version, based on the joint image distribution:

minθDKL(qtθ(x~|c~,y)||pt(x~|c~,y))=minθDKL(qtθ(x~|c~,y)||exp(𝒞(x~,c~))i=1Vpt(xi|ci,y),\small\begin{split}&\min\limits_{\theta}D_{KL}(q_{t}^{\theta}(\tilde{\textbf{x% }}|\tilde{\textbf{c}},y)||p_{t}(\tilde{\textbf{x}}|\tilde{\textbf{c}},y))\\ &=\min\limits_{\theta}D_{KL}(q_{t}^{\theta}(\tilde{\textbf{x}}|\tilde{\textbf{% c}},y)||\exp(\mathcal{C}(\tilde{\textbf{x}},\tilde{\textbf{c}}))\prod_{i=1}^{V% }p_{t}(\textbf{x}^{i}|c^{i},y),\end{split}start_ROW start_CELL end_CELL start_CELL roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG x end_ARG | over~ start_ARG c end_ARG , italic_y ) | | italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG | over~ start_ARG c end_ARG , italic_y ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG x end_ARG | over~ start_ARG c end_ARG , italic_y ) | | roman_exp ( caligraphic_C ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y ) , end_CELL end_ROW (4)

where the extra energy term 𝒞(𝐱~,𝐜~)𝒞~𝐱~𝐜\mathcal{C}(\tilde{\bf x},\tilde{\bf c})caligraphic_C ( over~ start_ARG bold_x end_ARG , over~ start_ARG bold_c end_ARG ) in Eq. (3) accounts for the inter-view coherence. Without this constraint, e.g., 𝒞(𝐱~,𝐜~)0𝒞~𝐱~𝐜0\mathcal{C}(\tilde{\bf x},\tilde{\bf c})\equiv 0caligraphic_C ( over~ start_ARG bold_x end_ARG , over~ start_ARG bold_c end_ARG ) ≡ 0, different rendering views are optimized independently with the 2D diffusion model separately. In this sense, the original SDS can be seen as a special case of JSD.

Joint Score Distillation Function.

To correspond to the gradient of the new rule of multi-view KL-divergence in Eq. (4), we derive our score distillation function that is jointly conducted on multiple views as follows:

θLJSD(θ)Et,ϵΦ[w(t)𝔼(x~logqt(x~t|x~0)x~logpt(x~t|y))]=i=1VEt,ϵΦi[w(t)(ϵ^Φ(xti,y)𝒞(x~)xtiϵi)δg(θ,ci)δθ],subscript𝜃subscript𝐿𝐽𝑆𝐷𝜃subscriptE𝑡subscriptitalic-ϵΦdelimited-[]𝑤𝑡𝔼subscript~xsubscript𝑞𝑡conditionalsubscript~x𝑡subscript~x0subscript~xsubscript𝑝𝑡conditionalsubscript~x𝑡𝑦superscriptsubscript𝑖1𝑉subscriptE𝑡subscriptsuperscriptitalic-ϵ𝑖Φdelimited-[]𝑤𝑡subscript^italic-ϵΦsuperscriptsubscriptx𝑡𝑖𝑦𝒞~xsuperscriptsubscriptx𝑡𝑖superscriptitalic-ϵ𝑖𝛿𝑔𝜃superscript𝑐𝑖𝛿𝜃\small\begin{split}&\nabla_{\theta}L_{JSD}(\theta)\\ &\triangleq\mathrm{E}_{t,\epsilon_{\Phi}}[w(t)\mathbb{E}(\nabla_{\tilde{% \textbf{x}}}\log q_{t}(\tilde{\textbf{x}}_{t}|\tilde{\textbf{x}}_{0})-\nabla_{% \tilde{\textbf{x}}}\log p_{t}(\tilde{\textbf{x}}_{t}|y))]\\ &=\sum_{i=1}^{V}\mathrm{E}_{t,\epsilon^{i}_{\Phi}}[w(t)(\hat{\epsilon}_{\Phi}(% \textbf{x}_{t}^{i},y)-\frac{\partial\mathcal{C}(\tilde{\textbf{x}})}{\partial% \textbf{x}_{t}^{i}}-\epsilon^{i})\frac{\delta g(\theta,c^{i})}{\delta\theta}],% \end{split}start_ROW start_CELL end_CELL start_CELL ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_J italic_S italic_D end_POSTSUBSCRIPT ( italic_θ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≜ roman_E start_POSTSUBSCRIPT italic_t , italic_ϵ start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) blackboard_E ( ∇ start_POSTSUBSCRIPT over~ start_ARG x end_ARG end_POSTSUBSCRIPT roman_log italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over~ start_ARG x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - ∇ start_POSTSUBSCRIPT over~ start_ARG x end_ARG end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT roman_E start_POSTSUBSCRIPT italic_t , italic_ϵ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y ) - divide start_ARG ∂ caligraphic_C ( over~ start_ARG x end_ARG ) end_ARG start_ARG ∂ x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG - italic_ϵ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) divide start_ARG italic_δ italic_g ( italic_θ , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_δ italic_θ end_ARG ] , end_CELL end_ROW (5)

where {ϵi}i=1Vsuperscriptsubscriptsuperscriptitalic-ϵ𝑖𝑖1𝑉\{\epsilon^{i}\}_{i=1}^{V}{ italic_ϵ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT are noises during score matching for different views. The proof can be found in the Appendix. To intuitively compare SDS with JSD, we sample multi-view images from the forward pass of pre-trained diffusion model [38] as shown in Fig. 2(c). For each view xtsubscriptx𝑡\textbf{x}_{t}x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we randomly select xtsuperscriptsubscriptx𝑡\textbf{x}_{t}^{\prime}x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from a different view and adopt the binary classification model presented in Sec. 4.2 as the energy function to measure coherence across the two views. The results illustrate that JSD significantly enhances the correspondence with directional prompts and consistency across different views, particularly for the initially biased views such as side and back views. These theoretical and empirical results prove that the multi-face Janus issue in SDS is rooted in the view-agnostic and view-biased 2D image distribution, a challenge effectively mitigated by JSD.

4.2 Universal View-Aware Models as Energy Function

JSD requires an energy function 𝒞(x~,c~)𝒞~x~c\mathcal{C}(\tilde{\textbf{x}},\tilde{\textbf{c}})caligraphic_C ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) to measure coherence across denoised images, as presented in Eq. (3). This energy function plays a crucial role in assessing the consistency between different views in image distribution. To demonstrate the compatibility of JSD, we employ three different types of models trained for various representative multi-view tasks.

Refer to caption
Figure 4: Illustration of the binary classification model and qualitative results with JSD. (a) The classification model MCLSsubscript𝑀CLSM_{\text{CLS}}italic_M start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT produces the binary logit to measure the consistency between two input views xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and xjsuperscript𝑥𝑗x^{j}italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. (b) JSD integrated with the classification model effectively alleviates Janus issues compared to SDS.
Refer to caption
Figure 5: Comparison of text-to-3D generation. See Appendix for more results.

Binary classification model MCLSsubscript𝑀CLSM_{\text{CLS}}italic_M start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT.

The binary classification model MCLSsubscript𝑀CLSM_{\text{CLS}}italic_M start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT is designed to classify the content consistency between two input images based on their relative camera pose. To ensure computational efficiency, we introduce a dedicated classification model, as shown in Fig. 4 (a). The training process takes two days with a single A800 GPU on Objaverse dataset [8]. Further details can be found in the appendix. The classification model MCLSsubscript𝑀CLSM_{\text{CLS}}italic_M start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT processes pairs of images xisuperscriptx𝑖\textbf{x}^{i}x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and xjsuperscriptx𝑗\textbf{x}^{j}x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT captured from different viewpoints cisuperscript𝑐𝑖c^{i}italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and cjsuperscript𝑐𝑗c^{j}italic_c start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. It extracts image features using the DINO-ViT/s16 backbone [3]. These image features are conditioned on the camera feature obtained through MLP layers from the relative camera transformation matrix Δ(cj,ci)Δsuperscript𝑐𝑗superscript𝑐𝑖\Delta(c^{j},c^{i})roman_Δ ( italic_c start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). Finally, the classification head produces the binary score. To incorporate with JSD, we only consider neighboring views in V𝑉Vitalic_V as image pairs for coherence measurement, which is denoted as:

𝒞CLS(x~,c~)=i,j1,,V;ijMCLS(xti,xtj,Δ(cj,ci)),subscript𝒞CLS~x~csubscriptformulae-sequence𝑖𝑗1𝑉𝑖𝑗subscript𝑀CLSsubscriptsuperscriptx𝑖𝑡subscriptsuperscriptx𝑗𝑡Δsuperscript𝑐𝑗superscript𝑐𝑖\small\mathcal{C}_{\text{CLS}}(\tilde{\textbf{x}},\tilde{\textbf{c}})=\sum_{% \mathclap{i,j\in 1,\dots,V;i\neq j}}M_{\text{CLS}}(\textbf{x}^{i}_{t},\textbf{% x}^{j}_{t},\Delta(c^{j},c^{i})),caligraphic_C start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j ∈ 1 , … , italic_V ; italic_i ≠ italic_j end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ ( italic_c start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) , (6)

where the higher logit indicates the stronger geometric consistency.

Image-to-image translation model MI2Isubscript𝑀I2IM_{\text{I2I}}italic_M start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT.

The image-to-image translation model MI2Isubscript𝑀I2IM_{\text{I2I}}italic_M start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT is tailored for novel view synthesis [26, 29, 40, 27]. We employ the most recent model, Wonder3D [29] as MI2Isubscript𝑀I2IM_{\text{I2I}}italic_M start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT, which is a viewpoint-conditioned image translation model and generates consistent content in the target viewpoint. When integrated with JSD, a random reference view xrefsuperscriptxref\textbf{x}^{\textbf{ref}}x start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT is selected from the set of 3D rendered images. Then we input the relative camera transformation Δ(ci,cref)Δsuperscript𝑐𝑖superscript𝑐ref\Delta(c^{i},c^{\textbf{ref}})roman_Δ ( italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ) and rendered images to MI2Isubscript𝑀I2IM_{\text{I2I}}italic_M start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT. The measure of consistency is determined by calculating the reconstruction loss between the synthesized new image and its corresponding rendered image:

𝒞I2I(x~,c~)=i1,,VMI2I(xtref,Δ(ci,cref))xti22,subscript𝒞I2I~x~csubscript𝑖1𝑉superscriptsubscriptnormsubscript𝑀I2Isubscriptsuperscriptxref𝑡Δsuperscript𝑐𝑖superscript𝑐refsuperscriptsubscriptx𝑡𝑖22\small\mathcal{C}_{\text{I2I}}(\tilde{\textbf{x}},\tilde{\textbf{c}})=-\sum_{% \mathclap{i\in 1,\dots,V}}||M_{\text{I2I}}(\textbf{x}^{\textbf{ref}}_{t},% \Delta(c^{i},c^{\textbf{ref}}))-\textbf{x}_{t}^{i}||_{2}^{2},caligraphic_C start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) = - ∑ start_POSTSUBSCRIPT italic_i ∈ 1 , … , italic_V end_POSTSUBSCRIPT | | italic_M start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT ( x start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Δ ( italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ) ) - x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (7)

where a smaller reconstruction loss indicates stronger geometric consistency under the estimation of MI2Isubscript𝑀I2IM_{\text{I2I}}italic_M start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT.

Multi-view synthesis model MMVSsubscript𝑀MVSM_{\text{MVS}}italic_M start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT.

The multi-view synthesis model is designed to generate multiple images conditioned on text prompts and camera poses, wherein we employ very recent work MVDream [41] as MMVSsubscript𝑀MVSM_{\textbf{MVS}}italic_M start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT. We compute the reconstruction loss for multiple generative views and rendered views:

𝒞MVS(x~,c~)=MMVS(y,c~)x~22subscript𝒞MVS~x~csuperscriptsubscriptnormsubscript𝑀MVS𝑦~c~x22\small\mathcal{C}_{\textbf{MVS}}(\tilde{\textbf{x}},\tilde{\textbf{c}})=-||M_{% \textbf{MVS}}(y,\tilde{\textbf{c}})-\tilde{\textbf{x}}||_{2}^{2}caligraphic_C start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT ( over~ start_ARG x end_ARG , over~ start_ARG c end_ARG ) = - | | italic_M start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT ( italic_y , over~ start_ARG c end_ARG ) - over~ start_ARG x end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (8)

The smaller reconstruction loss signifies better geometric consistency within MMVSsubscript𝑀MVSM_{\textbf{MVS}}italic_M start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT. These view-aware models measure the coherence across views according to their own 3D-aware insights. Incorporated with JSD, they provide distinct constraints, resulting in different 3D generations that nonetheless contribute to enhanced geometric consistency. We believe that JSD can be adapted to more universal view-aware models, which enables us to progressively redefine the benchmark of 3D generation with the advancement of multi-view tasks.

Refer to caption
Figure 6: Ablations on JSD incorporated with different energy function 𝒞𝒞\mathcal{C}caligraphic_C. CLS: Binary Classification model; I2I: Image-to-Image Translation model (Wonder3D [29]); MVS: Multi-View Image Synthesis model (MVDream [41]).

4.3 Framework of JointDreamer

Building upon the JSD optimization, we propose the overall framework JointDreamer. The optimization is based on neural radiance field (NeRF) [31], adopting Instant-NGP [32] with volume renderer. We adopt the multi-view synthesis model MMVSsubscript𝑀MVSM_{\text{MVS}}italic_M start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT as the energy function to integrate with JSD in JointDreamer. During optimization training, we utilize the common techniques including time-annealing and resolution scaling-up following [42, 41]. Besides, we propose two novel techniques to further enhance the generation quality, including a Geometry Fading scheme and a Classifier-Free Guidance Scale (CFG) Switching strategy.

Geometry Fading.

We aim to shift attention between geometric structure and texture details during optimization. Specifically, starting from iteration 5K, we reduce the learning rate of the density network of NeRF from 1e21𝑒21e-21 italic_e - 2 to 1e61𝑒61e-61 italic_e - 6 and set orientation loss to 0. Consequently, it benefits geometric convergence in the early phase of optimization, while allowing for decreased attention on geometry and increased attention on texture enhancement in the later stages.

CFG Switching.

CFG scheduling strategies have been employed in 2D domain [39, 15] to enhance quality. In this work, we propose to modify the CFG scale s𝑠sitalic_s during the training for 3D generation. We are motivated by the observation that a large CFG scale can lead to accelerated geometric convergence but may result in under-optimized geometry and distorted texture. Unlike the annealed CFG approach in CLIP-Sculptor [39], our increasing strategy prioritizes texture while maintaining accurate geometry. Specifically, a smaller s=30𝑠30s=30italic_s = 30 is employed in the early stages to preserve shape integrity, which allows for stronger coherence guidance from JSD. After 5K iterations, we increase the s𝑠sitalic_s to 50, enhancing texture fidelity and overall quality.

5 Experiment

In this section, we present the text-to-3D generation results of JointDreamer with qualitative and quantitative evaluations, illustrating state-of-the-art performance. We also make further ablation analysis on the proposed JSD. More training and evaluation details can be found in the Appendix.

5.1 Text-to-3D Generation

Refer to caption
Figure 7: Ablations on CFG Switching and Geometry Fading techniques in JointDreamer pipeline, demonstrating the effectiveness of quality enhancement.
Table 1: Quantitative results on texual consistency and user preference, tested on object-centric subset of MS-COCO [25].
Method CLIP Score \uparrow R-Precision \uparrow User Study\uparrow
DreamFusion [37] 20.1 27.7 18.2
ProlificDreamer [42] 25.0 18.7 16.2
MVDream [41] 20.8 33.6 23.5
JointDreamer 27.7 88.5 42.1
Table 2: Ablation study on CFG Switching (CFGS) and Geometry Fading (GF).
SDS JSD CFGS GF CLIP Score\uparrow FID\downarrow
\checkmark 20.0 429.2
\checkmark 27.6 360.7
\checkmark \checkmark 28.2 357.6
\checkmark \checkmark \checkmark 28.8 353.9

Qualitative Comparisons.

We compare with several representative baselines on threestudio [11] project, which modularizes the text-to-3D framework allowing for fair comparisons by ablating individual components. The generative samples are shown in Fig. 5. (i) DreamFusion [37]: Compared to DreamFusion which utilizes traditional SDS optimization, JointDreamer significantly improves the Multi-Face Janus issues in the generations of DreamFusion by introducing an energy term to ensure inter-view coherence. (ii) Magic3D [24]: Magic3D introduces a two-stage generation pipeline, transferring NeRF to DMTet in the second stage to enhance generation quality. We also transfer NeRF to DMTet as JointDreamer-DMTet, showcasing consistent superiority in geometry and texture quality by utilizing JSD optimization instead of SDS. (iii) ProlificDreamer [42]: ProlificDreamer presents variational score distillation (VSD) as a variant of score distillation function. While VSD enhances photorealism in 3D renderings by introducing a LoRA model, the ill-posed association of pose and images during LoRA training deepens the geometric inconsistency of 3D representation, resulting in severe multi-view artifacts in generations. In contrast, JointDreamer with JSD achieves geometric consistency while maintaining high-fidelity texture quality. (iv) MVDream [41]: Compared to direct distillation from a finetuned model as MVDream, JointDreamer employs view-aware models as the coherence constraint in JSD, while still inherit the generalization capabilities of original diffusion models [38]. The results indicate that JointDreamer mitigates the overfitting issues of MVDream, which accurately responds to complex input text and enhances the 3D consistency of generations.

Quantitative Comparisons.

Following  [37, 18], we evaluate CLIP Score [14], CLIP R-Precision [36] and user preference on the object-centric caption subset of MS-COCO [25] with 153 prompts to measure the text congruence and generative quality. For computational efficiency, we generate each 3D asset using 5K iterations of 64×\times×64 rendered images and render 20 images per caption for evaluation. CLIP ViT-B/32 is adopted as the feature extractor for Clip Score and Clip R-Precision. As illustrated by the results in Table 5.1, JointDreamer can outperform all baselines on CLIP Score and CLIP R-Precision by large margins. Specifically, JointDreamer achieves an improvement of the R-Precision by 60.8% and 54.9% over DreamFusion and MVDream, demonstrating its superior corresponding to textual description. Notably, the severe Janus artifacts in ProlificDreamer compromise the quality of rendering with noisy background and semantic distortion, resulting in the lowest R-Precision. We also conduct a user study about shape preference on these prompts in Table 5.1.

5.2 Ablation Analyses

Ablations on Energy Functions.

Sec. 4.2 discusses the varying inter-view

Table 3: Quantitative comparison of energy functions, showcasing the effectiveness of JSD in mitigating Janus artifacts with comparable computational efficiency to SDS.
Methods Janus Rate \downarrow GPU Memory \downarrow Train Time \downarrow
SDS 100% 16.1 G 50 min.
JSD w/𝒞CLSsubscript𝒞CLS\mathcal{C}_{\textbf{CLS}}caligraphic_C start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT 12.5% 22.1 G 80 min.
JSD w/𝒞I2Isubscript𝒞I2I\mathcal{C}_{\textbf{I2I}}caligraphic_C start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT 31.2% 16.0 G 119 min.
JSD w/𝒞MVSsubscript𝒞MVS\mathcal{C}_{\textbf{MVS}}caligraphic_C start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT 6.2% 19.4 G 54 min.

coherency measurements provided by view-aware models trained. When incorporated with JSD, these models have distinct impacts on 3D generations, as shown in Fig. 6. The binary classifier effectively corrects inaccurate geometric structures in SDS. However, it cannot introduce additional imaginative elements as a discriminative model, resulting in oversaturated and monotonous textures. In contrast, as generative models, Wonder3D [29] and MVDream [41] employ reconstruction loss to estimate 3D consistency. Hence, they not only guide geometric structural modifications but also influence texture quality. we further conduct a quantitative comparison using 16 complex multi-Janus prompts, as outlined in Table 3. The experimental setup details can be found in the Appendix. The results indicate that our JSD consistently mitigates Janus artifacts across different view-aware models, with only a slight increase in computational requirements compared to SDS. We find that the energy function 𝒞I2Isubscript𝒞I2I\mathcal{C}_{\text{I2I}}caligraphic_C start_POSTSUBSCRIPT I2I end_POSTSUBSCRIPT derived from the image-to-image translation model exhibits poor performance, likely attributed to a mismatch in camera range and inaccurate translated results.

Analysis on Geometry Fading and CFG Switching Mechanisms.

We conduct incremental ablations on our proposed techniques in JointDreamer, including the Geometry Fading scheme and CFG Switching strategy. As shown in Fig. 7, increasing the CFG value enhances texture detail compared to the original JSD, but over-optimizes the geometry, resulting in a bumpier shape. Compared to “+CFG Switching”, the Geometry fading effectively protects shape when larger CFG guidance. We also conduct a quantitative evaluation on a 30% MS-COCO subset in Table 5.1. JSD demonstrates superior texture quality compared to SDS, as evidenced by the Clip Score (CS) and FID metrics. The two proposed mechanisms further enhance the texture quality.

Discussions on Training Loss.

To make further comparisons with JSD

Refer to caption
Figure 8: Training loss comparisons of JSD and SDS. JSD eliminates randomness fluctuations in SDS convergence.

and SDS, we aggregate training losses from multiple prompts on two optimization functions and visualize the training loss curve as illustrated in Fig. 8. We observe that SDS experiences significant fluctuations due to the randomness of single-view optimization. In contrast, JSD can converge gradually and smoothly, demonstrating the introduction of multi-view optimization with inter-view coherence in JSD can reduce the randomness of optimization and contribute to better convergence for 3D representation.

Refer to caption
Figure 9: Discuss the impact of prompts and random seeds, demonstrating the robustness of JointDreamer. […] means the same content as the prompt in Fig. 7.

Discussions on Robustness for Prompts and Random Seed.

To show the impact of seed and prompt, we modify seeds and key prompt components in Fig. 7, such as subject, object and verb. Results in Fig. 9 demonstrate the robustness of JointDreamer for different seed. Note that the default seed is 0.

The Effectiveness of Classification Model.

The classification model surpasses MVDream in training speed by a factor of 48. Fig.4 (b) provides additional high-quality results obtained using JSD in conjunction with the binary classification model MCLSsubscript𝑀CLSM_{\text{CLS}}italic_M start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT. Furthermore, we conduct an ablation study by replacing 𝒞(x~)xti𝒞~xsuperscriptsubscriptx𝑡𝑖\frac{\partial\mathcal{C}(\tilde{\textbf{x}})}{\partial\textbf{x}_{t}^{i}}divide start_ARG ∂ caligraphic_C ( over~ start_ARG x end_ARG ) end_ARG start_ARG ∂ x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG in Eq.5 with a randomly generated value between [0,1]01[0,1][ 0 , 1 ]. However, it yields shapeless results due to unrelated disturbances in the optimization process. These findings highlight the significance of the proposed classification model in achieving a balance between computational efficiency and generation quality.

Refer to caption
Figure 10: Comparison of JointDreamer and SDS+MVDream, illustrating the superiority of JointDreamer in mitigating the Janus problem. SDS+MVDream consistently exhibits semantic missing or Janus issues with different weight options.

More Comparison with MVDream.

View-aware models can be incorporated into 3D generation by combining them with JSD or using them directly with SDS. In this discussion, we use MVDream [41] as an example to demonstrate the superiority of JSD. MVDream can generate consistent shapes when calculating the SDS loss directly. However, it may miss components in the complete input text due to its fine-tuning on limited 3D data, as illustrated in Fig. 5. To address this limitation, a straightforward approach is to “SDS+MVDream”.

However, balancing the impact of SDS and MVDream is challenging. As shown in Fig. 10, setting λSDSsubscript𝜆SDS\lambda_{\textbf{SDS}}italic_λ start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT to 0 degrades to MVDream alone, while setting λMVDreamsubscript𝜆MVDream\lambda_{\textbf{MVDream}}italic_λ start_POSTSUBSCRIPT MVDream end_POSTSUBSCRIPT to 0 yields a combination similar to DreamFusion [37]. Achieving a balanced impact between SDS and MVDream through a simple combination is difficult. When λMVDreamsubscript𝜆MVDream\lambda_{\textbf{MVDream}}italic_λ start_POSTSUBSCRIPT MVDream end_POSTSUBSCRIPT is large, textual consistency remains constrained, whereas decreasing λMVDreamsubscript𝜆MVDream\lambda_{\textbf{MVDream}}italic_λ start_POSTSUBSCRIPT MVDream end_POSTSUBSCRIPT leads to the Janus problem. This is due to gradient misalignment across multiple views in the “SDS+MVDream” combination, as SDS lacks multi-view information and cannot derive our objective outlined in Eq. 5. In contrast, JSD based on the joint image distribution provides supervision for text consistency and high-fidelity texture. JSD also promotes inter-view consistency by introducing an energy term with the view-aware model.

6 Conclusion

In this work, we introduce Joint Score Distillation (JSD) as a new paradigm for text-to-3D generation, which conducts multi-view optimization jointly and accounts for inter-view coherence. We demonstrate that JSD can significantly enhance 3D coherence while maintaining generalizability. With other proposed techniques, our overall framework, JointDreamer, is capable of geometric-consistent and high-fidelity 3D generation adhering to complex text input.

Limitations. While the training time of JointDreamer is comparable to existing SDS pipelines, there is room for improvement in terms of acceleration. Future work will explore alternative 3D representations, such as 3D Gaussian [20]. Additionally, JSD utilizes view-aware models to ensure geometry consistency and mitigate the impact of limited data. However, view-aware models still require 3D data for training, and further efficient 3D data collection or reconstruction from multi-view images is also worth investigating.

References

  • [1] Armandpour, M., Zheng, H., Sadeghian, A., Sadeghian, A., Zhou, M.: Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond. In: ICLR (2024)
  • [2] Cao, Z., Hong, F., Wu, T., Pan, L., Liu, Z.: Large-vocabulary 3d diffusion model with transformer. In: ICLR (2024)
  • [3] Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650–9660 (2021)
  • [4] Chan, E.R., Lin, C.Z., Chan, M.A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L.J., Tremblay, J., Khamis, S., et al.: Efficient geometry-aware 3d generative adversarial networks. In: CVPR. pp. 16123–16133 (2022)
  • [5] Chen, H., Gu, J., Chen, A., Tian, W., Tu, Z., Liu, L., Su, H.: Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. arXiv preprint arXiv:2304.06714 (2023)
  • [6] Chen, R., Chen, Y., Jiao, N., Jia, K.: Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. In: ICCV. pp. 22246–22256 (2023)
  • [7] Deitke, M., Liu, R., Wallingford, M., Ngo, H., Michel, O., Kusupati, A., Fan, A., Laforte, C., Voleti, V., Gadre, S.Y., et al.: Objaverse-xl: A universe of 10m+ 3d objects. In: NeurIPS (2024)
  • [8] Deitke, M., Schwenk, D., Salvador, J., Weihs, L., Michel, O., VanderBilt, E., Schmidt, L., Ehsani, K., Kembhavi, A., Farhadi, A.: Objaverse: A universe of annotated 3d objects. In: CVPR. pp. 13142–13153 (2023)
  • [9] Deng, Y., Yang, J., Xiang, J., Tong, X.: Gram: Generative radiance manifolds for 3d-aware image generation. In: CVPR. pp. 10673–10683 (2022)
  • [10] Gao, J., Shen, T., Wang, Z., Chen, W., Yin, K., Li, D., Litany, O., Gojcic, Z., Fidler, S.: Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS 35, 31841–31854 (2022)
  • [11] Guo, Y.C., Liu, Y.T., Shao, R., Laforte, C., Voleti, V., Luo, G., Chen, C.H., Zou, Z.X., Wang, C., Cao, Y.P., Zhang, S.H.: threestudio: A unified framework for 3d content generation. https://github.com/threestudio-project/threestudio (2023)
  • [12] Henderson, P., Ferrari, V.: Learning single-image 3d reconstruction by generative modelling of shape, pose and shading. IJCV 128(4), 835–854 (2020)
  • [13] Henderson, P., Tsiminaki, V., Lampert, C.H.: Leveraging 2d data to learn textured 3d mesh generation. In: CVPR. pp. 7498–7507 (2020)
  • [14] Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. In: EMNLP (2021)
  • [15] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
  • [16] Hu, Z., Zhao, M., Zhao, C., Liang, X., Li, L., Zhao, Z., Fan, C., Zhou, X., Yu, X.: Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion priors. In: CVPR. pp. 4949–4958 (2024)
  • [17] Huang, Y., Wang, J., Shi, Y., Tang, B., Qi, X., Zhang, L.: Dreamtime: An improved optimization strategy for diffusion-guided 3d generation. In: ICLR (2023)
  • [18] Jain, A., Mildenhall, B., Barron, J.T., Abbeel, P., Poole, B.: Zero-shot text-guided object generation with dream fields. In: CVPR. pp. 867–876 (2022)
  • [19] Jun, H., Nichol, A.: Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 (2023)
  • [20] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4), 1–14 (2023)
  • [21] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predicting structured data 1(0) (2006)
  • [22] Li, M., Zhou, P., Liu, J.W., Keppo, J., Lin, M., Yan, S., Xu, X.: Instant3d: Instant text-to-3d generation. arXiv preprint arXiv:2311.08403 (2023)
  • [23] Li, W., Chen, R., Chen, X., Tan, P.: Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. In: ICLR (2024)
  • [24] Lin, C.H., Gao, J., Tang, L., Takikawa, T., Zeng, X., Huang, X., Kreis, K., Fidler, S., Liu, M.Y., Lin, T.Y.: Magic3d: High-resolution text-to-3d content creation. In: CVPR (2023)
  • [25] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
  • [26] Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: Zero-shot one image to 3d object. In: ICCV. pp. 9298–9309 (2023)
  • [27] Liu, Y., Lin, C., Zeng, Z., Long, X., Liu, L., Komura, T., Wang, W.: Syncdreamer: Generating multiview-consistent images from a single-view image. In: ICLR (2024)
  • [28] Liu, Z., Feng, Y., Black, M.J., Nowrouzezahrai, D., Paull, L., Liu, W.: Meshdiffusion: Score-based generative 3d mesh modeling. In: ICLR (2023)
  • [29] Long, X., Guo, Y.C., Lin, C., Liu, Y., Dou, Z., Liu, L., Ma, Y., Zhang, S.H., Habermann, M., Theobalt, C., et al.: Wonder3d: Single image to 3d using cross-domain diffusion. In: CVPR (2024)
  • [30] Luo, W., Hu, T., Zhang, S., Sun, J., Li, Z., Zhang, Z.: Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion models. NeurIPS 36 (2024)
  • [31] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [32] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41(4), 1–15 (2022)
  • [33] Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: Hologan: Unsupervised learning of 3d representations from natural images. In: ICCV. pp. 7588–7597 (2019)
  • [34] Nichol, A., Jun, H., Dhariwal, P., Mishkin, P., Chen, M.: Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751 (2022)
  • [35] Niemeyer, M., Geiger, A.: Giraffe: Representing scenes as compositional generative neural feature fields. In: CVPR. pp. 11453–11464 (2021)
  • [36] Park, D.H., Azadi, S., Liu, X., Darrell, T., Rohrbach, A.: Benchmark for compositional text-to-image synthesis. In: NeurIPS Datasets and Benchmarks Track (2021)
  • [37] Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. In: ICLR (2023)
  • [38] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)
  • [39] Sanghi, A., Fu, R., Liu, V., Willis, K.D., Shayani, H., Khasahmadi, A.H., Sridhar, S., Ritchie, D.: Clip-sculptor: Zero-shot generation of high-fidelity and diverse shapes from natural language. In: CVPR. pp. 18339–18348 (2023)
  • [40] Shi, R., Chen, H., Zhang, Z., Liu, M., Xu, C., Wei, X., Chen, L., Zeng, C., Su, H.: Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110 (2023)
  • [41] Shi, Y., Wang, P., Ye, J., Long, M., Li, K., Yang, X.: Mvdream: Multi-view diffusion for 3d generation. In: ICLR (2024)
  • [42] Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. In: NeurIPS (2024)
  • [43] Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. In: ICLR (2023)
  • [44] Weese, J., Kaus, M., Lorenz, C., Lobregt, S., Truyen, R., Pekar, V.: Shape constrained deformable models for 3d medical image segmentation. In: Information Processing in Medical Imaging: 17th International Conference, IPMI 2001 Davis, CA, USA, June 18–22, 2001 Proceedings 17. pp. 380–387. Springer (2001)
  • [45] Zhao, W., Yan, L., Zhang, Y.: Geometric-constrained multi-view image matching method based on semi-global optimization. Geo-spatial information science 21(2), 115–126 (2018)

This supplementary material consists of five parts, including technical details of the experimental setup (Sec. 0.A), the derivation of Joint Score Distillation (JSD) (Sec. 0.B), additional ablation analysis (Sec. 0.C), additional experimental results (Sec. 0.D) and the Janus prompt list (Sec. 0.E).

Appendix 0.A Experimental Setup

0.A.1 Details of JointDreamer Pipeline.

In our main text, we adopt MVDream MVSsubscriptMVS\mathcal{M}_{\text{MVS}}caligraphic_M start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT as the energy function for the overall JointDreamer pipeline. Since MVDream fine-tunes on SD-V2.1, we retain SD-V2.1 as a diffusion model. The whole training procedure includes 6k iterations, taking around 1.5 h with batch size 4 on 1 Nvidia Tesla A800 GPU. Specifically, we warm up NeRF for the initial 600 training iterations with SDS and adopt JSD for the remaining iterations. We adopt the common time-annealing and resolution-increasing tricks from the open-source implementation, together with the two proposed mechanisms including the Geometry Fading scheme and Classifier-Free Guidance (CFG) Scale switching strategy. We set t=0.98𝑡0.98t=0.98italic_t = 0.98 with resolution 64 for the first 3k iterations and then anneal into tU(0.02,0.50)similar-to𝑡𝑈0.020.50t\sim U(0.02,0.50)italic_t ∼ italic_U ( 0.02 , 0.50 ) with resolution 256 for the extra 2k iterations. Starting from iteration 5k, we scale up the resolution to 512 and conduct the two proposed mechanisms, where the learning rate of the density network is reduced from 1e21𝑒21e-21 italic_e - 2 to 1e61𝑒61e-61 italic_e - 6 and the CFG scale is switched from 30 to 50. The Geometry Fading scheme and Classifier-Free Guidance (CFG) Scale switching strategy allow greater influence from coherence guidance in JSD on geometry optimization in the early training stages and enhance the fidelity of textures in later stages.

0.A.2 Details of Binary Classification Model.

In this part, we will elaborate on the model architecture and training procedure of the binary classification model that is discussed in Sec.4.2 in the main paper.

Model Architecture.

We build the model based on the DINO framework. Specifically, we employ ViT-s16 as the backbone for extracting image features. The backbone is initially pre-trained following the DINO method, and during training, the first 9 blocks of the backbone are frozen. Besides, we use a 4-layer MLP with 256 hidden layer channels to extract the relative camera embedding of the transformation matrix between input images, which captures the camera-specific information. Next, we calculate the cross-attention between camera embedding and the concatenated image features of input image pairs. This cross-attention mechanism generates a residual feature input, combined with the concatenated image features as the final feature. Finally, the combined features are fed into the classification head consisting of a 3-layer MLP, which produces the classification logit prediction for input image pairs.

Training Procedure.

For training data, we use rendered images from

Refer to caption
Figure A1: Training loss and validation accuracy curves of the proposed Binary Classification Model.

Objaverse [8] following Zero-1-to-3 [26]. For the binary classification training objective, we adopt the pairs of images from the same object equipped with the correct camera pose as the positive samples and assign the image pairs from different objects or incorrect relative camera poses as negative samples. Before training, we prepare the index list of positive and negative pairs for efficient training. During training, we randomly sample 1 million positive pairs and 1 million negative pairs from the index list as training sets. The design of the training set ensures that the classification model can identify the 3D consistency between rendered images conditioned on relative camera pose. We adopt adamW optimizer with 5e45𝑒45e-45 italic_e - 4 learning rate and 0.04 weight decay. We also adopt random color jitter, gaussian blur, and polarization following DINO as data augmentation. We use an image size of 224×224224224224\times 224224 × 224 and a total batch size of 640640640640 and train the model for 10 epochs. The training takes about 1 day on 2 Nvidia Tesla A800 GPUs. To validate the classification accuracy, We random sample 5000 pairs as the validation set. The training loss and validation accuracy curve can be found in Fig. A1.

0.A.3 Details of Text-to-3D Generation Comparison

Baseline Setup.

We implement the experiments in an open-source threestudio project and reproduce DreamFuion-IF, Magic3D-IF-SD, and ProlificDreamer as baselines following the comparisons in the main paper of MVDream. Our MVDream baseline is reproduced by its officially released code. We adopt DeepFloyd-IF [deepfloyd] as the 2D diffusion model for baseline DreamFuion-IF and the first stage of Magic3D-IF-SD following MVDream. To make a fair comparison with our JointDreamer, we equip the same batch size, resolution, and time annealing strategy with JointDreamer for DreamFuion-IF.

Evaluation Details.

We conducted a user study from 100 users on the 153 generated models from the object-centric MS-COCO subset. Each user is given 4 rendered videos with their corresponding text input from generations of different methods. We ask the users to select a preferred 3D model from four options, and then calculate the mean proportion of each method selected over all 153 prompts as the score. The higher score indicates the greater user preference. For the Clip Score and Clip R-Precision, we adopt the CLIP ViT-B/32 as the feature extractor.

Refer to caption
Figure A2: More quality results of JSD with Classification Model.

0.A.4 Details of Computational Resource Comparison

We analyze the geometry consistency and computation efficiency of various view-aware models in main paper Table 3, using 16 complex multi-Janus prompts in Sec. 0.E from the DreamFusion [37] library. We maintain consistent experimental parameters, including a batch size of 4, training 5k iterations and a resolution of 64, as well as the same optimizer and time annealing hyperparameters. The only variation is in the camera parameters, which align with each view-aware model’s settings. For the baseline SDS model, we adopt the DreamFusion camera parameters. We present some examples showcasing these results incorporating 𝒞CLSsubscript𝒞CLS\mathcal{C}_{\textbf{CLS}}caligraphic_C start_POSTSUBSCRIPT CLS end_POSTSUBSCRIPT in Figure A2. And the results incorporating 𝒞MVSsubscript𝒞MVS\mathcal{C}_{\textbf{MVS}}caligraphic_C start_POSTSUBSCRIPT MVS end_POSTSUBSCRIPT can be found in Section 0.D.

Appendix 0.B Theory of Joint Score Distillation

Given a well-trained text-to-image diffusion model, like Stable Diffusion, the objective is to distill its knowledge into a 3D representation network parameterized by θ𝜃\thetaitalic_θ, such as NeRF and ensures coherent 3D generations. To achieve this, we aim to model the joint rendering distribution across multiple views of θ𝜃\thetaitalic_θ.

For ease of notation, we define 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG as the joint random variable comprising 𝒙1,,𝒙Vsuperscript𝒙1superscript𝒙𝑉\bm{x}^{1},\ldots,\bm{x}^{V}bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, which are rendered images sampled from the 3D representation θ𝜃\thetaitalic_θ. It is important to note that these views are not independent. In a 3D model, the views are inherently connected as they originate from the same underlying 3D object. This means that the rendered images, 𝒙1,,𝒙Vsuperscript𝒙1superscript𝒙𝑉\bm{x}^{1},\ldots,\bm{x}^{V}bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, exhibit dependencies and correlation.

Denote the joint rendering distribution of 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG as q~θsuperscript~𝑞𝜃\tilde{q}^{\theta}over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT. We can still define the marginal distributions as

qθ(𝒙i)=q~θ(𝒙~)𝑑𝒙~i,superscript𝑞𝜃superscript𝒙𝑖superscript~𝑞𝜃~𝒙differential-dsuperscript~𝒙𝑖q^{\theta}(\bm{x}^{i})=\int\tilde{q}^{\theta}(\tilde{\bm{x}})d\tilde{\bm{x}}^{% -i},italic_q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) = ∫ over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) italic_d over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT ,

where 𝒙~i=𝒙1,,𝒙i1,𝒙i+1,,𝒙Vsuperscript~𝒙𝑖superscript𝒙1superscript𝒙𝑖1superscript𝒙𝑖1superscript𝒙𝑉\tilde{\bm{x}}^{-i}={\bm{x}^{1},\ldots,\bm{x}^{i-1},\bm{x}^{i+1},\ldots,\bm{x}% ^{V}}over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_i + 1 end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT. This marginal distribution is the same as if only a single view is considered, i.e., V=1𝑉1V=1italic_V = 1.

We can further define the log density ratio as

R(𝒙~)=logq~θ(𝒙~)i=1Vqθ(𝒙i)𝑅~𝒙superscript~𝑞𝜃~𝒙superscriptsubscriptproduct𝑖1𝑉superscript𝑞𝜃superscript𝒙𝑖R(\tilde{\bm{x}})=\log\frac{\tilde{q}^{\theta}(\tilde{\bm{x}})}{\prod_{i=1}^{V% }q^{\theta}(\bm{x}^{i})}italic_R ( over~ start_ARG bold_italic_x end_ARG ) = roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG

to capture the inter-relationship among different views. Equivalently, we can write

q~θ(𝒙~)=exp(R(𝒙~))i=1Vqθ(𝒙i).superscript~𝑞𝜃~𝒙𝑅~𝒙superscriptsubscriptproduct𝑖1𝑉superscript𝑞𝜃superscript𝒙𝑖\tilde{q}^{\theta}(\tilde{\bm{x}})=\exp(R(\tilde{\bm{x}}))\prod_{i=1}^{V}q^{% \theta}(\bm{x}^{i}).over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) = roman_exp ( italic_R ( over~ start_ARG bold_italic_x end_ARG ) ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_q start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) .

To get the evaluations of 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG from the 2D diffusion model, we have

p~(𝒙~)exp(𝒞(𝒙~))i=1Vp(𝒙i)proportional-to~𝑝~𝒙𝒞~𝒙superscriptsubscriptproduct𝑖1𝑉𝑝superscript𝒙𝑖\tilde{p}(\tilde{\bm{x}})\propto\exp(\mathcal{C}(\tilde{\bm{x}}))\prod_{i=1}^{% V}p(\bm{x}^{i})over~ start_ARG italic_p end_ARG ( over~ start_ARG bold_italic_x end_ARG ) ∝ roman_exp ( caligraphic_C ( over~ start_ARG bold_italic_x end_ARG ) ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT italic_p ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT )

since the diffusion model only takes a single image as input and different views are weighted by the introduced joint energy function 𝒞𝒞\mathcal{C}caligraphic_C.

Now we consider learning q~θ(𝒙~)superscript~𝑞𝜃~𝒙\tilde{q}^{\theta}(\tilde{\bm{x}})over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) such that the following Integral Kullback–Leibler (IKL) divergence is minimized along the forward diffusion process 𝒙t=αt𝒙0+σtϵsubscript𝒙𝑡subscript𝛼𝑡subscript𝒙0subscript𝜎𝑡italic-ϵ\bm{x}_{t}=\alpha_{t}\bm{x}_{0}+\sigma_{t}\epsilonbold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ϵ where ϵitalic-ϵ\epsilonitalic_ϵ follows standard Gaussian distribution.

minθDIKL(q~θ(𝒙~)||p~(𝒙~))\displaystyle\small\min_{\theta}D_{\mathrm{IKL}}(\tilde{q}^{\theta}(\tilde{\bm% {x}})||\tilde{p}(\tilde{\bm{x}}))roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT roman_IKL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) | | over~ start_ARG italic_p end_ARG ( over~ start_ARG bold_italic_x end_ARG ) ) =minθ0Tw(t)σtαtDKL(q~tθ(𝒙~)||p~t(𝒙~))dt\displaystyle=\min_{\theta}\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}D_{% \mathrm{KL}}(\tilde{q}_{t}^{\theta}(\tilde{\bm{x}})||\tilde{p}_{t}(\tilde{\bm{% x}}))dt= roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) | | over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ) italic_d italic_t
=minθ0Tw(t)σtαt𝔼𝒙~tq~tθ(logq~tθ(𝒙~t)p~t(𝒙~t))𝑑t.absentsubscript𝜃superscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡subscript𝔼similar-tosubscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃superscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡subscript~𝑝𝑡subscript~𝒙𝑡differential-d𝑡\displaystyle=\min_{\theta}\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}% \mathbb{E}_{\tilde{\bm{x}}_{t}\sim\tilde{q}_{t}^{\theta}}\left(\log\frac{% \tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})}{\tilde{p}_{t}(\tilde{\bm{x}}_{t})}% \right)dt.= roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) italic_d italic_t .

Taking gradient with respect to θ𝜃\thetaitalic_θ gives

θDIKL(q~θ(𝒙)||p~(𝒙))\displaystyle\frac{\partial}{\partial\theta}D_{\mathrm{IKL}}(\tilde{q}^{\theta% }(\bm{x})||\tilde{p}(\bm{x}))divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG italic_D start_POSTSUBSCRIPT roman_IKL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | | over~ start_ARG italic_p end_ARG ( bold_italic_x ) )
=0Tw(t)σtαtθ𝔼𝒙~tq~tθ(logq~tθ(𝒙~t)p~t(𝒙~t))𝑑tabsentsuperscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡𝜃subscript𝔼similar-tosubscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃superscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡subscript~𝑝𝑡subscript~𝒙𝑡differential-d𝑡\displaystyle=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\frac{\partial}{% \partial\theta}\mathbb{E}_{\tilde{\bm{x}}_{t}\sim\tilde{q}_{t}^{\theta}}\left(% \log\frac{\tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})}{\tilde{p}_{t}(\tilde{\bm% {x}}_{t})}\right)dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) italic_d italic_t
=0Tw(t)σtαt𝔼𝒙~tq~tθ[𝒙~t(logq~tθ(𝒙~t)p~t(𝒙~t))𝒙~tθ+θlogq~tθ(𝒙)|𝒙=𝒙~t]𝑑tabsentsuperscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡subscript𝔼similar-tosubscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃delimited-[]subscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡subscript~𝑝𝑡subscript~𝒙𝑡subscript~𝒙𝑡𝜃evaluated-at𝜃superscriptsubscript~𝑞𝑡𝜃𝒙𝒙subscript~𝒙𝑡differential-d𝑡\displaystyle=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\mathbb{E}_{\tilde{% \bm{x}}_{t}\sim\tilde{q}_{t}^{\theta}}\left[\frac{\partial}{\partial\tilde{\bm% {x}}_{t}}\left(\log\frac{\tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})}{\tilde{p}% _{t}(\tilde{\bm{x}}_{t})}\right)\frac{\partial\tilde{\bm{x}}_{t}}{\partial% \theta}+\frac{\partial}{\partial\theta}\log\tilde{q}_{t}^{\theta}(\bm{x})|_{% \bm{x}=\tilde{\bm{x}}_{t}}\right]dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG ∂ end_ARG start_ARG ∂ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( roman_log divide start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ) divide start_ARG ∂ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG + divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG roman_log over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | start_POSTSUBSCRIPT bold_italic_x = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] italic_d italic_t
:=A+B.assignabsent𝐴𝐵\displaystyle:=A+B.:= italic_A + italic_B .

The term B𝐵Bitalic_B vanishes since

B𝐵\displaystyle Bitalic_B =0Tw(t)σtαt𝔼𝒙~tq~tθθlogq~tθ(𝒙)|𝒙=𝒙~tdtabsentevaluated-atsuperscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡subscript𝔼similar-tosubscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃𝜃superscriptsubscript~𝑞𝑡𝜃𝒙𝒙subscript~𝒙𝑡𝑑𝑡\displaystyle=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\mathbb{E}_{\tilde{% \bm{x}}_{t}\sim\tilde{q}_{t}^{\theta}}\frac{\partial}{\partial\theta}\log% \tilde{q}_{t}^{\theta}(\bm{x})|_{\bm{x}=\tilde{\bm{x}}_{t}}dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG roman_log over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | start_POSTSUBSCRIPT bold_italic_x = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d italic_t
=0Tw(t)σtαt𝔼𝒙~tq~tθθq~tθ(𝒙)|𝒙=𝒙~tq~tθ(𝒙~t)𝑑tabsentsuperscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡subscript𝔼similar-tosubscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃evaluated-at𝜃superscriptsubscript~𝑞𝑡𝜃𝒙𝒙subscript~𝒙𝑡superscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡differential-d𝑡\displaystyle=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\mathbb{E}_{\tilde{% \bm{x}}_{t}\sim\tilde{q}_{t}^{\theta}}\frac{\frac{\partial}{\partial\theta}% \tilde{q}_{t}^{\theta}(\bm{x})|_{\bm{x}=\tilde{\bm{x}}_{t}}}{\tilde{q}_{t}^{% \theta}(\tilde{\bm{x}}_{t})}dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | start_POSTSUBSCRIPT bold_italic_x = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG italic_d italic_t
=0Tw(t)σtαtθq~tθ(𝒙)|𝒙=𝒙~tdtabsentevaluated-atsuperscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡𝜃superscriptsubscript~𝑞𝑡𝜃𝒙𝒙subscript~𝒙𝑡𝑑𝑡\displaystyle=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\int{\frac{\partial% }{\partial\theta}\tilde{q}_{t}^{\theta}(\bm{x})|_{\bm{x}=\tilde{\bm{x}}_{t}}}dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∫ divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | start_POSTSUBSCRIPT bold_italic_x = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d italic_t
=0Tw(t)σtαtθq~tθ(𝒙)𝑑tabsentsuperscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡𝜃superscriptsubscript~𝑞𝑡𝜃𝒙differential-d𝑡\displaystyle=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\frac{\partial}{% \partial\theta}\int{\tilde{q}_{t}^{\theta}(\bm{x})}dt= ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG ∫ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) italic_d italic_t
=0absent0\displaystyle=0= 0

The term A𝐴Aitalic_A is the score distillation loss

A=0Tw(t)σtαt𝔼𝒙~0q~0θ,ϵ~(logq~tθ(𝒙~t)logp~t(𝒙~t))𝒙~tθ𝑑t,𝐴superscriptsubscript0𝑇𝑤𝑡subscript𝜎𝑡subscript𝛼𝑡subscript𝔼similar-tosubscript~𝒙0superscriptsubscript~𝑞0𝜃~italic-ϵsuperscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡subscript~𝑝𝑡subscript~𝒙𝑡subscript~𝒙𝑡𝜃differential-d𝑡\displaystyle A=\int_{0}^{T}w(t)\frac{\sigma_{t}}{\alpha_{t}}\mathbb{E}_{% \tilde{\bm{x}}_{0}\sim\tilde{q}_{0}^{\theta},\tilde{\epsilon}}\left(\nabla\log% {\tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})}-\nabla\log{\tilde{p}_{t}(\tilde{% \bm{x}}_{t})}\right)\frac{\partial\tilde{\bm{x}}_{t}}{\partial\theta}dt,italic_A = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , over~ start_ARG italic_ϵ end_ARG end_POSTSUBSCRIPT ( ∇ roman_log over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ roman_log over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) divide start_ARG ∂ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG italic_d italic_t ,

where ϵ~=(ϵ1,,ϵV)~italic-ϵsuperscriptitalic-ϵ1superscriptitalic-ϵ𝑉\tilde{\epsilon}=(\epsilon^{1},\ldots,\epsilon^{V})over~ start_ARG italic_ϵ end_ARG = ( italic_ϵ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_ϵ start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) are the noises along the forward diffusion process. Putting things together we have

θDIKL(q~θ(𝒙)||p~(𝒙))=𝔼𝒙~0q~0θ,ϵ~,t[w(t)σtαt(logq~tθ(𝒙~t)logp~t(𝒙~t))𝒙~tθ]\displaystyle\frac{\partial}{\partial\theta}D_{\mathrm{IKL}}(\tilde{q}^{\theta% }(\bm{x})||\tilde{p}(\bm{x}))=\mathbb{E}_{\tilde{\bm{x}}_{0}\sim\tilde{q}_{0}^% {\theta},\tilde{\epsilon},t}\left[w(t)\frac{\sigma_{t}}{\alpha_{t}}\left(% \nabla\log{\tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})}-\nabla\log{\tilde{p}_{t% }(\tilde{\bm{x}}_{t})}\right)\frac{\partial\tilde{\bm{x}}_{t}}{\partial\theta}\right]divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG italic_D start_POSTSUBSCRIPT roman_IKL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | | over~ start_ARG italic_p end_ARG ( bold_italic_x ) ) = blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , over~ start_ARG italic_ϵ end_ARG , italic_t end_POSTSUBSCRIPT [ italic_w ( italic_t ) divide start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( ∇ roman_log over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ∇ roman_log over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) divide start_ARG ∂ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ]

Notice that the NeRF rendering is a deterministic process given the view information. Therefore, the conditional distribution and marginal distribution coincide, i.e.,

q~tθ(𝒙~t)N(αt𝒙~0,σt2),logq~tθ(𝒙~t)=ϵ~/σt.formulae-sequencesimilar-tosuperscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡𝑁subscript𝛼𝑡subscript~𝒙0superscriptsubscript𝜎𝑡2superscriptsubscript~𝑞𝑡𝜃subscript~𝒙𝑡~italic-ϵsubscript𝜎𝑡\tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})\sim N(\alpha_{t}\tilde{\bm{x}}_{0},% \sigma_{t}^{2}),\quad\nabla\log{\tilde{q}_{t}^{\theta}(\tilde{\bm{x}}_{t})}=-% \tilde{\epsilon}/\sigma_{t}.over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_N ( italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , ∇ roman_log over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = - over~ start_ARG italic_ϵ end_ARG / italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

On the other hand, direct score matching tells us that

logpt(𝒙ti)=𝒞(𝒙~)𝒙tiϵ^Φ(𝒙ti,t)/σt.subscript𝑝𝑡subscriptsuperscript𝒙𝑖𝑡𝒞~𝒙subscriptsuperscript𝒙𝑖𝑡subscript^italic-ϵΦsubscriptsuperscript𝒙𝑖𝑡𝑡subscript𝜎𝑡\nabla\log{{p}_{t}({\bm{x}}^{i}_{t})}=\frac{\partial\mathcal{C}(\tilde{\bm{x}}% )}{\partial{\bm{x}^{i}_{t}}}-\hat{\epsilon}_{\Phi}({\bm{x}}^{i}_{t},t)/\sigma_% {t}.∇ roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG ∂ caligraphic_C ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) / italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

Finally, combining 𝒙tiθ=αt𝒙0iθsuperscriptsubscript𝒙𝑡𝑖𝜃subscript𝛼𝑡superscriptsubscript𝒙0𝑖𝜃\frac{\partial\bm{x}_{t}^{i}}{\partial\theta}=\alpha_{t}\frac{\partial\bm{x}_{% 0}^{i}}{\partial\theta}divide start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG, we have

θDIKL(q~θ(𝒙)||p~(𝒙))=𝔼𝒙~0q~0θ,ϵ~,t[w(t)i=1V(ϵ^Φ(𝒙ti,t)𝒞(𝒙~)𝒙tiϵi)𝒙0iθ].\displaystyle\frac{\partial}{\partial\theta}D_{\mathrm{IKL}}(\tilde{q}^{\theta% }(\bm{x})||\tilde{p}(\bm{x}))=\mathbb{E}_{\tilde{\bm{x}}_{0}\sim\tilde{q}_{0}^% {\theta},\tilde{\epsilon},t}\left[w(t)\sum_{i=1}^{V}\left(\hat{\epsilon}_{\Phi% }({\bm{x}}^{i}_{t},t)-\frac{\partial\mathcal{C}(\tilde{\bm{x}})}{\partial{\bm{% x}^{i}_{t}}}-\epsilon^{i}\right)\frac{\partial{\bm{x}}_{0}^{i}}{\partial\theta% }\right].divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ end_ARG italic_D start_POSTSUBSCRIPT roman_IKL end_POSTSUBSCRIPT ( over~ start_ARG italic_q end_ARG start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ( bold_italic_x ) | | over~ start_ARG italic_p end_ARG ( bold_italic_x ) ) = blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ over~ start_ARG italic_q end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT , over~ start_ARG italic_ϵ end_ARG , italic_t end_POSTSUBSCRIPT [ italic_w ( italic_t ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ( over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - divide start_ARG ∂ caligraphic_C ( over~ start_ARG bold_italic_x end_ARG ) end_ARG start_ARG ∂ bold_italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - italic_ϵ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) divide start_ARG ∂ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] . (9)

Now we have finished extending SDS to multiple views. As it turns out, the joint energy term R(𝒙~)𝑅~𝒙R(\tilde{\bm{x}})italic_R ( over~ start_ARG bold_italic_x end_ARG ) does not show up in the gradient formula.

Appendix 0.C Additional Ablation Study

Refer to caption
Figure A3: Comparison with SweetDreamer. SweetDreamer suffers from multi-faces (left) and missing components such as "legs" and "eyes" (right).

0.C.1 Comparison with SweetDreamer

We also conduct a comparison with SweetDreamer [23]. SweetDreamer aligns geometric priors (AGP) in a finetuned diffusion model and combines AGP with SDS to address the Janus issue. In contrast, JSD improves the optimization objective of SDS with various energy functions, and AGP can be one of them. For 3D generation, Fig. 10 shows that a simple combination, like SweetDreamer’s, uses more memory and complicates balancing components. Compared to SweetDreamer’s demos from its website, our JointDreamer achieves better shape and text congruence without multi-faces and missing components ("arms", "big eyes") in Fig. A5.

Refer to caption
Figure A4: Comparison with Image-to-3D methods. Compared with two alternative methods, all employing the Zero-1-to-3 XL model, our proposed JSD exhibits superior generative quality in novel view synthesis as evidenced by its geometric consistency.

0.C.2 Discussions on Image-to-3D Methods

Since the view-aware models can engage in 3D generation through SDS besides JSD, we make comparisons to showcase the superiority of JSD. Section 5.2 details the comparative use of MVDream, and herein, we extend this comparison to different applications of the image-to-image translation model, Zero-1-to-3 XL, which excels in image-to-3D tasks. Unlike text-to-3D approaches that generate 3D models from textual descriptions, the image-to-3D method uses a reference image to fix the reference view and generate the remaining views. As shown in Fig. A4, we input a reference image, exemplified by the front-view rendered image of the case of “A DSLR photo of a squirrel playing guitar” in Fig. A6 and compare with two alternative utilizations of Zero-1-to-3 XL. (i)Zero-1-to-3 XL [26], which directly utilizes Zero-1-to-3 XL to calculate SDS loss for novel rendered views according to reference view. The overfitting generalizability of Zero-1-to-3 XL reduces the generative quality, especially for the views distant from the reference view. (ii)Magic123 [qian2023magic123], which merges the SDS loss of SD-V2.1 and Zero-1-to-3 XL as objective function. By combining the generalizability from the original diffusion model, it can eliminate the distortion in novel views, but the effect is not satisfactory. By contrast, our JSD achieves better generation quality in novel views, where the overall geometric structure is more reasonable. Notably, when applying JSD in image-to-3D generation, we calculate the inter-view coherence between the reference view and random novel views to fix the reference view, differing from the two random novel views used in text-to-3D generation. The comparisons further illustrate that JSD provides the optimal solution to combine generalizability from 2D models and geometric understanding from 3D-aware models.

0.C.3 Discussion on Failure Cases

Despite JointDreamer’s impressive performance in handling detailed descriptions and multi-object combinations in long texts (as depicted in Fig. 1 of the main paper), it faces difficulties in comprehending complex relationships among objects. Specifically, it struggles to grasp relative spatial arrangements and hierarchical dependencies, as evidenced in Fig. A5. Exploring the use of larger diffusion models, such as SDXL [podell2023sdxl], may offer a potential solution to overcome these limitations.

Refer to caption
Figure A5: Failure Cases on MS-COCO Subset.
Refer to caption
Figure A6: More comparison of text-to-3D generation.
Refer to caption
Figure A7: More comparison of text-to-3D generation.
Refer to caption
Figure A8: More comparison of text-to-3D generation.
Refer to caption
Figure A9: More results of JointDreamer.

Appendix 0.D Additional Results of JointDreamer

We present more comparisons of text-to-3D generation as shown in Fig. A6A7 and A8. The results indicate that JointDreamer outperforms current text-to-3D generation methods regarding generation fidelity, geometric consistency, and text congruence. This further validates the effectiveness and generalization of the proposed JSD. We also provide more images and normal maps from additional generated results in Fig. A9, demonstrating the generalizability of JointDreamer with arbitrary textual descriptions.

Appendix 0.E Janus Prompts.

Our list of 16 Janus prompts is shown below:

"a blue jay standing on a large basket of rainbow macarons",

"a confused beagle sitting at a desk working on homework",

"Albert Einstein with grey suit is riding a moto",

"a panda rowing a boat in a pond",

"a wide angle zoomed out DSLR photo of a skiing penguin wearing a puffy jacket",

"a zoomed out DSLR photo of a baby monkey riding on a pig",

"a zoomed out DSLR photo of a fox working on a jigsaw puzzle",

"a DSLR photo of a pigeon reading a book",

"a DSLR photo of a cat lying on its side

batting at a ball of yarn"

"A crocodile playing a drum set"

"a rabbit cutting grass with a lawnmower",

"A red dragon dressed in a tuxedo and playing chess",

"a zoomed out DSLR photo of a bear playing electric bass",

"A bald eagle carved out of wood, more detail",

"A pig wearing back pack".

"a lemur drinking boba".