Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CollaFuse: Collaborative Diffusion Models

Simeon Allmendinger Equal contribution. University of Bayreuth Fraunhofer FIT Domenique Zipperling University of Bayreuth Fraunhofer FIT Lukas Struppek Technical University of Darmstadt German Research Center for Artificial Intelligence (DFKI) Niklas Kühl University of Bayreuth Fraunhofer FIT
Abstract

In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.

1 Introduction

Refer to caption
Figure 1: Our approach for collaborative image synthesis splits the denoising process between the server and clients. Based on client-specific conditioning, the first denoising steps are run on a trusted server, while the remaining denoising steps are run locally. Thereby, external resources can be utilized while keeping the clients’ raw data private.

Recently developed generative artificial intelligence (GenAI) methods exhibit astonishing results in generating images, among other modalities like music [26] and video [41, 3]. Recent advancements primarily rely on diffusion models [15, 42] that generate synthetic images from random noise through iterative denoising steps. We formally introduce diffusion models and related work in Sec. 2.1. In contrast to more traditional approaches [21, 10], diffusion models excel in providing high sample quality and strong mode coverage [20, 29, 6]. However, the strides in GenAI require large amounts of data, and the generation process itself is computationally expensive due to the multiple required denoising steps.

While large companies possess the necessary data and computational resources for training diffusion models, smaller organizations and private clients may face challenges in providing the required resources, limiting their ability to train and implement recent GenAI models. This might lead to high dependency on a few key players and even prevent the application of such models completely due to local data protection regulations. For example, the popular Stable Diffusion v2 has been trained on 256 A100 GPUs for about 200,000 hours [33]. But even smaller models still set notably high hardware and data requirements.

To address these constraints, organizations and clients can join forces to train machine learning models collaboratively with other clients in a decentralized way. A prominent representative of collaborative training is federated learning [27]. Conceptually, each client trains an individual model locally on their private data. After performing a certain number of training steps, the model parameters are sent to the server to build a global model by incorporating the individual weights. The server then makes this global model accessible to all clients. Yet, the necessity for each client to train and share an entire model remains, still entailing high computational resources for each client [45] and potentially introducing new privacy risks [40, 50].

As an alternative learning paradigm, split learning [11] supports collaborative model training by splitting the model into server-side and client-side components. In the conventional setup, clients share only intermediate network activations with the server, where the final computations are performed. Unlike federated learning, split learning reduces the computational load on clients and enhances privacy protection by sharing only intermediate representations instead of raw data or model weights. We provide a comprehensive overview of collaboratively training generative models in Sec. 2.2.

Our Approach: In addressing the challenges posed by the data, computation, and privacy requirements of diffusion-based GenAI, we present our collaborative diffusion models in Sec. 3. We introduce a novel collaborative learning and inference approach tailored specifically for diffusion models. Drawing inspiration from the split learning framework, our approach divides the iterative denoising steps of diffusion models into two components. The computation of the initial denoising steps is carried out by a shared model on a server, with limited information disclosure due to the inherent noise in the training data and generated samples. Subsequently, the client’s model then performs the remaining denoising steps, which are usually significantly fewer than the denoising steps on the server side.

Our collaborative diffusion approach also allows for personalized image conditioning by incorporating attribute labels during the generation. Our empirical results demonstrate that our collaborative diffusion approach improves the image quality compared to a setting where each client trains its own local diffusion model. By sharing a server model that performs most of the computationally heavy denoising steps, the computational burdens for each client are comparably small. At the same time, clients can better approximate their individual data distribution, which enables them to generate better characteristic features.

The proportion of denoising steps carried out on the server and client sides, respectively, is controlled by a single parameter called cut point. The higher the cut point, the more steps are computed on the client side. Whereas our approach is in principle also applicable to other diffusion model architectures, we focus our experiments in Sec. 4 on the common Denoising Diffusion Probabilistic Models (DDPM) [15]. In summary, we make the following contributions:

  1. 1.

    We introduce the first collaborative diffusion model, which consists of a shared server component trained by multiple clients without revealing their original training data to other clients or the server.

  2. 2.

    Collaborative diffusion models allow clients to outsource most of their computationally expensive denoising steps during training and inference to a shared server model.

  3. 3.

    Our collaborative diffusion models improve the image quality compared to the setting where each client trains a local diffusion model on its own data.

2 Background and Related Work

We start by introducing diffusion models for generative image synthesis. We further describe related distributed collaborative machine learning approaches, such as federated and split learning, and their utilization for image synthesis or generative AI more generally.

2.1 Diffusion Models

The Denoising Diffusion Probabilistic Models (DDPMs) [15] mark a significant advancement in generative image synthesis consisting of a diffusion and denoising process. The diffusion process is a Markov chain with T𝑇Titalic_T timesteps that transforms a training image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to a noisy image xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT that follows a random Gaussian distribution. The diffusion process of an image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t is mathematically defined as

xt=αtxt1+1αtϵ, with t=1,,T.formulae-sequencesubscript𝑥𝑡subscript𝛼𝑡subscript𝑥𝑡11subscript𝛼𝑡italic-ϵ with 𝑡1𝑇x_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}}\epsilon,\text{ with }t=1,% \ldots,T.italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , with italic_t = 1 , … , italic_T . (1)

Here, αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the variance schedule, and ϵitalic-ϵ\epsilonitalic_ϵ is the added Gaussian noise. A denoising network ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) with parameters θ𝜃\thetaitalic_θ is then trained to reverse the diffusion process and predict the noise added to the sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT during time step t𝑡titalic_t.

Most denoising networks are built upon the common U-Net [34] architecture. With the denoising network, the image generation process, which iteratively removes the predicted noise ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) from the noisy sample xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, can be defined as

xt1=1αt(xt1αt1α¯tϵθ(xt,t))subscript𝑥𝑡11subscript𝛼𝑡subscript𝑥𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-% \bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t},t)\right)italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) (2)

with α¯t=s=1tαssubscript¯𝛼𝑡subscriptsuperscriptproduct𝑡𝑠1subscript𝛼𝑠\bar{\alpha}_{t}=\prod^{t}_{s=1}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT.

Based on the idea of Markov chains, the distribution of the intermediate noise predictions in the denoising process pθ()subscript𝑝𝜃p_{\theta}(\cdot)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is defined by

pθ(x0:T)=p(xT)t=1Tpθ(xt1|xt)subscript𝑝𝜃subscript𝑥:0𝑇𝑝subscript𝑥𝑇subscriptsuperscriptproduct𝑇𝑡1subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡p_{\theta}(x_{0:T})=p(x_{T})*\prod^{T}_{t=1}p_{\theta}(x_{t-1}|x_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∗ ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (3)

with p(xT)=𝒩(xT;𝟎,I)𝑝subscript𝑥𝑇𝒩subscript𝑥𝑇0Ip(x_{T})=\mathcal{N}(x_{T};\mathbf{0},\mathbf{\text{I}})italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; bold_0 , I ).

In DDPMs, the iterative application of U-Nets across T𝑇Titalic_T timesteps is a fundamental characteristic, enabling the model to refine noisy data into structured outputs progressively. The Imagen model [36], as one of the most recognized text-conditioned diffusion models in the community, builds upon DDPMs. The authors employ a frozen text encoder and dynamic thresholding to generate photorealistic images conditioned by text prompts y𝑦yitalic_y. The loss function of the Imagen model is expressed by

Imagen=t=1Tωtϵθ(xt,t,y)ϵ22.subscriptImagensuperscriptsubscript𝑡1𝑇subscript𝜔𝑡subscriptsuperscriptnormsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑦italic-ϵ22\mathcal{L}_{\text{Imagen}}=\sum_{t=1}^{T}\omega_{t}\cdot||\epsilon_{\theta}(x% _{t},t,y)-\epsilon||^{2}_{2}.caligraphic_L start_POSTSUBSCRIPT Imagen end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ | | italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) - italic_ϵ | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (4)

In this context, ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the guidance weight, which is integral to the denoising process. This guidance weight modulates the influence of the predicted noise ϵθ(xt,t,y)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡𝑦\epsilon_{\theta}(x_{t},t,y)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) at each timestep t𝑡titalic_t, enabling precise control over the image generation process, particularly in maintaining fidelity to the target distribution. For simplicity, we leave the explicit embedding process out of our notation and implicitly assume that all text labels have already been embedded before feeding the embeddings into the U-Net ϵθ()subscriptitalic-ϵ𝜃\epsilon_{\theta}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ).

2.2 Federated and Split Learning in Generative AI

Federated learning (FL) and split learning (SL) are among the most prominent approaches for training machine learning models collaboratively on distributed data sources. FL utilizes distributed data, with clients independently training models on their unique datasets. These models are subsequently shared, aggregated, and redistributed. The cycle repeats until the models converge. Conversely, SL divides a model among clients and a central server, decreasing the computational load on clients. Moreover, clients have the option to use FL for model aggregation, leading to the development of SplitFed learning [45]. These techniques have found applications in diverse fields such as the automotive industry [38], energy management [37], and healthcare [18], where they are combined with both discriminative [28] and generative AI approaches [39].

Especially for image synthesis, FL and SL possess significant potential owing to the extensive volume of data involved. Before diffusion models took over as the predominant architecture for image synthesis, Generative Adversarial Networks (GANs) [10] were the most common network architecture. GANs are composed of two components, a generator for generating images and a discriminator trained to distinguish between real and synthetic images. Existing research on collaborative training of GANs demonstrates the different integrations of the two components within the FL learning process. Hardy et al. [12] introduce FL-GAN adopting the standard FL learning process for discriminators and generators alike. This vanilla approach is compared to the proposed MD-GAN. Here, FL is only applied to the discriminator, while the generator is trained directly by a server. Expanding upon this foundation, Fan and Liu [8] have empirically analyzed different strategies for synchronizing the discriminators and generators across clients in FL. Their analysis demonstrates that the best results are achieved when synchronizing both the discriminator and generator across clients. Li et al. [23] improved FL-GANs by employing maximum mean discrepancy for generator updates. Moreover, follow-up research [22, 49] has combined FL and SL to train GANs collaboratively.

Furthermore, there have been efforts to reduce privacy risks for GANs in FL settings. Augenstein et al. [1] proposed a novel algorithm for differentially private federated GANs, while Veeraragavan et al. [47] combine consortium blockchains and an efficient secret sharing algorithm to address trust-related weaknesses in existing solutions. Although Ohta et al. [30] do not focus on distributed learning, they offer a solution for privacy-preserving SL of GANs that can be expanded for collaborative learning.

In the domain of diffusion models, research on collaborative training methods is still scarce. Jothiraj and Mashhadi [19] made a first step and introduced the Phoenix technique for training unconditional diffusion models in a horizontal FL setting. Their objective is to address mode coverage issues often seen in distributed datasets that are not independent and identically distributed. Their data-sharing approach boosts performance by sharing only 4-5% of the data among clients, minimizing communication overhead. Personalization and threshold filtering techniques outperform comparison methods in terms of precision and recall but fall short in image quality compared to the proposed technique. The paper suggests further exploration to enhance image quality in future work.

Moreover, the potential of FL for AI-generated content, especially for DDPMs, was demonstrated by Huang et al. [17]. The authors discuss three different approaches for diffusion models in FL settings. A parallel approach mimicking the conventional FL. A separate split approach combines FL with SL. As a third solution, the authors discussed a sequential approach in which one client receives the current model from the server, trains the model on its data, and then transmits the current version to the next client. The trained model returns to the server only after every client trained the model once. Based on the sequential FL, a LoRA-based [16] federated fine-tuning scheme is designed and examined in more detail, demonstrating the advantages of faster convergence time and reduced memory consumption during the tuning process.

By mainly focusing on FL for GANs [24], current literature neglected benefits from different collaborative paradigms and GenAI architectures so far. Combining DDPMs and SL promises various benefits, including reducing local resource requirements and increasing data privacy. Our proposed approach for distributed collaborative image synthesis with diffusion models taps into these advantages and combines the research areas of diffusion models and collaborative learning.

3 Collaborative Diffusion Models

We now formally introduce our novel approach for enabling collaborative image generation with diffusion models. In our setting, a certain number k𝑘kitalic_k of clients cC={c1,c2,,ck}𝑐𝐶subscript𝑐1subscript𝑐2subscript𝑐𝑘c\in C=\{c_{1},c_{2},...,c_{k}\}italic_c ∈ italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } wants to collaboratively train a diffusion model for image synthesis. Although we assume that each client has a dataset from a similar domain, e.g., facial images, the specific feature distribution may differ. To stay with the facial image example, client A may have a dataset of facial images with eyeglasses, whereas client B’s dataset consists only of faces without eyeglasses. All clients now want to train a shared U-Net ϵθSsuperscriptsubscriptitalic-ϵ𝜃𝑆\epsilon_{\theta}^{S}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT on the server that is available to each client and computes the initial denoising steps. Additionally, each client cC𝑐𝐶c\in Citalic_c ∈ italic_C trains an individual U-Net ϵθcsuperscriptsubscriptitalic-ϵ𝜃𝑐\epsilon_{\theta}^{c}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT that is maintained locally and computes solely the remaining denoising steps. For notation simplicity, we assume that θ𝜃\thetaitalic_θ denotes the weights of each individual model, so there exist no shared weights between clients and the server.

The computational split between server and clients is manually set by the cut point tζ[0,T]subscript𝑡𝜁0𝑇t_{\zeta}\in[0,T]italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ∈ [ 0 , italic_T ] that specifies the number of denoising steps performed on the client side after Ttζ𝑇subscript𝑡𝜁T-t_{\zeta}italic_T - italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT steps were computed by the shared server model. The cut point is set as a hyperparameter and kept fixed during training and inference. For tζ=0subscript𝑡𝜁0t_{\zeta}=0italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 0, all denoising steps are computed by the server, which is trained on the joint set of all clients’ data. For tζ=Tsubscript𝑡𝜁𝑇t_{\zeta}=Titalic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = italic_T, each client trains an individual diffusion model on its data that performs all denoising steps without any shared server model. The approximated data distribution of our collaborative denoising approach is formalized in Equation 5 with p(xT)=𝒩(xT;0,I)𝑝subscript𝑥𝑇𝒩subscript𝑥𝑇0Ip(x_{T})=\mathcal{N}(x_{T};0,\text{I})italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ; 0 , I ):

pθs,θc(x0:T)=p(xT)t=1tζpθs(xt1|xt)t=tζTpθc(xt1|xt).subscript𝑝superscript𝜃𝑠superscript𝜃𝑐subscript𝑥:0𝑇𝑝subscript𝑥𝑇subscriptsuperscriptproductsubscript𝑡𝜁𝑡1subscript𝑝superscript𝜃𝑠conditionalsubscript𝑥𝑡1subscript𝑥𝑡subscriptsuperscriptproduct𝑇𝑡subscript𝑡𝜁subscript𝑝superscript𝜃𝑐conditionalsubscript𝑥𝑡1subscript𝑥𝑡\scalebox{0.82}{ $p_{\theta^{s},\theta^{c}}(x_{0:T})=p(x_{T})\cdot\prod^{t_{\zeta}}_{t=1}p_{% \theta^{s}}(x_{t-1}|x_{t})\cdot\prod^{T}_{t=t_{\zeta}}p_{\theta^{c}}(x_{t-1}|x% _{t})$}.italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) = italic_p ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ⋅ ∏ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ ∏ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (5)

Here, the first product operator describes the distribution approximated by the server model with weights θssuperscript𝜃𝑠\theta^{s}italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT, and the second product operator consequently defines the distribution approximated by the client model θcsuperscript𝜃𝑐\theta^{c}italic_θ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

0:  training dataset D𝐷Ditalic_D; batch size b𝑏bitalic_b; number of time steps T𝑇Titalic_T, cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT, batch size b𝑏bitalic_b, variance scheduler α𝛼\alphaitalic_α, noise scheduler σ𝜎\sigmaitalic_σ, clients C𝐶Citalic_C, server s𝑠sitalic_s
1:  client dataset DcDsubscript𝐷𝑐𝐷D_{c}\subseteq Ditalic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⊆ italic_D
2:  while Not Converged do
3:     for cC𝑐𝐶c\in Citalic_c ∈ italic_C do
4:        for Each batch {(x0,y)}bDcsuperscriptsubscript𝑥0𝑦𝑏subscript𝐷𝑐\{(x_{0},y)\}^{b}\subseteq D_{c}{ ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ⊆ italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT do
5:           *** CLIENT NODE ***
6:           tc𝒰[1,tζ]bsimilar-tosuperscript𝑡𝑐𝒰superscript1subscript𝑡𝜁𝑏t^{c}\sim\mathcal{U}[1,t_{\zeta}]^{b}italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∼ caligraphic_U [ 1 , italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT and ts𝒰[tζ,T]bsimilar-tosuperscript𝑡𝑠𝒰superscriptsubscript𝑡𝜁𝑇𝑏t^{s}\sim\mathcal{U}[t_{\zeta},T]^{b}italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ caligraphic_U [ italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT , italic_T ] start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT
7:           ϵc𝒩(0,I)similar-tosuperscriptitalic-ϵ𝑐𝒩0I\epsilon^{c}\sim\mathcal{N}(0,\text{I})italic_ϵ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , I ), ϵs𝒩(0,I)similar-tosuperscriptitalic-ϵ𝑠𝒩0I\epsilon^{s}\sim\mathcal{N}(0,\text{I})italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , I )
8:           xtcα(tc)x0+σ(tc)ϵcsubscript𝑥superscript𝑡𝑐𝛼superscript𝑡𝑐subscript𝑥0𝜎superscript𝑡𝑐superscriptitalic-ϵ𝑐x_{t^{c}}\leftarrow\alpha(t^{c})\cdot x_{0}+\sigma(t^{c})\cdot\epsilon^{c}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← italic_α ( italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ ( italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ⋅ italic_ϵ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
9:           xtζα(tζ)x0+σ(tζ)ϵcsubscript𝑥subscript𝑡𝜁𝛼subscript𝑡𝜁subscript𝑥0𝜎subscript𝑡𝜁superscriptitalic-ϵ𝑐x_{t_{\zeta}}\leftarrow\alpha(t_{\zeta})\cdot x_{0}+\sigma(t_{\zeta})\cdot% \epsilon^{c}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_α ( italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_σ ( italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ) ⋅ italic_ϵ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
10:           xtsα(ts)xtζ+σ(ts)ϵssubscript𝑥superscript𝑡𝑠𝛼superscript𝑡𝑠subscript𝑥subscript𝑡𝜁𝜎superscript𝑡𝑠superscriptitalic-ϵ𝑠x_{t^{s}}\leftarrow\alpha(t^{s})\cdot x_{t_{\zeta}}+\sigma(t^{s})\cdot\epsilon% ^{s}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← italic_α ( italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ⋅ italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_σ ( italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ⋅ italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT
11:           tc=ωtcϵθc(xtc,tc,y)ϵc22subscriptsuperscript𝑡𝑐subscript𝜔superscript𝑡𝑐subscriptsuperscriptnormsubscriptitalic-ϵsuperscript𝜃𝑐subscript𝑥superscript𝑡𝑐superscript𝑡𝑐𝑦superscriptitalic-ϵ𝑐22\mathcal{L}_{t^{c}}=\omega_{t^{c}}\cdot||\epsilon_{\theta^{c}}(x_{t^{c}},{t^{c% }},y)-\epsilon^{c}||^{2}_{2}caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ | | italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT , italic_y ) - italic_ϵ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
12:           Update θcsuperscript𝜃𝑐\theta^{c}italic_θ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT
13:           Pass xtssubscript𝑥superscript𝑡𝑠x_{t^{s}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and ϵssuperscriptitalic-ϵ𝑠\epsilon^{s}italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT to server s𝑠sitalic_s
14:           *** SERVER NODE ***
15:           ts=ωtsϵθs(xts,ts,y)ϵs22subscriptsuperscript𝑡𝑠subscript𝜔superscript𝑡𝑠subscriptsuperscriptnormsubscriptitalic-ϵsuperscript𝜃𝑠subscript𝑥superscript𝑡𝑠superscript𝑡𝑠𝑦superscriptitalic-ϵ𝑠22\mathcal{L}_{t^{s}}=\omega_{t^{s}}\cdot||\epsilon_{\theta^{s}}(x_{t^{s}},{t^{s% }},y)-\epsilon^{s}||^{2}_{2}caligraphic_L start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_ω start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ | | italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y ) - italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
16:           Update θssuperscript𝜃𝑠\theta^{s}italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT
17:        end for
18:     end for
19:  end while
Algorithm 1 Collaborative Training
Diffusion Process
Denoising Process

3.1 Collaborative Training

During training, client models ϵθcsubscriptsuperscriptitalic-ϵ𝑐𝜃\epsilon^{c}_{\theta}italic_ϵ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the server model ϵθssubscriptsuperscriptitalic-ϵ𝑠𝜃\epsilon^{s}_{\theta}italic_ϵ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are updated independently. Alg. 1 provides our collaborative training procedure in pseudocode, which we now describe in more detail. For training, each client cC𝑐𝐶c\in Citalic_c ∈ italic_C has access to a private dataset Dc={(x0i,yi)}subscript𝐷𝑐subscriptsuperscript𝑥𝑖0superscript𝑦𝑖D_{c}=\{(x^{i}_{0},y^{i})\}italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } of images x0isubscriptsuperscript𝑥𝑖0x^{i}_{0}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with optional textual attribute labels yisuperscript𝑦𝑖y^{i}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. In principle, our approach also works with unlabeled data and other kinds of labels, e.g., one-hot encoded label vectors and segmentation maps. However, we focus on the use case of attribute-conditioned image generation in this work. By using textual feature descriptions as labels, our implementation can easily be extended to more elaborated text-guided image synthesis. As for the standard diffusion training process, each client samples a training batch {(x0,y)}bDcsuperscriptsubscript𝑥0𝑦𝑏subscript𝐷𝑐\{(x_{0},y)\}^{b}\subseteq D_{c}{ ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ) } start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ⊆ italic_D start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of batch size b𝑏bitalic_b together with client time steps tc𝒰[1,tζ]bsimilar-tosuperscript𝑡𝑐𝒰superscript1subscript𝑡𝜁𝑏t^{c}\sim\mathcal{U}[1,t_{\zeta}]^{b}italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∼ caligraphic_U [ 1 , italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT during each training step. Gaussian noise is added to each training sample based on tcsuperscript𝑡𝑐t^{c}italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT following the diffusion process defined in eq. 1. All noisy images are fed into the client’s model to predict the added noise and update the model’s parameters according to the loss function defined in eq. 4. In addition, each client uses the diffused image xtζsubscript𝑥subscript𝑡𝜁x_{t_{\zeta}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT from the cut point and sampled additional server time steps ts𝒰[tζ,T]bsimilar-tosuperscript𝑡𝑠𝒰superscriptsubscript𝑡𝜁𝑇𝑏t^{s}\sim\mathcal{U}[t_{\zeta},T]^{b}italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ caligraphic_U [ italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT , italic_T ] start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT to provide the noisy images xtssubscript𝑥superscript𝑡𝑠x_{t^{s}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for the server. The final noise image and the noise added to xtζsubscript𝑥subscript𝑡𝜁x_{t_{\zeta}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT are then used to update the server model’s weights analogously. We note that the process of adding additional noise for the server could, in principle, also be performed on the server side. However, it is crucial to note that the server only has access to samples at the noise level of tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT and not the initial training samples to limit the amount of disclosed information shared with the server.

0:  number of time steps T𝑇Titalic_T, label y𝑦yitalic_y, cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT, client c𝑐citalic_c, server s𝑠sitalic_s
1:  Sample initial noise: xT𝒩(0,1)similar-tosubscript𝑥𝑇𝒩01x_{T}\sim\mathcal{N}(0,1)italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 )
2:  M=tζ+tζT(Ttζ)𝑀subscript𝑡𝜁subscript𝑡𝜁𝑇𝑇subscript𝑡𝜁M=\left\lfloor{t_{\zeta}+\frac{t_{\zeta}}{T}\cdot(T-t_{\zeta})}\right\rflooritalic_M = ⌊ italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT + divide start_ARG italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_ARG start_ARG italic_T end_ARG ⋅ ( italic_T - italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ) ⌋
3:  tlistc=linearly spaced array generator(1,M,tζ)subscriptsuperscript𝑡𝑐𝑙𝑖𝑠𝑡linearly spaced array generator1𝑀subscript𝑡𝜁t^{c}_{list}=\text{linearly spaced array generator}(1,M,t_{\zeta})italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT = linearly spaced array generator ( 1 , italic_M , italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT )
4:  tT𝑡𝑇t\leftarrow Titalic_t ← italic_T
5:  while t1𝑡1t\geq 1italic_t ≥ 1 do
6:     if t>tζ𝑡subscript𝑡𝜁t>t_{\zeta}italic_t > italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT then
7:        Compute xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT using ϵθS(xt,t,y)subscriptitalic-ϵsuperscript𝜃𝑆subscript𝑥𝑡𝑡𝑦\epsilon_{\theta^{S}}(x_{t},t,y)italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_y ) and α(t),σ(t)𝛼𝑡𝜎𝑡\alpha(t),\sigma(t)italic_α ( italic_t ) , italic_σ ( italic_t )
8:     else
9:        Compute xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT using ϵθC(xt,tlistc[t],y)subscriptitalic-ϵsuperscript𝜃𝐶subscript𝑥𝑡subscriptsuperscript𝑡𝑐𝑙𝑖𝑠𝑡delimited-[]𝑡𝑦\epsilon_{\theta^{C}}(x_{t},t^{c}_{list}[t],y)italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT [ italic_t ] , italic_y ) and α(tlistc[t]),σ(tlistc[t])𝛼subscriptsuperscript𝑡𝑐𝑙𝑖𝑠𝑡delimited-[]𝑡𝜎subscriptsuperscript𝑡𝑐𝑙𝑖𝑠𝑡delimited-[]𝑡\alpha(t^{c}_{list}[t]),\sigma(t^{c}_{list}[t])italic_α ( italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT [ italic_t ] ) , italic_σ ( italic_t start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_i italic_s italic_t end_POSTSUBSCRIPT [ italic_t ] )
10:     end if
11:     tt1𝑡𝑡1t\leftarrow t-1italic_t ← italic_t - 1
12:  end while
Algorithm 2 Collaborative Inference

3.2 Collaborative Inference

After training, each client cC𝑐𝐶c\in Citalic_c ∈ italic_C can send a request to the server containing optional textual attribute labels y𝑦yitalic_y. The individual steps during the inference are specified in Alg. 2. The server first samples initial noisy images x^T𝒩(0,1)similar-tosubscript^𝑥𝑇𝒩01\hat{x}_{T}\sim\mathcal{N}(0,1)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) and starts denoising them using ϵθssuperscriptsubscriptitalic-ϵ𝜃𝑠\epsilon_{\theta}^{s}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for Ttζ𝑇subscript𝑡𝜁T-t_{\zeta}italic_T - italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT steps conditioned on the label y𝑦yitalic_y. The still noisy samples x^tζssuperscriptsubscript^𝑥subscript𝑡𝜁𝑠\hat{x}_{t_{\zeta}}^{s}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT are sent to the client c𝑐citalic_c, which computes the final tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT denoising steps using its local model ϵθcsuperscriptsubscriptitalic-ϵ𝜃𝑐\epsilon_{\theta}^{c}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT. To account for the increased amount of noise in x^tζssuperscriptsubscript^𝑥subscript𝑡𝜁𝑠\hat{x}_{t_{\zeta}}^{s}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and hence allow for a higher noise reduction on the client node, the variance and noise scheduler are adapted considering the cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT. While keeping the total amount of timesteps fixed, the maximum value M𝑀Mitalic_M is defined to adapt the schedulers. A sufficiently large cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ensures that possibly sensitive features are generated on the client side while the server performs the less privacy-critical initial denoising steps. If multiple clients request samples from the same label y𝑦yitalic_y, the server-side denoising process can be run once to generate an intermediate noise sample, and each client solely has to compute the remaining denoising steps.

4 Experiment

To assess our approach, we implement Alg. 1 and Alg. 2 and train the models on a common benchmark dataset. We simulate a scenario in which k=5𝑘5k=5italic_k = 5 clients use a trusted server to train a DDPM collaboratively. Each client has access to an individual subset from the same domain. Thereby, we investigate the influence of collaborative training on the fidelity of generated images. Furthermore, we analyze the influence of the chosen cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT on sample fidelity with respect to disclosed information.

4.1 Experimental Protocol

To ensure consistent conditions and avoid confounding factors, we maintain identical training and inference hyperparameters and seeds across different runs and settings, if not stated otherwise. We provide our source code for reproducibility111https://github.com/SimeonAllmendinger/collafuse.

Model Architecture: We adopt the network architecture of Imagen [36], which is based on the U-Net architecture [34] to process 64×64646464\times 6464 × 64 RGB images. As in the original Imagen implementation, each U-Net model is conditioned on text embeddings computed by a T5-Base model [32]. Unlike Saharia et al. [36], we do not apply any super-resolution model to increase the fidelity of the generated images, as the focus of our work lies on the feasibility of the collaborative training and inference process. However, our collaborative diffusion setting also allows each client to add individual or shared super-resolution models [35] to their pipeline to upscale the generated images.

Datasets: We train and evaluate our collaborative diffusion models on the common CelebA dataset of facial attributes. The CelebA dataset is a large-scale face attributes dataset collected by Liu et al. [25]. It consists of over 200,000 facial images of celebrities, each annotated with 40 binary attribute labels, including attributes such as gender appearance, perceived age, hair color, and facial expressions. The images in the dataset are diverse, featuring celebrities from various ethnicities, ages, and backgrounds.

Table 1: Attribute selection for dataset and clients. Each client dataset comprises samples of two to five attributes, e.g., black hair, brown hair, etc.
Dataset Client 1 Client 2 Client 3 Client 4 Client 5
CelebA Hair colors Jewelry Hair cut Eyebrows Eyes/Glasses

Training Parameters: Our training protocol consists of ten epochs, employing a learning rate of 0.001, a batch size of 50, a cosine scheduler, and T=1000𝑇1000T=1000italic_T = 1000 timesteps. Each client holds 2,000 training images and 5,000 test images (hold-out dataset) according to an individual attribute group (cf. Tab. 1). The experiments were conducted utilizing an NVIDIA A100-SXM4-40GB for computational processing.

Refer to caption
Figure 2: Comparison between random samples from the training set (top row) and images generated with our collaborative diffusion models trained with cut point tζ=100subscript𝑡𝜁100t_{\zeta}=100italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 100 (bottom row). Images were not cherry-picked and generated starting with the same initial noise. The results demonstrate that collaboratively training diffusion models can achieve high image quality and attribute fidelity.

Evaluation Metrics: To assess the quality of the generated images, we calculate the common Kernel Inception Distance (KID) [2], the Fréchet Inception Distance (FID) [13] and the Fréchet CLIP Distance (FCD) between the 2,100 real (test dataset) and generated images from each client. We differentiate between images generated by clients from pure Gaussian noise (client-only), and images generated based on the server image at the cut point. All metrics are computed on the implementations provided by Parmar et al. [31] to ensure stable and comparable evaluation results. For all three metrics, lower values indicate better approximation of the training distribution and improved image quality. In the main paper, we report the FID and FCD values, as well as the KID results in the appendix due to the page limitation.

As we are interested in the performance of the collaborative system, we calculate the image fidelity across the set of clients to compare the performance for different values of tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT. Furthermore, we calculate the fidelity between the partially diffused images at the cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT and original images of the clients, as well as the denoised images from the server model at cut point tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT and further denoised images at t=0𝑡0t=0italic_t = 0. More detailed results are provided in the appendix.

4.2 Experimental Results

Refer to caption
Figure 3: Fidelity results for each client using the Fréchet Inception Distance (FID\downarrow) and Fréchet CLIP Distance (FCD\downarrow): We evaluate 10,500 real x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and collaboratively generated images (x^0cx^tζs(ϵ)superscriptsubscript^𝑥0𝑐subscriptsuperscript^𝑥𝑠subscript𝑡𝜁italic-ϵ\hat{x}_{0}^{c}\circ\hat{x}^{s}_{t_{\zeta}}(\epsilon)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∘ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ )) of the CelebA dataset across clients. In our experiment, cut points tζ<=300subscript𝑡𝜁300t_{\zeta}<=300italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT < = 300 outperform the baseline of independent client models (tζ=1000subscript𝑡𝜁1000t_{\zeta}=1000italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 1000). Moreover, small cut points even succeed the global model (tζ=0subscript𝑡𝜁0t_{\zeta}=0italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 0), which is trained on all client datasets.

In our experiments, we analyze the performance of collaborative diffusion models by focusing on specific distributed attributes among clients. Fig. 2 displays exemplary generated images of our collaborative approach, comparing them with their attributes and real images from the dataset. Our quantitative analysis includes a comparison with two baselines: global model (tζ=1000subscript𝑡𝜁1000t_{\zeta}=1000italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 1000) and independent client models (tζ=0subscript𝑡𝜁0t_{\zeta}=0italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 0). The single global model on the server node is trained using the combined datasets of all clients, while the independent client models are trained on client-specific distributed sub-datasets and separately operate on the client node. The FID and FCD scores in Fig. 3 show that models with cut points tζ300subscript𝑡𝜁300t_{\zeta}\leq 300italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT ≤ 300 surpass the performance of the independent client models in favor of collaborative image synthesis. Our experiments with smaller cut points even manage to outperform the global model. However, cut points tζ>300subscript𝑡𝜁300t_{\zeta}>300italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT > 300 weaken the fidelity of image generation, which converges with the performance of the independent client models for higher cut points. Furthermore, our findings highlight that a client-only approach, which includes the image generation from pure Gaussian noise at the cut point proves ineffective, particularly at lower cut points, as detailed in the appendix.

In terms of image quality, we observe that the degradation of image fidelity and visual characteristics is evident in images generated by the server as the cut point tζsubscript𝑡𝜁{t_{\zeta}}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT increases. Fig. 4 displays this deterioration, which affects the images generated by the server at higher cut points. Conversely, the client’s performance in generating images from noise is compromised when operating with low cut points. However, leveraging our collaborative approach, each client can produce meaningful images with improved fidelity, as shown in Fig. 3, making use of denoised images provided by the server. Additionally, our results indicate that incorporating an adjustment of the variance and noise scheduler into the collaborative inference process significantly enhances the denoising capabilities on the client node. This is particularly effective given the higher levels of residual noise in the images received from the server, leading to an improvement in the overall quality of the images. The adoption of the adjusted parameter M𝑀Mitalic_M in Alg. 2, especially with an increased emphasis on managing variance and noise in the collaborative setting, proves to be beneficial in refining the quality of the final image output.

Refer to caption
Figure 4: Samples generated by collaborative diffusion models trained with different cut points tζsubscript𝑡𝜁t_{\zeta}italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT. The top row depicts images produced by the server, which are then sent to the client. The bottom row shows the samples after the final denoising performed by the client. For tζ=0subscript𝑡𝜁0t_{\zeta}=0italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 0, the server performs the full denoising process, for tζ=1000subscript𝑡𝜁1000t_{\zeta}=1000italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 1000, each client trains a separate diffusion model without a server side.

5 Discussion, Limitations, and Future Work

Our findings reveal that collaborative diffusion models produce images of higher quality than those generated by independently trained diffusion models using sub-datasets. In certain instances, they even surpass the performance of centrally trained diffusion models. This not only highlights the efficacy of our method but also demonstrates its capability to tackle the personalization versus generalization challenge frequently encountered in federated learning scenarios. Beyond the aspect of image fidelity, it is pivotal to consider our findings within their broader impact on information disclosure and computational efficiency in distributed machine learning frameworks. Our approach circumvents the need to share raw data or complete model updates, opting instead to share only the diffused images alongside server-side noise. Furthermore, by delegating computationally intensive tasks, our approach significantly reduces the computational load on individual clients, leveraging the potential strengths of growing foundation models on large servers. Consequently, even in cases where image fidelity decreases, clients may still favor a collaborative diffusion model for its ability to lighten their computational load, even in simple one-to-one setups. Thus, our strategy facilitates a balanced optimization of performance, privacy, and computational resources and can be tailored to individual preferences.

Limitations: Our current approach assumes trustworthy clients and an honest server. However, collaborative approaches are known to be susceptible to backdoor attacks [48, 44] in which an adversarial client tries to inject secret functionalities into a collaboratively trained model. Also, existing research has shown that diffusion models are indeed susceptible to backdoor attacks [43, 5]. Our proposed collaborative diffusion models do not account for such adversarial cases. However, our focus lies on the conception of collaborative diffusion models and not on the security area.

Future Work: Our empirical evaluations are currently limited to images with 64×64646464\times 6464 × 64 resolution. An important future avenue is the collaborative training of diffusion models of larger scales, for which the computational advantage to each client further increases. Also, collaborative diffusion applications based on more general text-to-image synthesis models like Stable Diffusion are interesting to investigate. However, training such models requires high computational resources and is, therefore, out of the scope of this work. Nevertheless, future research steps include examining vulnerabilities, like backdoor attacks, in collaborative diffusion models. Identifying suitable countermeasures and analyzing their impact on image fidelity and computational efficiency is central to bringing our approach closer to its real-world application. Another intriguing avenue is the combination of our collaborative diffusion models with differentially private training algorithms to provide formal privacy guarantees for trained models [7, 9]. Similarly, it would be interesting to investigate to what extent the phenomenon of memorization in diffusion models [4, 46, 14] can also occur in collaborative approaches with separate models.

6 Conclusion

Our collaborative diffusion approach offers a novel solution to the challenges of diffusion-based generative models. By dividing the denoising process between a shared server and client models, we address performance, information disclosure, and computational concerns effectively. Our approach enables clients to outsource computationally intensive denoising steps to the server, balancing image quality without the necessity of sharing raw data. Through experiments, we demonstrate the effectiveness of collaborative training in enhancing image quality tailored to each client’s domain while reducing the number of denoising steps on the client side. These findings highlight the potential of collaborative diffusion models in advancing distributed machine learning research and development.

Reproducibility Statement.

Our source code is publicly at https://github.com/SimeonAllmendinger/collafuse to reproduce the experiments and facilitate further analysis.

References

  • Augenstein et al. [2020] Sean Augenstein, H. Brendan McMahan, Daniel Ramage, Swaroop Ramaswamy, Peter Kairouz, Mingqing Chen, Rajiv Mathews, and Blaise Agüera y Arcas. Generative models for effective ML on private, decentralized datasets. In International Conference on Learning Representations (ICLR), 2020.
  • Bińkowski et al. [2018] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations (ICLR), 2018.
  • Brooks et al. [2024] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024. Accessed: 19-June-2024.
  • Carlini et al. [2023] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In USENIX Security Symposium (USENIX), pages 5253–5270, 2023.
  • Chou et al. [2023] Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4015–4024, 2023.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), pages 8780–8794, 2021.
  • Dockhorn et al. [2023] Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially Private Diffusion Models. Transactions on Machine Learning Research (TMLR), 2023.
  • Fan and Liu [2020] Chenyou Fan and Ping Liu. Federated generative adversarial learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 3–15, 2020.
  • Ghalebikesabi et al. [2023] Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes, Soham De, Samuel L. Smith, Olivia Wiles, and Borja Balle. Differentially private diffusion models generate useful synthetic images. arXiv Preprint, 2302.13861, 2023.
  • Goodfellow et al. [2020] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Gupta and Raskar [2018] Otkrist Gupta and Ramesh Raskar. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications, 116:1–8, 2018.
  • Hardy et al. [2019] Corentin Hardy, Erwan Le Merrer, and Bruno Sericola. Md-gan: Multi-discriminator generative adversarial networks for distributed datasets. In International Parallel and Distributed Processing Symposium (IPDPS), pages 866–877, 2019.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), pages 6626–6637, 2017.
  • Hintersdorf et al. [2024] Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding nemo: Localizing neurons responsible for memorization in diffusion models. arXiv preprint, arXiv:2406.02366, 2024.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020.
  • Hu et al. [2022] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022.
  • Huang et al. [2023] Xumin Huang, Peichun Li, Hongyang Du, Jiawen Kang, Dusit Niyato, Dong In Kim, and Yuan Wu. Federated learning-empowered ai-generated content in wireless networks. arXiv preprint, arXiv:2307.07146, 2023.
  • Joshi et al. [2022] Madhura Joshi, Ankit Pal, and Malaikannan Sankarasubbu. Federated learning for healthcare domain - pipeline, applications and challenges. ACM Transactions on Computing for Healthcare, 3(4), 2022.
  • Jothiraj and Mashhadi [2023] Fiona Victoria Stanley Jothiraj and Afra Mashhadi. Phoenix: A federated generative diffusion model. arXiv preprint, arXiv:2306.04098, 2023.
  • Kazerouni et al. [2023] Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, and Dorit Merhof. Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis, 88:102846, 2023.
  • Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
  • Kortoçi et al. [2022] Pranvera Kortoçi, Yilei Liang, Pengyuan Zhou, Lik-Hang Lee, Abbas Mehrabi, Pan Hui, Sasu Tarkoma, and Jon Crowcroft. Federated split gans. In ACM Workshop on Data Privacy and Federated Learning Technologies for Mobile Edge Network, 2022.
  • Li et al. [2022] Wei Li, Jinlin Chen, Zhenyu Wang, Zhidong Shen, Chao Ma, and Xiaohui Cui. Ifl-gan: Improved federated learning generative adversarial network with maximum mean discrepancy model aggregation. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2022.
  • Little et al. [2023] Claire Little, Mark Elliot, and Richard Allmendinger. Federated learning for generating synthetic data: a scoping review. International Journal of Population Data Science, 8(1), 2023.
  • Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In International Conference on Computer Vision (ICCV), 2015.
  • Mariani et al. [2024] Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodolà. Multi-source diffusion models for simultaneous music generation and separation. In International Conference on Learning Representations (ICLR), 2024.
  • McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017.
  • Nazir and Kaleem [2023] Sajid Nazir and Mohammad Kaleem. Federated learning for medical image analysis with deep neural networks. Diagnostics, 13(9), 2023.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), pages 8162–8171, 2021.
  • Ohta and Nishio [2023] Shoki Ohta and Takayuki Nishio. $ΛΛ\Lambdaroman_Λ$-split: A privacy-preserving split computing framework for cloud-powered generative AI. arXiv preprint, arXiv:2310.14651, 2023.
  • Parmar et al. [2022] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 11400–11410, 2022.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351, pages 234–241, 2015.
  • Saharia et al. [2022a] Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 15:1–15:10, 2022a.
  • Saharia et al. [2022b] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Conference on Neural Information Processing Systems (NeurIPS), 2022b.
  • Schwermer et al. [2022] René Schwermer, Jonas Buchberger, Ruben Mayer, and Hans-Arno Jacobsen. Federated office plug-load identification for building management systems. In ACM International Conference on Future Energy Systems, page 114–126, 2022.
  • Schwermer et al. [2023] René Schwermer, Ekin-Alp Bicer, Pascal Schirmer, Ruben Mayer, and Hans-Arno Jacobsen. Federated computing in electric vehicles to predict coolant temperature. In International Middleware Conference: Industrial Track, page 8–14, 2023.
  • Shen et al. [2023] Yiqing Shen, Arcot Sowmya, Yulin Luo, Xiaoyao Liang, Dinggang Shen, and Jing Ke. A federated learning system for histopathology image analysis with an orchestral stain-normalization gan. IEEE Transactions on Medical Imaging, 42(7):1969–1981, 2023.
  • Shokri et al. [2017] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Symposium on Security and Privacy (S&P), pages 3–18, 2017.
  • Singer et al. [2023] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In International Conference on Learning Representations (ICLR), 2023.
  • Song and Ermon [2020] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Struppek et al. [2023] Lukas Struppek, Dominik Hintersdorf, and Kristian Kersting. Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis. In International Conference on Computer Vision (ICCV), 2023.
  • Tajalli et al. [2023] Behrad Tajalli, Oguzhan Ersoy, and Stjepan Picek. On feasibility of server-side backdoor attacks on split learning. In IEEE Security and Privacy Workshops (SPW), pages 84–93, 2023.
  • Thapa et al. [2022] Chandra Thapa, Pathum Chamikara Mahawaga Arachchige, Seyit Camtepe, and Lichao Sun. Splitfed: When federated learning meets split learning. AAAI Conference on Artificial Intelligence (AAAI), 36(8):8485–8493, 2022.
  • van den Burg and Williams [2021] Gerrit J. J. van den Burg and Christopher K. I. Williams. On memorization in probabilistic deep generative models. In Conference on Neural Information Processing Systems (NeurIPS), pages 27916–27928, 2021.
  • Veeraragavan and Nygård [2023] Narasimha Raghavan Veeraragavan and Jan Franz Nygård. Securing federated gans: Enabling synthetic data generation for health registry consortiums. In International Conference on Availability, Reliability and Security (ARES), pages 89:1–89:9, 2023.
  • Wang et al. [2020] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris S. Papailiopoulos. Attack of the tails: Yes, you really can backdoor federated learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  • Yin et al. [2023] Benshun Yin, Zhiyong Chen, and Meixia Tao. Predictive gan-powered multi-objective optimization for hybrid federated split learning. IEEE Transactions on Communications, 71(8):4544–4560, 2023.
  • Zhu et al. [2019] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 14747–14756, 2019.

Appendix A Appendix

A.1 Exemplary images for different image generation process: Client-only, server-only, cut point and collaborative inference

Refer to caption
Figure 5: The figure shows generated images exemplarily for different scenarios across different cut points. Column I describes server-only generated images, while column II describes client-only generated images. Column III shows images at the cut point, while column IV shows images generated collaboratively.

A.2 Development of FID, FCD, KID across cut points for different image generations compared with the original images

Refer to caption
Figure 6: The figure shows the development for FID; FCI, KID calculated between real and generated images The green lines show the scores of the client-only generated images, while the purple lines show the scores for server-only generated images. The blue line shows the server-generated images at the cut point and the yellow for the diffused images at the cut point.

A.3 Development of FID, FCD and KID over cut points

Refer to caption
Figure 7: Fidelity results for each client using FID\downarrow, FCD\downarrow, and KID\downarrow: We evaluate 10,500 real x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and collaboratively generated images (x^0cx^tζs(ϵ)superscriptsubscript^𝑥0𝑐subscriptsuperscript^𝑥𝑠subscript𝑡𝜁italic-ϵ\hat{x}_{0}^{c}\circ\hat{x}^{s}_{t_{\zeta}}(\epsilon)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ∘ over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϵ )) of the CelebA dataset across clients. In our experiment, cut points tζ<=300subscript𝑡𝜁300t_{\zeta}<=300italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT < = 300 outperform the baseline of independent client models (tζ=1000subscript𝑡𝜁1000t_{\zeta}=1000italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 1000). Moreover, small cut points even succeed the global model (tζ=0subscript𝑡𝜁0t_{\zeta}=0italic_t start_POSTSUBSCRIPT italic_ζ end_POSTSUBSCRIPT = 0), which is trained on all client datasets.