CollaFuse: Collaborative Diffusion Models

Simeon Allmendinger Equal contribution. University of Bayreuth Fraunhofer FIT Domenique Zipperling^∗ University of Bayreuth Fraunhofer FIT Lukas Struppek Technical University of Darmstadt German Research Center for Artificial Intelligence (DFKI) Niklas Kühl University of Bayreuth Fraunhofer FIT

Abstract

In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.

1 Introduction

Refer to caption — Figure 1: Our approach for collaborative image synthesis splits the denoising process between the server and clients. Based on client-specific conditioning, the first denoising steps are run on a trusted server, while the remaining denoising steps are run locally. Thereby, external resources can be utilized while keeping the clients’ raw data private.

Recently developed generative artificial intelligence (GenAI) methods exhibit astonishing results in generating images, among other modalities like music [26] and video [41, 3]. Recent advancements primarily rely on diffusion models [15, 42] that generate synthetic images from random noise through iterative denoising steps. We formally introduce diffusion models and related work in Sec. 2.1. In contrast to more traditional approaches [21, 10], diffusion models excel in providing high sample quality and strong mode coverage [20, 29, 6]. However, the strides in GenAI require large amounts of data, and the generation process itself is computationally expensive due to the multiple required denoising steps.

While large companies possess the necessary data and computational resources for training diffusion models, smaller organizations and private clients may face challenges in providing the required resources, limiting their ability to train and implement recent GenAI models. This might lead to high dependency on a few key players and even prevent the application of such models completely due to local data protection regulations. For example, the popular Stable Diffusion v2 has been trained on 256 A100 GPUs for about 200,000 hours [33]. But even smaller models still set notably high hardware and data requirements.

To address these constraints, organizations and clients can join forces to train machine learning models collaboratively with other clients in a decentralized way. A prominent representative of collaborative training is federated learning [27]. Conceptually, each client trains an individual model locally on their private data. After performing a certain number of training steps, the model parameters are sent to the server to build a global model by incorporating the individual weights. The server then makes this global model accessible to all clients. Yet, the necessity for each client to train and share an entire model remains, still entailing high computational resources for each client [45] and potentially introducing new privacy risks [40, 50].

As an alternative learning paradigm, split learning [11] supports collaborative model training by splitting the model into server-side and client-side components. In the conventional setup, clients share only intermediate network activations with the server, where the final computations are performed. Unlike federated learning, split learning reduces the computational load on clients and enhances privacy protection by sharing only intermediate representations instead of raw data or model weights. We provide a comprehensive overview of collaboratively training generative models in Sec. 2.2.

Our Approach: In addressing the challenges posed by the data, computation, and privacy requirements of diffusion-based GenAI, we present our collaborative diffusion models in Sec. 3. We introduce a novel collaborative learning and inference approach tailored specifically for diffusion models. Drawing inspiration from the split learning framework, our approach divides the iterative denoising steps of diffusion models into two components. The computation of the initial denoising steps is carried out by a shared model on a server, with limited information disclosure due to the inherent noise in the training data and generated samples. Subsequently, the client’s model then performs the remaining denoising steps, which are usually significantly fewer than the denoising steps on the server side.

Our collaborative diffusion approach also allows for personalized image conditioning by incorporating attribute labels during the generation. Our empirical results demonstrate that our collaborative diffusion approach improves the image quality compared to a setting where each client trains its own local diffusion model. By sharing a server model that performs most of the computationally heavy denoising steps, the computational burdens for each client are comparably small. At the same time, clients can better approximate their individual data distribution, which enables them to generate better characteristic features.

The proportion of denoising steps carried out on the server and client sides, respectively, is controlled by a single parameter called cut point. The higher the cut point, the more steps are computed on the client side. Whereas our approach is in principle also applicable to other diffusion model architectures, we focus our experiments in Sec. 4 on the common Denoising Diffusion Probabilistic Models (DDPM) [15]. In summary, we make the following contributions:

1.

We introduce the first collaborative diffusion model, which consists of a shared server component trained by multiple clients without revealing their original training data to other clients or the server.
2.

Collaborative diffusion models allow clients to outsource most of their computationally expensive denoising steps during training and inference to a shared server model.
3.

Our collaborative diffusion models improve the image quality compared to the setting where each client trains a local diffusion model on its own data.

2 Background and Related Work

We start by introducing diffusion models for generative image synthesis. We further describe related distributed collaborative machine learning approaches, such as federated and split learning, and their utilization for image synthesis or generative AI more generally.

2.1 Diffusion Models

The Denoising Diffusion Probabilistic Models (DDPMs) [15] mark a significant advancement in generative image synthesis consisting of a diffusion and denoising process. The diffusion process is a Markov chain with $T$ timesteps that transforms a training image $x_{0}$ to a noisy image $x_{T}$ that follows a random Gaussian distribution. The diffusion process of an image $x_{t}$ at time step $t$ is mathematically defined as

x_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}}\epsilon,\text{ with }t=1,% \ldots,T.

(1)

Here, $\alpha_{t}$ denotes the variance schedule, and $\epsilon$ is the added Gaussian noise. A denoising network $\epsilon_{\theta}(x_{t},t)$ with parameters $\theta$ is then trained to reverse the diffusion process and predict the noise added to the sample $x_{t}$ during time step $t$ .

Most denoising networks are built upon the common U-Net [34] architecture. With the denoising network, the image generation process, which iteratively removes the predicted noise $\epsilon_{\theta}(x_{t},t)$ from the noisy sample $x_{t}$ , can be defined as

x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-% \bar{\alpha}_{t}}}\epsilon_{\theta}(x_{t},t)\right)

(2)

with $\bar{\alpha}_{t}=\prod^{t}_{s=1}\alpha_{s}$ .

Based on the idea of Markov chains, the distribution of the intermediate noise predictions in the denoising process $p_{\theta}(\cdot)$ is defined by

p_{\theta}(x_{0:T})=p(x_{T})*\prod^{T}_{t=1}p_{\theta}(x_{t-1}|x_{t})

(3)

with $p(x_{T})=\mathcal{N}(x_{T};\mathbf{0},\mathbf{\text{I}})$ .

In DDPMs, the iterative application of U-Nets across $T$ timesteps is a fundamental characteristic, enabling the model to refine noisy data into structured outputs progressively. The Imagen model [36], as one of the most recognized text-conditioned diffusion models in the community, builds upon DDPMs. The authors employ a frozen text encoder and dynamic thresholding to generate photorealistic images conditioned by text prompts $y$ . The loss function of the Imagen model is expressed by

\mathcal{L}_{\text{Imagen}}=\sum_{t=1}^{T}\omega_{t}\cdot||\epsilon_{\theta}(x% _{t},t,y)-\epsilon||^{2}_{2}.

(4)

In this context, $\omega_{t}$ is the guidance weight, which is integral to the denoising process. This guidance weight modulates the influence of the predicted noise $\epsilon_{\theta}(x_{t},t,y)$ at each timestep $t$ , enabling precise control over the image generation process, particularly in maintaining fidelity to the target distribution. For simplicity, we leave the explicit embedding process out of our notation and implicitly assume that all text labels have already been embedded before feeding the embeddings into the U-Net $\epsilon_{\theta}(\cdot)$ .

2.2 Federated and Split Learning in Generative AI

Federated learning (FL) and split learning (SL) are among the most prominent approaches for training machine learning models collaboratively on distributed data sources. FL utilizes distributed data, with clients independently training models on their unique datasets. These models are subsequently shared, aggregated, and redistributed. The cycle repeats until the models converge. Conversely, SL divides a model among clients and a central server, decreasing the computational load on clients. Moreover, clients have the option to use FL for model aggregation, leading to the development of SplitFed learning [45]. These techniques have found applications in diverse fields such as the automotive industry [38], energy management [37], and healthcare [18], where they are combined with both discriminative [28] and generative AI approaches [39].

Especially for image synthesis, FL and SL possess significant potential owing to the extensive volume of data involved. Before diffusion models took over as the predominant architecture for image synthesis, Generative Adversarial Networks (GANs) [10] were the most common network architecture. GANs are composed of two components, a generator for generating images and a discriminator trained to distinguish between real and synthetic images. Existing research on collaborative training of GANs demonstrates the different integrations of the two components within the FL learning process. Hardy et al. [12] introduce FL-GAN adopting the standard FL learning process for discriminators and generators alike. This vanilla approach is compared to the proposed MD-GAN. Here, FL is only applied to the discriminator, while the generator is trained directly by a server. Expanding upon this foundation, Fan and Liu [8] have empirically analyzed different strategies for synchronizing the discriminators and generators across clients in FL. Their analysis demonstrates that the best results are achieved when synchronizing both the discriminator and generator across clients. Li et al. [23] improved FL-GANs by employing maximum mean discrepancy for generator updates. Moreover, follow-up research [22, 49] has combined FL and SL to train GANs collaboratively.

Furthermore, there have been efforts to reduce privacy risks for GANs in FL settings. Augenstein et al. [1] proposed a novel algorithm for differentially private federated GANs, while Veeraragavan et al. [47] combine consortium blockchains and an efficient secret sharing algorithm to address trust-related weaknesses in existing solutions. Although Ohta et al. [30] do not focus on distributed learning, they offer a solution for privacy-preserving SL of GANs that can be expanded for collaborative learning.

In the domain of diffusion models, research on collaborative training methods is still scarce. Jothiraj and Mashhadi [19] made a first step and introduced the Phoenix technique for training unconditional diffusion models in a horizontal FL setting. Their objective is to address mode coverage issues often seen in distributed datasets that are not independent and identically distributed. Their data-sharing approach boosts performance by sharing only 4-5% of the data among clients, minimizing communication overhead. Personalization and threshold filtering techniques outperform comparison methods in terms of precision and recall but fall short in image quality compared to the proposed technique. The paper suggests further exploration to enhance image quality in future work.

Moreover, the potential of FL for AI-generated content, especially for DDPMs, was demonstrated by Huang et al. [17]. The authors discuss three different approaches for diffusion models in FL settings. A parallel approach mimicking the conventional FL. A separate split approach combines FL with SL. As a third solution, the authors discussed a sequential approach in which one client receives the current model from the server, trains the model on its data, and then transmits the current version to the next client. The trained model returns to the server only after every client trained the model once. Based on the sequential FL, a LoRA-based [16] federated fine-tuning scheme is designed and examined in more detail, demonstrating the advantages of faster convergence time and reduced memory consumption during the tuning process.

By mainly focusing on FL for GANs [24], current literature neglected benefits from different collaborative paradigms and GenAI architectures so far. Combining DDPMs and SL promises various benefits, including reducing local resource requirements and increasing data privacy. Our proposed approach for distributed collaborative image synthesis with diffusion models taps into these advantages and combines the research areas of diffusion models and collaborative learning.

3 Collaborative Diffusion Models

We now formally introduce our novel approach for enabling collaborative image generation with diffusion models. In our setting, a certain number $k$ of clients $c\in C=\{c_{1},c_{2},...,c_{k}\}$ wants to collaboratively train a diffusion model for image synthesis. Although we assume that each client has a dataset from a similar domain, e.g., facial images, the specific feature distribution may differ. To stay with the facial image example, client A may have a dataset of facial images with eyeglasses, whereas client B’s dataset consists only of faces without eyeglasses. All clients now want to train a shared U-Net $\epsilon_{\theta}^{S}$ on the server that is available to each client and computes the initial denoising steps. Additionally, each client $c\in C$ trains an individual U-Net $\epsilon_{\theta}^{c}$ that is maintained locally and computes solely the remaining denoising steps. For notation simplicity, we assume that $\theta$ denotes the weights of each individual model, so there exist no shared weights between clients and the server.

The computational split between server and clients is manually set by the cut point $t_{\zeta}\in[0,T]$ that specifies the number of denoising steps performed on the client side after $T-t_{\zeta}$ steps were computed by the shared server model. The cut point is set as a hyperparameter and kept fixed during training and inference. For $t_{\zeta}=0$ , all denoising steps are computed by the server, which is trained on the joint set of all clients’ data. For $t_{\zeta}=T$ , each client trains an individual diffusion model on its data that performs all denoising steps without any shared server model. The approximated data distribution of our collaborative denoising approach is formalized in Equation 5 with $p(x_{T})=\mathcal{N}(x_{T};0,\text{I})$ :

\scalebox{0.82}{ $p_{\theta^{s},\theta^{c}}(x_{0:T})=p(x_{T})\cdot\prod^{t_{\zeta}}_{t=1}p_{% \theta^{s}}(x_{t-1}|x_{t})\cdot\prod^{T}_{t=t_{\zeta}}p_{\theta^{c}}(x_{t-1}|x% _{t})$}.

(5)

Here, the first product operator describes the distribution approximated by the server model with weights $\theta^{s}$ , and the second product operator consequently defines the distribution approximated by the client model $\theta^{c}$ .

0: training dataset

D

; batch size

b

; number of time steps

T

, cut point

t_{\zeta}

, batch size

b

, variance scheduler

\alpha

, noise scheduler

\sigma

, clients

C

, server

s

1: client dataset

D_{c}\subseteq D

2: while Not Converged do

3: for

c\in C

4: for Each batch

\{(x_{0},y)\}^{b}\subseteq D_{c}

5: *** CLIENT NODE ***

t^{c}\sim\mathcal{U}[1,t_{\zeta}]^{b}

and

t^{s}\sim\mathcal{U}[t_{\zeta},T]^{b}

\epsilon^{c}\sim\mathcal{N}(0,\text{I})

\epsilon^{s}\sim\mathcal{N}(0,\text{I})

x_{t^{c}}\leftarrow\alpha(t^{c})\cdot x_{0}+\sigma(t^{c})\cdot\epsilon^{c}

x_{t_{\zeta}}\leftarrow\alpha(t_{\zeta})\cdot x_{0}+\sigma(t_{\zeta})\cdot% \epsilon^{c}

10:

x_{t^{s}}\leftarrow\alpha(t^{s})\cdot x_{t_{\zeta}}+\sigma(t^{s})\cdot\epsilon% ^{s}

11:

\mathcal{L}_{t^{c}}=\omega_{t^{c}}\cdot||\epsilon_{\theta^{c}}(x_{t^{c}},{t^{c% }},y)-\epsilon^{c}||^{2}_{2}

12: Update

\theta^{c}

13: Pass

x_{t^{s}}

and

\epsilon^{s}

to server

s

14: *** SERVER NODE ***

15:

\mathcal{L}_{t^{s}}=\omega_{t^{s}}\cdot||\epsilon_{\theta^{s}}(x_{t^{s}},{t^{s% }},y)-\epsilon^{s}||^{2}_{2}

16: Update

\theta^{s}

17: end for

18: end for

19: end while

Algorithm 1 Collaborative Training

3.1 Collaborative Training

During training, client models $\epsilon^{c}_{\theta}$ and the server model $\epsilon^{s}_{\theta}$ are updated independently. Alg. 1 provides our collaborative training procedure in pseudocode, which we now describe in more detail. For training, each client $c\in C$ has access to a private dataset $D_{c}=\{(x^{i}_{0},y^{i})\}$ of images $x^{i}_{0}$ with optional textual attribute labels $y^{i}$ . In principle, our approach also works with unlabeled data and other kinds of labels, e.g., one-hot encoded label vectors and segmentation maps. However, we focus on the use case of attribute-conditioned image generation in this work. By using textual feature descriptions as labels, our implementation can easily be extended to more elaborated text-guided image synthesis. As for the standard diffusion training process, each client samples a training batch $\{(x_{0},y)\}^{b}\subseteq D_{c}$ of batch size $b$ together with client time steps $t^{c}\sim\mathcal{U}[1,t_{\zeta}]^{b}$ during each training step. Gaussian noise is added to each training sample based on $t^{c}$ following the diffusion process defined in eq. 1. All noisy images are fed into the client’s model to predict the added noise and update the model’s parameters according to the loss function defined in eq. 4. In addition, each client uses the diffused image $x_{t_{\zeta}}$ from the cut point and sampled additional server time steps $t^{s}\sim\mathcal{U}[t_{\zeta},T]^{b}$ to provide the noisy images $x_{t^{s}}$ for the server. The final noise image and the noise added to $x_{t_{\zeta}}$ are then used to update the server model’s weights analogously. We note that the process of adding additional noise for the server could, in principle, also be performed on the server side. However, it is crucial to note that the server only has access to samples at the noise level of $t_{\zeta}$ and not the initial training samples to limit the amount of disclosed information shared with the server.

0: number of time steps

T

, label

y

, cut point

t_{\zeta}

, client

c

, server

s

1: Sample initial noise:

x_{T}\sim\mathcal{N}(0,1)

M=\left\lfloor{t_{\zeta}+\frac{t_{\zeta}}{T}\cdot(T-t_{\zeta})}\right\rfloor

t^{c}_{list}=\text{linearly spaced array generator}(1,M,t_{\zeta})

t\leftarrow T

5: while

t\geq 1

6: if

t>t_{\zeta}

then

7: Compute

x_{t-1}

using

\epsilon_{\theta^{S}}(x_{t},t,y)

and

\alpha(t),\sigma(t)

8: else

9: Compute

x_{t-1}

using

\epsilon_{\theta^{C}}(x_{t},t^{c}_{list}[t],y)

and

\alpha(t^{c}_{list}[t]),\sigma(t^{c}_{list}[t])

10: end if

11:

t\leftarrow t-1

12: end while

Algorithm 2 Collaborative Inference

3.2 Collaborative Inference

After training, each client $c\in C$ can send a request to the server containing optional textual attribute labels $y$ . The individual steps during the inference are specified in Alg. 2. The server first samples initial noisy images $\hat{x}_{T}\sim\mathcal{N}(0,1)$ and starts denoising them using $\epsilon_{\theta}^{s}$ for $T-t_{\zeta}$ steps conditioned on the label $y$ . The still noisy samples $\hat{x}_{t_{\zeta}}^{s}$ are sent to the client $c$ , which computes the final $t_{\zeta}$ denoising steps using its local model $\epsilon_{\theta}^{c}$ . To account for the increased amount of noise in $\hat{x}_{t_{\zeta}}^{s}$ and hence allow for a higher noise reduction on the client node, the variance and noise scheduler are adapted considering the cut point $t_{\zeta}$ . While keeping the total amount of timesteps fixed, the maximum value $M$ is defined to adapt the schedulers. A sufficiently large cut point $t_{\zeta}$ ensures that possibly sensitive features are generated on the client side while the server performs the less privacy-critical initial denoising steps. If multiple clients request samples from the same label $y$ , the server-side denoising process can be run once to generate an intermediate noise sample, and each client solely has to compute the remaining denoising steps.

4 Experiment

To assess our approach, we implement Alg. 1 and Alg. 2 and train the models on a common benchmark dataset. We simulate a scenario in which $k=5$ clients use a trusted server to train a DDPM collaboratively. Each client has access to an individual subset from the same domain. Thereby, we investigate the influence of collaborative training on the fidelity of generated images. Furthermore, we analyze the influence of the chosen cut point $t_{\zeta}$ on sample fidelity with respect to disclosed information.

4.1 Experimental Protocol

To ensure consistent conditions and avoid confounding factors, we maintain identical training and inference hyperparameters and seeds across different runs and settings, if not stated otherwise. We provide our source code for reproducibility¹¹1https://github.com/SimeonAllmendinger/collafuse.

Model Architecture: We adopt the network architecture of Imagen [36], which is based on the U-Net architecture [34] to process $64\times 64$ RGB images. As in the original Imagen implementation, each U-Net model is conditioned on text embeddings computed by a T5-Base model [32]. Unlike Saharia et al. [36], we do not apply any super-resolution model to increase the fidelity of the generated images, as the focus of our work lies on the feasibility of the collaborative training and inference process. However, our collaborative diffusion setting also allows each client to add individual or shared super-resolution models [35] to their pipeline to upscale the generated images.

Datasets: We train and evaluate our collaborative diffusion models on the common CelebA dataset of facial attributes. The CelebA dataset is a large-scale face attributes dataset collected by Liu et al. [25]. It consists of over 200,000 facial images of celebrities, each annotated with 40 binary attribute labels, including attributes such as gender appearance, perceived age, hair color, and facial expressions. The images in the dataset are diverse, featuring celebrities from various ethnicities, ages, and backgrounds.

Table 1: Attribute selection for dataset and clients. Each client dataset comprises samples of two to five attributes, e.g., black hair, brown hair, etc.

Dataset	Client 1	Client 2	Client 3	Client 4	Client 5
CelebA	Hair colors	Jewelry	Hair cut	Eyebrows	Eyes/Glasses

Training Parameters: Our training protocol consists of ten epochs, employing a learning rate of 0.001, a batch size of 50, a cosine scheduler, and $T=1000$ timesteps. Each client holds 2,000 training images and 5,000 test images (hold-out dataset) according to an individual attribute group (cf. Tab. 1). The experiments were conducted utilizing an NVIDIA A100-SXM4-40GB for computational processing.

Evaluation Metrics: To assess the quality of the generated images, we calculate the common Kernel Inception Distance (KID) [2], the Fréchet Inception Distance (FID) [13] and the Fréchet CLIP Distance (FCD) between the 2,100 real (test dataset) and generated images from each client. We differentiate between images generated by clients from pure Gaussian noise (client-only), and images generated based on the server image at the cut point. All metrics are computed on the implementations provided by Parmar et al. [31] to ensure stable and comparable evaluation results. For all three metrics, lower values indicate better approximation of the training distribution and improved image quality. In the main paper, we report the FID and FCD values, as well as the KID results in the appendix due to the page limitation.

As we are interested in the performance of the collaborative system, we calculate the image fidelity across the set of clients to compare the performance for different values of $t_{\zeta}$ . Furthermore, we calculate the fidelity between the partially diffused images at the cut point $t_{\zeta}$ and original images of the clients, as well as the denoised images from the server model at cut point $t_{\zeta}$ and further denoised images at $t=0$ . More detailed results are provided in the appendix.

4.2 Experimental Results

In our experiments, we analyze the performance of collaborative diffusion models by focusing on specific distributed attributes among clients. Fig. 2 displays exemplary generated images of our collaborative approach, comparing them with their attributes and real images from the dataset. Our quantitative analysis includes a comparison with two baselines: global model ( $t_{\zeta}=1000$ ) and independent client models ( $t_{\zeta}=0$ ). The single global model on the server node is trained using the combined datasets of all clients, while the independent client models are trained on client-specific distributed sub-datasets and separately operate on the client node. The FID and FCD scores in Fig. 3 show that models with cut points $t_{\zeta}\leq 300$ surpass the performance of the independent client models in favor of collaborative image synthesis. Our experiments with smaller cut points even manage to outperform the global model. However, cut points $t_{\zeta}>300$ weaken the fidelity of image generation, which converges with the performance of the independent client models for higher cut points. Furthermore, our findings highlight that a client-only approach, which includes the image generation from pure Gaussian noise at the cut point proves ineffective, particularly at lower cut points, as detailed in the appendix.

In terms of image quality, we observe that the degradation of image fidelity and visual characteristics is evident in images generated by the server as the cut point ${t_{\zeta}}$ increases. Fig. 4 displays this deterioration, which affects the images generated by the server at higher cut points. Conversely, the client’s performance in generating images from noise is compromised when operating with low cut points. However, leveraging our collaborative approach, each client can produce meaningful images with improved fidelity, as shown in Fig. 3, making use of denoised images provided by the server. Additionally, our results indicate that incorporating an adjustment of the variance and noise scheduler into the collaborative inference process significantly enhances the denoising capabilities on the client node. This is particularly effective given the higher levels of residual noise in the images received from the server, leading to an improvement in the overall quality of the images. The adoption of the adjusted parameter $M$ in Alg. 2, especially with an increased emphasis on managing variance and noise in the collaborative setting, proves to be beneficial in refining the quality of the final image output.

5 Discussion, Limitations, and Future Work

Our findings reveal that collaborative diffusion models produce images of higher quality than those generated by independently trained diffusion models using sub-datasets. In certain instances, they even surpass the performance of centrally trained diffusion models. This not only highlights the efficacy of our method but also demonstrates its capability to tackle the personalization versus generalization challenge frequently encountered in federated learning scenarios. Beyond the aspect of image fidelity, it is pivotal to consider our findings within their broader impact on information disclosure and computational efficiency in distributed machine learning frameworks. Our approach circumvents the need to share raw data or complete model updates, opting instead to share only the diffused images alongside server-side noise. Furthermore, by delegating computationally intensive tasks, our approach significantly reduces the computational load on individual clients, leveraging the potential strengths of growing foundation models on large servers. Consequently, even in cases where image fidelity decreases, clients may still favor a collaborative diffusion model for its ability to lighten their computational load, even in simple one-to-one setups. Thus, our strategy facilitates a balanced optimization of performance, privacy, and computational resources and can be tailored to individual preferences.

Limitations: Our current approach assumes trustworthy clients and an honest server. However, collaborative approaches are known to be susceptible to backdoor attacks [48, 44] in which an adversarial client tries to inject secret functionalities into a collaboratively trained model. Also, existing research has shown that diffusion models are indeed susceptible to backdoor attacks [43, 5]. Our proposed collaborative diffusion models do not account for such adversarial cases. However, our focus lies on the conception of collaborative diffusion models and not on the security area.

Future Work: Our empirical evaluations are currently limited to images with $64\times 64$ resolution. An important future avenue is the collaborative training of diffusion models of larger scales, for which the computational advantage to each client further increases. Also, collaborative diffusion applications based on more general text-to-image synthesis models like Stable Diffusion are interesting to investigate. However, training such models requires high computational resources and is, therefore, out of the scope of this work. Nevertheless, future research steps include examining vulnerabilities, like backdoor attacks, in collaborative diffusion models. Identifying suitable countermeasures and analyzing their impact on image fidelity and computational efficiency is central to bringing our approach closer to its real-world application. Another intriguing avenue is the combination of our collaborative diffusion models with differentially private training algorithms to provide formal privacy guarantees for trained models [7, 9]. Similarly, it would be interesting to investigate to what extent the phenomenon of memorization in diffusion models [4, 46, 14] can also occur in collaborative approaches with separate models.

6 Conclusion

Our collaborative diffusion approach offers a novel solution to the challenges of diffusion-based generative models. By dividing the denoising process between a shared server and client models, we address performance, information disclosure, and computational concerns effectively. Our approach enables clients to outsource computationally intensive denoising steps to the server, balancing image quality without the necessity of sharing raw data. Through experiments, we demonstrate the effectiveness of collaborative training in enhancing image quality tailored to each client’s domain while reducing the number of denoising steps on the client side. These findings highlight the potential of collaborative diffusion models in advancing distributed machine learning research and development.

Reproducibility Statement.

Our source code is publicly at https://github.com/SimeonAllmendinger/collafuse to reproduce the experiments and facilitate further analysis.

References

Augenstein et al. [2020] Sean Augenstein, H. Brendan McMahan, Daniel Ramage, Swaroop Ramaswamy, Peter Kairouz, Mingqing Chen, Rajiv Mathews, and Blaise Agüera y Arcas. Generative models for effective ML on private, decentralized datasets. In International Conference on Learning Representations (ICLR), 2020.
Bińkowski et al. [2018] Mikołaj Bińkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In International Conference on Learning Representations (ICLR), 2018.
Brooks et al. [2024] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024. Accessed: 19-June-2024.
Carlini et al. [2023] Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, and Eric Wallace. Extracting training data from diffusion models. In USENIX Security Symposium (USENIX), pages 5253–5270, 2023.
Chou et al. [2023] Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. How to backdoor diffusion models? In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4015–4024, 2023.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), pages 8780–8794, 2021.
Dockhorn et al. [2023] Tim Dockhorn, Tianshi Cao, Arash Vahdat, and Karsten Kreis. Differentially Private Diffusion Models. Transactions on Machine Learning Research (TMLR), 2023.
Fan and Liu [2020] Chenyou Fan and Ping Liu. Federated generative adversarial learning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 3–15, 2020.
Ghalebikesabi et al. [2023] Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes, Soham De, Samuel L. Smith, Olivia Wiles, and Borja Balle. Differentially private diffusion models generate useful synthetic images. arXiv Preprint, 2302.13861, 2023.
Goodfellow et al. [2020] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Gupta and Raskar [2018] Otkrist Gupta and Ramesh Raskar. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications, 116:1–8, 2018.
Hardy et al. [2019] Corentin Hardy, Erwan Le Merrer, and Bruno Sericola. Md-gan: Multi-discriminator generative adversarial networks for distributed datasets. In International Parallel and Distributed Processing Symposium (IPDPS), pages 866–877, 2019.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), pages 6626–6637, 2017.
Hintersdorf et al. [2024] Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, and Franziska Boenisch. Finding nemo: Localizing neurons responsible for memorization in diffusion models. arXiv preprint, arXiv:2406.02366, 2024.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020.
Hu et al. [2022] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022.
Huang et al. [2023] Xumin Huang, Peichun Li, Hongyang Du, Jiawen Kang, Dusit Niyato, Dong In Kim, and Yuan Wu. Federated learning-empowered ai-generated content in wireless networks. arXiv preprint, arXiv:2307.07146, 2023.
Joshi et al. [2022] Madhura Joshi, Ankit Pal, and Malaikannan Sankarasubbu. Federated learning for healthcare domain - pipeline, applications and challenges. ACM Transactions on Computing for Healthcare, 3(4), 2022.
Jothiraj and Mashhadi [2023] Fiona Victoria Stanley Jothiraj and Afra Mashhadi. Phoenix: A federated generative diffusion model. arXiv preprint, arXiv:2306.04098, 2023.
Kazerouni et al. [2023] Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, and Dorit Merhof. Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis, 88:102846, 2023.
Kingma and Welling [2014] Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR), 2014.
Kortoçi et al. [2022] Pranvera Kortoçi, Yilei Liang, Pengyuan Zhou, Lik-Hang Lee, Abbas Mehrabi, Pan Hui, Sasu Tarkoma, and Jon Crowcroft. Federated split gans. In ACM Workshop on Data Privacy and Federated Learning Technologies for Mobile Edge Network, 2022.
Li et al. [2022] Wei Li, Jinlin Chen, Zhenyu Wang, Zhidong Shen, Chao Ma, and Xiaohui Cui. Ifl-gan: Improved federated learning generative adversarial network with maximum mean discrepancy model aggregation. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2022.
Little et al. [2023] Claire Little, Mark Elliot, and Richard Allmendinger. Federated learning for generating synthetic data: a scoping review. International Journal of Population Data Science, 8(1), 2023.
Liu et al. [2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep Learning Face Attributes in the Wild. In International Conference on Computer Vision (ICCV), 2015.
Mariani et al. [2024] Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodolà. Multi-source diffusion models for simultaneous music generation and separation. In International Conference on Learning Representations (ICLR), 2024.
McMahan et al. [2017] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017.
Nazir and Kaleem [2023] Sajid Nazir and Mohammad Kaleem. Federated learning for medical image analysis with deep neural networks. Diagnostics, 13(9), 2023.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (ICML), pages 8162–8171, 2021.
Ohta and Nishio [2023] Shoki Ohta and Takayuki Nishio. $ $\Lambda$ $-split: A privacy-preserving split computing framework for cloud-powered generative AI. arXiv preprint, arXiv:2310.14651, 2023.
Parmar et al. [2022] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 11400–11410, 2022.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351, pages 234–241, 2015.
Saharia et al. [2022a] Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), pages 15:1–15:10, 2022a.
Saharia et al. [2022b] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Conference on Neural Information Processing Systems (NeurIPS), 2022b.
Schwermer et al. [2022] René Schwermer, Jonas Buchberger, Ruben Mayer, and Hans-Arno Jacobsen. Federated office plug-load identification for building management systems. In ACM International Conference on Future Energy Systems, page 114–126, 2022.
Schwermer et al. [2023] René Schwermer, Ekin-Alp Bicer, Pascal Schirmer, Ruben Mayer, and Hans-Arno Jacobsen. Federated computing in electric vehicles to predict coolant temperature. In International Middleware Conference: Industrial Track, page 8–14, 2023.
Shen et al. [2023] Yiqing Shen, Arcot Sowmya, Yulin Luo, Xiaoyao Liang, Dinggang Shen, and Jing Ke. A federated learning system for histopathology image analysis with an orchestral stain-normalization gan. IEEE Transactions on Medical Imaging, 42(7):1969–1981, 2023.
Shokri et al. [2017] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In Symposium on Security and Privacy (S&P), pages 3–18, 2017.
Singer et al. [2023] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In International Conference on Learning Representations (ICLR), 2023.
Song and Ermon [2020] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
Struppek et al. [2023] Lukas Struppek, Dominik Hintersdorf, and Kristian Kersting. Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis. In International Conference on Computer Vision (ICCV), 2023.
Tajalli et al. [2023] Behrad Tajalli, Oguzhan Ersoy, and Stjepan Picek. On feasibility of server-side backdoor attacks on split learning. In IEEE Security and Privacy Workshops (SPW), pages 84–93, 2023.
Thapa et al. [2022] Chandra Thapa, Pathum Chamikara Mahawaga Arachchige, Seyit Camtepe, and Lichao Sun. Splitfed: When federated learning meets split learning. AAAI Conference on Artificial Intelligence (AAAI), 36(8):8485–8493, 2022.
van den Burg and Williams [2021] Gerrit J. J. van den Burg and Christopher K. I. Williams. On memorization in probabilistic deep generative models. In Conference on Neural Information Processing Systems (NeurIPS), pages 27916–27928, 2021.
Veeraragavan and Nygård [2023] Narasimha Raghavan Veeraragavan and Jan Franz Nygård. Securing federated gans: Enabling synthetic data generation for health registry consortiums. In International Conference on Availability, Reliability and Security (ARES), pages 89:1–89:9, 2023.
Wang et al. [2020] Hongyi Wang, Kartik Sreenivasan, Shashank Rajput, Harit Vishwakarma, Saurabh Agarwal, Jy-yong Sohn, Kangwook Lee, and Dimitris S. Papailiopoulos. Attack of the tails: Yes, you really can backdoor federated learning. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
Yin et al. [2023] Benshun Yin, Zhiyong Chen, and Meixia Tao. Predictive gan-powered multi-objective optimization for hybrid federated split learning. IEEE Transactions on Communications, 71(8):4544–4560, 2023.
Zhu et al. [2019] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 14747–14756, 2019.