Neural Network Diffusion: Forward Process
Neural Network Diffusion: Forward Process
Neural Network Diffusion: Forward Process
Kai Wang 1 Zhaopan Xu 1 Yukun Zhou 1 Zelin Zang 1 Trevor Darrell 2 Zhuang Liu * 3 Yang You * 1
Code: https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion
Image
Noise
cess in image and video generation. In this work,
we demonstrate that diffusion models can also
generate high-performing neural network param-
eters. Our approach is simple, utilizing an au- Reverse Process
toencoder and a standard latent diffusion model. Adding Noise
The autoencoder extracts latent representations Acc:76.6 Acc:64.0 Acc:42.1 Acc:1.4
of a subset of the trained network parameters.
Model
Initial.
A diffusion model is then trained to synthesize
these latent parameter representations from ran-
dom noise. It then generates new representations
SGD Optimization min. max
that are passed through the autoencoder’s decoder,
whose outputs are ready to use as new subsets of Figure 1. The top: illustrates the standard diffusion process in
network parameters. Across various architectures image generation. The bottom: denotes the parameter distribution
and datasets, our diffusion process consistently of batch normalization (BN) during the training CIFAR-100 with
generates models of comparable or improved per- ResNet-18. The upper half of the bracket: BN weights. The lower
formance over trained networks, with minimal half of the bracket: BN biases.
additional cost. Notably, we empirically find that
the generated models perform differently with the
trained networks. Our results encourage more ex- terms of image quality. Subsequently, GLIDE (Nichol et al.,
ploration on the versatile use of diffusion models. 2021), Imagen (Saharia et al., 2022), DALL·E 2 (Ramesh
et al., 2022), and Stable Diffusion (Rombach et al., 2022)
achieve photorealistic images adopted by artists.
1. Introduction
Despite the great success of diffusion models in visual gener-
The origin of diffusion models can be traced back to ation, their potential in other domains remains relatively un-
non-equilibrium thermodynamics (Jarzynski, 1997; Sohl- derexplored. In this work, we demonstrate the surprising ca-
Dickstein et al., 2015). Diffusion processes were first uti- pability of diffusion models in generating high-performing
lized to progressively remove noise from inputs and generate model parameters, a task fundamentally distinct from tradi-
clear images in (Sohl-Dickstein et al., 2015). Later works, tional visual generation. Parameter generation focuses on
such as DDPM (Ho et al., 2020) and DDIM (Song et al., creating neural network parameters that can perform well on
2021), refine diffusion models, with a training paradigm given tasks. It has been explored from prior and probability
characterized by forward and reverse processes. modeling aspects, i.e. stochastic neural network (Sompolin-
sky et al., 1988; Bottou et al., 1991; Wong, 1991; Schmidt
At that time, the quality of images generated by diffu-
et al., 1992; Murata et al., 1994) and Bayesian neural net-
sion models had not yet reached a desired level. Guided-
work (Neal, 2012; Kingma & Welling, 2013; Rezende et al.,
Diffusion (Dhariwal & Nichol, 2021) conducts sufficient
2014; Kingma et al., 2015; Gal & Ghahramani, 2016). How-
ablations and finds a better architecture, which represents
ever, using a diffusion model in parameter generation has
the pioneering effort to elevate diffusion models beyond
not been well-explored yet.
GAN-based methods (Zhu et al., 2017; Isola et al., 2017) in
* Taking a closer look at the neural network training and dif-
Equal advising, 1 National University of Singapore 2 University
of California, Berkeley 3 Meta AI Research. fusion models, the diffusion-based image generation shares
commonalities with the stochastic gradient descent (SGD)
arXiv preprint learning process in the following aspects (illustrated in
1
Neural Network Diffusion
Fig. 1). i) Both neural network training and the reverse reverse process aims to train a denoising network to recur-
process of diffusion models can be regarded as transitions sively remove the noise from xt . It moves backward on the
from random noise/initialization to specific distributions. ii) multi-step chain as t decreases from T to 0. Mathematically,
High-quality images and high-performing parameters can the reverse process can be formulated as follows,
also be degraded into simple distributions, such as Gaussian
distribution, through multiple noise additions.
Based on the observations above, we introduce a novel ap- pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ(xt , t)),
proach for parameter generation, named neural network T
Y (2)
diffusion (p-diff, p stands for parameter), which employs a pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ),
standard latent diffusion model to synthesize a new set of t=1
where q and N represent forward process and adding Gaus- Preparing the data for training the autoencoder. In our
sian noise parameterized by βt , and I is the identity matrix. paper, we default to synthesizing a subset of model pa-
rameters. Therefore, to collect the training data for the
Reverse process. Different from the forward process, the autoencoder, we train a model from scratch and densely
2
Neural Network Diffusion
Decoder
Encoder
Input
LDM
Inference
Decoder
LDM
Generated Parameters
Random
Noise
: Frozen / : Forward/Reverse Process
Figure 2. Our approach consists of two processes, named parameter autoencoder and generation. Parameter autoencoder aims to extract
the latent representations and reconstruct model parameters via the decoder. The extracted representations are used to train a standard
latent diffusion model (LDM). In the inference, the random noise is fed into LDM and trained decoder to obtain the generated parameters.
save checkpoints in the last epoch. It is worth noting that represents the latent representations, ξV and ξZ denote ran-
we only update the selected subset of parameters via SGD dom noise that are added into input parameters V and latent
optimizer and fix the remained parameters of the model. representations Z, and V ′ is the reconstructed parameters.
The saved subsets of parameters S = [s1 , . . . , sk , . . . , sK ] We default to using an autoencoder with a 4-layer encoder
is utilized to train the autoencoder, where K is the number and decoder. Same as the normal autoencoder training, we
of the training samples. For some large architectures that minimize the mean square error (MSE) loss between V ′ and
have been trained on large-scale datasets, considering the V as follows,
cost of training them from scratch, we fine-tune a subset of
1 XK
the parameters of the pre-trained model and densely save LMSE = ∥vk − vk′ ∥2 , (5)
K 1
the fine-tuned parameters as training samples.
where vk′ is the reconstructed parameters of k-th model.
Training parameter autoencoder. We then flatten
these parameters S into 1-dimensional vectors V = 2.4. Parameter generation
[v1 , . . . , vk , . . . , vK ], where V ∈ RK×D and D is the size
of the subset parameters. After that, an autoencoder is One of the most direct strategies is to synthesize the novel
trained to reconstruct these parameters V . To enhance the parameters via a diffusion model. However, the memory
robustness and generalization of the autoencoder, we intro- cost of this operation is too heavy, especially when the di-
duce random noise augmentation in input parameters and mension of V is ultra-large. Based on this consideration, we
latent representations simultaneously. The encoding and apply the diffusion process to the latent representations by
decoding processes can be formulated as, default. For Z = [z10 , · · · , zk0 , · · · , zK
0
] extracted from pa-
rameter autoencoder, we use the optimization of DDPM (Ho
et al., 2020) as follows,
Z = [z10 , . . . , zk0 , . . . , zK
0
] = fencoder (V + ξV , σ);
√ √
θ ← θ − ∇θ ||ϵ − ϵθ ( αt zk0 + 1 − αt ϵ, t)||2 ,
| {z }
encoding (6)
′
(4)
V = [v1′ , · · · , vk′ , · · · ′
, vK ] = fdecoder (Z + ξZ , ρ), where t is uniform between 1 and T , the sequence of hy-
perparameters αt indicates the noise strength at each step,
| {z }
decoding
ϵ is the added Gaussian noise, ϵθ (·) denotes the denoising
where fencoder (·, σ) and fdecoder (·, ρ) denote the encoder network that parameterized by θ. After finishing the training
and decoder parameterized by σ and ρ, respectively. Z of the parameter generation, we directly fed random noise
3
Neural Network Diffusion
Table 1. We present results in the format of ‘original / ensemble / p-diff’. Our method obtains similar or even higher performance than
baselines. The results of p-diff is average in three runs. Bold entries are best results.
Network\Dataset MNIST CIFAR-10 CIFAR-100 STL-10 Flowers Pets F-101 ImageNet-1K
ResNet-18 99.2 / 99.2 / 99.3 92.5 / 92.5 / 92.7 76.7 / 76.7 / 76.9 75.5 / 75.5 / 75.4 49.1 / 49.1 / 49.7 60.9 / 60.8 / 61.1 71.2 / 71.3 / 71.3 78.7 / 78.5 / 78.7
ResNet-50 99.4 / 99.3 / 99.4 91.3 / 91.4 / 91.3 71.6 / 71.6 / 71.7 69.2 / 69.1 / 69.2 33.7 / 33.9 / 38.1 58.0 / 58.0 / 58.0 68.6 / 68.5 / 68.6 79.2 / 79.2 / 79.3
ViT-Tiny 99.5 / 99.5 / 99.5 96.8 / 96.8 / 96.8 86.7 / 86.8 / 86.7 97.3 / 97.3 / 97.3 87.5 / 87.5 / 87.5 89.3 / 89.3 / 89.3 78.5 / 78.4 / 78.5 73.7 / 73.7 / 74.1
ViT-Base 99.5 / 99.4 / 99.5 98.7 / 98.7 / 98.7 91.5 / 91.4 / 91.7 99.1 / 99.0 / 99.2 98.3 / 98.3 / 98.3 91.6 / 91.5 / 91.7 83.4 / 83.4 / 83.4 84.5 / 84.5 / 84.7
ConvNeXt-T 99.3 / 99.4 / 99.3 97.6 / 97.6 / 97.7 87.0 / 87.0 / 87.1 98.2 / 98.0 / 98.2 70.0 / 70.0 / 70.5 92.9 / 92.8 / 93.0 76.1 / 76.1 / 76.2 82.1 / 82.1 / 82.3
ConvNeXt-B 99.3 / 99.3 / 99.4 98.1 / 98.1 / 98.1 88.3 / 88.4 / 88.4 98.8 / 98.8 / 98.9 88.4 / 88.4 / 88.5 94.1 / 94.0 / 94.1 81.4 / 81.4 / 81.6 83.8 / 83.7 / 83.9
into the reverse process and the trained decoder to generate concatenated with the aforementioned fixed parameters to
a new set of high-performing parameters. These generated form our generated models. From these generated models,
parameters are concatenated with the remained model pa- we select the one with the best performance on the training
rameters to form new models for evaluation. Neural network set. Subsequently, we evaluate its accuracy on the validation
parameters and image pixels exhibit significant disparities in set and report the results. That is a consideration of mak-
several key aspects, including data type, dimensions, range, ing fair comparisons with the models trained using SGD
and physical interpretation. Different from images, neural optimization. We empirically find the performance on the
network parameters mostly have no spatial relevance, so training set is good for selecting models for testing.
we replace 2D convolutions with 1D convolutions in our
parameter autoencoder and parameter generation processes. Baselines. 1) The best validation accuracy among the orig-
inal models is denoted as ‘original’. 2) Average weight
ensemble (Krogh & Vedelsby, 1994; Wortsman et al., 2022)
3. Experiments of original models is denoted as ‘ensemble’.
In this section, We first introduce the setup for reproducing.
Then, we report the result comparisons and ablation studies. 3.2. Results
Tab. 1 shows the result comparisons with two baselines
3.1. Setup across 8 datasets and 6 architectures. Based on the re-
sults, we have several observations as follows: i) In most
Datasets and architectures. We evaluate our approach
cases, our method achieves similar or better results than
across a wide range of datasets, including MNIST (Le-
two baselines. This demonstrates that our method can ef-
Cun et al., 1998), CIFAR-10/100 (Krizhevsky et al., 2009),
ficiently learn the distribution of high-performing parame-
ImageNet-1K (Deng et al., 2009), STL-10 (Coates et al.,
ters and generate superior models from random noise. ii)
2011), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi
Our method consistently performs well on various datasets,
et al., 2012), and F-101 (Bossard et al., 2014) to study the ef-
which indicates the good generality of our method.
fectiveness of our method. We mainly conduct experiments
on ResNet-18/50 (He et al., 2016), ViT-Tiny/Base (Dosovit-
skiy et al., 2020), and ConvNeXt-T/B (Liu et al., 2022). 3.3. Ablation studies and analysis
Extensive ablation studies are conducted in this section
Training details. The autoencoder and latent diffusion to illustrate the characteristics of our method. We default
model both include a 4-layer 1D CNNs-based encoder and to training ResNet-18 on CIFAR-100 and report the best,
decoder. We default to collecting 200 training data for all average, and medium accuracy (if not otherwise stated).
architectures. For ResNet-18/50, we train the models from
scratch. In the last epoch, we continue to train the last two The number of training models. Tab. 2(a) varies the size
normalization layers and fix the other parameters. We save of training data, i.e. the number of original models. We find
200 checkpoints in the last epoch, i.e., original models. For the performance gap of best results among different numbers
ViT-Tiny/Base and ConvNeXt-T/B, we fine-tune the last two of the original models is minor. To comprehensively explore
normalization parameters of the released model in the timm the influences of different numbers of training data on the
library (Wightman, 2019). The ξV and ξZ are Gaussian performance stability, we also report the average (avg.) and
noise with amplitude of 0.001 and 0.1. In most cases, the median (med.) accuracy as metrics of stability of our gener-
autoencoder and latent diffusion training can be completed ated models. Notably, the stability of models generated with
within 1 to 3 hours on a single Nvidia A100 40G GPU. a small number of training instances is much worse than
that observed in larger settings. This can be explained by
Inference details. We synthesize 100 novel parameters by the learning principle of the diffusion model: the diffusion
feeding random noise into the latent diffusion model and process may be hard to model the target distribution well if
the trained decoder. These synthesized parameters are then
4
Neural Network Diffusion
Table 2. p-diff main ablation experiments. We ablate the number of original models K, the location of applying our approach, and the
effect of noise augmentation. The default settings are K = 200, applying p-diff on the deep BN parameters (between layer16 to 18), and
using noise augmentation in the input parameters and latent representations. Defaults are marked in gray . Bold entries are best results.
(a) Large K can improve the (b) P-diff works well on deep layers. The (c) Noise augmentation makes p-diff stronger.
performance stability of our index of layer is aligned with the standard Adding noise on latent representations is more im-
method. ResNet-18. portant than on parameters.
K best avg. med. parameters best avg. med. noise augmentation best avg. med.
1 76.6 70.7 73.2 original models 76.7 76.6 76.6 original models 76.7 - -
10 76.5 71.2 73.8 BN-layer10 to 14 76.8 71.9 75.3 no noise 76.7 65.8 65.0
50 76.7 71.3 74.3 BN-layer14 to 16 76.9 72.2 75.5 + para. noise 76.7 66.7 67.3
200 76.9 72.4 75.6 BN-layer16 to 18 76.9 72.4 75.6 + latent noise 76.7 72.1 75.3
500 76.8 72.3 75.4 + para. and latent noise 76.9 72.4 75.6
Table 3. We present result comparisons of original, ensemble, and p-diff under synthesizing entire model parameters setting. Our method
demonstrates good generalization on ConvNet-3 and MLP-3. Bold entries are best results.
(a) Result comparisons on ConvNet-3 (includes three convolu- (b) Result comparisons on MLP-3 (includes three linear layers
tional layers and one linear layer. and ReLU activation function).
Dataset\Network ConvNet-3 Dataset\Network MLP-3
original ensemble p-diff parameter number original ensemble p-diff parameter number
CIFAR-10 77.2 77.3 77.5 24714 MNIST 85.3 85.2 85.4 39760
CIFAR-100 57.2 57.2 57.3 70884 CIFAR-10 48.1 48.1 48.2 155135
only a few input samples are used for training. average, and medium accuracy).
Where to apply p-diff. We default to synthesizing the pa- Generalization on entire model parameters. Until now,
rameters of the last two normalization layers. To investigate we have evaluated the effectiveness of our approach in syn-
the effectiveness of p-diff on other depths of normalization thesizing a subset of model parameters, i.e., batch normal-
layers, we also explore the performance of synthesizing the ization parameters. What about synthesizing entire model
other shallow-layer parameters. To keep an equal number parameters? To evaluate this, we extend our approach to
of BN parameters, we implement our approach to three two small architectures, namely MLP-3 (includes three lin-
sets of BN layers, which are between layers with different ear layers and ReLU activation function) and ConvNet-3 (in-
depths. As shown in Tab. 2(b), we empirically find that our cludes three convolutional layers and one linear layer). Dif-
approach achieves better performances (best accuracy) than ferent from the aforementioned training data collection strat-
the original models on all depths of BN layers settings. An- egy, we individually train these architectures from scratch
other finding is that synthesizing the deep layers can achieve with 200 different random seeds. We take CIFAR-10 as
better accuracy than generating the shallow ones. This is an example and show the details of these two architectures
because generating shallow-layer parameters is more likely (convolutional layer: kernel size × kernel size, the number
to accumulate errors during the forward propagation than of channels; linear layer: input dimension, output dimen-
generating deep-layer parameters. sion) as follows:
Noise augmentation. Noise augmentation is designed to • ConvNet-3: conv1. 3×3, 32, conv2. 3×3, 32, conv3. 3×3,
enhance the robustness and generalization of training the au- 32, linear layer. 2048, 10.
toencoder. We ablate the effectiveness of applying this aug- • MLP-3: linear layer1. 3072, 50, linear layer2. 50, 25,
mentation in the input parameters and latent representations, linear layer3. 25, 10.
respectively. The ablation results are presented in Tab. 2(c). We present result comparisons between our approach and
Several observations can be summarized as follows: i) Noise two baselines (i.e., original and ensemble) at Tab. 3. We
augmentation plays a crucial role in generating stable and report the comparisons and parameter numbers of ConvNet-
high-performing models. ii) The performance gains of ap- 3 on CIFAR-10/100 and MLP-3 on CIFAR-10 and MNIST
plying noise augmentation in the latent representations are datasets. These experiments demonstrate the effectiveness
larger than in the input parameters. iii) Our default set- and generalization of our approach in synthesizing entire
ting, jointly using noise augmentation in parameters and model parameters, i.e., achieving similar or even improved
representations obtains the best performances (includes best,
5
Neural Network Diffusion
performances over baselines. These results suggest the Seed 1 Seed 2 Seed 3
practical applicability of our method. However, we can
Conv.-layer2
not synthesize the entire parameters of large architectures,
such as ResNet, ViT, and ConvNeXt series. It is mainly
constrained by the limitation of the GPU memory.
FC-layer18
our method in generating neural network parameters. To
explore the intrinsic reason behind this, we use 3 random
seeds to train ResNet-18 model from scratch and visualize
the parameters in Fig. 3. We visualize the heat map of pa-
rameter distribution via min-max normalization in different Min Max
layers individually. Based on the visualizations of the param-
eters of convolutional (Conv.-layer2) and fully connected
(FC-layer18) layers, there indeed exist specific parameter Figure 3. Visualizing the parameter distributions of convolutional
patterns among these layers. Based on the learning of these (Conv.-layer2) and fully connected (FC-layer18) layers. Parame-
ters from different layers show variant patterns while these param-
patterns, our approach can generate high-performing neural
eters from the same layer show similar patterns. The index of layer
network parameters.
is aligned with the standard ResNet-18.
6
Neural Network Diffusion
$ P R Q J R U L J L Q D O
$ P R Q J S G L I I
% H W Z H H Q R U L J L Q D O D Q G S G L I I
% H W Z H H Q R U L J L Q D O D Q G S G L I I P D [