Neural Network Diffusion: Forward Process

Neural Network Diffusion
Kai Wang 1 Zhaopan Xu 1 Yukun Zhou 1 Zelin Zang 1 Trevor Darrell 2 Zhuang Liu * 3 Yang You * 1
Code: https://github.com/NUS-HPC-AI-Lab/Neural-Network-Diffusion
Abstract Forward Process

Diffusion models have achieved remarkable suc-
arXiv:2402.13144v1 [cs.LG] 20 Feb 2024
Image
Noise
cess in image and video generation. In this work,
we demonstrate that diffusion models can also
generate high-performing neural network param-
eters. Our approach is simple, utilizing an au- Reverse Process
toencoder and a standard latent diffusion model. Adding Noise
The autoencoder extracts latent representations Acc:76.6 Acc:64.0 Acc:42.1 Acc:1.4
of a subset of the trained network parameters.
Model
Initial.
A diffusion model is then trained to synthesize
these latent parameter representations from ran-
dom noise. It then generates new representations
SGD Optimization min. max
that are passed through the autoencoder’s decoder,
whose outputs are ready to use as new subsets of Figure 1. The top: illustrates the standard diffusion process in
network parameters. Across various architectures image generation. The bottom: denotes the parameter distribution
and datasets, our diffusion process consistently of batch normalization (BN) during the training CIFAR-100 with
generates models of comparable or improved per- ResNet-18. The upper half of the bracket: BN weights. The lower
formance over trained networks, with minimal half of the bracket: BN biases.
additional cost. Notably, we empirically find that
the generated models perform differently with the
trained networks. Our results encourage more ex- terms of image quality. Subsequently, GLIDE (Nichol et al.,
ploration on the versatile use of diffusion models. 2021), Imagen (Saharia et al., 2022), DALL·E 2 (Ramesh
et al., 2022), and Stable Diffusion (Rombach et al., 2022)
achieve photorealistic images adopted by artists.
1. Introduction
Despite the great success of diffusion models in visual gener-
The origin of diffusion models can be traced back to ation, their potential in other domains remains relatively un-
non-equilibrium thermodynamics (Jarzynski, 1997; Sohl- derexplored. In this work, we demonstrate the surprising ca-
Dickstein et al., 2015). Diffusion processes were first uti- pability of diffusion models in generating high-performing
lized to progressively remove noise from inputs and generate model parameters, a task fundamentally distinct from tradi-
clear images in (Sohl-Dickstein et al., 2015). Later works, tional visual generation. Parameter generation focuses on
such as DDPM (Ho et al., 2020) and DDIM (Song et al., creating neural network parameters that can perform well on
2021), refine diffusion models, with a training paradigm given tasks. It has been explored from prior and probability
characterized by forward and reverse processes. modeling aspects, i.e. stochastic neural network (Sompolin-
sky et al., 1988; Bottou et al., 1991; Wong, 1991; Schmidt
At that time, the quality of images generated by diffu-
et al., 1992; Murata et al., 1994) and Bayesian neural net-
sion models had not yet reached a desired level. Guided-
work (Neal, 2012; Kingma & Welling, 2013; Rezende et al.,
Diffusion (Dhariwal & Nichol, 2021) conducts sufficient
2014; Kingma et al., 2015; Gal & Ghahramani, 2016). How-
ablations and finds a better architecture, which represents
ever, using a diffusion model in parameter generation has
the pioneering effort to elevate diffusion models beyond
not been well-explored yet.
GAN-based methods (Zhu et al., 2017; Isola et al., 2017) in
* Taking a closer look at the neural network training and dif-
Equal advising, 1 National University of Singapore 2 University
of California, Berkeley 3 Meta AI Research. fusion models, the diffusion-based image generation shares
commonalities with the stochastic gradient descent (SGD)
arXiv preprint learning process in the following aspects (illustrated in
1
Fig. 1). i) Both neural network training and the reverse reverse process aims to train a denoising network to recur-
process of diffusion models can be regarded as transitions sively remove the noise from xt . It moves backward on the
from random noise/initialization to specific distributions. ii) multi-step chain as t decreases from T to 0. Mathematically,
High-quality images and high-performing parameters can the reverse process can be formulated as follows,
also be degraded into simple distributions, such as Gaussian
distribution, through multiple noise additions.
Based on the observations above, we introduce a novel ap- pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ(xt , t)),
proach for parameter generation, named neural network T
Y (2)
diffusion (p-diff, p stands for parameter), which employs a pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ),
standard latent diffusion model to synthesize a new set of t=1
parameters. That is motivated by the fact that the diffusion

where p represents the reverse process, µθ (xt , t) and
model has the capability to transform a given random distri-
Σθ(xt , t)) are the Gaussian mean and variance that esti-
bution to a specific one. Our method is simple, comprising
mated by the denoising network parameter θ. The denoising
an autoencoder and a standard latent diffusion model to
network in the reverse process is optimized by the standard
learn the distribution of high-performing parameters. First,
negative log-likelihood:
for a subset of parameters of models trained by the SGD
optimizer, the autoencoder is trained to extract the latent
representations for these parameters. Then, we leverage a Ldm = DKL (q(xt−1 |xt , x0 )||pθ (xt−1 |xt )), (3)
standard latent diffusion model to synthesize latent represen-
where the DKL (·||·) denotes the Kullback–Leibler (KL)
tations from random noise. Finally, the synthesized latent
divergence that is normally used to compute the difference
representations are passed through the trained autoencoder’s
between two distributions.
decoder to yield new high-performing model parameters.
Our approach has the following characteristics: i) It consis- Training and inference procedures. The goal of the train-
tently achieves similar, even enhanced performance than its ing diffusion model is to find the reverse transitions that
training data, i.e., models trained by SGD optimizer, across maximize the likelihood of the forward transitions in each
multiple datasets and architectures within seconds. ii) Our time step t. In practice, training equivalently consists of
generated models have great differences from the trained minimizing the variational upper bound. The inference pro-
models, which illustrates our approach can synthesize new cedure aims to generate novel samples from random noise
parameters instead of memorizing the training samples. We via the optimized denoising parameters θ∗ and the multi-
hope our research can provide fresh insights into expanding step chains in the reverse process.
the applications of diffusion models to other domains.
2.2. Overview
2. Nerual Network Diffusion We propose neural network diffusion (p-diff), which aims
to generate high-performing parameters from random noise.
2.1. Preliminaries of diffusion models
As illustrated in Fig. 2, our method consists of two processes,
Diffusion models typically consist of forward and reverse named parameter autoencoder and generation. Given a set
processes in a multi-step chain indexed by timesteps. We of trained high-performing models, we first select a subset
introduce these two processes in the following. of these parameters and flatten them into 1-dimensional
vectors. Subsequently, we introduce an encoder to extract
Forward process. Given a sample x0 ∼ q(x), the forward latent representations from these vectors, accompanied by a
process progressively adds Gaussian noise for T steps and decoder responsible for reconstructing the parameters from
obtain x1 , x2 , · · · , xT . The formulation of this process can latent representations. Then, a standard latent diffusion
be written as follows, model is trained to synthesize latent representations from
random noise. After training, we utilize p-diff to generate
p new parameters via the following chain: random noise →
q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), reverse process → trained decoder → generated parameters.
T
Y (1)
q(x1:T |x0 ) = q(xt |xt−1 ), 2.3. Parameter autoencoder
t=1
where q and N represent forward process and adding Gaus- Preparing the data for training the autoencoder. In our
sian noise parameterized by βt , and I is the identity matrix. paper, we default to synthesizing a subset of model pa-
rameters. Therefore, to collect the training data for the
Reverse process. Different from the forward process, the autoencoder, we train a model from scratch and densely
2
Parameter Autoencoder Parameter Generation

Parameters (standard latent diffusion)
Decoder
Encoder
Input
LDM
Latent Random Latent

Representations Noise Representations
Inference
Decoder
LDM
Generated Parameters
Random
Noise
: Frozen / : Forward/Reverse Process
Figure 2. Our approach consists of two processes, named parameter autoencoder and generation. Parameter autoencoder aims to extract
the latent representations and reconstruct model parameters via the decoder. The extracted representations are used to train a standard
latent diffusion model (LDM). In the inference, the random noise is fed into LDM and trained decoder to obtain the generated parameters.
save checkpoints in the last epoch. It is worth noting that represents the latent representations, ξV and ξZ denote ran-
we only update the selected subset of parameters via SGD dom noise that are added into input parameters V and latent
optimizer and fix the remained parameters of the model. representations Z, and V ′ is the reconstructed parameters.
The saved subsets of parameters S = [s1 , . . . , sk , . . . , sK ] We default to using an autoencoder with a 4-layer encoder
is utilized to train the autoencoder, where K is the number and decoder. Same as the normal autoencoder training, we
of the training samples. For some large architectures that minimize the mean square error (MSE) loss between V ′ and
have been trained on large-scale datasets, considering the V as follows,
cost of training them from scratch, we fine-tune a subset of
1 XK
the parameters of the pre-trained model and densely save LMSE = ∥vk − vk′ ∥2 , (5)
K 1
the fine-tuned parameters as training samples.
where vk′ is the reconstructed parameters of k-th model.
Training parameter autoencoder. We then flatten
these parameters S into 1-dimensional vectors V = 2.4. Parameter generation
[v1 , . . . , vk , . . . , vK ], where V ∈ RK×D and D is the size
of the subset parameters. After that, an autoencoder is One of the most direct strategies is to synthesize the novel
trained to reconstruct these parameters V . To enhance the parameters via a diffusion model. However, the memory
robustness and generalization of the autoencoder, we intro- cost of this operation is too heavy, especially when the di-
duce random noise augmentation in input parameters and mension of V is ultra-large. Based on this consideration, we
latent representations simultaneously. The encoding and apply the diffusion process to the latent representations by
decoding processes can be formulated as, default. For Z = [z10 , · · · , zk0 , · · · , zK
0
] extracted from pa-
rameter autoencoder, we use the optimization of DDPM (Ho
et al., 2020) as follows,
Z = [z10 , . . . , zk0 , . . . , zK
0
] = fencoder (V + ξV , σ);
√ √
θ ← θ − ∇θ ||ϵ − ϵθ ( αt zk0 + 1 − αt ϵ, t)||2 ,
| {z }
encoding (6)
′
(4)
V = [v1′ , · · · , vk′ , · · · ′
, vK ] = fdecoder (Z + ξZ , ρ), where t is uniform between 1 and T , the sequence of hy-
perparameters αt indicates the noise strength at each step,
| {z }
decoding
ϵ is the added Gaussian noise, ϵθ (·) denotes the denoising
where fencoder (·, σ) and fdecoder (·, ρ) denote the encoder network that parameterized by θ. After finishing the training
and decoder parameterized by σ and ρ, respectively. Z of the parameter generation, we directly fed random noise
3
Table 1. We present results in the format of ‘original / ensemble / p-diff’. Our method obtains similar or even higher performance than
baselines. The results of p-diff is average in three runs. Bold entries are best results.
Network\Dataset MNIST CIFAR-10 CIFAR-100 STL-10 Flowers Pets F-101 ImageNet-1K
ResNet-18 99.2 / 99.2 / 99.3 92.5 / 92.5 / 92.7 76.7 / 76.7 / 76.9 75.5 / 75.5 / 75.4 49.1 / 49.1 / 49.7 60.9 / 60.8 / 61.1 71.2 / 71.3 / 71.3 78.7 / 78.5 / 78.7
ResNet-50 99.4 / 99.3 / 99.4 91.3 / 91.4 / 91.3 71.6 / 71.6 / 71.7 69.2 / 69.1 / 69.2 33.7 / 33.9 / 38.1 58.0 / 58.0 / 58.0 68.6 / 68.5 / 68.6 79.2 / 79.2 / 79.3
ViT-Tiny 99.5 / 99.5 / 99.5 96.8 / 96.8 / 96.8 86.7 / 86.8 / 86.7 97.3 / 97.3 / 97.3 87.5 / 87.5 / 87.5 89.3 / 89.3 / 89.3 78.5 / 78.4 / 78.5 73.7 / 73.7 / 74.1
ViT-Base 99.5 / 99.4 / 99.5 98.7 / 98.7 / 98.7 91.5 / 91.4 / 91.7 99.1 / 99.0 / 99.2 98.3 / 98.3 / 98.3 91.6 / 91.5 / 91.7 83.4 / 83.4 / 83.4 84.5 / 84.5 / 84.7
ConvNeXt-T 99.3 / 99.4 / 99.3 97.6 / 97.6 / 97.7 87.0 / 87.0 / 87.1 98.2 / 98.0 / 98.2 70.0 / 70.0 / 70.5 92.9 / 92.8 / 93.0 76.1 / 76.1 / 76.2 82.1 / 82.1 / 82.3
ConvNeXt-B 99.3 / 99.3 / 99.4 98.1 / 98.1 / 98.1 88.3 / 88.4 / 88.4 98.8 / 98.8 / 98.9 88.4 / 88.4 / 88.5 94.1 / 94.0 / 94.1 81.4 / 81.4 / 81.6 83.8 / 83.7 / 83.9
into the reverse process and the trained decoder to generate concatenated with the aforementioned fixed parameters to
a new set of high-performing parameters. These generated form our generated models. From these generated models,
parameters are concatenated with the remained model pa- we select the one with the best performance on the training
rameters to form new models for evaluation. Neural network set. Subsequently, we evaluate its accuracy on the validation
parameters and image pixels exhibit significant disparities in set and report the results. That is a consideration of mak-
several key aspects, including data type, dimensions, range, ing fair comparisons with the models trained using SGD
and physical interpretation. Different from images, neural optimization. We empirically find the performance on the
network parameters mostly have no spatial relevance, so training set is good for selecting models for testing.
we replace 2D convolutions with 1D convolutions in our
parameter autoencoder and parameter generation processes. Baselines. 1) The best validation accuracy among the orig-
inal models is denoted as ‘original’. 2) Average weight
ensemble (Krogh & Vedelsby, 1994; Wortsman et al., 2022)
3. Experiments of original models is denoted as ‘ensemble’.
In this section, We first introduce the setup for reproducing.
Then, we report the result comparisons and ablation studies. 3.2. Results
Tab. 1 shows the result comparisons with two baselines
3.1. Setup across 8 datasets and 6 architectures. Based on the re-
sults, we have several observations as follows: i) In most
Datasets and architectures. We evaluate our approach
cases, our method achieves similar or better results than
across a wide range of datasets, including MNIST (Le-
two baselines. This demonstrates that our method can ef-
Cun et al., 1998), CIFAR-10/100 (Krizhevsky et al., 2009),
ficiently learn the distribution of high-performing parame-
ImageNet-1K (Deng et al., 2009), STL-10 (Coates et al.,
ters and generate superior models from random noise. ii)
2011), Flowers (Nilsback & Zisserman, 2008), Pets (Parkhi
Our method consistently performs well on various datasets,
et al., 2012), and F-101 (Bossard et al., 2014) to study the ef-
which indicates the good generality of our method.
fectiveness of our method. We mainly conduct experiments
on ResNet-18/50 (He et al., 2016), ViT-Tiny/Base (Dosovit-
skiy et al., 2020), and ConvNeXt-T/B (Liu et al., 2022). 3.3. Ablation studies and analysis
Extensive ablation studies are conducted in this section
Training details. The autoencoder and latent diffusion to illustrate the characteristics of our method. We default
model both include a 4-layer 1D CNNs-based encoder and to training ResNet-18 on CIFAR-100 and report the best,
decoder. We default to collecting 200 training data for all average, and medium accuracy (if not otherwise stated).
architectures. For ResNet-18/50, we train the models from
scratch. In the last epoch, we continue to train the last two The number of training models. Tab. 2(a) varies the size
normalization layers and fix the other parameters. We save of training data, i.e. the number of original models. We find
200 checkpoints in the last epoch, i.e., original models. For the performance gap of best results among different numbers
ViT-Tiny/Base and ConvNeXt-T/B, we fine-tune the last two of the original models is minor. To comprehensively explore
normalization parameters of the released model in the timm the influences of different numbers of training data on the
library (Wightman, 2019). The ξV and ξZ are Gaussian performance stability, we also report the average (avg.) and
noise with amplitude of 0.001 and 0.1. In most cases, the median (med.) accuracy as metrics of stability of our gener-
autoencoder and latent diffusion training can be completed ated models. Notably, the stability of models generated with
within 1 to 3 hours on a single Nvidia A100 40G GPU. a small number of training instances is much worse than
that observed in larger settings. This can be explained by
Inference details. We synthesize 100 novel parameters by the learning principle of the diffusion model: the diffusion
feeding random noise into the latent diffusion model and process may be hard to model the target distribution well if
the trained decoder. These synthesized parameters are then
4
Table 2. p-diff main ablation experiments. We ablate the number of original models K, the location of applying our approach, and the
effect of noise augmentation. The default settings are K = 200, applying p-diff on the deep BN parameters (between layer16 to 18), and
using noise augmentation in the input parameters and latent representations. Defaults are marked in gray . Bold entries are best results.
(a) Large K can improve the (b) P-diff works well on deep layers. The (c) Noise augmentation makes p-diff stronger.
performance stability of our index of layer is aligned with the standard Adding noise on latent representations is more im-
method. ResNet-18. portant than on parameters.
K best avg. med. parameters best avg. med. noise augmentation best avg. med.
1 76.6 70.7 73.2 original models 76.7 76.6 76.6 original models 76.7 - -
10 76.5 71.2 73.8 BN-layer10 to 14 76.8 71.9 75.3 no noise 76.7 65.8 65.0
50 76.7 71.3 74.3 BN-layer14 to 16 76.9 72.2 75.5 + para. noise 76.7 66.7 67.3
200 76.9 72.4 75.6 BN-layer16 to 18 76.9 72.4 75.6 + latent noise 76.7 72.1 75.3
500 76.8 72.3 75.4 + para. and latent noise 76.9 72.4 75.6
Table 3. We present result comparisons of original, ensemble, and p-diff under synthesizing entire model parameters setting. Our method
demonstrates good generalization on ConvNet-3 and MLP-3. Bold entries are best results.
(a) Result comparisons on ConvNet-3 (includes three convolu- (b) Result comparisons on MLP-3 (includes three linear layers
tional layers and one linear layer. and ReLU activation function).
Dataset\Network ConvNet-3 Dataset\Network MLP-3
original ensemble p-diff parameter number original ensemble p-diff parameter number
CIFAR-10 77.2 77.3 77.5 24714 MNIST 85.3 85.2 85.4 39760
CIFAR-100 57.2 57.2 57.3 70884 CIFAR-10 48.1 48.1 48.2 155135
only a few input samples are used for training. average, and medium accuracy).
Where to apply p-diff. We default to synthesizing the pa- Generalization on entire model parameters. Until now,
rameters of the last two normalization layers. To investigate we have evaluated the effectiveness of our approach in syn-
the effectiveness of p-diff on other depths of normalization thesizing a subset of model parameters, i.e., batch normal-
layers, we also explore the performance of synthesizing the ization parameters. What about synthesizing entire model
other shallow-layer parameters. To keep an equal number parameters? To evaluate this, we extend our approach to
of BN parameters, we implement our approach to three two small architectures, namely MLP-3 (includes three lin-
sets of BN layers, which are between layers with different ear layers and ReLU activation function) and ConvNet-3 (in-
depths. As shown in Tab. 2(b), we empirically find that our cludes three convolutional layers and one linear layer). Dif-
approach achieves better performances (best accuracy) than ferent from the aforementioned training data collection strat-
the original models on all depths of BN layers settings. An- egy, we individually train these architectures from scratch
other finding is that synthesizing the deep layers can achieve with 200 different random seeds. We take CIFAR-10 as
better accuracy than generating the shallow ones. This is an example and show the details of these two architectures
because generating shallow-layer parameters is more likely (convolutional layer: kernel size × kernel size, the number
to accumulate errors during the forward propagation than of channels; linear layer: input dimension, output dimen-
generating deep-layer parameters. sion) as follows:
Noise augmentation. Noise augmentation is designed to • ConvNet-3: conv1. 3×3, 32, conv2. 3×3, 32, conv3. 3×3,
enhance the robustness and generalization of training the au- 32, linear layer. 2048, 10.
toencoder. We ablate the effectiveness of applying this aug- • MLP-3: linear layer1. 3072, 50, linear layer2. 50, 25,
mentation in the input parameters and latent representations, linear layer3. 25, 10.
respectively. The ablation results are presented in Tab. 2(c). We present result comparisons between our approach and
Several observations can be summarized as follows: i) Noise two baselines (i.e., original and ensemble) at Tab. 3. We
augmentation plays a crucial role in generating stable and report the comparisons and parameter numbers of ConvNet-
high-performing models. ii) The performance gains of ap- 3 on CIFAR-10/100 and MLP-3 on CIFAR-10 and MNIST
plying noise augmentation in the latent representations are datasets. These experiments demonstrate the effectiveness
larger than in the input parameters. iii) Our default set- and generalization of our approach in synthesizing entire
ting, jointly using noise augmentation in parameters and model parameters, i.e., achieving similar or even improved
representations obtains the best performances (includes best,
5
performances over baselines. These results suggest the Seed 1 Seed 2 Seed 3
practical applicability of our method. However, we can
Conv.-layer2
not synthesize the entire parameters of large architectures,
such as ResNet, ViT, and ConvNeXt series. It is mainly
constrained by the limitation of the GPU memory.
Parameter patterns of original models. Experimental re-

sults and ablation studies demonstrate the effectiveness of
FC-layer18
our method in generating neural network parameters. To
explore the intrinsic reason behind this, we use 3 random
seeds to train ResNet-18 model from scratch and visualize
the parameters in Fig. 3. We visualize the heat map of pa-
rameter distribution via min-max normalization in different Min Max
layers individually. Based on the visualizations of the param-
eters of convolutional (Conv.-layer2) and fully connected
(FC-layer18) layers, there indeed exist specific parameter Figure 3. Visualizing the parameter distributions of convolutional
patterns among these layers. Based on the learning of these (Conv.-layer2) and fully connected (FC-layer18) layers. Parame-
ters from different layers show variant patterns while these param-
patterns, our approach can generate high-performing neural
eters from the same layer show similar patterns. The index of layer
network parameters.
is aligned with the standard ResNet-18.
4. Is P-diff Only Memorizing?

Similarity of predictions. We evaluate the similarity be-
In this section, we mainly investigate the difference between tween the original and p-diff models. For each model, we
original and generated models. We first propose a similarity obtain its similarity by averaging the IoUs with other models.
metric. Then several comparisons and visualizations are We introduce four comparisons: 1) similarity among origi-
conducted to illustrate the characteristics of our approach. nal models; 2) similarity among p-diff models; 3) similarity
between original and p-diff models; and 4) max similarity
Questions and experiment designs. Here, we first ask (nearest neighbor) between original and p-diff models. We
the following questions: 1) Does p-diff just memorize the calculate the IoUs for all models in the above four compar-
samples from the original models in the training set? 2) Is isons and report their averaged values in Fig. 4(a).
there any difference among adding noise or fine-tuning the
original models, and the models generated by our approach? One can find that the differences among generated models
In our paper, we hope that our p-diff can generate some new are much larger than the differences among the original
parameters that perform differently than the original models. models. Another finding is that even the maximum similar-
To verify this, we design experiments to study the differ- ity between the original and generated models is also lower
ences between original, noise-added, fine-tuned, and p-diff than the similarity among the original models. It shows our
models by comparing their predictions and visualizations. p-diff can generate new parameters that perform differently
with their training data (i.e. original models).
Similarity metric. We conduct experiments on CIFAR-
100 (Krizhevsky et al., 2009) with ResNet-18 (He et al., We also compare our approach with the fine-tuned and noise-
2016) under the default setting, i.e. only generating the added models. Specifically, we randomly choose one gen-
parameters of the last two batch normalization layers. We erated model, and search its nearest neighbor (i.e. max
measure the similarity between the two models by calcu- similarity) from the original models. Then, we fine-tune
lating the Intersection over Union (IoU) on their wrong and add random noise from the nearest neighbor to obtain
predictions. The IoU can be formulated as follows, corresponding models. After that, we calculate the similar-
ity of the original with fine-tuned and noise-added models,
respectively. Finally, we repeat this operation fifty times
IoU = |P1wrong ∩ P2wrong |/|P1wrong ∪ P2wrong |, (7)
and report their average IoUs for analysis. In this experi-
ment, we also constraint the performances of all models,
where P·wrong denotes the indexes of wrong predictions on
i.e., only good models are used here for reducing the bias of
the validation set, ∩ and ∪ represent union and intersection
visualization. We empirically set the amplitude of random
operations. A higher IoU indicates a greater similarity be-
noise with the range from 0.01 to 0.1 to prevent substantial
tween the predictions of the two models. From now on, we
performance drops.
use IoU as the similarity metric in our paper. To mitigate
the influence of the performance contrasts in experiments, Based on the results in Fig. 4(b), we find that the perfor-
we select models that perform better than 76.5% by default. mances of fine-tuned and noise-added models are hard to
6
$PRQJRULJLQDO
$PRQJSGLII
%HWZHHQRULJLQDODQGSGLII
%HWZHHQRULJLQDODQGSGLIIPD[

6LPLODULW\

$FF

2ULJLQDOPRGHODFFXUDF\UDQJH
)LQHWXQHG 2ULJLQDO
1RLVHDGGHG $GGLQJQRLVH
3GLII
3GLII

6LPLODULW\
(a) Similarity comparisons of original and p- (b) Similarity comparisons of fine-tuned, noise- (c) t-SNE of the latent representations
diff models. added, and p-diff models. of original, p-diff, and adding noise.
Figure 4. The similarity represents the Intersection of Union (IoU) over wrong predictions between/among two models (a) shows the
comparisons in four cases: similarity among original models and p-diff models, similarity between original and p-diff models, and the
maximum similarity (nearest neighbor) between original and p-diff models. (b) displays the accuracy and max similarity of fine-tuned,
noise-added, and p-diff models. All the maximum similarities are calculated with the original models. (c) presents the t-SNE (Van der
Maaten et al., 2008) of latent representations of the original models, p-diff models, and adding noise operation.
outperform the original models. Besides, the similarities vestigate the impact of the number of original models (K)
between fine-tuned or noise-added and original models are on the diversity of generated models, we visualize the max
very high, which indicates these two operations can not similarities between original and generated models with
obtain novel but high-performing models. However, our different K in Fig. 5(b). Specifically, we continually gener-
generated models achieve diverse similarities and superior ate parameters until 50 models perform better than 76.5%
performances compared to the original models. in all cases. The generated models almost memorize the
original model when K = 1, as indicated by the narrow sim-
Comparison of latent representations. In addition to pre- ilarity range and high value. The similarity range of these
dictions, we assess the distributions of latent representations generated models becomes larger as K increases, demon-
for the original and generated models using t-SNE (Van der strating our approach can generate parameters that perform
Maaten et al., 2008). To identify the differences between differently from the original models.
our approach and the operation of adding noise to the la-
tent representations of original models, we also include the
adding noise operation as a comparison in Fig. 4(c). The 5. Related Work
added noise is random Gaussian noise with an amplitude Diffusion models. Diffusion models have achieved re-
of 0.1. One can find that p-diff can generate novel latent markable results in visual generation. These methods (Ho
representations while adding noise just makes interpolation et al., 2020; Dhariwal & Nichol, 2021; Ho et al., 2022;
around the latent representations of original models. Peebles & Xie, 2022; Hertz et al., 2023; Li et al., 2023)
are based on non-equilibrium thermodynamics (Jarzynski,
The trajectories of p-diff process. We plot the generated 1997; Sohl-Dickstein et al., 2015), and the its pathway is
parameters of different time steps in the inference stage to similar to GAN (Zhu et al., 2017; Isola et al., 2017; Brock
form trajectories to explore its generation process. Five et al., 2018a), VAE (Kingma & Welling, 2013; Razavi et al.,
trajectories (initialized by 5 random noise) are shown in 2019), and flow-based model (Dinh et al., 2014; Rezende
Fig. 5(a). We also plot the average parameters of the origi- & Mohamed, 2015). Diffusion models can be categorized
nal models and their standard deviation (std). As the time into three main branches. The first branch focuses on en-
step increases, the generated parameters are overall close hancing the synthesis quality of diffusion models, exem-
to the original models. Although we keep a narrow per- plified by models like DALL·E 2 (Ramesh et al., 2022),
formance range constraint for visualization, there is still a Imagen (Saharia et al., 2022), and Stable Diffusion (Rom-
certain distance between the end points (orange triangles) bach et al., 2022). The second branch aims to improve
of trajectories and average parameters (five-pointed star). the sampling speed, including DDIM (Song et al., 2021),
Another finding is that the five trajectories are diverse. Analytic-DPM (Bao et al., 2022), and DPM-Solver (Lu
From memorizing to generate new parameters. To in- et al., 2022). The final branch involves reevaluating diffu-
7

6LPLODULW\

. 6LPa

Start point . 6LPa
. 6LPa
End point
Std of original models . 6LPa
. 6LPa
Mean of original models

7KHQXPEHURIRULJLQDOPRGHOi. e. , WUDLQLQJGDWDRISGLII
(a) Visualization of parameter trajectories of p-diff. (b) IoUs of high-performing (Acc.≥76.5%) generated models.
Figure 5. (a) shows the parameter trajectories of our approach and original models distribution via t-SNE. (b) illustrates max IoUs between
generated and original models in different K settings. Sim. denotes similarity.
sion models from a continuous perspective, like score-based networks. They introduce a heuristic pruner to reduce the
models (Song & Ermon, 2019; Feng et al., 2023). number of network weights, resulting in improved general-
ization. (Welling & Teh, 2011) combine Langevin dynamics
Parameter generation. HyperNet (Ha et al., 2017) dy- with SGD to incorporate a Gaussian prior into the gradient.
namically generates the weights of a model with variable This transforms SGD optimization into a sampling process.
architecture. Smash (Brock et al., 2018b) introduces a flexi- Bayes by Backprop (Blundell et al., 2015) learns a probabil-
ble scheme based on memory read-writes that can define a ity distribution prior over the weights of a neural network.
diverse range of architectures. (Peebles et al., 2023) collect These methods mostly operate in small-scale settings, while
23 million checkpoints and train a conditional generator p-diff shows its effectiveness in real-world architectures.
via a transformer-based diffusion model. MetaDiff (Zhang
& Yu, 2023) introduces a diffusion-based meta-learning
method for few-shot learning, where a layer is replaced by 6. Discussion and Conclusion
a diffusion U-Net (Ronneberger et al., 2015). HyperDiffu- Neural networks have several popular learning paradigms,
sion (Erkoç et al., 2023) directly utilizes a diffusion model such as supervised learning (Krizhevsky et al., 2012; Si-
on MLPs to generate new neural implicit fields. Different monyan & Zisserman, 2014; He et al., 2016; Dosovitskiy
from them, we analyze the intrinsic differences between et al., 2020), self-supervised learning (Devlin et al., 2018;
images and parameters and design corresponding modules Brown et al., 2020; He et al., 2020; 2022), and more. In this
to learn the distributions of the high-performing parameters. study, we observe that diffusion models can be employed to
generate high-performing and novel neural network param-
Stochastic and Bayesian neural networks. Our approach
eters, demonstrating their superiority. Using diffusion steps
could be viewed as learning a prior over network parameters,
for neural network parameter updates shows a potentially
represented by the trained diffusion model. Learning param-
novel paradigm in deep learning.
eter priors for neural networks has been studied in classical
literature. Stochastic neural networks (SNNs) (Sompolin- However, we acknowledge that images/videos and parame-
sky et al., 1988; Bottou et al., 1991; Wong, 1991; Schmidt ters are signals of different natures, and this distinction must
et al., 1992; Murata et al., 1994) also learn such priors be handled with care. Additionally, even though diffusion
by introducing randomness to improve the robustness and models have achieved considerable success in image/video
generalization of neural networks. The Bayesian neural net- generation, their application to parameters remains relatively
works (Neal, 2012; Kingma & Welling, 2013; Rezende et al., underexplored. These pose a series of challenges for neural
2014; Kingma et al., 2015; Gal & Ghahramani, 2016) aims network diffusion. We propose an initial approach to ad-
to model a probability distribution over neural networks to dress some of these challenges. Nevertheless, there are still
mitigate overfitting, learn from small datasets, and asses the unresolved challenges, including memory constraints for
uncertainty of model predictions. (Graves, 2011) propose generating the entire parameters of large architectures, the
an easily implementable stochastic variational method as efficiency of structure designs, and performance stability.
a practical approximation to Bayesian inference for neural
8
Acknowledgments. We thank Kaiming He, Dianbo Liu, Dhariwal, P. and Nichol, A. Diffusion models beat gans on
Mingjia Shi, Zheng Zhu, Bo Zhao, Jiawei Liu, Yong Liu, image synthesis. NeurIPS, 34, 2021.
Ziheng Qin, Zangwei Zheng, Yifan Zhang, Xiangyu Peng,
Hongyan Chang, David Yin, Dave Zhenyu Chen, Ahmad Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear
Sajedi, and George Cazenavette for valuable discussions independent components estimation. arXiv preprint
and feedbacks. arXiv:1410.8516, 2014.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

References D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
Heigold, G., Gelly, S., et al. An image is worth 16x16
Bao, F., Li, C., Zhu, J., and Zhang, B. Analytic-DPM: an
words: Transformers for image recognition at scale. arXiv
analytic estimate of the optimal reverse variance in diffu-
preprint arXiv:2010.11929, 2020.
sion probabilistic models. In ICLR, 2022. URL https:
//openreview.net/forum?id=0xiJLKH-ufZ. Erkoç, Z., Ma, F., Shan, Q., Nießner, M., and Dai, A. Hyper-
diffusion: Generating implicit neural fields with weight-
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wier-
space diffusion. arXiv preprint arXiv:2303.17015, 2023.
stra, D. Weight uncertainty in neural network. In ICML.
PMLR, 2015. Feng, B. T., Smith, J., Rubinstein, M., Chang, H., Bouman,
K. L., and Freeman, W. T. Score-based diffusion models
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101–
as principled priors for inverse imaging. arXiv preprint
mining discriminative components with random forests.
arXiv:2304.11751, 2023.
In ECCV. Springer, 2014.
Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi-
Bottou, L. et al. Stochastic gradient learning in neural
mation: Representing model uncertainty in deep learning.
networks. Proceedings of Neuro-Nımes, 91(8), 1991.
In ICML. PMLR, 2016.
Brock, A., Donahue, J., and Simonyan, K. Large scale gan
Graves, A. Practical variational inference for neural net-
training for high fidelity natural image synthesis. arXiv
works. NeurIPS, 24, 2011.
preprint arXiv:1809.11096, 2018a.
Ha, D., Dai, A. M., and Le, Q. V. Hypernetworks. In ICLR,
Brock, A., Lim, T., Ritchie, J., and Weston, N. SMASH: 2017. URL https://openreview.net/forum?
One-shot model architecture search through hypernet- id=rkpACe1lx.
works. In ICLR, 2018b. URL https://openreview.
net/forum?id=rydeCEhs-. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In CVPR, 2016.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
Askell, A., et al. Language models are few-shot learners. mentum contrast for unsupervised visual representation
NeurIPS, 33, 2020. learning. In CVPR, 2020.
Coates, A., Ng, A., and Lee, H. An analysis of single-layer He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick,
networks in unsupervised feature learning. In Proceed- R. Masked autoencoders are scalable vision learners. In
ings of the fourteenth international conference on arti- CVPR, 2022.
ficial intelligence and statistics. JMLR Workshop and
Conference Proceedings, 2011. Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K.,
Pritch, Y., and Cohen-or, D. Prompt-to-prompt im-
Cristianini, N., Shawe-Taylor, J., et al. An introduction to age editing with cross-attention control. In ICLR,
support vector machines and other kernel-based learning 2023. URL https://openreview.net/forum?
methods. Cambridge university press, 2000. id=_CDixzkzeyb.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba-
L. Imagenet: A large-scale hierarchical image database. bilistic models. NeurIPS, 33, 2020.
In CVPR. Ieee, 2009.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko,
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J.,
Pre-training of deep bidirectional transformers for lan- et al. Imagen video: High definition video generation
guage understanding. arXiv preprint arXiv:1810.04805, with diffusion models. arXiv preprint arXiv:2210.02303,
2018. 2022.
9
Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to- Neal, R. M. Bayesian learning for neural networks, volume
image translation with conditional adversarial networks. 118. Springer Science & Business Media, 2012.
In CVPR, 2017.
Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin,
Jarzynski, C. Equilibrium free-energy differences from P., McGrew, B., Sutskever, I., and Chen, M. Glide:
nonequilibrium measurements: A master-equation ap- Towards photorealistic image generation and editing
proach. Physical Review E, 56(5), 1997. with text-guided diffusion models. arXiv preprint
arXiv:2112.10741, 2021.
Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013. Nilsback, M.-E. and Zisserman, A. Automated flower clas-
sification over a large number of classes. In 2008 Sixth
Kingma, D. P., Salimans, T., and Welling, M. Variational
Indian conference on computer vision, graphics & image
dropout and the local reparameterization trick. NeurIPS,
processing. IEEE, 2008.
28, 2015.
Krizhevsky, A., Hinton, G., et al. Learning multiple layers Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C.
of features from tiny images. 2009. Cats and dogs. In CVPR. IEEE, 2012.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet Peebles, W. and Xie, S. Scalable diffusion models with
classification with deep convolutional neural networks. transformers. arXiv preprint arXiv:2212.09748, 2022.
NeurIPS, 25, 2012. Peebles, W., Radosavovic, I., Brooks, T., Efros, A. A., and
Krogh, A. and Vedelsby, J. Neural network ensembles, cross Malik, J. Learning to learn with generative models of
validation, and active learning. NeurIPS, 7, 1994. neural network checkpoints, 2023. URL https://
openreview.net/forum?id=JXkz3zm8gJ.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M.
ings of the IEEE, 86(11), 1998. Hierarchical text-conditional image generation with clip
latents. arXiv preprint arXiv:2204.06125, 1(2), 2022.
Li, A. C., Prabhudesai, M., Duggal, S., Brown, E., and
Pathak, D. Your diffusion model is secretly a zero-shot Razavi, A., Van den Oord, A., and Vinyals, O. Generating
classifier. arXiv preprint arXiv:2303.16203, 2023. diverse high-fidelity images with vq-vae-2. NeurIPS, 32,
2019.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn:
Common objects in context. In Computer Vision–ECCV Towards real-time object detection with region proposal
2014: 13th European Conference, Zurich, Switzerland, networks. Advances in neural information processing
September 6-12, 2014, Proceedings, Part V 13, pp. 740– systems, 28, 2015.
755. Springer, 2014.
Rezende, D. and Mohamed, S. Variational inference with
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., normalizing flows. In ICML. PMLR, 2015.
and Xie, S. A convnet for the 2020s. In CVPR, 2022.
Rezende, D. J., Mohamed, S., and Wierstra, D. Stochas-
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional tic backpropagation and approximate inference in deep
networks for semantic segmentation. In Proceedings generative models. In ICML. PMLR, 2014.
of the IEEE conference on computer vision and pattern
recognition, pp. 3431–3440, 2015. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
Ommer, B. High-resolution image synthesis with latent
Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. diffusion models. In CVPR, 2022.
DPM-solver: A fast ODE solver for diffusion probabilis-
tic model sampling in around 10 steps. In Oh, A. H., Ronneberger, O., Fischer, P., and Brox, T. U-net: Con-
Agarwal, A., Belgrave, D., and Cho, K. (eds.), NeurIPS, volutional networks for biomedical image segmentation.
2022. URL https://openreview.net/forum? In Medical Image Computing and Computer-Assisted
id=2uAaGwlP_V. Intervention–MICCAI 2015: 18th International Confer-
ence, Munich, Germany, October 5-9, 2015, Proceedings,
Murata, N., Yoshizawa, S., and Amari, S.-i. Network infor- Part III 18, pp. 234–241. Springer, 2015.
mation criterion-determining the number of hidden units
for an artificial neural network model. IEEE transactions Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton,
on neural networks, 5(6), 1994. E. L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan,
10
B., Salimans, T., et al. Photorealistic text-to-image diffu-

sion models with deep language understanding. NeurIPS,
35, 2022.
Schmidt, W. F., Kraaijveld, M. A., Duin, R. P., et al. Feed
forward neural networks with random weights. In ICPR.
IEEE Computer Society Press, 1992.
Simonyan, K. and Zisserman, A. Very deep convolu-

tional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556, 2014.
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and
Ganguli, S. Deep unsupervised learning using nonequi-
librium thermodynamics. In ICML. PMLR, 2015.
Sompolinsky, H., Crisanti, A., and Sommers, H.-J. Chaos
in random neural networks. Physical review letters, 61
(3), 1988.
Song, J., Meng, C., and Ermon, S. Denoising diffu-
sion implicit models. In ICLR, 2021. URL https:
//openreview.net/forum?id=St1giarCHLP.
Song, Y. and Ermon, S. Generative modeling by estimating
gradients of the data distribution. NeurIPS, 32, 2019.
Tian, Z., Shen, C., Chen, H., and He, T. Fcos: A simple and
strong anchor-free object detector. IEEE T-PAMI, 44(4):
1922–1933, 2020.
Van der Maaten, L., Hinton, G., and Van der Maaten, L.
Visualizing data using t-sne. JMLR, 9(11), 2008.
Welling, M. and Teh, Y. W. Bayesian learning via stochastic

gradient langevin dynamics. In ICML, 2011.
Wightman, R. Pytorch image models. https://github.
com/rwightman/pytorch-image-models,
2019.
Wong, E. Stochastic neural networks. Algorithmica, 6(1-6),

1991.
Wortsman, M., Ilharco, G., Gadre, S. Y., Roelofs, R.,
Gontijo-Lopes, R., Morcos, A. S., Namkoong, H.,
Farhadi, A., Carmon, Y., Kornblith, S., et al. Model
soups: averaging weights of multiple fine-tuned models
improves accuracy without increasing inference time. In
ICML, pp. 23965–23998. PMLR, 2022.
Zhang, B. and Yu, D. Metadiff: Meta-learning with con-
ditional diffusion for few-shot learning. arXiv preprint
arXiv:2307.16424, 2023.
Zhu, J.-Y., Park, T., Isola, P., and Efros, A. A. Unpaired
image-to-image translation using cycle-consistent adver-
sarial networks. In ICCV, 2017.
11
A. Experimental Settings
In this section, we introduce detailed experiment settings, datasets, and instructions of code for reproducing.
A.1. Training recipe

We provide our basic training recipe with specific details in Tab. 4. This recipe is based on the setting of ResNet-18 with
CIFAR-100 dataset. We introduce these details of general training hyperparameters, autoencoder, and latent diffusion model,
respectively. It may be necessary to make adjustments to the learning rate and the training iterations for other datasets.
Training Setting Configuration

K, i.e., the number of original models 200
batch size 200
Autoencoder
optimizer AdamW
learning rate 1e-3
training iterations 30, 000
optimizer momentum betas=(0.9, 0.999)
weight decay 2e-6
ξV , i.e., noise added on the input parameters 0.001
ξZ , i.e., noise added on the latent representations 0.1
Diffusion
optimizer AdamW
learning rate 1e-3
training iterations 30, 000
optimizer momentum betas=(0.9, 0.999)
weight decay 2e-6
ema β 0.9999
betas start 1e-4
betas end 2e-2
betas schedule linear
T , i.e., maximum time steps in the training stage 1000
Table 4. Our basic training recipe based on CIFAR100 dataset and ResNet-18 backbone.
A.2. Datasets
We evaluate the effectiveness of p-diff on 8 datasets. To be specific, CIFAR-10/100 (Krizhevsky et al., 2009). The CIFAR
datasets comprise colored natural images of dimensions 32 × 32, categorized into 10 and 100 classes, respectively. Each
dataset consists of 50,000 images for training and 10,000 images for testing. ImageNet-1K (Deng et al., 2009) derived from
the larger ImageNet-21K dataset, ImageNet-1K is a curated subset featuring 1,000 categories. It encompasses 1,281,167
training images and 50,000 validation images. STL-10 (Coates et al., 2011) comprises 96 × 96 color images, spanning 10
different object categories. It serves as a versatile resource for various computer vision tasks, including image classification
and object recognition. Flowers (Nilsback & Zisserman, 2008) is a dataset comprising 102 distinct flower categories,
with each category representing a commonly occurring flower species found in the United Kingdom. Pets (Parkhi et al.,
2012) includes around 7000 images with 37 categories. The images have large variations in scale, pose, and lighting.
F-101 (Bossard et al., 2014) consists of 365K images that are crawled from Google, Bing, Yelp, and TripAdvisor using the
Food-101 taxonomy.
In the appendix, we extend our p-diff in object detection, semantic segmentation, and image generation tasks. Therefore, we
also introduce the extra-used datasets in the following. COCO (Lin et al., 2014) consists of over 200,000 images featuring
complex scenes with 80 object categories. It is widely used for object detection and segmentation tasks. We implement
image generation task on CIFAR-10.
12
Table 5. Comparison of using 1D CNNs and fully connected (FC) layers. 1D CNNs perform better than FC layers, especially in memory
and time.
Arch. Method Dataset Time (s)↓ Best↑ Average↑ Median↑ Worst↑ Memory (MB)↓
ConvNet-3 FC MNIST 17 98.0 90.1 93.6 70.2 1375
ConvNet-3 1D CNNs MNIST 16 99.2 92.1 94.2 73.6 1244
A.3. Instructions for code

We have submitted the source code as the supplementary materials in a zipped file named as ‘p-diff.zip’ for reproduction. A
README is also included for the instructions for running the code.
B. Explorations of Designs and Strategies

In this section, we introduce the reasons for the designs and strategies of our approach.
B.1. Why 1D CNNs?

Considering the great differences between visual data and neural network parameters, we default to using 1D CNNs in
parameter autoencoder and generation. The detailed designs of 1D CNNs can be found in the following. Each layer in 1D
CNNs includes two 1D convolutional layers with a normalization layer and an activation layer. More details of the 1D
CNNs can be found at core/module/modules in our code zip file.
Here naturally raises a question: are there alternatives to 1D CNNs? We can use pure fully connected (FC) layers as
an alternative. To answer this question, we compare the performance of FC layers and 1D CNNs. The experiments are
conducted on MNIST with ConvNet-3 as the backbone. Based on our experimental results in Tab. 5, 1D CNNs consistently
outperform FC in all architectures. Meanwhile, the memory occupancy of 1D CNNs is smaller than FC.
Table 6. Comparison of using batch normalization, group normalization, and instance normalization in our approach. We also report the
results without normalization. ‘norm.’ denotes normalization. Default settings are marked in gray . Bold entries are best results.
(a) Results on CIFAR-10. (b) Results on MNIST. (c) Results on CIFAR-100.

norm. best avg. med. norm. best avg. med. norm. best avg. med.
original 94.3 - - original 99.6 - - original 76.7 - -
no norm. 94.0 82.8 80.1 no norm. 99.5 84.1 98.4 no norm. 76.1 67.4 69.9
BN 88.7 84.3 88.2 BN 99.3 86.7 99.1 BN 75.9 70.7 73.3
GN 94.3 89.8 93.9 GN 99.6 93.2 99.3 GN 76.8 72.1 75.8
IN 94.4 88.5 94.2 IN 99.6 92.7 99.4 IN 76.9 72.4 75.6
B.2. Is variational autoencoder an alternative to our approach?

Variational autoencoder (VAE) (Cristianini et al., 2000) can be regarded as a probabilistic generative model and achieve
many remarkable results in the generation area. We also implement VAE to generate neural network parameters. We first
introduce the details of VAE in our experiment. We implement vanilla VAE using the same backbone of the autoencoder in
p-diff for a fair comparison. We evaluate the VAE generator in the case of different K and compare its best, average, and
medium performances with p-diff generated models. Based on the results in Tab. 7, our approach outperforms VAE by a
large margin in all cases. Another interesting finding is that the average performance of VAE generated models goes down
as the number of original models increases.
B.3. Which normalization strategy is suitable?

Considering the intrinsic difference between images and neural network parameters, we explore the influence of different
normalization strategies. We ablate batch normalization (BN), group normalization (GN), and instance normalization (IN)
on CIFAR-10, MNIST, and CIFAR-100, respectively. We also implement our method without normalization for additional
13
Table 7. Comparisons between VAE and our proposed p-diff. VAE performs worse than our approach, especially on the metric of average
and medium accuracy.
(a) Result of VAE (b) P-diff vs VAE, improvements are reported in ().
num. of original models best avg. med. num. of original models best avg. med.
1 75.6 61.2 70.4 1 76.6 (+1.0) 70.7 (+9.5) 73.2 (+2.8)
10 76.5 65.8 71.5 10 76.5 (+0.0) 71.2 (+5.4) 73.8 (+2.3)
50 76.5 63.0 71.8 50 76.7 (+0.2) 71.3 (+8.3) 74.3 (+2.5)
200 76.7 62.7 70.8 200 76.9 (+0.2) 72.4 (+9.7) 75.6 (+4.8)
500 76.7 62.6 71.9 500 76.8 (+0.1) 72.3 (+9.7) 75.4 (+3.5)
comparison. Their best, average, and medium performances of 100 generated models are reported in Tab. 6. Based on the
results, we have the following observations: 1) BN obtains the worst overall performance on all three metrics. Since BN
operates in the batch dimension and introduces undesired correlations among model parameters 2) GN and IN perform
better than without normalization, i.e. ‘no norm.’ in the Tab. 6. That could be explained by some outlier parameters affecting
the performance a lot. 3) From the metrics, we find our method has good generalization among channel-wise normalization
operations, such as GN and IN.
Table 8. We design ablations about the intensity of input noise ξV and latent noise ξZ , generating variant types of parameters. ‘para.’
denotes parameter. Default settings are marked in gray . Bold entries are best results.
(a) Ablation of input noise ξV . (b) Ablation of latent noise ξZ . (c) Ablation of types of parameters.
para. noise best avg. med. latent noise best avg. med. para. type original best avg. med.
1e-4 76.7 72.1 75.6 1e-3 76.7 67.3 73.2 linear 76.6 76.6 47.3 71.1
1e-3 76.9 72.4 75.6 1e-2 76.6 70.1 74.7 conv 76.2 76.2 71.3 76.1
1e-2 76.3 70.4 74.4 1e-1 76.9 72.6 75.6 shortcut 75.9 76.0 73.6 75.7
1e-1 76.8 71.4 75.1 1e-0 76.7 74.0 75.0 bn 76.7 76.9 72.4 75.6
C. More Ablations
In this section, we introduce more ablation studies of our method. Same as the main paper, if not otherwise stated, we
default to training ResNet-18 on CIFAR-100 and report the best, average, and medium accuracy.
C.1. The intensity of noise added into input parameters

In the main paper, we ablate the effectiveness of the added noise into input parameters. Here, we study the impact of the
intensity of this noise. Specifically, we explore four levels of noise intensity and report their best, average, and medium
results in Tab. 8(a). One can find that, our default intensity achieves the best overall performance. Both too-large and
too-small noise intensities fail to obtain good results. That can be explained by that the too-large noise may destroy the
original distribution of parameters while too-small noise can not provide enough effectiveness of augmentation.
C.2. The intensity of noise added into latent representations

Similar to Sec. C.1, we also ablate the noise intensity added into latent representations. As shown in Tab. 8(b), the
performance stability of generated models becomes better as the noise intensity increases. However, too-large noise also
breaks the distribution of the original latent representations.
C.3. The generalization on other types of parameters

In the main paper, we investigate the effectiveness of our approach in generating normalization parameters. We also evaluate
our approach on other types of parameters, such as linear, convolutional, and shortcut layers. Here, we show the details
14
Table 9. Exploring the influence of maximum time steps in the training stage. We conduct experiments on CIFAR-10, MNIST, and
CIFAR-100 datasets, respectively. Bold entries are best results.
(a) Results on CIFAR-10. (b) Results on MNIST. (c) Results on CIFAR-100.
maximum step best avg. med. maximum step best avg. med. maximum step best avg. med.
10 94.4 82.0 93.8 10 99.6 89.9 98.9 10 76.6 70.6 74.9
100 94.3 94.3 94.3 100 99.6 99.6 99.6 100 76.8 75.9 76.5
1000 94.4 88.5 94.2 1000 99.6 92.7 99.4 1000 76.9 72.4 75.6
2000 94.3 85.8 94.2 2000 99.6 94.1 99.5 2000 76.8 73.1 75.1
of the above three type layers as follows: 1) linear layer: the last linear layer of ResNet-18. 2) convolutional layer: first
convolutional layer of ResNet-18. 3) shortcut layer: the shortcut layer between 7th and 8th layer of ResNet-18. The training
data preparation is the same as we mentioned in the main paper. As illustrated in Tab. 8, we find our approach consistently
achieves similar or improved performance compared to the original models.
D. Open Explorations
D.1. Do we need to train 1000-step diffusion model?
We default to training the latent diffusion model via random sampling from 1000 time steps. Can we reduce the number
of time steps in the training stage? To study the impact of the time steps, we conduct an ablation and report the results in
Tab. 9. Several findings can be summarized as follows: 1) Too small time steps might not be strong enough to generate
high-performing models with good stability. 2) The best stability performances are obtained by setting the maximum time
steps as 100. 3). Increasing the maximum time steps from 1000 to 2000 can not improve the performance. We will further
upgrade our design based on this exploration.
D.2. Potential applications

Neural network diffusion can be utilized or help the following potential research areas. 1) Parameters initialization: our
approach can generate high-performing initialized parameters. Therefore, that would speed up the optimization and reduce
the overall cost of training. 2) Domain adaptation: our approach may have three benefits in the domain adaptation area.
First, we can directly use the diffusion process to learn the well-performed models trained by different domain data. Second,
some hard adaptations can be achieved by our approach. Third, the adaptation efficiency might be improved largely.
E. Other Finding and Comparison Results

E.1. How to select generated parameters?
P-diff can rapidly generate numerous high-performance models. How do we evaluate these models? There are two primary
strategies. The first one is to directly test them on the validation set and select the best-performing model. The second one is
to compute the loss of model outputs compared to the ground truth on the training set to choose a model. We generated
hundred model parameters with performance distributions in different intervals and displayed their accuracy curves on both
the training and validation sets in Fig. 6(a). The experimental results indicate that p-diff exhibits a high level of consistency
between the training and validation sets. To provide a fair comparison with baseline methods, we default to choose the
model that performs the best results on the training set and compare it with the baseline.
E.2. Parameter visualization

To provide a more intuitive understanding, we compare the parameters generated by our approach, SGD optimization
(original), and randomly initialized. Taking ResNet-18 as an example, we report the mean, std, accuracy (acc.), and IoU of
the normalization layer parameters of training on CIFAR-100 in Fig. 6(b). There is a significant difference between the
parameters generated by our approach and the randomly initialized parameters, mean: 0.37 vs 0.36, std: 0.22 vs 0.21 The
IoU between ours and SGD is 0.87. This visualization and results confirm that the diffusion process can learn the patterns of
15
accuracy on CIFAR-10 100 min. max Visualization of R-18 BN parameters

98 Similarity: 0.87
validation set
96
original
training set
initial.
p-diff
94
92
90 mean: 0.50 std: 0.50 acc: 1.3% mean: 0.37 std: 0.22 acc: 76.6% mean: 0.36 std: 0.21 acc: 76.8%
0 20 40 60 80 100
index of generated model
(a) Accuracy distribution in p-diff models. (b) Visualization of initial, SGD-trained, p-diff generated model.
Figure 6. P-diff can generate models with great consistency on both training and validation sets contrast compared to the original model.
(a) shows the accuracy distribution of training and validation sets in hundred p-diff models. (b) displays a heat map of initial, SGD-trained,
p-diff generated parameters of the normalization layer in ResNet-18.
44x faster
15x faster
45s 2000s
19s 300s
(a) Acc. of R-18. (b) Acc. of ViT-Base.
Figure 7. We compare the accuracy curves of our method and SGD under three cases. (a): ResNet-18 on CIFAR-100. (b): ViT-Base on
ImageNet-1K. Our approach speeds up at least 15 × than standard SGD process.
high-performance parameters and generate new good models from random noise. More importantly, our generated model
has a great behavior contrast compared to the original model, which is reflected in the low IoU value.
E.3. Efficiency of parameter generation

To evaluate the generation efficiency of our method, we compare the validation accuracy curves of our method and SGD
training among the following cases: 1) parameter diffusion with ResNet-18 on CIFAR-100; 2) parameter diffusion with
ViT-Base on ImageNet-1K. We utilize the random initialized parameters for our method and SGD to make a fair comparison.
As illustrated in Fig. 7, our method can speed up at least 15 × compared to the SGD without performance drops. On
ImageNet-1K, we can speed up by 44 × when compared to the vanilla SGD optimization, which illustrates the more
significant potential when applying our approach to large training datasets.
F. Generalization on Other Tasks

We implement our method for other visual tasks, i.e., object detection, semantic segmentation, image generation. Experi-
mental results illustrate the ability of our method to generalize to various tasks.
F.1. Object detection

Faster R-CNN (Ren et al., 2015) utilizes a region proposal network (RPN) which shares full-image convolutional features
with the detection network to improve Fast R-CNN on object detection task. The FCOS (Fully Convolutional One-
16
Stage) (Tian et al., 2020) model is a single-stage object detection model that simplifies the detection process by eliminating
the need for anchor boxes. In the object detection task, We implement Faster R-CNN (Ren et al., 2015) and FCOS (Tian
et al., 2020) with ResNet-50 backbone on the COCO (Lin et al., 2014) dataset based on torch/torchvision. Considering the
time cost of data for p-diff, we directly use the pre-trained parameters as our first training data, then fine-tune it to obtain
other training data. The parameters of the boxing predictor layer are generated by p-diff. We report the results in Tab. 10.
Our method can get models with similar or even better performance than the original model in seconds.
model/performance best original mAP best p-diff mAP

Faster R-CNN 36.9 37.0
FCOS 39.1 39.1
Table 10. P-diff in object detection task. We report the mAP of best original model and best p-diff generated model.
F.2. Semantic segmentation

Fully Convolutional Network (FCN) (Long et al., 2015) was designed to efficiently process and analyze images at the pixel
level, allowing for the semantic segmentation of objects within an image. Following the approach in object detection, we
implement semantic segmentation task using FCN (Long et al., 2015) with ResNet-50 backbone to evaluate a subset of
COCO val2017, on the 20 categories that are present in the Pascal VOC dataset. We generate a subset of the parameters
of backbone and report the results in Tab. 11. Our approach can generate high-performing neural network parameters in
semantic segmentation task.
model/performance original p-diff

mean IoU pixelwise acc. mean IoU pixelwise acc.
FCN 60.5 91.4 60.7 91.5
Table 11. P-diff in semantic segmentation task. We report mean IoU and pixelwise accuracy of best original model and best p-diff model.
F.3. Image generation

DDPM (Ho et al., 2020) is a diffusion-based method in image generation,
where UNet (Ronneberger et al., 2015) is used to model the noise. In model/performance original FID p-diff FID
the image generation task, we use p-diff to generate a subset of model
DDPM UNet 3.17 3.19
parameters of UNet. For comparison, we evaluate the p-diff model’s FID
score on the CIFAR-10 dataset and report the results in Tab. 12. The best Table 12. P-diff in image generation task. We report
p-diff generated UNet get similar performance to the original model. the FID score on the CIFAR-10 dataset.
17

Neural Network Diffusion: Forward Process

Uploaded by

Copyright:

Available Formats

Neural Network Diffusion: Forward Process

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Network Diffusion: Forward Process

Uploaded by

Copyright:

Available Formats

Neural Network Diffusion

Abstract Forward Process

parameters. That is motivated by the fact that the diffusion

Parameter Autoencoder Parameter Generation

Latent Random Latent

Parameter patterns of original models. Experimental re-

4. Is P-diff Only Memorizing?

 . 6LPa 

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,

B., Salimans, T., et al. Photorealistic text-to-image diffu-

Simonyan, K. and Zisserman, A. Very deep convolu-

Welling, M. and Teh, Y. W. Bayesian learning via stochastic

Wong, E. Stochastic neural networks. Algorithmica, 6(1-6),

A.1. Training recipe

Training Setting Configuration

A.3. Instructions for code

B. Explorations of Designs and Strategies

B.1. Why 1D CNNs?

(a) Results on CIFAR-10. (b) Results on MNIST. (c) Results on CIFAR-100.

B.2. Is variational autoencoder an alternative to our approach?

B.3. Which normalization strategy is suitable?

C.1. The intensity of noise added into input parameters

C.2. The intensity of noise added into latent representations

C.3. The generalization on other types of parameters

D.2. Potential applications

E. Other Finding and Comparison Results

E.2. Parameter visualization

accuracy on CIFAR-10 100 min. max Visualization of R-18 BN parameters

(a) Acc. of R-18. (b) Acc. of ViT-Base.

E.3. Efficiency of parameter generation

F. Generalization on Other Tasks

F.1. Object detection

model/performance best original mAP best p-diff mAP

F.2. Semantic segmentation

model/performance original p-diff

F.3. Image generation

You might also like

. 6LPa