Synthetic Image Data for Deep Learning
Synthetic Image Data for Deep Learning
Abstract—Realistic synthetic image data rendered from 3D vehicle CAD model to a brand new vehicle or a new model
models can be used to augment image sets and train image year and then generating a new training set is far less than that
classification semantic segmentation models. In this work, we of acquiring new real world examples, especially when the
arXiv:2212.06232v1 [cs.CV] 12 Dec 2022
1) Metrics: While the image generation and training tech- M , where mr,s,i ∈ M : r ∈ N, s ∈ N, i ∈ [0..|Mr,s |) is one
niques share applicability with object detection and instance instance of a class of U-net models trained on (r, s) random
segmentation models with more actionable metrics, we quan- images from datasets R and S, and we report aggregate
tify the performance of a standard multiclass U-net segmenta- statistics over the model class Mr,s at each cell in the matrices
tion model simply with per-pixel mean intersection-over-union of Figure 2.
(mean IoU) with uniform class weighting and a prediction
Figure 2a aggregates the mean IoU predictions of each
threshold of 50%.
trained model class on the unseen validation set from the
B. Real Dataset Supplementation real image domain. We observe a general trend of increasing
accuracy with larger samples of real images, with diminishing
To determine how supplementing a dataset of real images
returns as the training images grow to sufficiently represent
with synthetic images would affect model training and ac-
the domain features. Along the horizontal axis, we see that
curacy, we trained instances of multiple model classes with
augmentation with synthetic data tended to increase accuracy,
different mixtures of images from both sets. Subsets of the
with greater yields in models trained on smaller real datasets.
real image set R of sizes N = {0, 16, 32, ..., 8192} were
We also observe that models trained on purely synthetic data
paired with subsets of the synthetic image set S from the same
tend to poorly predict the real domain, even with thousands
size range, forming the axes of the 11x11 matrices shown in
of examples.
Figure 2 with model classes at each intersection. For each
model class, random samples from R and S were used to To discuss the results of synthetic data augmentation, we
train individual U-net segmentation models with parameters first look at the effects of augmentation on model reliability.
reported above. The number of models trained in each class Figure 3 shows the summary statistics of mean validation set
was sufficient that the confidence interval (α = 0.95) width of predictions for model classes trained on purely real images and
the mean truth/prediction IoU measurements on the real image those augmented with 2048 synthetic images, which details
validation set was less than 5% of the mean value, requiring columns 0 and 2048 from Figure 2a. Models trained with
between 7 and 30 model instances for each image set size smaller random samples of real images tended to show more
pair. We refer to the resulting set of segmentation models as variation in their resulting prediction accuracy. We observe that
8192 0.976 0.973 0.976 0.974 0.975 0.973 0.975
4096 0.973 0.974 0.974 0.974 0.974 0.975 0.972
2048 0.970 0.971 0.971 0.972 0.971 0.964 0.962 0.95
1024 0.964 0.965 0.966 0.966 0.966 0.959 0.958
512 0.954 0.956 0.958 0.960 0.957 0.950 0.951 0.90
0
(b) 256 0.0 0.8 1.1 0.9 1.4 0.7 0.3 2048
128 0.0 0.5 1.0 1.9 2.4 3.6 2.7 3.8 3.8 3.1 1.9 0.55
64 0.0 3.9 2.6 6.6 7.6 10.1 10.7 10.5 9.8 9.5 8.7 16 32 64 128 256 512 1024 2048 4096 8192
32 0.0 1.6 11.5 8.8 15.4 17.2 19.6 19.3 19.5 19.4 18.6 # real images
16 0.0 3.2 5.8 15.4 18.1 24.8 22.3 21.8 22.6 24.9 22.1
0 16 32 64 128 256 512 1024 2048 4096 8192 Fig. 3: Mean IoU of model predictions on a validation set of
# synthetic
1176 real images. Each IQR plot describes between 7 and 30
8192 0.887 0.499 0.831 0.749 0.951 0.834 individual U-net models trained on random subsets of real and
4096 0.369 0.175 0.200 0.038 0.009 0.794 synthetic images. In general, augmenting smaller (≤ 256) sets
2048 0.075 0.087 0.004 0.011 0.960 0.992
1024 0.162 0.046 0.075 0.030 0.921 0.975 of real images resulted in higher accuracy and less variation in
512 0.075 0.004 <0.001 0.077 0.899 0.841
# real
256 0.015 0.001 0.005 <0.001 0.033 0.261 the trained models, with diminishing returns as the real data
(c) 0.071 0.003 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 0.022
128 became sufficiently representative of the domain.
64 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
32 0.056 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
16 0.003 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 # real images
0 16 32 64 128 256 512 1024 2048 4096 8192 8192
# synthetic 0.95 4096
2048
1024
0.90
mean prediction IoU with real domain
0.65
augmentation tended to increase mean accuracy and decrease
variance in models trained with less than 256-512 real images. 0 1:1 2:1 4:1 8:1 16:1 32:1 64:1 128:1 256:1 512:1
synthetic:real ratio
Augmenting the real training sample with varying amounts
of synthetic data yields better results, depending on how Fig. 4: Mean prediction IoU of U-net models on real images,
accurate the model is to begin with. Figure 2b reshapes viewed by the ratio of real to synthetic data in the training
the data in Figure 2a as a percentage increase in mean datasets. Each trend exhibits an inflection point where ac-
prediction IoU relative to that of the pure real set (column curacy decreased, presumably due to limited capacity of the
0). We can see that augmenting models trained with 512 model to encompass both the real and synthetic domain.
or more real images only results in a marginal increase, at
best 0.6%. However, in models trained with 256 or fewer
real images, the accuracy increase is substantial, up to 25.0% amounts of synthetic data correlate with a slight decrease in
when only 16 real images are available. We can also see that prediction accuracy, presumably due to dilution of the samples
the addition of any amount of real images results in models from the real domain and a limited capacity of the model
that are more accurate than those trained on synthetic data to encompass both the real and synthetic domains. We can
alone. This is supported by the p-values of one-sided T-tests, observe this trend more clearly when viewing the relationship
Ha : IoU (Mr,s ) > IoU (Mr,0 ) ∀ r, s ∈ N , shown in Figure 2c between real and synthetic image set sizes as a ratio, shown
with p < 0.05 highlighted. in Figure 4. Each real image set size exhibits an inflection
Figure 2b also shows that in some cases, particularly in point where accuracy declines, which we suspect is dependent
those with 512 or more real images, the addition of large on the capacity of the model and similarity between real and
synthetic data in a particular use case. • synth-random - only the contracting path (encoder) was
To visualize the differences in prediction accuracy, Figure 5 initialized with weights from the pretrained base, al-
presents the segmentation maps predicted by 10 different lowing the untrained expanding path (decoder) to train
models, trained on 16-256 real images and augmented with completely on real data;
either 0 or 2048 synthetic images. In contrast to the randomly • synth-synth - both the encoder and decoder were initial-
selected images used to train the models Figure 2, each real ized with pretrained base weights;
dataset larger than 16 images is a superset of the smaller • VGG19-random - the encoder part of the model was
datasets, and the same real datasets and 2048-image synthetic replaced with VGG19, detailed below, and the decoder
dataset are reused in each of the augmented models. For this left with random weights;
example image, the quality of the predictions are fairly low • VGG19-synth - the encoder was replaced with VGG19,
in the pure real models, limiting usefulness depending on the and the decoder initialized with pretrained base weights;
use case. The addition of synthetic images results in clearly • control - the base model was used without freezing or
defined door/window boundaries with even the smallest real replacing layers, and the initial random weights were
training set, and better identification of smaller features such unchanged. Note that this is the same configuration as
as the door handles at 64 real images compared to requiring models in the previous section, and the resulting model
128 without augmentation. is trained on purely real data.
Finally, we doubled the above configurations with another
V. T RANSFER L EARNING parameter, choosing to either freeze the layers of the encoder
portion of the model or allow the secondary training with
Another potential use case for synthetic data is in pretraining
real data to propagate and update the encoder weights. Our
models for later improvement with real data, either as a base
expectations were that freezing the encoder section of the
for multiple specialized models or as a starting point for
model would reduce training time as there were less param-
incremental training as real data becomes available. Our results
eters to update with each back-propagation, but could reduce
from the previous section indicate that U-net models trained
the model’s ability to adapt to the new data. Table II details
with 256 or fewer images from our real image dataset suffer
the number of trainable parameters and mean training time per
from low applicability to new images, so in this section we
image-epoch for the four resulting model architectures, which
will focus on pretrained model refinement with small numbers
indeed shows decreased time per image with less parameters
of real images.
to update.
The goals and requirements for transfer learning can vary
For some experiments, the encoder layers of the model were
widely, but in our exploration we will focus on use cases
replaced with a VGG19[25] model pretrained with weights
stemming from unavailability of real labelled training images
from ImageNet[1], following the same procedure as the work
and from the need to specialize a general model for a particular
done in [8] for comparability. With the models initialized
task. As such, we will quantify results in terms of accuracy
with pretrained weights, we continued training using randomly
(in this case, mean prediction IoU on real data) and training
selected subsets of real images until convergence, using the
time of the model specialization training.
stopping criteria described in the previous section. All model
variant and real image sample size permutations were repeated
A. U-Net
30 times.
There are many strategies for transfer learning using the We first compare on the frozen/trainable encoder variable,
U-Net model [cite], most involving freezing, reinitializing, visualized in Figure 6. In models using VGG19 as the encoder,
adding, or removing layers. It is beyond the scope of this we observed greater prediction accuracy and lower training
work to explore the many factors involved in choosing the time, while models using our encoder pretrained on synthetic
optimal strategy for a particular use case. We will instead data tended to perform better when the encoder was not
focus on a relatively simple technique that compares well to frozen during secondary training. This is perhaps due to
our work with a more advanced model in the next subsection, the large difference in the number of encoder neurons, as
which to train a U-Net with purely synthetic data, and then propagating the training feedback from each example through
continuing training with real images while optionally freezing the larger VGG19 encoder is more costly and less impactful.
or replacing part of the model. Our base synthetic-trained We speculate that limiting the neurons being updated each
U-Net model uses parameters as described in the previous epoch lead to faster model convergence while the models with
section, trained with a larger dataset of 36,480 synthetic more trainable weights slowed in training progress enough
images, which achieved 0.954 mean prediction IoU on the to trigger early stopping. The training logs support this con-
holdout set from the same synthetic domain. Accuracy on jecture, showing extremely slow improvement before training
segmentation of real images was similar to the experiments was terminated. It is possible that, given enough time, the
with large pure synthetic datasets in the previous section, only accuracy differences between trainable and frozen versions of
achieving a mean prediction IoU of 0.618 on that domain. the same model would minimize. However, since all models
Starting with an identical U-Net base model initialized with use the same early stopping criteria, we present the results as
random weights, experiments were configured as follows: comparable in a practical sense. In the remainder of this work,
r = 16 r = 32 r = 64 r = 128 r = 256
s = 0 s = 0 s = 0 s = 0 s = 0
IoU = 0.535 IoU = 0.537 IoU = 0.726 IoU = 0.829 IoU = 0.921
Fig. 5: Segmentation map predictions of U-net models trained with pure real images (top) vs. the same training sets augmented
with 2048 synthetic images (bottom). The input image and ground truth are shown on the left for reference.
comparisons with these models will use the better-performing more consistent, in that the U-net default layers trained with
frozen encoders in the case of VGG19, and trainable encoders synthetic data resulted in higher mean prediction accuracy than
for the synthetic data-trained models. models using the VGG19 encoder trained on ImageNet, across
Next, we compare the mean prediction accuracy of the all real data sample sizes. We conclude from these findings
retrained models with frozen encoders to the control models that a relatively small encoder (4.72m parameters) trained on
trained from randomly initialized weights. We observed that in a few thousand images drawn from a similar synthetic domain
cases with 64 or fewer real images, we saw an increase in accu- to the target can outperform the already impressive feature
racy over a control model trained on purely real data. However, extraction of a large (23.03m parameters) encoder trained on
in larger real image classes and with all control models trained over a million generic real images.
on a mix of real and synthetic data, we saw significantly lower Figure 7b compares the training times of retrained models to
accuracy in the specialized models. We again speculate that those of the control model for each real sample size class, with
the model training may have slowed enough to trigger our results between 10.0% and 20.8% of the time required for the
early termination criteria, and that a combination of refined control. The time can be accounted for in both the number
learning rate, early termination parameters, and lengthened of trainable parameters in the retrained models with frozen
training time may result in improved accuracy. Our goals in encoders, and the number of epochs required to converge.
this work are in comparability between experiments, though, As the mean training time for a purely synthetic U-net (r=0,
so we present these results as a baseline to be improved upon. s=2048) is 11,648 seconds, the training time for a retrained
In comparing the prediction accuracy of U-net models with U-net is comparable to that of the control.
different decoder weights, we saw mixed results; the pretrained
synthetic data weights appeared to result in lower performance B. Double-U-Net
in models with synthetic weighted encoders trained on 16 or Since the introduction of U-net in 2015, a number of
32 real images, while having the opposite effect in models derivative models have been proposed that improve its ap-
with VGG19 encoders. In models trained on 64 or more real plicability to certain use cases. One of these, the Double-U-
images, the results were less clear; and a two-sided T-test net[8], improves upon the localization of segment instances
showed insufficient difference to conclude that the results are by dividing the task between, as the name suggests, two U-
drawn from different distributions at p = 0.05. net models linked together. The first U-net, using a VGG19
Comparing encoder paths of the different model classes was encoder trained on ImageNet, outputs feature maps from each
model type model type
synth base (r=0, s=2048) synth base (r=0, s=2048)
0.950 synth-synth control (r=x, s=0)
synth-synth (frozen encoder) 0.950 control (r=x, s=2048)
synth-random synth-synth
0.900 synth-random (frozen encoder) synth-random
VGG19-synth 0.900 VGG19-synth
VGG19-synth (frozen encoder) VGG19-random
0.850 VGG19-random
VGG19-random (frozen encoder)
0.850
mean prediction IoU with real domain
0.800
0.750
0.650 0.700
0.627
0.600
0.650
0.550 0.627
0.600
0.500
0.550
0.450
16 32 64 128 256 16 32 64 128 256
# real images # real images
4000 15000
(b) (b)
3000
10000
2000
5000
1000
0
16 32 64 128 256 16 32 64 128 256
# real images # real images
Fig. 6: Comparisons of mean prediction IoU (a) and training Fig. 7: Limiting to encoder type and decoder initial weights
time (b) of secondary training of pretrained U-Net models, (synthetic pretrained vs. random) model permutations, we
with the weights of the contracting path (encoder) either observed a sizeable tradeoff between mean prediction IoU (a)
trainable or frozen. and training time (b) when compared to models initialized
from randomness.
0.850
We found that, for this image segmentation problem, syn-
0.800 thetic images were an effective technique for augmenting
(a)
limited sets of real training data. We observed that models
0.750
trained on purely synthetic images had a very low mean
0.700 prediction IoU on real validation images. We also observed
that adding even very small amounts of real images to a
0.650
0.627
synthetic dataset greatly improved accuracy, and that models
0.600 trained on datasets augmented with synthetic images were
more accurate than those trained on real images alone. We
0.550
16 32 64
# real images
128 256 noted that for this domain, 256 to 512 images seemed to be
enough to train a reasonably accurate model, with rapidly
model type
U-net (r=x, synth-synth)
diminishing returns on adding synthetic images to the mix,
U-net (r=x, VGG19-synth)
6000
W-net (r=x, encoder=synth) eventually resulting in lower accuracy as the real:synthetic
W-net (r=x, encoder=VGG19)
ratio dropped.
5000 In use cases that benefit from incremental training or model
specialization, we found that pretraining on synthetic images
4000
provided a usable base model for transfer learning. While we
training time (s)