Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
0 views

Synthetic Image Data for Deep Learning

This document discusses the use of realistic synthetic image data generated from 3D vehicle CAD models to enhance training for image classification and segmentation models in deep learning. The authors found that while models trained solely on synthetic images performed poorly on real validation images, incorporating even a small number of real images significantly improved accuracy. The research highlights the potential of synthetic images to reduce training costs and improve model performance, particularly in applications requiring specialized detection systems.

Uploaded by

niteshsamindre
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Synthetic Image Data for Deep Learning

This document discusses the use of realistic synthetic image data generated from 3D vehicle CAD models to enhance training for image classification and segmentation models in deep learning. The authors found that while models trained solely on synthetic images performed poorly on real validation images, incorporating even a small number of real images significantly improved accuracy. The research highlights the potential of synthetic images to reduce training costs and improve model performance, particularly in applications requiring specialized detection systems.

Uploaded by

niteshsamindre
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Synthetic Image Data for Deep Learning

Jason W. Anderson Marcin Ziolkowski Ken Kennedy Amy W. Apon


BMW IT Research Center BMW IT Research Center BMW IT Research Center School of Computing
Greenville, SC Greenville, SC Greenville, SC Clemson University
jason.anderson@bmwgroup.com marcin.ziolkowski@bmwgroup.com ken.kennedy@bmwgroup.com Clemson, USA
aapon@clemson.edu

Abstract—Realistic synthetic image data rendered from 3D vehicle CAD model to a brand new vehicle or a new model
models can be used to augment image sets and train image year and then generating a new training set is far less than that
classification semantic segmentation models. In this work, we of acquiring new real world examples, especially when the
arXiv:2212.06232v1 [cs.CV] 12 Dec 2022

explore how high quality physically-based rendering and domain


randomization can efficiently create a large synthetic dataset goal is to have a working detection system before the model
based on production 3D CAD models of a real vehicle. We use this enters production. The costs of developing a synthetic image
dataset to quantify the effectiveness of synthetic augmentation generation pipeline specific to a model’s goals can be further
using U-net and Double-U-net models. We found that, for recuperated in cases where similar images can be used to train
this domain, synthetic images were an effective technique for other models, potentially requiring only minor alterations to
augmenting limited sets of real training data. We observed that
models trained on purely synthetic images had a very low mean the generator.
prediction IoU on real validation images. We also observed that While the body of work around using synthetic images in
adding even very small amounts of real images to a synthetic deep learning models has become broadened in recent years,
dataset greatly improved accuracy, and that models trained on we have found little exploration of using synthetic images to
datasets augmented with synthetic images were more accurate pretrain a multistage segmentation model such as the recently
than those trained on real images alone.
Finally, we found that in use cases that benefit from incremen- proposed DoubleU-net which has been shown to be highly
tal training or model specialization, pretraining a base model on accurate in some applications. Our motivations in this work
synthetic images provided a sizeable reduction in the training cost are to explore the performance effects of training such a model
of transfer learning, allowing up to 90% of the model training in various combinations of synthetic and real images.
to be front-loaded. We present our research on synthetic image training in the
context of a real world anomaly detection system, including
I. I NTRODUCTION
the results of testing on a large set of annotated proprietary
In the field of image classification and segmentation with production images. We believe the methodology presented
deep learning systems, access to sets of labelled training here can be readily applied to other systems, and make the
images with sufficient quantity and quality can be a formidable case that synthetic images can replace the real images and
barrier to training an accurate model. Collecting, segmenting, still achieve a potentially useful level of performance.
and labelling high quality images can be prohibitively expen- The remainder of this paper is organized as follows. Sec-
sive both in time and monetary cost. In some cases, the barrier tion II describes concepts related to deep learning systems,
can be lowered by pretraining a model with a generic dataset synthetic data generation, and the specific models used in this
such as ImageNet[1] and then fine-tuned on a smaller set of paper. Section III describes the technologies and processes
images more directly related to the project goals. However, used to generate synthetic images. Section IV shows how
depending on the specificity requirements for the final model, synthetic data can augment real images to improve model
a generalized dataset may not be useful. accuracy. Section V compares those results to techniques using
A common alternative to vast quantities of readily available pretrained models and transfer learning. Finally, Section VI
general images and costly task-specific images is synthetic summarizes our key results and addresses further questions
image generation, where a 3D computer model of a scene remaining for exploration.
relevant to the deep learning model is rendered to an image,
segmented and/or classified, and then used to augment the II. BACKGROUND AND R ELATED W ORK
training data available to the model. Synthetic data has been Synthetic data can be used to train deep learning models in
used successfully in a growing body of research, in many cases a number of ways.
reducing the overall cost of training a model. First, in one extreme the model may be trained with
Advantages to using synthetic images are not limited to only synthetic images, which can be useful in models where
overcoming the time and safety constraints of capturing and acquiring examples of desired detection conditions can be time
annotating real images. 3D modeling systems are very flexible consuming or unsafe. For example, sufficient examples of rare
– scenes and assets can be changed and re-rendered with a cost flaws in products on an assembly line could be time consuming
likely far less than the real world equivalent. For example, in to capture for a quality control model, and examples of
the use cases presented in this work, the cost of changing the unsafe conditions may be challenging to acquire for a video
surveillance system. There has been some success with using Recent research has shown that models developed using
purely synthetic data to train models [2], [3], [4], and may be synthetic data can be used as a basis for more specific
a good option depending on the use case. Real images, if they models. This transfer learning can be used in many tasks,
exist, can be used as all or part of the test set to prove the such as enhancing detection of object position in [26], [11]
model’s accuracy. and separating target objects from visual distractors in [27].
Next, synthetic images may be mixed with real images in
III. S YNTHETIC I MAGE G ENERATION
some combination, augmenting the size and/or variation of the
training set presented to the model. In published research, this In this section we describe the tools and workflow de-
method has been used to successfully decrease model training veloped for creating synthetic images, followed by our ex-
cost or improve model accuracy, and in some cases both[5], periment designs for validating the output images and using
[6], [7]. the generated data to train a deep learning model. Software
Finally, synthetic images can also be used to pretrain a used includes Unity 2020.1, Unity High Definition Rendering
model in a two-stage process, either by fitting a model to Pipeline (HDRP) 7.4.1, and PiXYZ Plugin 2019.2.1.14.
the synthetic set and then increasing the model bias toward A. 3D Modeling
real world examples by iterating over the real image set, The image generator was built as a set of scene descrip-
or in a multi-model system such as DoubleU-net[8], which tions, models, and scripts in the Unity 3D game development
is the primary focus of this paper. This method is similar platform.
to using generalized image sets to pretrain a system (such For our use case, a vehicle model was translated from
as robotic vision) on patterns common to the real world, its native CATIAv5 CAD format [check name] into a Unity
and then secondary training to adapt the model to a specific asset with the PiXYZ plugin. Importing the CAD object
environment. [sort out citations for examples here] was relatively labor intensive due to a technical difficulty in
Jhang et al.[9] demonstrated training a Faster R-CNN[10] mapping part materials to Unity textures, which is an area
object detection model using synthetic images annotated with of current work. The work-around for our purposes was to
Unity Perception and generated at scale with Unity Simulation. manually assign textures to the approximately 10,000 visible
They found that while a model trained purely on a large surfaces in the imported Unity asset.
(400,000) set of synthetic images performed poorly at de-
tecting objects in situations with occlusions and low lighting, B. Realistic Rendering
augmenting the synthetic images with a small number of real In general, synthetic images for model training need to
images significantly improved the detection accuracy over a exemplify the characteristics of real images that the model
model trained purely on a small (760) set of real images. relies on for accurate classification. While these qualities
Their work was inspired by and complements findings from could be vastly different depending on the model, for our
Hinterstoisser et al.[4], who described a method for domain use case we needed images that embody the broad range of
randomization by composing a backdrop of random objects in shadows and reflections seen in the production environment.
front of which the objects of interest are rendered and labeled. Rather than attempting to identify and optimize for the most
Our process is distinguished by using a randomly oriented important image features, our approach was to create images
”skybox” surrounding the subject of interest, which achieves as accurately as possible with a goal of being indistinguishable
domain randomization with lowered scene complexity and from real images by a human observer.
randomized reflections. Images were rendered using the Unity High Definition
Another method of domain adaptation to insert simulated Rendering Pipeline. We relied on a number of Unity features
objects of interest into real images, such as in [11]. designed for high rendering accuracy, and avoided many
Rendered images of 3D scenes have been used to train approximation features designed to improve rendering perfor-
object detection models for a long time, as exemplified by [12] mance in a game setting requiring high framerate with limited
and [13]. More recently, advances in 3D rendering techniques hardware resources. Our settings for image accuracy were
have made photorealistic image generation practical. [14] [15] largely influenced by guidelines from Unity[28]. The Unity
[16] ”camera” object was configured to mimic the properties of the
Other researchers have applied full domain physical camera used to capture real images. A full disclosure
randomization[17] to synthetic image generation with and justification of the rendering settings we used would be
varying degrees of success. [4], [18], [19]. Our approach is a lengthy and beyond the scope of this paper, and will be made
hybrid between full domain randomization and photorealistic available on publication.
rendering, varying the lighting and subject/background We found that several external resources were very helpful
orientation and random sampling from a set of realistic in creating realistic image rendering, especially in our use
textures. The approaches described in [20], [21] and [22] are case with automotive models. In particular, Unity’s automotive
most similar to our own in this regard. industry-focused Measured Materials library[29] helped us
Successful specialization of U-net models has been achieved simulate the paint, glass, rubber, and plastic textures of a real
[23], [24] using VGG encoders[25] pretrained on the vehicle. Skyboxes were sampled from the Unity HDRI pack,
ImageNet[1] dataset. captured using techniques described by Lagarde et al.[30].
C. Domain Randomization real images R synthetic images S
feature examples frequency examples frequency
We chose a hybrid approach to domain randomization, ren-
back door 5,994 47.09% 40,231 99.57%
dering the image subject as accurately as possible with ambient
back window 5,854 45.99% 40,263 99.65%
lighting similar to the production environment. Randomized
rear window 4,844 38.05% 24,080 59.60%
attributes included subject position relative to the camera
front door 6,599 51.84% 22,308 55.21%
within plausible constraints, vehicle exterior paint colors from
front window 5,985 47.02% 26,171 64.77%
a set of possible values, and a single light source (the sun)
door handle 4,670 36.69% 40,084 99.20%
with varying position.
mirror 3,897 30.62% 6,932 17.16%
To separate the subject from the background, we used a
tail light 4,511 35.44% 8,501 21.04%
background skybox with a very ”busy” texture, and then
randomized its orientation on all 3 axis for every scene. This TABLE I: Feature Example Frequency in Image Sets
served a secondary purpose in creating randomized reflection
patterns on all surfaces of the vehicle.
Randomization of objects in the scene was accomplished smaller horizontal range of camera freedom, some classes were
with a set of scripts written in C-sharp, used natively in Unity represented more or less heavily in the synthetic set, as detailed
for game logic. in Table I. However, as we weight each class equally in our
metrics and present aggregate statistics over the entire dataset,
D. Segment Labeling
we deemed that the example frequency weights would not
We labeled image segments by capturing multiple images affect the conclusions.
from each randomized scene – one fully rendered image, and
then one false color image for each segment. This could have A. Training Methodology
been achieved in many ways, but the approach we found to be Images and labels were used to train U-net[31] convolu-
most performant in Unity was to maintain a second ”mask” tional neural network models implemented in TensorFlow[32]
copy of the subject model completely colored with an ”unlit” 2.0.0 and Keras[33] 2.2.4-tf. Models were trained using an
black texture, locked to the same position as the color model. NVIDIA DGX-2 with Tesla V100 GPUs running Ubuntu
Two identical cameras in the same position were used, one 18.04.4 LTS.
able to see the color model, background, and lighting and the In this section, all U-net model structure and parameters
other camera only able to see the mask model. are identical with the exception of input datasets. The U-net
After the normal image was captured with the color camera, implementation was derived from code provided by Debesh
the segment capture phase would iterate through groups of et al. in their Double-U-net supplement, to be consistent
components comprising each segment, recolor the group with with the further work in Section V. From the original U-
an unlit white texture, capture an image with the mask camera, net description, the only significant difference is the use of
and then recolor the group to the unlit black texture. Figure batch normalization[34] after the convolutional layers along
1 shows the resulting image segments. This approach had the contracting path, which resulted in more consistent training
the performance advantage of minimizing the retexturing of and better generalization in our use case.
materials on the model. This also allowed us to capture A hyperparameter search using real and synthetic datasets
occlusions by components not part of the segment of interest, revealed optimal parameters that were similar enough to avoid
such as the door handles in the example images. differentiation between the domains. As the purpose of this
work is to explore the tradeoffs of synthetic vs real data,
IV. M ODEL T RAINING we chose parameters that resulted in consistent and stable
To validate the effectiveness of the synthetic image gen- training sessions rather than strictly optimizing for the highest
erator, we conducted experiments comparing models trained possible accuracy. For our datasets, a dropout probability of
with varying amounts of real labelled images augmented with 0.30, a batch size of 64, and a learning rate of 0.0020 resulted
synthetic data. Our available data consisted of 14,125 labelled in models that converged quickly and consistently within a
images of real vehicles in a production line, each of which reasonable limit on training time and generalized well to the
contained one or more examples of eight distinct feature validation data.
classes. From this dataset, a 10% holdout set was randomly As synthetic data can be seen as a form of data augmen-
selected for validating models, leaving 12,712 images in the tation, we chose to forego any traditional augmentation tech-
real dataset R for training. The frequency of each feature’s niques (randomized cropping, gamma shifts, etc.) to present
appearance is described in Table I, where the subset of the real clear results, with the single exception of randomly flipping
image set R with one or more pixels belonging to a feature all training images horizontally to match the real dataset’s
class f is given as Rf = {e|e ∈ R and f ∈ e}, and an imaging of both sides of the vehicle. During training, models
example frequency of |Rf |/|R|. were evaluated each epoch against the disjoint validation set.
Using the synthetic image generator described in Section III, To prevent overfitting, we used an early stopping mechanism to
we rendered a set of 40,406 synthetic images and labels halt training and revert to the best weights if no improvement
S with the same feature classes as R. Due to a slightly in validation set prediction loss was made over 30 epochs.
Fig. 1: A 3D generated image (top left) in addition to a series of one-hot encoded masks segmenting each object class.

1) Metrics: While the image generation and training tech- M , where mr,s,i ∈ M : r ∈ N, s ∈ N, i ∈ [0..|Mr,s |) is one
niques share applicability with object detection and instance instance of a class of U-net models trained on (r, s) random
segmentation models with more actionable metrics, we quan- images from datasets R and S, and we report aggregate
tify the performance of a standard multiclass U-net segmenta- statistics over the model class Mr,s at each cell in the matrices
tion model simply with per-pixel mean intersection-over-union of Figure 2.
(mean IoU) with uniform class weighting and a prediction
Figure 2a aggregates the mean IoU predictions of each
threshold of 50%.
trained model class on the unseen validation set from the
B. Real Dataset Supplementation real image domain. We observe a general trend of increasing
accuracy with larger samples of real images, with diminishing
To determine how supplementing a dataset of real images
returns as the training images grow to sufficiently represent
with synthetic images would affect model training and ac-
the domain features. Along the horizontal axis, we see that
curacy, we trained instances of multiple model classes with
augmentation with synthetic data tended to increase accuracy,
different mixtures of images from both sets. Subsets of the
with greater yields in models trained on smaller real datasets.
real image set R of sizes N = {0, 16, 32, ..., 8192} were
We also observe that models trained on purely synthetic data
paired with subsets of the synthetic image set S from the same
tend to poorly predict the real domain, even with thousands
size range, forming the axes of the 11x11 matrices shown in
of examples.
Figure 2 with model classes at each intersection. For each
model class, random samples from R and S were used to To discuss the results of synthetic data augmentation, we
train individual U-net segmentation models with parameters first look at the effects of augmentation on model reliability.
reported above. The number of models trained in each class Figure 3 shows the summary statistics of mean validation set
was sufficient that the confidence interval (α = 0.95) width of predictions for model classes trained on purely real images and
the mean truth/prediction IoU measurements on the real image those augmented with 2048 synthetic images, which details
validation set was less than 5% of the mean value, requiring columns 0 and 2048 from Figure 2a. Models trained with
between 7 and 30 model instances for each image set size smaller random samples of real images tended to show more
pair. We refer to the resulting set of segmentation models as variation in their resulting prediction accuracy. We observe that
8192 0.976 0.973 0.976 0.974 0.975 0.973 0.975
4096 0.973 0.974 0.974 0.974 0.974 0.975 0.972
2048 0.970 0.971 0.971 0.972 0.971 0.964 0.962 0.95
1024 0.964 0.965 0.966 0.966 0.966 0.959 0.958
512 0.954 0.956 0.958 0.960 0.957 0.950 0.951 0.90

mean prediction IoU with real domain


# real

256 0.936 0.943 0.946 0.945 0.949 0.942 0.939


(a) 128 0.896 0.900 0.905 0.913 0.918 0.929 0.920 0.930 0.930 0.923 0.913 0.85
64 0.814 0.846 0.836 0.868 0.877 0.896 0.901 0.900 0.894 0.892 0.885
32 0.710 0.722 0.792 0.772 0.819 0.832 0.849 0.847 0.848 0.848 0.842
16 0.630 0.651 0.667 0.727 0.744 0.787 0.771 0.768 0.773 0.787 0.769 0.80
0 0.499 0.522 0.553 0.572 0.593 0.597 0.614 0.627 0.621 0.627
0 16 32 64 128 256 512 1024 2048 4096 8192 0.75
# synthetic
0.70
8192 0.0 -0.2 0.0 -0.1 -0.1 -0.2 -0.1
4096 0.0 0.0 0.1 0.1 0.1 0.2 -0.1 0.65
2048 0.0 0.1 0.1 0.2 0.2 -0.6 -0.8
1024 0.0 0.1 0.2 0.2 0.2 -0.5 -0.6 # synthetic images
0.0 0.2 0.5 0.6 0.3 -0.4 -0.3 0.60
512
# real

0
(b) 256 0.0 0.8 1.1 0.9 1.4 0.7 0.3 2048
128 0.0 0.5 1.0 1.9 2.4 3.6 2.7 3.8 3.8 3.1 1.9 0.55
64 0.0 3.9 2.6 6.6 7.6 10.1 10.7 10.5 9.8 9.5 8.7 16 32 64 128 256 512 1024 2048 4096 8192
32 0.0 1.6 11.5 8.8 15.4 17.2 19.6 19.3 19.5 19.4 18.6 # real images
16 0.0 3.2 5.8 15.4 18.1 24.8 22.3 21.8 22.6 24.9 22.1
0 16 32 64 128 256 512 1024 2048 4096 8192 Fig. 3: Mean IoU of model predictions on a validation set of
# synthetic
1176 real images. Each IQR plot describes between 7 and 30
8192 0.887 0.499 0.831 0.749 0.951 0.834 individual U-net models trained on random subsets of real and
4096 0.369 0.175 0.200 0.038 0.009 0.794 synthetic images. In general, augmenting smaller (≤ 256) sets
2048 0.075 0.087 0.004 0.011 0.960 0.992
1024 0.162 0.046 0.075 0.030 0.921 0.975 of real images resulted in higher accuracy and less variation in
512 0.075 0.004 <0.001 0.077 0.899 0.841
# real

256 0.015 0.001 0.005 <0.001 0.033 0.261 the trained models, with diminishing returns as the real data
(c) 0.071 0.003 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 0.022
128 became sufficiently representative of the domain.
64 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
32 0.056 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001
16 0.003 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 <0.001 # real images
0 16 32 64 128 256 512 1024 2048 4096 8192 8192
# synthetic 0.95 4096
2048
1024
0.90
mean prediction IoU with real domain

Fig. 2: Aggregated mean prediction IoU (a) of U-net models 512


trained on random samples from real and synthetic datasets. 256
0.85 128
Models augmented with synthetic data showed up to 24.9% 64
higher prediction accuracy (b) than the baseline, particularly 32
0.80 16
with limited amounts of real training images. The p-values (c)
of one-sided T-tests, Ha : IoU (Mr,s ) > IoU (Mr,0 ), show
0.75
significant accuracy increases (p ≤ 0.05, highlighted) in most
models trained with 256 or fewer real images. 0.70

0.65
augmentation tended to increase mean accuracy and decrease
variance in models trained with less than 256-512 real images. 0 1:1 2:1 4:1 8:1 16:1 32:1 64:1 128:1 256:1 512:1
synthetic:real ratio
Augmenting the real training sample with varying amounts
of synthetic data yields better results, depending on how Fig. 4: Mean prediction IoU of U-net models on real images,
accurate the model is to begin with. Figure 2b reshapes viewed by the ratio of real to synthetic data in the training
the data in Figure 2a as a percentage increase in mean datasets. Each trend exhibits an inflection point where ac-
prediction IoU relative to that of the pure real set (column curacy decreased, presumably due to limited capacity of the
0). We can see that augmenting models trained with 512 model to encompass both the real and synthetic domain.
or more real images only results in a marginal increase, at
best 0.6%. However, in models trained with 256 or fewer
real images, the accuracy increase is substantial, up to 25.0% amounts of synthetic data correlate with a slight decrease in
when only 16 real images are available. We can also see that prediction accuracy, presumably due to dilution of the samples
the addition of any amount of real images results in models from the real domain and a limited capacity of the model
that are more accurate than those trained on synthetic data to encompass both the real and synthetic domains. We can
alone. This is supported by the p-values of one-sided T-tests, observe this trend more clearly when viewing the relationship
Ha : IoU (Mr,s ) > IoU (Mr,0 ) ∀ r, s ∈ N , shown in Figure 2c between real and synthetic image set sizes as a ratio, shown
with p < 0.05 highlighted. in Figure 4. Each real image set size exhibits an inflection
Figure 2b also shows that in some cases, particularly in point where accuracy declines, which we suspect is dependent
those with 512 or more real images, the addition of large on the capacity of the model and similarity between real and
synthetic data in a particular use case. • synth-random - only the contracting path (encoder) was
To visualize the differences in prediction accuracy, Figure 5 initialized with weights from the pretrained base, al-
presents the segmentation maps predicted by 10 different lowing the untrained expanding path (decoder) to train
models, trained on 16-256 real images and augmented with completely on real data;
either 0 or 2048 synthetic images. In contrast to the randomly • synth-synth - both the encoder and decoder were initial-
selected images used to train the models Figure 2, each real ized with pretrained base weights;
dataset larger than 16 images is a superset of the smaller • VGG19-random - the encoder part of the model was
datasets, and the same real datasets and 2048-image synthetic replaced with VGG19, detailed below, and the decoder
dataset are reused in each of the augmented models. For this left with random weights;
example image, the quality of the predictions are fairly low • VGG19-synth - the encoder was replaced with VGG19,
in the pure real models, limiting usefulness depending on the and the decoder initialized with pretrained base weights;
use case. The addition of synthetic images results in clearly • control - the base model was used without freezing or
defined door/window boundaries with even the smallest real replacing layers, and the initial random weights were
training set, and better identification of smaller features such unchanged. Note that this is the same configuration as
as the door handles at 64 real images compared to requiring models in the previous section, and the resulting model
128 without augmentation. is trained on purely real data.
Finally, we doubled the above configurations with another
V. T RANSFER L EARNING parameter, choosing to either freeze the layers of the encoder
portion of the model or allow the secondary training with
Another potential use case for synthetic data is in pretraining
real data to propagate and update the encoder weights. Our
models for later improvement with real data, either as a base
expectations were that freezing the encoder section of the
for multiple specialized models or as a starting point for
model would reduce training time as there were less param-
incremental training as real data becomes available. Our results
eters to update with each back-propagation, but could reduce
from the previous section indicate that U-net models trained
the model’s ability to adapt to the new data. Table II details
with 256 or fewer images from our real image dataset suffer
the number of trainable parameters and mean training time per
from low applicability to new images, so in this section we
image-epoch for the four resulting model architectures, which
will focus on pretrained model refinement with small numbers
indeed shows decreased time per image with less parameters
of real images.
to update.
The goals and requirements for transfer learning can vary
For some experiments, the encoder layers of the model were
widely, but in our exploration we will focus on use cases
replaced with a VGG19[25] model pretrained with weights
stemming from unavailability of real labelled training images
from ImageNet[1], following the same procedure as the work
and from the need to specialize a general model for a particular
done in [8] for comparability. With the models initialized
task. As such, we will quantify results in terms of accuracy
with pretrained weights, we continued training using randomly
(in this case, mean prediction IoU on real data) and training
selected subsets of real images until convergence, using the
time of the model specialization training.
stopping criteria described in the previous section. All model
variant and real image sample size permutations were repeated
A. U-Net
30 times.
There are many strategies for transfer learning using the We first compare on the frozen/trainable encoder variable,
U-Net model [cite], most involving freezing, reinitializing, visualized in Figure 6. In models using VGG19 as the encoder,
adding, or removing layers. It is beyond the scope of this we observed greater prediction accuracy and lower training
work to explore the many factors involved in choosing the time, while models using our encoder pretrained on synthetic
optimal strategy for a particular use case. We will instead data tended to perform better when the encoder was not
focus on a relatively simple technique that compares well to frozen during secondary training. This is perhaps due to
our work with a more advanced model in the next subsection, the large difference in the number of encoder neurons, as
which to train a U-Net with purely synthetic data, and then propagating the training feedback from each example through
continuing training with real images while optionally freezing the larger VGG19 encoder is more costly and less impactful.
or replacing part of the model. Our base synthetic-trained We speculate that limiting the neurons being updated each
U-Net model uses parameters as described in the previous epoch lead to faster model convergence while the models with
section, trained with a larger dataset of 36,480 synthetic more trainable weights slowed in training progress enough
images, which achieved 0.954 mean prediction IoU on the to trigger early stopping. The training logs support this con-
holdout set from the same synthetic domain. Accuracy on jecture, showing extremely slow improvement before training
segmentation of real images was similar to the experiments was terminated. It is possible that, given enough time, the
with large pure synthetic datasets in the previous section, only accuracy differences between trainable and frozen versions of
achieving a mean prediction IoU of 0.618 on that domain. the same model would minimize. However, since all models
Starting with an identical U-Net base model initialized with use the same early stopping criteria, we present the results as
random weights, experiments were configured as follows: comparable in a practical sense. In the remainder of this work,
r = 16 r = 32 r = 64 r = 128 r = 256
s = 0 s = 0 s = 0 s = 0 s = 0
IoU = 0.535 IoU = 0.537 IoU = 0.726 IoU = 0.829 IoU = 0.921

ground truth r = 16 r = 32 r = 64 r = 128 r = 256


s = 2048 s = 2048 s = 2048 s = 2048 s = 2048
IoU = 0.954 IoU = 0.952 IoU = 0.969 IoU = 0.965 IoU = 0.975

Fig. 5: Segmentation map predictions of U-net models trained with pure real images (top) vs. the same training sets augmented
with 2048 synthetic images (bottom). The input image and ground truth are shown on the left for reference.

parameters (millions) training time per


model variant total trainable epoch-image (s)
U-net 7.77 7.77 0.0130
U-net frozen encoder 7.77 3.05 0.0120
U-net VGG19 encoder 23.86 23.86 0.0176
U-net frozen VGG19 encoder 23.86 3.83 0.0153
W-net frozen 1st U-net 10.11 2.34 0.0146
W-net frozen VGG19 encoder 26.59 6.56 0.0195

TABLE II: Model Size and Training Time

comparisons with these models will use the better-performing more consistent, in that the U-net default layers trained with
frozen encoders in the case of VGG19, and trainable encoders synthetic data resulted in higher mean prediction accuracy than
for the synthetic data-trained models. models using the VGG19 encoder trained on ImageNet, across
Next, we compare the mean prediction accuracy of the all real data sample sizes. We conclude from these findings
retrained models with frozen encoders to the control models that a relatively small encoder (4.72m parameters) trained on
trained from randomly initialized weights. We observed that in a few thousand images drawn from a similar synthetic domain
cases with 64 or fewer real images, we saw an increase in accu- to the target can outperform the already impressive feature
racy over a control model trained on purely real data. However, extraction of a large (23.03m parameters) encoder trained on
in larger real image classes and with all control models trained over a million generic real images.
on a mix of real and synthetic data, we saw significantly lower Figure 7b compares the training times of retrained models to
accuracy in the specialized models. We again speculate that those of the control model for each real sample size class, with
the model training may have slowed enough to trigger our results between 10.0% and 20.8% of the time required for the
early termination criteria, and that a combination of refined control. The time can be accounted for in both the number
learning rate, early termination parameters, and lengthened of trainable parameters in the retrained models with frozen
training time may result in improved accuracy. Our goals in encoders, and the number of epochs required to converge.
this work are in comparability between experiments, though, As the mean training time for a purely synthetic U-net (r=0,
so we present these results as a baseline to be improved upon. s=2048) is 11,648 seconds, the training time for a retrained
In comparing the prediction accuracy of U-net models with U-net is comparable to that of the control.
different decoder weights, we saw mixed results; the pretrained
synthetic data weights appeared to result in lower performance B. Double-U-Net
in models with synthetic weighted encoders trained on 16 or Since the introduction of U-net in 2015, a number of
32 real images, while having the opposite effect in models derivative models have been proposed that improve its ap-
with VGG19 encoders. In models trained on 64 or more real plicability to certain use cases. One of these, the Double-U-
images, the results were less clear; and a two-sided T-test net[8], improves upon the localization of segment instances
showed insufficient difference to conclude that the results are by dividing the task between, as the name suggests, two U-
drawn from different distributions at p = 0.05. net models linked together. The first U-net, using a VGG19
Comparing encoder paths of the different model classes was encoder trained on ImageNet, outputs feature maps from each
model type model type
synth base (r=0, s=2048) synth base (r=0, s=2048)
0.950 synth-synth control (r=x, s=0)
synth-synth (frozen encoder) 0.950 control (r=x, s=2048)
synth-random synth-synth
0.900 synth-random (frozen encoder) synth-random
VGG19-synth 0.900 VGG19-synth
VGG19-synth (frozen encoder) VGG19-random
0.850 VGG19-random
VGG19-random (frozen encoder)
0.850
mean prediction IoU with real domain

mean prediction IoU with real domain


0.800

0.800
0.750

(a) 0.700 (a) 0.750

0.650 0.700
0.627
0.600
0.650
0.550 0.627
0.600
0.500

0.550
0.450
16 32 64 128 256 16 32 64 128 256
# real images # real images

model type 25000 model type


6000 synth-synth control (r=x, s=0)
synth-synth (frozen encoder) control (r=x, s=2048)
synth-random synth-synth
synth-random (frozen encoder) synth-random
VGG19-synth VGG19-synth
VGG19-synth (frozen encoder) VGG19-random
5000 VGG19-random 20000
VGG19-random (frozen encoder)

4000 15000

training time (s)


training time (s)

(b) (b)
3000
10000

2000
5000

1000
0
16 32 64 128 256 16 32 64 128 256
# real images # real images

Fig. 6: Comparisons of mean prediction IoU (a) and training Fig. 7: Limiting to encoder type and decoder initial weights
time (b) of secondary training of pretrained U-Net models, (synthetic pretrained vs. random) model permutations, we
with the weights of the contracting path (encoder) either observed a sizeable tradeoff between mean prediction IoU (a)
trainable or frozen. and training time (b) when compared to models initialized
from randomness.

level of the encoding process as well as an intermediate


segmentation map from the decoder. The segmentation map were captured from a fixed viewpoint with a low variation
is paired with the original image as input to the second U- in the vehicle’s distance from the camera. This warranted
net, while feature map outputs of the first U-net are linked omission of the Atrous Spatial Pyramid Pooling (ASPP) block
to corresponding layers of the second U-net decoder. The between the encoder and decoder in each U-net, which was
authors’ results showed impressive accuracy gains over a used in [8] to handle feature scaling. We conducted a limited
standard U-net on a variety of medical segmentation datasets. exploration and found these features to contribute little to no
As an exercise in applying transfer learning to a more com- performance gains on our particular use case, so we believe
plex model, we chose the Double-U-net (abbreviated W-net that the simplified model is a better comparison to transfer
for the remainder of this work) because of its intuitive design learning results on a simple U-net in the previous section.
as a logical extension to the standard U-net, as well as having Our W-net implementation is simply two U-net models,
experience and success using the model in some production identical to the implementation described in the previous
use cases. Our experiments in this section will expand on the section, with the following two additions. First, as in the
previous section for ease of comparison, with the caveat that [8], the U-nets are connected with a pixel-wise multiplication
we made some implementation choices toward this goal while layer, such that the second U-net receives the original image
potentially sacrificing some peak performance. For example, augmented with the segmentation map output of the first U-
the authors of W-net used squeeze-excite blocks[35] at the net. Second, the encoder layer-wise feature maps from the
end of each convolutional block, which is not part of the first U-net are concatenated to the inputs of the second U-net
original U-net specification. Additionally, in our image set, decoder, in the same manner as the feature maps from the
vehicle features were largely scale-invariant, as the images second U-net encoder.
model type
synth base (r=0, s=2048)
control (r=x, s=0)
training with a multipart model like W-net can be a viable
0.950 control (r=x, s=2048)
U-net (r=x, synth-synth) accuracy enhancement if the time cost can be justified.
U-net (r=x, VGG19-synth)
0.900 W-net (r=x, encoder=synth)
W-net (r=x, encoder=VGG19)
VI. C ONCLUSIONS
mean prediction IoU with real domain

0.850
We found that, for this image segmentation problem, syn-
0.800 thetic images were an effective technique for augmenting
(a)
limited sets of real training data. We observed that models
0.750
trained on purely synthetic images had a very low mean
0.700 prediction IoU on real validation images. We also observed
that adding even very small amounts of real images to a
0.650
0.627
synthetic dataset greatly improved accuracy, and that models
0.600 trained on datasets augmented with synthetic images were
more accurate than those trained on real images alone. We
0.550
16 32 64
# real images
128 256 noted that for this domain, 256 to 512 images seemed to be
enough to train a reasonably accurate model, with rapidly
model type
U-net (r=x, synth-synth)
diminishing returns on adding synthetic images to the mix,
U-net (r=x, VGG19-synth)
6000
W-net (r=x, encoder=synth) eventually resulting in lower accuracy as the real:synthetic
W-net (r=x, encoder=VGG19)
ratio dropped.
5000 In use cases that benefit from incremental training or model
specialization, we found that pretraining on synthetic images
4000
provided a usable base model for transfer learning. While we
training time (s)

observed that models trained in a single session outperformed


(b) those pretrained on synthetic images and retrained on real data,
3000
we also saw that up to 90% of the total training time could
be completed in the pretraining phase.
2000 We conclude that synthetic image generation can be benefi-
cial to segmentation model training when insufficient images
1000 are available to train a satisfactory model. However, testing
16 32 64 128 256
must be done to find the break point where adding more
# real images
synthetic images does not result in higher mean accuracy.
Fig. 8: Results of transfer learning on U-net and W-net VII. F UTURE W ORK
models with (first) encoders trained on synthetic data or
VGG19/ImageNet, compared to training of control models A natural progression from this work is to study the
initialized with random weights. The mean prediction IoU (a) characteristics of synthetic data and identify features that con-
and training time (b) suggest improved accuracy of synthetic- tribute to model accuracy and can be adapted to more closely
trained encoders, but in some cases with a time cost. resemble the real domain, while separating less important
features that should be randomized. Recent work in the field
of Generative Adversarial Networks (GANs) could be used to
Following the work in the previous section and as an analog automate the feature identification process and help design
to [8], we chose to construct Double-U-nets with two model more robust synthetic image rendering processes. Another
variations. In the first model, we use a U-net trained on interesting topic would be exploring how synthetic images can
synthetic data as described above, with the entire first U-net be used in conjunction with other effective data augmentation
frozen. The second model, analogous to [8], uses a frozen techniques, which unfortunately was beyond the scope of this
VGG19 encoder and a trainable uninitialized decoder. In both work.
models, the second U-net is initialized with random weights
and is fully trainable. Our hyperparameter search revealed
optimal parameters very close to those used to train the
individual U-nets, so we opted to keep the original parameters
for comparability.
Our results, shown in Figure 8, show accuracy improve-
ments using the W-net model with the VGG19 encoder over
all training image size classes, and similar or better results with
the synthetic-trained first U-net. The accuracy improvements
correlate with a training cost increase, however, especially
with the VGG19-based models with more layers to train.
The conclusion we draw from these results is that secondary
R EFERENCES
[19] J. Borrego, A. Dehban, R. Figueiredo, P. Moreno, A. Bernardino, and
[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, J. Santos-Victor, “Applying domain randomization to synthetic data for
“Imagenet: A large-scale hierarchical image database,” in 2009 IEEE object category detection,” arXiv preprint arXiv:1807.09834, 2018,
conference on computer vision and pattern recognition. Ieee, 2009, 00013.
pp. 248–255, 00000. [20] C. Mitash, K. E. Bekris, and A. Boularias, “A self-supervised learning
[2] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: system for object detection using physics simulation and multi-view
Ground truth from computer games,” in European conference on pose estimation,” in 2017 IEEE/RSJ International Conference on
computer vision. Springer, 2016, pp. 102–118, 00781. Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 545–551,
[3] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for cnn: Viewpoint 00063.
estimation in images using cnns trained with rendered 3d model [21] A. Prakash, S. Boochoon, M. Brophy, D. Acuna, E. Cameracci,
views,” in Proceedings of the IEEE International Conference on G. State, O. Shapira, and S. Birchfield, “Structured domain
Computer Vision, 2015, pp. 2686–2694, 00567. randomization: Bridging the reality gap by context-aware synthetic
[4] S. Hinterstoisser, O. Pauly, H. Heibel, M. Martina, and M. Bokeloh, data,” in 2019 International Conference on Robotics and Automation
“An annotation saved is an annotation earned: using fully synthetic (ICRA). IEEE, 2019, pp. 7249–7255, 00036.
training for object detection,” in Proceedings of the IEEE/CVF [22] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil,
International Conference on Computer Vision (ICCV) Workshops, Oct. T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep
2019, 00004. networks with synthetic data: Bridging the reality gap by domain
[5] X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep object randomization,” in Proceedings of the IEEE Conference on Computer
detectors from 3d models,” in Proceedings of the IEEE International Vision and Pattern Recognition Workshops, 2018, pp. 969–977, 00256.
Conference on Computer Vision, 2015, pp. 1278–1286, 00274. [23] V. Iglovikov and A. Shvets, “Ternausnet: U-net with vgg11 encoder
[6] D. Dwibedi, I. Misra, and M. Hebert, “Cut, paste and learn: pre-trained on imagenet for image segmentation,” arXiv preprint
Surprisingly easy synthesis for instance detection,” in Proceedings of arXiv:1801.05746, 2018.
the IEEE International Conference on Computer Vision, 2017, pp. [24] M. Frid-Adar, A. Ben-Cohen, R. Amer, and H. Greenspan, “Improving
1301–1310, 00164. the segmentation of anatomical structures in chest radiographs using
[7] B. Sun and K. Saenko, “From Virtual to Reality: Fast Adaptation of u-net with an imagenet pre-trained encoder,” in Image Analysis for
Virtual Object Detectors to Real Domains.” in BMVC, vol. 1, 2014, Moving Organ, Breast, and Thoracic Images. Springer, 2018, pp.
p. 3, 00150 Issue: 2. 159–168.
[8] D. Jha, M. A. Riegler, D. Johansen, P. Halvorsen, and H. D. Johansen, [25] K. Simonyan and A. Zisserman, “Very deep convolutional networks
“DoubleU-Net: A Deep Convolutional Neural Network for Medical for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
Image Segmentation,” arXiv preprint arXiv:2006.04868, 2020, zSCC: 2014, 00000.
0000003. [26] T. Inoue, S. Choudhury, G. De Magistris, and S. Dasgupta, “Transfer
[9] Y.-C. Jhang, A. Palmar, B. Li, S. Dhakad, S. K. Vishwakarma, learning from synthetic to real images using variational autoencoders
J. Hogins, A. Crespi, C. Kerr, S. Chockalingam, C. Romero, for precise position detection,” in 2018 25th IEEE International
A. Thaman, and S. Ganguly, “Training a performant object detection Conference on Image Processing (ICIP). IEEE, 2018, pp.
ML model on synthetic data using Unity Perception tools,” Sep. 2020, 2725–2729, 00018.
00000. [Online]. Available: [27] F. Zhang, J. Leitner, M. Milford, and P. Corke, “Sim-to-real transfer of
https://blogs.unity3d.com/2020/09/17/training-a-performant-object- visuo-motor policies for reaching in clutter: Domain randomization
detection-ml-model-on-synthetic-data-using-unity-perception-tools/ and adaptation with modular networks,” world, vol. 7, no. 8, 2017.
[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards [28] P. Y. Donzallaz, “How to set up unity’s high definition render pipeline
real-time object detection with region proposal networks,” IEEE for high-end visualizations,” Jan 2020. [Online]. Available:
transactions on pattern analysis and machine intelligence, vol. 39, https://blogs.unity3d.com/2020/01/09/how-to-set-up-unitys-unitys-
no. 6, pp. 1137–1149, 2016, 23133 Publisher: IEEE. high-definition-render-pipeline-for-high-end-visualizations/
[11] M. Yan, I. Frosio, S. Tyree, and J. Kautz, “Sim-to-real transfer of
[29] E. Martin and L. Vo Van, “We have you covered with the measured
accurate grasping with eye-in-hand observations and continuous
materials library,” Feb 2019. [Online]. Available:
control,” arXiv preprint arXiv:1712.03303, 2017.
https://blogs.unity3d.com/2019/02/08/we-have-you-covered-with-the-
[12] R. Nevatia and T. O. Binford, “Description and recognition of curved
measured-materials-library/
objects,” Artificial intelligence, vol. 8, no. 1, pp. 77–98, 1977, 00606
[30] S. Lagarde, S. Lachambre, and C. Jover, “An artist-friendly workflow
Publisher: Elsevier.
for panoramic hdri,” in ACM SIGGRAPH 2016 Courses, ser.
[13] D. G. Lowe, “Three-dimensional object recognition from single
SIGGRAPH ’16. New York, NY, USA: Association for Computing
two-dimensional images,” Artificial intelligence, vol. 31, no. 3, pp.
Machinery, 2016. [Online]. Available:
355–395, 1987, 01904.
https://doi.org/10.1145/2897826.2927353
[14] T. Hodaň, V. Vineet, R. Gal, E. Shalev, J. Hanzelka, T. Connell,
P. Urbina, S. N. Sinha, and B. Guenter, “Photorealistic image synthesis [31] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
for object instance detection,” in 2019 IEEE International Conference networks for biomedical image segmentation,” in International
on Image Processing (ICIP). IEEE, 2019, pp. 66–70, 00000. Conference on Medical image computing and computer-assisted
[15] Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and intervention. Springer, 2015, pp. 234–241, 00000.
T. Funkhouser, “Physically-based rendering for indoor scene [32] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
understanding using convolutional neural networks,” in Proceedings of Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
the IEEE Conference on Computer Vision and Pattern Recognition, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
2017, pp. 5287–5295, 00000. M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
[16] Z. Li and N. Snavely, “Cgintrinsics: Better intrinsic image C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
decomposition through physically-based rendering,” in Proceedings of P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
the European Conference on Computer Vision (ECCV), 2018, pp. P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
371–387, 00000. “TensorFlow: Large-scale machine learning on heterogeneous
[17] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, systems,” 2015, software available from tensorflow.org. [Online].
“Domain randomization for transferring deep neural networks from Available: https://www.tensorflow.org/
simulation to the real world,” in 2017 IEEE/RSJ International [33] F. Chollet et al., “Keras,” https://keras.io, 2015.
Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, [34] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
pp. 23–30, 00851. network training by reducing internal covariate shift,” in International
[18] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and conference on machine learning. PMLR, 2015, pp. 448–456.
S. Birchfield, “Deep object pose estimation for semantic robotic [35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in
grasping of household objects,” arXiv preprint arXiv:1809.10790, Proceedings of the IEEE conference on computer vision and pattern
2018, 00168. recognition, 2018, pp. 7132–7141.

You might also like