Object Detection Using Deep CNNs Trained On Synthetic Images
Object Detection Using Deep CNNs Trained On Synthetic Images
Abstract—The need for large annotated image datasets for published works like Faster R-CNN [9] and SSD [10] learn
training Convolutional Neural Networks (CNNs) has been a object proposals and object classification in an end-to-end
significant impediment for their adoption in computer vision fashion.
applications. We show that with transfer learning an effective
object detector can be trained almost entirely on synthetically The availability of large sets of training images has been
rendered datasets. We apply this strategy for detecting pack- a prerequisite for successfully training CNNs [6]. Manual
aged food products clustered in refrigerator scenes. Our CNN annotation of images for object detection, however, is a
trained only with 4000 synthetic images achieves mean average time-consuming and mechanical task; what is more, in some
precision (mAP) of 24 on a test set with 55 distinct products as applications the cost of capturing images with sufficient
objects of interest and 17 distractor objects. A further increase
of 12% in the mAP is obtained by adding only 400 real images variety is prohibitive. In fact the largest image datasets are
to these 4000 synthetic images in the training set. A high degree built upon only a few categories for which images can be
of photorealism in the synthetic images was not essential in feasibly curated (20 categories in PASCAL VOC [11], 80
achieving this performance. We analyze factors like training in COCO [12], and 200 in ImageNet [13]). In applications
data set size and 3D model dictionary size for their influence where a large set of intra-category objects need to be
on detection performance. Additionally, training strategies like
fine-tuning with selected layers and early stopping which affect detected the option of supervised learning with CNNs is even
transfer learning from synthetic scenes to real scenes are tougher as it is practically impossible to collect sufficient
explored. Training CNNs with synthetic datasets is a novel training material.
application of high-performance computing and a promising There have been solutions proposed to reduce annotation
approach for object detection applications in domains where efforts by employing transfer learning or simulating scenes
there is a dearth of large annotated image data.
to generate large image sets. The research community has
Keywords-Convolutional Neural Networks (CNN); Deep proposed multiple approaches for the problem of adapting
learning; Transfer learning; Synthetic datasets; Object Detec- vision-based models trained in one domain to a different
tion; 3D Rendering
domain [14]–[18]. Examples include: re-training a model
in the target domain [19]; adapting the weights of a pre-
I. I NTRODUCTION
trained model [20]; using pre-trained weights for feature
The field of Computer Vision has reached new heights extraction [21]; and, learning common features between
over the last few years. In the past, methods like DPMs [1], domains [22].
SIFT [2] and HOG [3] were used for feature extraction, Attempts to use synthetic data for training CNNs to adapt
and linear classifiers were used for making predictions. in real scenarios have been made in the past. Peng et. al.
Other methods [4] used correspondences between template used available 3D CAD models, both with and without
images and the scene image. Later works focused on class- texture, and rendered images after varying the projections
independent object proposals [5] using segmentation and and orientations of the objects, evaluating on 20 categories in
classification using hand crafted features. Today methods the PASCAL VOC 2007 data set [23]. The CNN employed
based on Deep Neural Networks (DNNs) have achieved for their approach used a general object proposal module [8]
state-of-the-art performance on image classification, object which operated independently from the fine-tuned classifier
detection, and segmentation [6], [7]. DNNs been success- network. In contrast, Su and coworkers [24] used the ren-
fully deployed in numerous domains [6], [7]. Convolutional dered 2D images from 3D on varying backgrounds for pose
Neural Networks (CNNs), specifically, have fulfilled the estimation. Their work also uses an object proposal stage
demand for a robust feature extractor that can generalize and limits the objects of interest to a few specific categories
to new types of scenes. CNNs were initially deployed for from the PASCAL VOC data set. Georgakis and cowork-
image classification [6] and later extended to object detec- ers [25] propose to learn object detection with synthetic data
tion [8]. The R-CNN approach [8] used object proposals generated by object instances being superimposed into real
and features from a pre-trained object classifier. Recently scenes at different positions, scales, and illumination. They
Figure 1. Overview of our approach to train object detectors for real images based on synthetic rendered images.
propose the use of existing object recognition data sets such from simulation to reality. The rest of this paper is organized
as BigBird [26] rather than using 3D CAD models. They as follows: our methodology is described in section II,
limit their synthesized scenes to low-occlusion scenarios followed by the results we obtain reported in section III,
with 11 products in GMU-Kitchens data set. Gupta et. al. finally concluding the paper in section IV.
generate a synthetic training set by taking advantage of scene
segmentation to create synthetic training examples, however II. M ETHOD
the goal is text localization instead of object detection [21]. Given a RGB image captured inside a refrigerator, our
Tobin et. al. perform domain randomization with low-fidelity goal is to predict a bound-box and the object class category
rendered images from 3D meshes, however their objective for each object of interest. In addition, there are few objects
is to locate simpler polygon-shaped objects restricted to a in the scene that need to be neglected. Our approach is
table top in world coordinates [27]. In [28], [29], the Unity to train a deep CNN with synthetic rendered images from
game engine is used to generate RGB-D rendered images available 3D models. Overview of the approach is shown in
and semantic labels for outdoor and indoor scenes. They Figure 1. Our work can be divided into two major parts
show that by using photo-realistic rendered images the effort namely synthetic image rendering from 3D models and
for annotation can be significantly reduced. They combine transfer learning by fine-tuning the deep neural network with
synthetic and real data to train models for semantic segmen- synthetic images.
tation, however the network requires depth map information
for semantic segmentation. A. Synthetic Generation of Images from 3D Models
None of the existing approaches to training with synthetic
data consider the use of synthetic image datasets for training We use an open source 3D graphics software named
a general object detector in a scenario where high intra-class Blender. Blender-Python APIs facilitate to load 3D models
variance is present along with high clutter or occlusion. Ad- and automate the scene rendering. We use Cycles Render
ditionally, while previous works have compared the perfor- Engine available with Blender since it supports ray-tracing
mance using benchmark datasets, the study of cues or hyper- to render synthetic images. Since all the required annotation
parameters involved in transfer learning has not received data is available, we use the KITTI [31] format with bound-
sufficient attention. We propose to detect object candidates box co-ordinates, truncation state and occlusion state for
in the scene with large intra-class variance compared to an each object in the image.
approach of detecting objects for few specific categories. We Real world images have lot of information embed-
are especially interested in synthetic datasets which do not ded about the environment, illumination, surface materials,
require extensive effort towards achieving photorealism. In shapes etc. Since the trained model, at test time must be
this work, we simulate scenes using 3D models and use the able to generalize to the real world images, we take into
rendered RGB images to train a CNN-based object detector. consideration the following aspects during generation of
We automate the process of rendering and annotating the each scenario:
2D images with sufficient diversity to train the CNN end- • Number of objects
to-end and use it for object detection in real scenes. Our • Shape, Texture, and Materials of the objects
experiments also explore the effects of different parameters • Texture and Materials of the refrigerator
like data set size and 3D model repository size. We also • Packing pattern of the objects
explore the effects of training strategies like fine-tuning • Position, Orientation of camera
selective layers and early stopping [30] on transfer learning • Illumination via light sources
Figure 2. Overview of the training dataset. a) Snapshots of the few 3D models from the ShapeNet database used for rendering images. We illustrate
the variety in object textures, surface materials and shapes in 3D models used for rendering. b) Rendered non-photo realistic images with with varying
object textures, surface materials and shapes arranged in random, grid and bin packed patterns finally captured from various camera angles with different
illuminations. c) Few real images used to illustrate the difference in real and synthetic images. These images are subset of the real dataset used for
benchmarking performance of model trained with synthetic images.
In order to simulate the scenario, we need 3D models, replicate common scenarios in refrigerator. The light sources
their texture information and metadata. Thousands of 3D are placed such that illumination is varied in every scene and
CAD models are available online. We choose ShapeNet [32] the images are not biased to a well lit environment since
database since it provides a large variety of objects of refrigerators generally tend to have dim lighting. Multiple
interest for our application. Among various categories from cameras are placed at random location and orientation to
ShapeNet like bottles, tins, cans and food items, we selec- render images from each scene. The refrigerator texture
tively add 616 various object models to object repository and material properties are dynamically chosen for every
(R0 ) for generating scenes. Figure 2a shows few of the rendered image. Figure 2b shows few rendered images used
models in R0 . The variety helps randomize the aspect of as training set while Figure 2c shows the subset of real world
shape, texture and materials of the objects. For the refriger- images used in training.
ator, we choose a model from Archive3D [33] suitable for
the application. The design of refrigerator remains same for B. Deep Neural Network Architecture, Training and Evalu-
all the scenarios though the textures and material properties ation
are dynamically chosen. Figure 3 provides the detailed illustration of network
For generating training set with rendered images, the architecture and work-flow for the training and valida-
3D scenes need to be distinct. The refrigerator model with tion stages. For neural network training we use NVIDIA-
5-25 randomly selected objects from R0 are imported in DIGITST M -DetectNet [34] with Caffe [35] library in back-
each scene. To simulate the cluster of objects packed in end. During training, the RGB images with resolution (in
refrigerator like real world scenarios, we use three patterns pixels) 512 x 512 are labelled with standard KITTI [31]
namely grid, random and bin packing for 3D models. The format for object detection. We neglect objects truncated or
grid places the objects in a particular scene on a refrigerator highly occluded in the images using appropriate flags in the
tray top at predefined distances. Random placements drop ground truth label generated while rendering. The dataset is
the objects at random locations on refrigerator tray top. later fed into a fully convolutional network (FCN) predicting
Bin packing tries to optimize the usage of tray top area coverage map for each detected class. The FCN network
placing objects very close and clustered in the scene to represented concisely in Figure 4 has the same structure
Figure 3. Work-flow for the major steps of the system. a) Using annotated images, the FCN generates a coverage map and bound-box co-ordinates. The
training loss is a weighted sum of coverage and bound-box loss. b) At validation time, coverage map and bound boxes are generated from the FCN.
as GoogLeNet [7] without the data input layers and output network and ground truth
layers. For our experiments, we use pre-trained weights on N
ImageNet to initialize the FCN network which has earlier 1 X 2
coverageti − coveragepi (1)
been helpful for transfer learning [25]. 2N i=1
B. Detection Accuracy
We evaluate our best object detector model on a set of
50 crowd-sourced refrigerator scenes with all cue variances
covering 55 distinct objects of interest considered as posi-
tives and 17 distractor objects as negatives. Figure 8 shows
the variety in test set and the predicted bound-boxes for all Figure 9. Performance plots illustrating the effect of including synthetic
images while training neural networks.
refrigerator images. The detector achieves mAP of 24 on
this dataset which is a promising result considering that no
distractor objects were used while training using synthetic IV. C ONCLUSION
images. The question arises how well does a network trained with
synthetic images fare against one trained with real world
images. Hence we compare the performance of networks
trained with three different training image-sets as illustrated
in Figure 9. The synthetic training set consisted of 4000
images with 200 3D object models of interest while the real
training set consisted of 400 images parsed from the internet
with 240 distinct products and 19 distractor objects. The
hybrid set with synthetic and real images consisted of 3600
synthetic and 400 real images. All models were evaluated
on a set of 50 refrigerator scenes with less than 5% object
overlap between the test set and train set images. CNN
fully trained with 4000 synthetic images (achieves 24 mAP)
underperforms against one with 400 real images (achieves
28 mAP) but the addition of 4000 synthetic images to real
Figure 8. Scenes representing variance in scale, background, textures, dataset boosts the detection performance by 12% (achieves
illumination, packing patterns and material properties wherein Top Row: 36 mAP) which signifies the importance of transferable cues
Object detector correctly predicts the bound boxes for all objects of interest.
Middle Row: Object detector misses objects of interest. Bottom Row: from synthetic to real.
Object detector falsely predicts the presence of an object. To improve the observed performance, several tactics can
be tried. The presence of distractor objects in the test set
We observe that detector handles scale, shape and texture was observed to negatively impact performance. We are
variance. Though packing patterns like vertical stacking or working on the addition of distractor objects to the 3D
highly oblique camera angles lead to false predictions. Few model repository for rendering scenes with distractor objects
vegetables among the distractor objects are falsely predicted to train the network to become aware of them. Optimizing
as objects of interest suggesting the influence of pre-training the model architecture or replacing DetectNet with object
proposal networks might be another alternative. Training [10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y.
CNNs for semantic segmentation using synthetic images and Fu, and A. C. Berg, “SSD: Single shot multibox detector,”
the addition of depth information to the training sets is also in Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in
expected to help in the case of images with high degree of Bioinformatics), 2016, vol. 9905 LNCS, pp. 21–37.
occlusion.
[11] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I.
ACKNOWLEDGMENT Williams, J. Winn, and A. Zisserman, “The Pascal Visual
Object Classes Challenge: A Retrospective,” International
We acknowledge funding support from Innit Inc. con- Journal of Computer Vision, vol. 111, no. 1, pp. 98–136,
sultancy grant CNS/INNIT/EE/P0210/1617/0007 and High 2014.
Performance Computing Lab support from Mr. Sudeep
[12] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
Banerjee. We thank Aalok Gangopadhyay for the insightful
manan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Com-
discussions. mon objects in context,” Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence
R EFERENCES and Lecture Notes in Bioinformatics), vol. 8693 LNCS, no.
PART 5, pp. 740–755, 2014.
[1] D. Forsyth, “Object detection with discriminatively trained
part-based models,” Computer, vol. 47, no. 2, pp. 6–7, feb [13] Jia Deng, Wei Dong, R. Socher, Li-Jia Li, Kai Li, and Li Fei-
2014. Fei, “ImageNet: A large-scale hierarchical image database,”
in 2009 IEEE Conference on Computer Vision and Pattern
[2] D. G. Lowe, “Distinctive image features from scale-invariant Recognition. IEEE, jun 2009, pp. 248–255.
keypoints,” International Journal of Computer Vision, vol. 60,
no. 2, pp. 91–110, nov 2004. [14] W. Li, L. Duan, D. Xu, and I. W. Tsang, “Learning with
augmented features for supervised and semi-supervised het-
[3] N. Dalal and B. Triggs, “Histograms of oriented gradients erogeneous domain adaptation,” in IEEE Transactions on
for human detection,” in Proceedings - 2005 IEEE Computer Pattern Analysis and Machine Intelligence, vol. 36, no. 6,
Society Conference on Computer Vision and Pattern Recog- jun 2014, pp. 1134–1148.
nition, CVPR 2005, vol. I. IEEE, 2005, pp. 886–893.
[15] J. Hoffman, E. Rodner, J. Donahue, T. Darrell, and K. Saenko,
[4] S. Ekvall, F. Hoffmann, and D. Kragic, “Object recognition “Efficient Learning of Domain-invariant Image Representa-
and pose estimation for robotic manipulation using color tions,” in ICLR, jan 2013, pp. 1–9.
cooccurrence histograms,” in Proceedings 2003 IEEERSJ
International Conference on Intelligent Robots and Systems [16] J. Hoffman, S. Guadarrama, E. Tzeng, R. Hu, J. Donahue,
IROS 2003 Cat No03CH37453, vol. 2, no. October. IEEE, R. Girshick, T. Darrell, and K. Saenko, “LSDA: Large Scale
2003, pp. 1284–1289. Detection Through Adaptation,” in Proceedings of the 27th
International Conference on Neural Information Processing
[5] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Systems. MIT Press, 2014, pp. 3536–3544.
Smeulders, “Selective search for object recognition,” Inter-
national Journal of Computer Vision, vol. 104, no. 2, pp. [17] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not
154–171, sep 2013. what you get: Domain adaptation using asymmetric kernel
transforms,” in Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet
IEEE, jun 2011, pp. 1785–1792.
classification with deep convolutional neural networks,” in
International Conference on Neural Information Processing
[18] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning
Systems. Curran Associates Inc., 2012, pp. 1097–1105.
Transferable Features with Deep Adaptation Networks,” in
Proceedings of the 32nd International Conference on Inter-
[7] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, national Conference on Machine Learning - Volume 37, 2015,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper pp. 97–105.
with convolutions,” in Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recog- [19] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How trans-
nition, vol. 07-12-June. IEEE, jun 2015, pp. 1–9. ferable are features in deep neural networks?” in Proceedings
of the 27th International Conference on Neural Information
[8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region- Processing Systems. MIT Press, 2014, pp. 3320–3328.
Based Convolutional Networks for Accurate Object Detection
and Segmentation,” IEEE Transactions on Pattern Analysis [20] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisit-
and Machine Intelligence, vol. 38, no. 1, pp. 142–158, jan ing Batch Normalization For Practical Domain Adaptation,”
2016. Arxiv Preprint, vol. 1603.04779, no. 10.1016/B0-7216-0423-
4/50051-2, mar 2016.
[9] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN:
Towards Real-Time Object Detection with Region Proposal [21] A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic Data
Networks,” IEEE Transactions on Pattern Analysis and Ma- for Text Localisation in Natural Images,” Arxiv Preprint, vol.
chine Intelligence, vol. 39, no. 6, pp. 1137–1149, jun 2017. 1604.06646, no. 10.1109/CVPR.2016.254, apr 2016.
[22] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, [35] M. P. Vlastelica, S. Hayrapetyan, M. Tapaswi, and R. Stiefel-
“Deep Domain Confusion: Maximizing for Domain Invari- hagen, “Kit at MediaEval 2015 - Evaluating visual cues
ance,” Arxiv Preprint, vol. 1412.3474, dec 2014. for affective impact of movies task,” in CEUR Workshop
Proceedings, vol. 1436. New York, New York, USA: ACM
[23] X. Peng, B. Sun, K. Ali, and K. Saenko, “Learning deep Press, 2015, pp. 675–678.
object detectors from 3D models,” Proceedings of the IEEE
International Conference on Computer Vision, vol. 2015 Inter, [36] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error
pp. 1278–1286, dec 2015. in object detectors,” in Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence
[24] H. Su, C. R. Qi, Y. Li, and L. J. Guibas, “Render for and Lecture Notes in Bioinformatics). Springer, Berlin,
CNN: Viewpoint estimation in images using CNNs trained Heidelberg, 2012, vol. 7574 LNCS, no. PART 3, pp. 340–
with rendered 3D model views,” Proceedings of the IEEE 353.
International Conference on Computer Vision, vol. 2015 Inter,
pp. 2686–2694, may 2015.