Towards Learning 3d Object Detection and 6d Pose Estimation From Synthetic Data
Towards Learning 3d Object Detection and 6d Pose Estimation From Synthetic Data
Abstract—Deep Learning-based approaches for 3d object de- synthetic images are used for training, which can be created in
tection and 6d pose estimation typically require large amounts arbitrary amounts. Yet, special care has to be taken to bridge
of labeled training data. Labeling image data is expensive and the domain gap that emerges when training with synthetic
particularly the 6d pose information is difficult to obtain, as
it requires a complex setup during image acquisition. Training data but testing with real data [11]. Some studies exist that
with synthetic data is therefore very attractive. Large amounts investigate the effects of data generation on the detection
of synthetic, labeled data can be generated, but it is not yet fully performance for the simpler task of 2d detection [11–14], but
understood how certain aspects of data generation affect the to our knowledge no such works exist for 6d detection.
detection and pose estimation performance. Our work therefore In this paper, we apply two data generation methods to
focuses on creating synthetic training data and investigating the
effects on detection performance. We present two methods for create synthetic training data (as in Fig. 1) for Deep Learning-
data generation: rendering object views and pasting them on based detection methods and examine the effects on their de-
random background images, and simulating realistic scenes. The tection performance. We aim to help the research community
former is computationally simpler and achieved better results, but by sharing our insights on how synthetic training data should
the detection performance is still very sensitive to small changes, be shaped to achieve best detection results.
e.g. the type of background images.
Index Terms—object detection, synthetic data, deep learning
I. I NTRODUCTION
Sensing and perception of unstructured or less-structured
environments are some of the major needs for advanced
robotics in manufacturing industries [1]. Crucial capabilities
to satisfy these needs are 3d object detection and 6d pose
estimation, which we will call 6d detection. The SIXD Chal-
lenge 2017 [2] compared several state-of-the-art 6d detection Figure 1. Example images generated by the two approaches: render and
methods on a variety of datasets and found that approaches compose (left) and simulated scene (right).
based on point pair feature matching [3, 4] performed best.
However, only recently a range of methods utilizing Deep
Neural Networks to accomplish the 6d detection task has II. R ELATED W ORKS
emerged, which are not yet considered in the benchmark. Training with synthetic data is very attractive, as it renders
Most approaches rely on pre-trained classification or detec- the manual collection and annotation of datasets unnecessary.
tion networks and extend their architectures to generate pose From the methods mentioned in the introduction, [9] train with
hypotheses [5–8], but there are several different methods to real data, [5, 7, 10] additionally use synthetic images created
generate the hypotheses. Some use a categorization approach, by randomly pasting cropped views of the object in back-
i.e. the network is able to distinguish a discrete number of ground images from datasets like ImageNet or Pascal VOC.
views [6, 9]. The detector of [8] also implements this idea Only [6, 8] completely rely on synthetic images, generated by
but uses an autoencoder to compare detected object views rendering the 3d model in different poses, using various illu-
with template object views. Another approach is to predict mination settings and further augmentations. However, none
the corner points of the 3d bounding box and compute the of them thoroughly examined the effect of their data synthesis
pose using a PnP algorithm [5, 7]. Only few try to directly pipeline, although [8] did some ablation studies on the effect of
predict the rotation by using e.g. a quaternion regression [10]. different color augmentations and found that results improve
Although these methods already achieve competitive results, with increasing number of augmentations used.
in contrast to point pair feature matching, they require an- This observation is the basis for the strategy of domain
notated training data. Particularly the 6d pose information is randomization as introduced in [11]. They argue that if the
difficult to obtain, as it requires a complex setup during image variability of synthetic data is high enough, the trained models
acquisition. To overcome this hurdle, instead of real data, will generalize to the real world without adaptation. In their
Authorized licensed use limited to: ASTAR. Downloaded on February 03,2022 at 10:41:45 UTC from IEEE Xplore. Restrictions apply.
experiments they used a physics simulation to generate scenes During composition stage, we use the rendered views to
where they randomized e.g. number of distractor objects in the paste them at random positions on randomly chosen images
scene, textures of all objects, and illumination settings. They of the COCO dataset [15], similar to [6, 8]. The views are
could show good detection results on real data, while using randomly scaled with a factor in the range 0.6 − 1.6 (in
solely synthetic data for training. However, they used geomet- consideration of the scales in the real dataset), to simulate
rically simple objects and did not estimate full 6d poses. different camera distances. Note that this approach makes
Other works used synthetic data for training of 2d detectors, it computationally faster, because fewer views have to be
and their insights are valuable for the task of 6d detection rendered and scaling is a cheap operation. The ground truth
as well. In [13], they again rendered object views and pasted pose information is adjusted to the new position and distance.
them on random real background images. One of their insights HSV augmentations on the rendered views allow us to cover
was that blurring the composite images (and particularly the more possible appearances of the object. We also add distractor
border between rendered object and background) significantly objects to the composed images by applying the same routine
improves the detection rates. They also found that freezing the as for the target object, however, we make sure that no more
weights of the feature extractor, which was pre-trained with than 20% of the target object’s area is occluded by distractors.
real data, greatly improves recognition rates. However, [12]
tried the same when using synthetic data to detect cars from
the KITTI dataset and they could not reproduce this effect.
Instead, varying illumination settings and object textures were
most important for their detection task. An approach based
on purely synthetic data was proposed by [14], who gener-
ated synthetic backgrounds which were heavily cluttered with
objects of similar size to the object in question. Furthermore,
they achieved good results by applying a special curriculum
for training by first presenting similar object views and slowly
increasing complexity.
In summary, it can be said that although many different
techniques have been applied and investigated, to us there
is not yet a clear picture of which approaches are most
suitable to yield good performance, particularly for the task of
6d object detection. Some aspects which are seemingly crucial
Figure 2. Illustration of the render and compose schema.
for certain applications turn out to be of less importance or
even not helpful at all in other applications.
B. Simulated Scenes
III. M ETHODOLOGY
Our second approach follows the notion that a realistic
Our goal is to investigate the effects of data generation context in which an object can be most likely observed may
when training state of the art detectors with synthetic data. help to recognize the object’s pose. For example, if a cup
We designed two approaches to generate synthetic data: First, a will be placed on a table, two of its rotational and one of its
render and compose schema which creates composites random translational degrees of freedom are already fixed and could be
background images with rendered object views, and second, a estimated by only observing the ground plane. Also, realistic
scene simulation in which we use a rigid body simulation to shadows on the table can further help to distinguish pose
compose realistic scenes. Fig. 1 shows samples. hypotheses. To pursue this intuition we generated synthetic
data by means of a rigid body simulation of a simple scene.
A. Render and Compose
We randomly placed relevant object models in the air and
In the first stage, we render the object in question from simulated their falling down on a table. The views were then
various viewpoints, and in the second stage we compose im- generated using the same rendering pipeline as above. We also
ages by placing the synthetic object views on real background used HSV and scale augmentations.
images. The process is illustrated in Fig. 2.
The viewpoint sampling is modeled after the turntable IV. E XPERIMENTS
approach, i.e. using Euler angles to rotate camera and object. In this section we first describe our experimental setup,
We limit the rotations to views of the upper hemisphere. The i.e. our implementation as well as used datasets and metrics.
camera’s distance depends on the model’s dimensions, but Subsequently, we present the results for different variations in
is fixed during rendering. Two light sources are randomly data generation.
positioned in the scene. Optionally, we eliminate ambiguous
views, i.e. views that have the same appearance but different A. Experimental Setup
poses, because some detector networks may not be able to The implementation of our data generation pipeline is based
properly deal with ambiguous training data. on Blender for rendering and simulation, and OpenCV for
1541
Authorized licensed use limited to: ASTAR. Downloaded on February 03,2022 at 10:41:45 UTC from IEEE Xplore. Restrictions apply.
image composition. As pose detector we use Single Shot The results from Table I reflect the fact that the IoU
Pose (SSP) network by [7], which extends on YOLO and is metric is more permissive than ADD. Interestingly, most
available as open source. It predicts the 3d bounding boxes approaches have considerably higher IoU than ADD, even
and computes the pose hypothesis with a PnP algorithm. One if we lower the matching threshold to 50%, which means
instance of the network is trained for each object, as in [7]. that the average distance of a model’s points in ground truth
The authors of SSP used the LINEMOD dataset [16] and split pose to their corresponding points in estimated pose must be
it into training and test. In all our experiments, we use the lower than 50% of the model’s diameter. This indicates that
same scenes for evaluation as in [7], however, we do not use the detection of the object instances works reasonably well,
the bowl, can and cup due to their strong rotational symmetry. while the estimation of the pose is more error-prone.
Also similar to [7], we use two standard error metrics. First, As the achieved results are not yet satisfactory, we made
the average distance metrics ADD from [16], and second, the some more investigations with only two exemplary objects (cat
Intersection over Union IoU . Note that for 6d pose estimation, and iron) using the ADD10% metric. To counteract the bad
the IoU is computed over the area of the model’s projections pose estimation we set up two datasets similar to rc1 where
in ground truth and estimated pose, as in [6, 7]. It only we used finer sampling of rotations during rendering. Using
describes similarity in appearance without directly checking 440 object views in 2000 composite images slightly improved
the pose correctness, and is thus more permissive than the the recognition of the cat (from 5.3% to 9.6%) but not for the
ADD metric. iron (from 16.7% to 15.9%). Training the SSP with 32k object
views in 20k composite images with multiple object instances
B. Results
resulted in no successful detections at all.
We tested altogether four approaches for data generation on To quantify the effects of rendering and composing sep-
the whole LINEMOD dataset. The first two (rc1 and rc2) use arately, we used the same 172 images from LINEMOD
the render and compose schema described in Section III-A. which [7] used for training, and trained variations with real or
For both variants we used 180 discrete rotations per object and rendered object and real or random background images from
created 720 composite images. Each composite image contains COCO. Note that all variations use exactly the same object
one target object and three distractor objects, which were other poses during training. The results are displayed in Table II and
objects from the same dataset. The difference between rc1 and indicate that both object appearance and background strongly
rc2 is, that only the latter applies the HSV augmentation. affect the recognition rates.
The third approach (sim) is simulation of scenes as de-
scribed in Section III-B. We also use one instance of the target
Table II
object and three instances of distractor objects, and we apply E FFECT OF RENDERING AND COMPOSING ON THE RECOGNITION RATES ,
HSV and scale augmentation. However, no background images USING THE ADD 10% METRIC .
are added to the scenes. Lastly, we use the union of rc2 and
Correct Poses (%)
sim as rc2+sim. Object Background Cat Iron
real real 55.4 78.0
Table I real COCO 48.3 34.6
C OMPARISON OF OUR DATA GENERATION APPROACHES . P ERCENTAGES OF rendered real 6.9 29.7
CORRECTLY ESTIMATED POSES ARE AVERAGED OVER ALL OBJECTS . rendered COCO 9.3 8.8
Training Datasets
Metric real [7] rc1 rc2 sim rc2+sim
ADD10% 54.88 6.10 5.59 0.00 5.88
ADD50% 96.62 36.37 31.27 0.00 37.08 V. D ISCUSSION
IoU 50% 99.94 87.31 48.71 0.00 55.43
The preliminary results confirm that bridging the domain
gap, that occurs when training on synthetic data, is challeng-
The recognition rates are presented in Table I and compared ing. Our prototypical data generation pipeline is not yet ca-
to the results of [7], who trained on real images and real object pable of creating training images that give similar recognition
views pasted on random background images. While training rates as real training data, but we were able to observe a great
on real data yields the best results, we can observe dramatic sensitivity with respect to certain changes in the training data.
differences for the various synthetic training datasets. Even Comparing the two proposed methods, the simpler render
the two very similar approaches rc1 and rc2 greatly differ and compose approach works far better than the simulated
in recognition rates (based on the IoU metric). Interestingly, scenes. We think this is mainly due to missing variations,
the variant without HSV augmentation performs better. Using particularly in the setup of the scenes as well as regarding
simulated scenes did not yield any successful recognitions, the background. The simulated scenes had only a constant
which might emphasize the importance of background vari- background color with no variation at all, and as we saw
ance. However, if combined with composite images that have in Table II, the used background images can largely affect
diverse backgrounds, the simulated scenes seem to be helpful, detection rates. Using varying scene setups, applying different
as the recognition rates slightly increase for rc2+sim with textures to the table and adding random background images
respect to rc2 alone. could thus leverage training with simulated scenes.
1542
Authorized licensed use limited to: ASTAR. Downloaded on February 03,2022 at 10:41:45 UTC from IEEE Xplore. Restrictions apply.
The results further prove the difficulty of the pose estimation [6] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slo-
task, as there is a large difference between results with ADD bodan Ilic, and Nassir Navab. “SSD-6D: Making RGB-
metric and those with IoU , which does not require an exact Based 3D Detection and 6D Pose Estimation Great
pose estimation but rather a rough localization. However, Again”. In: The IEEE International Conference on
simply using more poses during training was not successful. Computer Vision (ICCV). 2017.
We suggest further experiments using an icosphere sampling [7] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. “Real-
during rendering (as in [16]) and adjusting the sampled dis- time seamless single shot 6d object pose prediction”.
tances more precisely to the target domain. Yet, as the results In: Proceedings of the IEEE Conference on Computer
in Table II show, the rendering still impairs recognition rates Vision and Pattern Recognition. 2018, pp. 292–301.
even when poses from the target domain are used. Further [8] Martin Sundermeyer, Zoltan-Csaba Marton, Maximil-
variations and randomizations have to be tested to alleviate ian Durner, Manuel Brucker, and Rudolph Triebel.
this effect. “Implicit 3d orientation learning for 6d object detec-
tion from rgb images”. In: Proceedings of the Euro-
VI. C ONCLUSIONS AND O UTLOOK
pean Conference on Computer Vision (ECCV). 2018,
In our work we presented two methods to generate synthetic pp. 699–715.
training data and evaluated them using the Single Shot Pose [9] Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei
detector [7] for LINEMOD dataset [16]. Our render and com- Liu, Jana Kosecka, and Alexander C. Berg. “Fast single
pose approach seems promising but still has to be improved shot detection and pose estimation”. In: 2016 Fourth
to be competitive. We set up further experiments to examine International Conference on 3D Vision (3DV). 2016,
the influence of pose sampling, rendering, and background pp. 676–684.
variation, and conclude that particularly background variation [10] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan,
and rendering have a strong impact on recognition rates. and Dieter Fox. “Posecnn: A convolutional neural net-
In the future, we will concentrate on identifying further work for 6d object pose estimation in cluttered scenes”.
influence factors and their impact for various datasets and In: arXiv preprint arXiv:1711.00199 (2018).
more detector architectures. Automatic parameterization of [11] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider,
data generation methods from unlabeled, real images would Wojciech Zaremba, and Pieter Abbeel. “Domain ran-
be of particular interest. domization for transferring deep neural networks from
R EFERENCES simulation to the real world”. In: Intelligent Robots and
Systems (IROS), 2017 IEEE/RSJ International Confer-
[1] Albert N. Link, Zachary T. Oliver, and Alan C. ence on. 2017, pp. 23–30.
O’Connor. Economic Analysis of Technology Infras- [12] Jonathan Tremblay et al. “Training deep networks with
tructure Needs for Advanced Manufacturing: Advanced synthetic data: Bridging the reality gap by domain ran-
Robotics and Automation. 2016. URL: https://nvlpubs. domization”. In: Proceedings of the IEEE Conference
nist . gov / nistpubs / gcr / 2016 / NIST. GCR . 16 - 005 . pdf on Computer Vision and Pattern Recognition Work-
(visited on 02/11/2019). shops. 2018, pp. 969–977.
[2] Tomáš Hodaň et al. “BOP: benchmark for 6D object [13] Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart,
pose estimation”. In: Proceedings of the European Con- and Kurt Konolige. “On pre-trained image features
ference on Computer Vision (ECCV). 2018, pp. 19–34. and synthetic images for deep learning”. In: European
[3] Bertram Drost, Markus Ulrich, Nassir Navab, and Slo- Conference on Computer Vision. 2018, pp. 682–697.
bodan Ilic. “Model globally, match locally: Efficient and [14] Stefan Hinterstoisser, Olivier Pauly, Hauke Heibel, Mar-
robust 3D object recognition”. In: Computer Vision and tina Marek, and Martin Bokeloh. “An Annotation Saved
Pattern Recognition (CVPR), 2010 IEEE International is an Annotation Earned: Using Fully Synthetic Train-
Conference on. 2010, pp. 998–1005. ing for Object Instance Detection”. In: arXiv preprint
[4] Joel Vidal, Chyi-Yeu Lin, and Robert Martí. “6D pose arXiv:1902.09967 (2019).
estimation using an improved method based on point [15] Tsung-Yi Lin, Michael Maire, Serge Belongie, James
pair features”. In: 2018 4th International Conference Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
on Control, Automation and Robotics (ICCAR). 2018, C. Lawrence Zitnick. “Microsoft coco: Common ob-
pp. 405–409. jects in context”. In: European conference on computer
[5] Mahdi Rad and Vincent Lepetit. “BB8: a scalable, vision. 2014, pp. 740–755.
accurate, robust to partial occlusion method for pre- [16] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic,
dicting the 3D poses of challenging objects without Stefan Holzer, Gary Bradski, Kurt Konolige, and Nas-
using depth”. In: Proceedings of the IEEE International sir Navab. “Model based training, detection and pose
Conference on Computer Vision. 2017, pp. 3828–3836. estimation of texture-less 3d objects in heavily cluttered
scenes”. In: Asian conference on computer vision. 2012,
pp. 548–562.
1543
Authorized licensed use limited to: ASTAR. Downloaded on February 03,2022 at 10:41:45 UTC from IEEE Xplore. Restrictions apply.
Powered by TCPDF (www.tcpdf.org)