Abstract
We present a two-phase algorithm that first identifies the categories and 2D proposal regions of 3D objects and then estimates the eight corners of cubes bounding the target objects. Given the predicted corners, the six-degrees-of-freedom (6-DoF) poses of the 3D objects are calculated using the conventional perspective-n-point (PnP) algorithm and evaluated with respect to manually annotated corners. In addition, several 3D models with high-quality shapes, texture information, 2D images, and annotations, such as 2D boxes, 3D cuboids, and segmentation masks, are collected. New objects are included while validating the proposed method. Our results are compared qualitatively and quantitatively with those of the baseline model using the publicly accessible LineMOD dataset, additional annotations in the OCCLUSION dataset, and our own custom dataset. While handling single and multiple objects in testing scenes, the proposed method is observed to exhibit clear improvements on both the aforementioned datasets and in real-world examples.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Kim S-H, Hwang Y (2021) A survey on deep learning based methods and datasets for monocular 3d object detection. Electronics 10(4):517
Kim J, Kim S-H (2021) Deep learning based object detection method and its application for intelligent transport systems. J Inst Control Robot Syst 27(12):1016–1022
Kim S-H, Choe G, Park M-G, Kweon I (2020) Salient view selection for visual recognition of industrial components. IEEE Robot Automation Lett 5(2):2506–2513
Kim S-H, Choe G, Ahn B, Kweon I (2017) Deep representation of industrial components using simulated images, In: 2017 IEEE international conference on robotics and automation (ICRA). IEEE, pp 2003–2010
Kim S-H, Tai Y-W, Lee J-Y, Park J, Kweon I (2017) Category-specific salient view selection via deep convolutional neural networks, In: Computer graphics forum, vol. 36, no. 8. Wiley Online Library, pp. 313–328
Wang C-Y, Bochkovskiy A, Liao H-Y (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7464–7475. [Online]. https://github.com/WongKinYiu/yolov7
Tekin B, Sinha S, Fua P (2018) Real-time seamless single shot 6d object pose prediction, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 292–301. https://github.com/microsoft/singleshotpose
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788
Redmon J, Angelova A (2015) Real-time grasp detection using convolutional neural networks, In: IEEE international conference on robotics and automation (ICRA). IEEE, pp 1316–1322
Kneip L, Scaramuzza D, Siegwart R (2011) A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation, In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, pp 2969–2976
Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2013) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes, In: ACCV. Springer, pp. 548–562
Rad M, Lepetit V (2017) Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth, In: Proceedings of the IEEE international conference on computer vision, pp. 3828–3836
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation, In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg A (2016) Ssd: single shot multibox detector. ECCV. Springer, pp 21–37
Kehl W, Manhardt F, Tombari F, Ilic S, Navab N (2017) Ssd-6d: making rgb-based 3d detection and 6d pose estimation great again, In: Proceedings of the IEEE international conference on computer vision, pp 1521–1529
Chen H, Wang P, Wang F, Tian W, Xiong L, Li H (2022) Epro-pnp: generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2781–2790
Wang C-Y, Bochkovskiy A, Liao H-YM (June 2021) Scaled-YOLOv4: scaling cross stage partial network, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 13 029–13 038. [Online]. https://github.com/AlexeyAB/darknet
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database, in CVPR. IEEE, pp 248–255
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks, In: Advances in neural information processing systems (NIPS), pp 1097–1105
Hinterstoisser S, Holzer S, Cagniart C, Ilic S, Konolige K, Navab N, Lepetit V (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes, In: ICCV. IEEE, pp 858–865
Kim S-H, Cho D (2021) Viewpoint-aware action recognition using skeleton-based features from still images. Electronics 10(9):1118
Brachmann E, Krull A, Michel F, Gumhold S, Shotton J, Rother C (2014) Learning 6d object pose estimation using 3d object coordinates, In: ECCV. Springer, pp 536–551
Collins J, Goel S, Deng K, Luthra A, Xu L, Gundogdu E, Zhang X, Vicente TFY, Dideriksen T, Arora H et al (2022) Abo: dataset and benchmarks for real-world 3d object understanding, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 21 126–21 136
Brazil G, Kumar A, Straub J, Ravi N, Johnson J, Gkioxari G (2023) Omni3d: A large benchmark and model for 3d object detection in the wild, In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13 154–13 164
Wu T, Zhang J, Fu X, Wang Y, Ren J, Pan L, Wu W, Yang L, Wang J, Qian C et al (2023) Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation, arXiv preprint arXiv:2301.07525
Acknowledgements
This work was supported by the Soongsil University Research Fund (New Professor Support Research) of 2021.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jang, Jh., Lee, J. & Kim, Sh. Two-Phase Approach for Monocular Object Detection and 6-DoF Pose Estimation. J. Electr. Eng. Technol. 19, 1817–1825 (2024). https://doi.org/10.1007/s42835-023-01640-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42835-023-01640-7