Abstract
To aid humans in everyday tasks, robots need to know which objects exist in the scene, where they are, and how to grasp and manipulate them in different situations. Therefore, object recognition and grasping are two key functionalities for autonomous robots. Most state-of-the-art approaches treat object recognition and grasping as two separate problems, even though both use visual input. Furthermore, the knowledge of the robot is fixed after the training phase. In such cases, if the robot encounters new object categories, it must be retrained to incorporate new information without catastrophic forgetting. To resolve this problem, we propose a deep learning architecture with an augmented memory capacity to handle open-ended object recognition and grasping simultaneously. In particular, our approach takes multi-views of an object as input and jointly estimates pixel-wise grasp configuration as well as a deep scale- and rotation-invariant representation as output. The obtained representation is then used for open-ended object recognition through a meta-active learning technique. We demonstrate the ability of our approach to grasp never-seen-before objects and to rapidly learn new object categories using very few examples on-site in both simulation and real-world settings. Our approach empowers a robot to acquire knowledge about new object categories using, on average, less than five instances per category and achieve \(95\%\) object recognition accuracy and above \(91\%\) grasp success rate on (highly) cluttered scenarios in both simulation and real-robot experiments. A video of these experiments is available online at: https://youtu.be/n9SMpuEkOgk
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Wang, J., Chakraborty, R., Stella, X.Y.: Spatial transformer for 3d point clouds. IEEE Trans. Pattern Anal. Mach. Intell. (2021)
Yu, C., Wang, J., Gao, C., Yu, G., Shen, C., Sang, N.: Context prior for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), (2020)
Fang, H.-S., Wang, C., Gou, M., Lu, C.: Graspnet-1billion: a large-scale benchmark for general object grasping. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11 444–11 453 (2020)
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. 114(13), 3521–3526 (2017)
Bohg, J., Morales, A., Asfour, T., Kragic, D.: Data-driven grasp synthesis–a survey. IEEE Trans. Rob. 30(2), 289–309 (2013)
Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. The International Journal of Robotics Research 34(4–5), 705–724 (2015)
Mahler, J., Liang, J., Niyaz, S., Laskey, M., Doan, R., Liu, X., Ojea, J.A., Goldberg, K.: Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics (2017). arXiv preprint arXiv:1703.09312
Morrison, D., Corke, P., Leitner, J.: Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach. In: Processing of robotics: science and systems (RSS), (2018)
Klokov , R., Lempitsky, V.: Escape from cells: Deep kd-networks for the recognition of 3D point cloud models. In: Proceedings of the IEEE international conference on computer vision, pp. 863–872 (2017)
Kanezaki, A., Matsushita, Y., Nishida, Y.: RotationNet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5010–5019 (2018)
Kumra, S., Joshi, S., Sahin, F.: Antipodal robotic grasping using generative residual convolutional neural network. In: IEEE/RSJ International conference on intelligent robots and systems (IROS) 2020, 9626–9633 (2020)
Breyer, M., Chung, J.J., Ott, L., Roland, S., Juan, N.: Volumetric grasping network: Real-time 6 dof grasp detection in clutter. In: Conference on robot learning, (2020)
Mousavian, A., Eppner, C., Fox, D.: 6-dof graspnet: Variational grasp generation for object manipulation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2901–2910 (2019)
Newbury, R., Gu, M., Chumbley, L., Mousavian, A., Eppner, C., Leitner, J., Bohg, J., Morales, A., Asfour, T., Kragic D et al.: Deep learning approaches to grasp synthesis: A review. IEEE Trans. Robot. (2023)
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020). arXiv preprint arXiv:2004.10934
Bendale, A., Boult, T.E.: Towards open set deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1563–1572 (2016)
Subramanya, A., Pillai, V., Pirsiavash, H.: Fooling network interpretation in image classification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2020–2029 (2019)
Da, Q., Yu, Y., Zhou, Z.-H., Learning with augmented class by exploiting unlabeled data. In: Proceedings of the AAAI conference on artificial intelligence, 28(1), 2014
Scheirer, W.J., Jain, L.P., Boult, T.E.: Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2317–2324 (2014)
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920 (2015)
Maturana, D., Scherer, S.: VoxNet: A 3D convolutional neural network for real-time object recognition. In: 2015 IEEE/RSJ International conference on intelligent robots and systems (IROS). IEEE, pp. 922–928 (2015)
Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view CNNs for object classification on 3D data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656 (2016)
Shi, B., Bai, S., Zhou, Z., Bai, X.: Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Process. Lett. 22(12), 2339–2343 (2015)
Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 945–953 (2015)
Parisotto, T., Mukherjee, S., Kasaei, H.: More: simultaneous multi-view 3d object recognition and pose estimation. Int. Serv. Robot. pp. 1–12 (2023)
Xiong, K.H., Songsong.: Enhancing fine-grained 3d object recognition using hybrid multi-modal vision transformer-cnn models. In: 2023 IEEE/RSJ International conference on intelligent robots and systems (IROS). IEEE, (2023)
Kasaei, S.H., Melsen, J., van Beers, F., Steenkist, C., Voncina, K.: The state of lifelong learning in service robots: Current bottlenecks in object perception and manipulation. Journal of Intelligent & Robotic Systems 103, 1–31 (2021)
Sener, O., Savarese, S.: Active learning for convolutional neural networks: A core-set approach (2017). arXiv preprint arXiv:1708.00489
Aggarwal, U., Popescu, A., Hudelot, C.: Active learning for imbalanced datasets. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (WACV), (2020)
Siddiqui, Y., Valentin, J., Niessner, M.: Viewal: Active learning with viewpoint entropy for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), (2020)
Gal, Y., Islam, R., Ghahramani, Z.: Deep bayesian active learning with image data. In: International conference on machine learning. PMLR, pp. 1183–1192 (2017)
Kasaei, S.H.O.: OrthographicNet: A deep transfer learning approach for 3D object recognition in open-ended domains. IEEE/ASME Trans. Mechatronics, pp 1–1 (2020)
Kasaei, S.H., Tomé, A.M., Lopes, L.S.: Hierarchical object representation for open-ended object category learning and recognition. In: Advances in neural information processing systems, pp. 1948–1956 (2016)
Kasaei, X.S., Hamidreza.: Lifelong ensemble learning based on multiple representations for few-shot object recognition. Robot. Auton. Syst. (2023)
Ren, P., Xiao, Y., Chang, X., Huang, P.-Y., Li, Z., Gupta, B.B., Chen, X., Wang, X.: A survey of deep active learning. ACM computing surveys (CSUR) 54(9), 1–40 (2021)
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: A differentiable renderer for image-based 3D reasoning. In: Proceedings of the IEEE international conference on computer vision, pp. 7708–7717 (2019)
Thrun, S.: Probabilistic robotics. Commun. ACM 45(3), 52–57 (2002)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly S et al.: An image is worth 16x16 words: Transformers for image recognition at scale (2020). arXiv preprint arXiv:2010.11929
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In.: IEEE Conference on computer vision and pattern recognition. Ieee 2009, 248–255 (2009)
Calli, B., Singh, A., Bruce, J., Walsman, A., Konolige, K., Srinivasa, S., Abbeel, P., Dollar, A.M.: Yale-cmu-berkeley dataset for robotic manipulation research. The International Journal of Robotics Research 36(3), 261–268 (2017)
Kirkpatrick, S., Gelatt Jr, C.D., Vecchi, M.P.: Optimization by simulated annealing. Science, 220(4598), 671–680 (1983)
Kasaei, S.H., Oliveira, M., Lim, G.H., Lopes, L.S., Tomé, A.M.: Interactive open-ended learning for 3D object recognition: An approach and experiments. Journal of Intelligent & Robotic Systems 80(3–4), 537–553 (2015)
Keunecke, N., Kasaei, S.H.: Combining shape features with multiple color spaces in open-ended 3d object recognition. IEEE-RAS International conference on humanoid robots (Humanoids), (2020)
Ji, R., Wen, L., Zhang, L., Du, D., Wu, Y., Zhao, C., Liu, X., Huang, F.: Attention convolutional binary neural tree for fine-grained visual categorization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10 468–10 477 (2020)
Chauhan, A., Lopes, L.S.: Using spoken words to guide open-ended category formation. Cogn. Process. 12(4), 341 (2011)
Kasaei, S.H., Lopes, L.S., Tomé, A.M.: Coping with context change in open-ended object recognition without explicit context information. In: 2018 IEEE/RSJ International conference on intelligent robots and systems (IROS). IEEE, pp. 1–7 (2018)
Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D object dataset. In: Robotics and automation (ICRA), 2011 IEEE international conference on. IEEE, pp. 1817–1824 (2011)
Kasaei, S.H., Oliveira, M., Lim, G.H., Lopes, L.S., Tomé, A.M.: Towards lifelong assistive robotics: A tight coupling between object perception and manipulation. Neurocomputing 291, 151–166 (2018)
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in neural information processing systems, pp. 856–864 (2010)
Kasaei, S.H., Sock, J., Lopes, L.S., Tomé, A.M., Kim, T.-K.: Perceiving, learning, and recognizing 3D objects: An approach to cognitive service robots. In: Thirty-second AAAI conference on artificial intelligence, (2018)
Gualtieri, M., Ten Pas, A., Saenko, K., Platt, R.: High precision grasp pose detection in dense clutter. In: 2016 IEEE/RSJ International conference on intelligent robots and systems (IROS). IEEE, pp. 598–605 (2016)
Morrison, D., Corke, P., Leitner, J.: Learning robust, real-time, reactive robotic grasping. The International Journal of Robotics Research 39(2–3), 183–201 (2020)
Mokhtar, K., Heemskerk, C., Kasaei, H.: Self-supervised learning for joint pushing and grasping policies in highly cluttered environments (2022). arXiv preprint arXiv:2203.02511
Xu, Y., Kasaei, M., Kasaei, H., Li, Z.: Instance-wise grasp synthesis for robotic grasping (2023). arXiv preprint arXiv:2302.07824
Acknowledgements
We thank the Center for Information Technology of the University of Groningen for their support and for providing access to the Peregrine high-performance computing cluster.
Author information
Authors and Affiliations
Contributions
Hamidreza Kasaei proposed the main idea and led the work. He also contributed to the development of the approach and performed experiments in both simulation and real robots. Mohammadreza Kasaei also contributed to the development of the proposed approach and performed simulation experiments. Georgios Tziafas developed the Vision Transformer part and Sha Lou contributed to developing the simulation environment. Remo Sasso was partly involved in the development of the grasp network. All authors reviewed the manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kasaei, H., Kasaei, M., Tziafas, G. et al. Simultaneous Multi-View Object Recognition and Grasping in Open-Ended Domains. J Intell Robot Syst 110, 62 (2024). https://doi.org/10.1007/s10846-024-02092-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10846-024-02092-5