Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

6-DoF grasp estimation method that fuses RGB-D data based on external attention

Published: 18 July 2024 Publication History

Abstract

6-DoF grasp estimation methods based on point clouds have long been a challenge in robotics due to the limitations of single data input, which hinder the robot’s perception of real-world scenarios, thus reducing its robustness. In this work, we propose a 6-DoF grasp pose estimation method based on RGB-D data, which leverages ResNet to extract color image features, utilizes the PointNet++ network to extract geometric information features, and employs an external attention mechanism to fuse both features. Our method is an end-to-end design, and we validate its performance through benchmark tests on a large-scale dataset and evaluations in a simulated robot environment. Our method outperforms previous state-of-the-art methods on public datasets, achieving 47.75mAP and 40.08mAP for seen and unseen objects, respectively. We also test our grasp pose estimation method on multiple objects in a simulated robot environment, demonstrating that our approach exhibits higher grasp accuracy and robustness than previous methods.

References

[1]
Bicchi A., Kumar V., Robotic grasping and contact: A review, in: Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 1, IEEE, 2000, pp. 348–353.
[2]
Dang H., Allen P.K., Semantic grasping: Planning robotic grasps functionally suitable for an object manipulation task, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2012, pp. 1311–1317.
[3]
Bohg J., Morales A., Asfour T., Kragic D., Data-driven grasp synthesis—a survey, IEEE Trans. Robot. 30 (2) (2013) 289–309.
[4]
Lenz I., Lee H., Saxena A., Deep learning for detecting robotic grasps, Int. J. Robot. Res. 34 (4–5) (2015) 705–724.
[5]
Morrison D., Corke P., Leitner J., Closing the loop for robotic grasping: A real-time, generative grasp synthesis approach, 2018, arXiv preprint arXiv:1804.05172.
[6]
S. Peng, Y. Liu, Q. Huang, X. Zhou, H. Bao, Pvnet: Pixel-wise voting network for 6dof pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4561–4570.
[7]
Y. He, W. Sun, H. Huang, J. Liu, H. Fan, J. Sun, Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11632–11641.
[8]
Zhao Z., Peng G., Wang H., Fang H.-S., Li C., Lu C., Estimating 6D pose from localizing designated surface keypoints, 2018, arXiv preprint arXiv:1812.01387.
[9]
Gou M., Fang H.-S., Zhu Z., Xu S., Wang C., Lu C., Rgb matters: Learning 7-dof grasp poses on monocular rgbd images, in: 2021 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2021, pp. 13459–13466.
[10]
C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, S. Savarese, Densefusion: 6d object pose estimation by iterative dense fusion, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3343–3352.
[11]
Y. He, H. Huang, H. Fan, Q. Chen, J. Sun, Ffb6d: A full flow bidirectional fusion network for 6d pose estimation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3003–3013.
[12]
Guo M.-H., Liu Z.-N., Mu T.-J., Hu S.-M., Beyond self-attention: External attention using two linear layers for visual tasks, IEEE Trans. Pattern Anal. Mach. Intell. 45 (5) (2022) 5436–5447.
[13]
Chen I.-M., Burdick J.W., Finding antipodal point grasps on irregularly shaped objects, IEEE Trans. Robot. Autom. 9 (4) (1993) 507–512.
[14]
Pinto L., Gupta A., Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours, in: 2016 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2016, pp. 3406–3413.
[15]
Mahler J., Liang J., Niyaz S., Laskey M., Doan R., Liu X., Ojea J.A., Goldberg K., Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics, 2017, arXiv preprint arXiv:1703.09312.
[16]
Park D., Chun S.Y., Classification based grasp detection using spatial transformer network, 2018, arXiv preprint arXiv:1803.01356.
[17]
Zhang Q., Qu D., Xu F., Zou F., Robust robot grasp detection in multimodal fusion, in: MATEC Web of Conferences, Vol. 139, EDP Sciences, 2017, p. 00060.
[18]
Kumra S., Kanan C., Robotic grasp detection using deep convolutional neural networks, in: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, IEEE, 2017, pp. 769–776.
[19]
Park D., Seo Y., Chun S.Y., Real-time, highly accurate robotic grasp detection using fully convolutional neural network with rotation ensemble module, in: 2020 IEEE International Conference on Robotics and Automation, ICRA, IEEE, 2020, pp. 9397–9403.
[20]
Ten Pas A., Gualtieri M., Saenko K., Platt R., Grasp pose detection in point clouds, Int. J. Robot. Res. 36 (13–14) (2017) 1455–1473.
[21]
Liang H., Ma X., Li S., Görner M., Tang S., Fang B., Sun F., Zhang J., Pointnetgpd: Detecting grasp configurations from point sets, in: 2019 International Conference on Robotics and Automation, ICRA, IEEE, 2019, pp. 3629–3635.
[22]
Qin Y., Chen R., Zhu H., Song M., Xu J., Su H., S4g: Amodal single-view single-shot se (3) grasp detection in cluttered scenes, in: Conference on Robot Learning, PMLR, 2020, pp. 53–65.
[23]
H.-S. Fang, C. Wang, M. Gou, C. Lu, Graspnet-1billion: A large-scale benchmark for general object grasping, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11444–11453.
[24]
LeCun Y., Bottou L., Bengio Y., Haffner P., Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[25]
LeCun Y., The MNIST database of handwritten digits, 1998, http://yann.lecun.com/exdb/mnist/.
[26]
Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition, 2014, arXiv preprint arXiv:1409.1556.
[27]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[28]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
[29]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
[30]
Iandola F.N., Han S., Moskewicz M.W., Ashraf K., Dally W.J., Keutzer K., SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size, 2016, arXiv preprint arXiv:1602.07360.
[31]
Howard A.G., Zhu M., Chen B., Kalenichenko D., Wang W., Weyand T., Andreetto M., Adam H., Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017, arXiv preprint arXiv:1704.04861.
[32]
J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
[33]
S. Woo, J. Park, J.-Y. Lee, I.S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European Conference on Computer Vision, ECCV, 2018, pp. 3–19.
[34]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., Kaiser Ł., Polosukhin I., Attention is all you need, Adv. Neural Inf. Process. Syst. 30 (2017).
[35]
Devlin J., Chang M.-W., Lee K., Toutanova K., Bert: Pre-training of deep bidirectional transformers for language understanding, 2018, arXiv preprint arXiv:1810.04805.
[36]
Radford A., Narasimhan K., Salimans T., Sutskever I., Improving Language Understanding with Unsupervised Learning, Technical report, OpenAI, 2018.
[37]
Radford A., Wu J., Child R., Luan D., Amodei D., Sutskever I., et al., Language models are unsupervised multitask learners, OpenAI Blog 1 (8) (2019) 9.
[38]
Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., Neelakantan A., Shyam P., Sastry G., Askell A., et al., Language models are few-shot learners, Adv. Neural Inf. Process. Syst. 33 (2020) 1877–1901.
[39]
Hou R., Chang H., Ma B., Shan S., Chen X., Cross attention network for few-shot classification, Adv. Neural Inf. Process. Syst. 32 (2019).
[40]
Zhang H., Goodfellow I., Metaxas D., Odena A., Self-attention generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 7354–7363.
[41]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discriminative localization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
[42]
Lu Y., Deng B., Wang Z., Zhi P., Li Y., Wang S., Hybrid physical metric for 6-DoF grasp pose detection, in: 2022 International Conference on Robotics and Automation, ICRA, IEEE, 2022, pp. 8238–8244.
[43]
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., et al., An image is worth 16x16 words: Transformers for image recognition at scale. arxiv 2020, 2010, arXiv preprint arXiv:2010.11929.
[44]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10012–10022.
[45]
X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, J. Lu, Point-bert: Pre-training 3d point cloud transformers with masked point modeling, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19313–19322.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Visual Communication and Image Representation
Journal of Visual Communication and Image Representation  Volume 101, Issue C
May 2024
313 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 18 July 2024

Author Tags

  1. 6-DoF grasp
  2. External attention
  3. Data fusion
  4. Deep learning
  5. Pose estimation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media