SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Zhang, Junjie; Bai, Chenjia; He, Haoran; Xia, Wenke; Wang, Zhigang; Zhao, Bin; Li, Xiu; Li, Xuelong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2405.19586 (cs)

[Submitted on 30 May 2024]

Title:SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Authors:Junjie Zhang, Chenjia Bai, Haoran He, Wenke Xia, Zhigang Wang, Bin Zhao, Xiu Li, Xuelong Li

View PDF HTML (experimental)

Abstract:Acquiring a multi-task imitation policy in 3D manipulation poses challenges in terms of scene understanding and action prediction. Current methods employ both 3D representation and multi-view 2D representation to predict the poses of the robot's end-effector. However, they still require a considerable amount of high-quality robot trajectories, and suffer from limited generalization in unseen tasks and inefficient execution in long-horizon reasoning. In this paper, we propose SAM-E, a novel architecture for robot manipulation by leveraging a vision-foundation model for generalizable scene understanding and sequence imitation for long-term action reasoning. Specifically, we adopt Segment Anything (SAM) pre-trained on a huge number of images and promptable masks as the foundation model for extracting task-relevant features, and employ parameter-efficient fine-tuning on robot data for a better understanding of embodied scenarios. To address long-horizon reasoning, we develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass, notably enhancing execution efficiency. Experimental results from various instruction-following tasks demonstrate that SAM-E achieves superior performance with higher execution efficiency compared to the baselines, and also significantly improves generalization in few-shot adaptation to new tasks.

Comments:	ICML 2024. Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
Cite as:	arXiv:2405.19586 [cs.CV]
	(or arXiv:2405.19586v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2405.19586

Submission history

From: Chenjia Bai [view email]
[v1] Thu, 30 May 2024 00:32:51 UTC (9,076 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators