Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach

Weikun Peng1, Jun Lv2, Yuwei Zeng1, Haonan Chen3, Siheng Zhao3,
Jicheng Sun2, Cewu Lu2, Lin Shao1†
1
School of Computing, National University of Singapore
2Department of Computer Science, Shanghai Jiao Tong University
3Department of Computer Science and Technology, Nanjing University
\dagger Corresponding author e-mail:linshao@nus.edu.sg
Abstract

The tie-knotting task is highly challenging due to the tie’s high deformation and long-horizon manipulation actions. This work presents TieBot, a Real-to-Sim-to-Real learning from visual demonstration system for the robots to learn to knot a tie. We introduce the Hierarchical Feature Matching approach to estimate a sequence of tie’s meshes from the demonstration video. With these estimated meshes used as subgoals, we first learn a teacher policy using privileged information. Then, we learn a student policy with point cloud observation by imitating teacher policy. Lastly, our pipeline learns a residual policy when the learned policy is applied to real-world execution, mitigating the Sim2Real gap. We demonstrate the effectiveness of TieBot in simulation and the real world. In the real-world experiment, a dual-arm robot successfully knots a tie, achieving 50% success rate among 10 trials. Videos can be found on our website.

1 Introduction

Learning cloth manipulation holds great utility across a wide range of applications. One intriguing domain is robotic tie knotting. Service robots must be adept at tasks like aiding the elderly or individuals with disabilities in dressing for certain social events. Teaching robots to knot ties, as a special case of cloth manipulation, typically pushes the limits of robotic cloth manipulation. This offers valuable insights for tie knotting and the broader field of robotic cloth manipulation.

Cloth manipulation presents challenges for robots due to its high-dimensional state and complex dynamics. Extracting and modeling state information are difficult problems. In contrast, humans have accumulated extensive knowledge about cloth manipulation. These priors make learning from demonstration (LfD) a promising direction. LfD empowers a robot to acquire a policy from expert demonstrations, significantly reducing the need to design task-specific reward functions manually. Consequently, LfD stands as a potent and efficient framework for instructing robots in the execution of complex skills.

However, existing LfD methods struggle with tie-knotting tasks. Kinesthetic demonstration or teleoperation suffers from the complexity of tie-knotting tasks. Tie-knotting tasks require bi-manual operations, placing high demand on human operators’ skills and equipment. For instance, Zhang et. al use VR headsets for teleoperation [1]. Thus, simple behavior cloning may be significantly labor-intensive. Learning from visual demonstration is usually an easier approach in terms of collecting demonstration data. But this approach also leads to embodiment gaps. Therefore, researchers attempt to find some object-centric representations that robots can utilize to generate correct actions, overcoming embodiment gaps. Several methods attempt to learn a general visual representation of some simple pick-place skills via large-scale pre-training on actionless videos [2, 3, 4]. These works present strong generalizations on the learned visual representations, but none of them shows the ability to learn dexterous manipulation skills that can knot a tie. Other methods such as [5, 6, 7, 8] try to leverage object trajectories or keypoints as representations to guide the policy learning. Such representations are indeed sufficient to describe simple object motions but fail to capture the tie’s complex topology and subtle dynamics.

Refer to caption
Figure 1: Our proposed TieBot performs a tie-knotting task. We leverage cloth simulation to recover the cloth’s state from human demonstration and learn a goal-condition policy to accomplish the tie-knotting task.

Compared to the existing LfD work mentioned in the previous paragraph, our insight is that mesh is the most suitable representation for tie-knotting tasks and other complex cloth manipulation tasks. It captures accurate geometric structures and physics properties of the tie, which is crucial for tie-knotting tasks. It also disentangles irrelevant information in the visual demonstrations, such as environment background, object colors, and so forth, enabling the learned policy to apply to different test settings. Therefore, we propose a Real-to-Sim-to-Real LfD framework. First, we propose a Hierarchical Feature Matching method to iteratively estimate the tie’s meshes with cloth simulation from the demonstrated video. We use a cloth simulator called DiffClothAI [9] that supports intersection-free contact for cloth to maintain the tie’s topological structure during the estimation process. These estimated meshes from the demonstrated video are then used as subgoals. To learn where to grasp the tie and where to pull the tie from point clouds observations in simulation, we adopt a teacher-student training paradigm similar to [10]. Lastly, our pipeline learns a residual policy when applying the policy to real-world settings, mitigating the sim2real gap.

In summary, we make the following contributions: 1) We introduced a systematic LfD framework for a dual-arm robot to learn to knot to tie. 2) We proposed a Hierarchical Feature Matching approach to estimate the tie’s mesh with high deformation from the demonstrated RGB-D video using cloth simulation. 3) With estimated meshes as subgoals, we presented a teacher-student training paradigm to learn grasping points and placing points from point cloud observations in simulation. 4) We conduct experiments in simulation and the real-world to demonstrate the effectiveness and advances of our pipeline. To the best of our knowledge, this work is the first effort to develop a robot system, integrating perception, modeling, and robot learning. It successfully complete the tie-knotting task, which is one of the most complex and challenging cloth manipulation tasks.

2 Related Work

2.1 Cloth Manipulation

Previous work mainly addresses short-horizon cloth manipulation tasks that only involve simple pick-place actions. There are several approaches to learning cloth manipulation skills. One approach is using model-free RL or learned dynamics model to learn cloth unfolding, rope rearranging, and dressing assistance tasks on raw sensor input [11, 12, 13, 14, 15]. Other approaches will collect and annotate data from images [16, 17] or generate demonstration trajectories in simulation [18] to learn policy. Because of the short-horizon and simple actions features of tasks, it’s also possible to infer correct actions from some visual representations, such as flow between current observation and target images [19].

In contrast, tie-knotting tasks require flipping or rotating a part of the tie, which makes it difficult to annotate robot actions or design action primitives. Therefore, collecting and annotating robot actions on observations is infeasible. It’s also difficult to generate demonstrations or directly apply RL in simulation since the trajectories of tie-knotting tasks are much longer and the possible state space is much larger. Thus, in this work, we choose to learn skills from human demonstration.

2.2 Learning from Visual Demonstration

One line of research explores pre-training neural representations from actionless videos [20, 21, 2, 22, 3, 4, 23]. This approach aims at learning general representations for different actions, whereas none of them shows the ability to learn dexterous manipulation skills that can knot a tie. Another line of research attempts to learn from visual priors extracted from visual demonstrations, such as object trajectories [5, 24], hand poses [25, 26], keypoints positions [27, 6], graph relations [7], or affordances [28]. The third approach is to learn a video or trajectory prediction model to guide policy learning [29, 30, 31, 32, 8]. These approaches require in-domain demonstrations, placing restrictions on visual demonstrations. Moreover, the prediction model may suffer from the long-horizon feature of tie-knotting tasks. ORION is the most closely related work, which builds a graph representation from object motions that can generalize across diverse test environments [33]. However, simple graph representations cannot capture the tie’s complex topology and subtle dynamics during the tie-knotting process.

Consequently, we propose explicitly modeling the demonstration as a sequence of meshes. Mesh can accurately describe the tie’s structure and dynamics, which is crucial to learning correct robot actions and generalizing them to different test scenarios.

2.3 Cloth State Estimation

One cloth state estimation method directly predicts cloth states using deep learning [34, 35, 36]. Non-rigid point cloud registration methods such as coherent point drifting are also applied for linear deformable object tracking [37, 38, 39]. However, purely vision-based methods do not guarantee correct cloth topology due to the lack of physics prior. Huang et al. propose a method to reconstruct and track cloth state with a dynamics model [40]. However, this method requires known actions, which cannot be accessed easily from human demonstration sometimes.

Therefore, we propose a Hierarchical Feature Matching method to iteratively estimate the tie mesh in the demonstration video with cloth simulation. Cloth simulation provides important physics prior for state estimation, such as non-penetration, which is crucial for maintaining correct topology.

3 Technical Approach

Refer to caption
Figure 2: TieBot utilizes simulation to estimate the tie’s meshes from the demonstrated video. Then, using mesh sequences as subgoals, we introduce how to generate the robot’s actions to manipulate the tie. The pipeline finally learns a residual policy to reduce the sim2real gap.

This work presents a Real-to-Sim-to-Real LfD framework called TieBot to guide a dual-arm robot shown in Fig. 1 to knot the tie from an RGB-D demonstration video. An overview of our proposed method is in Fig. 2. We first describe the procedure to estimate the tie’s mesh sequences from the demonstrated video (Sec. 3.1). Using the tie’s mesh sequences as subgoals, we introduce a pipeline to generate robot actions to manipulate the tie, using teacher-student training paradigm (Sec. 3.2). Lastly, we discuss learning residual policy to mitigate Sim2Real gaps(Sec. 3.3).

3.1 Real2Sim

To better estimate the tie’s mesh, we propose to integrate cloth simulation into our pipeline, which provides important physical prior such as non-penetration in the estimation process. We segment the tie in the demonstrated RGB-D video using Track-Anything [41] and transform the associated segmented depth images into point clouds. Meanwhile, a tie’s mesh is loaded into the DiffClothAI. At time step t𝑡titalic_t, we use the tie mesh’s vertices denoted as 𝒳tSsubscriptsuperscript𝒳𝑆𝑡\mathcal{X}^{S}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to describe the tie’s shape. From the RGB images and segmented point clouds denoted as {tD}subscriptsuperscript𝐷𝑡\{\mathcal{I}^{D}_{t}\}{ caligraphic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } and {𝒳tD}subscriptsuperscript𝒳𝐷𝑡\{\mathcal{X}^{D}_{t}\}{ caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, Real2Sim pipeline estimates tie’s mesh sequences {𝒳tS}subscriptsuperscript𝒳𝑆𝑡\{\mathcal{X}^{S}_{t}\}{ caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } with simulation. The pipeline manually aligns the mesh with the initial frame. We assume the initial mesh fully overlaps with the initial point cloud.

3.1.1 Local Feature Matching

Refer to caption
Figure 3: Local Feature matching between two images. A hand caused a gap along the length of the tie during the demonstration.

If 𝒳t1Ssubscriptsuperscript𝒳𝑆𝑡1\mathcal{X}^{S}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝒳t1Dsubscriptsuperscript𝒳𝐷𝑡1\mathcal{X}^{D}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT are aligned and there are correspondences between 𝒳t1Dsubscriptsuperscript𝒳𝐷𝑡1\mathcal{X}^{D}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can build up the correspondences from the tie’s mesh to the next demonstrated point cloud 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and move the tie’s vertices to align 𝒳tSsubscriptsuperscript𝒳𝑆𝑡\mathcal{X}^{S}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT towards 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Here we adopt an off-the-shelf feature matching model called LoFTR [42] to build up correspondences between two RGB images t1subscript𝑡1\mathcal{I}_{t-1}caligraphic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as shown in Fig 3. Typically, LoFTR can provide more than a hundred reliable correspondences between two images, which cover almost every visible part of the tie. From the correspondences between t1subscript𝑡1\mathcal{I}_{t-1}caligraphic_I start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and tsubscript𝑡\mathcal{I}_{t}caligraphic_I start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can find the feature points on 𝒳t1Dsubscriptsuperscript𝒳𝐷𝑡1\mathcal{X}^{D}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and their corresponding feature points on 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, to control the mesh in DiffClothAI to align it with 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we need to define several vertices on the mesh as control vertices 𝒱𝒱\mathcal{V}caligraphic_V. Since 𝒳t1Ssubscriptsuperscript𝒳𝑆𝑡1\mathcal{X}^{S}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT aligns well with 𝒳t1Dsubscriptsuperscript𝒳𝐷𝑡1\mathcal{X}^{D}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, we map the feature points on 𝒳t1Dsubscriptsuperscript𝒳𝐷𝑡1\mathcal{X}^{D}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to nearest vertices on 𝒳t1Ssubscriptsuperscript𝒳𝑆𝑡1\mathcal{X}^{S}_{t-1}caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. These vertices are assigned as control vertices 𝒱𝒱\mathcal{V}caligraphic_V. Finally, we control 𝒱𝒱\mathcal{V}caligraphic_V to move to the positions of feature points on 𝒳tDsuperscriptsubscript𝒳𝑡𝐷\mathcal{X}_{t}^{D}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT to align 𝒳tSsuperscriptsubscript𝒳𝑡𝑆\mathcal{X}_{t}^{S}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT towards 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in DiffClothAI.

However, vanilla local feature matching cannot create correspondences in occluded regions, which is common in tie-knotting tasks. We lose the motion information due to occlusion, and the estimation will deviate. Therefore, we propose to add global keypoints information to amend this pipeline.

3.1.2 Keypoints Detection

Refer to caption
Figure 4: The oriented keypoints to represent the state of the tie. The x,y,z axis are represented by the red, green, blue arrow, respectively.

Keypoints detection can directly build correspondences between mesh vertices and the point cloud. Thus, it will not be affected by occlusion. We define five keypoints along the tie’s surface and the corresponding five key vertices on the mesh, shown in Fig. 4. For each keypoint as the origin, we define the local frame as follows. The z direction is the surface normal from the tie’s positive side to the negative side. The x direction is the direction of the tie’s middle skeleton. The y direction is derived using the right-hand rule. These five keypoints, in a predefined order, play the role of the skeleton definition.

Then, we train Pointnet++ [43] to predict the keypoints and associated local frames on the demonstrated point clouds. However, the high-dimensional state makes it challenging to generate sufficient training data to cover all the states encountered in the knotting procedure. A successful tie-knotting trajectory occupies only a small portion of the whole state space of the tie. Thus, uniformly applying random actions on the initial tie’s mesh in the simulation to produce training data fails to cover these states. In contrast, we generate training data based on the current estimated mesh. When we detect the chamfer distance between 𝒳tSsubscriptsuperscript𝒳𝑆𝑡\mathcal{X}^{S}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒳tDsubscriptsuperscript𝒳𝐷𝑡\mathcal{X}^{D}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is larger than a threshold, we backtrack to the previous time step, gather the tie’s shape 𝒳t1subscript𝒳𝑡1\mathcal{X}_{t-1}caligraphic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and apply random actions to the tie’s mesh at t1𝑡1t-1italic_t - 1 in the simulation to generate annotated training data and train the keypoints prediction network.

3.1.3 Hierarchical Feature Matching(HFM)

Finally, we combine them as Hierarchical Feature Matching(HFM) for state estimation. Control vertices assigned in local feature matching and key vertices will be used together to pull the mesh to target positions specified by local feature matching and keypoints detection. Local feature matching provides detailed motion of vertices, while global keypoints indicate a global tie’s structure. This global structural information ensures the estimation won’t deviate too much, alleviating the shortcomings of the local feature matching method. We use this method to estimate the tie’s meshes from demonstration and output a sequence of meshes {𝒳t𝒢}superscriptsubscript𝒳𝑡𝒢\{\mathcal{X}_{t}^{\mathcal{G}}\}{ caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT }. The next part will use these meshes as subgoals to guide robot action generation.

3.2 Learn@Sim

Our pipeline begins to sequentially generate feasible robotic actions in the simulation to guide the tie {𝒳tS}subscriptsuperscript𝒳𝑆𝑡\{\mathcal{X}^{S}_{t}\}{ caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } towards these subgoals. Since the tie-knotting task is a long-horizon task with multiple grasp and pull actions, we aim to learn where to grasp the tie and where to pull the tie.

For where to pull, we apply a simple strategy. Once we identify the grasping vertices, we pull these vertices to the positions of those vertices on the subgoal. For where to grasp, we adopt a similar teacher-student training paradigm in [10] to ease policy learning. Directly learning from high dimensional observations such as point cloud is data-inefficient because the policy needs to simultaneously learn which features to extract from visual observations and what the high-rewarding actions are. On the contrary, learning a policy via RL from sufficient state information would be much easier, as suggested by [10]. Therefore, we first use privileged information to learn a teacher policy, and then train a student policy imitating teacher policy with point clouds as observations.

Teacher Policy

We first learn a teacher policy to select proper grasping points using privileged information. The state s𝑠sitalic_s contains the previous tie’s vertices positions and the point-wise displacement for each tie vertices to the subgoal. The action a𝑎aitalic_a is one or two grasping vertices of all the tie mesh’s vertices. The reward function \mathcal{R}caligraphic_R is defined in equation 1.

Note that we specify the action space as the discrete space (vertex index of the tie). Although there are multiple 6D poses of the robotic grippers to grasp one vertex position of the mesh, the learned policy still reflects the overall grasping quality of these 6D poses associated with one vertex. In the engineering practice, we record each grasping pose offline so that once we figure out the grasping vertices on the tie’s mesh at each timestep, we can automatically produce the feasible grasping poses concerning specific hardware platforms using inverse kinematics.

(s,a)={C1,if knotting-tie succeedsC2,if fails to reach any subgoal along the trajectoryC3𝒳tS𝒳t𝒢,Otherwise𝑠𝑎casessubscript𝐶1if knotting-tie succeedssubscript𝐶2if fails to reach any subgoal along the trajectorysubscript𝐶3normsubscriptsuperscript𝒳𝑆𝑡subscriptsuperscript𝒳𝒢𝑡Otherwise\mathcal{R}(s,a)=\begin{cases}C_{1},&\text{if knotting-tie succeeds}\\ -C_{2},&\text{if fails to reach any subgoal along the trajectory}\\ C_{3}-\|\mathcal{X}^{S}_{t}-\mathcal{X}^{\mathcal{G}}_{t}\|,&\text{Otherwise}% \\ \end{cases}caligraphic_R ( italic_s , italic_a ) = { start_ROW start_CELL italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL start_CELL if knotting-tie succeeds end_CELL end_ROW start_ROW start_CELL - italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL start_CELL if fails to reach any subgoal along the trajectory end_CELL end_ROW start_ROW start_CELL italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - ∥ caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_X start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ , end_CELL start_CELL Otherwise end_CELL end_ROW (1)

For the reward function, here C1,C2,C3subscript𝐶1subscript𝐶2subscript𝐶3C_{1},C_{2},C_{3}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are constant positive values. 𝒳tSsuperscriptsubscript𝒳𝑡𝑆\mathcal{X}_{t}^{S}caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT is the result tie mesh. The failure to reach the subgoal is due to the distance 𝒳tS𝒳t𝒢normsubscriptsuperscript𝒳𝑆𝑡subscriptsuperscript𝒳𝒢𝑡\|\mathcal{X}^{S}_{t}-\mathcal{X}^{\mathcal{G}}_{t}\|∥ caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_X start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ is larger than a given threshold, the tie could not be pulled close to the subgoal by grasping on the wrong selected vertex; otherwise, it will return C3𝒳tS𝒳t𝒢subscript𝐶3normsubscriptsuperscript𝒳𝑆𝑡subscriptsuperscript𝒳𝒢𝑡C_{3}-\|\mathcal{X}^{S}_{t}-\mathcal{X}^{\mathcal{G}}_{t}\|italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT - ∥ caligraphic_X start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - caligraphic_X start_POSTSUPERSCRIPT caligraphic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ for intermediate steps or C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for the final step.

Student Policy

To learn actions from point clouds, we train a student policy to imitate teacher policy. We add some perturbations to the size and positions of the mesh and update the associated trajectories accordingly to generate training data in the simulation. We render point clouds from meshes in PyBullet [44] as the input of our policy network πsimsuperscript𝜋𝑠𝑖𝑚\pi^{sim}italic_π start_POSTSUPERSCRIPT italic_s italic_i italic_m end_POSTSUPERSCRIPT and output the grasping points and placing points positions. We use Pointnet++ [43] as the policy network and train it in a supervised learning manner.

3.3 Sim2Real

When the robot knots the tie in the real world, the robot receives a segmented point cloud of the tie denoted as 𝒳trealsubscriptsuperscript𝒳𝑟𝑒𝑎𝑙𝑡\mathcal{X}^{real}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each time step. Instead of directly applying the action πsim(𝒳treal)superscript𝜋𝑠𝑖𝑚subscriptsuperscript𝒳𝑟𝑒𝑎𝑙𝑡\pi^{sim}(\mathcal{X}^{real}_{t})italic_π start_POSTSUPERSCRIPT italic_s italic_i italic_m end_POSTSUPERSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in the real world, we train a residual policy that takes in the real point cloud 𝒳trealsubscriptsuperscript𝒳𝑟𝑒𝑎𝑙𝑡\mathcal{X}^{real}_{t}caligraphic_X start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the output of πsim(𝒳treal)superscript𝜋𝑠𝑖𝑚subscriptsuperscript𝒳𝑟𝑒𝑎𝑙𝑡\pi^{sim}(\mathcal{X}^{real}_{t})italic_π start_POSTSUPERSCRIPT italic_s italic_i italic_m end_POSTSUPERSCRIPT ( caligraphic_X start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), outputs small offsets to the positions of predicted grasping points and placing points. Combing the small offsets and predicted grasping and placing points positions, we finally generate the action in real setting. We follow the training process of residual policy described in [45].

4 Experiments

In this section, we introduce our experimental setup and conduct quantitative and qualitative evaluations to demonstrate the effectiveness of our approach. Our experiments focus on answering the following questions.

  • How do our pipeline and baseline methods perform on tie-knotting task?

  • How does HFM compare to other cloth state estimation methods?

  • Can our HFM apply to other cloth manipulation tasks?

Considering the complexity of the entire system, we provide additional experiment results, along with detailed explanations of submodules, in the supplementary materials and website.

4.1 Comparing TieBot and baseline

We first evaluate the whole pipeline of TieBot and a baseline method in a tie-knotting task. We estimate a sequence of meshes from one human demonstration video. Then, we divide the whole trajectory into 6 parts with 6 subgoals. Our teacher policy learns to select proper grasping points using PPO [46], and student policy imitates the teacher policy to infer grasping points and placing points from the point cloud. We evaluate TieBot and the baseline method 10 times for each of the two different ties in DiffClothAI and evaluate TieBot on two real ties with a dual-arm robot.

Metrics

We compare the success rate between our pipeline and ATM [8]. In simulation experiments, if the distance of the final tie’s mesh to the target tie’s mesh is smaller than a threshold, we consider it a success. In real-world experiments, if the little end of the tie is inserted into the hole, as shown in the final stage in Fig. 5, we consider it a success. Since the tie-knotting task is long-horizon, we also compute the averaged number of achieved subgoals for further evaluation.

ATM

ATM proposes to model tasks as points trajectories [8]. It first learns a trajectory prediction model, and then learns policy with the learned prediction model using imitation learning. Following similar experiment settings in ATM, we collect 100 demonstration videos in simulation to train the trajectory prediction module. Then, we use the 45 demonstration videos with ground truth action annotations to train the policy network and test the policy in simulation. The action is the 3D offset of the grasping vertices.

Success Rate / Average Achieved Subgoals Ours ATM
normal tie 60% / 5.1 0% / 0.0
larger tie(unseen) 30% / 4.3 0% / 0.0
real tie1(softer) 50% / 5.0 NA / NA
real tie2(harder, unseen) 30% / 4.15 NA / NA
Table 1: Success rate and average achieved subgoals of policy rollouts
Experiment Result

We test TieBot on two real ties that differ in materials: one is softer, and the other one is harder. We tested each of them 10 times. We also test TieBot and ATM on two ties with different sizes in simulation 10 times for each. The quantitative results are shown in Tab. 1, and qualitative results of TieBot are shown in Fig. 5. This comparison suggests that object trajectories are insufficient to represent subtle dynamics and topology of the tie in tie-knotting tasks. ATM quickly deviates from the correct trajectory since it cannot capture the subtle dynamics of the tie. Therefore, it fails to achieve even one subgoal. This comparison suggests that explicitly modeling the tie in meshes is necessary. For qualitative evaluation, in Fig. 5, we can see that although the tie in the demonstration video, the mesh in the simulation, and the tie used for real robot manipulation are different, our policy can overcome these gaps and learn feasible robot policy.

Refer to caption
Figure 5: The results of TieBot at different stages. We show different sides of the tie in red and blue and manipulation action in yellow to better visualize.

4.2 Evaluating Hierarchical Feature Matching(HFM)

Real2Sim is the most important part of our pipeline. Without accurate state estimation, particularly estimating the correct topology for the tie, it’s impossible to learn feasible policy to finish the task. To illustrate the importance of different components of HFM and its performance against other cloth state estimation methods, we design three experiments in simulation to test baseline methods and the ablation versions of HFM.

Coherent Point Drift

Coherent Point Drift(CPD) is a non-rigid point cloud registration algorithm. We employ the CPD to predict the target positions of the mesh vertices in the target point cloud and directly align the mesh to the target positions.

Ablated Version

Ours w/o KP stands for only using local feature matching; Ours w/o LF stands for using local feature matching and the predicted keypoints positions; Ours w/o FM stands for only using predicted keypoints positions and local frames.

Experiment Result

The qualitative results are shown in Fig. 6. We can find that either CPD or ablated versions of HFM cannot estimate the target mesh correctly among these three experiments. We also compute the L2 Distance between the vertices of the target mesh and estimated mesh as a quantitative evaluation shown in Tab. 2. It also suggests that the performance will degrade if we cancel some parts of HFM, while CPD deviates a lot from the correct states.

L2 Distance Ours Ours w/o FM Ours w/o KP Ours w/o LF CPD
exp1 0.0248 0.0732 0.2424 0.0512 0.1384
exp2 0.0053 0.0107 0.0123 0.0088 0.0661
exp3 0.0032 0.0093 0.0053 0.0049 0.1049
Table 2: Quantitative results of ablation study and comparison to CPD in simulation
Refer to caption
Figure 6: The visualization of the ablation study of HFM in simulation. We put a cross sign in the image’s bottom-right corner to indicate failures of estimating the correct target state, according to human evaluation. Red and blue colors represent different sides of the mesh.

4.3 Apply HFM on Other Cloth Manipulation Tasks

We demonstrate that HFM can be applied to other cloth manipulation tasks. One is a different way to knot a tie. The other one is to fold a towel. We visualize the estimation results in Fig. 7. The results show that HFM can be applied to different cloth manipulation tasks.

Refer to caption
Figure 7: The visualization of a differentway to knot a tie and the towel folding. The fist row is the human demonstration of tie-knotting, the second row is the estimated states in simulation. While the third row is showing the towel folding. To better visualize, we show the manipulation action in yellow dots and arrows.

5 Conclusion

This work introduces a Real-to-Sim-to-Real LfD framework called TieBot for the robots to learn to knot a tie from visual demonstration. TieBot introduces the Hierarchical Feature Matching approach to iteratively estimate a sequence of tie’s meshes from the demonstrated video. TieBot adopts a teacher-student training paradigm to learn grasping points and placing points from point clouds. Lastly, our pipeline learns a residual policy when the imitated policy is applied to real-world execution, mitigating the Sim2Real gap. We demonstrate that a dual-arm robot successfully knots the tie with our framework, achieving 50% success rate over 10 trials.

Nonetheless, our pipeline has some limitations. First, our pipeline still requires manually setting the initial state of the tie at the beginning of the Real2Sim stage. Second, our Real2Sim module requires training keypoints detection models iteratively, which takes a lot of time. Third, due to the hardware limits, the last step in the real-world experiments shown in Fig. 5 is hardcode action. Better cloth mesh reconstruction methods, video tracking methods, and more dexterous robot arms may alleviate these issues.

References

  • Zhang et al. [2018] T. Zhang, Z. McCarthy, O. Jow, D. Lee, X. Chen, K. Goldberg, and P. Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5628–5635, 2018. doi:10.1109/ICRA.2018.8461249.
  • Nair et al. [2022] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
  • Radosavovic et al. [2023] I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
  • Shaw et al. [2023] K. Shaw, S. Bahl, and D. Pathak. Videodex: Learning dexterity from internet videos. In Conference on Robot Learning, pages 654–665. PMLR, 2023.
  • Seita et al. [2022] D. Seita, Y. Wang, S. Shetty, E. Li, Z. Erickson, and D. Held. ToolFlowNet: Robotic Manipulation with Tools via Predicting Tool Flow from Point Clouds. In Conference on Robot Learning (CoRL), 2022.
  • Das et al. [2021] N. Das, S. Bechtle, T. Davchev, D. Jayaraman, A. Rai, and F. Meier. Model-based inverse reinforcement learning from visual demonstrations. In Conference on Robot Learning, pages 1930–1942. PMLR, 2021.
  • Kumar et al. [2023] S. Kumar, J. Zamora, N. Hansen, R. Jangir, and X. Wang. Graph inverse reinforcement learning from diverse videos. In Conference on Robot Learning, pages 55–66. PMLR, 2023.
  • Wen et al. [2023] C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y. Gao, and P. Abbeel. Any-point trajectory modeling for policy learning, 2023.
  • Yu et al. [2023] X. Yu, S. Zhao, S. Luo, G. Yang, and L. Shao. Diffclothai: Differentiable cloth simulation with intersection-free frictional contact and differentiable two-way coupling with articulated rigid bodies. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023.
  • Chen et al. [2023] T. Chen, M. Tippur, S. Wu, V. Kumar, E. Adelson, and P. Agrawal. Visual dexterity: In-hand reorientation of novel and complex object shapes. Science Robotics, 8(84):eadc9244, 2023. doi:10.1126/scirobotics.adc9244. URL https://www.science.org/doi/abs/10.1126/scirobotics.adc9244.
  • Ha and Song [2022] H. Ha and S. Song. Flingbot: The unreasonable effectiveness of dynamic manipulation for cloth unfolding. In Conference on Robot Learning, pages 24–33. PMLR, 2022.
  • Wu et al. [2019] Y. Wu, W. Yan, T. Kurutach, L. Pinto, and P. Abbeel. Learning to manipulate deformable objects without demonstrations. arXiv preprint arXiv:1910.13439, 2019.
  • Deng et al. [2022] Y. Deng, C. Xia, X. Wang, and L. Chen. Deep reinforcement learning based on local gnn for goal-conditioned deformable object rearranging. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1131–1138. IEEE, 2022.
  • Wang et al. [2023] Y. Wang, Z. Sun, Z. Erickson, and D. Held. One policy to dress them all: Learning to dress people with diverse poses and garments. In Robotics: Science and Systems (RSS), 2023.
  • Yan et al. [2020] M. Yan, Y. Zhu, N. Jin, and J. Bohg. Self-supervised learning of state estimation for manipulating deformable linear objects. IEEE robotics and automation letters, 5(2):2372–2379, 2020.
  • Avigal et al. [2022] Y. Avigal, L. Berscheid, T. Asfour, T. Kröger, and K. Goldberg. Speedfolding: Learning efficient bimanual folding of garments. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2022.
  • Seita et al. [2019] D. Seita, N. Jamali, M. Laskey, A. K. Tanwani, R. Berenstein, P. Baskaran, S. Iba, J. Canny, and K. Goldberg. Deep transfer learning of pick points on fabric for robot bed-making. In The International Symposium of Robotics Research, pages 275–290. Springer, 2019.
  • Deng et al. [2023] Y. Deng, K. Mo, C. Xia, and X. Wang. Learning language-conditioned deformable object manipulation with graph dynamics. arXiv preprint arXiv:2303.01310, 2023.
  • Weng et al. [2022] T. Weng, S. M. Bajracharya, Y. Wang, K. Agrawal, and D. Held. Fabricflownet: Bimanual cloth manipulation with a flow-based policy. In Conference on Robot Learning, pages 192–202. PMLR, 2022.
  • Sermanet et al. [2018] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine, and G. Brain. Time-contrastive networks: Self-supervised learning from video. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1134–1141. IEEE, 2018.
  • Ma et al. [2022] Y. J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V. Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022.
  • Xiao et al. [2022] T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
  • Mendonca et al. [2023] R. Mendonca, S. Bahl, and D. Pathak. Structured world models from human videos. 2023.
  • Vecerik et al. [2023] M. Vecerik, C. Doersch, Y. Yang, T. Davchev, Y. Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz. Robotap: Tracking arbitrary points for few-shot visual imitation. arXiv preprint arXiv:2308.15975, 2023.
  • Bahl et al. [2022] S. Bahl, A. Gupta, and D. Pathak. Human-to-robot imitation in the wild. 2022.
  • Bharadhwaj et al. [2023] H. Bharadhwaj, A. Gupta, S. Tulsiani, and V. Kumar. Zero-shot robot manipulation from passive human videos. arXiv preprint arXiv:2302.02011, 2023.
  • Xiong et al. [2021] H. Xiong, Q. Li, Y.-C. Chen, H. Bharadhwaj, S. Sinha, and A. Garg. Learning by watching: Physical imitation of manipulation skills from human videos. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7827–7834. IEEE, 2021.
  • Bahl et al. [2023] S. Bahl, R. Mendonca, L. Chen, U. Jain, and D. Pathak. Affordances from human videos as a versatile representation for robotics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13778–13790, 2023.
  • Schmeckpeper et al. [2020] K. Schmeckpeper, A. Xie, O. Rybkin, S. Tian, K. Daniilidis, S. Levine, and C. Finn. Learning predictive models from observation and interaction. In European Conference on Computer Vision, pages 708–725. Springer, 2020.
  • Escontrela et al. [2024] A. Escontrela, A. Adeniji, W. Yan, A. Jain, X. B. Peng, K. Goldberg, Y. Lee, D. Hafner, and P. Abbeel. Video prediction models as rewards for reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  • Du et al. [2024] Y. Du, S. Yang, B. Dai, H. Dai, O. Nachum, J. Tenenbaum, D. Schuurmans, and P. Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024.
  • Ko et al. [2023] P.-C. Ko, J. Mao, Y. Du, S.-H. Sun, and J. B. Tenenbaum. Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576, 2023.
  • Zhu et al. [2024] Y. Zhu, A. Lim, P. Stone, and Y. Zhu. Vision-based manipulation from single human video with open-world object graphs. arXiv preprint arXiv:2405.20321, 2024.
  • Chi and Song [2021] C. Chi and S. Song. Garmentnets: Category-level pose estimation for garments via canonical space shape completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3324–3333, 2021.
  • Xue et al. [2023] H. Xue, W. Xu, J. Zhang, T. Tang, Y. Li, W. Du, R. Ye, and C. Lu. Garmenttracking: Category-level garment pose tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21233–21242, 2023.
  • Huang et al. [2022] Z. Huang, X. Lin, and D. Held. Mesh-based dynamics with occlusion reasoning for cloth manipulation. arXiv preprint arXiv:2206.02881, 2022.
  • Tang et al. [2017] T. Tang, Y. Fan, H.-C. Lin, and M. Tomizuka. State estimation for deformable objects by point registration and dynamic simulation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2427–2433. IEEE, 2017.
  • Tang et al. [2018] T. Tang, C. Wang, and M. Tomizuka. A framework for manipulating deformable linear objects by coherent point drift. IEEE Robotics and Automation Letters, 3(4):3426–3433, 2018.
  • Chi and Berenson [2019] C. Chi and D. Berenson. Occlusion-robust deformable object tracking without physics simulation. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6443–6450. IEEE, 2019.
  • Huang et al. [2023] Z. Huang, X. Lin, and D. Held. Self-supervised cloth reconstruction via action-conditioned cloth tracking. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7111–7118. IEEE, 2023.
  • Yang et al. [2023] J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng. Track anything: Segment anything meets videos, 2023.
  • Sun et al. [2021] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou. LoFTR: Detector-free local feature matching with transformers. CVPR, 2021.
  • Qi et al. [2017] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413, 2017.
  • Coumans and Bai [2016–2021] E. Coumans and Y. Bai. Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org, 2016–2021.
  • Silver et al. [2018] T. Silver, K. Allen, J. Tenenbaum, and L. Kaelbling. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018.
  • Schulman et al. [2017] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.