Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

TacDiffusion: Force-domain Diffusion Policy for Precise Tactile Manipulation

Yansong Wu*, Zongxie Chen*, Fan Wu, Lingyun Chen, Liding Zhang,
Zhenshan Bing, Abdalla Swikir, Alois Knoll, Sami Haddadin
The authors are with the Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich, Germany. The authors with the notation * contribute equally to this work. Corresponding author: Fan Wu (f.wu@tum.de). The authors acknowledge the financial support by the Bavarian State Ministry for Economic Affairs, Regional Development and Energy (StMWi) for the Lighthouse Initiative KI.FABRIK (Phase 1: Infrastructure as well as the research and development program under, grant no. DIK0249). Code avaible: https://github.com/popnut123/TacDiffusion
Abstract

Assembly is a crucial skill for robots in both modern manufacturing and service robotics. However, mastering transferable insertion skills that can handle a variety of high-precision assembly tasks remains a significant challenge. This paper presents a novel framework that utilizes diffusion models to generate 6D wrench for high-precision tactile robotic insertion tasks. It learns from demonstrations performed on a single task and achieves a zero-shot transfer success rate of 95.7% across various novel high-precision tasks. Our method effectively inherits the self-adaptability demonstrated by our previous work. In this framework, we address the frequency misalignment between the diffusion policy and the real-time control loop with a dynamic system-based filter, significantly improving the task success rate by 9.15%. Furthermore, we provide a practical guideline regarding the trade-off between diffusion models’ inference ability and speed.

I Introduction

Assembly tasks are crucial in robotics, serving as the backbone of modern manufacturing and service applications [1]. As the demand for flexible manufacturing grows, robotic assembly increasingly takes place in dynamic environments, where objects are not precisely positioned at known locations and part holders are often not viable [2]. Achieving both broad transferability and precise control capabilities in these conditions remains a significant challenge. Human workers, on the other hand, demonstrate exceptional dexterity in assembling diverse objects with tight-clearance components, primarily by leveraging tactile feedback from their fingertips throughout the process [3, 4]. Similarly, a versatile high-precision robotic assembly system must exhibit both task-level transferability—generalizing across a wide range of objects and parts—and control-level self-adaptability, enabling it to respond to environmental changes often sensed through tactile feedback [5, 6].

Throughout the history of robotics research, the importance of tactile feedback and force control for high-precision assembly has been consistently recognized [7, 8, 9, 10, 11, 12, 5, 13]. However, several challenges persist in precise force control, including the difficulty of accessing to appropriate robot hardware and expensive force sensors, the complexity of ensuring stability and safety while regulating force, the sensitivity of force control to environmental changes, the difficulty of estimating environment constraints and contact dynamics in dynamic settings, and the challenge of collecting high-quality tactile data for learning force control. Due to these barriers, the use of simpler motion-domain action spaces, with impedance control as an indirect force control method is often favored by the robot learning community. Nevertheless, the increasing diversity of contact-rich manipulation tasks highlights the equal importance of simultaneously regulating motion, compliance, and force, so that agents can autonomously perform a wide range of task stably and robustly, without the need for explicit controller switching [14]. Despite the recent successes of implementing transformer [15, 16, 17, 18, 19, 20] and/or diffusion-based [21, 22, 23, 24, 25, 26, 27, 28, 29, 30] policies for robot manipulation that exhibit excellent generalization capability, it remains unexplored how to integrate force control with these models for high-precision tactile manipulation, so that the benefits of these generative models for multi-modal modelling and prediction can be fully exploited.

To address this gap, and aiming to achieve both task-level transferability and control-level self-adaptability, we propose TacDiffusion, a novel framework that leverages a diffusion policy for high-precision tactile manipulation. To the authors’ knowledge, it is the first framework to employ diffusion models in generating force-domain actions for tactile-based robotic manipulation in tight-clearance insertion tasks. TacDiffusion learns from demonstrations performed by expert policies on a single task and achieves an overall 95.7% zero-shot transfer success rate across various novel high-precision, sub-millimeter-level peg-in-hole tasks. By imitating the expert policies, which are based on a behavior tree-based skill proposed in our previous work [31], TacDiffusion successfully inherits its self-adaptability, characterized by the ability to switch skill primitives based on real-time tactile sensing. Importantly, compared to the expert policy, TacDiffusion also outperforms in execution time and robustness on these novel tasks in a zero-shot transfer manner.

To further enhance real-time performance, we investigate how model size affects the trade-off between accuracy and inference speed, providing practical guidelines for optimal model selection. Moreover, to handle the frequency misalignment between the diffusion policy’s inference process and the low-level controller, a dynamic system-based filter is designed to smooth the output of the diffusion model for high-frequency force-impedance control, significantly improving the task success rate by 9.15%.

In summary, our main contributions are: (i) a novel diffusion-based policy that outputs 6D wrench for tactile manipulation; (ii) learning from a behavior tree-based expert policy to inherent its tactile-based self-adaptability; (iii) a dynamic system-based filter smoothing and aligning low frequency outputs from diffusion model with high frequency control, with experimental evidences showing significant effect on task performance; (iv) investigation on trade-off between accuracy and inference speed, resulting in insights for optimal model selection in practice.

II Related Works

In this section, we focus our review on (i) High-Precision Assembly Tasks, (ii) Transferability and (iii) Diffusion model in robotics.

II-A High-Precision Assembly Tasks

Due to the robot’s accuracy limitation, position-based control methods are insufficient for high-precision assembly tasks that require accuracy exceeding the robot’s precision [10]. To address this issue, recent studies have shifted to designing actions in the force-domain rather than the position domain to perform high-precision robotic assembly tasks. According to the control strategy, these methods span four main categories: force controller [9], admittance controller [8], hybrid position/force controller[10, 13], and impedance controller with feed-forward force [11, 12]. Nevertheless, these works normally focus on a specific tight-clearance task and lack investigation into the method’s transferability and adaptability to novel tasks[5].

II-B Enhancing Transferability in Robotic Assembly

In the last decade, there is extensive literature on generating robotic assembly policies with broad generalization. Deep Reinforcement Learning-based methods, for instance, typically achieve the generalization ability through training with multiple objects [32, 33]. Another noteworthy case is meta-learning, which trains a pre-trained model using online or offline data from a diverse and comprehensive set of tasks, enabling domain adaptation ability through fine-tuning [34, 35]. Furthermore, sim-to-real based approaches have gained attention for their cost-effective data collection in the simulation environments, and zero-shot sim-to-real transfer for perception-initialized assembly has been only recently demonstrated [36, 37]. Besides, to tackle precise manipulation, RVT-2 [20] trained a transformer-based multi-task policy. Despite improving performance on multi-task learning benchmark, its success rate on high-precision (millimeter level) insertion tasks, roughly 50%, is far from being satisfactory to deploy to real assembly production. Aside from these approaches, evolutionary algorithms with parameterized robot skills have shown transferability across tasks via fine-tuning [31]. However, achieving zero-shot transfer on high-precision tasks with a satisfactory success rate in the real world remains an open challenge

II-C Diffusion Model in Robotics

Meanwhile, in other areas of robotics, diffusion models [38] have made significant progress. Compared to traditional discriminative models, diffusion models excel in generalization, achieving superior performance on unseen tasks and scenarios, by establishing a stochastic transport map between an empirically observed target distribution and a known prior [30]. Recent works have typically used scene images as input to solve planning problems [23, 24, 29] and perform manipulation tasks [25, 26, 27, 28] in robotics. However, the application of diffusion models with other input modalities remains relatively underexplored in robotics, with only few studies addressing this area [22]. In addition, considering diffusion model applications in sequential behavior imitation [21] and time series processing [39], there is great potential for adapting diffusion models to force-domain actions in robotics.

In summary, although significant progress has been made in insertion tasks, achieving zero-shot transfer in high-precision assembly tasks remains an ongoing challenge. Additionally, the application of diffusion models to force-domain actions has not yet been explored. To bridge these gaps, we propose a novel framework that leverages diffusion models to enable more efficient zero-shot transfer in high-precision insertion tasks.

III Methods

To solve the aforementioned issues, we develop a framework that adapts the diffusion model to force-domain actions for high-precision tactile assembly tasks. In the following subsections, we first provide an overview of the framework, followed by a detailed explanation of the concrete modules, i.e., the diffusion model, the impedance control with feed-forward force, and the dynamic system-based filter.

III-A Framework Overview

Our framework comprises two key functional modules: the diffusion policy-based action generation module and the impedance control with feed-forward-based execution module.111A practical consideration here is the compatibility issues between the real-time kernel and the NVIDIA CUDA Toolkit. As illustrated in Fig. LABEL:fig:IL_overview_new, the diffusion-based policy is integrated into the behavior tree (BT) based Insertion skill by replacing the original sub-tree, which contained two primitives and a state estimator. The resultant behavior tree is simplified into a sequence of skill primitives, with “approach” and “contact” as two preceding primitives. As the BT is simplified into a sequence and the diffusion model handles primitive-switching, the discussion of the preceding skill primitives for contact initialization is beyond the scope of this work. For more details, we refer readers to our previous work [31].

During the assembly process, the interaction between the robot and the environment is captured as observation 𝒐𝒐\bm{o}bold_italic_o, which includes the external wrench, internal wrench, and end-effector’s speed. The diffusion model then predicts the force-domain actions (𝒂:=𝑭dfassign𝒂subscript𝑭𝑑𝑓\bm{a}:=\bm{F}_{df}bold_italic_a := bold_italic_F start_POSTSUBSCRIPT italic_d italic_f end_POSTSUBSCRIPT) based on both the current observation 𝒐currsubscript𝒐𝑐𝑢𝑟𝑟\bm{o}_{curr}bold_italic_o start_POSTSUBSCRIPT italic_c italic_u italic_r italic_r end_POSTSUBSCRIPT and the previous observation 𝒐prevsubscript𝒐𝑝𝑟𝑒𝑣\bm{o}_{prev}bold_italic_o start_POSTSUBSCRIPT italic_p italic_r italic_e italic_v end_POSTSUBSCRIPT. Due to the restrictions of computational resources, the diffusion model’s inference frequency typically ranges from 50 Hz to 500 Hz (Table I), which is misaligned with the robot’s 1000 Hz real-time control loop. To mitigate this, we designed a dynamic system-based filter to interpolate the diffusion model’s predictions 𝑭dfsubscript𝑭𝑑𝑓\bm{F}_{df}bold_italic_F start_POSTSUBSCRIPT italic_d italic_f end_POSTSUBSCRIPT. The filtered action is then transmitted to the impedance controller with feed-forward force. Based on the desired goal 𝒙dsubscript𝒙𝑑\bm{x}_{d}bold_italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT (insertion hole’s pose) and the force command, it regulates the robot’s motion and force behavior simultaneously.

III-B Diffusion Model

Denoising diffusion probabilistic model (DDPM) [38, 40, 41] is a specific type of diffusion model designed to generate data by learning to reverse a noise injection process. DDPM consists of two processes: diffusion and denoising. The diffusion process systematically transforms the data into noise, while the denoising process is responsible for converting this noise back into data.

Refer to caption
Figure 1: Network architecture of the noise estimator.

III-B1 Diffusion Process

The diffusion process is a forward progressive process that destructs data with noise over a series of steps. By progressively injecting noise into a “clean” initial action 𝒂0subscript𝒂0\bm{a}_{0}bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a sequence of “polluted” actions 𝒂1,𝒂2,,𝒂Tsubscript𝒂1subscript𝒂2subscript𝒂𝑇\bm{a}_{1},\bm{a}_{2},\cdots,\bm{a}_{T}bold_italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , bold_italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT converging to a Gaussian distribution is obtained, according to the diffusion rule [21]:

ατsubscript𝛼𝜏\displaystyle{\alpha}_{\tau}italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT =1βτ,absent1subscript𝛽𝜏\displaystyle=1-{\beta}_{\tau},= 1 - italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , (1)
𝒂τsubscript𝒂𝜏\displaystyle\bm{a}_{\tau}bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT =ατ𝒂τ1+βτϵτ,absentsubscript𝛼𝜏subscript𝒂𝜏1subscript𝛽𝜏subscriptbold-italic-ϵ𝜏\displaystyle=\sqrt{\alpha_{\tau}}\ \bm{a}_{\tau-1}+\sqrt{\beta_{\tau}}\ \bm{% \epsilon}_{\tau},= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG bold_italic_a start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , (2)

where τ[1,T]𝜏1𝑇\tau\in[1,T]italic_τ ∈ [ 1 , italic_T ] denotes the diffusion step, with T𝑇Titalic_T referring to the total number of denoising steps (not to be confused with the environment time step, as it is common in time serials). 𝒂τsubscript𝒂𝜏\bm{a}_{\tau}bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and ϵτ𝒩(𝟎,𝑰)subscriptbold-italic-ϵ𝜏𝒩0𝑰\bm{\epsilon}_{\tau}\in\mathcal{N}(\bm{0},\bm{I})bold_italic_ϵ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ caligraphic_N ( bold_0 , bold_italic_I ) represent the diffused action and the corresponding noise in the τ𝜏\tauitalic_τ-th diffusion step. ατsubscript𝛼𝜏{\alpha}_{\tau}italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and βτsubscript𝛽𝜏{\beta}_{\tau}italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT refer to variance schedule parameters that regulate the noise mixed in each diffusion step.

Furthermore, the noise ϵτsubscriptbold-italic-ϵ𝜏\bm{\epsilon}_{\tau}bold_italic_ϵ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT also plays a crucial role in the subsequent denoising process. To account for this, we construct the noise estimator ϵ^()^bold-italic-ϵ\hat{\bm{\epsilon}}(\cdot)over^ start_ARG bold_italic_ϵ end_ARG ( ⋅ ) using a residual neural network [42], as illustrated in Fig. 1, and train it by minimizing the following loss function:

𝓛DDPM=𝔼[ϵ^τ(𝒐,𝒂τ,τ)ϵτ22],subscript𝓛𝐷𝐷𝑃𝑀𝔼delimited-[]superscriptsubscriptnormsubscript^bold-italic-ϵ𝜏𝒐subscript𝒂𝜏𝜏subscriptbold-italic-ϵ𝜏22\bm{\mathcal{L}}_{DDPM}={\mathbb{E}}[\left\|\hat{\bm{\epsilon}}_{\tau}(\bm{o},% \bm{a}_{\tau},{\tau})-{\bm{\epsilon}}_{\tau}\right\|_{2}^{2}],bold_caligraphic_L start_POSTSUBSCRIPT italic_D italic_D italic_P italic_M end_POSTSUBSCRIPT = blackboard_E [ ∥ over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_italic_o , bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) - bold_italic_ϵ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (3)

where 𝒐𝒐\bm{o}bold_italic_o includes both the current and previous observations, as incorporating historical information helps identify trends and enhances the accuracy of predicting future actions. The diffusion step τ𝜏\tauitalic_τ serves as positional information, enabling the network to recognize the current diffusion stage effectively [43].

III-B2 Denoising Process

In contrast to the diffusion process, the denoising process reconstructs data from noise in reverse, illustrated by the linen block in Fig. LABEL:fig:IL_overview_new. Leveraging the previously trained noise estimator ϵ^()^bold-italic-ϵ\hat{\bm{\epsilon}}(\cdot)over^ start_ARG bold_italic_ϵ end_ARG ( ⋅ ), the model progressively removes the noise from a random sample 𝒂T𝒩(𝟎,𝑰)subscript𝒂𝑇𝒩0𝑰\bm{a}_{T}\in\mathcal{N}(\bm{0},\bm{I})bold_italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_N ( bold_0 , bold_italic_I ), following the denoising rule:

στ=subscript𝜎𝜏absent\displaystyle{\sigma}_{\tau}=italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = βτ,subscript𝛽𝜏\displaystyle\sqrt{{{\beta}_{\tau}}},square-root start_ARG italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG , (4)
α¯τ=subscript¯𝛼𝜏absent\displaystyle\bar{{{\alpha}}}_{\tau}=over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = i=1ταi,superscriptsubscriptproduct𝑖1𝜏subscript𝛼𝑖\displaystyle\prod_{i=1}^{\tau}{{{\alpha}}}_{i},∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (5)
𝒂τ1=subscript𝒂𝜏1absent\displaystyle\bm{a}_{\tau-1}=bold_italic_a start_POSTSUBSCRIPT italic_τ - 1 end_POSTSUBSCRIPT = 1ατ[𝒂τ1ατ1α¯τϵ^τ(𝒐,𝒂τ,τ)]+στϵτ,1subscript𝛼𝜏delimited-[]subscript𝒂𝜏1subscript𝛼𝜏1subscript¯𝛼𝜏subscript^bold-italic-ϵ𝜏𝒐subscript𝒂𝜏𝜏subscript𝜎𝜏subscriptbold-italic-ϵ𝜏\displaystyle\frac{1}{\sqrt{\alpha_{\tau}}}\ [\bm{a}_{\tau}-\frac{1-\alpha_{% \tau}}{\sqrt{1-\bar{\alpha}_{\tau}}}\ \hat{\bm{\epsilon}}_{\tau}(\bm{o},\bm{a}% _{\tau},{\tau})]+\sigma_{\tau}\ \bm{\epsilon}_{\tau},divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG end_ARG [ bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_italic_o , bold_italic_a start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , italic_τ ) ] + italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT , (6)

where the variance schedule parameters α¯τsubscript¯𝛼𝜏\bar{{{\alpha}}}_{\tau}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and στsubscript𝜎𝜏{\sigma}_{\tau}italic_σ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT modulate the subtracted noise in each step. After T𝑇Titalic_T steps (diffusion horizon) iteration, we obtain a probabilistic reconstructed action 𝒂0subscript𝒂0\bm{a}_{0}bold_italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. An illustrative example is provided in Sec. IV-B3.

III-C Impedance Control with Feed-forward Force

Consider a torque-controlled robot with n𝑛nitalic_n-Degree of Freedom, the second-order rigid body dynamics is written as:

𝑴(𝒒)𝒒¨+𝑪(𝒒,𝒒˙)𝒒˙+𝒈(𝒒)=𝝉m+𝝉ext,𝑴𝒒¨𝒒𝑪𝒒˙𝒒˙𝒒𝒈𝒒subscript𝝉𝑚subscript𝝉𝑒𝑥𝑡\bm{M}(\bm{q})\ddot{\bm{q}}+\bm{C}(\bm{q},\dot{\bm{q}})\dot{\bm{q}}+\bm{g}(\bm% {q})=\bm{\tau}_{m}+\bm{\tau}_{ext},bold_italic_M ( bold_italic_q ) over¨ start_ARG bold_italic_q end_ARG + bold_italic_C ( bold_italic_q , over˙ start_ARG bold_italic_q end_ARG ) over˙ start_ARG bold_italic_q end_ARG + bold_italic_g ( bold_italic_q ) = bold_italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_italic_τ start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT , (7)

where 𝒒n𝒒superscript𝑛\bm{q}\in\mathbb{R}^{n}bold_italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the joint state. 𝑴(𝒒)n×n𝑴𝒒superscript𝑛𝑛\bm{M}(\bm{q})\in\mathbb{R}^{n\times n}bold_italic_M ( bold_italic_q ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT corresponds to the mass matrix, 𝑪(𝒒,𝒒˙)n×n𝑪𝒒˙𝒒superscript𝑛𝑛\bm{C}(\bm{q},\dot{\bm{q}})\in\mathbb{R}^{n\times n}bold_italic_C ( bold_italic_q , over˙ start_ARG bold_italic_q end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is the Coriolis matrix and 𝒈(𝒒)n𝒈𝒒superscript𝑛\bm{g}(\bm{q})\in\mathbb{R}^{n}bold_italic_g ( bold_italic_q ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the gravity vector. The motor torque (control input) and external torque are denoted by 𝝉mnsubscript𝝉𝑚superscript𝑛\bm{\tau}_{m}\in\mathbb{R}^{n}bold_italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝝉extnsubscript𝝉𝑒𝑥𝑡superscript𝑛\bm{\tau}_{ext}\in\mathbb{R}^{n}bold_italic_τ start_POSTSUBSCRIPT italic_e italic_x italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, respectively. The impedance control law with feed-forward force profile is defined as [44]:

𝝉m(t)=subscript𝝉𝑚𝑡absent\displaystyle\bm{\tau}_{m}(t)=bold_italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = 𝑱(𝒒)𝖳[𝑭ff(t)+𝑲(t)𝒆+𝑫𝒆˙\displaystyle\bm{J}(\bm{q})^{\mathsf{T}}[\bm{F}_{ff}(t)+\bm{K}(t)\bm{e}+\bm{D}% \dot{\bm{e}}bold_italic_J ( bold_italic_q ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT [ bold_italic_F start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT ( italic_t ) + bold_italic_K ( italic_t ) bold_italic_e + bold_italic_D over˙ start_ARG bold_italic_e end_ARG (8)
+𝑴(𝒒)𝒙¨d]+𝑪(𝒒,𝒒˙)𝒒˙+𝒈(𝒒),\displaystyle+\bm{M}(\bm{q})\ddot{\bm{x}}_{d}]+\bm{C}(\bm{q},{\dot{\bm{q}}})% \dot{\bm{q}}+\bm{g}(\bm{q}),+ bold_italic_M ( bold_italic_q ) over¨ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] + bold_italic_C ( bold_italic_q , over˙ start_ARG bold_italic_q end_ARG ) over˙ start_ARG bold_italic_q end_ARG + bold_italic_g ( bold_italic_q ) ,

where 𝑭ff(t)subscript𝑭𝑓𝑓𝑡\bm{F}_{ff}(t)bold_italic_F start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT ( italic_t ) donates the feed-forward wrench, 𝒙dsubscript𝒙𝑑\bm{x}_{d}bold_italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is desired trajectory. 𝒙𝒙\bm{x}bold_italic_x indicates the robot’s current position. 𝒆=𝒙d𝒙𝒆subscript𝒙𝑑𝒙\bm{e}=\bm{x}_{d}-\bm{x}bold_italic_e = bold_italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - bold_italic_x and 𝒆˙=𝒙˙d𝒙˙˙𝒆subscript˙𝒙𝑑˙𝒙\dot{\bm{e}}={\dot{\bm{x}}}_{d}-\dot{\bm{x}}over˙ start_ARG bold_italic_e end_ARG = over˙ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - over˙ start_ARG bold_italic_x end_ARG are the position and velocity error, respectively. 𝑲(t)𝑲𝑡\bm{K}(t)bold_italic_K ( italic_t ) and 𝑫𝑫\bm{D}bold_italic_D are stiffness and damping matrices in Cartesian space. 𝑱(𝒒)𝑱𝒒\bm{J}(\bm{q})bold_italic_J ( bold_italic_q ) represents the robot Jacobian matrix. The internal wrench 𝑭insubscript𝑭𝑖𝑛\bm{F}_{in}bold_italic_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT applied by the robot on objects is calculated with:

𝑱binvsubscript𝑱𝑏𝑖𝑛𝑣\displaystyle\bm{J}_{binv}bold_italic_J start_POSTSUBSCRIPT italic_b italic_i italic_n italic_v end_POSTSUBSCRIPT =𝑱body,absentsuperscriptsubscript𝑱𝑏𝑜𝑑𝑦\displaystyle=\bm{J}_{body}^{{\dagger}},= bold_italic_J start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT , (9)
𝑭insubscript𝑭𝑖𝑛\displaystyle\bm{F}_{in}bold_italic_F start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT =𝑱binv𝖳(𝝉m𝑪(𝒒,𝒒˙)𝒒˙𝒈(𝒒)),absentsuperscriptsubscript𝑱𝑏𝑖𝑛𝑣𝖳subscript𝝉𝑚𝑪𝒒˙𝒒˙𝒒𝒈𝒒\displaystyle=\bm{J}_{binv}^{\mathsf{T}}(\bm{\tau}_{m}-\bm{C}\left(\bm{q},\dot% {\bm{q}}\right)\dot{\bm{q}}-\bm{g}\left(\bm{q}\right)),= bold_italic_J start_POSTSUBSCRIPT italic_b italic_i italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT ( bold_italic_τ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT - bold_italic_C ( bold_italic_q , over˙ start_ARG bold_italic_q end_ARG ) over˙ start_ARG bold_italic_q end_ARG - bold_italic_g ( bold_italic_q ) ) , (10)

where 𝑱binvsubscript𝑱𝑏𝑖𝑛𝑣\bm{J}_{binv}bold_italic_J start_POSTSUBSCRIPT italic_b italic_i italic_n italic_v end_POSTSUBSCRIPT represents the pseudo-inverse of the body Jacobian 𝑱bodysubscript𝑱𝑏𝑜𝑑𝑦\bm{J}_{body}bold_italic_J start_POSTSUBSCRIPT italic_b italic_o italic_d italic_y end_POSTSUBSCRIPT, which relates joint velocities to the End-Effector (EE) twist expressed in the body frame (a frame at the EE).

III-D Dynamic System based Filter

To solve the frequency misalignment between the diffusion model and the impedance controller with feed-forward force, we interpolate the diffusion model’s output 𝑭dfsubscript𝑭𝑑𝑓\bm{F}_{df}bold_italic_F start_POSTSUBSCRIPT italic_d italic_f end_POSTSUBSCRIPT with a dynamic system-based filter, according to the equation:

𝑭¨ff=α(β(𝑭df𝑭ff)𝑭˙ff),subscript¨𝑭𝑓𝑓𝛼𝛽subscript𝑭𝑑𝑓subscript𝑭𝑓𝑓subscript˙𝑭𝑓𝑓\ddot{\bm{F}}_{ff}=\alpha(\beta(\bm{F}_{df}-\bm{F}_{ff})-\dot{\bm{F}}_{ff}),over¨ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT = italic_α ( italic_β ( bold_italic_F start_POSTSUBSCRIPT italic_d italic_f end_POSTSUBSCRIPT - bold_italic_F start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT ) - over˙ start_ARG bold_italic_F end_ARG start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT ) , (11)

where the 𝑭dfsubscript𝑭𝑑𝑓\bm{F}_{df}bold_italic_F start_POSTSUBSCRIPT italic_d italic_f end_POSTSUBSCRIPT refers to the raw output of the diffusion model and 𝑭ffsubscript𝑭𝑓𝑓\bm{F}_{ff}bold_italic_F start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT indicates the filtered and interpolated 1000100010001000 Hz feed-forward force to be executed by the controller. The derivative and second-order derivative of 𝑭ffsubscript𝑭𝑓𝑓\bm{F}_{ff}bold_italic_F start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT are initialized as zero vectors. α𝛼\alphaitalic_α and β𝛽\betaitalic_β are two constant scales.222In this work, α𝛼\alphaitalic_α and β𝛽\betaitalic_β are fixed as 0.90.90.90.9 and 0.30.30.30.3, respectively, based on several trials that demonstrated their effectiveness.

IV Experiment

To evaluate our proposed method, we designed a set of experiments to: (i) demonstrate the effectiveness of our proposed framework and validate its capability to generalize to novel tasks, (ii) provide a practical guideline for balancing inference ability and speed by evaluating the performance of models with varying sizes, and (iii) showcase the feasibility of our designed dynamic system-based filter to mitigate the frequency misalignment between diffusion model and real-time controller.

Refer to caption
Figure 2: Experiment Setup. The object grasped by the robot in the left figure is the training object: (a) Cuboid: A 35 mm × 25 mm × 60 mm dimensional cuboid (0.1 mm clearance). The four objects on the right are applied to validate the transferability: (b) Key: A 37 mm long key; (c) Cyl-S: A 50 mm long cylinder with a diameter of 20 mm (0.02 mm clearance); (d) Cyl-L: A cylinder with a length of 50 mm and diameter of 30 mm (0.025 mm clearance); (e) Prism: A 50 mm long octagonal prism with a side length of 11 mm (0.05 mm clearance).

IV-A Experiment Setup

The experiment setup shown in Fig. 2 consists of a Franka Emika Panda robot with 5 tight-clearance insertion objects. The robot is controlled by a PC using Ubuntu 20.04 with Intel i9-10900K CPU and real-time kernel, and the diffusion module is implemented on the PyTorch framework. Training and inference are performed on another PC with NVIDIA RTX 3090 GPU and CUDA Toolkit.

IV-B Data Collection & Training

Refer to caption
Figure 3: An example view of observations in the dataset.
Refer to caption
(a) training loss
Refer to caption
(b) validation loss
Figure 4: Training loss and validation loss. Validation is conducted every 5 epochs throughout the training process.

IV-B1 Data Collection

To train the diffusion model, we collect a comprehensive dataset comprising 1500 expert demonstrations of the assembly task, using the setup shown in Fig. 2. Demonstrations are generated by executing our previous method [31] to perform the insertion task (Cuboid) in various initial poses. The data is recorded at 1000 Hz, resulting in a 24-dimensioned sequence, i.e., an 18-dimensional observation 𝒐𝒐\bm{o}bold_italic_o which includes external wrench, internal wrench, and EE’s speed (Fig. 3), paired with corresponding 6-dimensional actions 𝑭ffsubscript𝑭𝑓𝑓\bm{F}_{ff}bold_italic_F start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT.

TABLE I: Hyperparameters for Training Diffusion Models
Hyperparameters Value
Epoch 1500
Batch Size 4096
Learning Rate 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
Diffusion Horizon (T𝑇Titalic_T) 50
Diffusion Weight (βτsubscript𝛽𝜏\beta_{\tau}italic_β start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT) increased from 104 to 102superscript104 to superscript10210^{-4}\text{ to }10^{-2}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT

IV-B2 Training

There is a trade-off to select the optimal model. Larger models offer stronger inference capabilities, but smaller models provide faster inference speeds that are better suited to our controller. Therefore, an appropriate size is crucial for balancing performance and real-time control requirements, especially in our scenario where computational efficiency is critical.

TABLE II: Details of four Diffusion Models
Model Neurons (N) Final Loss Inference Frequency
DF1𝐷subscript𝐹1DF_{1}italic_D italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 128128128128 0.27510.27510.27510.2751 503.8503.8503.8503.8 Hz
DF2𝐷subscript𝐹2DF_{2}italic_D italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 256256256256 0.16530.16530.16530.1653 297.5297.5297.5297.5 Hz
DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 512512512512 0.07160.07160.07160.0716 141.8141.8141.8141.8 Hz
DF4𝐷subscript𝐹4DF_{4}italic_D italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 1024102410241024 0.02880.02880.02880.0288 51.251.251.251.2 Hz

To address this problem, we train diffusion models with varying neuron numbers N𝑁Nitalic_N (highlighted in red in Fig. 1) to provide a practical guideline. 80% of the data is used as training data. Hyperparameters employed in this process are detailed in Table I. Moreover, all trained models were exported to the ONNX format to optimize the inference speed. Table II provides the details of each model. In addition, as shown by the corresponding learning curve in Fig. 4(a), all the candidate models successfully converge within 1,000,000 iteration steps. As the model size increases, there is a clear improvement in accuracy on the training dataset, evidenced by the decreasing final loss. However, larger models also require more computational resources, leading to an evident frequency drop from 503.8503.8503.8503.8 Hz to 51.251.251.251.2 Hz.

Refer to caption
Figure 5: Denoising process with model DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. From the top down, the red curves indicate the change in the diffused actions during the denoising process. The black refers to the corresponding ground truth.

IV-B3 Validation

The remaining 20% of the data is used for validation. The validation losses in Fig. 4(b) imply that models have successfully converged without overfitting. Fig. 5 provides an intuitive instance of the denoising process, where the diffusion model reconstructs actions by progressively removing noise from a random Gaussian sample (τ=50𝜏50\tau=50italic_τ = 50). After 25 backward diffusion steps (τ=25𝜏25\tau=25italic_τ = 25), the model’s output exhibits a tendency towards the ground truth. By the final step (τ=1𝜏1\tau=1italic_τ = 1), the model’s prediction closely matches the ground truth.

It is noteworthy that the diffusion model successfully inherits the self-adaptability of our previous method, selecting appropriate primitives based on the assembly state. The model performs a wiggle motion to align the object with the hole before 0.90.90.90.9 s, and to resolve a stuck state from 1.21.21.21.2 s to 4.24.24.24.2 s. When the object is properly aligned, it applies a force to push the object into the insertion hole.

IV-C Real-World Experiment Performance

IV-C1 Performance Test

In this section, we validate the efficacy of our diffusion models using the experimental setup depicted in Fig. 2. Among all demonstrated policies, we select the best-performing one as our baseline. We evaluate not only the performance of the candidates on the training object but also emphasize their zero-shot transferability to four novel objects.

As depicted in Table III, a total of 25 test cases are created by combining the models with various objects. For each case, the model is evaluated on the corresponding task with 50 random initial poses. At each pose, the robot performed two insertion trials to account for variability and reduce the influence of random occurrences. Consequently, the success rate and corresponding execution time are represented in Table III and Fig. 6, respectively.

TABLE III: Success Rate [%]
Model trained novel (zero-shot transfer)
Cuboid Key Cyl-S Cyl-L Prism Average
DF1𝐷subscript𝐹1DF_{1}italic_D italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 90.090.090.090.0 99.099.0\bm{99.0}bold_99.0 86.086.086.086.0 85.085.085.085.0 40.040.040.040.0 77.577.577.577.5
DF2𝐷subscript𝐹2DF_{2}italic_D italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 79.079.079.079.0 94.094.094.094.0 87.087.087.087.0 90.090.090.090.0 79.079.079.079.0 87.587.587.587.5
DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 98.098.0\bm{98.0}bold_98.0 99.099.0\bm{99.0}bold_99.0 97.097.0\bm{97.0}bold_97.0 96.096.0\bm{96.0}bold_96.0 91.091.091.091.0 95.795.7\bm{95.7}bold_95.7
DF4𝐷subscript𝐹4DF_{4}italic_D italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 73.073.073.073.0 85.085.085.085.0 90.090.090.090.0 66.066.066.066.0 85.085.085.085.0 81.581.581.581.5
Baseline 92.092.092.092.0 94.094.094.094.0 61.061.061.061.0 82.082.082.082.0 96.096.0\bm{96.0}bold_96.0 83.383.383.383.3
*The highest success rate for each task is highlighted in bold font. The detailed configuration of models DF1𝐷subscript𝐹1DF_{1}italic_D italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to DF4𝐷subscript𝐹4DF_{4}italic_D italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is provided in Table II.
Refer to caption
Figure 6: Execution time. The colored bars represent the median execution time for each model, and the black lines denote their 25th and 75th percentiles.

According to the Common Industry Format for Usability Test Reports (ISO/IEC 25062:2006), the “core measure of efficiency” is the ratio of the task completion rate to the mean time per task [45]. We use this ratio as the performance metric, to evaluate the performance of comparing models. The results, illustrated by the radar plots in Fig. 7, show that DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT outperforms the baseline on demonstrated tasks in terms of efficiency.

Notably, for novel tasks, all diffusion models achieve over a 10% improvement in efficiency, showcasing excellent zero-shot transferability. Among these models, DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT stands out with the best comprehensive performance on novel tasks, achieving an average success rate of 95.7%.

Refer to caption
Figure 7: Radar plots for efficiency across different models

IV-C2 Trade-off between model accuracy and inference speed

As the model size increases, the model better captures latent relationships within the data, which is reflected in the increasing overall success rate from DF1𝐷subscript𝐹1DF_{1}italic_D italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, as shown in Table III. However, larger models also experience a significant reduction in inference frequency, which exacerbates the misalignment with the 1000 Hz control loop. As depicted in TableII, DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT maintains an acceptable frequency of 141.8 Hz, whereas DF4𝐷subscript𝐹4DF_{4}italic_D italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT suffers a dramatic drop to only 51.2 Hz. This extremely low output frequency limits the model’s deployment potential despite its strong inference capability, resulting in an overall significant performance drop. Consequently, DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT (with N=512𝑁512N=512italic_N = 512) is the only model that outperforms the baseline on both demonstrated and novel tasks. It exhibits the most balanced and highest performance across all insertion tasks, achieving a 129.5% improvement in overall performance compared to the baseline.

IV-C3 Dymanic system-based filter

Our dynamic system-based filter is designed to address the frequency misalignment issue. To validate its effectiveness, we repeat the identical experiments in Sec. IV-C1 for the diffusion models while disabling the filter in the framework. To distinguish from the previous models (DFx𝐷subscript𝐹𝑥DF_{x}italic_D italic_F start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT), these models are represented as DFxN𝐷subscript𝐹𝑥𝑁DF_{xN}italic_D italic_F start_POSTSUBSCRIPT italic_x italic_N end_POSTSUBSCRIPT. For ease of comparison, the results are presented in the same figure. As illustrated in Fig. 8, the models with filter assistance achieve higher success rates in 16 out of 20 scenarios, with three unchanged and one decreasing by 6%. Overall, our dynamic system-based filter mitigates the effects of frequency misalignment, leading to a 9.15% increase in success rates.

Refer to caption
Figure 8: Impact of the dynamical system-based filter on the success rate of high-precision assembly tasks

Moreover, we compare the model’s performance on both demonstrated and novel objects as illustrated in Fig.10. The inclusion of the filter results in enhanced performance across both categories. Besides, a more concrete example is provided in Fig.9, vividly illustrating the effect of our filter on diffusion model outputs. The raw diffusion output, depicted by the black curves, exhibits higher variability and fluctuations in force and torque components. In contrast, the filtered feed-forward force commands, indicated by the red curves, present a smoother profile at 1000 Hz. These results confirm that the filtering process mitigates the frequency misalignment issue.

Refer to caption
Figure 9: Impact of the filter on diffusion model’s predictions. The red curves represent the filtered feed-forward wrench, while the black curves correspond to the raw outputs from the diffusion model DF3𝐷subscript𝐹3DF_{3}italic_D italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.
Refer to caption
(a) demonstrated object
Refer to caption
(b) novel objects
Figure 10: Model efficiency sorted in descending order.

V Conclusion

In this work, we present a novel framework leveraging diffusion models to generate 6D wrench for tactile manipulation in high-precision robotic assembly tasks. Our approach, being the first force-domain diffusion policy, demonstrated excellent improved zero-shot transferability compared to prior work, by achieving an overall 95.7% success rate in zero-shot transfer in experimental evaluations. Additionally, we investigate the trade-off between accuracy and inference speed and provide a practical guideline for optimal model selection. Further, we address the frequency misalignment between the diffusion policy and the real-time control loop with a dynamic system-based filter, significantly improving the task success rate by 9.15%. Extensive experimental studies in our work underscores the effectiveness of our framework in real-world settings, showcasing a promising approach tackling high-precision tactile manipulation by learning diffusion-based transferable skills from expert policies containing primitive-switching logic. In future work, we will focus on extending the framework’s applicability to a broader range of high-precision assembly tasks and integrating additional sensing modalities to enhance system adaptability and robustness in real-time environments.

References

  • [1] D. E. Whitney, Mechanical assemblies: their design, manufacture, and role in product development.   Oxford university press New York, 2004, vol. 1.
  • [2] K. Nottensteiner, A. Sachtler, and A. Albu-Schäffer, “Towards Autonomous Robotic Assembly: Using Combined Visual and Tactile Sensing for Adaptive Task Execution,” Journal of Intelligent & Robotic Systems, vol. 101, no. 3, p. 49, Mar. 2021.
  • [3] R. S. Johansson and Å. B. Vallbo, “Tactile sensory coding in the glabrous skin of the human hand,” Trends in neurosciences, vol. 6, pp. 27–32, 1983.
  • [4] I. Birznieks, P. Jenmalm, A. W. Goodwin, and R. S. Johansson, “Encoding of direction of fingertip forces by human tactile afferents,” Journal of Neuroscience, vol. 21, no. 20, pp. 8222–8237, 2001.
  • [5] R. Li and H. Qiao, “A survey of methods and strategies for high-precision robotic grasping and assembly tasks—some new trends,” IEEE/ASME Transactions on Mechatronics, vol. 24, no. 6, pp. 2718–2732, 2019.
  • [6] K. Nottensteiner, A. Sachtler, and A. Albu-Schäffer, “Towards autonomous robotic assembly: Using combined visual and tactile sensing for adaptive task execution,” Journal of Intelligent & Robotic Systems, vol. 101, no. 3, p. 49, 2021.
  • [7] I. Hirochika, “Force Feedback in Precise Assembly Tasks,” in AI Memos.   MIT, 1974.
  • [8] H. Chen, G. Zhang, H. Zhang, and T. A. Fuhlbrigge, “Integrated robotic system for high precision assembly in a semi-structured environment,” Assembly Automation, vol. 27, no. 3, pp. 247–252, 2007.
  • [9] H. Chen, J. Wang, G. Zhang, T. Fuhlbrigge, and S. Kock, “High-precision assembly automation based on robot compliance,” The International Journal of Advanced Manufacturing Technology, vol. 45, no. 9, pp. 999–1006, 2009.
  • [10] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana, “Deep reinforcement learning for high precision assembly tasks,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2017, pp. 819–825.
  • [11] L. Johannsmeier, M. Gerchow, and S. Haddadin, “A framework for robot manipulation: Skill formalism, meta learning and adaptive control,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 5844–5850.
  • [12] J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel, “Reinforcement learning on variable impedance controller for high-precision robotic assembly,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 3080–3087.
  • [13] C. C. Beltran-Hernandez, D. Petit, I. G. Ramirez-Alpizar, and K. Harada, “Variable compliance control for robotic peg-in-hole assembly: A deep-reinforcement-learning approach,” Applied Sciences, vol. 10, no. 19, p. 6923, 2020.
  • [14] S. Haddadin and E. Shahriari, “Unified force-impedance control,” The International Journal of Robotics Research, p. 02783649241249194, Jul. 2024.
  • [15] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” in International Conference on Machine Learning.   PMLR, 2023, pp. 8469–8488.
  • [16] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
  • [17] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
  • [18] M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning.   PMLR, 2023, pp. 785–799.
  • [19] A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain et al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in 2024 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2024, pp. 6892–6903.
  • [20] A. Goyal, V. Blukis, J. Xu, Y. Guo, Y.-W. Chao, and D. Fox, “RVT-2: Learning Precise Manipulation from Few Demonstrations,” in Robotics: Science and Systems, Jun. 2024.
  • [21] T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V. Macua, S. Z. Tan, I. Momennejad, K. Hofmann et al., “Imitating human behaviour with diffusion models,” in Deep Reinforcement Learning Workshop NeurIPS, 2022.
  • [22] J. Carvalho, A. T. Le, M. Baierl, D. Koert, and J. Peters, “Motion planning diffusion: Learning and planning of robot motions with diffusion models,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 1916–1923.
  • [23] I. Kapelyukh, V. Vosylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” IEEE Robotics and Automation Letters, vol. 8, no. 7, pp. 3956–3963, 2023.
  • [24] U. A. Mishra, S. Xue, Y. Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” in Conference on Robot Learning.   PMLR, 2023, pp. 2905–2925.
  • [25] K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639, 2023.
  • [26] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” Robotics: Science and Systems, 2023.
  • [27] M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal-conditioned imitation learning using score-based diffusion policies,” Robotics: Science and Systems, 2023.
  • [28] U. A. Mishra and Y. Chen, “Reorientdiff: Diffusion model based reorientation for object manipulation,” in 2024 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2024, pp. 10 867–10 873.
  • [29] A. Sridhar, D. Shah, C. Glossop, and S. Levine, “Nomad: Goal masked diffusion policies for navigation and exploration,” in 2024 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2024, pp. 63–70.
  • [30] P. Li, Z. Li, H. Zhang, and J. Bian, “On the generalization properties of diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [31] Y. Wu, F. Wu, L. Chen, K. Chen, S. Schneider, L. Johannsmeier, Z. Bing, F. J. Abu-Dakka, A. Knoll, and S. Haddadin, “1 khz behavior tree for self-adaptable tactile insertion,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 002–16 008.
  • [32] O. Spector and D. Di Castro, “Insertionnet-a scalable solution for insertion,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5509–5516, 2021.
  • [33] O. Spector, V. Tchuiev, and D. Di Castro, “Insertionnet 2.0: Minimal contact multi-step insertion using multimodal multiview sensory input,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 6330–6336.
  • [34] G. Schoettler, A. Nair, J. A. Ojea, S. Levine, and E. Solowjow, “Meta-reinforcement learning for robotic industrial insertion tasks,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 9728–9735.
  • [35] T. Z. Zhao, J. Luo, O. Sushkov, R. Pevceviciute, N. Heess, J. Scholz, S. Schaal, and S. Levine, “Offline meta-reinforcement learning for industrial insertion,” in 2022 international conference on robotics and automation (ICRA).   IEEE, 2022, pp. 6386–6393.
  • [36] B. Tang, M. A. Lin, I. Akinola, A. Handa, G. S. Sukhatme, F. Ramos, D. Fox, and Y. Narang, “Industreal: Transferring contact-rich assembly tasks from simulation to reality,” arXiv preprint arXiv:2305.17110, 2023.
  • [37] B. Tang, I. Akinola, J. Xu, B. Wen, A. Handa, K. Van Wyk, D. Fox, G. S. Sukhatme, F. Ramos, and Y. Narang, “Automate: Specialist and generalist assembly policies over diverse geometries,” arXiv preprint arXiv:2407.08028, 2024.
  • [38] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  • [39] M. Kollovieh, A. F. Ansari, M. Bohlke-Schneider, J. Zschiegner, H. Wang, and Y. B. Wang, “Predict, refine, synthesize: Self-guiding diffusion models for probabilistic time series forecasting,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [40] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International conference on machine learning.   PMLR, 2021, pp. 8162–8171.
  • [41] Y. Yang, M. Jin, H. Wen, C. Zhang, Y. Liang, L. Ma, Y. Wang, C. Liu, B. Yang, Z. Xu et al., “A survey on diffusion models for time series and spatio-temporal data,” arXiv preprint arXiv:2404.18886, 2024.
  • [42] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14.   Springer, 2016, pp. 630–645.
  • [43] Y. Li, N. Miao, L. Ma, F. Shuang, and X. Huang, “Transformer for object detection: Review and benchmark,” Engineering Applications of Artificial Intelligence, vol. 126, p. 107021, 2023.
  • [44] C. Yang, G. Ganesh, S. Haddadin, S. Parusel, A. Albu-Schaeffer, and E. Burdet, “Human-like adaptation of force and impedance in stable and unstable interactions,” IEEE transactions on robotics, vol. 27, no. 5, pp. 918–930, 2011.
  • [45] B. Albert and T. Tullis, Measuring the user experience: collecting, analyzing, and presenting usability metrics.   Newnes, 2013.