TacDiffusion: Force-domain Diffusion Policy for Precise Tactile Manipulation

Yansong Wu^*, Zongxie Chen^*, Fan Wu, Lingyun Chen, Liding Zhang,
Zhenshan Bing, Abdalla Swikir, Alois Knoll, Sami Haddadin The authors are with the Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich, Germany. The authors with the notation ^* contribute equally to this work. Corresponding author: Fan Wu (f.wu@tum.de). The authors acknowledge the financial support by the Bavarian State Ministry for Economic Affairs, Regional Development and Energy (StMWi) for the Lighthouse Initiative KI.FABRIK (Phase 1: Infrastructure as well as the research and development program under, grant no. DIK0249). Code avaible: https://github.com/popnut123/TacDiffusion

Abstract

Assembly is a crucial skill for robots in both modern manufacturing and service robotics. However, mastering transferable insertion skills that can handle a variety of high-precision assembly tasks remains a significant challenge. This paper presents a novel framework that utilizes diffusion models to generate 6D wrench for high-precision tactile robotic insertion tasks. It learns from demonstrations performed on a single task and achieves a zero-shot transfer success rate of 95.7% across various novel high-precision tasks. Our method effectively inherits the self-adaptability demonstrated by our previous work. In this framework, we address the frequency misalignment between the diffusion policy and the real-time control loop with a dynamic system-based filter, significantly improving the task success rate by 9.15%. Furthermore, we provide a practical guideline regarding the trade-off between diffusion models’ inference ability and speed.

I Introduction

Assembly tasks are crucial in robotics, serving as the backbone of modern manufacturing and service applications [1]. As the demand for flexible manufacturing grows, robotic assembly increasingly takes place in dynamic environments, where objects are not precisely positioned at known locations and part holders are often not viable [2]. Achieving both broad transferability and precise control capabilities in these conditions remains a significant challenge. Human workers, on the other hand, demonstrate exceptional dexterity in assembling diverse objects with tight-clearance components, primarily by leveraging tactile feedback from their fingertips throughout the process [3, 4]. Similarly, a versatile high-precision robotic assembly system must exhibit both task-level transferability—generalizing across a wide range of objects and parts—and control-level self-adaptability, enabling it to respond to environmental changes often sensed through tactile feedback [5, 6].

Throughout the history of robotics research, the importance of tactile feedback and force control for high-precision assembly has been consistently recognized [7, 8, 9, 10, 11, 12, 5, 13]. However, several challenges persist in precise force control, including the difficulty of accessing to appropriate robot hardware and expensive force sensors, the complexity of ensuring stability and safety while regulating force, the sensitivity of force control to environmental changes, the difficulty of estimating environment constraints and contact dynamics in dynamic settings, and the challenge of collecting high-quality tactile data for learning force control. Due to these barriers, the use of simpler motion-domain action spaces, with impedance control as an indirect force control method is often favored by the robot learning community. Nevertheless, the increasing diversity of contact-rich manipulation tasks highlights the equal importance of simultaneously regulating motion, compliance, and force, so that agents can autonomously perform a wide range of task stably and robustly, without the need for explicit controller switching [14]. Despite the recent successes of implementing transformer [15, 16, 17, 18, 19, 20] and/or diffusion-based [21, 22, 23, 24, 25, 26, 27, 28, 29, 30] policies for robot manipulation that exhibit excellent generalization capability, it remains unexplored how to integrate force control with these models for high-precision tactile manipulation, so that the benefits of these generative models for multi-modal modelling and prediction can be fully exploited.

To address this gap, and aiming to achieve both task-level transferability and control-level self-adaptability, we propose TacDiffusion, a novel framework that leverages a diffusion policy for high-precision tactile manipulation. To the authors’ knowledge, it is the first framework to employ diffusion models in generating force-domain actions for tactile-based robotic manipulation in tight-clearance insertion tasks. TacDiffusion learns from demonstrations performed by expert policies on a single task and achieves an overall 95.7% zero-shot transfer success rate across various novel high-precision, sub-millimeter-level peg-in-hole tasks. By imitating the expert policies, which are based on a behavior tree-based skill proposed in our previous work [31], TacDiffusion successfully inherits its self-adaptability, characterized by the ability to switch skill primitives based on real-time tactile sensing. Importantly, compared to the expert policy, TacDiffusion also outperforms in execution time and robustness on these novel tasks in a zero-shot transfer manner.

To further enhance real-time performance, we investigate how model size affects the trade-off between accuracy and inference speed, providing practical guidelines for optimal model selection. Moreover, to handle the frequency misalignment between the diffusion policy’s inference process and the low-level controller, a dynamic system-based filter is designed to smooth the output of the diffusion model for high-frequency force-impedance control, significantly improving the task success rate by 9.15%.

In summary, our main contributions are: (i) a novel diffusion-based policy that outputs 6D wrench for tactile manipulation; (ii) learning from a behavior tree-based expert policy to inherent its tactile-based self-adaptability; (iii) a dynamic system-based filter smoothing and aligning low frequency outputs from diffusion model with high frequency control, with experimental evidences showing significant effect on task performance; (iv) investigation on trade-off between accuracy and inference speed, resulting in insights for optimal model selection in practice.

II Related Works

In this section, we focus our review on (i) High-Precision Assembly Tasks, (ii) Transferability and (iii) Diffusion model in robotics.

II-A High-Precision Assembly Tasks

Due to the robot’s accuracy limitation, position-based control methods are insufficient for high-precision assembly tasks that require accuracy exceeding the robot’s precision [10]. To address this issue, recent studies have shifted to designing actions in the force-domain rather than the position domain to perform high-precision robotic assembly tasks. According to the control strategy, these methods span four main categories: force controller [9], admittance controller [8], hybrid position/force controller[10, 13], and impedance controller with feed-forward force [11, 12]. Nevertheless, these works normally focus on a specific tight-clearance task and lack investigation into the method’s transferability and adaptability to novel tasks[5].

II-B Enhancing Transferability in Robotic Assembly

In the last decade, there is extensive literature on generating robotic assembly policies with broad generalization. Deep Reinforcement Learning-based methods, for instance, typically achieve the generalization ability through training with multiple objects [32, 33]. Another noteworthy case is meta-learning, which trains a pre-trained model using online or offline data from a diverse and comprehensive set of tasks, enabling domain adaptation ability through fine-tuning [34, 35]. Furthermore, sim-to-real based approaches have gained attention for their cost-effective data collection in the simulation environments, and zero-shot sim-to-real transfer for perception-initialized assembly has been only recently demonstrated [36, 37]. Besides, to tackle precise manipulation, RVT-2 [20] trained a transformer-based multi-task policy. Despite improving performance on multi-task learning benchmark, its success rate on high-precision (millimeter level) insertion tasks, roughly 50%, is far from being satisfactory to deploy to real assembly production. Aside from these approaches, evolutionary algorithms with parameterized robot skills have shown transferability across tasks via fine-tuning [31]. However, achieving zero-shot transfer on high-precision tasks with a satisfactory success rate in the real world remains an open challenge

II-C Diffusion Model in Robotics

Meanwhile, in other areas of robotics, diffusion models [38] have made significant progress. Compared to traditional discriminative models, diffusion models excel in generalization, achieving superior performance on unseen tasks and scenarios, by establishing a stochastic transport map between an empirically observed target distribution and a known prior [30]. Recent works have typically used scene images as input to solve planning problems [23, 24, 29] and perform manipulation tasks [25, 26, 27, 28] in robotics. However, the application of diffusion models with other input modalities remains relatively underexplored in robotics, with only few studies addressing this area [22]. In addition, considering diffusion model applications in sequential behavior imitation [21] and time series processing [39], there is great potential for adapting diffusion models to force-domain actions in robotics.

In summary, although significant progress has been made in insertion tasks, achieving zero-shot transfer in high-precision assembly tasks remains an ongoing challenge. Additionally, the application of diffusion models to force-domain actions has not yet been explored. To bridge these gaps, we propose a novel framework that leverages diffusion models to enable more efficient zero-shot transfer in high-precision insertion tasks.

III Methods

To solve the aforementioned issues, we develop a framework that adapts the diffusion model to force-domain actions for high-precision tactile assembly tasks. In the following subsections, we first provide an overview of the framework, followed by a detailed explanation of the concrete modules, i.e., the diffusion model, the impedance control with feed-forward force, and the dynamic system-based filter.

III-A Framework Overview

Our framework comprises two key functional modules: the diffusion policy-based action generation module and the impedance control with feed-forward-based execution module.¹¹1A practical consideration here is the compatibility issues between the real-time kernel and the NVIDIA CUDA Toolkit. As illustrated in Fig. LABEL:fig:IL_overview_new, the diffusion-based policy is integrated into the behavior tree (BT) based Insertion skill by replacing the original sub-tree, which contained two primitives and a state estimator. The resultant behavior tree is simplified into a sequence of skill primitives, with “approach” and “contact” as two preceding primitives. As the BT is simplified into a sequence and the diffusion model handles primitive-switching, the discussion of the preceding skill primitives for contact initialization is beyond the scope of this work. For more details, we refer readers to our previous work [31].

During the assembly process, the interaction between the robot and the environment is captured as observation $\bm{o}$ , which includes the external wrench, internal wrench, and end-effector’s speed. The diffusion model then predicts the force-domain actions ( $\bm{a}:=\bm{F}_{df}$ ) based on both the current observation $\bm{o}_{curr}$ and the previous observation $\bm{o}_{prev}$ . Due to the restrictions of computational resources, the diffusion model’s inference frequency typically ranges from 50 Hz to 500 Hz (Table I), which is misaligned with the robot’s 1000 Hz real-time control loop. To mitigate this, we designed a dynamic system-based filter to interpolate the diffusion model’s predictions $\bm{F}_{df}$ . The filtered action is then transmitted to the impedance controller with feed-forward force. Based on the desired goal $\bm{x}_{d}$ (insertion hole’s pose) and the force command, it regulates the robot’s motion and force behavior simultaneously.

III-B Diffusion Model

Denoising diffusion probabilistic model (DDPM) [38, 40, 41] is a specific type of diffusion model designed to generate data by learning to reverse a noise injection process. DDPM consists of two processes: diffusion and denoising. The diffusion process systematically transforms the data into noise, while the denoising process is responsible for converting this noise back into data.

Refer to caption — Figure 1: Network architecture of the noise estimator.

III-B1 Diffusion Process

The diffusion process is a forward progressive process that destructs data with noise over a series of steps. By progressively injecting noise into a “clean” initial action $\bm{a}_{0}$ , a sequence of “polluted” actions $\bm{a}_{1},\bm{a}_{2},\cdots,\bm{a}_{T}$ converging to a Gaussian distribution is obtained, according to the diffusion rule [21]:

	$\displaystyle{\alpha}_{\tau}$	$\displaystyle=1-{\beta}_{\tau},$		(1)
	$\displaystyle\bm{a}_{\tau}$	$\displaystyle=\sqrt{\alpha_{\tau}}\ \bm{a}_{\tau-1}+\sqrt{\beta_{\tau}}\ \bm{% \epsilon}_{\tau},$		(2)

where $\tau\in[1,T]$ denotes the diffusion step, with $T$ referring to the total number of denoising steps (not to be confused with the environment time step, as it is common in time serials). $\bm{a}_{\tau}$ and $\bm{\epsilon}_{\tau}\in\mathcal{N}(\bm{0},\bm{I})$ represent the diffused action and the corresponding noise in the $\tau$ -th diffusion step. ${\alpha}_{\tau}$ and ${\beta}_{\tau}$ refer to variance schedule parameters that regulate the noise mixed in each diffusion step.

Furthermore, the noise $\bm{\epsilon}_{\tau}$ also plays a crucial role in the subsequent denoising process. To account for this, we construct the noise estimator $\hat{\bm{\epsilon}}(\cdot)$ using a residual neural network [42], as illustrated in Fig. 1, and train it by minimizing the following loss function:

\bm{\mathcal{L}}_{DDPM}={\mathbb{E}}[\left\|\hat{\bm{\epsilon}}_{\tau}(\bm{o},% \bm{a}_{\tau},{\tau})-{\bm{\epsilon}}_{\tau}\right\|_{2}^{2}],

(3)

where $\bm{o}$ includes both the current and previous observations, as incorporating historical information helps identify trends and enhances the accuracy of predicting future actions. The diffusion step $\tau$ serves as positional information, enabling the network to recognize the current diffusion stage effectively [43].

III-B2 Denoising Process

In contrast to the diffusion process, the denoising process reconstructs data from noise in reverse, illustrated by the linen block in Fig. LABEL:fig:IL_overview_new. Leveraging the previously trained noise estimator $\hat{\bm{\epsilon}}(\cdot)$ , the model progressively removes the noise from a random sample $\bm{a}_{T}\in\mathcal{N}(\bm{0},\bm{I})$ , following the denoising rule:

$\displaystyle{\sigma}_{\tau}=$	$\displaystyle\sqrt{{{\beta}_{\tau}}},$	(4)
$\displaystyle\bar{{{\alpha}}}_{\tau}=$	$\displaystyle\prod_{i=1}^{\tau}{{{\alpha}}}_{i},$	(5)
$\displaystyle\bm{a}_{\tau-1}=$	$\displaystyle\frac{1}{\sqrt{\alpha_{\tau}}}\ [\bm{a}_{\tau}-\frac{1-\alpha_{% \tau}}{\sqrt{1-\bar{\alpha}_{\tau}}}\ \hat{\bm{\epsilon}}_{\tau}(\bm{o},\bm{a}% _{\tau},{\tau})]+\sigma_{\tau}\ \bm{\epsilon}_{\tau},$	(6)

where the variance schedule parameters $\bar{{{\alpha}}}_{\tau}$ and ${\sigma}_{\tau}$ modulate the subtracted noise in each step. After $T$ steps (diffusion horizon) iteration, we obtain a probabilistic reconstructed action $\bm{a}_{0}$ . An illustrative example is provided in Sec. IV-B3.

III-C Impedance Control with Feed-forward Force

Consider a torque-controlled robot with $n$ -Degree of Freedom, the second-order rigid body dynamics is written as:

\bm{M}(\bm{q})\ddot{\bm{q}}+\bm{C}(\bm{q},\dot{\bm{q}})\dot{\bm{q}}+\bm{g}(\bm% {q})=\bm{\tau}_{m}+\bm{\tau}_{ext},

(7)

where $\bm{q}\in\mathbb{R}^{n}$ is the joint state. $\bm{M}(\bm{q})\in\mathbb{R}^{n\times n}$ corresponds to the mass matrix, $\bm{C}(\bm{q},\dot{\bm{q}})\in\mathbb{R}^{n\times n}$ is the Coriolis matrix and $\bm{g}(\bm{q})\in\mathbb{R}^{n}$ is the gravity vector. The motor torque (control input) and external torque are denoted by $\bm{\tau}_{m}\in\mathbb{R}^{n}$ and $\bm{\tau}_{ext}\in\mathbb{R}^{n}$ , respectively. The impedance control law with feed-forward force profile is defined as [44]:

	$\displaystyle\bm{\tau}_{m}(t)=$	$\displaystyle\bm{J}(\bm{q})^{\mathsf{T}}[\bm{F}_{ff}(t)+\bm{K}(t)\bm{e}+\bm{D}% \dot{\bm{e}}$		(8)
		$\displaystyle+\bm{M}(\bm{q})\ddot{\bm{x}}_{d}]+\bm{C}(\bm{q},{\dot{\bm{q}}})% \dot{\bm{q}}+\bm{g}(\bm{q}),$		(8)

where $\bm{F}_{ff}(t)$ donates the feed-forward wrench, $\bm{x}_{d}$ is desired trajectory. $\bm{x}$ indicates the robot’s current position. $\bm{e}=\bm{x}_{d}-\bm{x}$ and $\dot{\bm{e}}={\dot{\bm{x}}}_{d}-\dot{\bm{x}}$ are the position and velocity error, respectively. $\bm{K}(t)$ and $\bm{D}$ are stiffness and damping matrices in Cartesian space. $\bm{J}(\bm{q})$ represents the robot Jacobian matrix. The internal wrench $\bm{F}_{in}$ applied by the robot on objects is calculated with:

	$\displaystyle\bm{J}_{binv}$	$\displaystyle=\bm{J}_{body}^{{\dagger}},$		(9)
	$\displaystyle\bm{F}_{in}$	$\displaystyle=\bm{J}_{binv}^{\mathsf{T}}(\bm{\tau}_{m}-\bm{C}\left(\bm{q},\dot% {\bm{q}}\right)\dot{\bm{q}}-\bm{g}\left(\bm{q}\right)),$		(10)

where $\bm{J}_{binv}$ represents the pseudo-inverse of the body Jacobian $\bm{J}_{body}$ , which relates joint velocities to the End-Effector (EE) twist expressed in the body frame (a frame at the EE).

III-D Dynamic System based Filter

To solve the frequency misalignment between the diffusion model and the impedance controller with feed-forward force, we interpolate the diffusion model’s output $\bm{F}_{df}$ with a dynamic system-based filter, according to the equation:

\ddot{\bm{F}}_{ff}=\alpha(\beta(\bm{F}_{df}-\bm{F}_{ff})-\dot{\bm{F}}_{ff}),

(11)

where the $\bm{F}_{df}$ refers to the raw output of the diffusion model and $\bm{F}_{ff}$ indicates the filtered and interpolated $1000$ Hz feed-forward force to be executed by the controller. The derivative and second-order derivative of $\bm{F}_{ff}$ are initialized as zero vectors. $\alpha$ and $\beta$ are two constant scales.²²2In this work, $\alpha$ and $\beta$ are fixed as $0.9$ and $0.3$ , respectively, based on several trials that demonstrated their effectiveness.

IV Experiment

To evaluate our proposed method, we designed a set of experiments to: (i) demonstrate the effectiveness of our proposed framework and validate its capability to generalize to novel tasks, (ii) provide a practical guideline for balancing inference ability and speed by evaluating the performance of models with varying sizes, and (iii) showcase the feasibility of our designed dynamic system-based filter to mitigate the frequency misalignment between diffusion model and real-time controller.

IV-A Experiment Setup

The experiment setup shown in Fig. 2 consists of a Franka Emika Panda robot with 5 tight-clearance insertion objects. The robot is controlled by a PC using Ubuntu 20.04 with Intel i9-10900K CPU and real-time kernel, and the diffusion module is implemented on the PyTorch framework. Training and inference are performed on another PC with NVIDIA RTX 3090 GPU and CUDA Toolkit.

IV-B Data Collection & Training

IV-B1 Data Collection

To train the diffusion model, we collect a comprehensive dataset comprising 1500 expert demonstrations of the assembly task, using the setup shown in Fig. 2. Demonstrations are generated by executing our previous method [31] to perform the insertion task (Cuboid) in various initial poses. The data is recorded at 1000 Hz, resulting in a 24-dimensioned sequence, i.e., an 18-dimensional observation $\bm{o}$ which includes external wrench, internal wrench, and EE’s speed (Fig. 3), paired with corresponding 6-dimensional actions $\bm{F}_{ff}$ .

TABLE I: Hyperparameters for Training Diffusion Models

Hyperparameters	Value
Epoch	1500
Batch Size	4096
Learning Rate	$10^{-3}$
Diffusion Horizon ( $T$ )	50
Diffusion Weight ( $\beta_{\tau}$ )	increased from $10^{-4}\text{ to }10^{-2}$

IV-B2 Training

There is a trade-off to select the optimal model. Larger models offer stronger inference capabilities, but smaller models provide faster inference speeds that are better suited to our controller. Therefore, an appropriate size is crucial for balancing performance and real-time control requirements, especially in our scenario where computational efficiency is critical.

TABLE II: Details of four Diffusion Models

Model	Neurons (N)	Final Loss	Inference Frequency
$DF_{1}$	$128$	$0.2751$	$503.8$ Hz
$DF_{2}$	$256$	$0.1653$	$297.5$ Hz
$DF_{3}$	$512$	$0.0716$	$141.8$ Hz
$DF_{4}$	$1024$	$0.0288$	$51.2$ Hz

To address this problem, we train diffusion models with varying neuron numbers $N$ (highlighted in red in Fig. 1) to provide a practical guideline. 80% of the data is used as training data. Hyperparameters employed in this process are detailed in Table I. Moreover, all trained models were exported to the ONNX format to optimize the inference speed. Table II provides the details of each model. In addition, as shown by the corresponding learning curve in Fig. 4(a), all the candidate models successfully converge within 1,000,000 iteration steps. As the model size increases, there is a clear improvement in accuracy on the training dataset, evidenced by the decreasing final loss. However, larger models also require more computational resources, leading to an evident frequency drop from $503.8$ Hz to $51.2$ Hz.

IV-B3 Validation

The remaining 20% of the data is used for validation. The validation losses in Fig. 4(b) imply that models have successfully converged without overfitting. Fig. 5 provides an intuitive instance of the denoising process, where the diffusion model reconstructs actions by progressively removing noise from a random Gaussian sample ( $\tau=50$ ). After 25 backward diffusion steps ( $\tau=25$ ), the model’s output exhibits a tendency towards the ground truth. By the final step ( $\tau=1$ ), the model’s prediction closely matches the ground truth.

It is noteworthy that the diffusion model successfully inherits the self-adaptability of our previous method, selecting appropriate primitives based on the assembly state. The model performs a wiggle motion to align the object with the hole before $0.9$ s, and to resolve a stuck state from $1.2$ s to $4.2$ s. When the object is properly aligned, it applies a force to push the object into the insertion hole.

IV-C Real-World Experiment Performance

IV-C1 Performance Test

In this section, we validate the efficacy of our diffusion models using the experimental setup depicted in Fig. 2. Among all demonstrated policies, we select the best-performing one as our baseline. We evaluate not only the performance of the candidates on the training object but also emphasize their zero-shot transferability to four novel objects.

As depicted in Table III, a total of 25 test cases are created by combining the models with various objects. For each case, the model is evaluated on the corresponding task with 50 random initial poses. At each pose, the robot performed two insertion trials to account for variability and reduce the influence of random occurrences. Consequently, the success rate and corresponding execution time are represented in Table III and Fig. 6, respectively.

TABLE III: Success Rate [%]

Model	trained	novel (zero-shot transfer)
	Cuboid	Key	Cyl-S	Cyl-L	Prism	Average
$DF_{1}$	$90.0$	$\bm{99.0}$	$86.0$	$85.0$	$40.0$	$77.5$
$DF_{2}$	$79.0$	$94.0$	$87.0$	$90.0$	$79.0$	$87.5$
$DF_{3}$	$\bm{98.0}$	$\bm{99.0}$	$\bm{97.0}$	$\bm{96.0}$	$91.0$	$\bm{95.7}$
$DF_{4}$	$73.0$	$85.0$	$90.0$	$66.0$	$85.0$	$81.5$
Baseline	$92.0$	$94.0$	$61.0$	$82.0$	$\bm{96.0}$	$83.3$

*The highest success rate for each task is highlighted in bold font. The detailed configuration of models

DF_{1}

DF_{4}

is provided in Table II.

According to the Common Industry Format for Usability Test Reports (ISO/IEC 25062:2006), the “core measure of efficiency” is the ratio of the task completion rate to the mean time per task [45]. We use this ratio as the performance metric, to evaluate the performance of comparing models. The results, illustrated by the radar plots in Fig. 7, show that $DF_{3}$ outperforms the baseline on demonstrated tasks in terms of efficiency.

Notably, for novel tasks, all diffusion models achieve over a 10% improvement in efficiency, showcasing excellent zero-shot transferability. Among these models, $DF_{3}$ stands out with the best comprehensive performance on novel tasks, achieving an average success rate of 95.7%.

IV-C2 Trade-off between model accuracy and inference speed

As the model size increases, the model better captures latent relationships within the data, which is reflected in the increasing overall success rate from $DF_{1}$ to $DF_{3}$ , as shown in Table III. However, larger models also experience a significant reduction in inference frequency, which exacerbates the misalignment with the 1000 Hz control loop. As depicted in TableII, $DF_{3}$ maintains an acceptable frequency of 141.8 Hz, whereas $DF_{4}$ suffers a dramatic drop to only 51.2 Hz. This extremely low output frequency limits the model’s deployment potential despite its strong inference capability, resulting in an overall significant performance drop. Consequently, $DF_{3}$ (with $N=512$ ) is the only model that outperforms the baseline on both demonstrated and novel tasks. It exhibits the most balanced and highest performance across all insertion tasks, achieving a 129.5% improvement in overall performance compared to the baseline.

IV-C3 Dymanic system-based filter

Our dynamic system-based filter is designed to address the frequency misalignment issue. To validate its effectiveness, we repeat the identical experiments in Sec. IV-C1 for the diffusion models while disabling the filter in the framework. To distinguish from the previous models ( $DF_{x}$ ), these models are represented as $DF_{xN}$ . For ease of comparison, the results are presented in the same figure. As illustrated in Fig. 8, the models with filter assistance achieve higher success rates in 16 out of 20 scenarios, with three unchanged and one decreasing by 6%. Overall, our dynamic system-based filter mitigates the effects of frequency misalignment, leading to a 9.15% increase in success rates.

Moreover, we compare the model’s performance on both demonstrated and novel objects as illustrated in Fig.10. The inclusion of the filter results in enhanced performance across both categories. Besides, a more concrete example is provided in Fig.9, vividly illustrating the effect of our filter on diffusion model outputs. The raw diffusion output, depicted by the black curves, exhibits higher variability and fluctuations in force and torque components. In contrast, the filtered feed-forward force commands, indicated by the red curves, present a smoother profile at 1000 Hz. These results confirm that the filtering process mitigates the frequency misalignment issue.

V Conclusion

In this work, we present a novel framework leveraging diffusion models to generate 6D wrench for tactile manipulation in high-precision robotic assembly tasks. Our approach, being the first force-domain diffusion policy, demonstrated excellent improved zero-shot transferability compared to prior work, by achieving an overall 95.7% success rate in zero-shot transfer in experimental evaluations. Additionally, we investigate the trade-off between accuracy and inference speed and provide a practical guideline for optimal model selection. Further, we address the frequency misalignment between the diffusion policy and the real-time control loop with a dynamic system-based filter, significantly improving the task success rate by 9.15%. Extensive experimental studies in our work underscores the effectiveness of our framework in real-world settings, showcasing a promising approach tackling high-precision tactile manipulation by learning diffusion-based transferable skills from expert policies containing primitive-switching logic. In future work, we will focus on extending the framework’s applicability to a broader range of high-precision assembly tasks and integrating additional sensing modalities to enhance system adaptability and robustness in real-time environments.

References

[1] D. E. Whitney, Mechanical assemblies: their design, manufacture, and role in product development. Oxford university press New York, 2004, vol. 1.
[2] K. Nottensteiner, A. Sachtler, and A. Albu-Schäffer, “Towards Autonomous Robotic Assembly: Using Combined Visual and Tactile Sensing for Adaptive Task Execution,” Journal of Intelligent & Robotic Systems, vol. 101, no. 3, p. 49, Mar. 2021.
[3] R. S. Johansson and Å. B. Vallbo, “Tactile sensory coding in the glabrous skin of the human hand,” Trends in neurosciences, vol. 6, pp. 27–32, 1983.
[4] I. Birznieks, P. Jenmalm, A. W. Goodwin, and R. S. Johansson, “Encoding of direction of fingertip forces by human tactile afferents,” Journal of Neuroscience, vol. 21, no. 20, pp. 8222–8237, 2001.
[5] R. Li and H. Qiao, “A survey of methods and strategies for high-precision robotic grasping and assembly tasks—some new trends,” IEEE/ASME Transactions on Mechatronics, vol. 24, no. 6, pp. 2718–2732, 2019.
[6] K. Nottensteiner, A. Sachtler, and A. Albu-Schäffer, “Towards autonomous robotic assembly: Using combined visual and tactile sensing for adaptive task execution,” Journal of Intelligent & Robotic Systems, vol. 101, no. 3, p. 49, 2021.
[7] I. Hirochika, “Force Feedback in Precise Assembly Tasks,” in AI Memos. MIT, 1974.
[8] H. Chen, G. Zhang, H. Zhang, and T. A. Fuhlbrigge, “Integrated robotic system for high precision assembly in a semi-structured environment,” Assembly Automation, vol. 27, no. 3, pp. 247–252, 2007.
[9] H. Chen, J. Wang, G. Zhang, T. Fuhlbrigge, and S. Kock, “High-precision assembly automation based on robot compliance,” The International Journal of Advanced Manufacturing Technology, vol. 45, no. 9, pp. 999–1006, 2009.
[10] T. Inoue, G. De Magistris, A. Munawar, T. Yokoya, and R. Tachibana, “Deep reinforcement learning for high precision assembly tasks,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 819–825.
[11] L. Johannsmeier, M. Gerchow, and S. Haddadin, “A framework for robot manipulation: Skill formalism, meta learning and adaptive control,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 5844–5850.
[12] J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel, “Reinforcement learning on variable impedance controller for high-precision robotic assembly,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 3080–3087.
[13] C. C. Beltran-Hernandez, D. Petit, I. G. Ramirez-Alpizar, and K. Harada, “Variable compliance control for robotic peg-in-hole assembly: A deep-reinforcement-learning approach,” Applied Sciences, vol. 10, no. 19, p. 6923, 2020.
[14] S. Haddadin and E. Shahriari, “Unified force-impedance control,” The International Journal of Robotics Research, p. 02783649241249194, Jul. 2024.
[15] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” in International Conference on Machine Learning. PMLR, 2023, pp. 8469–8488.
[16] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
[17] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” arXiv preprint arXiv:2307.15818, 2023.
[18] M. Shridhar, L. Manuelli, and D. Fox, “Perceiver-actor: A multi-task transformer for robotic manipulation,” in Conference on Robot Learning. PMLR, 2023, pp. 785–799.
[19] A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain et al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903.
[20] A. Goyal, V. Blukis, J. Xu, Y. Guo, Y.-W. Chao, and D. Fox, “RVT-2: Learning Precise Manipulation from Few Demonstrations,” in Robotics: Science and Systems, Jun. 2024.
[21] T. Pearce, T. Rashid, A. Kanervisto, D. Bignell, M. Sun, R. Georgescu, S. V. Macua, S. Z. Tan, I. Momennejad, K. Hofmann et al., “Imitating human behaviour with diffusion models,” in Deep Reinforcement Learning Workshop NeurIPS, 2022.
[22] J. Carvalho, A. T. Le, M. Baierl, D. Koert, and J. Peters, “Motion planning diffusion: Learning and planning of robot motions with diffusion models,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 1916–1923.
[23] I. Kapelyukh, V. Vosylius, and E. Johns, “Dall-e-bot: Introducing web-scale diffusion models to robotics,” IEEE Robotics and Automation Letters, vol. 8, no. 7, pp. 3956–3963, 2023.
[24] U. A. Mishra, S. Xue, Y. Chen, and D. Xu, “Generative skill chaining: Long-horizon skill planning with diffusion models,” in Conference on Robot Learning. PMLR, 2023, pp. 2905–2925.
[25] K. Black, M. Nakamoto, P. Atreya, H. Walke, C. Finn, A. Kumar, and S. Levine, “Zero-shot robotic manipulation with pretrained image-editing diffusion models,” arXiv preprint arXiv:2310.10639, 2023.
[26] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” Robotics: Science and Systems, 2023.
[27] M. Reuss, M. Li, X. Jia, and R. Lioutikov, “Goal-conditioned imitation learning using score-based diffusion policies,” Robotics: Science and Systems, 2023.
[28] U. A. Mishra and Y. Chen, “Reorientdiff: Diffusion model based reorientation for object manipulation,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 10 867–10 873.
[29] A. Sridhar, D. Shah, C. Glossop, and S. Levine, “Nomad: Goal masked diffusion policies for navigation and exploration,” in 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 63–70.
[30] P. Li, Z. Li, H. Zhang, and J. Bian, “On the generalization properties of diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[31] Y. Wu, F. Wu, L. Chen, K. Chen, S. Schneider, L. Johannsmeier, Z. Bing, F. J. Abu-Dakka, A. Knoll, and S. Haddadin, “1 khz behavior tree for self-adaptable tactile insertion,” in 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024, pp. 16 002–16 008.
[32] O. Spector and D. Di Castro, “Insertionnet-a scalable solution for insertion,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5509–5516, 2021.
[33] O. Spector, V. Tchuiev, and D. Di Castro, “Insertionnet 2.0: Minimal contact multi-step insertion using multimodal multiview sensory input,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 6330–6336.
[34] G. Schoettler, A. Nair, J. A. Ojea, S. Levine, and E. Solowjow, “Meta-reinforcement learning for robotic industrial insertion tasks,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 9728–9735.
[35] T. Z. Zhao, J. Luo, O. Sushkov, R. Pevceviciute, N. Heess, J. Scholz, S. Schaal, and S. Levine, “Offline meta-reinforcement learning for industrial insertion,” in 2022 international conference on robotics and automation (ICRA). IEEE, 2022, pp. 6386–6393.
[36] B. Tang, M. A. Lin, I. Akinola, A. Handa, G. S. Sukhatme, F. Ramos, D. Fox, and Y. Narang, “Industreal: Transferring contact-rich assembly tasks from simulation to reality,” arXiv preprint arXiv:2305.17110, 2023.
[37] B. Tang, I. Akinola, J. Xu, B. Wen, A. Handa, K. Van Wyk, D. Fox, G. S. Sukhatme, F. Ramos, and Y. Narang, “Automate: Specialist and generalist assembly policies over diverse geometries,” arXiv preprint arXiv:2407.08028, 2024.
[38] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[39] M. Kollovieh, A. F. Ansari, M. Bohlke-Schneider, J. Zschiegner, H. Wang, and Y. B. Wang, “Predict, refine, synthesize: Self-guiding diffusion models for probabilistic time series forecasting,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[40] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International conference on machine learning. PMLR, 2021, pp. 8162–8171.
[41] Y. Yang, M. Jin, H. Wen, C. Zhang, Y. Liang, L. Ma, Y. Wang, C. Liu, B. Yang, Z. Xu et al., “A survey on diffusion models for time series and spatio-temporal data,” arXiv preprint arXiv:2404.18886, 2024.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer, 2016, pp. 630–645.
[43] Y. Li, N. Miao, L. Ma, F. Shuang, and X. Huang, “Transformer for object detection: Review and benchmark,” Engineering Applications of Artificial Intelligence, vol. 126, p. 107021, 2023.
[44] C. Yang, G. Ganesh, S. Haddadin, S. Parusel, A. Albu-Schaeffer, and E. Burdet, “Human-like adaptation of force and impedance in stable and unstable interactions,” IEEE transactions on robotics, vol. 27, no. 5, pp. 918–930, 2011.
[45] B. Albert and T. Tullis, Measuring the user experience: collecting, analyzing, and presenting usability metrics. Newnes, 2013.