Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

A Spatiotemporal Hand-Eye Calibration for Trajectory Alignment in
Visual(-Inertial) Odometry Evaluation

Zichao Shu1, Lijun Li1, Rui Wang2 and Zetao Chen1 Manuscript received: December 12, 2023; Revised: March 7, 2024; Accepted: March 30, 2024. This paper was recommended for publication by Editor Sven Behnke upon evaluation of the Associate Editor and Reviewers’ comments. 1Zichao Shu, Lijun Li and Zetao Chen are with the Advanced Display and Sensing Research Center, Yongjiang Laboratory, China. {zichao-shu; lijun-li; zetao-chen}@ylab.ac.cn2Rui Wang is with the Mixed Reality & AI Lab-Zürich, Microsoft, Switzerland. {wangr@microsoft.com}Digital Object Identifier (DOI): see top of this page.
Abstract

A common prerequisite for evaluating a visual(-inertial) odometry (VO/VIO) algorithm is to align the timestamps and the reference frame of its estimated trajectory with a reference ground-truth derived from a system of superior precision, such as a motion capture system. The trajectory-based alignment, typically modeled as a classic hand-eye calibration, significantly influences the accuracy of evaluation metrics. However, traditional calibration methods are susceptible to the quality of the input poses. Few studies have taken this into account when evaluating VO/VIO trajectories that usually suffer from noise and drift. To fill this gap, we propose a novel spatiotemporal hand-eye calibration algorithm that fully leverages multiple constraints from screw theory for enhanced accuracy and robustness. Experimental results show that our algorithm has better performance and is less noise-prone than state-of-the-art methods.

Index Terms:
Calibration and Identification, Performance Evaluation and Benchmarking, Visual-Inertial SLAM.

I INTRODUCTION

Visual(-inertial) odometry (VO/VIO) is known to provide state estimation of motion devices and has a wide range of application domains, such as robotics, extended reality, and autonomous driving. The performance evaluation is a fundamental task in VO/VIO research and application, where metrics are typically quantified by evaluating the estimated trajectory from VO/VIO with respect to the ground-truth. Commonly, the ground-truth trajectory can be obtained by tracking the motion device simultaneously with a system of superior precision, e.g., using a motion capture (MoCap) system, laser tracker, etc [1, 2, 3]. There are two main problems when comparing the estimated trajectory against the ground-truth: the trajectory pair is usually on different clock domains (thus with non-corresponding timestamps) and expressed in different global and local reference frames. While well-known methods such as the Umeyama algorithm [4] can align the global frames, the spatiotemporal alignment, which calculates the offsets of timestamps and local frames of the trajectory pair, still needs meticulous handling.

The spatiotemporal alignment problem above can be modeled as a classic hand-eye calibration problem: given the local frames of the ground-truth and estimated trajectory as the hand and the eye respectively, calculate the timestamp offset and estimate the homogeneous transformation between them. While the essence of hand-eye calibration problem has been well addressed in numerous studies [5, 6, 7, 8, 9, 10], the accuracy and robustness may still be compromised in practical applications. The error introduced in this step will affect the transformation of the ground-truth trajectory, thereby exerting a substantial influence on the subsequent evaluation metrics.

I-A Motivation

In this work, we consider the scenario in which only the trajectory information is available. This is common among commercial consumer devices, such as extended reality headsets or home robots, where the original raw sensor data used to derive the device trajectories are not accessible by users. Existing hand-eye calibration algorithms can be categorized into two distinct approaches: tightly-coupled and loosely-coupled. The former typically joints raw data from the sensors such as images with information of the calibration boards [6, 7, 8] or IMU measurements [1, 2], and optimizes the result in a maximum likelihood estimation (MLE) framework. The latter, on the other hand, directly calculates the offset between the hand and the eye based on their independently estimated poses [9, 10]. While tightly-coupled approaches can theoretically achieve higher accuracy and are used in well-known benchmarks such as EuRoC [1] and TUM-VI [2], they are not applicable in cases where only the trajectory information is available. Loosely-coupled approaches can perform calibration in the pose-only condition, but due to the ubiquitous noise and accumulated error in the VO/VIO estimation, the accuracy and robustness of existing methods are generally insufficient.

I-B Contribution

In this letter, we propose a novel loosely-coupled spatiotemporal hand-eye calibration method tailored for VO/VIO evaluation. This method demonstrates robustness against noise and accumulated error in the input trajectories. For time alignment, we improve the correlation analysis of the screw invariant and obtain synchronized trajectories. For spatial calibration, we construct linear equations using local relative poses based on rotational constraint to fully utilize the motion information, rather than the naive global or inter-frame strategies [9, 8, 10]. Additionally, we introduce a well-designed robust kernel based on the screw theory to stabilize the linear solution. These operations are iteratively completed within the random sample consensus (RANSAC) framework to recover inlier data. Finally, we design a nonlinear optimization tool to jointly refine the time offset and the linear extrinsic solution. To validate the effectiveness of our algorithm, we conduct experiments on public and simulated datasets, as well as our own datasets collected by a virtual reality (VR) headset with VIO capability and a MoCap system (see in Fig. 1).

Refer to caption
Figure 1: Our spatiotemporal hand-eye calibration platform and the convention of the reference frames. The global frame for MoCap trajectory is denoted as G, and the local frame is referenced to a specific tracking marker indicated as H. The global frame for VIO trajectory is denoted as W, and the local frame E coincides with the IMU body frame. During this process, dashed lines represent transformations that change over time, while solid lines indicate static offsets.

The rest of this letter is organized as follows. Section II reviews related work on spatiotemporal hand-eye calibration. The new method is described in Section III and its performance is evaluated in Section IV. Section V concludes the letter.

II RELATED WORK

Spatiotemporal hand-eye calibration based on different strategies is a widely studied area. In our review, we briefly discuss the related work that shares the same strategy as the proposed method, i.e., the loosely-coupled methods, and motivate the design adapted in our work.

II-A Spatial Hand-Eye Calibration

In our application scenario, the loosely-coupled hand-eye calibration can be formulated as 𝐀𝐗𝐀𝐗\mathbf{AX}bold_AX = 𝐗𝐁𝐗𝐁\mathbf{XB}bold_XB [11], where 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B are the hand and the eye poses between two frames respectively, and 𝐗𝐗\mathbf{X}bold_X is the unknown homogeneous transformation between the hand and the eye. The solution for 𝐗𝐗\mathbf{X}bold_X can be categorized into two approaches: either separately or simultaneously solving the rotation and translation parts for the transformation.

One of the earliest separated approaches was presented by Shiu and Ahmad [11]. They represented rotation using angle-axis and proposed a closed-form solution for the 𝐀𝐗𝐀𝐗\mathit{\mathbf{AX}}bold_AX = 𝐗𝐁𝐗𝐁\mathit{\mathbf{XB}}bold_XB formulation. Tsai and Lenz [12] proposed a similar but simplified method to improve computational efficiency, which has been widely used to this day (e.g. in OpenCV). Later methods focused on utilizing various parameterizations of the rotation, such as angle-axis [13], quaternion [14], Lie algebra [15], and Kronecker product [16], to achieve more efficient solution. Although separated methods are computationally efficient, they are error-prone due to the independence assumption between the rotation and translation, which are actually nonlinearly coupled [17].

In contrast, the simultaneous methods calibrate the rotation and translation parts jointly. Representative methods include screw motion [18], Kronecker product [19], dual quaternion [20] and dual tensor [21], which use alternative analytical parameterizations to express the complete homogeneous transformation. Additionally, there are also algorithms based on numerical optimization [22, 23, 24]. While the computational cost may increase, they are generally more accurate than the separated methods. In addition to improving accuracy, recent researchers have also focused on enhancing the robustness of algorithms across various applications [25, 10, 9, 26]. However, these efforts were mostly focused on scenarios characterized by comparatively high-precision pose data. This differs from our use case where the trajectories are substantially affected by noise or accumulated error. In this paper, we utilize a dual quaternion scheme similar to [20, 10] and construct a robust linear solving system using multiple constraints from screw theory to address the challenges from the VO/VIO trajectory.

II-B Temporal Alignment

Given that the hand and the eye sensors usually operate on different clocks, the temporal correspondence between 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B is usually unknown. The time alignment prior to spatial calibration is thus necessary. Kelly and Sukhatme [27] considered the problem to be a registration task and solved it by utilizing the iterative closest point algorithm. Based on the discrete Fourier transform (DFT) theory, the time alignment of trajectories can be converted to the correlation analysis between two invariant signals derived from screw motion. This simple and effective method has been widely used in data synchronization for hand-eye calibration [10, 28, 29, 30]. However, the precision of this method is limited by the temporal resolution of the correlation function, with low-frequency data resulting in reduced time alignment accuracy. Alternatively, the TUM-VI benchmark [2] achieves time alignment by utilizing information from an error function, which is calculated through a grid search between the motion invariants. Additionally, a parabolic fitting is applied to the error function to enhance the precision of the time alignment. Unlike the grid search method in the time domain employed in TUM-VI, our approach builds upon the more commonly used DFT-based approach. To overcome the limitation posed by data frequency, we adopt a similar technique inspired by TUM-VI.

In some studies of continuous-time state estimation, by parameterizing the state variables as continuous-time functions, it is possible to achieve simultaneous spatiotemporal multi-sensor calibration within a MLE framework [31, 32, 33]. These high-precision methods can be easily extended to the pose-only scenario, but they are sensitive to the initial condition. In our work, we use the result of our linear calibration as the initial guess and perform a further refinement.

III METHODOLOGY

To better illustrate our algorithm, we take our hand-eye calibration platform as an example (see in Fig. 1). The unit dual quaternions 𝐪^𝔻^𝐪𝔻\hat{\mathbf{q}}\in\mathbb{DQ}over^ start_ARG bold_q end_ARG ∈ blackboard_D blackboard_Q is used to represent the homogeneous transformation. A dual quaternion 𝐪^^𝐪\hat{\mathbf{q}}over^ start_ARG bold_q end_ARG has the form 𝐪^=𝐪+ϵ𝐪=(𝐪,𝐪)^𝐪𝐪italic-ϵsuperscript𝐪𝐪superscript𝐪\hat{\mathbf{q}}=\mathbf{q}+\epsilon\mathbf{q}^{\prime}=\left(\mathbf{q},% \mathbf{q}^{\prime}\right)over^ start_ARG bold_q end_ARG = bold_q + italic_ϵ bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_q , bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where Hamiltonian quaternions 𝐪𝐪\mathbf{q}bold_q and 𝐪superscript𝐪\mathbf{q}^{\prime}bold_q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the standard part and the dual part of 𝐪^^𝐪\hat{\mathbf{q}}over^ start_ARG bold_q end_ARG respectively, and ϵitalic-ϵ\epsilonitalic_ϵ is the infinitesimal unit satisfying ϵ2=0superscriptitalic-ϵ20\epsilon^{2}=0italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0. Our goal is to estimate the time offset ΔtHEΔsubscript𝑡HE\Delta t_{\scriptscriptstyle\mathrm{HE}}roman_Δ italic_t start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT and the extrinsic 𝐪^HEsubscript^𝐪HE\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT between the MoCap (hand) trajectory 𝐪^GHsubscript^𝐪GH\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT and the estimated VIO (eye) trajectory 𝐪^WEsubscript^𝐪WE\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT. Fig. 2 provides an overview of the proposed algorithm, which comprises three modules. The main steps of the algorithm will be described in the remaining of this section.

Refer to caption
Figure 2: Flowchart of the proposed spatiotemporal hand-eye calibration, where the green and blue parallelograms represent the inputs and outputs respectively, and the orange rectangles represent the critical processing steps.

III-A Time Alignment

Refer to caption
Figure 3: Illustration of time alignment. The time offset can be determined by performing a quadratic polynomial curve fitting around the maximum (highlighted in gray) of the correlation function and obtaining the index of the maximum. This method enables synchronization of angular velocity at a finer granularity.

In order to process poses from sensors with different clocks, the first crucial step is to align the two sets of timestamps. To achieve this, we can leverage the constraint from screw motion, i.e., based on the equality of the angular velocities ωGHsubscript𝜔GH\omega_{\scriptscriptstyle\mathrm{GH}}italic_ω start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT and ωWEsubscript𝜔WE\omega_{\scriptscriptstyle\mathrm{WE}}italic_ω start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT of trajectories 𝐪^GHsubscript^𝐪GH\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT and 𝐪^WEsubscript^𝐪WE\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT, which is independent of calibration parameters. This simplifies the time alignment to the synchronization of angular velocity signals. Based on theory of DFT, the correlation between two time domain signals is greater when they exhibit higher similarity. Therefore, the synchronization involves finding the time shift, τshiftsubscript𝜏shift\tau_{\mathrm{shift}}italic_τ start_POSTSUBSCRIPT roman_shift end_POSTSUBSCRIPT, where the correlation function reaches its maximum:

τshift=argmaxindex(Corr(ωGH,ωWE)),subscript𝜏shiftindexargmaxCorrsubscript𝜔GHsubscript𝜔WE\tau_{\mathrm{shift}}=\underset{\mathrm{index}}{\operatorname{argmax}}\left(% \operatorname{Corr}\left(\omega_{\scriptscriptstyle\mathrm{GH}},\omega_{% \scriptscriptstyle\mathrm{WE}}\right)\right),italic_τ start_POSTSUBSCRIPT roman_shift end_POSTSUBSCRIPT = underroman_index start_ARG roman_argmax end_ARG ( roman_Corr ( italic_ω start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT ) ) , (1)

where Corr()Corr\operatorname{Corr}\left(\cdot\right)roman_Corr ( ⋅ ) is the correlation function.

Given the discrete character of the angular velocity signals, the correlation function also appears discrete in the time domain, with the precision of the obtained time shift depending on the temporal resolution of the function. Assuming the data around the maximum follows a quadratic polynomial distribution, we perform curve fitting to refine our time shift similar to [2]. As illustrated in Fig. 3, this approach allows us to accurately determine the index of maximum correlation and trace it back to the corresponding time offset, ΔtHEΔsubscript𝑡HE\Delta t_{\scriptscriptstyle\mathrm{HE}}roman_Δ italic_t start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT.

III-B Linear Calibration

Given the time aligned trajectories 𝐪^GHsubscript^𝐪GH\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT and 𝐪^WEsubscript^𝐪WE\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT, we can perform spatial hand-eye calibration. As shown in Fig. 1, for any hand-eye motion from i𝑖iitalic_i to j𝑗jitalic_j in the trajectories, we have:

𝐪^GHi𝐪^HE𝐪^WEi1=𝐪^GHj𝐪^HE𝐪^WEj1.subscript^𝐪subscriptGH𝑖subscript^𝐪HEsuperscriptsubscript^𝐪subscriptWE𝑖1subscript^𝐪subscriptGH𝑗subscript^𝐪HEsuperscriptsubscript^𝐪subscriptWE𝑗1\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}_{i}}\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{HE}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE}_% {i}}^{-1}=\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}_{j}}\hat{\mathbf{q}}% _{\scriptscriptstyle\mathrm{HE}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE% }_{j}}^{-1}.over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (2)

Using the relative transformation between two poses, i.e., 𝐪^HiHj=𝐪^GHi1𝐪^GHjsubscript^𝐪subscriptH𝑖subscriptH𝑗superscriptsubscript^𝐪subscriptGH𝑖1subscript^𝐪subscriptGH𝑗\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}% _{j}}=\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}_{i}}^{-1}\hat{\mathbf{q}% }_{\scriptscriptstyle\mathrm{GH}_{j}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐪^EiEj=𝐪^WEi1𝐪^WEjsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪subscriptWE𝑖1subscript^𝐪subscriptWE𝑗\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E}% _{j}}=\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE}_{i}}^{-1}\hat{\mathbf{q}% }_{\scriptscriptstyle\mathrm{WE}_{j}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, (2) can be rewritten in the form of 𝐀𝐗𝐀𝐗\mathbf{AX}bold_AX = 𝐗𝐁𝐗𝐁\mathbf{XB}bold_XB as 𝐪^HiHj𝐪^HE=𝐪^HE𝐪^EiEjsubscript^𝐪subscriptH𝑖subscriptH𝑗subscript^𝐪HEsubscript^𝐪HEsubscript^𝐪subscriptE𝑖subscriptE𝑗\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}% _{j}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}=\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{HE}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{% i}\scriptscriptstyle\mathrm{E}_{j}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT = over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This fundamental equation can be divided into the standard and dual parts, yielding:

𝐪HiHj𝐪HE𝐪HE𝐪EiEjsubscript𝐪subscriptH𝑖subscriptH𝑗subscript𝐪HEsubscript𝐪HEsubscript𝐪subscriptE𝑖subscriptE𝑗\displaystyle{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle% \mathrm{H}_{j}}{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}-{\mathbf{q}}_{% \scriptscriptstyle\mathrm{HE}}{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}bold_q start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT - bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT =0,absent0\displaystyle=0,= 0 , (3)
𝐪HiHj𝐪HE𝐪HE𝐪EiEj+𝐪HiHj𝐪HE𝐪HE𝐪EiEjsuperscriptsubscript𝐪subscriptH𝑖subscriptH𝑗subscript𝐪HEsubscript𝐪HEsuperscriptsubscript𝐪subscriptE𝑖subscriptE𝑗subscript𝐪subscriptH𝑖subscriptH𝑗superscriptsubscript𝐪HEsuperscriptsubscript𝐪HEsubscript𝐪subscriptE𝑖subscriptE𝑗\displaystyle{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle% \mathrm{H}_{j}}^{\prime}{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}-{\mathbf{% q}}_{\scriptscriptstyle\mathrm{HE}}{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_% {i}\scriptscriptstyle\mathrm{E}_{j}}^{\prime}+{\mathbf{q}}_{\scriptscriptstyle% \mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}{\mathbf{q}}_{% \scriptscriptstyle\mathrm{HE}}^{\prime}-{\mathbf{q}}_{\scriptscriptstyle% \mathrm{HE}}^{\prime}{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}bold_q start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT - bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_q start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_q start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT =0.absent0\displaystyle=0.= 0 .

Due to the redundancy of the scalar part of the dual quaternion, we set 𝐫=(𝐪HiHj)v𝐫subscriptsubscript𝐪subscriptH𝑖subscriptH𝑗𝑣\mathbf{r}=\bigl{(}{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}% \scriptscriptstyle\mathrm{H}_{j}}\bigr{)}_{v}bold_r = ( bold_q start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 𝐫=(𝐪HiHj)vsuperscript𝐫subscriptsuperscriptsubscript𝐪subscriptH𝑖subscriptH𝑗𝑣\mathbf{r}^{\prime}=\bigl{(}{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}% \scriptscriptstyle\mathrm{H}_{j}}^{\prime}\bigr{)}_{v}bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_q start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 𝐬=(𝐪EiEj)v𝐬subscriptsubscript𝐪subscriptE𝑖subscriptE𝑗𝑣\mathbf{s}=\bigl{(}{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}\bigr{)}_{v}bold_s = ( bold_q start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and 𝐬=(𝐪EiEj)vsuperscript𝐬subscriptsuperscriptsubscript𝐪subscriptE𝑖subscriptE𝑗𝑣\mathbf{s}^{\prime}=\bigl{(}{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}^{\prime}\bigr{)}_{v}bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( bold_q start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, where ()vsubscript𝑣\left(\cdot\right)_{v}( ⋅ ) start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denotes the vector part of the quaternion. The linear equation for extrinsic calibration derived from a single motion can be written as:

[𝐫𝐬(𝐫+𝐬)𝟎3×1𝟎3×3𝐫𝐬(𝐫+𝐬)𝐫𝐬(𝐫+𝐬)][𝐪HE𝐪HE]=0,matrix𝐫𝐬superscript𝐫𝐬subscript031subscript033superscript𝐫superscript𝐬superscriptsuperscript𝐫superscript𝐬𝐫𝐬superscript𝐫𝐬matrixsubscript𝐪HEsuperscriptsubscript𝐪HE0\begin{bmatrix}\mathbf{r}-\mathbf{s}&\left(\mathbf{r}+\mathbf{s}\right)^{% \wedge}&\mathbf{0}_{3\times 1}&\mathbf{0}_{3\times 3}\\ \mathbf{r}^{\prime}-\mathbf{s}^{\prime}&\left(\mathbf{r}^{\prime}+\mathbf{s}^{% \prime}\right)^{\wedge}&\mathbf{r}-\mathbf{s}&\left(\mathbf{r}+\mathbf{s}% \right)^{\wedge}\end{bmatrix}\begin{bmatrix}{\mathbf{q}}_{\scriptscriptstyle% \mathrm{HE}}\\ {\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{\prime}\end{bmatrix}=0,[ start_ARG start_ROW start_CELL bold_r - bold_s end_CELL start_CELL ( bold_r + bold_s ) start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT end_CELL start_CELL bold_0 start_POSTSUBSCRIPT 3 × 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_0 start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL ( bold_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT end_CELL start_CELL bold_r - bold_s end_CELL start_CELL ( bold_r + bold_s ) start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_q start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] = 0 , (4)

where ()superscript\left(\cdot\right)^{\wedge}( ⋅ ) start_POSTSUPERSCRIPT ∧ end_POSTSUPERSCRIPT denotes the antisymmetric matrix of a vector, and the coefficient matrix with dimensions of 6×8686\times 86 × 8 will be denoted as 𝐒𝐒\mathbf{S}bold_S. With n2𝑛2n\geq 2italic_n ≥ 2 motions, we can stack 𝐒𝐒\mathbf{S}bold_S to obtain a 6n×86𝑛86n\times 86 italic_n × 8 matrix with rank 6 in the noise-free case as:

𝐌=[𝐒1T,𝐒2T,,𝐒nT]T.𝐌superscriptsuperscriptsubscript𝐒1Tsuperscriptsubscript𝐒2Tsuperscriptsubscript𝐒𝑛TT\mathbf{M}=\left[\mathbf{S}_{1}^{\mathrm{T}},\mathbf{S}_{2}^{\mathrm{T}},% \ldots,\mathbf{S}_{n}^{\mathrm{T}}\right]^{\mathrm{T}}.bold_M = [ bold_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , bold_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , … , bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT . (5)

The singular value decomposition (SVD) algorithm is then used to find the linear least squares solution of the extrinsic with the constraint of the unit dual quaternion. For more detailed information about the fundamental principles of dual quaternion-based hand-eye calibration, please refer to [20].

SVD is known to be sensitive to noise and outliers. In the following, we will propose three strategies to enhance the accuracy and robustness of the algorithm.

III-B1 Relative poses construction

As shown in (4), the equation entirely relies on the relative poses of the hand-eye motion, therefore, the relative poses construction method will significantly affect the quality of the solution. Conventional methods either use a global or an inter-frame strategy. The former fixes a certain frame and calculates the relative poses of the remaining frames with respect to it. However, this method is prone to coupling the trajectory drift in the VO/VIO scenario. The latter calculates the relative poses between two successive frames, but may suffer from the noise caused by insufficient motion. Moreover, for a relative pose with pure translation, the matrix in (4) will degenerate and cannot constrain the dual part of the extrinsic.

Refer to caption
Figure 4: Illustration of relative poses construction, different methods are represented by three different lines, and ours shown as the solid green line.

We use a rotationally constrained approach to select the hand-eye keyframes to construct the relative poses, aiming to mitigate the error coupling and solution degeneracy. Specifically, one can keep searching forward from a frame 𝐪^isubscript^𝐪𝑖\hat{\mathbf{q}}_{i}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the hand or eye trajectory, and build the relative pose when a frame 𝐪^jsubscript^𝐪𝑗\hat{\mathbf{q}}_{j}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT satisfies the constraint:

2arccos((𝐪^i1𝐪^j)w)η,2subscriptsuperscriptsubscript^𝐪𝑖1subscript^𝐪𝑗𝑤𝜂2\arccos\left(\left(\hat{\mathbf{q}}_{i}^{-1}\hat{\mathbf{q}}_{j}\right)_{w}% \right)\geq\eta,2 roman_arccos ( ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ≥ italic_η , (6)

where ()wsubscript𝑤\left(\cdot\right)_{w}( ⋅ ) start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT denotes the scalar part of the standard component of the dual quaternion, and η𝜂\etaitalic_η is an adjustable threshold which we set to 5 degrees. Fig. 4 illustrates the construction of relative poses using both conventional methods and our proposed method for comparison.

III-B2 Robust kernel

Despite the effort to construct high-quality relative poses, they still contain varying degrees of noise. To quantify the weights of different relative poses, we propose a robust kernel in the linear system construction.

For a unit dual quaternion 𝐪^^𝐪\hat{\mathbf{q}}over^ start_ARG bold_q end_ARG, its scalar part is defined as:

Scalar(𝐪^)=(𝐪^+𝐪^1)2,Scalar^𝐪^𝐪superscript^𝐪12\operatorname{Scalar}\left(\hat{\mathbf{q}}\right)=\frac{\left(\hat{\mathbf{q}% }+\hat{\mathbf{q}}^{-1}\right)}{2},roman_Scalar ( over^ start_ARG bold_q end_ARG ) = divide start_ARG ( over^ start_ARG bold_q end_ARG + over^ start_ARG bold_q end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 end_ARG , (7)

which can be expressed in the form of a vector, written as:

Scalar(𝐪^)Scalar^𝐪\displaystyle\operatorname{Scalar}\left(\hat{\mathbf{q}}\right)roman_Scalar ( over^ start_ARG bold_q end_ARG ) =[ω,𝟎1×3,ω,𝟎1×3]Tabsentsuperscript𝜔subscript013superscript𝜔subscript013T\displaystyle=\left[\omega,\mathbf{0}_{1\times 3},\omega^{\prime},\mathbf{0}_{% 1\times 3}\right]^{\mathrm{T}}= [ italic_ω , bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT (8)
=[cosθ2,𝟎1×3,d2sinθ2,𝟎1×3]T,absentsuperscript𝜃2subscript013𝑑2𝜃2subscript013T\displaystyle=\Bigl{[}\cos\frac{\theta}{2},\mathbf{0}_{1\times 3},-\frac{d}{2}% \sin\frac{\theta}{2},\mathbf{0}_{1\times 3}\Bigr{]}^{\mathrm{T}},= [ roman_cos divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG , bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT , - divide start_ARG italic_d end_ARG start_ARG 2 end_ARG roman_sin divide start_ARG italic_θ end_ARG start_ARG 2 end_ARG , bold_0 start_POSTSUBSCRIPT 1 × 3 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ,

where ω𝜔\omegaitalic_ω and ωsuperscript𝜔\omega^{\prime}italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are the scalar parts of the standard part and the dual part of the dual quaternion respectively, θ𝜃\thetaitalic_θ denotes the rotation angle, and d𝑑ditalic_d represents the translation norm of the screw motion.

The screw concatenation between the hand and eye can be further transformed from the form of 𝐀𝐗𝐀𝐗\mathbf{AX}bold_AX = 𝐗𝐁𝐗𝐁\mathbf{XB}bold_XB to 𝐪^HiHj=𝐪^HE𝐪^EiEj𝐪^HE1subscript^𝐪subscriptH𝑖subscriptH𝑗subscript^𝐪HEsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪HE1\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}% _{j}}=\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E}_{j}}\hat{\mathbf{% q}}_{\scriptscriptstyle\mathrm{HE}}^{-1}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. Based on the definition of the scalar part in (7), we have:

Scalar(𝐪^HiHj)Scalarsubscript^𝐪subscriptH𝑖subscriptH𝑗\displaystyle\operatorname{Scalar}\left(\hat{\mathbf{q}}_{\scriptscriptstyle% \mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}\right)roman_Scalar ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =12(𝐪^HiHj+𝐪^HiHj1)absent12subscript^𝐪subscriptH𝑖subscriptH𝑗superscriptsubscript^𝐪subscriptH𝑖subscriptH𝑗1\displaystyle=\frac{1}{2}\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_% {i}\scriptscriptstyle\mathrm{H}_{j}}+\hat{\mathbf{q}}_{\scriptscriptstyle% \mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}^{-1}\right)= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) (9)
=12(𝐪^HE𝐪^EiEj𝐪^HE1+𝐪^HE𝐪^EiEj1𝐪^HE1)absent12subscript^𝐪HEsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪HE1subscript^𝐪HEsuperscriptsubscript^𝐪subscriptE𝑖subscriptE𝑗1superscriptsubscript^𝐪HE1\displaystyle=\frac{1}{2}\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}% }\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E% }_{j}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{-1}+\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{HE}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{% i}\scriptscriptstyle\mathrm{E}_{j}}^{-1}\hat{\mathbf{q}}_{\scriptscriptstyle% \mathrm{HE}}^{-1}\right)= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT )
=12𝐪^HE(𝐪^EiEj+𝐪^EiEj1)𝐪^HE1absent12subscript^𝐪HEsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪subscriptE𝑖subscriptE𝑗1superscriptsubscript^𝐪HE1\displaystyle=\frac{1}{2}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}\left% (\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E% }_{j}}+\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle% \mathrm{E}_{j}}^{-1}\right)\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{-1}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
=𝐪^HEScalar(𝐪^EiEj)𝐪^HE1absentsubscript^𝐪HEScalarsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪HE1\displaystyle=\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}\operatorname{% Scalar}\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}\right)\hat{\mathbf{q}}_{\scriptscriptstyle% \mathrm{HE}}^{-1}= over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT roman_Scalar ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
=Scalar(𝐪^EiEj)𝐪^HE𝐪^HE1absentScalarsubscript^𝐪subscriptE𝑖subscriptE𝑗subscript^𝐪HEsuperscriptsubscript^𝐪HE1\displaystyle=\operatorname{Scalar}\left(\hat{\mathbf{q}}_{\scriptscriptstyle% \mathrm{E}_{i}\scriptscriptstyle\mathrm{E}_{j}}\right)\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{HE}}\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}% ^{-1}= roman_Scalar ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
=Scalar(𝐪^EiEj),absentScalarsubscript^𝐪subscriptE𝑖subscriptE𝑗\displaystyle=\operatorname{Scalar}\left(\hat{\mathbf{q}}_{\scriptscriptstyle% \mathrm{E}_{i}\scriptscriptstyle\mathrm{E}_{j}}\right),= roman_Scalar ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

which demonstrates that the scalar parts of the hand-eye relative pose pair are completely equal in the absence of noise. According to (8), this constraint can be transformed into the equality of rotation angles in local frames and translation norms along the principal axes of rotation, known as the screw congruence theorem.

We design the function as (10) to evaluate the quality of the hand-eye relative poses based on the scalar part of the dual quaternion in (8). The function yields a result of 1 when the screw motion constraint in (9) is strictly satisfied, while deviating from 1 as the noise increases.

Ei(𝐪^HiHj,𝐪^EiEj)subscript𝐸𝑖subscript^𝐪subscriptH𝑖subscriptH𝑗subscript^𝐪subscriptE𝑖subscriptE𝑗\displaystyle E_{i}\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}% \scriptscriptstyle\mathrm{H}_{j}},\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{% E}_{i}\scriptscriptstyle\mathrm{E}_{j}}\right)italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (10)
=\displaystyle== Ei(Scalar(𝐪^HiHj),Scalar(𝐪^EiEj))superscriptsubscript𝐸𝑖Scalarsubscript^𝐪subscriptH𝑖subscriptH𝑗Scalarsubscript^𝐪subscriptE𝑖subscriptE𝑗\displaystyle E_{i}^{\prime}\left(\operatorname{Scalar}\left(\hat{\mathbf{q}}_% {\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}\right),% \operatorname{Scalar}\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}\right)\right)italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( roman_Scalar ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , roman_Scalar ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) )
=\displaystyle== 12(max(|ωHiHj|,|ωEiEj|)min(|ωHiHj|,|ωEiEj|)+max(|ωHiHj|,|ωEiEj|)min(|ωHiHj|,|ωEiEj|)).12subscript𝜔subscriptH𝑖subscriptH𝑗subscript𝜔subscriptE𝑖subscriptE𝑗subscript𝜔subscriptH𝑖subscriptH𝑗subscript𝜔subscriptE𝑖subscriptE𝑗superscriptsubscript𝜔subscriptH𝑖subscriptH𝑗superscriptsubscript𝜔subscriptE𝑖subscriptE𝑗superscriptsubscript𝜔subscriptH𝑖subscriptH𝑗superscriptsubscript𝜔subscriptE𝑖subscriptE𝑗\displaystyle\frac{1}{2}\left(\frac{\max\left(\left|\omega_{\scriptscriptstyle% \mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}\right|,\left|\omega_{% \scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E}_{j}}\right|\right% )}{\min\left(\left|\omega_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle% \mathrm{H}_{j}}\right|,\left|\omega_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}\right|\right)}+\frac{\max\bigl{(}\bigl{|}% \omega_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}^{% \prime}\bigr{|},\bigl{|}\omega_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}^{\prime}\bigr{|}\bigr{)}}{\min\bigl{(}\bigl{% |}\omega_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}^{% \prime}\bigr{|},\bigl{|}\omega_{\scriptscriptstyle\mathrm{E}_{i}% \scriptscriptstyle\mathrm{E}_{j}}^{\prime}\bigr{|}\bigr{)}}\right).divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG roman_max ( | italic_ω start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT | , | italic_ω start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ) end_ARG start_ARG roman_min ( | italic_ω start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT | , | italic_ω start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT | ) end_ARG + divide start_ARG roman_max ( | italic_ω start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | , | italic_ω start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ) end_ARG start_ARG roman_min ( | italic_ω start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | , | italic_ω start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ) end_ARG ) .

Based on (10), we can define the robust kernel W𝑊Witalic_W as:

Wi(𝐪^HiHj,𝐪^EiEj)=exp(μ(1Ei(𝐪^HiHj,𝐪^EiEj)2)),subscript𝑊𝑖subscript^𝐪subscriptH𝑖subscriptH𝑗subscript^𝐪subscriptE𝑖subscriptE𝑗𝜇1subscript𝐸𝑖superscriptsubscript^𝐪subscriptH𝑖subscriptH𝑗subscript^𝐪subscriptE𝑖subscriptE𝑗2W_{i}\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}% \scriptscriptstyle\mathrm{H}_{j}},\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{% E}_{i}\scriptscriptstyle\mathrm{E}_{j}}\right)=\exp\left(\mu\left(1-E_{i}\left% (\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H% }_{j}},\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle% \mathrm{E}_{j}}\right)^{2}\right)\right),italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = roman_exp ( italic_μ ( 1 - italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) , (11)

where the parameter μ𝜇\muitalic_μ is an adjustable magnification factor, which we set to 5. Hence, the linear calibration matrix in (5) can be robustified for better numerical stability:

𝐌r=[W1𝐒1T,W2𝐒2T,,Wn𝐒nT]T.subscript𝐌rsuperscriptsubscript𝑊1superscriptsubscript𝐒1Tsubscript𝑊2superscriptsubscript𝐒2Tsubscript𝑊𝑛superscriptsubscript𝐒𝑛TT\mathbf{M}_{\mathrm{r}}=\left[W_{1}\mathbf{S}_{1}^{\mathrm{T}},W_{2}\mathbf{S}% _{2}^{\mathrm{T}},\ldots,W_{n}\mathbf{S}_{n}^{\mathrm{T}}\right]^{\mathrm{T}}.bold_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT = [ italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT , … , italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT . (12)

III-B3 Outlier elimination

During the calibration process, it is also important to identify and discard outliers that exhibit significant error. We incorporate the RANSAC algorithm which utilizes an iterative sampling strategy, and has the ability to recover the inliers from noisy data.

Note that at least two pairs of relative poses are required to construct the linear calibration system. For each iteration, we randomly sample a pose pair subset consisting of two members and determine inliers by using the quantitative criterion as:

e1subscript𝑒1\displaystyle e_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =(𝐪^HE𝐪^EiEj𝐪^HE1𝐪^HiHj1)rot<φ,absentsubscriptsubscript^𝐪HEsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪HE1superscriptsubscript^𝐪subscriptH𝑖subscriptH𝑗1rot𝜑\displaystyle=\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}\hat{% \mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E}_{j}}% \hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{-1}\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}^{-1}\right)_% {\operatorname{rot}}<\varphi,= ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT roman_rot end_POSTSUBSCRIPT < italic_φ , (13)
e2subscript𝑒2\displaystyle e_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(𝐪^HE𝐪^EiEj𝐪^HE1𝐪^HiHj1)trans<ψ,absentsubscriptsubscript^𝐪HEsubscript^𝐪subscriptE𝑖subscriptE𝑗superscriptsubscript^𝐪HE1superscriptsubscript^𝐪subscriptH𝑖subscriptH𝑗1trans𝜓\displaystyle=\left(\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}\hat{% \mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle\mathrm{E}_{j}}% \hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{-1}\hat{\mathbf{q}}_{% \scriptscriptstyle\mathrm{H}_{i}\scriptscriptstyle\mathrm{H}_{j}}^{-1}\right)_% {\operatorname{trans}}<\psi,= ( over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT roman_trans end_POSTSUBSCRIPT < italic_ψ ,

where ()rotsubscriptrot\left(\cdot\right)_{\operatorname{rot}}( ⋅ ) start_POSTSUBSCRIPT roman_rot end_POSTSUBSCRIPT and ()transsubscripttrans\left(\cdot\right)_{\operatorname{trans}}( ⋅ ) start_POSTSUBSCRIPT roman_trans end_POSTSUBSCRIPT are the rotation angle and translation norm of the dual quaternion, respectively. Inliers are required to have both indicators in (13) less than the thresholds φ𝜑\varphiitalic_φ and ψ𝜓\psiitalic_ψ, which we set to 0.5 degrees and 0.02 m experimentally.

Input: Time aligned hand-eye trajectories 𝐪^GHsubscript^𝐪GH\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{GH}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT, 𝐪^WEsubscript^𝐪WE\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{WE}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT.
Output: Hand-eye extrinsic 𝐪^HEsuperscriptsubscript^𝐪HE\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{\ast}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
1 {𝐪^EiEj,𝐪^HiHj}𝒞subscript^𝐪subscriptE𝑖subscriptE𝑗subscript^𝐪subscriptH𝑖subscriptH𝑗𝒞absent\left\{\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{E}_{i}\scriptscriptstyle% \mathrm{E}_{j}},\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{H}_{i}% \scriptscriptstyle\mathrm{H}_{j}}\right\}\in\mathcal{C}\leftarrow{ over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∈ caligraphic_C ← Construct the relative poses based on the rotational constraint in (6);
2 while not reached the iteration limit do
3       Subset 𝒟𝒟absent\mathcal{D}\leftarrowcaligraphic_D ← Randomly sampling from 𝒞𝒞\mathcal{C}caligraphic_C;
4       Construct 𝐌12×8subscript𝐌128\mathbf{M}_{12\times 8}bold_M start_POSTSUBSCRIPT 12 × 8 end_POSTSUBSCRIPT based on 𝒟𝒟\mathcal{D}caligraphic_D using (5);
5       𝐪^HEinitsuperscriptsubscript^𝐪HEinitabsent\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{\mathrm{init}}\leftarrowover^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ← Apply SVD to M;
6       foreach pose pair \in 𝒞𝒞\mathcal{C}caligraphic_C do
7             𝒢𝒢absent\mathcal{G}\leftarrowcaligraphic_G ← Select inliers using (13) with 𝐪^HEinitsuperscriptsubscript^𝐪HEinit\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{\mathrm{init}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT;
8            
9       end foreach
10      Construct 𝐌2n×8superscriptsubscript𝐌2𝑛8\mathbf{M}_{2n\times 8}^{\prime}bold_M start_POSTSUBSCRIPT 2 italic_n × 8 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with robust kernel based on 𝒢𝒢\mathcal{G}caligraphic_G using (12);
11       𝐪^HErefinesuperscriptsubscript^𝐪HErefineabsent\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{\mathrm{refine}}\leftarrowover^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_refine end_POSTSUPERSCRIPT ← Apply SVD to 𝐌superscript𝐌\mathbf{M}^{\prime}bold_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
12       𝐪^HE𝐪^HErefinesuperscriptsubscript^𝐪HEsuperscriptsubscript^𝐪HErefine\hat{\mathbf{q}}_{\scriptscriptstyle\mathrm{HE}}^{\ast}\leftarrow\hat{\mathbf{% q}}_{\scriptscriptstyle\mathrm{HE}}^{\mathrm{refine}}over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← over^ start_ARG bold_q end_ARG start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_refine end_POSTSUPERSCRIPT with the smallest σ7σ6subscript𝜎7subscript𝜎6\frac{\sigma_{7}}{\sigma_{6}}divide start_ARG italic_σ start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT end_ARG value;
13 end while
Algorithm 1 Robust linear hand-eye calibration

Additionally, we need to assess the quality of the solutions during the iteration. Given that the matrix 𝐌rsubscript𝐌r\mathbf{M}_{\mathrm{r}}bold_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT in (12) possesses a two-dimensional null space, the singular values, σ7subscript𝜎7\sigma_{7}italic_σ start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT and σ8subscript𝜎8\sigma_{8}italic_σ start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, which are the last two in the descending diagonal matrix ΣΣ\Sigmaroman_Σ from the SVD result 𝐌r=𝐔𝚺𝐕Tsubscript𝐌r𝐔𝚺superscript𝐕T\mathbf{M}_{\mathrm{r}}=\mathbf{U\Sigma V}^{\mathrm{T}}bold_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT = bold_U bold_Σ bold_V start_POSTSUPERSCRIPT roman_T end_POSTSUPERSCRIPT, are expected to be zero in the absence of noise. However, when noise is present, these two singular values, corresponding to the noise, disproportionately increase compared to the remaining singular values. The ratio of σ7subscript𝜎7\sigma_{7}italic_σ start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT to the third-to-last singular value, σ6subscript𝜎6\sigma_{6}italic_σ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, can serve as an indicator of this disproportionality, and a lower ratio suggests a reduced impact of noise. Consequently, we can establish a direct metric for quantifying the quality of the solutions, rather than relying on traditional measures such as the number of inliers or the root mean square error (RMSE) of the solutions with respect to inliers. The specific implementation of the linear spatial hand-eye calibration within the RANSAC framework is detailed in Algorithm 1.

III-C Batch Estimation

Despite the fact that our linear calibration method achieves good accuracy and robustness in the experiments of Section IV, it can be further improved by introducing the correlation between the temporal and spatial calibration parameters. In this section we follow the continuous-time batch estimation methods proposed in [32, 33] and provide an estimator within the rigorous theoretical framework of MLE, to jointly optimize the time offset and the spatial transformation. Since the estimator is hard to converge globally, we used the results derived in Section III-A and III-B as the initial guesses.

Specifically, the original hand trajectory is parameterized using a B-spline functions 𝐓GH(t)subscript𝐓GH𝑡\mathbf{T}_{\mathrm{GH}}\left(t\right)bold_T start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT ( italic_t ), with the translation part represented by a B-spline in three-dimensional vector space, and the rotation part parameterized by a B-spline on SO(3)𝑆𝑂3SO\left(3\right)italic_S italic_O ( 3 ). For the B-spline 𝐪(t)𝐪𝑡\mathbf{q}\left(t\right)bold_q ( italic_t ) on SO(3)𝑆𝑂3SO\left(3\right)italic_S italic_O ( 3 ) with order ξ𝜉\xiitalic_ξ, knots {ti|i{1,2,3,,N}}conditional-setsubscript𝑡𝑖𝑖123𝑁\left\{t_{i}|i\in\left\{1,2,3,\dots,N\right\}\right\}{ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i ∈ { 1 , 2 , 3 , … , italic_N } }, and satisfying N2ξ𝑁2𝜉N\geq 2\xiitalic_N ≥ 2 italic_ξ, the function can be defined in each subinterval {t[ti,ti+1)|ξiNξ}conditional-set𝑡subscript𝑡𝑖subscript𝑡𝑖1𝜉𝑖𝑁𝜉\left\{t\in\left[t_{i},t_{i+1}\right)|\xi\leq i\leq N-\xi\right\}{ italic_t ∈ [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) | italic_ξ ≤ italic_i ≤ italic_N - italic_ξ } as:

𝐪(t)=𝐪l(i)j=1ξ1𝐄𝐗𝐏((k=ηifk)𝐋𝐎𝐆(𝐪η11𝐪η)),\mathbf{q}\left(t\right)=\mathbf{q}_{l\left(i\right)}\prod_{j=1}^{\xi-1}% \mathbf{EXP}\biggl{(}\Bigl{(}\sum_{k=\eta}^{i}f_{k}\Bigr{)}\mathbf{LOG}\left(% \mathbf{q}_{\eta-1}^{-1}\mathbf{q}_{\eta}\right)\biggl{)},bold_q ( italic_t ) = bold_q start_POSTSUBSCRIPT italic_l ( italic_i ) end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ξ - 1 end_POSTSUPERSCRIPT bold_EXP ( ( ∑ start_POSTSUBSCRIPT italic_k = italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) bold_LOG ( bold_q start_POSTSUBSCRIPT italic_η - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_q start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) ) , (14)

where l(i)=iξ+1𝑙𝑖𝑖𝜉1l\left(i\right)=i-\xi+1italic_l ( italic_i ) = italic_i - italic_ξ + 1 and η=l(i)+j𝜂𝑙𝑖𝑗\eta=l\left(i\right)+jitalic_η = italic_l ( italic_i ) + italic_j, f𝑓fitalic_f and 𝐪𝐪\mathbf{q}bold_q represent the B-spline basis functions and the control vertices in unit quaternion form, respectively. 𝐄𝐗𝐏𝐄𝐗𝐏\mathbf{EXP}bold_EXP denotes the mapping from the Lie algebra to the Lie group, while 𝐋𝐎𝐆𝐋𝐎𝐆\mathbf{LOG}bold_LOG represents the inverse process.

The parameters determined by our estimator include 𝐓GH(t)subscript𝐓GH𝑡\mathbf{T}_{\scriptscriptstyle\mathrm{GH}}\left(t\right)bold_T start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT ( italic_t ), homogeneous transformation 𝐓HEsubscript𝐓HE\mathbf{T}_{\scriptscriptstyle\mathrm{HE}}bold_T start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT, and time offset ΔtHEΔsubscript𝑡HE\Delta t_{\scriptscriptstyle\mathrm{HE}}roman_Δ italic_t start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT. With the observations of hand-eye trajectories, we minimize the negative log-likelihood function as:

g𝑔\displaystyle gitalic_g =h=1H1ρ(d(𝐓GH(th),𝐓GH(th+1),𝐓GHh,𝐓GHh+1)ΣH2)absentsuperscriptsubscript1𝐻1𝜌superscriptsubscriptnorm𝑑subscript𝐓GHsubscript𝑡subscript𝐓GHsubscript𝑡1subscript𝐓subscriptGHsubscript𝐓subscriptGH1subscriptΣH2\displaystyle=\sum_{h=1}^{H-1}\rho\left(\left\|d\left(\mathbf{T}_{% \scriptscriptstyle\mathrm{GH}}\left(t_{h}\right),\mathbf{T}_{% \scriptscriptstyle\mathrm{GH}}\left(t_{h+1}\right),\mathbf{T}_{% \scriptscriptstyle\mathrm{GH}_{h}},\mathbf{T}_{\scriptscriptstyle\mathrm{GH}_{% h+1}}\right)\right\|_{\Sigma_{\scriptscriptstyle\mathrm{H}}}^{2}\right)= ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H - 1 end_POSTSUPERSCRIPT italic_ρ ( ∥ italic_d ( bold_T start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) , bold_T start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT ) , bold_T start_POSTSUBSCRIPT roman_GH start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT roman_GH start_POSTSUBSCRIPT italic_h + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT roman_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (15)
+e=1E1ρ(d(𝐓WE(te),𝐓WE(te+1),𝐓WEe,𝐓WEe+1)ΣE2),superscriptsubscript𝑒1𝐸1𝜌superscriptsubscriptnorm𝑑subscript𝐓WEsubscript𝑡𝑒subscript𝐓WEsubscript𝑡𝑒1subscript𝐓subscriptWE𝑒subscript𝐓subscriptWE𝑒1subscriptΣE2\displaystyle+\sum_{e=1}^{E-1}\rho\left(\left\|d\left(\mathbf{T}_{% \scriptscriptstyle\mathrm{WE}}\left(t_{e}\right),\mathbf{T}_{% \scriptscriptstyle\mathrm{WE}}\left(t_{e+1}\right),\mathbf{T}_{% \scriptscriptstyle\mathrm{WE}_{e}},\mathbf{T}_{\scriptscriptstyle\mathrm{WE}_{% e+1}}\right)\right\|_{\Sigma_{\scriptscriptstyle\mathrm{E}}}^{2}\right),+ ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E - 1 end_POSTSUPERSCRIPT italic_ρ ( ∥ italic_d ( bold_T start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) , bold_T start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_e + 1 end_POSTSUBSCRIPT ) , bold_T start_POSTSUBSCRIPT roman_WE start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT roman_WE start_POSTSUBSCRIPT italic_e + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT roman_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where H𝐻Hitalic_H and E𝐸Eitalic_E denote the number of pose observations of the hand and the eye respectively, and ρ()𝜌\rho\left(\cdot\right)italic_ρ ( ⋅ ) is the Huber loss function. Due to the rigid connection between the hand and the eye, we have 𝐓WE(t)=𝐓WG𝐓GH(tΔtHE)𝐓HEsubscript𝐓WE𝑡subscript𝐓WGsubscript𝐓GH𝑡Δsubscript𝑡HEsubscript𝐓HE\mathbf{T}_{\scriptscriptstyle\mathrm{WE}}\left(t\right)=\mathbf{T}_{% \scriptscriptstyle\mathrm{WG}}\mathbf{T}_{\scriptscriptstyle\mathrm{GH}}\left(% t-\Delta t_{\scriptscriptstyle\mathrm{HE}}\right)\mathbf{T}_{% \scriptscriptstyle\mathrm{HE}}bold_T start_POSTSUBSCRIPT roman_WE end_POSTSUBSCRIPT ( italic_t ) = bold_T start_POSTSUBSCRIPT roman_WG end_POSTSUBSCRIPT bold_T start_POSTSUBSCRIPT roman_GH end_POSTSUBSCRIPT ( italic_t - roman_Δ italic_t start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT ) bold_T start_POSTSUBSCRIPT roman_HE end_POSTSUBSCRIPT. Meanwhile, we define the residual function in a relative form to eliminate the influence of the unknown 𝐓WGsubscript𝐓WG\mathbf{T}_{\scriptscriptstyle\mathrm{WG}}bold_T start_POSTSUBSCRIPT roman_WG end_POSTSUBSCRIPT as:

d(𝐓i,𝐓i+1,𝐓i,𝐓i+1)=𝐋𝐎𝐆(𝐓i1𝐓i+1(𝐓i1𝐓i+1)1).𝑑subscript𝐓𝑖subscript𝐓𝑖1superscriptsubscript𝐓𝑖superscriptsubscript𝐓𝑖1𝐋𝐎𝐆superscriptsubscript𝐓𝑖1subscript𝐓𝑖1superscriptsuperscriptsubscript𝐓𝑖1superscriptsubscript𝐓𝑖11d\left(\mathbf{T}_{i},\mathbf{T}_{i+1},\mathbf{T}_{i}^{\prime},\mathbf{T}_{i+1% }^{\prime}\right)=\mathbf{LOG}\Bigl{(}\mathbf{T}_{i}^{-1}\mathbf{T}_{i+1}\left% (\mathbf{T}_{i}^{\prime-1}\mathbf{T}_{i+1}^{\prime}\right)^{-1}\Bigr{)}.italic_d ( bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = bold_LOG ( bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ - 1 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) . (16)

For the above cost function, the Levenberg–Marquardt algorithm is used to obtain the refined calibration result.

IV EXPERIMENTAL RESULTS

IV-A Ablation Studies

To validate the effectiveness of the key improvement strategies proposed in our methodology, we conduct comparative experiments using the simulated datasets. Specifically, to obtain the required data, we model a real motion trajectory using the B-spline. Subsequently, we extract two trajectories from the model at 100Hz and 20Hz to simulate MoCap and VO/VIO trajectories for hand-eye calibration. Throughout the process, we introduce frame-wise cumulative error into each eye pose to simulate the noise and drift typically present in the VO/VIO trajectory. The standard deviations for translation and rotation errors are divided into ten levels, ranging from 0 to 5 mm and 0 to 0.2 degrees, respectively.

We first test the time alignment algorithm proposed in Section III-A to verify the enhancement provided by the quadratic polynomial curve fitting in the correlation analysis compared to the baseline approach used in [10]. In this experiment, we introduce random delays into the eye trajectories derived from the B-spline to simulate the time offsets. To facilitate the correlation function computation, standardizing the frequencies of the trajectory pair is necessary. Specifically, we adjust the frequencies to match those of hand and eye trajectories, which are respectively 100Hz and 20Hz. As shown in Fig. 5, correlation analysis-based time alignment methods exhibit minimal sensitivity to trajectory noise. However, limited by the temporal resolution of the correlation function, time alignment error of the baseline method is essentially determined by the trajectory frequency. Acquiring the high-frequency trajectory is challenging and requires significant computational effort. By implementing the curve fitting strategy, we can determine the maximum index of the correlation function with greater accuracy, leading to higher precision in time alignment, even at low frequency.

Refer to caption
Figure 5: Performance comparison of different time alignment methods under various noise levels. We report the mean and standard deviation of the time alignment error.
Refer to caption
Figure 6: Performance comparison of different strategies used in linear calibration under various noise levels. We calculate the translation error and the rotation error separately, and report the mean and standard deviation.

Additionally, we evaluate the effectiveness of the strategies for spatial linear calibration proposed in Section III-B. The three relative poses construction methods, namely global, inter-frame, and rotationally constrained, are comparatively validated, and different thresholds in (6) are explored within the proposed rotationally constrained approach. Meanwhile, we test the impact of the robust kernel (RB). In this experiment, the calibration error is used as a quantitative metric for assessment. Fig. 6 indicates that the calibration errors for all scenarios trend to increase as noise levels rise. Nonetheless, our method, which constructs rotationally constrained relative poses yields smaller error due to its ability to derive the linear equation with higher signal-to-noise ratio. At the same time, the rotational constraint threshold also affects the accuracy of the solution, with 5 degrees proving to be optimal in our tests. Furthermore, our robust kernel strategy demonstrates superior robustness in the presence of increasing noise, as indicated by the relatively small rise in calibration error and standard deviation.

IV-B Overall Evaluation

We evaluate the performance of our hand-eye calibration algorithm using real-world datasets, including public VIO datasets, and datasets collected by our system (see in Fig. 1). For comparison, we also evaluate two other state-of-the-art (SOTA) linear hand-eye calibration algorithms based on dual quaternion and random sampling, as implemented in [10], namely RANSAC scalar-based inlier check (RS) and RANSAC classic (RC). We use the well-known VIO algorithm OpenVINS [34] to estimate the trajectory of the eye, while the trajectory of the hand is derived from the ground-truth system.

Refer to caption
Figure 7: Performance comparison of different linear calibration methods on public datasets. We select two sequences in each scenario and count the distribution of the translation error and the rotation error separately.

We first test on the TUM-VI room scenario [2] as well as machine hall (MH) and Vicon room (V) scenarios provided by EuRoC benchmark [1]. All three scenarios provide complete raw data of ground-truth trajectories, and achieve high-precision hand-eye calibration through the tightly-coupled method. We align the estimated VIO trajectories with the raw ground-truth trajectories using different hand-eye calibration algorithms. For evaluation metrics, we compare with the calibration results provided by the public datasets, and calculate the translation error and the rotation error. As shown in Fig. 7, we first evaluate our linear calibration (LC). The comparison results demonstrate that our algorithm achieves the highest accuracy and repeatability in all sequences. In challenging sequences such as MH_04 and V2_03, where the VIO algorithm struggles to deliver high-quality trajectories, traditional calibration algorithms are prone to fail, but our algorithm handles these situations effectively.

TABLE I: Detailed Performance of the Proposed Methods and the Trajectory Metrics on Public Datasets
Sequences Error of our LC Error of our BE Original metrics Our metrics
Time (ms) Trans (m) Rot (deg) Time (ms) Trans (m) Rot (deg) APE (m) ARE (deg) APE (m) ARE (deg)
Room_1 1.393 0.003 0.284 0.198 0.004 0.106 0.059 1.593 0.059 1.593
Room_6 1.266 0.024 0.675 0.296 0.016 0.199 0.085 1.686 0.085 1.683
MH_01 3.430 0.032 0.596 0.883 0.018 0.641 0.156 1.884 0.157 1.683
MH_04 5.404 0.031 0.927 3.691 0.018 0.749 0.161 0.950 0.157 0.561
V1_01 3.718 0.053 4.444 1.982 0.049 4.479 0.061 5.520 0.047 1.064
V2_03 3.092 0.015 0.280 2.135 0.007 0.118 0.095 1.157 0.095 1.139
Refer to caption
(a)
Refer to caption
(b)
Figure 8: AR visualization of the EuRoC V1_01 sequence, featuring a virtual box rendered in the scene. This visualization utilizes (a) transformed ground-truth poses provided by the EuRoC benchmark and (b) transformed ground-truth poses based on the raw pre-calibrated trajectory and our calibration result. The red dashed circle highlights the difference in consistency between the virtual and real elements when using different poses.

Table I details the performance of our algorithm in each sequence, including the time alignment errors and the results of our batch estimation (BE). Additionally, to analyze the impact of hand-eye calibration on VIO trajectory evaluation, we use our calibration results to transform the raw ground-truth trajectories and calculate the absolute positional error (APE) and absolute rotational error (ARE) [3], which are important metrics in VIO evaluation. As a comparison, we also calculate the original metrics using the transformed ground-truth provided by the public datasets. The obtained results demonstrate that our LC effectively accomplishes spatiotemporal hand-eye calibration, and our BE can further optimizes the result to achieve higher accuracy. In most of the sequences, our calibration algorithm provides accurate transformations for ground-truths with minimal impact on evaluation metrics. However, it is interesting to note that in some sequences, particularly V1_01, our algorithm exhibits some errors. This may stem from the inherent inaccuracies in the calibration results of the benchmarks, corroborated by a similar conclusion in [34]. To provide a more intuitive illustration, we render a virtual box on the images of the V1_01 sequence using two different transformed ground-truth trajectories. The first is provided by the EuRoC benchmark and the second is obtained based on our hand-eye calibration result. Inaccurate hand-eye calibration can lead to misalignment between the virtual and the real elements in this augmented reality (AR) application. As shown in the comparison results in Fig. 8, the virtual box is more consistent with the real world in our AR result, i.e., implying higher calibration accuracy.

Refer to caption
Figure 9: Experimental setup for evaluating the hand-eye calibration using a relative translation method. The VR headset remains stationary, and the tracking markers can be translated along the red arrow.
TABLE II: Performance Comparison of Different Calibration Methods on Self-Collected Datasets. We Count the Translation Error at Different Relative Translation Distances
Sequences
Error of
RS (m)
Error of
RC (m)
Error of
our_LC (m)
Error of
our_BE (m)
0.1m_easy 0.035 0.028 0.007 0.003
0.2m_easy 0.026 0.022 0.005 0.003
0.3m_easy 0.049 0.034 0.007 0.002
0.1m_difficult 0.041 0.045 0.007 0.004
0.2m_difficult 0.056 0.038 0.006 0.004
0.3m_difficult 0.062 0.047 0.010 0.004

In real-world system, evaluating hand-eye calibration is inherently difficult, as the ground-truth of offset is not available. To address this, we employ a relative approach using self-collected data. The experimental setup, as illustrated in Fig. 9, involved mounting a VR headset and tracking markers on the same object for motion. After obtaining a set of hand-eye trajectories, we translate the tracking markers in a specified direction to gather additional calibration data. We compute the norm of the relative translation using the extrinsics derived from the calibrations performed before and after the translation. Calibration error is then determined by comparing this norm against the high-precision, directly measured result.

Table II presents the errors obtained from different hand-eye calibration methods when the tracking markers are translated by 0.1 m, 0.2 m, and 0.3 m, respectively. We categorize the scenarios into easy (texture-rich and slow-moving) and difficult (texture-less and fast-moving) in the context of VIO, to test the robustness of the calibration algorithms. The experimental results show that our algorithm performs best in all sequences. Additionally, both our LC and BE achieve millimeter-level calibration accuracy and are less affected by trajectory error.

V CONCLUSIONS

In this letter, we propose an improved spatiotemporal hand-eye calibration algorithm for trajectory alignment in VO/VIO evaluation. Aiming to optimize for VO/VIO scenarios, we have designed multiple strategies based on screw theory to enhance both the accuracy and robustness of our proposed algorithm. The validation experiments demonstrate that our algorithm outperforms SOTA methods, exhibiting superior accuracy while effectively mitigating the influence of noise. Our method is well poised to be applied in the evaluation of modern VO/VIO algorithms. Nevertheless, our algorithm is less effective over an extended period. This limitation arises partly because our calibration algorithm processes the entire trajectory, leading to inefficiency with long sequences. Furthermore, our algorithm presupposes a constant time offset, an assumption unsuitable over a long time. In future work, we will focus on spatiotemporal hand-eye calibration tailored for long-duration trajectories. The aim is to develop an algorithm that can process extensive data efficiently and address the problem of time offset drift over an extended period.

References

  • [1] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EE\mathrm{E}roman_EuRR\mathrm{R}roman_RoCC\mathrm{C}roman_C micro aerial vehicle datasets,” The International Journal of Robotics Research, vol. 35, no. 10, pp. 1157–1163, 2016.
  • [2] D. Schubert, T. Goll, N. Demmel, V. Usenko, J. Stückler, and D. Cremers, “The TUMVITUMVI\mathrm{TUM\ VI}roman_TUM roman_VI benchmark for evaluating visual-inertial odometry,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2018, pp. 1680–1687.
  • [3] J. Li, B. Yang, D. Chen, N. Wang, G. Zhang, and H. Bao, “Survey and evaluation of monocular visual-inertial SLAMSLAM\mathrm{SLAM}roman_SLAM algorithms for augmented reality,” Virtual Reality & Intelligent Hardware, vol. 1, no. 4, pp. 386–410, 2019.
  • [4] S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 13, no. 4, pp. 376–380, 1991.
  • [5] J. Jiang, X. Luo, Q. Luo, L. Qiao, and M. Li, “An overview of hand-eye calibration,” The International Journal of Advanced Manufacturing Technology, vol. 119, no. 1-2, pp. 77–97, 2022.
  • [6] E. Pedrosa, M. Oliveira, N. Lau, and V. Santos, “A general approach to hand-eye calibration through the optimization of atomic transformations,” IEEE Transactions on Robotics, vol. 37, no. 5, pp. 1619–1633, 2021.
  • [7] K. Koide and E. Menegatti, “General hand-eye calibration based on reprojection error minimization,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1021–1028, 2019.
  • [8] I. Ali, O. Suominen, A. Gotchev, and E. R. Morales, “Methods for simultaneous robot-world-hand-eye calibration: A comparative study,” Sensors, vol. 19, no. 12, p. 2837, 2019.
  • [9] S. Sarabandi, J. M. Porta, and F. Thomas, “Hand-eye calibration made easy through a closed-form two-stage method,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3679–3686, 2022.
  • [10] F. Furrer, M. Fehr, T. Novkovic, H. Sommer, I. Gilitschenski, and R. Siegwart, “Evaluation of combined time-offset estimation and hand-eye calibration on robotic datasets,” in Field and Service Robotics: Results of the 11th International Conference.   Springer, 2018, pp. 145–159.
  • [11] Y. C. Shiu and S. Ahmad, “Calibration of wrist-mounted robotic sensors by solving homogeneous transform equations of the form AXAX\mathrm{AX}roman_AX = XBXB\mathrm{XB}roman_XB,” IEEE Transactions on Robotics and Automation, vol. 5, no. 1, pp. 16–29, 1989.
  • [12] R. Y. Tsai and R. K. Lenz, “Real time versatile robotics hand/eye calibration using 3DD\mathrm{D}roman_D machine vision,” in 1988 IEEE International Conference on Robotics and Automation.   IEEE, 1988, pp. 554–561.
  • [13] C. C. Wang, “Extrinsic calibration of a vision sensor mounted on a robot,” IEEE Transactions on Robotics and Automation, vol. 8, no. 2, pp. 161–175, 1992.
  • [14] J. C. Chou and M. Kamel, “Finding the position and orientation of a sensor on a robot manipulator using quaternions,” The International Journal of Robotics Research, vol. 10, no. 3, pp. 240–254, 1991.
  • [15] F. C. Park and B. J. Martin, “Robot sensor calibration: Solving AXAX\mathrm{AX}roman_AX = XBXB\mathrm{XB}roman_XB on the EE\mathrm{E}roman_Euclidean group,” IEEE Transactions on Robotics and Automation, vol. 10, no. 5, pp. 717–721, 1994.
  • [16] R. Liang and J. Mao, “Hand-eye calibration with a new linear decomposition algorithm,” Journal of Zhejiang University-SCIENCE A, vol. 9, no. 10, pp. 1363–1368, 2008.
  • [17] H. Nguyen and Q. C. Pham, “On the covariance of XX\mathrm{X}roman_X in AXAX\mathrm{AX}roman_AX = XBXB\mathrm{XB}roman_XB,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1651–1658, 2018.
  • [18] H. H. Chen, “A screw motion approach to uniqueness analysis of head-eye geometry,” in 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.   IEEE Computer Society, 1991, pp. 145–146.
  • [19] N. Andreff, R. Horaud, and B. Espiau, “On-line hand-eye calibration,” in Second International Conference on 3-D Digital Imaging and Modeling.   IEEE, 1999, pp. 430–436.
  • [20] K. Daniilidis, “Hand-eye calibration using dual quaternions,” The International Journal of Robotics Research, vol. 18, no. 3, pp. 286–298, 1999.
  • [21] D. Condurache and A. Burlacu, “Orthogonal dual tensor method for solving the AXAX\mathrm{AX}roman_AX = XBXB\mathrm{XB}roman_XB sensor calibration problem,” Mechanism and Machine Theory, vol. 104, pp. 382–404, 2016.
  • [22] K. H. Strobl and G. Hirzinger, “Optimal hand-eye calibration,” in 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2006, pp. 4647–4653.
  • [23] J. Heller, D. Henrion, and T. Pajdla, “Hand-eye and robot-world calibration by global polynomial optimization,” in 2014 IEEE International Conference on Robotics and Automation.   IEEE, 2014, pp. 3157–3164.
  • [24] Z. Zhao, “Simultaneous robot-world and hand-eye calibration by the alternative linear programming,” Pattern Recognition Letters, vol. 127, pp. 174–180, 2019.
  • [25] J. Schmidt, F. Vogt, and H. Niemann, “Robust hand-eye calibration of an endoscopic surgery robot using dual quaternions,” in Pattern Recognition: 25th DAGM Symposium.   Springer, 2003, pp. 548–556.
  • [26] J. Wu, Y. Sun, M. Wang, and M. Liu, “Hand-eye calibration: 4-DD\mathrm{D}roman_D procrustes analysis approach,” IEEE Transactions on Instrumentation and Measurement, vol. 69, no. 6, pp. 2966–2981, 2019.
  • [27] J. Kelly and G. S. Sukhatme, “A general framework for temporal calibration of multiple proprioceptive and exteroceptive sensors,” in Experimental Robotics: The 12th International Symposium on Experimental Robotics.   Springer, 2014, pp. 195–209.
  • [28] M. K. Ackerman, A. Cheng, B. Shiffman, E. Boctor, and G. Chirikjian, “Sensor calibration with unknown correspondence: Solving AXAX\mathrm{AX}roman_AX = XBXB\mathrm{XB}roman_XB using EE\mathrm{E}roman_Euclidean-group invariants,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2013, pp. 1308–1313.
  • [29] H. Li, Q. Ma, T. Wang, and G. S. Chirikjian, “Simultaneous hand-eye and robot-world calibration by solving the AXAX\mathrm{AX}roman_AX = YBYB\mathrm{YB}roman_YB problem without correspondence,” IEEE Robotics and Automation Letters, vol. 1, no. 1, pp. 145–152, 2015.
  • [30] K. Pachtrachai, F. Vasconcelos, G. Dwyer, V. Pawar, S. Hailes, and D. Stoyanov, “Chess—CC\mathrm{C}roman_Calibrating the hand-eye matrix with screw constraints and synchronization,” IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2000–2007, 2018.
  • [31] P. Furgale, T. D. Barfoot, and G. Sibley, “Continuous-time batch estimation using temporal basis functions,” in 2012 IEEE International Conference on Robotics and Automation.   IEEE, 2012, pp. 2088–2095.
  • [32] J. Rehder, R. Siegwart, and P. Furgale, “A general approach to spatiotemporal calibration in multisensor systems,” IEEE Transactions on Robotics, vol. 32, no. 2, pp. 383–398, 2016.
  • [33] H. Sommer, J. R. Forbes, R. Siegwart, and P. Furgale, “Continuous-time estimation of attitude using BB\mathrm{B}roman_B-splines on LL\mathrm{L}roman_Lie groups,” Journal of Guidance, Control, and Dynamics, vol. 39, no. 2, pp. 242–261, 2016.
  • [34] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “OpenVINSVINS\mathrm{VINS}roman_VINS: A research platform for visual-inertial estimation,” in 2020 IEEE International Conference on Robotics and Automation.   IEEE, 2020, pp. 4666–4672.