The tracking thread runs in real time at a frequency higher than the camera frame rate. Camera information and wheel speed information are soft-synchronized, with the first frame serving as the temporal origin, where the origin of the world coordinate system coincides with the origin of the robot coordinate system, making the first image frame the starting keyframe. First, the wheel speed information with a rate higher than the frame rate is pre-integrated to estimate the initial pose of the mobile robot. Each new image frame undergoes ORB feature extraction, and RANSAC [
20,
21] is used to remove outliers from feature matches. The thread matches the features in the current frame with the reference keyframe and estimates the initial map points using the pre-integrated information between the feature pairs and image frames. If the current frame meets the criteria for becoming a keyframe, it is added as a new reference keyframe to the local map.
The global optimization thread maintains all keyframes and map points within the entire map. Keyframes that have undergone local optimization enter the global optimization thread, where similar keyframes are identified from the entire map using the “bag of words” approach [
23] as loop closure candidate frames. Once loop closure is confirmed, the current keyframe, the loop closure keyframe, and their covisible map points form a subgraph. Optimization of the poses of the two keyframes and map points leads to an inter-frame loop closure constraint between the poses of the two frames. Visual constraints, wheel speed pre-integration constraints, and loop closure constraints are then incorporated into the global map to optimize the pose of the mobile robot and global map points, thereby eliminating accumulated errors caused by local optimization.
2.1. Coordinate System and Parameter Representation
As shown in
Figure 2, the perception problem for wheeled mobile robots primarily involves three coordinate systems: the world coordinate system w, the robot coordinate system b, and the camera coordinate system c. The robot coordinate system can be any frame of reference fixed to the robot body, but in this context, it is defined as the odometry coordinate system. The origin,
, is located at the midpoint of the line connecting the wheel axes, with the positive
x-axis pointing in the forward direction of the robot. The
y-axis is aligned with the wheel axis line and positive to the left, while the positive
z-axis points vertically upward from the origin. The camera coordinate system is centered at the camera’s optical center
, with the positive
z-axis pointing outwards from the camera lens, and the
x and
y axes aligned with the
u and
v directions of the pixel plane, respectively.
The Lie groups SO(3) and SE(3) are used to represent rotation
R and pose
T in three-dimensional space, respectively:
Since the Lie group space does not support addition, making it impossible to perform small iterations on optimized variables, Lie algebra
is introduced to enable rotations and poses to be mapped as follows:
In Equation (
3), the Lie algebra
of the rotation matrix corresponds to the rotation vector, which can also be obtained using the Rodrigues formula:
The left Jacobian
is expressed as follows:
Similarly, the pose on SE(2) corresponding to the three-dimensional vector
can be expressed as follows:
where
The correspondence between the Lie algebra se(3) and se(2) can be expressed as follows:
In general, the transformation matrix
represents the transformation from coordinate system
B to coordinate system
A. The matrix includes a rotation matrix
and a translation vector
. The pose of a ground mobile robot is represented in SE(2) and denoted as
, while the pose of the camera is represented in SE(3) and denoted as
. The external calibration transformation matrix between the robot and camera coordinate systems is also represented in SE(3) and denoted as
. A 3D map point is expressed in the world coordinate system as
, and the map point observed in the camera coordinate system is denoted as
. The transformation relationship between these coordinates is expressed as follows:
2.2. Visual SLAM Perception Algorithm with Wheel Speed Fusion
Since optimization-based SLAM back-end methods can iteratively achieve more accurate state estimates, the sensing problem of wheeled mobile robots moving on indoor planes is modeled here as a maximum a posteriori estimation problem (MAP). Given the initial pose of the wheeled mobile robot, the camera intrinsic matrix K, and the external calibration matrix , and considering a series of continuous input images (i = 1, 2, ⋯) and wheeled odometer speed information (j = 1, 2, ⋯), the goal is to estimate a series of poses (i = 1, 2, ⋯) of the wheeled mobile robot on SE(2) and the corresponding map points (k = 1, 2, ⋯) throughout the observation process. Next, we construct visual observation constraints, wheeled pre-integration constraints, and loop closure observation constraints to form the final MAP problem.
2.3. Visual Observation Constraint
As shown in
Figure 3, the camera uses a pinhole model to map the 3D coordinate point
in the world coordinate system to the 2D coordinate
on the pixel plane.
In the camera coordinate system, the relationship between the 3D point
and the 2D point
on the pixel plane is as follows:
In the formula, z is the z-axis component of the 3D coordinate point. The parameters and represent the camera focal lengths along the x and y axes, respectively. The parameters and denote the coordinates of the image center point on the pixel plane.
This can be abbreviated as the projection function:
Given that visual measurements may be affected by noise, under the assumption of a Gaussian model, the measurement equation can be expressed as follows:
where
.
To facilitate the nonlinear optimization solution of the SLAM backend, we differentiate the previous equation with respect to the 3D point
in the camera coordinate system to obtain the Jacobian matrix of the pixel observation with respect to the camera coordinate point
[
24]:
When the
i-th keyframe observes a landmark point
and the camera extrinsic parameter rotation matrix
and translation vector
are known, the observation model in the camera coordinate system is given by
In the equation, and represent the SE(3) expressions corresponding to the robot’s pose parameterized on SE(2), while denotes the camera projection equation.
Unlike the unconstrained state of SE(3), a mobile robot undergoing planar motion is parameterized as a pose on SE(2). Given the minor vibrations that occur during the motion of the robot [
15], perturbations in SE(3) are added to the SE(2) pose of the robot to account for the existing visual measurement noise. Assuming the translational perturbation experienced by the robot’s pose is
and the rotational perturbation is
, the actual pose of the robot should be represented as
Therefore, the observation equation in Equation (
16) is modified as follows:
From the above Equation (
19), it is evident that the pixel observation is influenced by three types of noise:
,
, and
. To calculate the covariance of the camera observation, a first-order Taylor expansion is performed:
The Jacobian matrix of the camera observation with respect to the rotational perturbation
and translational perturbation
is given by
The term
refers to the Jacobian matrix of the camera projection equation, as shown in Equation (
15).
Therefore, the covariance propagation equation for camera observation information under perturbation can be derived as follows:
Subsequently, under the aforementioned constraints, the camera’s reprojection observation error equation can be constructed as follows:
The Jacobian matrices of the reprojection observation error with respect to the rotational and translational components of the robot’s pose are, respectively,
The Jacobian matrix of the reprojection error with respect to the landmark point is:
At this point, we have derived the camera reprojection observation error, its corresponding covariance, and the Jacobian matrix of the error with respect to the variables to be optimized.
2.4. Wheel Speed Pre-Integration Constraint
In wheeled mobile robots, encoder information is typically generated at a higher frequency than visual data, resulting in each frame of visual data corresponding to multiple instances of encoder data. Consequently, updating the camera pose at each iteration of back-end optimization, alongside the robot poses represented by the multiple sets of encoder data, can either lead to redundant computations or affect the accuracy of perception. Therefore, the concept of pre-integration is employed [
25], as shown in
Figure 4, to construct constraints between frames of visual information using pre-integrated encoder data.
First, the linear velocity
v and angular velocity
of a wheeled mobile robot can be derived from the left and right wheel speeds, as well as the robot’s intrinsic parameters.
In the equations, the linear velocity and angular velocity are subject to Gaussian noise,
. To mitigate errors from manual measurements, accurate intrinsic parameters of the robot should be obtained through joint calibration with the camera [
26].
Then, the discrete integration model for wheel speed at two consecutive time points can be derived, with a sampling interval of
T.
where
.
Given
, the above equation can be expressed in a compact form as
Performing a Taylor expansion on the equation yields the error state equation as follows:
where
,
.
Given
, the error state equation between two consecutive wheel speed moments can be obtained as follows:
In this way, the covariance propagation equation for wheel speed observations at two consecutive time intervals can be derived [
27]:
The initial value of
is generally set as a zero matrix, and
is derived from the measurement noise
of the encoder wheel speed:
Assuming there are
n instances of wheel speed information
between two consecutive keyframes
and
, as indicated by Equation (
33), the wheel speed pre-integration observation between the two keyframes is as follows:
Accordingly, the covariance matrix of the pre-integrated quantities
between two visual keyframes can be recursively obtained from Equation (
37).
Then, let the pre-integrated observation be denoted as
, and based on the pre-integrated quantities derived in Equation (
39), the wheel speed measurement error equation can be constructed as follows:
By dividing the residuals into rotational and translational components, we obtain
The Jacobian matrices of the wheel speed observation error with respect to the robot poses at two different time points are as follows:
The Jacobian matrix of the translational component of the error with respect to the rotational component of the i-th robot state is calculated using the right perturbation approach. Consequently, the observation error of the wheel speed pre-integration, its associated covariance, and the Jacobian matrix of the error with respect to the variables to be optimized have been obtained.
2.5. Loop Closure Co-Visibility Observation Constraint
In addition to the visual observation constraints and wheel speed pre-integration observation constraints introduced in the previous sections, there is also a loop closure constraint in the robot’s movement. This constraint is important for eliminating cumulative errors that arise during the robot’s state estimation process. When the robot’s pose at a certain time forms a loop closure with the pose at a previous time, there will be a large number of co-visible landmarks between the two frames. This allows the system to construct a loop closure constraint to correct the robot’s pose.
By taking the current frame, the loop closure frame, and the co-visible landmarks as nodes, and using the observation error of the landmarks in the camera coordinate system as edges, a subgraph optimization problem can be constructed. For instance, consider the observation of the k-th landmark
from the perspective of the current frame
(defined as
for the current frame and
for the loop closure frame). The measurement error can be expressed as
The Jacobian matrix of the measurement error
with respect to the optimization variable
X is
Let
. By applying a first-order Taylor expansion, Equation (
44) can be expressed as
Therefore, the error function to be optimized is
After the iteration converges, the optimized poses of the two frames can be obtained as poses and .
In the next step, to obtain the covariance of the relative observations, we need to calculate the covariance of each pose individually. Under the assumption of a Gaussian distribution, let the current pose to be optimized and the loop closure pose be variables
, respectively, while the shared 3D point is denoted as
P. The joint probability distribution of
can be represented as the product of the marginal and conditional probabilities [
28], which can be expressed as follows:
After iterative optimization, the Hessian matrix used for solving the optimization problem in the graph can be obtained. This matrix is denoted as
H:
The matrix block in the top left corner corresponds to the current frame and loop closure frame, while the matrix block in the bottom right corner corresponds to the landmark points.
For the likelihood distribution, the information matrix is the expected Hessian matrix of the negative log-likelihood problem [
29]. Given the information matrix of the joint distribution, the information matrix of the marginal probability can be derived through decomposition [
30]. By marginalizing the landmark points using the Schur complement, the information matrix for the marginal probability
can be obtained as follows:
Next, we rewrite
in the following form:
Assuming the current frame pose and loop closure frame pose are mutually independent, their information matrices correspond to the respective blocks in
. Since the covariance matrix is the inverse of the information matrix, the information matrices for the current frame and the loop closure frame can be obtained as follows:
Consequently, the perturbed poses of the current frame and the loop closure frame can be represented as follows:
In the context of this statement, refers to the left perturbation in small quantities.
In the context of this statement, refers to the right perturbation in small quantities. Both the left and the right perturbations are equivalent; however, we choose the right perturbation in this context for convenience in subsequent calculations.
Therefore, the relative observation between the current frame and the loop closure frame is expressed as follows:
Given the relative observation perturbation as
, the perturbation mechanism leads to the following equation:
Given the formula
, the covariance of
(to the second-order term only) can be derived using the Baker–Campbell–Hausdorff (BCH) formula:
where
.
Therefore, the observation error between the loop closure frame and the current frame can be expressed as
Similar to Equation (
40), the transformation can be expressed in terms of the rotation and translation components on SE(2) as follows:
The corresponding Jacobian matrix is similar to that in Equations (42) and (43), so it will not be repeated here. Thus, the loop closure covisual observation error, covariance matrix, and Jacobian matrix relative to the optimization variables have been obtained.
2.6. Formulation of the Maximum A Posteriori Estimation Problem
The essence of robot state estimation is maximum a posteriori estimation given known observations; when ignoring unknown prior information, this can be converted to maximum likelihood estimation. Under the assumption that the probability density function of the state variables follows a Gaussian distribution, the problem can be further transformed into a least squares estimation problem through negative logarithmic transformation. Thus, by treating the mobile robot pose and map points as state variables
, and based on the aforementioned reprojection error Equation (
25), wheel speed pre-integration error Equation (
40), and loop closure co-visibility error Equation (
62), a nonlinear error function can be constructed as follows:
In the equation, takes a value of 0 or 1, where it is set to 1 when loop closure and global optimization are needed.
The optimal state vector
is obtained by minimizing the objective function
:
The error function can be linearized at the operating point
using the second-order Taylor expansion, which yields
To minimize the objective function, the function is differentiated with respect to
and set to zero. This results in the optimal perturbation value of the state vector, which is the solution to the following linear system:
The optimal solution for the state variables is the best estimate of the pose of the wheeled mobile robot and the observed 3D map points:
In this context, ⊞ represents the generalized summation symbol.