1. Introduction
Tunnels are an essential component of modern transportation and urban infrastructure. Their unique environmental characteristics, such as low lighting, humidity, and dust, pose significant challenges for fire rescue. In tunnel applications, typical sensors like LiDAR and optical cameras offer advantages such as dense point clouds and strong semantic feature information. However, they are easily hindered by environmental factors like smoke and dust. Under adverse conditions, these sensors cannot obtain continuous data, making it difficult to robustly perceive the surrounding environment. Millimeter-wave radar can overcome constraints posed by conditions like rain and snow, allowing for the collection of continuous data even in harsh environments. Still, it has the disadvantages of sparse point clouds and weak semantic information. Existing rail robots equipped with millimeter-wave radar and optical cameras move in a straight line along the tracks at a fixed observation angle. They utilize millimeter-wave radar to gather continuous point-cloud data and optical cameras to capture intermittent geometric features. Using millimeter-wave radar odometry, they perceive the environment and obtain their positioning information, but they struggle to create closed loops [
1].
Existing radar and visual fusion SLAM systems often integrate both the front end and the back end. Current SLAM front-end frameworks mainly fall into three method categories: feature matching based, correlation based, and point-cloud registration based. Algorithms, such as CSM [
2], which are representative of front-end pose estimation, as cited in [
3], can achieve self-pose estimation and map the surrounding environment. However, due to the observation errors of the radar and coordinate transformation errors during the pose estimation process, there is an accumulated error over a prolonged measurement period. This causes the pose trajectory to gradually deviate in a direction perpendicular to the distance. Therefore, back-end optimization is necessary to correct the long-term cumulative errors produced by the front end.
In the application of back-end optimization, most optimization-oriented SLAM techniques consist of two subsystems. The first subsystem utilizes sensor data by finding matches between newly added observations and the map, establishing the constraints of the problem. The second subsystem computes or corrects the sensor’s pose (as well as previous poses) and the map according to these constraints to achieve a consistent whole [
4]. Existing fusion methods, such as those combining LiDAR and cameras, or LiDAR and IMU, largely rely on back-end optimization’s loop closure detection methods to reduce cumulative errors. These methods utilize the geometric features from LiDAR point clouds or visual features’ similarity during measurements for loop closure detection and pose correction. A LiDAR and visual SLAM back end is introduced in [
5], which uses both LiDAR geometric features and visual features to achieve loop closure detection. It constructs a Bag of Words (Bow) model that describes visual similarity to aid in loop closure detection, followed by point-cloud rematching to verify the loop closure and complete graph optimization. DVLSLAM’s [
6] back end optimizes the pose graph using graph optimization techniques, detecting whether there’s a loop between the current frame and previous keyframes. If a loop is detected, it is added as an edge to the pose graph. LIV-LAM [
7] integrates LiDAR-based odometry measurements with target detection based on a monocular camera and associates it with loop closure detection through pose graph optimization. LC-LVF [
8] suggests a new error function that takes both scanning and image data as constraints for pose graph optimization and uses g2o for further optimization, ultimately employing a Bag-of-Words-based method for revisited place detection. These loop closure detection-based methods require the sensor’s movement trajectory to close a loop and relatively continuous data. They are challenging to apply in scenes with a linear motion trajectory and discontinuous visual geometry.
In the tunnel scenes addressed in this paper, track robots are characterized by a single pass through the fire, which is not conducive to loop closure correction. Furthermore, cameras struggle to capture continuous geometric features due to obstructions like smoke and dust. As a result, this paper introduces a new backend correction framework that incorporates discontinuous visual line features. These features are matched with line features in the occupancy grid map built by the millimeter-wave radar odometry. By correcting the observed measurements, the framework aims to reduce the cumulative error of front-end pose estimation. However, given the sparsity of the millimeter-wave radar point cloud, the precision of matching line features in the occupancy grid map with visual line features is low, leading to mismatches and redundancies. Moreover, the radar is prone to producing clutter points during observation. These points often possess location information similar to the real point cloud, making it challenging to accurately detect and identify radar line features. This impacts the precision of their match with visual line features.
In response to the aforementioned challenges, this paper takes advantage of the consistent and stable information that line features provide in low-textured environments and their consistency across different viewpoints. This paper proposes a CSM pose estimation method that integrates visual and radar line features. The main contributions of this paper are as follows:
Addressing the cumulative error of a single radar odometer during extended measurement periods, this paper proposes a pose estimation framework that fuses visual and radar line features. By matching visual line features with global grid map line features, an error constraint function is constructed, and the Levenberg–Marquardt (LM) algorithm is employed to solve for the optimized pose.
Considering the significant matching error between the front-end global grid map and visual line features, making it challenging to construct a visual line feature set in the global coordinate system, this paper proposes an adaptive Hough transform line detection method based on prior radar grid map projections. Using the principle of homography, the global grid is mapped onto optical images, endowing the optical images with prior map distance information. Based on this prior information, we identify “areas of interest”, excluding non-matching Hough transform line detection results to enhance line feature matching accuracy.
Given the interference of radar clutter points online feature points, traditional Gaussian mixture models (GMMs) based on position information clustering find it difficult to categorize line features from the global grid map, making it challenging to match with the visual line feature set. Addressing this issue, this paper proposes an RCS-based GMM clustering method (R-GMM), enhancing the matching accuracy between visual line features and global grid map line features, thereby obtaining the required parameters for pose correction.
The second section of this paper introduces the coordinate system definition, related work, and a pose estimation framework that integrates visual and radar line features. The third section describes the principle of the adaptive Hough transform line detection method based on the projection of the radar’s prior grid map, establishing the visual line feature set. The fourth section presents the principle of the GMM clustering model based on RCS. The fifth section is dedicated to experiments and result analysis, where several comparative experiments demonstrate the effectiveness of the method. The final section offers a summary.
3. Adaptive Hough Transform Line Detection Method Based on Prior Radar Grid Map Projection
Commonly used line detection algorithms include Hough [
16], LSWMS [
17], Edline [
18], and LSD [
19]. Most of these methods rely on image gradients or edge information, and different methods have varying sensitivities to lines. The Hough transform is a technique widely used in image processing, particularly adept at detecting incomplete or noise-disturbed geometric shapes from images, with line detection being particularly common. The basic idea is to transform the representation of image space into a parameter space. For line detection, this parameter space is typically polar coordinates. In this space, one identifies areas where the density of intersecting points or places surpassing a certain threshold represent lines in the image. Through the Hough transform, even if lines in the image are broken or partially obscured by other objects, they can still be successfully detected, making it suitable for scenes in tunnels where continuous data are hard to obtain.
Due to its dense point cloud, the Lidar point-line feature is very distinctive, mapping its point cloud to optical images, and matching highly with detection algorithms sensitive to straight lines [
20,
21], such as the LSD and Edline algorithms. However, the millimeter-wave radar point cloud is sparse, resulting in a low match between the global grid map line features and the Hough transform line detection algorithm results, easily causing mismatches and redundancies. Thus, this paper proposes an adaptive Hough transform line detection method based on the radar prior to grid map projection, establishing a visual line feature set and excluding non-matching line interference.
First, using the homography transformation to construct the grid coordinate and pixel coordinate mapping relationship, map the global grid map to the optical image corresponding to the submap frame as a prior distance constraint, determine the “regions of interest” of the Hough transform line detection, to realize the adaptive Hough transform line detection, and increase the match between the global grid map line features and visual line features.
Figure 6 shows the flowchart of the A-Hough method:
3.1. Estimation of Mapping Parameters between Local and Global Coordinate Systems
The mapping parameters between the local and global coordinate systems are obtained through the front-end rigid transformation. Let the pose of the first frame be pose = (
tx,
ty,
θ), where
represents the translation in the X-direction in the global coordinate system,
represents the translation in the Y-direction in the global coordinate system, and
represents the rotation angle. Let the rotation relationship between the point-cloud coordinates in the local coordinate system and the grid coordinates in the global coordinate system be R. The mapping relationship between the two coordinate systems is shown in Equation (2).
From this, we can determine the mapping relationship between the global coordinate system and the local coordinate system. What this means is that after the local coordinate system captures the point-cloud coordinates of the first frame, the SLAM front end estimates the pose of that initial frame. Based on rigid transformations, the transformation relationship between the global and local coordinate systems is established. As the number of frames increases, the local coordinate systems of subsequent frames map to the global coordinate system based on their motion relationship. Here, represents the frame number linking the local and global coordinate systems.
3.2. Estimation of Mapping Parameters between Local Coordinate System and Pixel Coordinate System
The mapping parameters between the local coordinate system and the pixel coordinate system are obtained through the calibration of the millimeter-wave radar and the optical camera. Suppose that when the millimeter-wave radar performs target detection, the target’s coordinates in the local coordinate system are
, and the target’s pixel coordinates captured by the camera are
. Both the target detected by the radar and the target detected by the camera are on the radar acquisition plane
and the pixel plane
. The position of an object simultaneously detected by both the millimeter-wave radar and the camera in the local coordinate system is
, and its position in the pixel coordinate system is
. Based on the principle of homography, the transformation relationship between the radar and the camera is the mapping relationship between the target point in the local coordinate system and its corresponding target point on the pixel coordinate plane. According to the principle of homography, the relationship between the local coordinate system and the pixel coordinate system is shown in Equation (3).
From this, a homography matrix representing the mapping relationship between the local coordinate system and the pixel coordinate system can be obtained. Here, S is a constant and is used as a scaling factor to ensure equality on both sides of the equation.
3.3. Estimation of Mapping Parameters between the Global Coordinate System and the Pixel Coordinate System
The mapping parameters between the global coordinate system and the pixel coordinate system are obtained through the principle of homography transformation. The target point coordinates in the global coordinate system are extracted, and the extracted coordinates are paired with the target points in the local coordinate system to form a set of matching points. Through the homography transformation relationship between the local coordinate system and the pixel coordinate system, the pixel coordinates of the target points can be derived. Then, using the local coordinate system as a bridge, the mapping relationship between the global coordinate system and the pixel coordinate system is determined based on the principle of homography transformation, as shown in Equation (4).
Through the aforementioned steps, a mapping relationship between the global coordinate system and the pixel coordinate system can be constructed. This allows for the projection of the global grid map onto optical images, endowing the optical images with distance information.
In a given frame, the pixel coordinates in the optical image mapped from the line features in the global grid map are denoted as
.
represents the horizontal coordinate of the
nth pixel point in the pixel coordinate system, converted from grid coordinates in the pixel coordinate system, and
represents the vertical coordinate of the
nth pixel point in the pixel coordinate system, converted from grid coordinates in the pixel coordinate system. By traversing all the mapped line feature points, obtaining the maximum and minimum points on both the
x-axis and
y-axis. As shown in Equations (5)–(8).
is the minimum horizontal coordinate of the feature points of the global grid map lines mapped to the pixel points in the optical image, and is the maximum horizontal coordinate point. is the minimum vertical coordinate of the feature points of the global coordinate system grid map lines mapped to the pixel points in the optical image, and is the maximum vertical coordinate point. Based on the distance information projected onto the optical image from the prior map, the “regions of interest” for line detection can be identified as . Subsequently, the Hough transform is utilized exclusively for line feature detection within this region. As time progresses, the detection range for each frame of the optical image is determined based on this distance information, thus achieving adaptive Hough transform line detection.
The collection of line features detected in the pixel coordinate system is denoted as
,
. Extracting the pixel coordinates
of each line feature. Using the inverse homography transformation, line features in the pixel coordinate system are converted to the global coordinate system, as shown in Equation (9). The visual line feature set is constructed as
,
. The coordinates of each line feature in the global coordinate system are represented by
.
By projecting the global grid map with distance information onto the optical image and determining the “regions of interest”, adaptive Hough transformation is achieved. This enhances the matching accuracy between the line features of the global grid map and the detected line features, serving as a foundation for back-end optimization.
4. Based on RCS Gaussian Mixture Model Clustering Method
Millimeter-wave radar includes both a transmitter and a receiver. The transmitter can generate and emit high-frequency electromagnetic waves within a certain frequency range (typically between 30 and 300 GHz). When the emitted electromagnetic waves encounter an object, they are reflected by targets such as vehicles, walls, fences, etc. The size, shape, and material of the target all influence the characteristics of the reflected wave, with the most critical factor being the RCS, which describes the ability of a target to reflect radar waves. The receiver captures the reflected electromagnetic waves. By analyzing the time delay, frequency shift, and intensity of these echoes, it determines the target’s distance, speed, and size. However, radar echoes often experience clutter interference from the ground, buildings, etc., producing a large number of irrelevant data points in the scene, possessing location information similar to the “points of interest”. Particularly when slender linear features like fences exist in the scene, this clutter makes these features difficult to detect and recognize accurately. Clutter usually originates from various irregular reflectors, so their RCS values tend to be variable with a broad distribution. For instance, ground clutter might encompass reflectors ranging from pebbles to large buildings. However, genuine targets usually have similar RCS characteristics, such as fences, vehicles, airplanes, etc. This is because their shape, material, and size are typically consistent. For example, a metal fence, in a specific direction, will always have a relatively fixed RCS value.
Clustering is one of the important methods in data mining. By utilizing the similarity between data, it categorizes data objects with similar properties into the same class. Through clustering, point clouds can be classified, thereby isolating line features that match the global grid map. The GMM can be used for clustering data in unsupervised learning. It is a statistical model used to represent an observed dataset generated by a mix of several Gaussian distributions. The model assumes that data is generated by a blend of various Gaussian distributions. A GMM can be expressed as:
where x is the data point,
M is the number of Gaussian distributions,
is the mixture weight of the
ith distribution, and
represents the Gaussian density function, satisfying
, with a mean
and covariance
.
In the application of GMM clustering, each Gaussian distribution can be viewed as a cluster within the data. The power of the GMM lies in its ability to describe the structure of data by combining multiple Gaussian distributions. One of the strengths of the GMM is its ability to capture clusters of various shapes, not just spherical ones but also elongated shapes, thanks to the covariance matrix of each Gaussian distribution. Compared to traditional K-means clustering, the GMM takes into account not only the center of the data but also its covariance structure, capturing a richer structural feature of the data. Its parameters include the mean (center) and covariance (which describes the shape and directional spread) of each component.
However, in the global grid map, due to the clutter points having similar positional features to the “points of interest”, the GMM might treat these clutter points as significant data features, allocating one or more Gaussian distributions to fit these clutter points. This is because the GMM’s goal is to minimize the overall error, and the clutter may dominate much of the data. At the same time, the points forming line features might be sparse in comparison to the entire dataset. Thus, when the GMM attempts to fit the large number of clutter points with Gaussian distributions, it might “overshadow” the sparse line features.
Traditional clustering approaches struggle to address the issue of the influence of clutter points. Hence, this paper proposes the incorporation of RCS as a weight to distinguish between line feature points and clutter points, increasing the distance in the ellipsoidal feature space, ensuring points with similar structures cluster together. A schematic diagram is shown in
Figure 7. The blue box represents the clustering results after incorporating RCS, while the yellow box shows the GMM clustering results. Both algorithms cluster the same line.
During the clustering process, a weight is assigned to each point-cloud data point. This weight is based on the RCS value of the point, specifically, data points with a larger RCS value are given a higher weight.
Let us assume the set of data points is
and their corresponding RCS values are
. Normalized each point using its RCS value such that the sum of all weights is one. As shown in Equation (11).
N is the total number of data points in the set, and
is the normalized RCS weight of the
kth data point. After introducing RCS, the GMM probability distribution with RCS weighting is shown in Equation (12).
When the RCS weight is introduced, each data point has an associated weight. However, these weights do not directly influence the probability distribution of individual data points; they mainly play a role in the likelihood function and parameter estimation.
The likelihood function is shown in Equation (14).
is the set of data points, is the RCS weight, is the RCS weight of the kth data point, is the mixing weight of the ith Gaussian distribution, M is the number of Gaussian distributions, N is the number of data points, is the Gaussian density function with mean and covariance .
The likelihood function still serves the purpose of describing the probability of observing the entire dataset given the model parameters. However, the aforementioned likelihood function is influenced by the RCS weights. Specifically, data points with higher RCS weights will have a greater weight in the likelihood function. This means that the model will pay more attention to these data points, effectively filtering out the interference from clutter points and segregating point clouds with similar characteristics.
Since this paper mainly focuses on the region in the global grid map that matches with the visual line features, during the R-GMM clustering, this region is set to cluster into three categories. Centered around the line feature region, each side of the line feature is clustered into one category. A schematic diagram is shown in
Figure 8.
Next, the EM (expectation-maximization) algorithm is used to estimate the model parameters.
The EM algorithm is an iterative method used to find parameter estimates when there are hidden variables, aiming to maximize the likelihood function of the observed data. For the Gaussian mixture model, these hidden variables indicate which Gaussian distribution the data points come from. The EM algorithm mainly consists of two steps: the E-step (expectation step) and the M-step (maximization step).
E-step: Calculate the posterior probability
of data point x belonging to the
mth Gaussian distribution. As shown in Equation (15). In this step, the expected posterior probability of data points belonging to each Gaussian distribution is calculated, which is the probability of the data point coming from a particular Gaussian distribution given the observed data and current parameter estimates.
M-step: Based on the posterior probabilities obtained from the E-step, update the parameters of the Gaussian Mixture Model to maximize the data likelihood.
The updated mean is shown in Equation (16).
Equation (16) calculates the mean of the kth Gaussian mixture component with the inclusion of the RCS weight. The RCS weight adjusts the contribution of each data point’s position. When a data point has a larger RCS weight, its contribution to the mean of mixture component k is greater.
Update the covariance matrix is shown in Equation (17).
Equation (17) computes the new covariance for the kth Gaussian mixture component. The RCS weight adjusts the contribution from each data point regarding its shape and direction. Data points with a larger RCS weight have a greater influence on the covariance of the kth mixture component.
Update weights are shown in Equation (18).
The RCS weight assigns different importance to data points to eliminate the interference of clutter points. During the M-step update of GMM parameters, the RCS weight of a data point determines its significance in calculating the new weights, means, and covariances. Data points with a larger RCS weight will have a more significant impact on the update of model parameters. This allows the GMM to prioritize those more crucial data points (based on RCS weight) to adjust the model parameters.
Using the R-GMM method, both the global grid map and the visual line feature set are clustered, yielding two-line feature classes and obtaining their mean and covariance matrix. Suppose the mean of the visual line feature class clustered in the global coordinate system is , and the covariance matrix is ; the mean of the line feature class of the global grid map is , with covariance matrix .
The covariance matrix provides information on the direction and magnitude of the shape post-clustering. Using singular value decomposition (SVD) to decompose the matrix, obtaining the rotation matrix. Specifically, the column vectors of the rotation matrix are the base vectors in the new coordinate system, which are the eigenvectors of the covariance matrix. These eigenvectors determine the class’s direction and typically represent the primary directions of the covariance matrix, i.e., the directions in which the data has the most significant variance. The visual line feature class is decomposed using SVD, as shown in Equation (19).
is an orthogonal matrix, and its columns are the eigenvectors of the covariance matrix . These eigenvectors determine the primary directions or variation directions of the line features.
is a diagonal matrix, with its diagonal elements being the eigenvalues of the covariance matrix . These eigenvalues represent the amount of variation or “spread” of the data in the directions of the corresponding eigenvectors. A large eigenvalue implies that the line feature points have a significant variation in that direction, while a small eigenvalue indicates that the variation of the line feature points in that direction is minimal.
is the transpose of .
Its rotation matrix
is given by Equation (20):
This rotation matrix represents the rotation of the visual line feature relative to the coordinate axes in the global coordinate system. It is an orthogonal matrix, and its column vectors are unit vectors, indicating the direction in the coordinate system. This formula describes the rotation of the visual line feature in the global coordinate system.