In this section, we evaluate our approach on the public dataset SemanticKITTI. We performed the training on a hardware platform with eight GeForce RTX™ 2080 TI graphic processing units (GPUs) (11 GB of video memory) with 80 epochs. After 50 epochs of training, the random sampling strategy was applied. After all training, validation and testing were performed with a single graphics card after training.
4.1. Dataset and Metrics
To fully evaluate the performance of our method and to benchmark it against other panoptic segmentation algorithms, the SemanticKITTI dataset was chosen in this study for training validation and testing. The SemanticKITTI dataset provided 43,442 frames of point cloud data (sequences 00–10 were point cloud ground truths with point-wise annotations). Moreover, challenges for multiple segmentation tasks were launched at the dataset’s official website, where participants’ methods were assessed with uniform metrics. In reference to the requirements on the use of other algorithms and datasets, 19,130 frames of point cloud data from sequences 00–07 and 09–10 were used as the training dataset, 4071 frames of data from sequence 08 were used as the validation dataset, and 20,351 frames of point cloud data from sequences 11–21 were used as the test dataset. For the semantic segmentation and panoptic segmentation of single-frame data, 19 categories needed to be labeled. For the semantic segmentation of multiple scans, 25 categories needed to be annotated (six additional mobile categories). The evaluation metrics for semantic segmentation were mean intersection-over-union (mIoU):
where
denotes the number of categories,
denote true positive, false positive, and false negative values for category
, respectively.
The evaluation metrics for panoptic segmentation [
6,
40] were the panoptic quality, recognition quality, and semantic quality, denoted as PQ, RQ, and SQ, respectively, for all categories; PQ
Th, SQ
Th, and RQ
Th, respectively, for the “thing classes”; and PQ
St, SQ
St, and RQ
St, respectively, for the “stuff classes”. Additionally, only SQ was used as PQ
† for PQ in the “stuff classes”. PQ, RQ, and SQ were defined as follows:
4.2. Performance and Comparisons
Table 2 presents the results of the quantitative evaluation of the method proposed in this study on the test dataset. We took FPS = 10 Hz as the grouping criterion to compare the panoptic segmentation performance of each algorithm to verify the effectiveness of the method proposed in this study. Note that the results of the comparison methods were obtained from the competition, and taking 10 Hz as the grouping criterion complied with the data acquisition frequency, which is usually used as the criterion of real-time performance.
Panoptic-PolarNet uses cylindrical partitioning for instance segmentation with UNet as the backbone network, 4D-PLS adopts multi-scan data fusion to achieve panoptic segmentation of single-scan data, and DS-Net applies 3D sparse convolution of cylindrical partitioning and the dynamic shifting clustering algorithm with multiple iterations. The method proposed in this study drew to varying degrees on these three panoptic segmentation methods. Compared with Panoptic-PolarNet, our approach improved PQ by 0.5% and FPS by 1 Hz. Compared with 4D-PLS, our approach improved PQ by 4.3%. The PQ obtained by DS-Net is 1.3% higher than that obtained by our method, but the FPS obtained by DS-Net is merely 3.4 Hz, which does not allow for real-time panoptic segmentation.
Furthermore, we investigated the key components for performance improvements based on the experiment results. As discussed in the related work (
Section 2), a robust panoptic segmentation can be realized using either a strong backbone, excellent clustering methods, or smart data fusion strategies. Firstly, based on the powerful backbone named Cylinder3D, DS-Net proposes a dynamic shifting clustering method to improve the performance of panoptic segmentation. Cylinder3D adopts 3D sparse convolution and obtains a 67.8% mIoU score in single scan semantic segmentation competitions. Therefore, DS-Net can get a 55.9% PQ score. However, our backbone only got less than a 60% mIoU score in single scan semantic segmentation competitions. Thanks to the way of “thing classes” data fusion, we could approach the performance of DS-Net under the condition of meeting the requirements of real time. Secondly, 4D-PLS performs panoptic segmentation by fusing multi-scan data. Although multi-scan data are fused, the proportion of “thing classes” data remains unchanged. In our method, only the “thing classes” data were fused, which improved the proportion of valuable data for instance segmentation, and then improved the performance of panoptic segmentation. Finally, compared with Panoptic-PolarNet, our method adopted a strong backbone that could aggregate features at full scales, called Polar-UNet3+, which could achieve better performance in the case of shortening the panoptic segmentation time.
Table 3 shows, on a semantic category-by-semantic category basis, the quantitative evaluation results obtained by our proposed method on the test dataset. The quantitative evaluation results demonstrate that this method achieved a balance between accuracy and efficiency in panoptic segmentation.
The visualizations of panoptic segmentation results on the validation split of SemanticKITTI are shown in
Figure 8. The figure consists of five subfigures, which correspond to the panoptic segmentation results of the five scanned frames sampled at 800-frame intervals in the validation split. Each subfigure consists of four parts: the ground truth of the panoptic segmentation, the semantic segmentation predictions, the instance segmentation predictions, and the panoptic segmentation predictions. Note that we used semantic-kitti-api [
44] to visualize the panoptic ground truths and predicted labels. During each visualization, the color corresponding to each instance was given randomly. Therefore, in
Figure 8, each instance’s color of the ground truths and predicted labels is inconsistent. The qualitative comparison results show that we could accurately predict each instance and that the error mainly came from the point-wise annotations inside the instance, which will be a direction of improvement in the future.
The visualization results of panoptic segmentation on the test split of SemanticKITTI are shown in
Figure 9. The figure consists of eleven subfigures, which correspond to the panoptic segmentation results of each sequence sampled at the 500th scan in the test split (since the 500th scan of sequence 16 contained only a few pedestrians and sequence 17 only had 491 scans, the 300th scan was selected in sequence 16–17). Each subfigure consists of four parts: the raw scan, the semantic segmentation predictions, the instance segmentation predictions, and the panoptic segmentation predictions. Because the test split did not provide ground truths for comparison, we did not use red rectangular boxes for identification.
4.3. Ablation Study
We present an ablation analysis of our approach, which considers the “thing classes” point cloud fusion, random sampling, Polar-Unet3+, and grid size. All results were compared on the validation set, sequence 08. Firstly, we set the grid size to (480, 360, 32) and investigated the contribution of each key component of the method. The results are shown in
Table 4. As a key part of the performance improvement, Polar-Unet3+ boosted PQ by 0.8%, a further 0.9% performance improvement was achieved for the “thing classes” point fusion, and the random sampling module improved the stability of the method.
After analyzing the performance of each component, we explored the effect of voxelization size on segmentation time and performance, and the results are shown in
Table 5. For the LiDAR point cloud featuring “dense in close range and sparse in far range”, finer voxels did not deliver significant performance gains but rather compromised segmentation efficiency.
4.4. Other Applications
The improvement of panoptic segmentation performance benefited from the “thing classes” data fusion and random sampling strategies. To fully evaluate the effectiveness and generalization of our method, we chose the more challenging semantic segmentation task and the moving object semantic segmentation task to verify our method. The semantic segmentation of moving objects involves training on the basis of multiple scans as input and labeling the semantic on a single scan (the environmental elements of 25 categories need to be predicted, and the environmental elements of specific categories need to be distinguished whether they are moving or not). Compared with moving object segmentation, moving object semantic segmentation not only distinguishes the state of environmental elements (moving or static) but also needs to accurately label all environmental elements. Compared with the semantic segmentation of a single scan, moving object semantic segmentation adds six categories: moving car, moving truck, moving other vehicle, moving person, moving bicyclist, and moving motorcyclist.
On the basis of ensuring real-time performance, we expanded Polar-Unet3+ for panoptic segmentation, including four encoders and three decoders, to MS-Polar-Unet3+, including five encoders and four decoders. The main reason for deepening the network was that semantic segmentation does not need to predict the instance ID, and the reduced amount of calculation can be used to deepen the network structure, as shown in
Figure 10.
In the process of multiple-scan fusion, we defined “thing classes data” as moving-class data, including “moving car”, “moving truck”, “other moving vehicle”, “moving person”, “moving bicyclist”, and ‘moving motorcyclist”. In addition, “1 + P + M” was used to represent the multiple-scan fusion mode, in which P is the number of complete previous scans and M is the number of scans in which only moving-class points are fused. In the selection of P, due to the limitation of the hardware platform, it was difficult for us to integrate the past four scans, as in SemanticKITTI [
7], and we set P to 2. In the selection of M, because the number of points in the moving classes accounts for a small proportion, we referred to the conclusion in Lidar-MOS [
45] and set M to 8. In the training process, referring to the training parameter setting of the panoptic segmentation task, the grid size was (480, 360, 32), the number of training epochs was 80, and the random sampling strategy was applied at the 50th epoch.
Table 6 shows the quantitative evaluation results of our method on the test split of SemanticKITTI. Our method could obtain 52.9% mIoU on the basis of ensuring real-time performance. The moving object semantic segmentation results of Cylinder3D [
25] come from the official evaluation website of SemanticKITTI [
46], and we could not confirm its fusion quantity. Note that the existence of moving objects improves the difficulty of semantic segmentation. Cylinder3D [
25] obtains 67.8% mIoU in a single scan semantic segmentation task, while it obtains only 52.5% mIoU in a multiple-scan semantic segmentation task. Our method obtained 52.9% mIoU on the basis of ensuring real-time performance.
The quantitative evaluation results of our proposed approach in panoptic segmentation and moving object semantic segmentation on the test split of SemanticKITTI show that our method achieved an effective balance between segmentation accuracy and efficiency, and the foreground data fusion and random sampling strategies could be popularized and applied to other LiDAR-based point cloud segmentation networks.
Furthermore, we combined the segmentation results of dynamic objects with SLAM to evaluate the effectiveness of our method. Moving objects in the environment will produce a wrong data association effect, which will affect the pose estimation accuracy of SLAM algorithms. If moving objects can be removed accurately, it will undoubtedly improve the performance of SLAM algorithms. We chose the MULLS [
47] as the SLAM benchmark. We used three kinds of input data for experiments: raw scan, the scan which filtered out dynamic objects according to the semantic ground truth (abbreviated as Dynamic Moving GT), and the scan which filtered out dynamic objects according to the semantic predictions obtained by our method (abbreviated as Dynamic Moving).
For the evaluation of the SLAM algorithm, the quantitative evaluation index of absolute pose error (APE) was applied, using Sim (3) Umeyama alignment in the calculation. We then used the evo tool [
48] to evaluate the estimated pose results, including the root mean square error (RMSE), the mean error, the median error, and the standard deviation (Std.).
Table 7 is a comparison of the APE used for the translation component of different inputs based on the MULLS.
Figure 11 shows the APE visualization results for different inputs. This figure consists of eleven subfigures, which correspond to the sequence 00–10 in the SemanticKITTI dataset, respectively. According to
Table 7 and
Figure 10, it is obvious that filtering out dynamic objects could significantly improve the accuracy of pose estimation. Considering that most of the dynamic objects in the KITTI dataset were static in the environment, the experimental results strongly demonstrate the effectiveness of our method in segmenting dynamic objects, which improved the accuracy of pose estimation and enhanced the performance of different SLAM algorithms.