Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

MambaMOS: LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model

Kang Zeng1, Hao Shi2, Jiacheng Lin1, Siyu Li1, Jintao Cheng3, Kaiwei Wang2,
Zhiyong Li1,∗, and Kailun Yang1,∗
1Hunan University, 2Zhejiang University, 3South China Normal University
Abstract.

LiDAR-based Moving Object Segmentation (MOS) aims to locate and segment moving objects in point clouds of the current scan using motion information from previous scans. Despite the promising results achieved by previous MOS methods, several key issues, such as the weak coupling of temporal and spatial information, still need further study. In this paper, we propose a novel LiDAR-based 3D Moving Object Segmentation with Motion-aware State Space Model, termed MambaMOS. Firstly, we develop a novel embedding module, the Time Clue Bootstrapping Embedding (TCBE), to enhance the coupling of temporal and spatial information in point clouds and alleviate the issue of overlooked temporal clues. Secondly, we introduce the Motion-aware State Space Model (MSSM) to endow the model with the capacity to understand the temporal correlations of the same object across different time steps. Specifically, MSSM emphasizes the motion states of the same object at different time steps through two distinct temporal modeling and correlation steps. We utilize an improved state space model to represent these motion differences, significantly modeling the motion states. Finally, extensive experiments on the SemanticKITTI-MOS and KITTI-Road benchmarks demonstrate that the proposed MambaMOS achieves state-of-the-art performance. The source code of this work will be made publicly available at https://github.com/Terminal-K/MambaMOS.

Moving Object Segmentation; State Space Model; Spatio-Temporal Fusion

1. Introduction

LiDAR-based Moving Object Segmentation (MOS) task is pivotal for accurately delineating moving entities such as cars or pedestrians within the current LiDAR scan, serving as a fundamental component of autonomous perception  (Chen et al., 2021; Zhou et al., 2023). MOS contributes in two main ways. First, it ensures stable operation for autonomous driving systems by providing accurate 3D dynamic semantic scene understanding (Sun et al., 2022; Cheng et al., 2024); Second, it assists in removing the “ghost effect” caused by object motion during mapping in simultaneous localization and mapping, resulting in a clean static map (Chen et al., 2019; Li et al., 2023).

Chen et al. (Chen et al., 2021) propose a learning-based MOS method that projects point clouds onto a planar representation and utilizes a sequence of these representations to incorporate temporal information for MOS. Similar paradigms like (Sun et al., 2022; Kim et al., 2022; Cheng et al., 2024; Mohapatra et al., 2022; Zhou et al., 2023) achieve low latency but suffer from geometric loss introduced by projection, leaving room for improvement in terms of accuracy and generalization. Non-projection methods (Mersch et al., 2022; Wang et al., 2023) perform feature extraction directly in the 3D space and have achieved precise segmentation results and excellent generalization. However, these methods cannot sufficiently couple the temporal-spatial features of multi-scan point clouds and suffer from the issue of “weak coupling between temporal and spatial information”. Specifically, due to the changing spatial positions of moving objects over time, trailing artifacts will be formed in the aggregated point cloud. Without incorporating timestamp information to differentiate each scan in the aggregated point cloud, these artifacts may be confused with larger objects in terms of their similar appearance (e.g., moving cars and parked trucks). The evolution of timestamp information reflects the motion of objects, and the moving objects can also be identified through the evolution of their timestamp information.

Based on the above observations, we hypothesize that the temporal information of objects is the dominant information for determining their motion, and strengthening the coupling between the temporal and spatial information of objects will facilitate the segmentation of moving objects. However, the aforementioned methods (Mersch et al., 2022; Wang et al., 2023) directly concatenate the timestamp information of each point with the spatially occupied information to form a 4D point cloud that contains temporal-spatial features and employ a Convolutional Neural Network (CNN) to learn these temporal-spatial features, as shown in Figure 1 (a). Although they are effective, the neglect of the dominant role of timestamp information and the lack of deeper coupling between temporal and spatial information hinder further improvement in their segmentation performance.

In this work, we rethink the issue of effectively encoding shallow temporal and spatial features and facilitating sufficient interaction among deep temporal and spatial features. For cases where simply concatenating timestamp information with spatial information fails to highlight the importance of temporal information, we propose an effective embedding approach named Time Clue Bootstrapping Embedding (TCBE), which emphasizes the expressive power of temporal information through attention mechanisms and enhances the mutual coupling between temporal and spatial information by treating temporal information as an independent channel separate from spatial information.

Refer to caption
Figure 1. A brief comparison of other non-projection methods (sub-figure (a)) with ours (sub-figure (b)). The previous methods treated temporal information t𝑡titalic_t and spatially occupied information O𝑂Oitalic_O equally, without deeply integrating them. In contrast, our method emphasizes the primacy of temporal information at each point more through our designed TCBE and achieves a deeper coupling of temporal and spatial information with MSSM, which aligns more closely with the fundamental principles of motion recognition.

Although TCBE can enhance the coupling between temporal and spatial information to some extent compared to the previous embedding approach, it can only be applied to shallow layers and cannot further deepen the coupling between temporal and spatial information. Recently, the work of (Li and Zhuang, 2023) introduced a projection-based method and for the first time incorporated the self-attention mechanism core of transformers (Vaswani et al., 2017) into MOS, achieving better performance far surpassing the methods of the same paradigm. However, studies (Dao et al., 2022; Gu and Dao, 2023) have shown that the quadratic computational complexity of the transformer model in managing large input sequences presents challenges in achieving a balance between training cost and accuracy. Fortunately, the State Space Model (SSM) introduced by Mamba (Gu and Dao, 2023) offers a promising solution, providing us with the opportunity to achieve comparable long-range context modeling capabilities to the transformer (Vaswani et al., 2017) while maintaining linear time complexity. Inspired by this advancement, we proceed to develop the Motion-aware Space State Model (MSSM). In the designed MSSM, we decouple the aggregated point cloud features into multiple single-scan features and learn the appearance features expressed by single-scan features and the motion features expressed by the aggregated features separately. Then, by using the cross-product attention between these two features, we achieve spatial appearance interpretation of multiple-scan features from single-scan features and temporal information supplementation of single-scan features from multiple-scan features, thereby enabling the deep-level coupling between temporal and spatial information with the assistance of SSM and achieving linear complexity.

Through extensive experiments, we demonstrate that the combination of TCBE and MSSM can effectively achieve strong coupling between temporal and spatial information, and achieve state-of-the-art performances on SemanticKITTI-MOS (Behley et al., 2020; Chen et al., 2021) and KITTI-Road (Geiger et al., 2013) benchmarks. Our contributions are summarized as follows:

  • We rethink the problem of the weak coupling between temporal and spatial information that existed in the previous methods and propose a novel LiDAR-based moving object segmentation framework, here in MambaMOS. To the best of our knowledge, this work represents the first attempt to utilize SSM in MOS, providing directions for future extensions of SSM in the MOS domain.

  • An effective Time Clue Bootstrapping Embedding method (TCBE) is introduced, which enhances the coupling capability of temporal and spatial information to some extent, improving the expressive power of motion object features.

  • A novel temporal-spatial information coupling module based on SSM (MSSM) is proposed, which enables deep-level coupling between temporal and spatial features and enhances the perception of moving objects through the complementary nature of single-scan and multiple-scan features.

2. Related Work

Existing MOS methods can be categorized into two categories: Projection-based methods (Chen et al., 2021; Sun et al., 2022; Cheng et al., 2024; Kim et al., 2022) and Non-Projection-based methods (Mersch et al., 2022; Wang et al., 2023; Li and Zhuang, 2023; Li et al., 2023; Mersch et al., 2023). Projection-based methods involve projecting a 3D point cloud onto a compact 2D plane as the model input, while Non-Projection-based methods are processed directly within the 3D point cloud space.

2.1. Projection-based methods

Projection-based MOS methods can be divided into Range View (RV) methods (Chen et al., 2021; Sun et al., 2022; Cheng et al., 2024; Kim et al., 2022) and Bird’s-Eye View (BEV) ones (Mohapatra et al., 2022; Zhou et al., 2023). There has been extensive work in the field of object detection and segmentation in 3D LiDAR data using RV images (Fan et al., 2021; Cortinhal et al., 2020; Kong et al., 2023), which use the original single scan point cloud through the spherical projection (Milioto et al., 2019) to obtain 2D RV image as the model input. In motion perception tasks, the temporal information needed to perceive motion is usually provided by the residual images obtained from the residual processing of the RV images of the current scan and the past few scans (Kim et al., 2022; Chen et al., 2021; Cheng et al., 2024; Sun et al., 2022). Chen et al. (Chen et al., 2021) directly concatenate the RV images and the corresponding multi-scan residual images as input, whereas Sun et al. (Sun et al., 2022) proposed a dual-branch model structure, which used two encoders to extract features from the RV images and the multi-scan residual images respectively. Different from (Sun et al., 2022), Kim et al. (Kim et al., 2022) use a branch in its model to decouple the movable objects into moving objects and static objects using additional semantic labels, which enhances the model’s capacity to understand the dynamic scenario. Cheng et al. (Cheng et al., 2024) focus more on the feature extraction of motion features, which coincides harmoniously with our viewpoint and has achieved leading performance with additional semantic labels.

Unlike the above RV-based methods, the BEV methods present the point cloud features from a top-down perspective, which maintains the consistency of the object scale in the point cloud and makes it easier to understand and process features (Zhang et al., 2020). Mohapatra et al. (Mohapatra et al., 2022) first proposed moving object segmentation in BEV, which achieved faster running speed but lower accuracy than RV-based methods. Zhou et al. (Zhou et al., 2023) employed polar coordinates to transform point clouds into Bird’s-Eye View (BEV) representation. They utilized a dual-branch CNN to extract appearance and motion features from multiple BEV scans, resulting in improved accuracy and efficiency. Although the above projection-based methods are efficient, there is a loss of geometric information in the process of returning the final result to the 3D point cloud space, which limits the performance of such methods.

2.2. Non-projection-based methods

Non-projection-based methods, which directly operate on point clouds in 3D space, circumvent the loss of geometric information inherent in projection-based approaches. Consequently, these methods hold the theoretical advantage of achieving superior segmentation performance. 4DMOS (Mersch et al., 2022) inputs the voxelized point cloud superposition representation of multiple scans into a sparse 4D CNN and fuses the prediction results of multiple different scans of moving objects by a binary Bayesian filter as an additional post-process, which improved the confidence score of judging the moving object in the current scan and achieved excellent segmentation results. Similarly, InsMOS (Wang et al., 2023) is also based on a 4D point cloud as input, but they assist in segmenting moving objects by fusing BEV representations containing object instance information at different resolutions. Li et al. (Li and Zhuang, 2023) proposed a dual-branch model that integrates 3D point clouds and 2D images, and employed Transformer (Vaswani et al., 2017) to fuse multi-scale point cloud and image features, aiming to enhance the coupling of temporal and spatial characteristics. Li et al. (Li et al., 2023) utilized cylindrical coordinates to voxelize the aggregated point cloud input and employed a CNN to obtain moving object segmentation results and further applied MOS in the task of LiDAR-based localization to improve its robustness in dynamic scenes. MapMOS (Mersch et al., 2023) improves that selecting fixed past scans will lead to some moving objects not being perceived due to occlusion. Therefore, a strategy of moving target perception based on a local map constructed by past scans is proposed, and state-of-the-art performance is achieved on the validation set of the SemanticKITTI-MOS benchmark. In addition to the learning-based methods mentioned above, there are also many non-learning-based methods, including map-cleaning methods (Kim and Kim, 2020; Schauer and Nüchter, 2018; Lim et al., 2021, 2023; Arora et al., 2021) and map-based methods (Chen et al., 2022; Pfreundschuh et al., 2021; Schmid et al., 2023). Map-cleaning methods remove the moving objects offline by the geometric information of the target (Kim and Kim, 2020; Schauer and Nüchter, 2018; Lim et al., 2021, 2023; Arora et al., 2021). Map-based methods, on the other hand, require a pre-built map to remove the objects that are moving throughout the mapping process (Chen et al., 2022; Pfreundschuh et al., 2021; Schmid et al., 2023).

In general, existing MOS methods have not thoroughly explored the coupling between temporal and spatial features, which limits their understanding of motion states. In contrast, our method achieves shallow coupling of temporal and spatial features during the embedding stage and deep coupling within each stage of the model. This deep coupling establishes a robust correlation between temporal and spatial clues, enhancing the model’s comprehension of motion scenes. Importantly, our method achieves state-of-the-art performance without any post-processing modules on the MOS task.

Refer to caption
Figure 2. The overview of our proposed MambaMOS. The previous F1𝐹1F-1italic_F - 1 scans, after undergoing viewpoint transformation, are overlaid with the current scan to form a 4D point cloud. This 4D point cloud is then serialized to obtain a sequence as input. After passing through TCBE, the coupling degree between temporal and spatial information in the input is enhanced and fed into a symmetric encoder-decoder architecture (the pink box). Each stage of the encoder/decoder consists of a pooling/unpooling layer and N𝑁Nitalic_N blocks (the blue box). MSSM serves as the core of each block to achieve deep-level coupling of temporal and spatial features. Finally, the MOS result in the current scan can be obtained from the output of the decoder by a linear layer.

3. Method

3.1. Preliminaries

State Space Model. SSM (Gu, 2023) is a sequential model that can map a one-dimensional input x(t)𝑥𝑡x(t)\in\mathbb{R}italic_x ( italic_t ) ∈ blackboard_R sequence to an output sequence y(t)𝑦𝑡y(t)\in\mathbb{R}italic_y ( italic_t ) ∈ blackboard_R. The process is represented by a series of continuous hidden states h(t)N𝑡superscript𝑁h(t)\in\mathbb{R}^{N}italic_h ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of state size N𝑁Nitalic_N. In general, the SSM of a continuous-time system can be represented by the following linear Ordinary Differential Equation (ODE) as depicted in Equation (1),

(1) h(t)=Ah(t)+Bx(t)y(t)=Ch(t)superscript𝑡𝐴𝑡𝐵𝑥𝑡𝑦𝑡𝐶𝑡\begin{gathered}h^{\prime}(t)=Ah(t)+Bx(t)\\ y(t)=Ch(t)\end{gathered}start_ROW start_CELL italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = italic_A italic_h ( italic_t ) + italic_B italic_x ( italic_t ) end_CELL end_ROW start_ROW start_CELL italic_y ( italic_t ) = italic_C italic_h ( italic_t ) end_CELL end_ROW

where the parameters AN×N𝐴superscript𝑁𝑁A\in\mathbb{R}^{N\times N}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, BN×1𝐵superscript𝑁1B\in\mathbb{R}^{N\times 1}italic_B ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT and C1×N𝐶superscript1𝑁C\in\mathbb{R}^{1\times N}italic_C ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_N end_POSTSUPERSCRIPT establish the correlation between the state and output variables.

Discretization. It is essential that the original SSM equations be transformed into a discrete form to fit the discretized data in the task. The discretized SSM can be written as Equation (2),

(2) ht=A¯ht1+B¯xtyt=Chtsubscript𝑡¯𝐴subscript𝑡1¯𝐵subscript𝑥𝑡subscript𝑦𝑡𝐶subscript𝑡\begin{gathered}h_{t}=\bar{A}h_{t-1}+\bar{B}x_{t}\\ y_{t}=Ch_{t}\end{gathered}start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over¯ start_ARG italic_A end_ARG italic_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + over¯ start_ARG italic_B end_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW

The discretization parameters A¯¯𝐴\bar{A}over¯ start_ARG italic_A end_ARG, B¯¯𝐵\bar{B}over¯ start_ARG italic_B end_ARG can be described by the Zero-Order Hold (ZOH) rule with timescale parameters ΔΔ\Deltaroman_Δ as Equation (3),

(3) A¯=eΔAB¯=(ΔA)1(eΔAI)ΔB¯𝐴superscript𝑒Δ𝐴¯𝐵superscriptΔ𝐴1superscript𝑒Δ𝐴𝐼Δ𝐵\begin{gathered}\bar{A}=e^{\Delta A}\\ \bar{B}=(\Delta A)^{-1}\left(e^{\Delta A}-I\right)\cdot\Delta B\end{gathered}start_ROW start_CELL over¯ start_ARG italic_A end_ARG = italic_e start_POSTSUPERSCRIPT roman_Δ italic_A end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL over¯ start_ARG italic_B end_ARG = ( roman_Δ italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT roman_Δ italic_A end_POSTSUPERSCRIPT - italic_I ) ⋅ roman_Δ italic_B end_CELL end_ROW

Selective Scans Mechanism. Mamba (Gu and Dao, 2023) proposed a selective scan mechanism that effectively adjusts the parameters through the parameterized projection of the input sequence, enabling the SSM to selectively filter the input sequence features. This has advanced the research of SSM in the time-varying domain.

MambaMOS Principles. To address the weak coupling of temporal and spatial information in existing MOS methods, we try to adapt Mamba from Natural Language Processing (NLP) to the MOS task. An intriguing discovery emerged: the MOS task inherently involves selecting a moving subset of elements from an unordered set, akin to the selective copying mechanism in NLP (Gu and Dao, 2023; Arjovsky et al., 2016). Leveraging this insight, we introduced MambaMOS based on the selective copying mechanism. This enhancement equips Mamba to effectively address MOS tasks, enabling the model to adaptively select the moving target while reducing operational costs.

3.2. MambaMOS

Overview Architecture. The proposed MambaMOS leverages a U-Net (Ronneberger et al., 2015) style overall architecture as shown in Figure 2. Firstly, the 4D point cloud set as input will be transformed into an ordered sequence after the serialization process.

Simultaneously, they are encoded through the meticulously designed TCBE (Sec. 3.3). Next, the point cloud is sent into the encoder-decoder construct to model deep features. It includes the encoder with a 5555-stage block depth of [2,2,2,6,2]22262[2,2,2,6,2][ 2 , 2 , 2 , 6 , 2 ] and the decoder with a 4444-stage block depth of [2,2,2,2]2222[2,2,2,2][ 2 , 2 , 2 , 2 ]. It should be noted that the point cloud pooling strategy is used in the encoder of all stages except for the first. The scale change factor of the point cloud passing the pooling layer is 2222. Moreover, at the beginning of the block, an efficient position encoding block is leveraged to capture the local attention of the feature following the idea of most point transformer works (Lai et al., 2022; Wu et al., 2024; Yang et al., 2023).

The point cloud features after layer normalization will pass through the MSSM (Sec. 3.4), the core insight of the entire block, where the motion features of the objects will be enhanced. The final output of the block is the layer normalization and a multi-layer perceptron. And residual connections are extensively applied in each of our blocks to avoid vanishing gradients (He et al., 2016). Finally, the logits of each point can be obtained by a linear layer. And points are deserialized to extract the segmentation result.

Input Representation. At the current time (t=0𝑡0t=0italic_t = 0), given a LiDAR scan St={pi4}i=0Nt1subscript𝑆𝑡superscriptsubscriptsubscript𝑝𝑖superscript4𝑖0subscript𝑁𝑡1S_{t}=\left\{p_{i}\in\mathbb{R}^{4}\right\}_{i=0}^{N_{t}-1}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT with Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT points pi=[xi,yi,zi,1]Tsubscript𝑝𝑖superscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖1𝑇p_{i}=[x_{i},y_{i},z_{i},1]^{T}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT represented by homogeneous coordinates. The goal is to segment the moving points in the current scan S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using the continuous point cloud set S={St}t=0F1𝑆superscriptsubscriptsubscript𝑆𝑡𝑡0𝐹1S=\left\{S_{t}\right\}_{t=0}^{F-1}italic_S = { italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT including the current scan S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and its past F1𝐹1F-1italic_F - 1 scans. To aggregate the F𝐹Fitalic_F scans point cloud data into a 4D point cloud input containing temporal-spatial information and eliminate self-motion, we need to transform the past F1𝐹1F-1italic_F - 1 scans to the perspective of the current scan and convert the homogeneous coordinates to Cartesian coordinates separately. Given the pose transition matrix Tt0superscriptsubscript𝑇𝑡0T_{t}^{0}italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT from scan t𝑡titalic_t to the current scan, the perspective transition from the point cloud at time t𝑡titalic_t to the current point cloud can be expressed as Equation  (4).

(4) St0={pi=Tt0pipiSt}i=0Nt1subscript𝑆𝑡0superscriptsubscriptconditional-setsuperscriptsubscript𝑝𝑖superscriptsubscript𝑇𝑡0subscript𝑝𝑖subscript𝑝𝑖subscript𝑆𝑡𝑖0subscript𝑁𝑡1S_{t\rightarrow 0}=\left\{p_{i}^{\prime}=T_{t}^{0}\cdot p_{i}\mid p_{i}\in S_{% t}\right\}_{i=0}^{N_{t}-1}italic_S start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ⋅ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT

Thus, the 4D point cloud set S={pi4}i=0N1superscript𝑆superscriptsubscriptsuperscriptsubscript𝑝𝑖superscript4𝑖0𝑁1S^{\prime}=\left\{p_{i}^{\prime}\in\mathbb{R}^{4}\right\}_{i=0}^{N-1}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT with N=t=0F1Nt𝑁superscriptsubscript𝑡0𝐹1subscript𝑁𝑡N=\sum_{t=0}^{F-1}N_{t}italic_N = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F - 1 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT points can be represented as Equation (5). To distinguish each scan in the 4D point cloud, we add the corresponding time step of each scan as an additional dimension of the point and obtain the spatio-temporal point representation pi=[xi,yi,zi,ti]Tsuperscriptsubscript𝑝𝑖superscriptsubscript𝑥𝑖subscript𝑦𝑖subscript𝑧𝑖subscript𝑡𝑖𝑇p_{i}^{\prime}=[x_{i},y_{i},z_{i},t_{i}]^{T}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

(5) S={S0,S10,,St0}superscript𝑆subscript𝑆0subscript𝑆10subscript𝑆𝑡0S^{\prime}=\left\{S_{0},S_{1\rightarrow 0},...,S_{t\rightarrow 0}\right\}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 → 0 end_POSTSUBSCRIPT , … , italic_S start_POSTSUBSCRIPT italic_t → 0 end_POSTSUBSCRIPT }

Serialization. The SSM, as the core of MambaMOS, typically takes in a sequence of data, such as natural language. Therefore, it is necessary to obtain the sequence Sosuperscriptsubscript𝑆𝑜S_{o}^{\prime}italic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the unordered 4D point cloud set Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by serialization. The serialization can be understood as a projection function ΨΨ\Psiroman_Ψ that transforms the unordered set Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT into the sequence Sosuperscriptsubscript𝑆𝑜S_{o}^{\prime}italic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Thus, the process of serialization and deserialization can be described as Equation 6, where Ψ1superscriptΨ1\Psi^{-1}roman_Ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is the inverse projection function. One approach to serialize point clouds is by sorting the coordinate of each point (Liu et al., 2024). However, this serialization method fails to adequately preserve the local spatial relationships of the objects, which may result in spatially close point clouds being far apart in the final sequence.

Space-filling curves are mathematical curves that can project data in N𝑁Nitalic_N-dimensional space to one-dimensional continuous space: Nsuperscript𝑁\mathbb{Z}^{N}\rightarrow\mathbb{Z}blackboard_Z start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT → blackboard_Z, which have been applied in recent 3D scene understanding works (Wu et al., 2024; Wang, 2023). Inspired by them, our serialization process utilizes z-order curves (Morton, 1966) and Hilbert curves (Hilbert, 1935), which preserve neighborhood relationships in the original 3D point cloud effectively.

(6) So=Ψ(S)S=Ψ1(So)superscriptsubscript𝑆𝑜Ψsuperscript𝑆superscript𝑆superscriptΨ1superscriptsubscript𝑆𝑜\begin{gathered}S_{o}^{\prime}=\Psi(S^{\prime})\\ S^{\prime}=\Psi^{-1}(S_{o}^{\prime})\end{gathered}start_ROW start_CELL italic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Ψ ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_Ψ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW

3.3. Time Clue Bootstrapping Embedding

Previous methods (Mersch et al., 2022, 2023; Wang et al., 2023) have not effectively emphasized the dominance of temporal information for each point. This is evident in their equal treatment of spatial occupied information obtained from LiDAR and the corresponding timestamp information from the scan aggregation process. However, the direct overlaying for temporal and spatial information, which belong to different modalities, does not fully exploit the supervisory role of one modality on the other. Therefore, we propose the Time Clue Bootstrapping Embedding (TCBE). It emphasizes temporal information over spatial information based on the principle that time evolution drives the motion of objects, thereby enhancing the coupling of temporal and spatial information.

The structure of TCBE is illustrated in the bottom right of Figure 2. Specifically, TCBE embeds the spatial and temporal information of each point in the ordered point cloud sequence Sosuperscriptsubscript𝑆𝑜S_{o}^{\prime}italic_S start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using 1D convolution to obtain the corresponding spatial feature fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and temporal feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in embedded dimension, both of which possess local characteristics. Firstly, the initial coupled temporal and spatial feature fcousubscript𝑓𝑐𝑜𝑢f_{cou}italic_f start_POSTSUBSCRIPT italic_c italic_o italic_u end_POSTSUBSCRIPT is obtained by adding the temporal feature ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the spatial feature fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, which serves as an alternative implementation of previous embedding methods. Then, in order to emphasize the dominance of temporal information over spatial information, ftsuperscriptsubscript𝑓𝑡f_{t}^{\prime}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT which reflects the local temporal evolution trends, obtained through 1D convolution without changing its channels, is multiplied element-wise by fssubscript𝑓𝑠f_{s}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Finally, the time-guided spatial feature fTGSsubscript𝑓𝑇𝐺𝑆f_{TGS}italic_f start_POSTSUBSCRIPT italic_T italic_G italic_S end_POSTSUBSCRIPT is added to the initial coupled temporal and spatial feature fcousubscript𝑓𝑐𝑜𝑢f_{cou}italic_f start_POSTSUBSCRIPT italic_c italic_o italic_u end_POSTSUBSCRIPT, resulting in the enhanced temporal and spatial coupling information fcousuperscriptsubscript𝑓𝑐𝑜𝑢f_{cou}^{\prime}italic_f start_POSTSUBSCRIPT italic_c italic_o italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. After undergoing 1D convolution, batch normalization, and activation functions, fcousuperscriptsubscript𝑓𝑐𝑜𝑢f_{cou}^{\prime}italic_f start_POSTSUBSCRIPT italic_c italic_o italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT serves as the output of TCBE.

3.4. Motion-aware State Space Model

Although there are some similarities in form between the selective copy task (Arjovsky et al., 2016; Gu and Dao, 2023) in NLP and the MOS task as mentioned before, the direct application of Mamba (Gu and Dao, 2023) cannot effectively exploit the temporal features. This is attributed to the fact that the original Mamba (Gu and Dao, 2023) is designed for one-dimensional natural language with a certain causal relationship. However, the serialized multi-scan point cloud sequence cannot reflect strong causality. Thus, we propose MSSM to compensate for the shortcomings of Mamba (Gu and Dao, 2023) on MOS.

The main design idea of MSSM is to enhance the original Mamba’s perception of temporal features regarding moving objects by using cross-product attention between single-scan features and multi-scan features. As shown in the upper left of Figure 2, it is mainly composed of linear layers, activation function σ𝜎\sigmaitalic_σ, and an SSM with the selective scans mechanism. Let the input point cloud feature with batch size B𝐵Bitalic_B, sequence length N𝑁Nitalic_N, and number of channels C𝐶Citalic_C be characterized by fIB×N×Csubscript𝑓𝐼superscript𝐵𝑁𝐶f_{I}\in\mathbb{R}^{B\times N\times C}italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B × italic_N × italic_C end_POSTSUPERSCRIPT, which will go through three branches. We derive the main branch of Mamba (Gu and Dao, 2023) to obtain the upper and the middle branches of our MambaMOS. The upper branch is used to extract the appearance features of each object in the single-scan point cloud. And the middle branch focuses more on the temporal features of moving objects in the 4D point cloud. Since the MOS task only focuses on moving objects, we aim for the MSSM to assign lower attention to unmovable objects such as roads or tree trunks. Therefore, a feature weighting process is required. Inspired by the gated attention units (Hua et al., 2022), we employ a simple gating mechanism as the bottom branch of MSSM to allocate weights to features in each hidden state, thereby determining whether the features are expressed.

Specifically, to obtain the single scan feature fSB×Np×Csubscript𝑓𝑆superscriptsuperscript𝐵subscript𝑁𝑝𝐶f_{S}\in\mathbb{R}^{B^{\prime}\times N_{p}\times C}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_C end_POSTSUPERSCRIPT with B=B×Fsuperscript𝐵𝐵𝐹B^{\prime}=B\times Fitalic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_B × italic_F at this time, the upper branch firstly performs Reversed Aggregation (RA), which separates each scan of Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and concatenates them as a separate batch after 0-padding to Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Then the appearance features of the single scan fAsuperscriptsubscript𝑓𝐴f_{A}^{\prime}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be obtained through the process of 1D convolution and single scan aggregation. This process can be written as:

(7) fA=σ(Conv1d(RA(fI)))superscriptsubscript𝑓𝐴𝜎Conv1dRAsubscript𝑓𝐼f_{A}^{\prime}=\sigma\left(\operatorname{Conv1d}\left(\operatorname{RA}\left(f% _{I}\right)\right)\right)italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_σ ( Conv1d ( roman_RA ( italic_f start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ) ) )

The middle branch employs 1D convolution to obtain the temporal and appearance features of moving objects in multiple scans. The output of this process is denoted as fMsuperscriptsubscript𝑓𝑀f_{M}^{\prime}italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Subsequently, fMsuperscriptsubscript𝑓𝑀f_{M}^{\prime}italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is fused with the output of upper branch fAsuperscriptsubscript𝑓𝐴f_{A}^{\prime}italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT through the cross-product attention to obtain fMGsubscript𝑓𝑀𝐺f_{MG}italic_f start_POSTSUBSCRIPT italic_M italic_G end_POSTSUBSCRIPT. The fusion process can be described as follows:

(8) fMG=Sigmoid(fM)fA+fMsubscript𝑓𝑀𝐺tensor-productSigmoidsuperscriptsubscript𝑓𝑀superscriptsubscript𝑓𝐴superscriptsubscript𝑓𝑀f_{MG}=\operatorname{Sigmoid}\left(f_{M}^{\prime}\right)\otimes f_{A}^{\prime}% +f_{M}^{\prime}italic_f start_POSTSUBSCRIPT italic_M italic_G end_POSTSUBSCRIPT = roman_Sigmoid ( italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⊗ italic_f start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_f start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

In the subsequent design, we follow the idea of the original Mamba, that is, the final output fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of the block is obtained by element-wise multiplication of the result of the main branch and the result fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT of the gated branch after a linear projection. This process is described as follows:

(9) f=SSM(σ(fMG))fGsuperscript𝑓tensor-productSSM𝜎subscript𝑓𝑀𝐺subscript𝑓𝐺f^{\prime}=\operatorname{SSM}\left(\sigma\left(f_{MG}\right)\right)\otimes f_{G}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_SSM ( italic_σ ( italic_f start_POSTSUBSCRIPT italic_M italic_G end_POSTSUBSCRIPT ) ) ⊗ italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT

3.5. Loss Function

Before performing the loss calculation, we first deserialize the obtained sequence segmentation results to correspond to the initial unordered point cloud set as Euqation  6. Afterwards, following the majority of 3D segmentation methods, we adopt the combination of cross-entropy loss (cesubscriptce\mathcal{L}_{\mathrm{ce}}caligraphic_L start_POSTSUBSCRIPT roman_ce end_POSTSUBSCRIPT) and Lovász-Softmax loss (Berman et al., 2018) (lssubscriptls\mathcal{L}_{\mathrm{ls}}caligraphic_L start_POSTSUBSCRIPT roman_ls end_POSTSUBSCRIPT) as the joint loss =ce+lssubscriptcesubscriptls\mathcal{L}=\mathcal{L}_{\mathrm{ce}}+\mathcal{L}_{\mathrm{ls}}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT roman_ce end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT roman_ls end_POSTSUBSCRIPT for supervised training.

4. Experiment

4.1. Experiment Setups

We perform a variety of experiments to verify the proposed MambaMOS on the SemanticKITTI-MOS dataset (Behley et al., 2020; Chen et al., 2021). Sequences 00similar-to\sim07 and 09similar-to\sim10 are used as the training set, sequence 08 is used as the validation set, and sequences 11similar-to\sim21 are used as the test set, following the same division as previous MOS methods (Mersch et al., 2023; Cheng et al., 2024; Wang et al., 2023). The KITTI-road dataset (Geiger et al., 2013) is also utilized in the experiments for comparison with other MOS methods, and the same partitioning approach as (Cheng et al., 2024) is maintained. The entire training is conducted on four NVIDIA RTX A6000 GPUs with 48G VRAM for 50505050 epochs with a batch size of 4444. AdamW (Loshchilov and Hutter, 2018) with a weight decay of 0.0050.0050.0050.005 is used as the optimizer, and the learning rate is set to 0.000320.000320.000320.00032. A grid size of 0.09m0.09𝑚0.09m0.09 italic_m is applied to voxelize the input aggregated point cloud, and scans of F=8𝐹8F=8italic_F = 8 are used for input same as (Chen et al., 2021; Sun et al., 2022; Cheng et al., 2024). Moreover, common point cloud data augmentation approaches such as random rotation and random flipping are applied during training to enhance the generalization capacity of MambaMOS. All ablation experiments were conducted on eight NVIDIA GeForce RTX 3090 GPUs, using a four-scan input (F=4𝐹4F=4italic_F = 4), a batch size of 8888, and completed using automatic mixed precision. We report the voxelized moving object IoU for ablations. Additionally, similar to (Chen et al., 2021; Kim et al., 2022; Cheng et al., 2024), we employed additional semantic labels for training as well. During the validation and testing stages, Intersection-over-Union (IoU) (Everingham et al., 2010) is adopted as the metric to evaluate the performance. Following previous methods (Li et al., 2023; Li and Zhuang, 2023; Mersch et al., 2023), all experiments provide IoU for the moving objects as IoUMOS as the main evaluation metric which can be described as Equation 10 with True Positive TP𝑇𝑃TPitalic_T italic_P, False Positive FP𝐹𝑃FPitalic_F italic_P and False Negative FN𝐹𝑁FNitalic_F italic_N:

(10) IoUMOS=TPTP+FP+FN𝐼𝑜subscript𝑈𝑀𝑂𝑆𝑇𝑃𝑇𝑃𝐹𝑃𝐹𝑁IoU_{MOS}=\frac{TP}{TP+FP+FN}italic_I italic_o italic_U start_POSTSUBSCRIPT italic_M italic_O italic_S end_POSTSUBSCRIPT = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P + italic_F italic_N end_ARG

4.2. Moving Object Segmentation Performance

We analyze the comparison with other SoTA methods from both quantitative and qualitative perspectives.

Quantitative Analysis. Table 1 presents the quantitative comparison results of MambaMOS with state-of-the-art methods in the MOS benchmark  (Behley et al., 2020; Chen et al., 2021). All results reported for each method are the best-reported results in their respective papers. Due to some methods (Sun et al., 2022; Wang et al., 2023; Cheng et al., 2024) using the KITTI-Road dataset (Geiger et al., 2013) as additional training data in their comparisons, we follow the principle of fair comparison by using consistent training data, distinguishing it with a symbol \dagger from the original comparison on the table.

Without using additional training data, MambaMOS successfully outperforms almost all methods on the benchmark. Specifically, MambaMOS surpasses Two-streamMOS (Li and Zhuang, 2023), which integrates point clouds and images as input, by 4.4%percent\%% and 2.6%percent\%% on the validation set and hidden test set. We attribute this significant improvement largely to the geometric losses present in non-projection-based methods. In the comparison with non-projection-based methods, MambaMOS achieves superior performance on the validation than LiDAR-IMU-GNSS (Li et al., 2023), which includes ground optimization as an additional pre-processing, by 3.3%percent\%%, and leads it by 0.7%percent\%% on the test set, owing to the stronger coupling of temporal and spatial information in MambaMOS. Despite MambaMOS using a fixed number of scans as input, it still surpasses MapMOS (Mersch et al., 2023), which leverages a local map instead, with a significant margin of 9.6%percent\%% on the hidden test set. After incorporating additional training data, MambaMOS, when trained with the same settings as before, still outperforms other methods under comparable conditions. MambaMOS surpasses the state-of-the-art MF-MOS (Cheng et al., 2024) by 3.4%percent\%%, and outperforms InsMOS (Wang et al., 2023), which utilizes an additional instance bounding box for determining moving instance, by 3.9%percent\%% and 4.5%percent\%%, on the validation set and hidden test set respectively.

Table 1. Comparison with state-of-the-art methods on the SemanticKITTI-MOS benchmark. \dagger denotes using additional KITTI-Road for training.
Method IoUMOS (%percent\%%)
Validation 08 Test 11-21
LiMoSeg (Mohapatra et al., 2022) 52.6 -
LMNet (Chen et al., 2021) 67.1 54.5
SSF-MOS (Song et al., 2024) 70.1 -
MotionSeg3D (Sun et al., 2022) 71.4 64.9
RVMOS (Kim et al., 2022) 71.2 74.7
4DMOS (Mersch et al., 2022) 77.2 65.2
InsMOS (Wang et al., 2023) 73.2 -
MF-MOS (Cheng et al., 2024) 76.1 -
MotionBEV (Zhou et al., 2023) 76.5 69.7
MapMOS (Mersch et al., 2023) 86.1 66.0
Two-streamMOS (Li and Zhuang, 2023) 77.9 73.0
LiDAR-IMU-GNSS (Li et al., 2023) 79.0 74.9
MambaMOS 82.3 75.6
LMNet (Chen et al., 2021) 63.8 60.5
MotionSeg3D (Sun et al., 2022) 69.3 70.2
MotionBEV (Zhou et al., 2023) 64.6 74.9
InsMOS (Wang et al., 2023) 69.4 75.6
MF-MOS (Cheng et al., 2024) - 76.7
MambaMOS 73.3 80.1

To further analyze the advantages brought by our method, we have conducted a detailed comparison of the segmentation performance of existing methods on the SemanticKITTI-MOS validation set for different distances, as shown in Table 2. The metrics in Table 2 are either reported in their respective papers or determined using their publicly available weights. The weights for other methods such as RVMOS (Kim et al., 2022), Two-streamMOS (Li and Zhuang, 2023), and LiDAR-IMU-GNSS (Li et al., 2023) are either undisclosed or not reported in the papers, hence not included in the comparison.

Table 2. MOS performance on the SemanticKITTI-MOS validation set for points at different distances. R denotes recall and P denotes precision.
Method Close (¡20m𝑚mitalic_m) Medium (¿=20m𝑚mitalic_m, ¡ 50m𝑚mitalic_m) Far (¿= 50m𝑚mitalic_m)
IoUMOS R P IoUMOS R P IoUMOS R P
LMNet (Chen et al., 2021) 70.72 76.89 89.80 43.88 54.30 69.56 0.00 0.00 -
MotionSeg3D (Sun et al., 2022) 71.66 79.97 87.35 52.21 59.27 81.40 4.99 4.99 100.00
MotionBEV (Zhou et al., 2023) 80.85 85.40 93.81 56.35 59.89 90.50 0.00 0.00 -
4DMOS (Mersch et al., 2022) 78.43 82.11 94.59 68.71 72.62 92.74 41.00 41.00 100.00
InsMOS (Wang et al., 2023) 75.29 88.78 83.21 57.67 66.81 80.84 10.88 10.89 98.63
MF-MOS (Cheng et al., 2024) 79.31 84.98 92.23 54.67 64.10 78.81 47.97 50.08 91.94
MambaMOS 83.69 87.30 95.29 72.48 78.24 90.78 94.44 97.73 96.56

As known, the point cloud distribution becomes sparser as the object distance from the LiDAR increases. As shown in Table 2, most MOS methods achieve satisfactory segmentation results at close distances. However, their segmentation performance sharply declines when the distance reaches the range of 20m𝑚mitalic_m and 50m𝑚mitalic_m. Furthermore, beyond a distance of 50m𝑚mitalic_m, some projection-based methods such as LMNet (Chen et al., 2021), MotionSeg3D (Sun et al., 2022), and MotionBEV (Zhou et al., 2023) fail to discern the motion attributes of the objects. Although MF-MOS (Cheng et al., 2024), with its focus on motion features, surpasses the non-projection-based methods like 4DMOS (Mersch et al., 2022) and InsMOS (Wang et al., 2023) in segmenting distant moving objects, it is still limited in recognizing the motion attributes of distant objects due to geometric losses caused by the projection process, which prevents the strong coupling of spatial information and temporal information for the objects. On the other hand, MambaMOS demonstrates precise segmentation of moving objects even in cases of extremely sparse point clouds. This indirectly supports the viewpoint delivered in our work: reinforcing temporal information can effectively improve the MOS performance when the spatial features of the targets are not prominent.

Qualitative Analysis. As shown in Figure 3, MambaMOS has achieved significant performance improvements compared to other methods on the SemanticKITTI-MOS validation set (Behley et al., 2020; Chen et al., 2021). We have conducted a detailed analysis of this phenomenon as shown in the figure. MF-MOS (Cheng et al., 2024), InsMOS (Wang et al., 2023), and 4DMOS (Mersch et al., 2022) all fail to correctly determine the motion attributes of the objects, resulting in a large number of false negative results. It can be attributed to they only rely on a weak coupling between spatial and temporal information for motion estimation. Slow-moving objects or distant moving objects do not exhibit obvious characteristics in terms of spatial information, so methods (Cheng et al., 2024; Wang et al., 2023; Mersch et al., 2022) that do not strengthen the temporal information perform poorly in their estimation. However, MambaMOS effectively addresses this problem by incorporating strong temporal information coupling. Furthermore, to reduce false positive predictions, we also categorize stationary vehicles as a movable class for training, following the method of (Cheng et al., 2024; Kim et al., 2022; Chen et al., 2021), which further enhances the model’s understanding of motion scenes.

Refer to caption
Figure 3. Visualization comparison results of MambaMOS with MF-MOS (Cheng et al., 2024), InsMOS (Wang et al., 2023), and 4DMOS (Mersch et al., 2022) on the SemanticKITTI validation set. We overlay their respective predictions for the current scan and the past seven scans to visually demonstrate the results of MOS.

4.3. Ablation Study

Since the SSM receives sequential features, different spatial serialization combinations will have an impact on overall performance. We explore the influence of the serialization combination of z𝑧zitalic_z-curve (Morton, 1966) and Hilbert curve (Hilbert, 1935), which have good spatial locality characteristics, on the performance of MOS. As the spatial filling curves traverse the spatial points based on the order of x𝑥xitalic_x, y𝑦yitalic_y, and z𝑧zitalic_z, prioritizing y𝑦yitalic_y will yield different serialization results compared to prioritizing x𝑥xitalic_x. We denote this variant with T. As shown in Table 6, richer serialization methods yield better performance on the validation set. This is because multiple serialization methods capture different contextual relationships of sequences, reducing overfitting while enhancing the model’s understanding of dynamic objects.

Table 4 presents two methods proposed in our study to enhance the coupling of temporal and spatial information. We conducted ablation experiments to demonstrate that the proposed modules can enhance the model’s perception of moving objects. When MSSM is not used, we replaced it with the original Mamba block (Gu and Dao, 2023), and we employed a simple 3D convolution for information embedding when TCBE is not applied.

From the experimental results, it can be observed that when only MSSM or TCBE is applied compared to the baseline, the performance is improved by 1.86%percent\%% and 1.41%percent\%%. This indicates that both MSSM and TCBE can enhance the coupling of temporal and spatial features and improve the model’s perception of motion features. However, when only MSSM is applied, the performance is improved by 0.45%percent\%% compared to TCBE. This is because MSSM, based on the interaction between single-scan features and multi-scan features, focuses more on deep-level spatio-temporal information coupling and can learn the motion attributes of objects more comprehensively. Finally, by joining TCBE on the basis of MSSM, the emphasis on temporal information during the embedding phase is further enhanced, which aligns with the fundamental logic of motion recognition and achieves optimal performance, surpassing the baseline by 2.25%percent\%%.

Table 3. Ablation about the serialization combination on the SemanticKITTI-MOS validation set.
      Pattern       IoUMOS (%percent\%%)
      Z       75.11
      Hilbert       76.38
      Z+ZT       76.57
      Hilbert+HilbertT       76.41
      Z+ZT+Hilbert+HilbertT       77.46
Table 4. Ablation about each module in MambaMOS on the SemanticKITTI-MOS validation set.
      Component       IoUMOS (%percent\%%)
      MSSM       TCBE
      ✗       ✗       75.21
      ✓       ✗       77.07
      ✗       ✓       76.62
      ✓       ✓       77.46

4.4. Generalization Performance Analysis

Since the majority of the SemanticKITTI dataset (Behley et al., 2020) was collected in residential areas, to test the broader environmental adaptability of MambaMOS, we fine-tune it on the KITTI-Road dataset (Geiger et al., 2013) to evaluate its generalizability to new environments. We follow the same data partitioning strategy as MF-MOS (Cheng et al., 2024), InsMOS (Wang et al., 2023), and MotionSeg3D (Sun et al., 2022), and compared it with methods that use fixed scans as input with publicly available weights. The original weights of all methods shown in Table 5 are open-source and trained exclusively on the SemanticKITTI-MOS (Behley et al., 2020; Chen et al., 2021) dataset. They were then fine-tuned for 10101010 epochs on the KITTI-Road training set (Geiger et al., 2013) to obtain the final results. The results indicate that even with a small amount of data and minimal fine-tuning, MambaMOS still achieves better results than previous methods, demonstrating its excellent generalization capability for adapting to new environments.

Table 5. Comparison of fine-tune performance with state-of-the-art methods on the KITTI-Road dataset.
        Method         IoUMOS
        LMNet (Chen et al., 2021)         87.4
        MotionBEV (Zhou et al., 2023)         80.5
        4DMOS (Mersch et al., 2022)         81.0
        InsMOS (Wang et al., 2023)         83.9
        MF-MOS (Cheng et al., 2024)         87.9
        MambaMOS         89.4

5. Conclusion

This paper introduces MambaMOS, a novel framework for moving object segmentation, aiming to address the issue of weak spatio-temporal coupling in existing methods. Specifically, we introduce the Time Clue Bootstrapping Embedding to achieve the shallow coupling of temporal and spatial information of the objects. Furthermore, we underscore the importance of temporal information as the primary cue for recognizing motion attributes, thereby enhancing the model’s sensitivity to motion features. To achieve deeper spatio-temporal coupling, we propose the Motion-aware State Space Model, which facilitates interaction between single-scan and multi-scan features. Leveraging the SSM’s linear complexity and strong contextual modeling capability, the MSSM achieves strong spatio-temporal coupling of features. Extensive experiments validate the effectiveness of our method, demonstrating state-of-the-art performance on both SemanticKITTI-MOS and KITTI-Road datasets. Additionally, this paper marks the pioneering application of SSM to the MOS task, and establishes a significant connection between point cloud segmentation in 3D vision and natural language tasks, offering valuable insights for future research directions.

References

  • (1)
  • Arjovsky et al. (2016) Martin Arjovsky, Amar Shah, and Yoshua Bengio. 2016. Unitary evolution recurrent neural networks. In International Conference on Machine Learning (ICML).
  • Arora et al. (2021) Mehul Arora, Louis Wiesmann, Xieyuanli Chen, and Cyrill Stachniss. 2021. Mapping the static parts of dynamic scenes from 3D LiDAR point clouds exploiting ground segmentation. In 2021 European Conference on Mobile Robots (ECMR).
  • Behley et al. (2020) Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jürgen Gall. 2020. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  • Berman et al. (2018) Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. 2018. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Chen et al. (2021) Xieyuanli Chen, Shijie Li, Benedikt Mersch, Louis Wiesmann, Jürgen Gall, Jens Behley, and Cyrill Stachniss. 2021. Moving object segmentation in 3D LiDAR data: A learning-based approach exploiting sequential data. IEEE Robotics and Automation Letters (RA-L) 6, 4 (2021), 6529–6536.
  • Chen et al. (2022) Xieyuanli Chen, Benedikt Mersch, Lucas Nunes, Rodrigo Marcuzzi, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. 2022. Automatic labeling to generate training data for online LiDAR-based moving object segmentation. IEEE Robotics and Automation Letters (RA-L) 7, 3 (2022), 6107–6114.
  • Chen et al. (2019) Xieyuanli Chen, Andres Milioto, Emanuele Palazzolo, Philippe Giguere, Jens Behley, and Cyrill Stachniss. 2019. SuMa++: Efficient LiDAR-based Semantic SLAM. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
  • Cheng et al. (2024) Jintao Cheng, Kang Zeng, Zhuoxu Huang, Xiaoyu Tang, Jin Wu, Chengxi Zhang, Xieyuanli Chen, and Rui Fan. 2024. MF-MOS: A Motion-Focused Model for Moving Object Segmentation. In 2024 IEEE International Conference on Robotics and Automation (ICRA).
  • Cortinhal et al. (2020) Tiago Cortinhal, George Tzelepis, and Eren Erdal Aksoy. 2020. SalsaNext: Fast, Uncertainty-Aware Semantic Segmentation of LiDAR Point Clouds. In International Symposium on Visual Computing (ISVC).
  • Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems (NeurIPS) (2022).
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV) 88 (2010), 303–338.
  • Fan et al. (2021) Lue Fan, Xuan Xiong, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. 2021. RangeDet: In Defense of Range View for LiDAR-based 3D Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  • Geiger et al. (2013) Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR) 32, 11 (2013), 1231–1237.
  • Gu (2023) Albert Gu. 2023. Modeling Sequences with Structured State Spaces. Stanford University.
  • Gu and Dao (2023) Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 (2023).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Hilbert (1935) David Hilbert. 1935. Dritter Band: Analysis· Grundlagen der Mathematik· Physik Verschiedenes: Nebst Einer Lebensgeschichte. (1935).
  • Hua et al. (2022) Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. 2022. Transformer quality in linear time. In International Conference on Machine Learning (ICML).
  • Kim and Kim (2020) Giseop Kim and Ayoung Kim. 2020. Remove, then revert: Static point cloud map construction using multiresolution range images. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
  • Kim et al. (2022) Jaeyeul Kim, Jungwan Woo, and Sunghoon Im. 2022. RVMOS: Range-View Moving Object Segmentation Leveraged by Semantic and Motion Features. IEEE Robotics and Automation Letters (RA-L) 7, 3 (2022), 8044–8051.
  • Kong et al. (2023) Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma, Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, and Ziwei Liu. 2023. Rethinking range view representation for lidar segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
  • Lai et al. (2022) Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. 2022. Stratified Transformer for 3D Point Cloud Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Li and Zhuang (2023) Qipeng Li and Yuan Zhuang. 2023. An efficient image-guided-based 3D point cloud moving object segmentation with transformer-attention in autonomous driving. International Journal of Applied Earth Observation and Geoinformation 123 (2023), 103488.
  • Li et al. (2023) Qipeng Li, Yuan Zhuang, and Jianzhu Huai. 2023. Multi-sensor fusion for robust localization with moving object segmentation in complex dynamic 3D scenes. International Journal of Applied Earth Observation and Geoinformation (2023).
  • Lim et al. (2021) Hyungtae Lim, Sungwon Hwang, and Hyun Myung. 2021. ERASOR: Egocentric ratio of pseudo occupancy-based dynamic object removal for static 3D point cloud map building. IEEE Robotics and Automation Letters (RA-L) 6, 2 (2021), 2272–2279.
  • Lim et al. (2023) HyungTae Lim, Lucas Nunes, Benedikt Mersch, Xieyunali Chen, Jens Behley, Hyun Myung, and Cyrill Stachniss. 2023. ERASOR2: Instance-aware robust 3D mapping of the static world in dynamic scenes. In Robotics: Science and Systems (RSS).
  • Liu et al. (2024) Jiuming Liu, Ruiji Yu, Yian Wang, Yu Zheng, Tianchen Deng, Weicai Ye, and Hesheng Wang. 2024. Point Mamba: A Novel Point Cloud Backbone Based on State Space Model with Octree-Based Ordering Strategy. arXiv preprint arXiv:2403.06467 (2024).
  • Loshchilov and Hutter (2018) Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).
  • Mersch et al. (2022) Benedikt Mersch, Xieyuanli Chen, Ignacio Vizzo, Lucas Nunes, Jens Behley, and Cyrill Stachniss. 2022. Receding Moving Object Segmentation in 3D LiDAR Data Using Sparse 4D Convolutions. IEEE Robotics and Automation Letters (RA-L) 7, 3 (2022), 7503–7510.
  • Mersch et al. (2023) Benedikt Mersch, Tiziano Guadagnino, Xieyuanli Chen, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. 2023. Building volumetric beliefs for dynamic environments exploiting map-based moving object segmentation. IEEE Robotics and Automation Letters (RA-L) 8, 8 (2023), 5180–5187.
  • Milioto et al. (2019) Andres Milioto, Ignacio Vizzo, Jens Behley, and Cyrill Stachniss. 2019. RangeNet ++: Fast and Accurate LiDAR Semantic Segmentation. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS).
  • Mohapatra et al. (2022) Sambit Mohapatra, Mona Hodaei, Senthil Yogamani, Stefan Milz, Heinrich Gotzig, Martin Simon, Hazem Rashed, and Patrick Maeder. 2022. LiMoSeg: Real-time Bird’s Eye View based LiDAR Motion Segmentation. In International Conference on Computer Vision Theory and Applications (VISAPP).
  • Morton (1966) Guy M Morton. 1966. A computer oriented geodetic data base and a new technique in file sequencing. (1966).
  • Pfreundschuh et al. (2021) Patrick Pfreundschuh, Hubertus FC Hendrikx, Victor Reijgwart, Renaud Dubé, Roland Siegwart, and Andrei Cramariuc. 2021. Dynamic Object Aware LiDAR SLAM based on Automatic Generation of Training Data. In 2021 IEEE International Conference on Robotics and Automation (ICRA).
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).
  • Schauer and Nüchter (2018) Johannes Schauer and Andreas Nüchter. 2018. The Peopleremover - Removing Dynamic Objects From 3-D Point Cloud Data by Traversing a Voxel Occupancy Grid. IEEE Robotics and Automation Letters (RA-L) 3, 3 (2018), 1679–1686.
  • Schmid et al. (2023) Lukas Schmid, Olov Andersson, Aurelio Sulser, Patrick Pfreundschuh, and Roland Siegwart. 2023. Dynablox: Real-Time Detection of Diverse Dynamic Objects in Complex Environments. IEEE Robotics and Automation Letters (RA-L) 8, 10 (2023), 6259–6266.
  • Song et al. (2024) Tao Song, Yunhao Liu, Ziying Yao, and Xinkai Wu. 2024. SSF-MOS: Semantic Scene Flow Assisted Moving Object Segmentation for Autonomous Vehicles. IEEE Transactions on Instrumentation and Measurement (TIM) 73 (2024), 1–12.
  • Sun et al. (2022) Jiadai Sun, Yuchao Dai, Xianjing Zhang, Jintao Xu, Rui Ai, Weihao Gu, and Xieyuanli Chen. 2022. Efficient Spatial-Temporal Information Fusion for LiDAR-Based 3D Moving Object Segmentation. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS).
  • Wang et al. (2023) Neng Wang, Chenghao Shi, Ruibin Guo, Huimin Lu, Zhiqiang Zheng, and Xieyuanli Chen. 2023. InsMOS: Instance-aware moving object segmentation in LiDAR data. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
  • Wang (2023) Peng-Shuai Wang. 2023. OctFormer: Octree-based Transformers for 3D Point Clouds. ACM Transactions on Graphics (TOG) 42, 4 (2023), 1–11.
  • Wu et al. (2024) Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. 2024. Point Transformer V3: Simpler, Faster, Stronger. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Yang et al. (2023) Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, and Baining Guo. 2023. Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding. arXiv preprint arXiv:2304.06906 (2023).
  • Zhang et al. (2020) Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Boqing Gong, and Hassan Foroosh. 2020. PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Zhou et al. (2023) Bo Zhou, Jiapeng Xie, Yan Pan, Jiajie Wu, and Chuanzhao Lu. 2023. MotionBEV: Attention-Aware Online LiDAR Moving Object Segmentation With Bird’s Eye View Based Appearance and Motion Features. IEEE Robotics and Automation Letters (RA-L) 8, 12 (2023), 8074–8081.