Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

FreePoint: Unsupervised Point Cloud Instance Segmentation

Zhikai Zhang1, Jian Ding1,2, Li Jiang3, Dengxin Dai4, Guisong Xia1
1Wuhan University  2KAUST  3CUHK-Shenzhen  4Huawei Zurich Research Center
Abstract

Instance segmentation of point clouds is a crucial task in 3D field with numerous applications that involve localizing and segmenting objects in a scene. However, achieving satisfactory results requires a large number of manual annotations, which is a time-consuming and expensive process. To alleviate dependency on annotations, we propose a novel framework, FreePoint, for underexplored unsupervised class-agnostic instance segmentation on point clouds. In detail, we represent the point features by combining coordinates, colors, and self-supervised deep features. Based on the point features, we perform a bottom-up multicut algorithm to segment point clouds into coarse instance masks as pseudo labels, which are used to train a point cloud instance segmentation model. We propose an id-as-feature strategy at this stage to alleviate the randomness of the multicut algorithm and improve the pseudo labels’ quality. During training, we propose a weakly-supervised two-step training strategy and corresponding losses to overcome the inaccuracy of coarse masks. FreePoint has achieved breakthroughs in unsupervised class-agnostic instance segmentation on point clouds and outperformed previous traditional methods by over 18.2% and a competitive concurrent work UnScene3D by 5.5% in AP. Additionally, when used as a pretext task and fine-tuned on S3DIS, FreePoint performs significantly better than existing self-supervised pre-training methods with limited annotations and surpasses CSC by 6.0% in AP with 10% annotation masks. Code will be released at https://github.com/zzk273/FreePoint.

Corresponding author

1 Introduction

Instance segmentation on point clouds aims to segment and recognize objects in a 3D scene, serving as the foundation for a wide range of applications such as autonomous driving, virtual reality, and robot navigation. This task has received increasing attention [6, 13, 14, 15, 16, 19, 22, 23, 52, 37, 43] for the availability of large-scale point cloud datasets [4, 12, 28, 40]. Most of the previous works focus on fully-supervised point cloud segmentation, which requires a large number of bounding boxes and per-point annotations to achieve satisfactory results. However, the annotations of point clouds are labor-intensive. For example, labeling an average scene in ScanNet takes about 22.3 minutes [12].

Refer to caption
Figure 1: We propose a novel framework for unsupervised point cloud instance segmentation. In detail, we cluster points based on coordinates, colors, and self-supervised deep features. Then we use the clustered pseudo masks to perform a step-training and improve the unsupervised segmentation quality further.

To relieve the annotation requirements, some weakly-supervised 3D segmentation methods [51, 24, 55, 56, 8] and semi-supervised 3D segmentation methods [7, 20] have been proposed. Besides, some works explore unsupervised pre-training methods for 3D point clouds [18, 50, 57], mainly focusing on data-efficient scene understanding and achieving satisfactory results when fine-tuning on downstream tasks with limited annotations. These works, however, still rely on considerable box, point annotations, or a certain proportion of mask annotations to achieve competitive results. A concurrent work Unscene3D [34] explores unsupervised 3D class-agnostic instance segmentation for indoor scenes. It shows promising results while still having large room for improvement in accuracy.

In this work, we propose a novel framework FreePoint for unsupervised point cloud instance segmentation, which can be split into three parts: (1) preprocessing and point feature extraction; (2) pseudo mask label generation by point feature based graph partitioning; (3) step-training using the pseudo labels. We first adopt plane segmentation algorithm repeatedly to split a point cloud scene into foreground points and background points. Then, for foreground points, we use a self-supervised pre-trained backbone to generate deep-learning feature embeddings for each point. To enhance our feature representation, we add coordinates and colors as extra point features. Our main motivation is that the geometry and color features are helpful for point cloud segmentation. These information has been widely adopted by some traditional point-clustering methods [35, 36, 2, 31]. To generate pseudo mask labels, we solve a bottom-up multicut [11] problem based on the affinities of point features and constructed point graphs. We propose an id-as-feature strategy at this stage to alleviate the randomness of the multicut algorithm and improve the pseudo labels’ quality. This strategy is, in essence, an ensemble of multiple runnings of RAMA. We also adopt down-sampling and up-sampling here to make the computation affordable. These pseudo masks are used to train an existing instance segmentation model. In our work, we choose Mask3D [37] for its efficiency and good performance. Since the pseudo masks are inaccurate and the training can be unstable, we propose a weakly-supervised two-step training strategy and corresponding losses to alleviate this problem. The overview of FreePoint is shown in Figure 1.

We evaluate our method on unsupervised class-agnostic instance segmentation. In this setting, our method shows surprising results without any annotations, surpassing previous SOTA by a large margin. Apart from directly acquiring the class-agnostic instance masks, our method can also be used for unsupervised pre-training on 3D point clouds. The learned parameters of the backbone can be used to initialize a supervised instance segmentation model and improve final results with limited annotations.

Our contributions in this paper are three-fold:

  • We propose a novel framework, FreePoint, for unsupervised point cloud instance segmentation with deep networks. Freepoint generates pseudo labels based on solving a graph partitioning problem and then uses these pseudo labels to train a 3D instance segmentation model. Our work opens up possibilities for advancing the field.

  • We make great efforts to overcome many difficulties brought by the lack of manual annotations. To generate pseudo labels of higher quality, we first propose a hybrid feature representation for point affinity computation. Then we design an id-as-feature strategy to alleviate the randomness of the graph partitioning method. For better use of the noisy pseudo labels, we further propose a carefully designed two-step training strategy and corresponding losses to overcome pseudo labels’ noise.

  • We evaluate FreePoint’s performance on unsupervised class-agnostic point cloud instance segmentation. It surpasses traditional unsupervised segmentation methods by over 18.2%, and even outperforms the competitive concurrent work UnScene3D [34] by 5.5% in AP. We also evaluate FreePoint’s performance as a pretext task. For example, when fine-tuning on S3DIS dataset with 10% labeled masks, FreePoint outperforms training from scratch by +8.2% AP and CSC by 5.8% AP.

2 Related work

Point cloud instance segmentation

Early works on point cloud instance segmentation focus on grouping points based on their affinities [44, 45, 13]. They use dense labels to train point feature encoders and segment point clouds by measuring the point affinities. 3D-SIS [17] and 3D-BoNet [52] extract bounding box proposals and classify them. Recent works prefer to group points based on predicted semantics and object centers [19, 14, 6, 23, 15]. Mask3D [37] is the first Transformer-based approach to challenge this task. We choose it as our step-training model for its high efficiency. The above works highly rely on per-point labels to achieve good results. However, acquiring such labels is labor-intensive. Some 3D instance segmentation works have been proposed these years to alleviate dependency on costly manual annotations. [51, 18, 7, 20, 24, 51, 55, 56] assume a sparse number of points is annotated and [8] use only bounding box labels. However, they still rely on considerable annotations to achieve competitive results.

Unsupervised segmentation and detection

In 2D images, several works explore unsupervised object detection [10, 39, 48, 26, 38], instance segmentation [46, 47], and semantic segmentation [21, 9, 42]. In object detection area, some works [48, 26, 38] use spectral methods to discover and segment main objects in a scene. They first construct an adjacency matrix using spatial features, color features, or features from pre-trained backbones. Then the matrix’s eigenvectors and eigenvalues are computed to decompose the image. Recently, a few works [46, 47] have explored unsupervised instance segmentation for 2D images and achieved satisfactory results. UnScene3D [34] has explored unsupervised 3D instance segmentation for indoor scenes. It operates on a basis of geometric oversegmentation to generate pseudo labels and refines them through self-training as many 2D works. UnScene3D shows promising results while still having large room for improvement in accuracy. The main difference between our method and this work lies in: (1) utilizing only 3D color and geometric features instead of multimodal features from 2D and 3D pre-training backbones; (2) designing a two-step training strategy instead of a multi-round self-training strategy which is very time-costing.

3D feature representation

Traditional methods [2, 31] use features like coordinates, colors and normals to describe each point in a scene. Following the tendency of unsupervised pre-training in 2D field, various works [3, 50, 18, 30, 53, 27, 57, 54] have been proposed recently to represent 3D features, but mostly focusing on single-object classification tasks on ShapeNet [5] or ModelNet [49]. Only a few works [50, 18, 57] focus on large-scale indoor point cloud datasets, which are important for multi-object segmentation tasks and contain far more than only one object. [18] mainly explores how to address downstream tasks in a data-efficient semi-supervised way rather than using full annotations. As a result, many works on instance segmentation and semantic segmentation train their model from scratch and can not benefit from 3D pre-training.

3 Method

Refer to caption
Figure 2: Overview. For inputted point clouds, we first use plane segmentation to filter out backgrounds. Then we represent the features for points by combining self-supervised deep features and traditional features. After that, we construct a graph and compute the edge affinity costs between points. Based on the graph, we apply a multicut algorithm to segment point clouds into coarse instance masks. These masks are adopted as pseudo labels to train a 3D instance segmentation model with our proposed weakly-supervised loss and step-training strategy.

Our pipeline, as shown in Figure 2, can be split into three parts: (1) preprocessing and point feature extraction; (2) pseudo mask label generation by point feature based graph partitioning; (3) step-training using the pseudo labels. Concretely, we first apply plane segmentation to separate the foreground points and background points. Then, for foreground points, we combine both traditional features (i.e., coordinates and colors) and self-supervised deep-learning embeddings to represent their features. Based on it, we construct an undirected graph 𝐆=(𝐕,𝐄,𝐀)𝐆𝐕𝐄𝐀{\bf{G}}=({\bf{V}},{\bf{E}},{\bf{A}})bold_G = ( bold_V , bold_E , bold_A ) viewing the points as vertices 𝐕𝐕{\bf{V}}bold_V and their connections as edges 𝐄𝐄{\bf{E}}bold_E. 𝐀𝐀{\bf{A}}bold_A is an affinity cost vector measured by the affinities between point features. After this, a multicut algorithm is adopted to decompose 𝐆𝐆{\bf{G}}bold_G into coarse instance masks. Finally, we use the coarse masks to perform step-training with our proposed weakly-supervised loss and step-training strategy.

3.1 Preprocessing and point feature extraction

Preprocessing

It is difficult to directly cluster the point clouds into instance masks and backgrounds in the unsupervised setting, since numerous inconspicuous objects are integrated into nearby backgrounds. However, we find that for indoor point cloud datasets, backgrounds include floors, walls, and ceilings, which are usually large and flat surfaces and thus can be easily removed. So we apply plane segmentation [58] to filter out major surfaces in a scene and consider them as backgrounds. In detail, we run a non-deep learning plane segmentation algorithm several times for a scene. Each fitted plane will be projected and compared with its corresponding surface of the whole indoor scene’s bounding box and we will compute the IOU. If the IOU is larger than a threshold, it will be seen as part of the background and removed from the scene. After this step, the original input point cloud 𝐕fullN×6subscript𝐕𝑓𝑢𝑙𝑙superscript𝑁6{{\bf{V}}_{full}}\in{\mathbb{R}^{N\times 6}}bold_V start_POSTSUBSCRIPT italic_f italic_u italic_l italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 6 end_POSTSUPERSCRIPT, which contains coordinate and color information, is divided into two subsets: foreground point cloud 𝐕fgNfg×6subscript𝐕𝑓𝑔superscriptsubscript𝑁𝑓𝑔6{{\bf{V}}_{fg}}\in{\mathbb{R}^{N_{fg}\times 6}}bold_V start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT and background point cloud 𝐕bgNbg×6subscript𝐕𝑏𝑔superscriptsubscript𝑁𝑏𝑔6{{\bf{V}}_{bg}}\in{\mathbb{R}^{N_{bg}\times 6}}bold_V start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT. Since segmenting backgrounds is not the goal of instance segmentation, we only use 𝐕fgsubscript𝐕𝑓𝑔{{\bf{V}}_{fg}}bold_V start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT for the next feature extracting and point cloud segmenting step.

We then perform farthest point sampling [32] to down sample 𝐕fgsubscript𝐕𝑓𝑔{{\bf{V}}_{fg}}bold_V start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT into 𝐕sampledNsampled×6subscript𝐕𝑠𝑎𝑚𝑝𝑙𝑒𝑑superscriptsubscript𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑑6{{\bf{V}}_{sampled}}\in{\mathbb{R}^{N_{sampled}\times 6}}bold_V start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT × 6 end_POSTSUPERSCRIPT. This step is important for that: (1) it can reduce the computation cost of the following point cloud segmenting process and make the pseudo label generation on raw points affordable; (2) this down-sampling can make the point distribution more sparse. Because of the sparsity, the sampled points are farther from each other in feature space, which is beneficial for our point cloud segmenting method described in Section 3.2.

Feature extraction

Since our segmenting method is based on the affinity between the feature representation of each point, we should find a way to make points closer in feature embedding space if they belong to the same object and farther otherwise. We first use self-supervised pre-trained backbones to encode points. However, we find it difficult to encode points discriminatively using deep-learning features alone, which means even points belonging to different instances can be close to each other in the feature embedding space.

Before the era of deep learning, some methods [2, 31] use traditional features to cluster points. For example, Supervoxel [31] uses features like coordinates and colors to measure the affinities between points and cluster them accordingly. Inspired by it, we use both traditional features and deep-learning features to represent each sampled point and measure their affinities in our work.

3.2 Point cloud segmenting

Preliminary

Minimum-cost multicut [11] problem aims to decompose an undirected graph 𝐆=(𝐕,𝐄,𝐀)𝐆𝐕𝐄𝐀{\bf{G}}=({\bf{V}},{\bf{E}},{\bf{A}})bold_G = ( bold_V , bold_E , bold_A ) into a set of point subsets {𝐕1,,𝐕k}subscript𝐕1subscript𝐕𝑘{\{{\bf{V}}_{1},\ldots,{\bf{V}}_{k}\}}{ bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } where 𝐕1𝐕k=𝐕subscript𝐕1subscript𝐕𝑘𝐕{\bf{V}}_{1}\cup\ldots\cup{\bf{V}}_{k}={\bf{V}}bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ … ∪ bold_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_V and 𝐕i𝐕j=subscript𝐕𝑖subscript𝐕𝑗{\bf{V}}_{i}\cap{\bf{V}}_{j}=\varnothingbold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅ ijfor-all𝑖𝑗\forall i\neq j∀ italic_i ≠ italic_j. Edges that straddle distinct clusters which decomposes 𝐆𝐆{\bf{G}}bold_G form the cut δ(𝐕1,,𝐕k)𝛿subscript𝐕1subscript𝐕𝑘\delta({\bf{V}}_{1},\ldots,{\bf{V}}_{k})italic_δ ( bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). 𝐀𝐄𝐀superscript𝐄{\bf{A}}\in\mathbb{R}^{\bf{E}}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT bold_E end_POSTSUPERSCRIPT is an affinity cost vector. Each edge (u,v)𝐄𝑢𝑣𝐄(u,v)\in{\bf{E}}( italic_u , italic_v ) ∈ bold_E has a cost 𝐀(u,v)subscript𝐀𝑢𝑣{\bf{A}}_{(u,v)}bold_A start_POSTSUBSCRIPT ( italic_u , italic_v ) end_POSTSUBSCRIPT. We need to find a decomposition cut of the undirected graph 𝐆𝐆{\bf{G}}bold_G that agrees as much as possible with the affinity cost vector, minimizing the whole cost of cut. So if more edge cost values are negative, 𝐆𝐆{\bf{G}}bold_G will be decomposed into more clusters generally.

Segmenting

In our work, we select RAMA [1], a rapid bottom-up multicut algorithm on GPU, to segment point clouds. Each vi𝐕subscript𝑣𝑖𝐕v_{i}\in{\bf{V}}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_V is connected to the closest k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT points {ui1,,uik1}𝐕subscript𝑢subscript𝑖1subscript𝑢subscript𝑖subscript𝑘1𝐕{\{u_{i_{1}},...,u_{i_{k_{1}}}\}}\in{\bf{V}}{ italic_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∈ bold_V by edges (vi,uij)𝐄subscript𝑣𝑖subscript𝑢subscript𝑖𝑗𝐄(v_{i},u_{i_{j}})\in{\bf{E}}( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∈ bold_E, where j{1,,k1}𝑗1subscript𝑘1j\in\{1,\ldots,k_{1}\}italic_j ∈ { 1 , … , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }. Affinity cost vector 𝐀𝐀{\bf{A}}bold_A is the affinities of both deep features and traditional features. For deep-learning feature embeddings 𝐅Nsampled×dim𝐅superscriptsubscript𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑑𝑑𝑖𝑚{{\bf{F}}}\in{\mathbb{R}^{N_{sampled}\times{dim}}}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT × italic_d italic_i italic_m end_POSTSUPERSCRIPT, we calculate their cosine similarities:

𝐀(i,j),emb=𝙲𝚘𝚜(𝐅i,𝐅j).subscript𝐀𝑖𝑗𝑒𝑚𝑏𝙲𝚘𝚜subscript𝐅𝑖subscript𝐅𝑗\displaystyle\mathcal{\bf A}_{(i,j),emb}={\tt Cos}({\bf F}_{i},{\bf F}_{j}).bold_A start_POSTSUBSCRIPT ( italic_i , italic_j ) , italic_e italic_m italic_b end_POSTSUBSCRIPT = typewriter_Cos ( bold_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (1)

We split 𝐕sampledsubscript𝐕𝑠𝑎𝑚𝑝𝑙𝑒𝑑{{\bf{V}}_{sampled}}bold_V start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT into point coordinates 𝐏Nsampled×3𝐏superscriptsubscript𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑑3{{\bf{P}}}\in{\mathbb{R}^{N_{sampled}\times 3}}bold_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT and point colors 𝐂Nsampled×3𝐂superscriptsubscript𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑑3{{\bf{C}}}\in{\mathbb{R}^{N_{sampled}\times 3}}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT. Then we compute L2 distance respectively in XYZ space and RGB space:

𝐀(i,j),xyz=𝐏i,𝐏j2,𝐀(i,j),rgb=𝐂i,𝐂j2.\displaystyle\mathcal{\bf A}_{(i,j),xyz}=-\parallel{\bf P}_{i},{\bf P}_{j}% \parallel_{2},\mathcal{\bf A}_{(i,j),rgb}=-\parallel{\bf C}_{i},{\bf C}_{j}% \parallel_{2}.bold_A start_POSTSUBSCRIPT ( italic_i , italic_j ) , italic_x italic_y italic_z end_POSTSUBSCRIPT = - ∥ bold_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_A start_POSTSUBSCRIPT ( italic_i , italic_j ) , italic_r italic_g italic_b end_POSTSUBSCRIPT = - ∥ bold_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (2)

These three affinities are all normalized to have a mean value of 0 and variance of 1. The total affinity can be written as:

𝐀𝐀\displaystyle\mathcal{\bf A}bold_A =α1𝐀emb+α2𝐀xyz+α3𝐀rgb,absentsubscript𝛼1subscript𝐀𝑒𝑚𝑏subscript𝛼2subscript𝐀𝑥𝑦𝑧subscript𝛼3subscript𝐀𝑟𝑔𝑏\displaystyle=\alpha_{1}{\bf A}_{emb}+\alpha_{2}{\bf A}_{xyz}+\alpha_{3}{\bf A% }_{rgb},= italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT italic_x italic_y italic_z end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT bold_A start_POSTSUBSCRIPT italic_r italic_g italic_b end_POSTSUBSCRIPT , (3)

where α1subscript𝛼1\alpha_{1}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, α2subscript𝛼2\alpha_{2}italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, α3subscript𝛼3\alpha_{3}italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are the weights to balance the importance of different affinities. 𝐆𝐆{\bf{G}}bold_G will be sent to RAMA [1] based on 𝐀𝐀{\bf{A}}bold_A and the output is pseudo instance labels.

However, due to the characteristics of this bottom-up segmenting method, the generated coarse masks have randomness. We design an id-as-feature strategy to solve this problem and improve the pseudo labels’ quality. This strategy is, in essence, an ensemble of multiple generation results. Concretely, each time t𝑡titalic_t we run RAMA, every point in 𝐕sampledsubscript𝐕𝑠𝑎𝑚𝑝𝑙𝑒𝑑{{\bf{V}}_{sampled}}bold_V start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT will have an assigned pseudo instance label idt𝑖subscript𝑑𝑡id_{t}italic_i italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We run RAMA multiple times 𝐓𝐓{\bf{T}}bold_T and concatenate every idt𝑖subscript𝑑𝑡id_{t}italic_i italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to form a new feature for each point in 𝐕sampledsubscript𝐕𝑠𝑎𝑚𝑝𝑙𝑒𝑑{{\bf{V}}_{sampled}}bold_V start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT. For these id-generated features 𝐈𝐃𝐅Nsampled×T𝐈𝐃𝐅superscriptsubscript𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑑𝑇{\bf{IDF}}\in{\mathbb{R}^{N_{sampled}\times{T}}}bold_IDF ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT × italic_T end_POSTSUPERSCRIPT, their similarities will be computed as:

𝐀(i,j),idsubscript𝐀𝑖𝑗𝑖𝑑\displaystyle\mathcal{\bf{A}}_{(i,j),id}bold_A start_POSTSUBSCRIPT ( italic_i , italic_j ) , italic_i italic_d end_POSTSUBSCRIPT =1Tt=1T𝑰[𝐈𝐃𝐅i[t]=𝐈𝐃𝐅j[t]]absent1𝑇superscriptsubscript𝑡1𝑇𝑰delimited-[]subscript𝐈𝐃𝐅𝑖delimited-[]𝑡subscript𝐈𝐃𝐅𝑗delimited-[]𝑡\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\bm{I}\left[{\bf{IDF}}_{i}\left[t\right% ]={\bf{IDF}}_{j}\left[t\right]\right]= divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_I [ bold_IDF start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ italic_t ] = bold_IDF start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_t ] ] (4)

We run RAMA again based on 𝐀idsubscript𝐀𝑖𝑑{\bf A}_{id}bold_A start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT and then preliminary pseudo instance labels 𝐋sampledNsampledsubscript𝐋𝑠𝑎𝑚𝑝𝑙𝑒𝑑superscriptsubscript𝑁𝑠𝑎𝑚𝑝𝑙𝑒𝑑{{\bf{L}}_{sampled}}\in{\mathbb{R}^{N_{sampled}}}bold_L start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are formed. To recover to original size, we use knn𝑘𝑛𝑛knnitalic_k italic_n italic_n to find the closest k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT points and corresponding labels in 𝐕sampledsubscript𝐕𝑠𝑎𝑚𝑝𝑙𝑒𝑑{{\bf{V}}_{sampled}}bold_V start_POSTSUBSCRIPT italic_s italic_a italic_m italic_p italic_l italic_e italic_d end_POSTSUBSCRIPT for each point in 𝐕fgsubscript𝐕𝑓𝑔{{\bf{V}}_{fg}}bold_V start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT. By majority voting of the k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT points, we obtain 𝐋fgNfgsubscript𝐋𝑓𝑔superscriptsubscript𝑁𝑓𝑔{{\bf{L}}_{fg}}\in{\mathbb{R}^{N_{fg}}}bold_L start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Then we annotate points in 𝐕bgsubscript𝐕𝑏𝑔{{\bf{V}}_{bg}}bold_V start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT as background and concatenate it with 𝐋fgsubscript𝐋𝑓𝑔{{\bf{L}}_{fg}}bold_L start_POSTSUBSCRIPT italic_f italic_g end_POSTSUBSCRIPT to obtain final pseudo labels 𝐋N𝐋superscript𝑁{{\bf{L}}}\in{\mathbb{R}^{N}}bold_L ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The pipeline of pseudo-label generation is shown in Figure 3.

Refer to caption
Figure 3: Pseudo-label Generation. In this figure, we show the complete pipeline of pseudo-label generation. For simplicity, we set k1=k2=2subscript𝑘1subscript𝑘22k_{1}=k_{2}=2italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2.

As mentioned before, RAMA generally segments the scene into more objects if more edge values are negative. When running RAMA based on 𝐀𝐀{\bf{A}}bold_A, we will add different hyper-parameters σlow,σhighsubscript𝜎𝑙𝑜𝑤subscript𝜎𝑖𝑔\sigma_{low},\sigma_{high}italic_σ start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT to affinity:

𝐀finalsubscript𝐀𝑓𝑖𝑛𝑎𝑙\displaystyle\mathcal{\bf A}_{final}bold_A start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT =𝐀+{σlow,σhigh}.absent𝐀subscript𝜎𝑙𝑜𝑤subscript𝜎𝑖𝑔\displaystyle={\bf A}+\{\sigma_{low},\sigma_{high}\}.= bold_A + { italic_σ start_POSTSUBSCRIPT italic_l italic_o italic_w end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_h italic_i italic_g italic_h end_POSTSUBSCRIPT } . (5)

By changing σ𝜎\sigmaitalic_σ, we generate coarse masks of two different segmenting levels. One is able to localize and identify most objects in the scene but fails to generate complete masks for instances. We denote these masks as base masks. To overcome base masks’ defects, we generate under-segmented masks with a relatively higher σ𝜎\sigmaitalic_σ. They will be in good use for the next step following our weakly-supervised two-step training design. It is worth mentioning that when running RAMA based on 𝐀idsubscript𝐀𝑖𝑑{\bf{A}}_{id}bold_A start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT, σ𝜎\sigmaitalic_σ is automatically chosen to keep the number of generated instances approximately the same as the average instance number of 𝐓𝐓{\bf{T}}bold_T runnings of RAMA based on 𝐀finalsubscript𝐀𝑓𝑖𝑛𝑎𝑙{{\bf{A}}_{final}}bold_A start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT.

3.3 Training with coarse masks

To further refine the coarse masks, we aim to train a point cloud instance segmenter using these masks as pseudo labels. In our work, we choose Mask3D [37], a Transformer-based model for semantic instance segmentation, for its good performance and efficiency. Coarse masks are often inaccurate, so directly using them to train an instance segmenter in a fully-supervised way will cause unsatisfactory results. Therefore we propose two designs to solve this problem, including a new weakly-supervised loss and a step-training strategy.

Loss for weakly-supervised training

In the original implementation of Mask3D [37], they use both dice loss dicesubscript𝑑𝑖𝑐𝑒\mathcal{L}_{dice}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT and binary cross entropy loss BCEsubscript𝐵𝐶𝐸\mathcal{L}_{BCE}caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT as mask loss to train. However, our pseudo labels are inaccurate, so using such per-point loss directly may lead to sub-optimal results. We propose to use these coarse masks as a kind of weak annotation and design a weakly-supervised loss.

Inspired by  [46, 8, 41], we believe mask centers and bounding boxes are important for weakly-supervised training. Mask centers can help to localize instances. We compute the mean value of normalized coordinates in a predicted mask 𝐦𝐦{\bf m}bold_m and target mask 𝐦superscript𝐦{\bf m^{*}}bold_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT along each axis to get prediction center cmean(xc,yc,zc)subscript𝑐𝑚𝑒𝑎𝑛subscript𝑥𝑐subscript𝑦𝑐subscript𝑧𝑐{c_{mean}}\in(x_{c},y_{c},z_{c})italic_c start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ∈ ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) and target center tmean(xt,yt,zt)subscript𝑡𝑚𝑒𝑎𝑛subscript𝑥𝑡subscript𝑦𝑡subscript𝑧𝑡{t_{mean}}\in(x_{t},y_{t},z_{t})italic_t start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ∈ ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Our model is trained to minimize the Euclidean distance between cmeansubscript𝑐𝑚𝑒𝑎𝑛c_{mean}italic_c start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and tmeansubscript𝑡𝑚𝑒𝑎𝑛t_{mean}italic_t start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT:

mean=𝙴𝚞𝚌𝚕𝚒𝚍𝚎𝚊𝚗(avg(𝐦),avg(𝐦)).subscript𝑚𝑒𝑎𝑛𝙴𝚞𝚌𝚕𝚒𝚍𝚎𝚊𝚗𝑎𝑣𝑔𝐦𝑎𝑣𝑔superscript𝐦\displaystyle\mathcal{L}_{mean}={\tt Euclidean}({avg}({\tt{\bf m}}),{avg}({\tt% {\bf m^{*}}})).caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = typewriter_Euclidean ( italic_a italic_v italic_g ( bold_m ) , italic_a italic_v italic_g ( bold_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) . (6)

We further propose a bounding box loss. Bounding box supervision enforces predictions with the correct sizes and locations. This design can further improve our work’s performance. For implementation, we pick the maximum and minimum value along each axis for a predicted mask and a target mask to get two boundary point pairs (cmax,tmax)subscript𝑐𝑚𝑎𝑥subscript𝑡𝑚𝑎𝑥(c_{max},t_{max})( italic_c start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) and (cmin,tmin)subscript𝑐𝑚𝑖𝑛subscript𝑡𝑚𝑖𝑛(c_{min},t_{min})( italic_c start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ). The Euclidean distance of each pair is summed to be our bounding-box loss. The loss can be written as:

box=sum(\displaystyle\mathcal{L}_{box}={sum}(caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT = italic_s italic_u italic_m ( 𝙴𝚞𝚌𝚕𝚒𝚍𝚎𝚊𝚗(max(𝐦),max(𝐦)),𝙴𝚞𝚌𝚕𝚒𝚍𝚎𝚊𝚗𝑚𝑎𝑥𝐦𝑚𝑎𝑥superscript𝐦\displaystyle{\tt Euclidean}({max}({\tt{\bf m}}),{{max}(\tt{\bf m^{*}})}),typewriter_Euclidean ( italic_m italic_a italic_x ( bold_m ) , italic_m italic_a italic_x ( bold_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) , (7)
𝙴𝚞𝚌𝚕𝚒𝚍𝚎𝚊𝚗(min(𝐦),min(𝐦))).\displaystyle{\tt Euclidean}({min}({\tt{\bf m}}),{{min}(\tt{\bf m^{*}})})).typewriter_Euclidean ( italic_m italic_i italic_n ( bold_m ) , italic_m italic_i italic_n ( bold_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ) .

We compute the above losses directly on points without voxelization. Then the weighted sum of each term in weakly-supervised loss and fully-supervised loss will be our final loss, which can be written as:

\displaystyle\mathcal{L}caligraphic_L =λdicedice+λBCEBCEabsentsubscript𝜆𝑑𝑖𝑐𝑒subscript𝑑𝑖𝑐𝑒subscript𝜆𝐵𝐶𝐸subscript𝐵𝐶𝐸\displaystyle=\lambda_{dice}\mathcal{L}_{dice}+\lambda_{BCE}\mathcal{L}_{BCE}= italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT (8)
+λmeanmean+λboxbox,subscript𝜆𝑚𝑒𝑎𝑛subscript𝑚𝑒𝑎𝑛subscript𝜆𝑏𝑜𝑥subscript𝑏𝑜𝑥\displaystyle+\lambda_{mean}\mathcal{L}_{mean}+\lambda_{box}\mathcal{L}_{box},+ italic_λ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT ,

where λdicesubscript𝜆𝑑𝑖𝑐𝑒\lambda_{dice}italic_λ start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT, λBCEsubscript𝜆𝐵𝐶𝐸\lambda_{BCE}italic_λ start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT, λmeansubscript𝜆𝑚𝑒𝑎𝑛\lambda_{mean}italic_λ start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT, and λboxsubscript𝜆𝑏𝑜𝑥\lambda_{box}italic_λ start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT are the weights to balance the importance of different loss terms.

Step training strategy

In section 3.2, we observe that our segmenting method can generate masks of different segmenting levels. For base masks, the scene will be generally split into object parts. More instances can be identified and localized in this situation, but they lack complete masks. For the under-segmented setting, the scene has fewer instance proposals, which means we will have more masks covering a whole object. However, instances in this setting are always mistakenly connected with nearby instances especially when they share similar features.

We wonder which kind of masks we should use to achieve better results. Both coarse masks have insurmountable defects if adopted as pseudo labels alone. Therefore, we explore a novel training strategy so that over-segmented and under-segmented masks can compensate for each other’s shortcomings and significantly improve final results. Concretely, we use base masks as pseudo labels for the first training step. At this stage, the model is trained to segment points of similar features, regardless of whether they belong to object parts or whole objects. For the second training step, we use under-segmented masks instead. With only a few epochs, the model learns to connect mistakenly segmented object parts into a whole object. This step can improve the results of the first step by a large margin with little time cost.

However, the under-segmented masks which contain multiple objects may harm the model’s performance. At this stage, we propose an undersegmentation-ignore design to relieve this problem. Concretely, during the bipartite matching stage of the training of Mask3D, we ignore the match if the matched pseudo mask contains more than certain times the points of the predicted mask. This design is based on the insight that the model can already predict approximately correct masks and doesn’t need much refinement. It ensures the model completes instance masks within a reasonable range.

The improvement in accuracy matches our intuition. The model is first trained to encode points and segment point clouds at a low level. Even though the pseudo labels we use in this step are over-segmented, the model can learn relatively good point feature representations and predict object parts. Then we use under-segmented masks to teach the model how to connect objects and predict complete instance masks.

4 Experiments

Implementation detablackils

For point downsampling in preprocessing, we downsample the whole point cloud to the half of the number of original points and set k1=k2=4subscript𝑘1subscript𝑘24k_{1}=k_{2}=4italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 4.

Datasets

We evaluate our work on two publicly available indoor 3D instance segmentation datasets ScanNet [12] and S3DIS [4]. The ScanNet dataset altogether contains 1613 scans, divided into training, validation and testing sets of 1201, 312, 100 scans respectively. We use the 20-class benchmark provided by the dataset. The S3DIS dataset contains 3D scans of 6 areas with 271 scenes in total. The dataset consists of 13 classes for instance segmentation evaluation. For unsupervised instance segmentation, we train on the training set. We report both evaluation results on the training set following UnScene3D [34] and the validation set.

Evaluation Metrics

We use standard average precision as our evaluation metrics. AP50 and AP25 denote the scores with IoU thresholds of 0.5 and 0.25 respectively. AP denotes the average scores with IoU threshold from 0.5 to 0.95 with a step size of 0.05. We evaluate only instance mask AP values without considering any semantic labels.

4.1 Main Results

Unsupervised instance segmentation

We mainly compare our work with a concurrent work [34], which is a recently proposed unsupervised instance segmentation method for indoor 3D scenes. It operates on a basis of geometric oversegmentation to generate pseudo labels and refines them through multi-round self-training as many works. We also compare FreePoint with some traditional clustering methods including DBSCAN [33], HDBSCAN [25] and a method originally proposed for outdoor autonomous vehicles [29]. The visualization results are shown in Figure 4.

We report the result in Table 1. It is worth noting that UnScene3D utilizes both 3D pretraining deep features and 2D pretraining deep features, while we only use the former. For a more fair comparison, we also show the result of UnScene3D which only uses 3D features from the same pretraining method CSC [18] as FreePoint. Our method surpasses previous methods by a significant margin.

Refer to caption
Figure 4: Qualitative results on ScanNet. FreePoint shows surprisingly good performance without any annotations.
Method Train set Val set
AP AP50 AP AP50
DBSCAN [33] 3.2 4.1 3.3 3.6
HDBSCAN [25] 1.6 5.5 1.9 5.4
Nunes et al. [29] 2.3 7.3 2.1 6.9
UnScene3D [34] 13.3 - - -
UnScene3D* [34] 15.9 32.2 - -
FreePoint (Ours) 21.4 38.7 18.9 36.4
Table 1: Unsupervised class-agnostic instance segmentation on ScanNet train split and validation split. We report average precision (AP) with different IoU thresholds. We mainly compare our method with some traditional clustering methods for point clouds and some recently proposed deep-learning-based methods. ’*’ means the method utilizes both 2D features and 3D features. ’-’ means the result is not provided by the original paper and we don’t have access to the code to evaluate it by ourselves. Our method improves significantly over baselines.

Fine-tuning on semantic instance segmentation

Pre-train AP AP50 AP25
10% masks Train from scratch 34.7 47.6 56.3
Supervised 36.9 50.1 55.5
PointContrast   [50] 36.1 (-0.8) 49.4 (-0.7) 56.8 (+1.3)
DepthContrast [57] 36.8 (-0.1) 49.0 (-1.1) 57.3 (+1.8)
CSC [18] 37.1 (+0.2) 50.7 (+0.6) 57.1 (+1.6)
FreePoint (Ours) 42.9 (+6.0) 54.6 (+4.5) 61.1 (+5.6)
20% masks Train from scratch 44.1 54.3 61.1
Supervised 45.7 55.2 61.4
PointContrast   [50] 44.4 (-1.3) 54.8 (-0.4) 61.7 (+0.3)
DepthContrast [57] 45.2 (-0.5) 54.9 (-0.3) 62.4 (+1.0)
CSC [18] 46.3 (+0.6) 56.4 (+1.2) 61.5 (+0.1)
FreePoint (Ours) 47.4 (+1.7) 60.2 (+5.0) 65.9 (+4.5)
Table 2: Supervised semantic instance segmentation with limited instance masks. “Supervised” denotes the process of fully-supervised pre-training on ScanNet, succeeded by fine-tuning on S3DIS. In contrast, other methods employ unsupervised pre-training. The numerical values in brackets indicate the relative performance changes of unsupervised pre-training compared to their supervised counterparts.
Pre-train AP AP50 AP25
10% scenes Train from scratch 30.1 41.2 52.2
Supervised 32.4 41.8 52.3
PointContrast [50] 31.0 (-1.4) 42.2 (+0.4) 53.5 (+1.2)
DepthContrast [57] 32.2 (-0.2) 41.5 (-0.3) 53.7 (+1.4)
CSC [18] 32.7 (+0.3) 42.7 (+0.9) 54.4 (+2.1)
FreePoint (Ours) 37.2 (+4.8) 48.1 (+6.3) 59.3 (+7.0)
20% scenes Train from scratch 42.1 49.5 58.3
Supervised 44.8 51.7 59.6
PointContrast [50] 43.7 (-1.1) 50.8 (-0.9) 60.5 (+0.9)
DepthContrast [57] 44.0 (-0.8) 51.6 (-0.1) 62.1 (+2.5)
CSC [18] 44.4 (-0.4) 52.9 (+1.2) 61.0 (+1.4)
FreePoint (Ours) 48.1 (+3.3) 56.6 (+4.9) 64.3 (+4.7)
Table 3: Supervised semantic instance segmentation with limited fully annotated point clouds. “Supervised” denotes the process of fully-supervised pre-training on ScanNet, succeeded by fine-tuning on S3DIS. In contrast, other methods employ unsupervised pre-training. The numerical values in brackets indicate the relative performance changes of unsupervised pre-training compared to their supervised counterparts.

Since our work is unsupervised, it can also be seen as a pre-training pretext task. Apart from unsupervised class-agnostic instance segmentation, we further evaluate our work’s performance as an unsupervised pre-training model. As shown in Table 2, FreePoint pre-training significantly outperforms other unsupervised pre-training methods [50, 18, 57] by a large margin, and even suppress the supervised pre-training by 6.0%percent\%% AP and 1.7%percent\%% AP and when using 10%percent\%% and 20%percent\%% training masks respectively.

We also compare the pre-training methods with different amounts of full-scene annotations. As shown in Table 3, we conduct fine-tuning experiments with only limited scenes available. Our work can still achieve satisfactory results. FreePoint pre-training outperforms other unsupervised pre-training methods, and even the supervised pre-training by 4.8%percent\%% AP and 3.3%percent\%% AP, when using 10%percent\%% and 20%percent\%% full-scene annotations respectively.

4.2 Ablation Study

In this part, we conduct ablation experiments to show the effectiveness of each designed component.

Different feature representations

We explore results on different kinds of point feature representations. For features generated by various self-supervised pre-training encoders, we compare their performance in generating coarse masks and final instance segmentation results. Then we combine the best performer with traditional features and find the accuracy can be further improved. The comparison between different feature representations is shown in Table 4.

Method AP AP50 AP AP50
Traditional 6.3 10.4 10.3 21.6
PointContrast [50] 7.6 13.3 15.7 27.9
CSC [18] 7.9 13.4 16.5 30.8
FreePoint (Ours) 8.5 15.3 18.9 36.4
Table 4: Different feature representation methods for generating base masks. We report the accuracy of both base masks (left block) and final results (right block). Our strategy has the best performance.

Segmenting methods

Owning relatively good feature representation, there are many existing ways to segment point clouds and generate coarse masks accordingly. We compare some methods including Supervoxel [31], FreeMasks, a method proposed by  [46], and spectral [26] methods. For each method, we adapt them to the ScanNet dataset and tune parameters to achieve good results as far as we can.

We observe that FreeMasks and spectral methods, which have proven successful in unsupervised object detection or segmentation tasks in the 2D field, fail to transfer to point clouds as shown in Figure 5. These two methods have two main defects due to their shared top-down mechanism. Firstly, they can only identify and localize partial objects in a crowded and cluttered 3D scene. Secondly, it is hard for these non-distance-based segmenting methods to distinguish different objects of the same semantic information even if they are far away from each other. The above two defects do not have much impact on some 2D images since they generally contain only one or a few dominant objects. But point cloud scenes are not this case. Point clouds usually have many similar objects in each scene, leading to unsatisfactory results. RAMA’s bottom-up mechanism relieves the above problems in essence. We also explore the effectiveness of our id-as-feature strategy and the impact of the running times 𝐓𝐓{\bf{T}}bold_T of RAMA when adopting this strategy. For each method, we report the accuracy of coarse masks and final predictions. Results are shown in Table 5.

Refer to caption
Figure 5: Comparison with segmenting methods originally for 2D unsupervised instance segmentation. Recent methods [46, 26] for 2D unsupervised instance segmentation fail to deal with crowded and cluttered point cloud scenes due to their top-down mechanism.
Method AP AP50 AP AP50
Supervoxel [31] 2.4 3.5 3.8 6.9
FreeMasks_3D [46] 2.9 3.2 - -
Spectral [26] 2.3 4.8 - -
RAMA 5.4 10.6 13.8 24.7
RAMA-5 8.3 14.7 17.6 35.0
RAMA-10 8.5 15.3 18.9 36.4
Table 5: Segmenting methods. We report the accuracy of both base masks (left block) and final results (right block). ‘-’ means failing to converge. The number after RAMA is running times 𝐓𝐓{\bf{T}}bold_T for evaluation of our id-as-feature strategy.

Weakly-supervised learning design.

To validate the effectiveness of our weakly-supervised design including different loss terms and undersegmentation-ignore method, we first evaluate the result of using fully-supervised loss(i.e., Dice loss and BCE loss) alone, discovering that directly adopting such loss leads to unsatisfactory results. We also find that only using our proposed weakly-supervised loss terms is even much worse than only using fully-supervised loss terms. This may be attributed to that terms for weak supervision contain too little information, unable to match low-quality predictions with ground truth at the early training stage. Each loss term and undersegmentation-ignore design are validated in Table 6.

Method AP AP50
combination(default) 18.9 36.4
- w/o meansubscript𝑚𝑒𝑎𝑛\mathcal{L}_{mean}caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT 14.3 30.6
- w/o boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT 15.2 31.4
- w/o meansubscript𝑚𝑒𝑎𝑛\mathcal{L}_{mean}caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT and boxsubscript𝑏𝑜𝑥\mathcal{L}_{box}caligraphic_L start_POSTSUBSCRIPT italic_b italic_o italic_x end_POSTSUBSCRIPT 14.0 28.5
- w/o dicesubscript𝑑𝑖𝑐𝑒\mathcal{L}_{dice}caligraphic_L start_POSTSUBSCRIPT italic_d italic_i italic_c italic_e end_POSTSUBSCRIPT and BCEsubscript𝐵𝐶𝐸\mathcal{L}_{BCE}caligraphic_L start_POSTSUBSCRIPT italic_B italic_C italic_E end_POSTSUBSCRIPT 7.8 15.7
- w/o undersegmentation-ignore 16.8 36.3
Table 6: Weakly-supervised learning design. Each design contributes to the final results.

Training strategy

As mentioned in section 3.2, we can generate coarse masks of different segmenting levels by changing parameters when running RAMA. Base masks are generally over-segmented while can identify and localize most objects in the scene. Therefore after training with the base masks, we further train the model with under-segmented masks with only a few epochs. In Table 7 we report AP and AP50 to evaluate our design’s effectiveness.

Method AP AP50
base masks 8.5 15.3
under-segmented masks 9.1 12.5
train with base masks 14.2 30.5
train with under-segmented masks 6.4 13.8
Ours 18.9 36.4
Table 7: Training strategy. Our two-step training strategy significantly improves the accuracy.

5 Discussion and Conclusion

In this work, we propose an effective framework FreePoint for unsupervised class-agnostic point cloud instance segmentation. FreePoint achieves satisfactory results compared with previous methods in this underexplored field, which proves this task is worthy of further exploration. In our experiment, we also find that top-down segmenting methods proposed in previous 2D unsupervised instance segmentation works fail to be directly adopted by point clouds as shown in Figure 5. Developing a novel unsupervised segmenting method for cluttered 3D indoor scenes may be promising. We hope our work can provide insights for future unsupervised point cloud learning works.

Acknowledgement

This work is supported by National Natural Science Foundation of China grants under contracts NO.62325111 and No.U22B2011. We would like to thank Ahmed Abbas for engaging in a discussion about the usage of the RAMA algorithm.

References

  • Abbas and Swoboda [2022] Ahmed Abbas and Paul Swoboda. Rama: A rapid multicut algorithm on gpu. In CVPR, pages 8193–8202, 2022.
  • Achanta et al. [2012] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE TPAMI, 34(11):2274–2282, 2012.
  • Afham et al. [2022] Mohamed Afham, Isuru Dissanayake, Dinithi Dissanayake, Amaya Dharmasiri, Kanchana Thilakarathna, and Ranga Rodrigo. Crosspoint: Self-supervised cross-modal contrastive learning for 3d point cloud understanding. In CVPR, pages 9902–9912, 2022.
  • Armeni et al. [2016] Iro Armeni, Ozan Sener, Amir R Zamir, Helen Jiang, Ioannis Brilakis, Martin Fischer, and Silvio Savarese. 3d semantic parsing of large-scale indoor spaces. In CVPR, pages 1534–1543, 2016.
  • Chang et al. [2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv:1512.03012, 2015.
  • Chen et al. [2021] Shaoyu Chen, Jiemin Fang, Qian Zhang, Wenyu Liu, and Xinggang Wang. Hierarchical aggregation for 3d instance segmentation. In ICCV, pages 15467–15476, 2021.
  • Cheng et al. [2021] Mingmei Cheng, Le Hui, Jin Xie, and Jian Yang. Sspc-net: Semi-supervised semantic 3d point cloud segmentation network. In AAAI, pages 1140–1147, 2021.
  • Chibane et al. [2022] Julian Chibane, Francis Engelmann, Tuan Anh Tran, and Gerard Pons-Moll. Box2mask: Weakly supervised 3d semantic instance segmentation using bounding boxes. In ECCV, pages 681–699. Springer, 2022.
  • Cho et al. [2021] Jang Hyun Cho, Utkarsh Mall, Kavita Bala, and Bharath Hariharan. Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In CVPR, pages 16794–16804, 2021.
  • Cho et al. [2015] Minsu Cho, Suha Kwak, Cordelia Schmid, and Jean Ponce. Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals. In CVPR, pages 1201–1210, 2015.
  • Chopra and Rao [1993] Sunil Chopra and Mendu R Rao. The partition problem. Mathematical programming, 59(1-3):87–115, 1993.
  • Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828–5839, 2017.
  • Elich et al. [2019] Cathrin Elich, Francis Engelmann, Theodora Kontogianni, and Bastian Leibe. 3d bird’s-eye-view instance segmentation. In GCPR, pages 48–61. Springer, 2019.
  • Engelmann et al. [2020] Francis Engelmann, Martin Bokeloh, Alireza Fathi, Bastian Leibe, and Matthias Nießner. 3d-mpa: Multi-proposal aggregation for 3d semantic instance segmentation. In CVPR, pages 9031–9040, 2020.
  • Han et al. [2020] Lei Han, Tian Zheng, Lan Xu, and Lu Fang. Occuseg: Occupancy-aware 3d instance segmentation. In CVPR, pages 2940–2949, 2020.
  • He et al. [2021] Tong He, Chunhua Shen, and Anton van den Hengel. Dyco3d: Robust instance segmentation of 3d point clouds through dynamic convolution. In CVPR, pages 354–363, 2021.
  • Hou et al. [2019] Ji Hou, Angela Dai, and Matthias Nießner. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In CVPR, pages 4421–4430, 2019.
  • Hou et al. [2021] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining Xie. Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, pages 15587–15597, 2021.
  • Jiang et al. [2020] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR, pages 4867–4876, 2020.
  • Jiang et al. [2021] Li Jiang, Shaoshuai Shi, Zhuotao Tian, Xin Lai, Shu Liu, Chi-Wing Fu, and Jiaya Jia. Guided point contrastive learning for semi-supervised point cloud semantic segmentation. In ICCV, pages 6423–6432, 2021.
  • Ke et al. [2022] Tsung-Wei Ke, Jyh-Jing Hwang, Yunhui Guo, Xudong Wang, and Stella X. Yu. Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In CVPR, pages 2571–2581, 2022.
  • Lahoud et al. [2019] Jean Lahoud, Bernard Ghanem, Marc Pollefeys, and Martin R Oswald. 3d instance segmentation via multi-task metric learning. In ICCV, pages 9256–9266, 2019.
  • Liang et al. [2021] Zhihao Liang, Zhihao Li, Songcen Xu, Mingkui Tan, and Kui Jia. Instance segmentation in 3d scenes using semantic superpoint tree networks. In ICCV, pages 2783–2792, 2021.
  • Liu et al. [2021] Zhengzhe Liu, Xiaojuan Qi, and Chi-Wing Fu. One thing one click: A self-training approach for weakly supervised 3d semantic segmentation. In CVPR, pages 1726–1736, 2021.
  • McInnes and Healy [2017] Leland McInnes and John Healy. Accelerated hierarchical density based clustering. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 33–42. IEEE, 2017.
  • Melas-Kyriazi et al. [2022] Luke Melas-Kyriazi, Christian Rupprecht, Iro Laina, and Andrea Vedaldi. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In CVPR, pages 8364–8375, 2022.
  • Min et al. [2022] Chen Min, Dawei Zhao, Liang Xiao, Yiming Nie, and Bin Dai. Voxel-mae: Masked autoencoders for pre-training large-scale point clouds. arXiv:2206.09900, 2022.
  • Mo et al. [2019] Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In CVPR, pages 909–918, 2019.
  • Nunes et al. [2022] Lucas Nunes, Xieyuanli Chen, Rodrigo Marcuzzi, Aljosa Osep, Laura Leal-Taixé, Cyrill Stachniss, and Jens Behley. Unsupervised class-agnostic instance segmentation of 3d lidar data for autonomous vehicles. IEEE Robotics and Automation Letters, 7(4):8713–8720, 2022.
  • Pang et al. [2022] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, Yonghong Tian, and Li Yuan. Masked autoencoders for point cloud self-supervised learning. In ECCV, pages 604–621. Springer, 2022.
  • Papon et al. [2013] Jeremie Papon, Alexey Abramov, Markus Schoeler, and Florentin Worgotter. Voxel cloud connectivity segmentation-supervoxels for point clouds. In CVPR, pages 2027–2034, 2013.
  • Qi et al. [2017] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS, 30, 2017.
  • Ram et al. [2010] Anant Ram, Sunita Jalal, Anand S Jalal, and Manoj Kumar. A density based algorithm for discovering density varied clusters in large spatial databases. International Journal of Computer Applications, 3(6):1–4, 2010.
  • Rozenberszki et al. [2023] David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. arXiv preprint arXiv:2303.14541, 2023.
  • Rusu [2010] Radu Bogdan Rusu. Semantic 3d object maps for everyday manipulation in human living environments. KI-Künstliche Intelligenz, pages 345–348, 2010.
  • Rusu and Cousins [2011] Radu Bogdan Rusu and Steve Cousins. 3d is here: Point cloud library (pcl). In ICRA, pages 1–4. IEEE, 2011.
  • Schult et al. [2022] Jonas Schult, Francis Engelmann, Alexander Hermans, Or Litany, Siyu Tang, and Bastian Leibe. Mask3d for 3d semantic instance segmentation. arXiv:2210.03105, 2022.
  • Shin et al. [2022] Gyungin Shin, Samuel Albanie, and Weidi Xie. Unsupervised salient object detection with spectral cluster voting. In CVPR, pages 3971–3980, 2022.
  • Siméoni et al. [2021] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv:2109.14279, 2021.
  • Song et al. [2017] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In CVPR, pages 1746–1754, 2017.
  • Tian et al. [2021] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. In CVPR, pages 5443–5452, 2021.
  • Van Gansbeke et al. [2021] Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. Unsupervised semantic segmentation by contrasting object mask proposals. In ICCV, pages 10052–10062, 2021.
  • Vu et al. [2022] Thang Vu, Kookhoi Kim, Tung M Luu, Thanh Nguyen, and Chang D Yoo. Softgroup for 3d instance segmentation on point clouds. In CVPR, pages 2708–2717, 2022.
  • Wang et al. [2018] Weiyue Wang, Ronald Yu, Qiangui Huang, and Ulrich Neumann. Sgpn: Similarity group proposal network for 3d point cloud instance segmentation. In CVPR, pages 2569–2578, 2018.
  • Wang et al. [2019] Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen, and Jiaya Jia. Associatively segmenting instances and semantics in point clouds. In CVPR, pages 4096–4105, 2019.
  • Wang et al. [2022a] Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandkumar, Chunhua Shen, and Jose M Alvarez. Freesolo: Learning to segment objects without annotations. In CVPR, pages 14176–14186, 2022a.
  • Wang et al. [2023] Xudong Wang, Rohit Girdhar, Stella X Yu, and Ishan Misra. Cut and learn for unsupervised object detection and instance segmentation. arXiv:2301.11320, 2023.
  • Wang et al. [2022b] Yangtao Wang, Xi Shen, Yuan Yuan, Yuming Du, Maomao Li, Shell Xu Hu, James L Crowley, and Dominique Vaufreydaz. Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut. arXiv:2209.00383, 2022b.
  • Wu et al. [2015] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–1920, 2015.
  • Xie et al. [2020] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas Guibas, and Or Litany. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, pages 574–591. Springer, 2020.
  • Xu and Lee [2020] Xun Xu and Gim Hee Lee. Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In CVPR, pages 13706–13715, 2020.
  • Yang et al. [2019] Bo Yang, Jianan Wang, Ronald Clark, Qingyong Hu, Sen Wang, Andrew Markham, and Niki Trigoni. Learning object bounding boxes for 3d instance segmentation on point clouds. NeurIPS, 32, 2019.
  • Yu et al. [2022] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie Zhou, and Jiwen Lu. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In CVPR, pages 19313–19322, 2022.
  • Zhang et al. [2022] Renrui Zhang, Ziyu Guo, Peng Gao, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, and Hongsheng Li. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. arXiv:2205.14401, 2022.
  • Zhang et al. [2021a] Yachao Zhang, Zonghao Li, Yuan Xie, Yanyun Qu, Cuihua Li, and Tao Mei. Weakly supervised semantic segmentation for large-scale point cloud. In AAAI, pages 3421–3429, 2021a.
  • Zhang et al. [2021b] Yachao Zhang, Yanyun Qu, Yuan Xie, Zonghao Li, Shanshan Zheng, and Cuihua Li. Perturbed self-distillation: Weakly supervised large-scale point cloud semantic segmentation. In ICCV, pages 15520–15528, 2021b.
  • Zhang et al. [2021c] Zaiwei Zhang, Rohit Girdhar, Armand Joulin, and Ishan Misra. Self-supervised pretraining of 3d features on any point-cloud. In ICCV, pages 10252–10263, 2021c.
  • Zhou et al. [2018] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3d: A modern library for 3d data processing. arXiv:1801.09847, 2018.