Weakly Supervised Semantic Segmentation For Large-Scale Point Cloud
Weakly Supervised Semantic Segmentation For Large-Scale Point Cloud
Weakly Supervised Semantic Segmentation For Large-Scale Point Cloud
∗
Yachao Zhang1 , Zonghao Li1 , Yuan Xie2 * , Yanyun Qu1 , Cuihua Li1 , Tao Mei3
1
School of Informatics, Xiamen University, Fujian, China
2
School of Computer Science and Technology, East China Normal University, Shanghai, China
3
JD AI Research, Beijing, China
{yachaozhang,zonghaoli}@stu.xmu.edu.cn, yxie@cs.ecnu.edu.cn, {yyqu,chli}@xmu.edu.cn, tmei@jd.com
arXiv:2212.04744v1 [cs.CV] 9 Dec 2022
Abstract
Existing methods for large-scale point cloud semantic seg-
mentation require expensive, tedious and error-prone manual
point-wise annotations. Intuitively, weakly supervised train-
ing is a direct solution to reduce the cost of labeling. How-
ever, for weakly supervised large-scale point cloud semantic
segmentation, too few annotations will inevitably lead to inef-
fective learning of network. We propose an effective weakly
supervised method containing two components to solve the
above problem. Firstly, we construct a pretext task, i.e., point
cloud colorization, with a self-supervised learning to trans-
fer the learned prior knowledge from a large amount of unla-
beled point cloud to a weakly supervised network. In this way, Figure 1: Visualize the results of semantic segmentation, the
the representation capability of the weakly supervised net-
work can be improved by the guidance from a heterogeneous
misclassification points are signed in red. From left to right,
task. Besides, to generate pseudo label for unlabeled data, the columns show the results with one labeled point, 1% la-
a sparse label propagation mechanism is proposed with the beled points, 10% labeled points and the ground truth.
help of generated class prototypes, which is used to measure
the classification confidence of unlabeled point. Our method
is evaluated on large-scale point cloud datasets with different in its infancy. Xu and Lee proposed a weakly supervised ap-
scenarios including indoor and outdoor. The experimental re- proach (Xu and Lee 2020) whose results are close to the pre-
sults show the large gain against existing weakly supervised vious fully supervised performance with 10× fewer labeled
and comparable results to fully supervised methods1 .
points. This approach only deal with an instance (ShapeNet-
Part) or a block (1m × 1m on S3DIS) of point cloud in a
small scale.
Introduction For large-scale point clouds, the network cannot effec-
3D scene understanding is required in numerous applica- tively learn feature representation with a few labeled points.
tions, in particular robotics, autonomous driving and virtual We consider the two strategies to solve this problem: 1)
reality. Large-scale point cloud semantic segmentation as a What prior knowledge can be transferred to the segmenta-
fundamental task attracts more and more attention. The suc- tion network for improving feature representation in weakly
cess of deep neural networks in point cloud semantic seg- supervised learning? 2) How are the labels of given points
mentation is attribute to their ability to scale up with more propagated to unlabeled points effectively?
well-labeled training data (Qi et al. 2017a,b; Zhang, Hua, Firstly, transfer learning provides a promising way and is
and Yeung 2019; Li, Chen, and Hee Lee 2018; Wang et al. relatively successful in 2D vision tasks, such as image classi-
2019; Wu, Qi, and Fuxin 2019; Yang et al. 2019; Hu et al. fication, segmentation, and so on. But point cloud tasks lack
2020). a well-annotated and category-extensive dataset like Ima-
Fully supervised point cloud semantic segmentation geNet (Deng et al. 2009) for pre-training. We consider to
methods need expensive, tedious and error-prone manual use self-supervised learning for knowledge transfer. How-
point-wise annotations. One direct solution is to achieve ef- ever, how to use enormous amounts of unlabeled point cloud
fective segmentation via weakly supervised learning with to generate labels by itself and learn a semantically related
annotating partial points or semantic category which is still representation for subsequent tasks is very challenging. Sec-
ondly, propagating pseudo labels to unlabeled points is a
* Corresponding
Author common method for weakly supervised and semi-supervised
1 task to learn effectively. Usually, this requires constructing
Code based on mindspore: https://github.com/dmcv-
ecnu/MindSpore ModelZoo/tree/main/WS3 MindSpore a fully-connected graph with all points. However, for large-
scale point clouds (∼ 106 points), the fully-connected graph Obviously, this point-wise annotation is labor-intensive and
is unachievable due to large GPU memory consumption. time-consuming.
To address the above difficulties, we present an effective Self-supervised learning on point cloud. Self-
weakly supervised method for large-scale point cloud se- supervised learning springs up in computer vision recently.
mantic segmentation. Firstly, we notice that points of the It aims to learn good representations from unlabeled visual
same semantic class have similar color distribution, and data, reducing or even eliminating the need for the costly
point cloud with color is essentially free. We choose point collection of manual labels (Newell and Deng 2020). Recent
cloud colorization as a pretext task for transfer learning. self-supervised learning makes great successes in feature
Specifically, we use color space transformation to construct representation which achieves comparable or outperforms
a self-supervised network for color space completion. We results produced by supervised pre-training (Bachman,
further introduce a local perceptual regularization term to Hjelm, and Buchwalter 2019; He et al. 2020).
enhance the local representation which is consistent with the Unlike 2D images, self-supervised learning in point cloud
goal of segmentation task. As a result, it can learn a prior- is rarely used. Previous works on unsupervised 3D represen-
based initialization distribution. After that, we use the pre- tation learning (Achlioptas et al. 2018; Gadelha, Wang, and
trained knowledge to fine-tune weakly semantic segmenta- Maji 2018; Hassani and Haley 2019; Li, Chen, and Hee Lee
tion network. Our learning scheme allows to transfer the 2018; Sauder and Sievers 2019; Yang et al. 2018) mainly
knowledge from the self-supervised task to the weakly su- focuse on representing an instance (ShapeNet (Chang et al.
pervised task and improves the effectiveness of feature rep- 2015)). It is difficult to directly apply the feature represen-
resentation. Moreover, we propose a sparse label propaga- tation learned on an instance to real-word large-scale point
tion method induced by class prototype which can gradually cloud tasks due to the large domain gaps (Xie et al. 2020).
propagate pseudo labels to unlabeled points. It is computa- PointContrast (Xie et al. 2020) concerns on contrastive
tional friendly that can be adapted to large-scale tasks. embeddings and proposed a pre-training task for 3D point
Our contributions can be summarized as follows: cloud understanding. The core of this method is that differ-
• A weakly supervised segmentation method is proposed ent views of point cloud should be mapped to similar em-
for large-scale point cloud which only needs very small beddings for matched points. It achieves promising results
number of labeled points. on fully-supervised downstream tasks. Xu and Lee (Xu and
Lee 2020) proposed a Siamese self-supervision branch by
• We adopt the heterogeneous transfer learning method and augmenting the training sample with a random in-plane rota-
construct a self-supervised pretext task by point cloud col- tion and flipping, and then made the original and augmented
orization. It can learn a prior distribution and be general- point-wise predictions be consistent. This method (Xu and
ized well to segmentation tasks. Lee 2020) treats self-supervision as a branch of multi-task
• We propose an efficient sparse label propagation method and only the current training samples are used. Therefore, it
which can propagate labels to unlabeled points. It expands cannot make full use of other massive point clouds to learn
the supervision information and has low computational representations with strong generalization.
complexity. Weakly Supervised Point Cloud Semantic Segmenta-
tion. There are few researches on weakly supervised point
• Extensive experimental results demonstrate that our cloud semantic segmentation. Following the weakly super-
method achieves the comparable to or even exceeds fully vised manner called incomplete supervision in (Zhou 2018),
supervised competitors. the recent work (Xu and Lee 2020) utilizes 10× fewer la-
beled points to achieve comparable performance to fully su-
Related Work pervised method in small scale part segmentation and small
blocks ( ∼ 103 points) semantic segmentation. MPRM (Wei
Semantic segmentation for large-scale point cloud. Point- et al. 2020) introduces a multi-path region mining module to
Net and PointNet++ (Qi et al. 2017a,b) are pioneering ap- generate pseudo point-level label from a classification net-
proaches for point clouds. While recent works have shown work. The classification network is trained by the subcloud-
promising results on small point clouds, most of them can- level weakly labels. For the scene-level as an input, the per-
not directly scale up to large-scale point clouds (∼ 106 formance will degrade. However, up to now, none of them
points) due to high computational and memory costs (Hu can be generalized to large-scale problem well.
et al. 2020).
SPG (Landrieu and Simonovsky 2018) processes the
large-scale point clouds through a graph convolution-based Proposed Method
method. FCPN (Rethage et al. 2018) preprocesses large- Overview
scale point clouds into voxels. However, both the graph par-
titioning and voxelization are computationally expensive. Our method aims to exploit the knowledge transfer and la-
Recently, RandLA-Net (Hu et al. 2020) provides an efficient bel propagation to solve the problem of unstable and poor
and lightweight neural architecture for large-scale point representation produced by the network under weakly su-
cloud semantic segmentation. The state-of-the-art methods pervised large-scale point cloud segmentation. We propose
mentioned above are fully supervised that require a large an effective weakly supervised large-scale method and de-
number of point clouds with dense point-wise annotation. pict the overview framework in Figure 2.
Figure 2: The framework of our method consists of three parts: i) Self-supervised pretext task learns a prior knowledge. ii)
The prior knowledge is used to fine-tune the weakly-supervised semantic segmentation network. iii) Sparse label propagation
generates pseudo label for unlabeled data to improves the effectiveness of weakly-supervised task.
We take point cloud colorization as a self-supervised pre- tual distance (Zhang, Isola, and Efros 2016; Larsson, Maire,
text task to learn a prior-based initialization distribution. A and Shakhnarovich 2017). Thus, we perform point cloud
local perceptual regularization is proposed to learn the con- colorization by a, b completion in this colorspace. Given
textual information. Then, we use pre-trained parameters of the lightness channel L, the network predicts the a and b
encoder to initialize the weakly-supervised network for im- color channels and the local Gaussian distribution for each
proving the effectiveness of feature presentation. point. Notably, the value in channel L is replicated to three
Furthermore, we leverage labeled points to directly su- times of each point to keep the same dimension as the in-
pervise the network and fine-tune network parameters. We put of segmentation task. Therefore,
s
the input point cloud
also introduce a non-parametric label propagation method X s = [x1 , x2 , ..., xN s ] ∈ RN ×6 consists of N s 3D points
for weakly supervised semantic segmentation. Some unla- with the xyz coordinates and three L. N s is the number of
beled points are assigned pseudo labels through the simi- points in one point cloud.
larity between class prototypes (Li et al. 2020; Qiu, Yao, Moreover, we implement RandLA-Net on the self-
and Mei 2017) and embeddings of unlabeled points. There- supervised task by modifying the final output layer. That is,
fore, more supervised information is introduced to improve the output of the network is a 6-dimension vector which con-
effectiveness of training. Considering of computational and tains the predicted â, b̂ and the corresponding local mean and
memory efficiency for large-scale point cloud, we choose variance.
RandLA-Net (Hu et al. 2020) as the backbone, which is an
efficient and lightweight neural architecture for large-scale Loss of Self-Supervised Task The loss is inherited from
point clouds semantic segmentation. In the following, we standard regression problem, which minimizes L1 error
describe the self-supervised pretext task and the sparse la- between the prediction and ground truth. Given a point
bel propagation method. cloud with coordinates and triplicate lightness values,
the self-supervised pretext task learns a mapping Ŷ =
Self-Supervised Pretext Task F(X s ; Θ), Ŷ = {â, b̂, µ̂a , σ̂ a , µ̂b , σ̂ b }, where â, b̂, µ̂ and
Colorization provides a powerful supervisory signal unlike σ̂ denote predicted a, b and corresponding local mean and
training from scratch in 2D vision task. Training data are variance, respectively. The loss of self-supervised task can
easy to collect, so any point cloud with color can be used be formulated as:
Ns
as training data. Due to the progress of point cloud acqui- 1 X
sition equipment, we have access to enormous amounts of Lab = (||ai − âi ||1 + ||bi − b̂i ||1 ). (1)
2N s i=1
unlabeled point cloud data with color information. We in-
vestigate and implement self-supervised learning on point In addition, to learn the local color distribution of every
cloud colorization which is treated as a pretext task. Point point, we introduce a local perceptual regularization term.
colorization aims to guide the self-supervised model to learn If the network can predict the color distribution (mean and
the feature representation. variance) of the neighbors, it can embed the local informa-
It is recognized that the Lab color space favor for percep- tion which is consistent with the segmentation task using
Figure 3: The framework of sparse label propagation. is Hadamard product and Gather denotes the operator of getting
elements by index.
local features for weakly supervised semantic segmentation. Sparse Label Propagation
Given a point xi as the centroid, the local neighbor N (xi ) is Segmentation performance degrades significantly with few
calculated by KNN according to the Euclidean distance. The labeled points. The main reason is that supervisory infor-
ground truth µai and σia of a channel can be obtained by: mation provided by few labeled points can not be propa-
K
gated well to unlabeled points. Therefore, we use the labeled
1 X points to assign pseudo labels for unlabeled points, and fur-
µai = aj , ∀xj ∈ N (xi ), (2) ther provide additional supervised information to improve
K j=1
the representation of the weakly supervised network.
v In order to achieve this goal, the following items require
u K to be taken into account: 1) Computational complexity is
a
u1 X
σi = t (aj − µai )2 + ε, ∀xj ∈ N (xi ), (3) not high and memory recourse is not large. Large-scale point
K j=1 clouds usually contain ∼ 106 points, if using all the points as
nodes to construct a fully-connected graph, it will consume
where ε is a very small constant. µbi , σib can be obtained in a lot of memory and computing resources. 2) The anchor
the same way. We formulize the local perceptual regulariza- points should be sparse. Some ambiguous points should not
tion term as: be given labels to train the network. 3) The propagated label
N s should be soft. The propagated labels should be related to
1 X their similarity, and the higher of the similarity, the more
Llocal = (||µai − µ̂i a ||1 + ||σia − σ̂i a ||1 +
4N s i=1 (4) similar the label should be.
b b
We design a sparse label propagation method. The over-
||µbi − µ̂i ||1 + ||σib − σ̂i ||1 ). all framework is shown in Figure 3. It consists of three
parts: class prototype generation, class assignment matrix
The total loss Lp of self-supervised pretext task can be construction, sparse pseudo-label generation.
expressed as: Class prototype generation. In the last two lay-
Lp = Lab + Llocal . (5) ers of the network, we output embedding Z =
[z1l , z2l , ..., zM
l
; z1u , z2u , ..., zNu
] ∈ R(M +N )×d and the corre-
Discussion Why is the knowledge learned from the pretext
task beneficial for semantic segmentation? sponding prediction Y = [y1l , y2l , ..., yM l u
; y1u , y2u , ..., yN ] ∈
R(M +N )×C which comes from M labeled points and N un-
• Pretext task learns similar feature distributions as the se- labeled points. We use Zl , Zu to represent the embedding of
mantic segmentation task. Objects in the same category labeled and unlabeled points, respectively. Firstly we gener-
usually have similar color distribution, for example the ate C prototypes to represent the C classes according to the
vegetation is typically green, and the road is black. The labeled points. Specifically, we simply take the mean of the
surface color texture of the scene provides ample cues for labeled point embeddings Zl for each class. For class c, the
many categories. prototype ρc is given by:
• Pretext task embeds the local feature representation. We 1 X l
ρc = zi , (6)
introduce a local perceptual regularization term to con- |Ic | l
zi ∈Ic
strain the local color distribution to be consistent with the
original distribution. Thus, it allows the network to em- where Ic denotes the embedding sets of class c, c =
bed more local information. Therefore, it may enhance {1, 2, ..., C} for labeled points.
the embedding of local features for semantic segmenta- Class assignment matrix construction. We leverage em-
tion task. bedding Zu of unlabeled points and the class prototypes to
construct a similarity matrix W ∈ RN ×C by: In the early stage of training, the embedding is unreliable.
Class prototypes generated by embedding can not represent
||ziu − ρc ||2 classes well. Thus, the label propagation loss should not be
W = exp(− ), (7)
σ introduced to optimize network parameters. As the embed-
where σ is a hyper-parameter. Each column of W represents ding gets better, the weight of the pseudo loss should in-
the similarity between the unlabeled point and the class pro- crease. Therefore, we introduce a non-linear parameter λ to
totypes. We use sof t-max to convert the similarity into a balance the two losses. In this work, λ is formulated as:
class assignment probability S as: (
0, epoch < 30
exp(W) λ= epoch
−1
(13)
S = P(i 7→ c) = PC . (8) e max epoch , otherwise
c=1 exp(W)
where epoch and max epoch denote the current training
Sparse pseudo label generation. There are some points epoch and the total epochs, respectively.
with low similarity to each category. These points are not
suitable to provide supervisory information to train the net- Experiments and Analysis
work. Specifically, for each class, according to the class as- In this section, we firstly introduce the experimental settings.
signment matrix S, we select the top-K unlabeled points Then, we evaluate the weakly semantic segmentation perfor-
and get the mask M k ∈ {0, 1}N ×C , where mkic = 1 means mance on indoor and outdoor large-scale point clouds, re-
the embedding of i-th point is the first K points similar to spectively. Furthermore, we perform ablation study to eval-
the class c. N is the number of unlabeled points. This is a la- uate the importance of the main components.
bel expansion method with a balanced number of categories.
It can alleviate the category imbalance to a certain extent. Experiment Settings
As an unlabeled point may belong to multiple categories,
we choose the most similar category and generate a binary Dataset Setting for Self-supervised Task Previous works
mask. According to M k , we get the point mask M pt ∈ on unsupervised 3D representation learning (Achlioptas
et al. 2018; Gadelha, Wang, and Maji 2018; Hassani and
{0, 1}N . mpti = 1 denotes that the ith point is assigned
Haley 2019; Li, Chen, and Hee Lee 2018; Sauder and Siev-
a pseudo labels. We can get the sparse pseudo label Y p ∈
ers 2019; Yang et al. 2018) mainly focused on ShapeNet
RN ×C by:
(Chang et al. 2015) which is a dataset of single-object CAD
Y p = M pt S, (9)
point cloud models. However, the real-world large-scale
p
where denotes Hadamard product. Y is the form of soft point cloud usually contains multiple models of different
one-hot. The label propagation loss Lsp can be formulated categories. Pre-training on ShapeNet have poor scalability
as: because of a large domain gap. We choose ScanNet (Dai
N C
1 X pt
X et al. 2017) as the pre-training dataset which is a big real-
Lsp = − m y p log yic
u
, (10) world point cloud dataset containing 2.5 million views in
||M pt ||1 i=1 i c=1 ic
more than 1500 indoor scans.
u
where yic is the probability that the unlabeled point i is clas- To train self-supervised pretext task, we convert the RGB
sified as category c, and || · ||1 denotes L1 norm. space to the Lab space and split 6 channels corresponding
Compared with the traditional fully-connected graph la- to the coordinate (x, y, z) and color (L, a, b) into the given
bel propagation method, our method is computation effi- channels (x, y, z, L) and the prediction channels (a, b).
cient. The complexity of our method is O(N Cd), while the
method of fully-connected graph is O((N + M )2 d). C is Dataset Setting for Weakly Supervised Segmentation
the number of category and d denotes the dimension of Zu . In order to evaluate the performance of our network on
The magnitude of C is ∼ 101 , which is much smaller than weakly semantic segmentation tasks, we experiment on the
N (∼ 106 ). indoor scene dataset S3DIS (Armeni et al. 2016) and Scan-
Netv2 (Dai et al. 2017) and outdoor dataset Semantic3D
Loss of Weakly Supervised Task (Hackel et al. 2017). Note that our method is pre-trained on
The loss of weakly supervised semantic segmentation in- ScanNet dataset. However, the pre-training task does not use
cludes two terms: segmentation loss and label propagation the semantic labels.
loss. Implementation Details Here weakly supervised settings
Ltotal = Lseg + λLsp , (11) are studied. i) 1 point label (1pt), we assume there is only
We utilize a softmax cross-entropy loss on the labeled one point within each category labeled with ground-truth for
points. For labeled points, the segmentation loss is formu- each scene. ii) (x%) denote x percentage points with ground-
lated as: truth. We set x = {1, 10} in our experiments. iii) The cost
M C of labeling 1% points manually is much higher than 1pt. We
l
1 XX exp yic consider a cheap way: super-point (SP T ), which annotates
Lseg = − yic log PC , (12) a local area instead of one point. In all settings, the annotated
M i=1 c=1 c=1 exp yic
l
points are randomly selected.
where yic is the ground truth of labeled point i. M denotes Our pre-training and weakly semantic segmentation net-
the number of labeled points. work are built on the efficient backbone of RandLA-Net.
Setting Method Area5 6-fold Setting Method mIoU
PointNet [’17] 41 47.6 PointNet++ [’17] 33.9
DGCNN [’19] - 56.1 SPLATNet [’18] 39.3
Fully
RSNet [’18] 56.5 - PointCNN [’18] 45.8
Fully
PointCNN [’18] 57.3 65.4 KPConv [’19] 68.4
ShellNet [’19] - 66.8 MPRM (subcloud) [’20] 41.1
RandLA-Net [’20] 62.8* 70.0 Ours (SP T ) 49.0
Weakly
Π Model [’16] 44.3 - Ours (1%) 51.1
1pt (0.2%) MT [’17] 44.4 - Ours (10%) 52.0
Xu [’20] 44.5 -
0.2% Ours 56.4 - Table 2: Quantitative results on ScanNetv2 (mIoU %).
Baseline 40.4 -
1pt (0.03%)
Ours 45.8 -
1% Ours 61.8 65.9 method.
Π Model [’16] 46.3 - The qualitative results are shown in Figure 1 on S3DIS
MT [’17] 47.9 - dataset under 1pt, 1% and 10% settings. It can be seen that
10% at 1% setting, our method can correctly classify except for
Xu [’20] 48.0 -
Ours 64.0 68.1 challenging boundaries.
Validation mIoU
Validation mIoU
50 60