Point Transformer V3
Point Transformer V3
Point Transformer V3
ScanNet
ScanNet200 Sem. Seg. ScanNet Eff.
Sem. Seg. L.A. 200 Inference Latency
arXiv:2312.10035v2 [cs.CV] 25 Mar 2024
78.6
75.5 MinkUNet 48ms
S3DIS 6-Fold 36.0 ScanNet Eff.
Sem. Seg. L.A. 20
80.8 62.4 PTv2 146ms
S3DIS ScanNet Eff. 3.3× faster
Sem. Seg. 74.3 68.2 L.R. 20% PTv3 44ms
ScanNet
Ins. Seg.
63.5 31.1
ScanNet Eff.
L.R. 1%
Faster Speed
Figure 1. Overview of Point Transformer V3 (PTv3). Compared to its predecessor, PTv2 [90], our PTv3 shows superiority in the
following aspects: 1. Stronger performance. PTv3 achieves state-of-the-art results across a variety of indoor and outdoor 3D perception
tasks. 2. Wider receptive field. Benefit from the simplicity and efficiency, PTv3 expands the receptive field from 16 to 1024 points.
3. Faster speed. PTv3 significantly increases processing speed, making it suitable for latency-sensitive applications. 4. Lower Memory
Consumption. PTv3 reduces memory usage, enhancing accessibility for broader situations.
Abstract 1. Introduction
This paper is not motivated to seek innovation within Deep learning models have experienced rapid advance-
the attention mechanism. Instead, it focuses on overcom- ments in various areas, such as 2D vision [24, 38, 78, 86]
ing the existing trade-offs between accuracy and efficiency and natural language processing (NLP) [1, 37, 56, 79], with
within the context of point cloud processing, leveraging the their progress often attributed to the effective utilization of
power of scale. Drawing inspiration from recent advances scale, encompassing factors such as the size of datasets, the
in 3D large-scale representation learning, we recognize that number of model parameters, the range of effective recep-
model performance is more influenced by scale than by in- tive field, and the computing power allocated for training.
tricate design. Therefore, we present Point Transformer V3 However, in contrast to the progress made in 2D vision or
(PTv3), which prioritizes simplicity and efficiency over the NLP, the development of 3D backbones [16, 46, 62, 88] has
accuracy of certain mechanisms that are minor to the over- been hindered in terms of scale, primarily due to the limited
all performance after scaling, such as replacing the precise size and diversity of point cloud data available in separate
neighbor search by KNN with an efficient serialized neigh- domains [92]. Consequently, there exists a gap in applying
bor mapping of point clouds organized with specific pat- scaling principles that have driven advancements in other
terns. This principle enables significant scaling, expand- fields [37]. This absence of scale often leads to a limited
ing the receptive field from 16 to 1024 points while remain- trade-off between accuracy and speed on 3D backbones,
ing efficient (a 3× increase in processing speed and a 10× particularly for models based on the transformer architec-
improvement in memory efficiency compared with its pre- ture [27, 106]. Typically, this trade-off involves sacrific-
decessor, PTv2). PTv3 attains state-of-the-art results on ing efficiency for accuracy. Such limited efficiency impedes
over 20 downstream tasks that span both indoor and out- some of these models’ capacity to fully leverage the inher-
door scenarios. Further enhanced with multi-dataset joint ent strength of transformers in scaling the range of receptive
training, PTv3 pushes these results to a higher level. fields, hindering their full potential in 3D data processing.
* Corresponding author. A recent advancement [92] in 3D representation learn-
1
ing has made progress in overcoming the data scale limita- 2. Related Work
tion in point cloud processing by introducing a synergistic
training approach spanning multiple 3D datasets. Coupled 3D understanding. Conventionally, deep neural architec-
with this strategy, the efficient convolutional backbone [13] tures for understanding 3D point cloud data can be broadly
has effectively bridged the accuracy gap commonly associ- classified into three categories based on their approach to
ated with point cloud transformers [40, 90]. However, point modeling point clouds: projection-based, voxel-based, and
cloud transformers themselves have not yet fully benefited point-based methods. Projection-based methods project 3D
from this privilege of scale due to their efficiency gap com- points onto various image planes and utilize 2D CNN-based
pared to sparse convolution. This discovery shapes the ini- backbones for feature extraction [8, 43, 45, 71]. Voxel-
tial motivation for our work: to re-weigh the design choices based approaches transform point clouds into regular voxel
in point transformers, with the lens of the scaling princi- grids to facilitate 3D convolution operations [53, 70], with
ple. We posit that model performance is more significantly their efficiency subsequently enhanced by sparse convolu-
influenced by scale than by intricate design. tion [13, 25, 84]. However, they often lack scalability in
terms of the kernel sizes. Point-based methods, by con-
Therefore, we introduce Point Transformer V3 (PTv3),
trast, process point clouds directly [52, 62, 63, 77, 105] and
which prioritizes simplicity and efficiency over the accuracy
have recently seen a shift towards transformer-based archi-
of certain mechanisms, thereby enabling scalability. Such
tectures [27, 65, 90, 101, 106]. While these methods are
adjustments have an ignorable impact on overall perfor-
powerful, their efficiency is frequently constrained by the
mance after scaling. Specifically, PTv3 makes the following
unstructured nature of point clouds, which poses challenges
adaptations to achieve superior efficiency and scalability:
to scaling their designs.
• Inspired by two recent advancements [51, 83] and rec- Serialization-based method. Recent works [7, 51, 83]
ognizing the scalability benefits of structuring unstruc- have introduced approaches diverging from the traditional
tured point clouds, PTv3 shifts from the traditional spa- paradigms of point cloud processing, which we catego-
tial proximity defined by K-Nearest Neighbors (KNN) rized as serialization-based. These methods structure point
query, accounting for 28% of the forward time. Instead, clouds by sorting them according to specific patterns, trans-
it explores the potential of serialized neighborhoods in forming unstructured, irregular point clouds into manage-
point clouds, organized according to specific patterns. able sequences while preserving certain spatial proximity.
• PTv3 replaces more complex attention patch interaction OctFormer [83] inherits order during octreelization, akin to
mechanisms, like shift-window (impeding the fusion of z-order, offering scalability but still constrained by the oc-
attention operators) and the neighborhood mechanism tree structure itself. FlatFormer [51], on the other hand, em-
(causing high memory consumption), with a streamlined ploys a window-based sorting strategy for grouping point
approach tailored for serialized point clouds. pillars, akin to window partitioning. However, this design
• PTv3 eliminates the reliance on relative positional en- lacks scalability in the receptive field and is more suited to
coding, which accounts for 26% of the forward time, in pillar-based 3D object detectors. These pioneering works
favor of a simpler prepositive sparse convolutional layer. mark the inception of serialization-based methods. Our
We consider these designs as intuitive choices driven by PTv3 builds on this foundation, defining and exploring the
the scaling principles and advancements in existing point full potential of point cloud serialization.
cloud transformers. Importantly, this paper underscores the 3D representation learning. In contrast to 2D domains,
critical importance of recognizing how scalability affects where large-scale pre-training has become a standard ap-
backbone design, instead of detailed module designs. proach for enhancing downstream tasks [6], 3D representa-
This principle significantly enhances scalability, over- tion learning is still in a phase of exploration. Most stud-
coming traditional trade-offs between accuracy and effi- ies still rely on training models from scratch using spe-
ciency (see Fig. 1). PTv3, compared to its predecessor, has cific target datasets [94]. While major efforts in 3D rep-
achieved a 3.3× increase in inference speed and a 10.2× resentation learning focused on individual objects [57, 68,
reduction in memory usage. More importantly, PTv3 cap- 69, 87, 103], some recent advancements have redirected at-
italizes on its inherent ability to scale the range of percep- tention towards training on real-world scene-centric point
tion, expanding its receptive field from 16 to 1024 points clouds [30, 36, 91, 94, 107]. This shift signifies a major step
while maintaining efficiency. This scalability underpins its forward in 3D scene understanding. Notably, Point Prompt
superior performance in real-world perception tasks, where Training (PPT) [92] introduces a new paradigm for large-
PTv3 achieves state-of-the-art results across over 20 down- scale representation learning through multi-dataset syner-
stream tasks in both indoor and outdoor scenarios. Further gistic learning, emphasizing the importance of scale. This
augmenting its data scale with multi-dataset training [92], approach greatly influences our design philosophy and ini-
PTv3 elevates these results even more. We hope that our tial motivation for developing PTv3, and we have incorpo-
insights will inspire future research in this direction. rated this strategy in our final results.
2
Outdoor Efficiency (nuScenes) Training Inference
Methods Params. Latency Memory Latency Memory
Index
MinkUNet / 3 [13] 37.9M 163ms 3.3G 48ms 1.7G Relative Operation
Positional Encoding
MinkUNet / 5 [13] 170.3M 455ms 5.6G 145ms 2.1G KNN Query
MinkUNet / 7 [13] 465.0M 1120ms 12.4G 337ms 2.8G
PTv2 / 16 [90] 12.8M 213ms 10.3G 146ms 12.3G
PTv2 / 24 [90] 12.8M 308ms 17.6G 180ms 15.2G QKV
Encoding
PTv2 / 32 [90] 12.8M 354ms 21.5G 213ms 19.4G
Relation & Weight
PTv3 / 256 (ours) 46.2M 120ms 3.3G 44ms 1.2G Encoding
Value Unpool
PTv3 / 1024 (ours) 46.2M 119ms 3.3G 44ms 1.2G Aggregation Grid Pool
PTv3 / 4096 (ours) 46.2M 125ms 3.3G 45ms 1.2G FFN
Table 1. Model efficiency. We benchmark the training and infer- Figure 2. Latency treemap of each components of PTv2. We
ence efficiency of backbones with various scales of receptive field. benchmark and visualize the proportion of the forward time of
The batch size is fixed to 1, and the number after “/” denotes the each component of PTv2. KNN Query and RPE occupy a total
kernel size of sparse convolution and patch size1 of attention. of 54% of forward time.
3. Design Principle and Pilot Study able impact on overall performance. This concept forms
the basis of our scaling principle for backbone design, and
In this section, we introduce the scaling principle and pilot we practice it with our design.
study, which guide the design of our model. Breaking the curse of permutation invariance. Despite
Scaling principle. Conventionally, the relationship be- the demonstrated efficiency of sparse convolution, the ques-
tween accuracy and efficiency in model performance is tion arises about the need for a scalable point transformer.
characterized as a “trade-off”, with a typical preference for While multi-dataset joint training allows for data scaling
accuracy at the expense of efficiency. In pursuit of this, nu- and the incorporation of more layers and channels con-
merous methods have been proposed with cumbersome op- tributes to model scaling, efficiently expanding the recep-
erations. Point Transformers [90, 106] prioritize accuracy tive field to enhance generalization capabilities remains a
and stability by substituting matrix multiplication in the challenge for convolutional backbones (refer to Tab. 1). It
computation of attention weights with learnable layers and is attention, an operator that is naturally adaptive to kernel
normalization, potentially compromising efficiency. Simi- shape, potentially to be universal.
larly, Stratified Transformer [40] and Swin3D [101] achieve However, current point transformers encounter chal-
improved accuracy by incorporating more complex forms lenges in scaling when adhering to the request of permu-
of relative positional encoding, yet this often results in de- tation invariance, stemming from the unstructured nature of
creased computational speed. point cloud data. In PTv1, the application of the K-Nearest
Yet, the perceived trade-off between accuracy and ef- Neighbors (KNN) algorithm to formulate local structures
ficiency is not absolute, with a notable counterexample introduced computational complexities. PTv2 attempted to
emerging through the engagement with scaling strategies. relieve this by halving the usage of KNN compared to PTv1.
Specifically, Sparse Convolution, known for its speed and Despite this improvement, KNN still constitutes a signifi-
memory efficiency, remains preferred in 3D large-scale cant computational burden, consuming 28% of the forward
pre-training. Utilizing multi-dataset joint training strate- time (refer to Fig. 2). Additionally, while Image Rela-
gies [92], Sparse Convolution [13, 25] has shown significant tive Positional Encoding (RPE) benefits from a grid layout
performance improvements, increasing mIoU on ScanNet that allows for the predefinition of relative positions, point
semantic segmentation from 72.2% to 77.0% [107]. This cloud RPE must resort to computing pairwise Euclidean
outperforms PTv2 when trained from scratch by 1.6%, all distances and employ learned layers or lookup tables for
while retaining superior efficiency. However, such advance- mapping such distances to embeddings, proves to be an-
ments have not been fully extended to point transformers, other source of inefficiency, occupying 26% of the forward
primarily due to their efficiency limitations, which present time (see Fig. 2). These extremely inefficient operations
burdens in model training especially when the computing bring difficulties when scaling up the backbone.
resource is constrained. Inspired by two recent advancements [51, 83], we move
This observation leads us to hypothesize that model per- away from the traditional paradigm, which treats point
formance may be more significantly influenced by scale clouds as unordered sets. Instead, we choose to “break” the
than by complex design details. We consider the possibility constraints of permutation invariance by serializing point
of trading the accuracy of certain mechanisms for simplicity clouds into a structured format. This strategic transforma-
and efficiency, thereby enabling scalability. By leveraging tion enables our method to leverage the benefits of struc-
the strength of scale, such sacrifices could have an ignor- tured data inefficiency with a compromise of the accuracy
1 Patch size refers to the number of neighboring points considered to- of locality-preserving property. We consider this trade-off
gether for self-attention mechanisms. as an entry point of our design.
3
(a) Z-order (b) Hilbert
Figure 3. Point cloud serialization. We show the four patterns of serialization with a triplet visualization. For each triplet, we show the
space-filling curve for serialization (left), point cloud serialization var sorting order within the space-filling curve (middle), and grouped
patches of the serialized point cloud for local attention (right). Shifting across the four serialization patterns allows the attention mechanism
to capture various spatial relationships and contexts, leading to an improvement in model accuracy and generalization capacity.
4
(a) Reordering (a) Standard
Figure 4. Patch grouping. (a) Reordering point cloud accord- (d) Shift Order (e) Shuffle Order
ing to order derived from a specific serialization pattern. (b)
Padding point cloud sequence by borrowing points from neighbor-
ing patches to ensure it is divisible by the designated patch size. shuffle
attn attn attn attn attn attn attn attn
4.2. Serialized Attention
Re-weigh options of attention mechanism. Image trans- Figure 5. Patch interaction. (a) Standard patch grouping with a
formers [20, 49, 50], benefiting from the structured and reg- regular, non-shifted arrangement; (b) Shift Dilation where points
ular grid of pixel data, naturally prefer window [49] and are grouped at regular intervals, creating a dilated effect; (c) Shift
dot-product [21, 80] attention mechanisms. These meth- Patch, which applies a shifting mechanism similar to the shift win-
ods take advantage of the fixed spatial relationships inherent dow approach; (d) Shift Order where different serialization pat-
to image data, allowing for efficient and scalable localized terns are cyclically assigned to successive attention layers; (d)
Shuffle Order, where the sequence of serialization patterns is ran-
processing. However, this advantage vanishes when con-
domized before being fed to attention layers.
fronting the unstructured nature of point clouds. To adapt,
previous point transformers [90, 106] introduce neighbor- patch size increases while still preserving spatial neighbor
hood attention to construct even-size attention kernels and relationships to a feasible degree. Although this approach
adopt vector attention to improve model convergence on may sacrifice some neighbor search accuracy when com-
point cloud data with a more complex spatial relation. pared to KNN, the trade-off is beneficial. Given the atten-
In light of the structured nature of serialized point clouds, tion’s re-weighting capacity to reference points, the gains
we choose to revisit and adopt the efficient window and in efficiency and scalability far outweigh the minor loss in
dot-product attention mechanisms as our foundational ap- neighborhood precision (scaling it up is all we need).
proach. While the serialization strategy may temporarily Patch interaction. The interaction between points from
yield a lower performance than some neighborhood con- different patches is critical for the model to integrate infor-
struction strategies like KNN due to a reduction in precise mation across the entire point cloud. This design element
spatial neighbor relationships, we will demonstrate that any counters the limitations of a non-overlapping architecture
initial accuracy gaps can be effectively bridged by harness- and is pivotal in making patch attention functional. Build-
ing the scalability potential inherent in serialization. ing on this insight, we investigate various designs for patch
Evolving from window attention, we define patch atten- interaction as outlined below (also visualized in Fig. 5):
tion, a mechanism that groups points into non-overlapping • In Shift Dilation [83], patch grouping is staggered by
patches and performs attention within each individual a specific step across the serialized point cloud, effec-
patch. The effectiveness of patch attention relies on two tively extending the model’s receptive field beyond the
major designs: patch grouping and patch interaction. immediate neighboring points.
Patch grouping. Grouping points into patches within se- • In Shift Patch, the positions of patches are shifted across
rialized point clouds has been well-explored in recent ad- the serialized point cloud, drawing inspiration from the
vancements [51, 83]. This process is both natural and ef- shift-window strategy in image transformers [49]. This
ficient, involving the simple grouping of points along the method maximizes the interaction among patches.
serialized order after padding. Our design for patch atten- • In Shift Order, the serialized order of the point cloud
tion is also predicated on this strategy as presented in Fig. 4. data is dynamically varied between attention blocks.
In practice, the processes of reordering and patch padding This technique, which aligns seamlessly with our point
can be integrated into a single indexing operation. cloud serialization method, serves to prevent the model
Furthermore, we illustrate patch grouping patterns de- from overfitting to a single pattern and promotes a more
rived from the four serialization patterns on the right part of robust integration of features across the data.
triplets in Fig. 3. This grouping strategy, in tandem with our • Shuffle Order∗ , building upon Shift Order, introduces a
serialization patterns, is designed to effectively broaden the random shuffle to the permutations of serialized orders.
attention mechanism’s receptive field in the 3D space as the This method ensures that the receptive field of each at-
5
Initialization Encoder ×S
×N
Patterns S.O. + S.D. + S.P. + Shuffle O.
Block
* * Z 74.3 54ms 75.5 89ms 75.8 86ms 74.3 54ms
Shuffle Orders
*
Serialization
Point Cloud
LayerNorm
LayerNorm
Embedding
Grid Pool
Attention
xCPE
H + TH 76.2 60ms 76.1 98ms 76.2 94ms 76.8 60ms
MLP
Z + TZ + H + TH 76.5 61ms 76.8 99ms 76.6 97ms 77.3 61ms
6
Indoor Sem. Seg. ScanNet [17] ScanNet200 [67] S3DIS [2] Outdoor Sem. Seg. nuScenes [5] Sem.KITTI [3] Waymo Val [72]
Methods Val Test Val Test Area5 6-fold Methods Val Test Val Test mIoU mAcc
◦ MinkUNet [13] 72.2 73.6 25.0 25.3 65.4 65.4 ◦ MinkUNet [13] 73.3 - 63.8 - 65.9 76.6
◦ ST [40] 74.3 73.7 - - 72.0 - ◦ SPVNAS [73] 77.4 - 64.7 66.4 - -
◦ PointNeXt [64] 71.5 71.2 - - 70.5 74.9 ◦ Cylinder3D [108] 76.1 77.2 64.3 67.8 - -
◦ OctFormer [83] 75.7 76.6 32.6 32.6 - - ◦ AF2S3Net [10] 62.2 78.0 74.2 70.8 - -
◦ Swin3D [101] 76.4 - - - 72.5 76.9 ◦ 2DPASS [98] - 80.8 69.3 72.9 - -
◦ PTv1 [106] 70.6 - 27.8 - 70.4 65.4 ◦ SphereFormer [41] 78.4 81.9 67.8 74.8 69.9 -
◦ PTv2 [90] 75.4 74.2 30.2 - 71.6 73.5 ◦ PTv2 [90] 80.2 82.6 70.3 72.6 70.6 80.2
◦ PTv3 (Ours) 77.5 77.9 35.2 37.8 73.4 77.7 ◦ PTv3 (Ours) 80.4 82.7 70.8 74.2 71.3 80.5
• PTv3 (Ours) 78.6 79.4 36.0 39.3 74.7 80.8 • PTv3 (Ours) 81.2 83.0 72.3 75.5 72.1 81.3
Table 5. Indoor semantic segmentation. Table 7. Outdoor semantic segmentation.
Method Metric Area1 Area2 Area3 Area4 Area5 Area6 6-Fold Indoor Ins. Seg. ScanNet [17] ScanNet200 [67]
allAcc 92.30 86.00 92.98 89.23 91.24 94.26 90.76 PointGroup [35] mAP25 mAP50 mAP mAP25 mAP50 mAP
◦ PTv2 mACC 88.44 72.81 88.41 82.50 77.85 92.44 83.13
◦ MinkUNet [13] 72.8 56.9 36.0 32.2 24.5 15.8
mIoU 81.14 61.25 81.65 69.06 72.02 85.95 75.17
◦ PTv2 [90] 76.3 60.0 38.3 39.6 31.9 21.4
allAcc 93.22 86.26 94.56 90.72 91.67 94.98 91.53 ◦ PTv3 (Ours) 77.5 61.7 40.9 40.1 33.2 23.1
◦ PTv3 mACC 89.92 74.44 94.45 81.11 78.92 93.55 85.31 • PTv3 (Ours) 78.9 63.5 42.1 40.8 34.1 24.0
mIoU 83.01 63.42 86.66 71.34 73.43 87.31 77.70
Table 8. Indoor instance segmentation.
allAcc 93.70 90.34 94.72 91.87 91.96 94.98 92.59
• PTv3 mACC 90.70 78.40 94.27 86.61 80.14 93.80 87.69 Data Efficient [30] Limited Reconstruction Limited Annotation
mIoU 83.88 70.11 87.40 75.53 74.33 88.74 80.81
Methods 1% 5% 10% 20% 20 50 100 200
Table 6. S3DIS 6-fold cross-validation.
◦ MinkUNet [13] 26.0 47.8 56.7 62.9 41.9 53.9 62.2 65.5
◦ PTv2 [90] 24.8 48.1 59.8 66.3 58.4 66.1 70.3 71.2
a single Shift Order cannot completely harness the potential ◦ PTv3 (Ours) 25.8 48.9 61.0 67.0 60.1 67.9 71.4 72.7
offered by the four serialization patterns. • PTv3 (Ours) 31.3 52.6 63.3 68.2 62.4 69.1 74.3 75.5
Patch interaction. In Tab. 2, we also assess the effective- Table 9. Data efficiency.
ness of each alternative patch interaction design. The de-
fault setting enables Shift Order, but the first row represents tion [18, 19], which can be advantageous for our PTv3.
the baseline scenario using a single serialization pattern, in- Patch size. In Tab. 4, we explore the scaling of the recep-
dicative of the vanilla configurations of Shift Patch and Shift tive field of attention by adjusting patch size. Beginning
Dilation (one single serialization order is not shiftable). The with a patch size of 16, a standard in prior point transform-
results indicate that while Shift Patch and Shift Dilation are ers, we observe that increasing the patch size significantly
indeed effective, their latency is somewhat hindered by the enhances performance. Moreover, as indicated in Tab. 1
dependency on attention masks, which compromises effi- (benchmarked with NuScenes dataset), benefits from op-
ciency. Conversely, Shift Code, which utilizes multiple se- timization techniques such as flash attention [18, 19], the
rialization patterns, offers a simple and efficient alternative speed and memory efficiency are effectively managed.
that achieves comparable results to these traditional meth-
ods. Notably, when combined with Shuffle Order and all 5.2. Results Comparision
four serialization patterns, our strategy not only shows fur- We benchmark the performance of PTv3 against previous
ther improvement but also retains its efficiency. SOTA backbones and present the highest
highest results obtained
Positional encoding. In Tab. 3, we benchmark our pro- for each benchmark. In our tables, Marker ◦ refers to a
posed CPE+ against conventional positional encoding, such model trained from scratch, and • refers to a model trained
as APE and RPE, as well as recent advanced solutions like with multi-dataset joint training (PPT [92]). An exhaustive
cRPE and CPE. The results confirm that while RPE and comparison with earlier works is available in the Appendix
Appendix.
Appendix
cRPE are significantly more effective than APE, they also Indoor semantic segmentation. In Tab. 5, we showcase
exhibit the inefficiencies previously discussed. Conversely, the validation and test performance of PTv3 on the Scan-
CPE and CPE+ emerge as superior alternatives. Although Net v2 [17] and ScanNet200 [67] benchmarks, along with
CPE+ incorporates slightly more parameters than CPE, it the Area 5 and 6-fold cross-validation [62] on S3DIS [2]
does not compromise our method’s efficiency too much. (details see Tab. 6). We report the mean Intersection over
Since CPEs operate prior to the attention phase rather than Union (mIoU) percentages and benchmark these results
during it, they benefit from optimization like flash atten- against previous backbones. Even without pre-training,
7
Waymo Obj. Det. Vehicle L2 Pedestrian L2 Cyclist L2 Mean L2 Indoor Efficiency (ScanNet) Training Inference
Methods # mAP APH mAP APH mAP APH mAPH Methods Params. Latency Memory Latency Memory
PointPillars [43] 1 63.6 63.1 62.8 50.3 61.9 59.9 57.8 MinkUNet [13] 37.9M 267ms 4.9G 90ms 4.7G
CenterPoint [102] 1 66.7 66.2 68.3 62.6 68.7 67.6 65.5 OctFormer [83] 44.0M 264ms 12.9G 86ms 12.5G
SST [22] 1 64.8 64.4 71.7 63.0 68.0 66.9 64.8 Swin3D [101] 71.1M 602ms 13.6G 456ms 8.8G
SST-Center [22] 1 66.6 66.2 72.4 65.0 68.9 67.6 66.3 PTv2 [90] 12.8M 312ms 13.4G 191ms 18.2G
VoxSet [28] 1 66.0 65.6 72.5 65.4 69.0 67.7 66.2 PTv3 (ours) 46.2M 151ms 6.8G 61ms 5.2G
PillarNet [26] 1 70.4 69.9 71.6 64.9 67.8 66.7 67.2
Table 11. Indoor model efficiency.
FlatFormer [51] 1 69.0 68.6 71.5 65.3 68.6 67.5 67.2
PTv3 (Ours) 1 71.2 70.8 76.3 70.4 71.5 70.4 70.5
various settings, from 5% to 20% of reconstructions and
CenterPoint [102] 2 67.7 67.2 71.0 67.5 71.5 70.5 68.4 from 20 to 200 annotations, PTv3 demonstrates strong per-
PillarNet [26] 2 71.6 71.6 74.5 71.4 68.3 67.5 70.2
formance. Moreover, the application of pre-training tech-
FlatFormer [51] 2 70.8 70.3 73.8 70.5 73.6 72.6 71.2
PTv3 (Ours) 2 72.5 72.1 77.6 74.5 71.0 70.1 72.2
nologies further boosts PTv3’s performance across all tasks.
Outdoor object detection. In Tab. 10, we benchmark PTv3
CenterPoint++ [102]3 71.8 71.4 73.5 70.8 73.7 72.8 71.6
against leading single-stage 3D detectors on the Waymo
SST [22] 3 66.5 66.1 76.2 72.3 73.6 72.8 70.4
FlatFormer [51] 3 71.4 71.0 74.5 71.3 74.7 73.7 72.0
Object Detection benchmark. All models are evaluated us-
PTv3 (Ours) 3 73.0 72.5 78.0 75.0 72.3 71.4 73.0 ing either anchor-based or center-based detection heads [99,
102], with a separate comparison for varying numbers of
Table 10. Waymo object detection. The colume with head name
input frames. Our PTv3, engaged with CenterPoint, consis-
“#” denotes the number of input frames.
tently outperforms both sparse convolutional [26, 102] and
PTv3 outperforms PTv2 by 3.7% on the ScanNet test split transformer-based [22, 28] detectors, achieving significant
and by 4.2% on the S3DIS 6-fold CV. The advantage of gains even when compared with the recent state-of-the-art,
PTv3 becomes even more pronounced when scaling up the FlatFormer [51]. Notably, PTv3 surpasses FlatFormer by
model with multi-dataset joint training [92], widening the 3.3% with a single frame as input and maintains a superior-
margin to 5.2% on ScanNet and 7.3% on S3DIS. ity of 1.0% in multi-frame settings.
Model efficiency. We evaluate model efficiency based on
Outdoor semantic segmentation. In Tab. 7, we detail the
average latency and memory consumption across real-world
validation and test results of PTv3 for the nuScenes [5, 23]
datasets. Efficiency metrics are measured on a single RTX
and SemanticKITTI [3] benchmarks and also include the
4090, excluding the first iteration to ensure steady-state
validation results for the Waymo benchmark [72]. Perfor-
measurements. We compared our PTv3 with multiple pre-
mance metrics are presented as mIoU percentages by de-
vious SOTAs. Specifically, we use the NuScenes dataset to
fault, with a comparison to prior models. PTv3 demon-
assess outdoor model efficiency (see Tab. 1) and the Scan-
strates enhanced performance over the recent state-of-the-
Net dataset for indoor model efficiency (see Tab. 11). Our
art model, SphereFormer, with a 2.0% improvement on
results demonstrate that PTv3 not only exhibits the lowest
nuScenes and a 3.0% increase on SemanticKITTI, both in
latency across all tested scenarios but also maintains rea-
the validation context. When pre-trained, PTv3’s lead ex-
sonable memory consumption.
tends to 2.8% for nuScenes and 4.5% for SemanticKITTI.
Indoor instance segmentation. In Tab. 8, we present
6. Conclusion and Discussion
PTv3’s validation results on the ScanNet v2 [17] and
ScanNet200 [67] instance segmentation benchmarks. We This paper presents Point Transformer V3, a stride towards
present the performance metrics as mAP, mAP25 , and overcoming the traditional trade-offs between accuracy and
mAP50 and compare them against several popular back- efficiency in point cloud processing. Guided by a novel in-
bones. To ensure a fair comparison, we standardize terpretation of the scaling principle in backbone design, we
the instance segmentation framework by employing Point- propose that model performance is more profoundly influ-
Group [35] across all tests, varying only the backbone. Our enced by scale than by complex design intricacies. By prior-
experiments reveal that integrating PTv3 as a backbone sig- itizing efficiency over the accuracy of less impactful mech-
nificantly enhances PointGroup, yielding a 4.9% increase anisms, we harness the power of scale, leading to enhanced
in mAP over MinkUNet. Moreover, fine-tuning a PPT pre- performance. Simply put, by making the model simpler
trained PTv3 provides an additional gain of 1.2% mAP. and faster, we enable it to become stronger.
Indoor data efficient. In Tab. 9, we evaluate the perfor- We discuss limitations and broader impacts as follows:
mance of PTv3 on the ScanNet data efficient [30] bench- • Attention mechanisum. In prioritizing efficiency, PTv3
mark. This benchmark tests models under constrained con- reverts to utilizing dot-product attention, which has been
ditions with limited percentages of available reconstructions well-optimized through engineering efforts. However,
(scenes) and restricted numbers of annotated points. Across we do note a reduction in convergence speed and a limi-
8
tation in further scaling depth compared to vector atten- Scratch Joint Training [92]
tion. This issue also observed in recent advancements Config Value Config Value
in transformer technology [93], is attributed to “attention
optimizer AdamW optimizer AdamW
sinks” stemming from the dot-product and softmax oper-
scheduler Cosine scheduler Cosine
ations. Consequently, our findings reinforce the need for criteria CrossEntropy (1) criteria CrossEntropy (1)
continued exploration of attention mechanisms. Lovasz [4] (1) Lovasz [4] (1)
• Scaling parameters. PTv3 transcends the existing trade- learning rate 5e-3 learning rate 5e-3
offs between accuracy and efficiency, paving the way for block lr scaler 0.1 block lr scaler 0.1
investigating 3D transformers at larger parameter scales weight decay 5e-2 weight decay 5e-2
batch size 12 batch size 24
within given computational resources. While this ex-
datasets ScanNet / datasets ScanNet (2)
ploration remains a topic for future work, current point S3DIS / S3DIS (1)
cloud transformers already demonstrate an over-capacity Struct.3D Struct.3D (4)
for existing tasks. We advocate for a combined approach warmup epochs 40 warmup iters 6k
that scales up both the model parameters and the scope epochs 800 iters 120k
of data and tasks (e.g., learning from all available data, Table 12. Indoor semantic segmentation settings.
multi-task frameworks, and multi-modality tasks). Such
an integrated strategy could fully unlock the potential of Scratch Joint Training [92]
scaling in 3D representation learning. Config Value Config Value
• Multiple modalities. Point cloud serialization provides a
optimizer AdamW optimizer AdamW
robust methodology for transforming n-dimensional data scheduler Cosine scheduler Cosine
into a structured 1D format, effectively preserving spa- criteria CrossEntropy (1) criteria CrossEntropy (1)
tial proximity. This technique can similarly be applied to Lovasz [4] (1) Lovasz [4] (1)
image data, enabling its conversion into a language-style learning rate 2e-3 learning rate 2e-3
1D structure that PTv3 can efficiently encode. This ca- block lr scaler 1e-1 block lr scaler 1e-1
weight decay 5e-3 weight decay 5e-3
pability opens new avenues for the development of mul-
batch size 12 batch size 24
timodal models that bridge 2D and 3D spaces, fostering datasets NuScenes / datasets NuScenes (1)
opportunities for large-scale, synergistic pre-training that Sem.KITTI / Sem.KITTI (1)
integrates both image and point cloud data. Waymo Waymo (1)
warmup epochs 2 warmup iters 9k
Acknowledgements epochs 50 iters 180k
This work is supported in part by the National Natural Sci-
Table 13. Outdoor semantic segmentation settings.
ence Foundation of China (NO. 622014840), the National
Key R&D Program of China (NO. 2022ZD0160101), HKU Ins. Seg. Obj. Det
Startup Fund, and HKU Seed Fund for Basic Research.
Config Value Config Value
9
Config Value Block BN LN BN LN
Pooling BN LN LN BN
serialization pattern Z + TZ + H + TH
patch interaction Shift Order + Shuffle Order Perf. 76.7 76.1 75.6 77.3
positional encoding xCPE
Table 17. Nomalization layer.
embedding depth 2
embedding channels 32
Block Traditional Post-Norm Pre-Norm
encoder depth [2, 2, 6, 2]
encoder channels [64, 128, 256, 512] Perf. 76.6 72.3 77.3
encoder num heads [4, 8, 16, 32]
Table 18. Block structure.
encoder patch size [1024, 1024, 1024, 1024]
decoder depth [1, 1, 1, 1] settings analogous to scratch training but with augmented
decoder channels [64, 64, 128, 256]
batch size. The numbers in brackets represent the propor-
decoder num heads [4, 4, 8, 16]
decoder patch size [1024, 1024, 1024, 1024] tional weight assigned to each dataset in the training mix.
down stride [×2, ×2, ×2, ×2] Other Downstream Tasks. We outline our configurations
mlp ratio 4 for indoor instance segmentation and outdoor object detec-
qkv bias True tion in Tab. 14. For indoor instance segmentation, we use
drop path 0.3
PointGroup [35] as our foundational framework, a popular
Table 15. Model settings. choice in 3D representation learning [30, 91, 92, 94]. Our
configuration primarily follows PointContrast [94], with
Augmentations Parameters Indoor Outdoor
necessary adjustments made for PTv3 compatibility. Re-
random dropout dropout ratio: 0.2, p: 0.2 ✓ - garding outdoor object detection, we adhere to the settings
random rotate axis: z, angle: [-1, 1], p: 0.5 ✓ ✓ detailed in FlatFormer [51] and implement CenterPoint as
axis: x, angle: [-1 / 64, 1 / 64], p: 0.5 ✓ -
our base framework to assess PTv3’s effectiveness. It’s im-
axis: y, angle: [-1 / 64, 1 / 64], p: 0.5 ✓ -
random scale scale: [0.9, 1.1] ✓ ✓ portant to note that PTv3 is versatile and can be integrated
random flip p: 0.5 ✓ ✓ with various other frameworks due to its backbone nature.
random jitter sigma: 0.005, clip: 0.02 ✓ ✓
elastic distort params: [[0.2, 0.4], [0.8, 1.6]] ✓ - A.2. Model Settings
auto contrast p: 0.2 ✓ -
color jitter std: 0.05; p: 0.95 ✓ - As briefly described in Sec. 4.3, here we delve into the
grid sampling grid size: 0.02 (indoor), 0.05 (outdoor) ✓ ✓
detailed model configurations of our PTv3, which are
sphere crop ratio: 0.8, max points: 128000 ✓ -
normalize color p: 1 ✓ - comprehensively listed in Tab. 15. This table serves
as a blueprint for components within serialization-based
Table 16. Data augmentations. point cloud transformers, encapsulating models like Oct-
Former [83] and FlatFormer [51] within the outlined frame-
A.1. Training Settings
works, except for certain limitations discussed in Sec. 2.
Indoor semantic segmentation. The settings for indoor Specifically, OctFormer can be interpreted as utilizing a sin-
semantic segmentation are outlined in Tab. 12. The two gle z-order serialization with patch interaction enabled by
leftmost columns describe the parameters for training from Shift Dilation. Conversely, FlatFormer can be character-
scratch using a single dataset. To our knowledge, this rep- ized by its window-based serialization approach, facilitat-
resents the first initiative to standardize training settings ing patch interaction through Shift Order.
across different indoor benchmarks with a unified approach.
The two rightmost columns describe the parameters for A.3. Data Augmentations
multi-dataset joint training [92] with PTv3, maintaining The specific configurations of data augmentations imple-
similar settings to the scratch training but with an increased mented for PTv3 are detailed in Tab. 16. We unify augmen-
batch size. The numbers in brackets indicate the relative tation pipelines for both indoor and outdoor scenarios sepa-
weight assigned to each dataset (criteria) in the mix. rately, and the configurations are shared by all tasks within
Outdoor semantic segmentation. The configuration for each domain. Notably, we observed that PTv3 does not de-
outdoor semantic segmentation, presented in Tab. 13, fol- pend on point clipping within a specific range, a process
lows a similar format to that of indoor. We also standardize often crucial for existing models.
the training settings across three outdoor datasets. Notably,
PTv3 operates effectively without the need for point clip- B. Additional Ablations
ping within a specific range, a step that is typically essential
in current models. Furthermore, we extend our methodol- In this section, we present further ablation studies focusing
ogy to multi-dataset joint training with PTv3, employing on macro designs of PTv3, previously discussed in Sec. 4.3.
10
Methods Year Val Test Methods Year Area5 6-fold
◦ PointNet++ [63] 2017 53.5 55.7 ◦ PointNet [62] 2017 41.1 47.6
◦ 3DMV [16] 2018 - 48.4 ◦ SegCloud [75] 2017 48.9 -
◦ PointCNN [46] 2018 - 45.8 ◦ TanConv [74] 2018 52.6 -
◦ SparseConvNet [25] 2018 69.3 72.5 ◦ PointCNN [46] 2018 57.3 65.4
◦ PanopticFusion [55] 2019 - 52.9 ◦ ParamConv [85] 2018 58.3 -
◦ PointConv [88] 2019 61.0 66.6 ◦ PointWeb [105] 2019 60.3 66.7
◦ JointPointBased [11] 2019 69.2 63.4 ◦ HPEIN [34] 2019 61.9 -
◦ KPConv [77] 2019 69.2 68.6 ◦ KPConv [77] 2019 67.1 70.6
◦ PointASNL [97] 2020 63.5 66.6 ◦ GACNet [82] 2019 62.9 -
◦ SegGCN [44] 2020 - 58.9 ◦ PAT [100] 2019 60.1 -
◦ RandLA-Net [32] 2020 - 64.5 ◦ SPGraph [42] 2018 58.0 62.1
◦ JSENet [33] 2020 - 69.9 ◦ SegGCN [44] 2020 63.6 -
◦ FusionNet [104] 2020 - 68.8 ◦ PAConv [96] 2021 66.6 -
◦ FastPointTransformer [58] 2022 72.4 - ◦ StratifiedTransformer [40] 2022 72.0 -
◦ SratifiedTranformer [40] 2022 74.3 73.7 ◦ PointNeXt [64] 2022 70.5 74.9
◦ PointNeXt [64] 2022 71.5 71.2 ◦ SuperpointTransformer [65] 2023 68.9 76.0
◦ LargeKernel3D [9] 2023 73.5 73.9 ◦ PointMetaBase [47] 2023 72.0 77.0
◦ PointMetaBase [47] 2023 72.8 71.4 ◦ Swin3D [101] 2023 72.5 76.9
◦ PointConvFormer [89] 2023 74.5 74.9 • + Supervised [101] 2023 74.5 79.8
◦ OctFormer [83] 2023 75.7 76.6 ◦ MinkUNet [13] 2019 65.4 65.4
◦ Swin3D [101] 2023 77.5 77.9 • + PC [94] 2020 70.3 -
• + Supervised [101] 2023 76.7 77.9 • + CSC [30] 2021 72.2 -
◦ MinkUNet [13] 2019 72.2 73.6 • + MSC [91] 2023 70.1 -
• + PC [94] 2020 74.1 - • + GC [81] 2024 72.0 -
• + CSC [30] 2021 73.8 - • + PPT [92] 2024 72.7 78.1
• + MSC [91] 2023 75.5 - ◦ PTv1 [106] 2021 70.4 65.4
• + GC [81] 2024 75.7 - ◦ PTv2 [90] 2022 71.6 73.5
• + PPT [92] 2024 76.4 76.6 ◦ PTv3 (Ours) 2024 73.4 77.7
◦ OA-CNNs [60] 2024 76.1 75.6 • + PPT [92] 2024 74.7 80.8
◦ PTv1 [106] 2021 70.6 -
Table 20. S3DIS semantic segmentation.
◦ PTv2 [90] 2022 75.4 74.2
◦ PTv3 (Ours) 2024 77.5 77.9 a normalization layer precedes the operator, can stabilize
• + PPT [92] 2024 78.6 79.4 training by ensuring normalized inputs for each layer [12].
Table 19. ScanNet V2 semantic segmentation. In contrast, the post-norm structure places a normalization
layer right after the operator, potentially leading to faster
B.1. Nomalization Layer convergence but with less stability [80]. Our experimental
Previous point transformers employ Batch Normalization results (see Tab. 18) indicate that the pre-norm structure is
(BN), which can lead to performance variability depend- more suitable for our PTv3, aligning with findings in recent
ing on the batch size. This variability becomes particularly transformer-based models [95].
problematic in scenarios with memory constraints that re-
quire small batch sizes or in tasks demanding dynamic or C. Additional Comparision
varying batch sizes. To address this issue, we have grad- In this section, we expand upon the combined results ta-
ually transitioned to Layer Normalization (LN). Our final, ble for semantic segmentation (Tab. 5 and Tab. 7) from
empirically determined choice is to implement Layer Nor- our main paper, offering a more detailed breakdown of re-
malization in the attention blocks while retaining Batch sults alongside the respective publication years of previous
Normalization in the pooling layers (see Tab. 17). works. This comprehensive result table is designed to assist
readers in tracking the progression of research efforts in 3D
B.2. Block Structure representation learning. Marker ◦ refers to the result from a
Previous point transformers use a traditional block structure model trained from scratch, and • refers to the result from a
that sequentially applies an operator, a normalization layer, pre-trained model.
and an activation function. While effective, this approach
C.1. Indoor Semantic Segmentation
can sometimes complicate training deeper models due to is-
sues like vanishing gradients or the need for careful initial- We conduct a detailed comparison of pre-training technolo-
ization and learning rate adjustments [95]. Consequently, gies and backbones on the ScanNet v2 [17] (see Tab. 19)
we explored adopting a more modern block structure, such and S3DIS [2] (see Tab. 20) datasets. ScanNet v2 com-
as pre-norm and post-norm. The pre-norm structure, where prises 1,513 room scans reconstructed from RGB-D frames,
11
Methods Year Val Test offering a diverse array of driving scenarios. Each point
◦ SPVNAS [73] 2020 64.7 66.4
in this dataset is labeled with one of 28 semantic classes,
◦ Cylinder3D [108] 2021 64.3 67.8 encompassing various elements of urban driving environ-
◦ PVKD [31] 2022 - 71.2 ments. NuScenes, on the other hand, provides a large-
◦ 2DPASS [98] 2022 69.3 72.9 scale dataset for autonomous driving, comprising 1,000 di-
◦ WaffleIron [61] 2023 68.0 70.8 verse urban driving scenes from Boston and Singapore. For
◦ SphereFormer [41] 2023 67.8 74.8
outdoor semantic segmentation, we also employ the mean
◦ RangeFormer [39] 2023 67.6 73.3
◦ MinkUNet [13] 2019 63.8 -
class-wise intersection over union (mIoU) as the primary
• + M3Net [48] 2024 69.9 - evaluation metric for outdoor semantic segmentation.
• + PPT [92] 2024 71.4 -
◦ OA-CNNs [60] 2024 70.6 - References
◦ PTv2 [90] 2022 70.3 72.6
◦ PTv3 (Ours) 2024 70.8 74.2 [1] Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao,
• + M3Net [48] 2024 72.0 75.1 Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei
• + PPT [92] 2024 72.3 75.5 Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta,
Table 21. SemanticKITTI semantic segmentation. Kai Hui, Sebastian Ruder, and Donald Metzler. Ext5: To-
wards extreme multi-task scaling for transfer learning. In
Methods Year Val Test ICLR, 2022. 1
[2] Iro Armeni, Ozan Sener, Amir R. Zamir, Helen Jiang, Ioan-
◦ SPVNAS [73] 2020 77.4 -
nis Brilakis, Martin Fischer, and Silvio Savarese. 3d seman-
◦ Cylinder3D [108] 2021 76.1 77.2
tic parsing of large-scale indoor spaces. In CVPR, 2016. 7,
◦ PVKD [31] 2022 - 76.0
◦ 2DPASS [98] 2022 - 80.8
11
◦ SphereFormer [41] 2023 78.4 81.9 [3] Jens Behley, Martin Garbade, Andres Milioto, Jan Quen-
◦ RangeFormer [39] 2023 78.1 80.1 zel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Se-
◦ MinkUNet [13] 2019 73.3 - mantickitti: A dataset for semantic scene understanding of
• + M3Net [48] 2024 79.0 - lidar sequences. In ICCV, 2019. 7, 8, 12
• + PPT [92] 2024 78.6 - [4] Maxim Berman, Amal Rannen Triki, and Matthew B
◦ OA-CNNs [60] 2024 78.9 - Blaschko. The lovász-softmax loss: A tractable surrogate
◦ PTv2 [90] 2022 80.2 82.6 for the optimization of the intersection-over-union measure
◦ PTv3 (Ours) 2024 80.4 82.7 in neural networks. In CVPR, 2018. 9
• + M3Net [48] 2024 80.9 83.1
[5] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora,
• + PPT [92] 2024 81.2 83.0
Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan,
Table 22. NuScenes semantic segmentation. Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multi-
divided into 1,201 training scenes and 312 for validation. In modal dataset for autonomous driving. In CVPR, 2020. 7,
8, 12
this dataset, model input point clouds are sampled from the
[6] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
vertices of reconstructed meshes, with each point assigned
Julien Mairal, Piotr Bojanowski, and Armand Joulin.
a semantic label from 20 categories (e.g., wall, floor, table). Emerging properties in self-supervised vision transformers.
The S3DIS dataset for semantic scene parsing includes 271 In CVPR, 2021. 2
rooms across six areas from three buildings. Following a [7] Wanli Chen, Xinge Zhu, Guojin Chen, and Bei Yu. Efficient
common practice [63, 75, 106], we withhold area 5 for test- point cloud analysis using hilbert curve. In ECCV, 2022. 2
ing and perform a 6-fold cross-validation. Different from [8] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia.
ScanNet v2, S3DIS densely sampled points on mesh sur- Multi-view 3d object detection network for autonomous
faces, annotated into 13 categories. Consistent with stan- driving. In CVPR, 2017. 2
dard practice [63]. We employ the mean class-wise inter- [9] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi,
section over union (mIoU) as the primary evaluation metric and Jiaya Jia. Largekernel3d: Scaling up kernels in 3d
for indoor semantic segmentation. sparse cnns. In CVPR, 2023. 11
[10] Ran Cheng, Ryan Razani, Ehsan Taghavi, Enxu Li, and
C.2. Outdoor Semantic Segmentation Bingbing Liu. (af)2-s3net: Attentive feature fusion with
adaptive feature selection for sparse semantic segmentation
We extend our comprehensive evaluation of pre-training network. In CVPR, 2021. 7
technologies and backbones to outdoor semantic segmenta- [11] Hung-Yueh Chiang, Yen-Liang Lin, Yueh-Cheng Liu, and
tion tasks, focusing on the SemanticKITTI [3](see Tab. 21) Winston H Hsu. A unified point-based framework for 3d
and NuScenes [5] (see Tab. 22) datasets. SemanticKITTI is segmentation. In 3DV, 2019. 11
derived from the KITTI Vision Benchmark Suite and con- [12] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever.
sists of 22 sequences, with 19 for training and the remain- Generating long sequences with sparse transformers.
ing 3 for testing. It features richly annotated LiDAR scans, arXiv:1904.10509, 2019. 6, 11
12
[13] Christopher Choy, JunYoung Gwak, and Silvio Savarese. [29] David Hilbert and David Hilbert. Über die stetige abbildung
4D spatio-temporal convnets: Minkowski convolutional einer linie auf ein flächenstück. Dritter Band: Analysis·
neural networks. In CVPR, 2019. 2, 3, 7, 8, 11, 12 Grundlagen der Mathematik· Physik Verschiedenes: Nebst
[14] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Einer Lebensgeschichte, 1935. 4
Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Con- [30] Ji Hou, Benjamin Graham, Matthias Nießner, and Saining
ditional positional encodings for vision transformers. Xie. Exploring data-efficient 3d scene understanding with
arXiv:2102.10882, 2021. 6 contrastive scene contexts. In CVPR, 2021. 2, 7, 8, 10, 11
[15] Pointcept Contributors. Pointcept: A codebase for point [31] Yuenan Hou, Xinge Zhu, Yuexin Ma, Chen Change Loy,
cloud perception research. https://github.com/ and Yikang Li. Point-to-voxel knowledge distillation for
Pointcept/Pointcept, 2023. 9 lidar semantic segmentation. In CVPR, 2022. 12
[16] Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-
[32] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan
view prediction for 3d semantic scene segmentation. In
Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham.
ECCV, 2018. 1, 11
Randla-net: Efficient semantic segmentation of large-scale
[17] Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- point clouds. In CVPR, 2020. 11
ber, Thomas Funkhouser, and Matthias Nießner. ScanNet:
[33] Zeyu Hu, Mingmin Zhen, Xuyang Bai, Hongbo Fu, and
Richly-annotated 3d reconstructions of indoor scenes. In
Chiew-lan Tai. Jsenet: Joint semantic segmentation and
CVPR, 2017. 7, 8, 11
edge detection network for 3d point clouds. In ECCV, 2020.
[18] Tri Dao. Flashattention-2: Faster attention with better par-
11
allelism and work partitioning. arXiv:2307.08691, 2023.
7 [34] Li Jiang, Hengshuang Zhao, Shu Liu, Xiaoyong Shen, Chi-
[19] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Wing Fu, and Jiaya Jia. Hierarchical point-edge interaction
Christopher Ré. FlashAttention: Fast and memory-efficient network for point cloud semantic segmentation. In ICCV,
exact attention with IO-awareness. In NeurIPS, 2022. 7 2019. 11
[20] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming [35] Li Jiang, Hengshuang Zhao, Shaoshuai Shi, Shu Liu, Chi-
Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Wing Fu, and Jiaya Jia. Pointgroup: Dual-set point group-
Guo. Cswin transformer: A general vision transformer ing for 3d instance segmentation. CVPR, 2020. 7, 8, 9,
backbone with cross-shaped windows. In CVPR, 2022. 5 10
[21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [36] Li Jiang, Zetong Yang, Shaoshuai Shi, Vladislav Golyanik,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Dengxin Dai, and Bernt Schiele. Self-supervised pre-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, training with masked shape prediction for 3d scene under-
Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- standing. In CVPR, 2023. 2
age is worth 16x16 words: Transformers for image recog- [37] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
nition at scale. ICLR, 2021. 5 Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec
[22] Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for
Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang neural language models. arXiv:2001.08361, 2020. 1
Zhang. Embracing single stride 3d object detector with [38] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi
sparse transformer. In CVPR, 2022. 8 Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer
[23] Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lub- Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár,
ing Zhou, Holger Caesar, Oscar Beijbom, and Abhinav Val- and Ross Girshick. Segment anything. In ICCV, 2023. 1
ada. Panoptic nuscenes: A large-scale benchmark for lidar
[39] Lingdong Kong, Youquan Liu, Runnan Chen, Yuexin Ma,
panoptic segmentation and tracking. RA-L, 2022. 8
Xinge Zhu, Yikang Li, Yuenan Hou, Yu Qiao, and Ziwei
[24] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min
Liu. Rethinking range view representation for lidar seg-
Xu, Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy
mentation. In ICCV, 2023. 12
Liptchinsky, Ishan Misra, Armand Joulin, et al. Self-
supervised pretraining of visual features in the wild. [40] Xin Lai, Jianhui Liu, Li Jiang, Liwei Wang, Hengshuang
arXiv:2103.01988, 2021. 1 Zhao, Shu Liu, Xiaojuan Qi, and Jiaya Jia. Stratified trans-
former for 3d point cloud segmentation. In CVPR, 2022. 2,
[25] Benjamin Graham, Martin Engelcke, and Laurens van der
3, 6, 7, 11
Maaten. 3d semantic segmentation with submanifold sparse
convolutional networks. In CVPR, 2018. 2, 3, 11 [41] Xin Lai, Yukang Chen, Fanbin Lu, Jianhui Liu, and Jiaya
[26] Chao Ma Guangsheng Shi, Ruifeng Li. Pillarnet: Real-time Jia. Spherical transformer for lidar-based 3d recognition. In
and high-performance pillar-based 3d object detection. In CVPR, 2023. 7, 12
ECCV, 2022. 8 [42] Loic Landrieu and Martin Simonovsky. Large-scale point
[27] Meng-Hao Guo, Jun-Xiong Cai, Zheng-Ning Liu, Tai-Jiang cloud semantic segmentation with superpoint graphs. In
Mu, Ralph R Martin, and Shi-Min Hu. Pct: Point cloud CVPR, 2018. 11
transformer. Computational Visual Media, 2021. 1, 2 [43] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou,
[28] Chenhang He, Ruihuang Li, Shuai Li, and Lei Zhang. Voxel Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders
set transformer: A set-to-set approach to 3d object detection for object detection from point clouds. In CVPR, 2019. 2,
from point clouds. In CVPR, 2022. 8 8
13
[44] Huan Lei, Naveed Akhtar, and Ajmal Mian. Seggcn: Effi- [63] Charles R Qi, Li Yi, Hao Su, and Leonidas J Guibas. Point-
cient 3d point cloud segmentation with fuzzy spherical ker- net++: Deep hierarchical feature learning on point sets in a
nel. In CVPR, 2020. 11 metric space. In NeurIPS, 2017. 2, 11, 12
[45] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from [64] Guocheng Qian, Yuchen Li, Houwen Peng, Jinjie Mai,
3d lidar using fully convolutional network. In RSS, 2016. 2 Hasan Hammoud, Mohamed Elhoseiny, and Bernard
[46] Yangyan Li, Rui Bu, Mingchao Sun, Wei Wu, Xinhan Ghanem. Pointnext: Revisiting pointnet++ with improved
Di, and Baoquan Chen. Pointcnn: Convolution on x- training and scaling strategies. In NeurIPS, 2022. 7, 11
transformed points. In NeurIPS, 2018. 1, 11 [65] Damien Robert, Hugo Raguet, and Loic Landrieu. Efficient
[47] Haojia Lin, Xiawu Zheng, Lijiang Li, Fei Chao, Shanshan 3d semantic segmentation with superpoint transformer. In
Wang, Yan Wang, Yonghong Tian, and Rongrong Ji. Meta ICCV, 2023. 2, 11
architecture for point cloud analysis. In CVPR, 2023. 11 [66] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-
[48] Youquan Liu, Lingdong Kong, Xiaoyang Wu, Runnan net: Convolutional networks for biomedical image segmen-
Chen, Xin Li, Liang Pan, Ziwei Liu, and Yuexin Ma. Multi- tation. In MICCAI, 2015. 6
space alignments towards universal lidar segmentation. In [67] David Rozenberszki, Or Litany, and Angela Dai.
CVPR, 2024. 12 Language-grounded indoor 3d semantic segmentation in
[49] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng the wild. In ECCV, 2022. 7, 8
Zhang, Stephen Lin, and Baining Guo. Swin transformer: [68] Aditya Sanghi. Info3d: Representation learning on 3d
Hierarchical vision transformer using shifted windows. In objects using mutual information maximization and con-
ICCV, 2021. 5 trastive learning. In ECCV, 2020. 2
[50] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, [69] Jonathan Sauder and Bjarne Sievers. Self-supervised deep
Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, learning on point clouds by reconstructing space. In
et al. Swin transformer v2: Scaling up capacity and resolu- NeurIPS, 2019. 2
tion. In CVPR, 2022. 5
[70] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang,
[51] Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and
Manolis Savva, and Thomas Funkhouser. Semantic scene
Song Han. Flatformer: Flattened window attention for ef-
completion from a single depth image. In CVPR, 2017. 2
ficient point cloud transformer. In CVPR, 2023. 2, 3, 5, 8,
[71] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and
10
Erik G. Learned-Miller. Multi-view convolutional neural
[52] Xu Ma, Can Qin, Haoxuan You, Haoxi Ran, and Yun Fu.
networks for 3d shape recognition. In ICCV, 2015. 2
Rethinking network design and local geometry in point
cloud: A simple residual mlp framework. In ICLR, 2022. 2 [72] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aure-
lien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin
[53] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-
Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in
volutional neural network for real-time object recognition.
perception for autonomous driving: Waymo open dataset.
In IROS, 2015. 2
In CVPR, 2020. 7, 8
[54] Guy M Morton. A computer oriented geodetic data base
and a new technique in file sequencing. International Busi- [73] Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji
ness Machines Company New York, 1966. 4 Lin, Hanrui Wang, and Song Han. Searching efficient 3d ar-
[55] Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke chitectures with sparse point-voxel convolution. In ECCV,
Kaji. Panopticfusion: Online volumetric semantic mapping 2020. 7, 12
at the level of stuff and things. In IROS, 2019. 11 [74] Maxim Tatarchenko, Jaesik Park, Vladlen Koltun, and
[56] OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. Qian-Yi Zhou. Tangent convolutions for dense prediction
1 in 3d. In CVPR, 2018. 11
[57] Yatian Pang, Wenxiao Wang, Francis EH Tay, Wei Liu, [75] Lyne Tchapmi, Christopher Choy, Iro Armeni, JunYoung
Yonghong Tian, and Li Yuan. Masked autoencoders for Gwak, and Silvio Savarese. Segcloud: Semantic segmenta-
point cloud self-supervised learning. In ECCV, 2022. 2 tion of 3d point clouds. In 3DV, 2017. 11, 12
[58] Chunghyun Park, Yoonwoo Jeong, Minsu Cho, and Jaesik [76] OpenPCDet Development Team. Openpcdet: An
Park. Fast point transformer. In CVPR, 2022. 11 open-source toolbox for 3d object detection from point
[59] Giuseppe Peano and G Peano. Sur une courbe, qui remplit clouds. https : / / github . com / open - mmlab /
toute une aire plane. Springer, 1990. 4 OpenPCDet, 2020. 9
[60] Bohao Peng, Xiaoyang Wu, Li Jiang, Yukang Chen, Heng- [77] Hugues Thomas, Charles R Qi, Jean-Emmanuel Deschaud,
shuang Zhao, Zhuotao Tian, and Jiaya Jia. Oa-cnns: Omni- Beatriz Marcotegui, François Goulette, and Leonidas J
adaptive sparse cnns for 3d semantic segmentation. In Guibas. Kpconv: Flexible and deformable convolution for
CVPR, 2024. 11, 12 point clouds. In ICCV, 2019. 2, 11
[61] Gilles Puy, Alexandre Boulch, and Renaud Marlet. Using a [78] Yonglong Tian, Olivier J Henaff, and Aäron van den Oord.
waffle iron for automotive point cloud semantic segmenta- Divide and contrast: Self-supervised learning from uncu-
tion. In ICCV, 2023. 12 rated data. In CVPR, 2021. 1
[62] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. [79] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Pointnet: Deep learning on point sets for 3d classification Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap-
and segmentation. In CVPR, 2017. 1, 2, 7, 11 tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar,
14
et al. Llama: Open and efficient foundation language mod- [96] Mutian Xu, Runyu Ding, Hengshuang Zhao, and Xiaojuan
els. arXiv:2302.13971, 2023. 1 Qi. Paconv: Position adaptive convolution with dynamic
[80] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob kernel assembling on point clouds. In CVPR, 2021. 11
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, [97] Xu Yan, Chaoda Zheng, Zhen Li, Sheng Wang, and
and Illia Polosukhin. Attention is all you need. In NeurIPS, Shuguang Cui. Pointasnl: Robust point clouds processing
2017. 5, 6, 11 using nonlocal neural networks with adaptive sampling. In
[81] Chengyao Wang, Li Jiang, Xiaoyang Wu, Zhuotao Tian, CVPR, 2020. 11
Bohao Peng, Hengshuang Zhao, and Jiaya Jia. Groupcon- [98] Xu Yan, Jiantao Gao, Chaoda Zheng, Chao Zheng, Ruimao
trast: Semantic-aware self-supervised representation learn- Zhang, Shuguang Cui, and Zhen Li. 2dpass: 2d priors
ing for 3d understanding. In CVPR, 2024. 11 assisted semantic segmentation on lidar point clouds. In
[82] Lei Wang, Yuchun Huang, Yaolin Hou, Shenman Zhang, ECCV, 2022. 7, 12
and Jie Shan. Graph attention convolution for point cloud [99] Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely
semantic segmentation. In CVPR, 2019. 11 embedded convolutional detection. Sensors, 18(10):3337,
[83] Peng-Shuai Wang. Octformer: Octree-based transformers 2018. 8
for 3D point clouds. In SIGGRAPH, 2023. 2, 3, 5, 6, 7, 8, [100] Jiancheng Yang, Qiang Zhang, Bingbing Ni, Linguo Li,
10, 11 Jinxian Liu, Mengdie Zhou, and Qi Tian. Modeling point
[84] Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, clouds with self-attention and gumbel subset sampling. In
and Xin Tong. O-CNN: Octree-based convolutional neural CVPR, 2019. 11
networks for 3D shape analysis. In SIGGRAPH, 2017. 2, 6 [101] Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu,
[85] Shenlong Wang, Simon Suo, Wei-Chiu Ma, Andrei Hao Pan, Peng-Shuai Wang, Xin Tong, and Baining Guo.
Pokrovsky, and Raquel Urtasun. Deep parametric contin- Swin3d: A pretrained transformer backbone for 3d indoor
uous convolutional neural networks. In CVPR, 2018. 11 scene understanding. arXiv:2304.06906, 2023. 2, 3, 6, 7,
[86] Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, and 8, 11
Tiejun Huang. Images speak in images: A generalist painter [102] Tianwei Yin, Xingyi Zhou, and Philipp Krähenbühl.
for in-context visual learning. In CVPR, 2023. 1 Center-based 3d object detection and tracking. In CVPR,
[87] Yue Wang and Justin M Solomon. Deep closest point: 2021. 8, 9
Learning representations for point cloud registration. In [103] Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang, Jie
ICCV, 2019. 2 Zhou, and Jiwen Lu. Point-BERT: Pre-training 3D point
[88] Wenxuan Wu, Zhongang Qi, and Li Fuxin. Pointconv: cloud transformers with masked point modeling. In CVPR,
Deep convolutional networks on 3d point clouds. In CVPR, 2022. 2
2019. 1, 11 [104] Feihu Zhang, Jin Fang, Benjamin Wah, and Philip Torr.
[89] Wenxuan Wu, Li Fuxin, and Qi Shan. Pointconvformer: Deep fusionnet for point cloud semantic segmentation. In
Revenge of the point-based convolution. In CVPR, 2023. ECCV, 2020. 11
11 [105] Hengshuang Zhao, Li Jiang, Chi-Wing Fu, and Jiaya Jia.
[90] Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- Pointweb: Enhancing local neighborhood features for point
shuang Zhao. Point transformer v2: Grouped vector atten- cloud processing. In CVPR, 2019. 2, 11
tion and partition-based pooling. In NeurIPS, 2022. 1, 2, 3, [106] Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, and
5, 6, 7, 8, 11, 12 Vladlen Koltun. Point transformer. In ICCV, 2021. 1, 2, 3,
[91] Xiaoyang Wu, Xin Wen, Xihui Liu, and Hengshuang Zhao. 5, 6, 7, 11, 12
Masked scene contrast: A scalable framework for unsuper- [107] Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha
vised 3d representation learning. In CVPR, 2023. 2, 10, Zhang, Xianglong He, Tong He, Hengshuang Zhao, Chun-
11 hua Shen, Yu Qiao, et al. Ponderv2: Pave the way for 3d
[92] Xiaoyang Wu, Zhuotao Tian, Xin Wen, Bohao Peng, Xihui foundataion model with a universal pre-training paradigm.
Liu, Kaicheng Yu, and Hengshuang Zhao. Towards large- arXiv:2310.08586, 2023. 2, 3
scale 3d representation learning with multi-dataset point [108] Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin
prompt training. In CVPR, 2024. 1, 2, 3, 7, 8, 9, 10, 11, 12 Ma, Wei Li, Hongsheng Li, and Dahua Lin. Cylindrical and
[93] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, asymmetrical 3d convolution networks for lidar segmenta-
and Mike Lewis. Efficient streaming language models with tion. In CVPR, 2021. 7, 12
attention sinks. arXiv, 2023. 9
[94] Saining Xie, Jiatao Gu, Demi Guo, Charles R Qi, Leonidas
Guibas, and Or Litany. Pointcontrast: Unsupervised pre-
training for 3d point cloud understanding. In ECCV, 2020.
2, 10, 11
[95] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin
Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei
Wang, and Tieyan Liu. On layer normalization in the trans-
former architecture. In ICML, 2020. 11
15