Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers

Longkun Zou, Wanru Zhu, Ke Chen🖂, , Lihua Guo🖂, , Kailing Guo, , Kui Jia,  and Yaowei Wang This work is supported in part by the Guangdong Pearl River Talent Program (Introduction of Young Talent) under Grant No. 2019QN01X246, the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2023A1515011104 and the Major Key Project of Peng Cheng Laboratory under Grant No. PCL2023A08. (Longkun Zou and Wanru Zhu contributed equally to this work.) (Corresponding author: Ke Chen; Lihua Guo.) L. Zou, W. Zhu, L. Guo and K. Guo are with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510641, China. L. Zou is an intern at the Peng Cheng Laboratory, Shenzhen 518000, China. K. Chen and Y. Wang are with the Peng Cheng Laboratory, Shenzhen 518000, China. K. Jia is with the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen 518000, China.
Abstract

Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Learning discriminative representations can be challenging due to large shape variations of point sets in local regions and incomplete surface in a global perspective, which can be made even more severe in the context of unsupervised domain adaptation (UDA). In specific, traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries, which greatly limits their cross-domain generalization. Recently, the transformer-based models have achieved impressive performance gain in a range of image-based tasks, benefiting from its strong generalization capability and scalability stemming from capturing long range correlation across local patches. Inspired by such successes of visual transformers, we propose a novel Relational Priors Distillation (RPD) method to extract relational priors from the well-trained transformers on massive images, which can significantly empower cross-domain representations with consistent topological priors of objects. To this end, we establish a parameter-frozen pre-trained transformer module shared between 2D teacher and 3D student models, complemented by an online knowledge distillation strategy for semantically regularizing the 3D student model. Furthermore, we introduce a novel self-supervised task centered on reconstructing masked point cloud patches using corresponding masked multi-view image features, thereby empowering the model with incorporating 3D geometric information. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification. The source code of this work is available at https://github.com/zou-longkun/RPD.git.

Index Terms:
unsupervised domain adaptation, point clouds, relational priors, cross-modal, knowledge distillation.
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

I Introduction

The point cloud is one of the popular 3D shape representations, with broad applications in robotics, drones, autonomous driving, etc. Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Recent advances in point cloud semantic analysis [1, 2, 3, 4, 5, 6, 7] have been largely driven by synthetic point clouds generated from CAD models (such as those in the ModelNet [8] and the ShapeNet [9]), which typically have noise-free point-based surface in local regions and a complete topological structure. Real-world point cloud data generated from RGB-D scanned by real-time depth sensors (such as the ScanNet [10] and the ScanObjectNN [11]) typically contains noises and occlusion, making it to suffer from large shape variations of point sets in local regions and incomplete surface in a global perspective. Such geometric variations can cause performance degradation when testing the network on a domain different from the training ones. More often, labels in the test domain may be unavailable due to high annotation costs, which is the situation we are interested in and can be formulated as the problem of unsupervised domain adaptation (UDA).

Refer to caption
Figure 1: Illustration of the proposed relational prior distillation framework (RPD) method. We leverage the relational priors of one pretrianed 2D Transformer model to boost the 3D Transfermer encoder via sharing a parameter-frozen pretrained Transformer module and employing an online knowledge distillation strategy as semantic regularization for 3D student model. An ensemble of the knowledge from the two modalities can effectively improve the generalization of point cloud representations to close domain gap.

Unsupervised domain adaptation on point clouds is recently attracted increasing attention in [12, 13, 14, 15, 16, 17] started since the pioneering PointDAN [12]. In general, these point-based UDA methods can be mainly categorized into two group of algorithms to bridge domain gap: domain adversarial training based [12] and self-supervised learning based [14, 13, 16, 15]. The former employs domain adversarial training to explicitly enforce indistinguishable features between point clouds from different domains using domain discriminators. Its main ideas are borrowed from the image-based UDA [18, 19, 20, 21, 22], which can be unstable and has a potential risk of damaging the intrinsic structures of target data discrimination in feature space, resulting in a suboptimal adaptation.

The latter mechanism achieves implicit domain alignment by incorporating self-supervised regularization pretext tasks aimed at capturing domain-invariant geometric patterns alongside semantic representation learning. The underlying motivation is that well-designed self-supervised tasks shared across domains can facilitate the learning of features with similar properties, which typically have a certain degree of cross-domain invariance. A diverse set of well-designed designed self-supervised tasks are proposed, such as rotation angle classification and deformation location [14], deformation reconstruction [13], scaling-up-down prediction and 3D-2D-3D projection reconstruction [16], and global implicit fields learning [15]. The PDG [23] utilized the DGCNN [3] or the PointNet [1] to encode part-level features, which are used as a dictionary to describe other features from local parts with a linear weighting strategy. However, existing point-based UDA algorithms mainly often prioritize feature alignment while overlooking the topological structure between local geometries, which greatly limits their cross-domain generalization capabilities.

Recently, transformer-based models have demonstrated remarkable success across various image-based tasks, following the “pretrain-and-finetune” paradigm, which can be attributed to their robust generalization capability and scalability, stemming from their ability to capture long-range correlations across local patches. Nonetheless, achieving proficiency in discerning topological relationships among local parts necessitates pre-training on extensive datasets. Mainstream point cloud networks, constrained by limited training data, leading to usage of shallow architectures to evade over-fitting, but this compromises their scalability and hampers their capacity to capture robust generalization features. Consequently, these networks struggle to effectively implement the “pretrain-and-finetune” paradigm and typically require training from scratch. While certain approaches, such as the PCT [24] and the Point Transformer [25], integrate the typical Transformer architecture into the 3D domain to deepen networks and enhance scalability, their efficacy remains contingent upon access to substantial labeled 3D data. In contrast, acquiring and annotating 2D data is comparatively straightforward, with vast datasets readily available online, numbering in the millions or even billions (e.g., the ImageNet [26], the COCO [27], the CLIP [28]). Leveraging these extensive 2D datasets, 2D transformer based networks exhibit superior aptitude in capturing topological relationships among local parts. This prompts a pivotal question: Can we harness the abundant relational priors ingrained in pre-trained 2D Transformer-based models to bolster the generalization capabilities of 3D models and mitigate domain shift? Affirmative answers to this question would not only bridge the 2D and 3D modalities but also diminish the heavy reliance on expensive collection and annotation of 3D data for model pre-training.

To harness the rich relational priors ingrained in pre-trained 2D Transformer-based models, we propose a simple yet effective knowledge distillation scheme with the standard teacher-student distillation workflow, whose concept is depicted in Fig. 1. Initially, both the teacher and student models share the frozen parameters of the standard Transformer module where the parameters of most block layers are fixed and only the last few block layers are fine-tuned. Moreover, we adopt an online knowledge distillation strategy, alternating between training the teacher and student models throughout the training process. We employ the KL-divergence loss function to align the predicted logits of the teacher and student models, enhancing cross-modal knowledge transfer and serving as semantic regularization for the 3D student model. Additionally, recognizing that sole reliance on 2D knowledge might inadequately capture 3D geometric information, we introduce a self-supervised task of reconstructing masked point clouds from projected multi-view images. In this way, the model’s ability to capture geometric information is enhanced. During inference, we ensemble predictions from both modalities. Our method achieves state-of-the-art performance on two public benchmark datasets (i.e. PointDA-10 [12] and Simt-to-Real [29]), which validates the effectiveness of our proposed method. In summary, our approach innovatively bridges the gap between 2D and 3D domains by leveraging the strength of Transformer-based attention mechanisms, which excel in modeling the relationships between local parts. This not only improves the robustness and generalization of 3D networks but also provides a practical solution to the data scarcity challenge in the 3D domain. Our main contributions in this study are as follows:

  • This paper proposes a novel scheme for unsupervised domain adaptation on object point cloud classification, which bridges domain gap via distilling relational priors from well-learned 2D transformers into 3D domains to enhance 3D feature representation.

  • Technically, we propose a simple but effective cross-modal knowledge transfer method, in which a parameter-frozen pretrained transformer module is shared between the 2D teacher and 3D student model and an online knowledge distillation strategy is adopted as a semantic regularization for 3D student model.

  • Meanwhile, we design a novel self-supervision task that reconstructs masked point cloud patches with corresponding masked multi-view image features to enhance the model’s ability to capture geometric information.

  • Experiments on two public UDA benchmarks verify that the proposed method consistently achieves the best performance of UDA for point cloud classification.

II Related Works

Refer to caption
Figure 2: Overview of our proposed relational priors distillation framework, which adheres to a standard teacher-student distillation workflow. Both the 2D teacher model and the 3D student model include Patchify, Tokenizer, and several Transformer encoder layers. For 2D teacher model, we project the point cloud into 10 single-channel depth maps via the Realistic Projection Pipeline introduced by PointClip v2 [30], and then “patchify” these depth maps into 10×14×1410141410\times 14\times 1410 × 14 × 14 image patches as input to the 2D Tokenizer (i.e. Conv2D). Tokens from the 2D Tokenizer and a [CLS] token are fed into the Transformer encoder. For 3D student model, we “patchify” the point cloud into 27 groups via Farthest Point Sampling (FPS) as input to the 3D Tokenizer (i.e. DGCNN [3]). Tokens from the 3D Tokenizer and a [CLS] token are fed into the Transformer encoder. The two modalities are processed independently by a siamese Transformer encoder parametrized by a MAE [31] pre-trained ViT [32]. During training, we randomly mask a pairs of point cloud token features and image token features with a huge fraction of 0.85. The decoder consists of a sequence of multi-head cross-attention (MCA) and multi-heat self-attention (MSA) layers and predicts missing patches in the point cloud with unmasked image token features. PE means the position encoding. Gray boxes indicate parameters are frozen, while blue, green and orange boxes indicate parameters can be updated. (Best viewed in color).

II-A Deep Networks for Point Clouds

In recent years, deep neural network architectures for point clouds have been extensively studied. Existing methods can be roughly divided into three major categories: view-based [33, 34, 35, 36, 37, 38, 39] and voxel-based [40, 41, 42], and point-based point cloud processing methods [3, 6, 1, 2].

View-based methods project the point cloud into images of multiple views and process them with various variants of 2D CNNs. The pioneering work MVCNN [34] consumes the multi-view images rendered from multiple virtual camera poses and obtains global shape features through cross-view max-pooling. GVCNN [35] proposes a three-level hierarchical correlation modeling framework, which adaptively groups multi-view feature embeddings into separate clusters. RotationNet [43] treats viewpoint indices as learnable latent variables and tends to jointly estimate object poses and semantic categories. MVTN [37] introduces differentiable rendering techniques to implement adaptive regression of optimal camera poses in an end-to-end trainable manner. SimpleView [38] naively project raw points onto image planes and set their pixel values according to the vertical distance. MvNet [39] proposes a multi-view vision-prompt to bridge the gap between 3D data and 2D pretrained models. Although view-based methods have shown dominant performance in various shape recognition tasks [2], [25], [26], acquiring views requires costly shape rendering and inevitably loses the internal geometric structure and spatial information.

Voxel-based methods require first preprocessing a given point cloud into voxels. Then, a voxel-based convolutional neural network is applied to extract features. Such methods can easily overcome point cloud density variations but are hampered by training costs that grow exponentially with voxel resolution. Typical works include VoxelNet [42] and Minkowski Engine [40]. These methods designed octree-based convolution and sparse convolution to extract local representations of point clouds, effectively reducing the consumption of GPU memory and computing costs.

Point-based methods, which directly take point clouds as input and process them in an unstructured format, have attracted increasing attention due to the absence of information loss and high training efficiency. PointNet [1] is a pioneering work , which proposes to model the permutation invariance of points by max-pooling point-wise features. PointNet++ [2] improves PointNet by further gathering local features in a hierarchical way. DGCNN [3] considers a point cloud as a graph and dynamically updates the graph to aggregate features. Recently Transformer [44] based methods have been proposed as a new paradigm for processing point clouds [24, 45, 25, 46].

In this work, we combine point-based methods and view-based methods to achieve cross-modal information fusion.

II-B Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) has been extensively explored on images [19, 21, 20, 47, 48, 49, 50, 51, 22], which aims of mitigating the domain gap between source domain containing labled data and target domain containing unlabled data. These methods can generally be categorized into three categories. 1) Adversarial training [19, 21, 20, 52, 53, 54], playing minimax games at the domain level between a discriminator and a generator. 2) Style transfer [50], wherein the translation from the source domain to the target domain is directly learned using Generative Adversarial Networks [55]. 3) Self-training with pseudo-labels [47, 49, 48, 56, 57], where partial supervision is provided to learn the distributions of the target domain. Despite the extensive research on UDA for 2D images, the domain of 3D point clouds is still in its nascent stages, with some methods borrowed from image-based UDA. For instance, PointDAN [12] is a pioneering work addressing UDA in point cloud classification by explicitly aligning local and global features across domains through domain adversarial training. ALSDA [58] introduces an automated loss function search method to address the issues of domain discriminator degeneration and cross-domain semantic mismatches in adversarial domain adaptation. GAST [14] employs a self-training method equipped with self-paced learning [59] for point cloud UDA. GLRV [16] proposes a reliable voting-based method for pseudo label generation, while SD [60] employs Graph Neural Networks (GNNs) [55] to refine pseudo-labels online during self-training. Chen et al. [61] propose quasi-balanced self-training, dynamically adjusting the threshold to balance the proportion of pseudo-label samples for each category, thereby improving the quality of pseudo-labels. In addition to the mainstream methods of UDA for 2D images, recent works on UDA for point clouds primarily focus on designing suitable self-supervised pretext tasks to facilitate the learning of domain-invariant features. For example, GAST [14] proposes rotation classification and distortion localization as a self-supervised task to align features at both local and global levels. DefRec [13] introduces deformation-reconstruction, and Learnable-Defrec [62] extends it into a learnable deformation task to further enhance performance. RS [63] shuffles and restores the input point cloud to improve discrimination. GLRV [16] proposes two self-supervised auxiliary tasks: scaling-up-down prediction and 3D-2D-3D projection reconstruction, along with a reliable pseudo-label voting strategy to further enhance domain adaptation. GAI [15] employs a self-supervised task of learning geometry-aware global implicit representations for domain adaptation on point clouds. Differentiating from the above single-modal self-supervised methods, we propose a cross-modal self-supervised task that uses 2D images to reconstruct 3D point clouds, thereby empowering the network with the ability to extract 3D geometric information from 2D images.

II-C 2D-to-3D Knowledge Transferring

The concept of model compression was originally introduced by Bucila et al. [64], with the aim of transferring knowledge from a large model to a smaller one without significant performance degradation. Hinton et al. [65] systematically summarized existing knowledge distillation techniques, showcasing the effectiveness of the student-teacher strategy and response-based knowledge distillation. Recently, the transfer of 2D knowledge to 3D using view-based methods has garnered considerable attention among researchers. For instance, PointCLIP [66] directly utilized the pretrained CLIP [28] model for zero-shot point cloud classification via image projection. The subsequent version, PointCLIP V2 [30], refined the projection strategy, resulting in a significant performance boost. ULIP [67, 68] employs large multimodal models to generate detailed language descriptions of 3D objects, addressing limitations in existing 3D object datasets regarding the quality and scalability of language descriptions. PointCMD [69] explores the transfer of cross-modal knowledge from multi-view 2D visual modeling to 3D geometric modeling to facilitate the understanding of the shape of the 3D point cloud. PointVST [70] introduces a self-supervised task that utilizes projected multi-view 2D images as self-supervised signals, enhancing the representation capabilities of point-based networks. I2P-MAE [71] proposes a pre-training framework that leverages 2D pre-trained models to guide the learning of 3D representations. More advanced methods exploit point-pixel correspondences [72, 73, 74, 75] between point clouds and multi-view projected images. Image2Point [76] presents a kernel inflation technique that expands kernels of a 2D CNN into 3D kernels and applies them to voxel-based point cloud understanding. There is a growing interest in utilizing pre-trained Transformers for point cloud processing. PCExpert [77] and EPCL [78] directly train high-quality point cloud models using pre-trained Transformer models as encoders. Although the Transformer pre-trained on large-scale 2D image data possesses powerful semantic representation capabilities, it lacks the ability to capture 3D information. Therefore, in this work, we follow the approach of PCExpert and EPCL, maintaining a Transformer pretrained on ImageNet [26] as an encoder for 3D point clouds, while also designing a self-supervised training task to reconstruct masked 3D point clouds using masked 2D images.

III Proposed Methods

This section introduces the overall working mechanism and specific technical implementations of the proposed RPD. We first introduce and formulate the unsupervised domain adaptation problem on point cloud in Sec. III-A. Then we present general formulations of deep image encoders and deep point encoders respectively in Sec. III-B and Sec. III-C, based on which we construct a unified online cross-modal knowledge distillation workflow in Sec. III-D. Furthermore, we introduce a novel self-supervised task to reconstruct masked point cloud patches with masked multi-view image in Sec. III-E. After that, self-Training strategy is described in detail in Sec. III-F. In the end, we summarize the overall loss function and training strategy in Sec. III-G.

III-A Problem Definition

Given a source domain 𝒮={𝒫is,is,yis}i=1ns𝒮superscriptsubscriptsuperscriptsubscript𝒫𝑖𝑠superscriptsubscript𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖1subscript𝑛𝑠\mathcal{S}=\{\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}\}_{i=1}^{n_{s}}caligraphic_S = { caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT labeled synthetic samples and a target domain 𝒯={𝒫it,it}i=1nt𝒯superscriptsubscriptsuperscriptsubscript𝒫𝑖𝑡superscriptsubscript𝑖𝑡𝑖1subscript𝑛𝑡\mathcal{T}=\{\mathcal{P}_{i}^{t},\mathcal{I}_{i}^{t}\}_{i=1}^{n_{t}}caligraphic_T = { caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT unlabeled real samples, a semantic label space 𝒴𝒴\mathcal{Y}caligraphic_Y is shared between 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T (i.e. 𝒴s=𝒴tsuperscript𝒴𝑠superscript𝒴𝑡\mathcal{Y}^{s}=\mathcal{Y}^{t}caligraphic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = caligraphic_Y start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT), where 𝒫N×3𝒫superscript𝑁3\mathcal{P}\in\mathbb{R}^{N\times 3}caligraphic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT represents a point cloud consisting of N𝑁Nitalic_N three-dimensional spatial coordinate points (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ), and V×W×Hsuperscript𝑉𝑊𝐻\mathcal{I}\in\mathbb{R}^{V\times W\times H}caligraphic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_W × italic_H end_POSTSUPERSCRIPT represents V𝑉Vitalic_V views 2D point-based projected images with a resolution of W×H𝑊𝐻W\times Hitalic_W × italic_H, and the superscripts s𝑠sitalic_s and t𝑡titalic_t denote the source and target domains, respectively. Let input space 𝒳={𝒫,}𝒳𝒫\mathcal{X}=\{\mathcal{P},\mathcal{I}\}caligraphic_X = { caligraphic_P , caligraphic_I }, our goal is to learn a domain-adapted mapping function Φ:𝒳𝒴:Φ𝒳𝒴\Phi:\mathcal{X}\rightarrow\mathcal{Y}roman_Φ : caligraphic_X → caligraphic_Y that can correctly classify target samples with accessing labeled source domain and unlabeled target domain. The mapping function Φ=Φ𝒫ΦΦdirect-sumsuperscriptΦ𝒫superscriptΦ\Phi=\Phi^{\mathcal{P}}\oplus\Phi^{\mathcal{I}}roman_Φ = roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ⊕ roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT, can be formulated into a cascade of a feature encoder Φfea:𝒳d:subscriptΦfea𝒳superscript𝑑\Phi_{\text{fea}}:\mathcal{X}\rightarrow\mathbb{R}^{d}roman_Φ start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for any input {𝒫,}𝒫\{\mathcal{P},\mathcal{I}\}{ caligraphic_P , caligraphic_I } and a classifier Φcls:d[0,1]c:subscriptΦclssuperscript𝑑superscript01𝑐\Phi_{\text{cls}}:\mathbb{R}^{d}\rightarrow[0,1]^{c}roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → [ 0 , 1 ] start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT typically using fully-connected layers as follows:

Φ𝒫(𝒫)=Φcls𝒫(𝒛𝒫)Φfea𝒫(𝒫),superscriptΦ𝒫𝒫superscriptsubscriptΦcls𝒫superscript𝒛𝒫superscriptsubscriptΦfea𝒫𝒫\displaystyle\Phi^{\mathcal{P}}(\mathcal{P})=\Phi_{\text{cls}}^{\mathcal{P}}(% \bm{z}^{\mathcal{P}})\circ\Phi_{\text{fea}}^{\mathcal{P}}(\mathcal{P}),roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P ) = roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ) ∘ roman_Φ start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P ) , (1)
Φ()=Φcls(𝒛)Φfea(),superscriptΦsuperscriptsubscriptΦclssuperscript𝒛superscriptsubscriptΦfea\displaystyle\Phi^{\mathcal{I}}(\mathcal{I})=\Phi_{\text{cls}}^{\mathcal{I}}(% \bm{z}^{\mathcal{I}})\circ\Phi_{\text{fea}}^{\mathcal{I}}(\mathcal{I}),roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I ) = roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ) ∘ roman_Φ start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I ) ,
logit=Φ𝒫(𝒫)Φ(),logitdirect-sumsuperscriptΦ𝒫𝒫superscriptΦ\displaystyle{\rm logit}=\Phi^{\mathcal{P}}(\mathcal{P})\oplus\Phi^{\mathcal{I% }}(\mathcal{I}),roman_logit = roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P ) ⊕ roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I ) ,

where direct-sum\oplus denotes cross-modal ensemble, d𝑑ditalic_d denotes the dimension of the feature representation output 𝒛d𝒛superscript𝑑\bm{z}\in\mathbb{R}^{d}bold_italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT of Φfea(.)\Phi_{\text{fea}}(\mathcal{.})roman_Φ start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT ( . ), c𝑐citalic_c denotes the number of shared classes and the superscripts 𝒫𝒫\mathcal{P}caligraphic_P and \mathcal{I}caligraphic_I denote the point and image modalities, respectively.

III-B Teacher Network for Image Modeling

Owning to the maturity of deep convolutional architectures, we can directly resort to powerful 2D models of different architectures (ResNet [79], ViT [32], Clip [28], [80]) for image feature fusion and extraction. Benefiting from the common practice of large-scale pretraining (e.g., on ImageNet [26] and Conceptual Captions [28]), the resulting 2D deep feature encoder demonstrates strong generalization ability when fine-tuned on downstream visual recognition tasks. This excellent property makes the pre-trained 2D model suitable as a teacher model for image feature extraction. To align the input modality for 2D models, we project the input point cloud onto multiple image planes, and then encode them into multi-view 2D representations. Specifically, given a point cloud 𝒫𝒫\mathcal{P}caligraphic_P, we first project it into multiple single-channel depth maps {v}v=1VV×H×Wsuperscriptsubscriptsubscript𝑣𝑣1𝑉superscript𝑉𝐻𝑊\{\mathcal{I}_{v}\}_{v=1}^{V}\in\mathbb{R}^{V\times H\times W}{ caligraphic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V × italic_H × italic_W end_POSTSUPERSCRIPT via Realistic Projection Pipeline introduced by PointClip v2 [30], where V𝑉Vitalic_V and (H,W)𝐻𝑊(H,W)( italic_H , italic_W ) denote the number of view-images and image size, respectively. Then the teacher image encoder take multi-view images {v}v=1Vsuperscriptsubscriptsubscript𝑣𝑣1𝑉\{\mathcal{I}_{v}\}_{v=1}^{V}{ caligraphic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT in parallel as input to extract image features.

In this paper, we employ a MAE [31] pre-trained ViT [32] to encode image feature. Formally, given a single-channel depth image vH×Wsubscript𝑣superscript𝐻𝑊\mathcal{I}_{v}\in\mathbb{R}^{H\times W}caligraphic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, the ViT divides the image into a sequence of flattened local image patches {𝒙v,i}i=1NN×P2superscriptsubscriptsubscriptsuperscript𝒙𝑣𝑖𝑖1subscript𝑁superscriptsubscript𝑁superscript𝑃2\{\bm{x}^{\mathcal{I}}_{v,i}\}_{i=1}^{N_{\mathcal{I}}}\in\mathbb{R}^{N_{% \mathcal{I}}\times P^{2}}{ bold_italic_x start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT × italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and used a tokenizer ΦembsuperscriptsubscriptΦemb\Phi_{\text{emb}}^{\mathcal{I}}roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT (i.e. Conv2D) to convert these patches into a sequence of 1-D visual token embeddings:

{𝒛v,i}i=1N=Φemb({𝒙v,i}i=1N),superscriptsubscriptsuperscriptsubscript𝒛𝑣𝑖𝑖1subscript𝑁superscriptsubscriptΦembsuperscriptsubscriptsubscriptsuperscript𝒙𝑣𝑖𝑖1subscript𝑁\displaystyle\{\bm{z}_{v,i}^{\mathcal{I}}\}_{i=1}^{N_{\mathcal{I}}}=\Phi_{% \text{emb}}^{\mathcal{I}}\big{(}\{\bm{x}^{\mathcal{I}}_{v,i}\}_{i=1}^{N_{% \mathcal{I}}}\big{)},{ bold_italic_z start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( { bold_italic_x start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , (2)

where {𝒛v,i}i=1NN×D1superscriptsubscriptsuperscriptsubscript𝒛𝑣𝑖𝑖1subscript𝑁superscriptsubscript𝑁subscript𝐷1\{\bm{z}_{v,i}^{\mathcal{I}}\}_{i=1}^{N_{\mathcal{I}}}\in\mathbb{R}^{N_{% \mathcal{I}}\times D_{1}}{ bold_italic_z start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, N=HW/P2subscript𝑁𝐻𝑊superscript𝑃2N_{\mathcal{I}}=HW/P^{2}italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT = italic_H italic_W / italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT denotes the number of tokens, (P,P)𝑃𝑃(P,P)( italic_P , italic_P ) denotes the resolution of image patches, and D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the dimension of each image token embedding. A learnable class token embedding 𝒛clssuperscriptsubscript𝒛cls\bm{z}_{\text{cls}}^{\mathcal{I}}bold_italic_z start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT is prepended to the sequence of the patch embeddings. Then, the final image input representation v(N+1)×D1superscriptsubscript𝑣superscriptsubscript𝑁1subscript𝐷1\mathcal{H}_{v}^{\mathcal{I}}\in\mathbb{R}^{(N_{\mathcal{I}}+1)\times D_{1}}caligraphic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT + 1 ) × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are calculated by summing the image patch embedding with image position embeddings 𝒵pos,v(N+1)×D1superscriptsubscript𝒵pos𝑣superscriptsubscript𝑁1subscript𝐷1\mathcal{Z}_{\text{pos},v}^{\mathcal{I}}\in\mathbb{R}^{(N_{\mathcal{I}}+1)% \times D_{1}}caligraphic_Z start_POSTSUBSCRIPT pos , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT + 1 ) × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT:

v=[𝒛cls,𝒛v,1,,𝒛v,N]+𝒵pos,vsuperscriptsubscript𝑣superscriptsubscript𝒛clssuperscriptsubscript𝒛𝑣1superscriptsubscript𝒛𝑣subscript𝑁superscriptsubscript𝒵pos𝑣\displaystyle\mathcal{H}_{v}^{\mathcal{I}}=[\bm{z}_{\text{cls}}^{\mathcal{I}},% \bm{z}_{v,1}^{\mathcal{I}},...,\bm{z}_{v,N_{\mathcal{I}}}^{\mathcal{I}}]+% \mathcal{Z}_{\text{pos},v}^{\mathcal{I}}caligraphic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT italic_v , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_v , italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ] + caligraphic_Z start_POSTSUBSCRIPT pos , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT (3)

Formally, the behaviours of the 2D teacher transformer module tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be formulated as follows:

{𝒵^v}v=1V=t({v}v=1V),superscriptsubscriptsuperscriptsubscript^𝒵𝑣𝑣1𝑉subscript𝑡superscriptsubscriptsuperscriptsubscript𝑣𝑣1𝑉\displaystyle\{\widehat{\mathcal{Z}}_{v}^{\mathcal{I}}\}_{v=1}^{V}=\mathcal{M}% _{t}\big{(}\{\mathcal{H}_{v}^{\mathcal{I}}\}_{v=1}^{V}\big{)},{ over^ start_ARG caligraphic_Z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( { caligraphic_H start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) , (4)
𝒛^=Concat({𝒛^v,0}v=1V),superscriptbold-^𝒛𝐶𝑜𝑛𝑐𝑎𝑡superscriptsubscriptsuperscriptsubscriptbold-^𝒛𝑣0𝑣1𝑉\displaystyle\bm{\hat{z}}^{\mathcal{I}}=Concat(\{\bm{\hat{z}}_{v,0}^{\mathcal{% I}}\}_{v=1}^{V}),overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = italic_C italic_o italic_n italic_c italic_a italic_t ( { overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) ,
𝒛=Proj(𝒛^),superscript𝒛𝑃𝑟𝑜𝑗superscriptbold-^𝒛\displaystyle\bm{z}^{\mathcal{I}}=Proj(\bm{\hat{z}}^{\mathcal{I}}),bold_italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = italic_P italic_r italic_o italic_j ( overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ) ,

where 𝒵^v={𝒛^v,i}i=0N(N+1)×D2superscriptsubscript^𝒵𝑣superscriptsubscriptsuperscriptsubscriptbold-^𝒛𝑣𝑖𝑖0subscript𝑁superscriptsubscript𝑁1subscript𝐷2\widehat{\mathcal{Z}}_{v}^{\mathcal{I}}=\{\bm{\hat{z}}_{v,i}^{\mathcal{I}}\}_{% i=0}^{N_{\mathcal{I}}}\in\mathbb{R}^{(N_{\mathcal{I}}+1)\times D_{2}}over^ start_ARG caligraphic_Z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = { overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT + 1 ) × italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with subscript v𝑣vitalic_v represents a set of view-specific image token features extracted from image vsubscript𝑣\mathcal{I}_{v}caligraphic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, 𝒛^v,0superscriptsubscriptbold-^𝒛𝑣0\bm{\hat{z}}_{v,0}^{\mathcal{I}}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT denote a view-specific class token feature, 𝒛^VD2superscriptbold-^𝒛superscript𝑉subscript𝐷2\bm{\hat{z}}^{\mathcal{I}}\in\mathbb{R}^{VD_{2}}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_V italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes concatenation of all view-specific class token features, Proj𝑃𝑟𝑜𝑗Projitalic_P italic_r italic_o italic_j denotes a projector based on a multi-layer perceptron (MLP) with three fully connected layers, and 𝒛dsuperscript𝒛superscript𝑑\bm{z}^{\mathcal{I}}\in\mathbb{R}^{d}bold_italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the final feature representation of the image modality input. By default V=10,P=16,H=W=224,N=196,D1=768,D2=512formulae-sequenceformulae-sequence𝑉10formulae-sequence𝑃16𝐻𝑊224formulae-sequencesubscript𝑁196formulae-sequencesubscript𝐷1768subscript𝐷2512V=10,P=16,H=W=224,N_{\mathcal{I}}=196,D_{1}=768,D_{2}=512italic_V = 10 , italic_P = 16 , italic_H = italic_W = 224 , italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT = 196 , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 768 , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 512.

III-C Student Network for 3D Point Cloud Modeling

Collecting and labeling 3D shape models is costly and time-consuming, resulting in the current 3D community still lacking large-scale and richly-annotated datasets comparable to those in the 2D field (i.e. [26, 27]). Limited by the insufficiency of training data, the parameters of mainstream point cloud networks (i.e. [4, 1, 2, 3]) are actually small to alleviate overfitting. This makes these point cloud networks poorly scalable and unsuitable for “pretrain-and-finetune”. We believe that Transformer-based models are inherently well-suited for learning robust semantic patterns in point clouds due to their ability to capture the topological configurations of local geometries. Before the standard transformer is applied to the point cloud field, there are some transformer layers ([24, 25]) specifically designed for point cloud processing. Pioneered by PointBERT [45], the standard transformer has been applied to point cloud tasks.

Following [45], we sample N𝒫subscript𝑁𝒫N_{\mathcal{P}}italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT centroids using Furthest Point Sampling (FPS). To each of these centroids, we assign k𝑘kitalic_k neighbouring points by conducting a k𝑘kitalic_k-Nearest Neighbour (KNN) search. Thereby, we obtain N𝒫subscript𝑁𝒫N_{\mathcal{P}}italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT local geometric patches {𝒙i𝒫}i=1N𝒫N𝒫×(k+1)×3superscriptsubscriptsubscriptsuperscript𝒙𝒫𝑖𝑖1subscript𝑁𝒫superscriptsubscript𝑁𝒫𝑘13\{\bm{x}^{\mathcal{P}}_{i}\}_{i=1}^{N_{\mathcal{P}}}\in\mathbb{R}^{N_{\mathcal% {P}}\times(k+1)\times 3}{ bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT × ( italic_k + 1 ) × 3 end_POSTSUPERSCRIPT, where each geometric patch 𝒙i𝒫subscriptsuperscript𝒙𝒫𝑖\bm{x}^{\mathcal{P}}_{i}bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of a centroid 𝒙i,0𝒫subscriptsuperscript𝒙𝒫𝑖0\bm{x}^{\mathcal{P}}_{i,0}bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 0 end_POSTSUBSCRIPT and its k𝑘kitalic_k neighboring point {𝒙i,j𝒫}j=1ksuperscriptsubscriptsubscriptsuperscript𝒙𝒫𝑖𝑗𝑗1𝑘\{\bm{x}^{\mathcal{P}}_{i,j}\}_{j=1}^{k}{ bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, i.e. 𝒙i𝒫={𝒙i,j𝒫}j=0ksubscriptsuperscript𝒙𝒫𝑖superscriptsubscriptsubscriptsuperscript𝒙𝒫𝑖𝑗𝑗0𝑘\bm{x}^{\mathcal{P}}_{i}=\{\bm{x}^{\mathcal{P}}_{i,j}\}_{j=0}^{k}bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. These patches are subsequently fed into tokenizer Φemb𝒫superscriptsubscriptΦemb𝒫\Phi_{\text{emb}}^{\mathcal{P}}roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT (mini-DGCNN [3]) to obtain patch token embeddings:

{𝒛i𝒫}i=1N𝒫=Φemb𝒫({𝒙i𝒫}i=1N𝒫),superscriptsubscriptsuperscriptsubscript𝒛𝑖𝒫𝑖1subscript𝑁𝒫superscriptsubscriptΦemb𝒫superscriptsubscriptsubscriptsuperscript𝒙𝒫𝑖𝑖1subscript𝑁𝒫\displaystyle\{\bm{z}_{i}^{\mathcal{P}}\}_{i=1}^{N_{\mathcal{P}}}=\Phi_{\text{% emb}}^{\mathcal{P}}\big{(}\{\bm{x}^{\mathcal{P}}_{i}\}_{i=1}^{N_{\mathcal{P}}}% \big{)},{ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( { bold_italic_x start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , (5)

where {𝒛i𝒫}i=1N𝒫N𝒫×D1superscriptsubscriptsuperscriptsubscript𝒛𝑖𝒫𝑖1subscript𝑁𝒫superscriptsubscript𝑁𝒫subscript𝐷1\{\bm{z}_{i}^{\mathcal{P}}\}_{i=1}^{N_{\mathcal{P}}}\in\mathbb{R}^{N_{\mathcal% {P}}\times D_{1}}{ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, N𝒫subscript𝑁𝒫N_{\mathcal{P}}italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT denotes the number of geometric tokens and D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT denotes the feature dimension. Similarly, a learnable class token embedding 𝒛cls𝒫superscriptsubscript𝒛cls𝒫\bm{z}_{\text{cls}}^{\mathcal{P}}bold_italic_z start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT is prepended to the sequence of the patch embeddings. Then, the final point cloud input representation 𝒫(N𝒫+1)×D1superscript𝒫superscriptsubscript𝑁𝒫1subscript𝐷1\mathcal{H}^{\mathcal{P}}\in\mathbb{R}^{(N_{\mathcal{P}}+1)\times D_{1}}caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT + 1 ) × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are calculated by summing the geometric patch embedding with position embeddings 𝒵pos𝒫(N𝒫+1)×D1superscriptsubscript𝒵pos𝒫superscriptsubscript𝑁𝒫1subscript𝐷1\mathcal{Z}_{\text{pos}}^{\mathcal{P}}\in\mathbb{R}^{(N_{\mathcal{P}}+1)\times D% _{1}}caligraphic_Z start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT + 1 ) × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT:

𝒫=[𝒛cls𝒫,𝒛1𝒫,,𝒛N𝒫𝒫]+𝒵pos𝒫superscript𝒫superscriptsubscript𝒛cls𝒫superscriptsubscript𝒛1𝒫superscriptsubscript𝒛subscript𝑁𝒫𝒫superscriptsubscript𝒵pos𝒫\displaystyle\mathcal{H}^{\mathcal{P}}=[\bm{z}_{\text{cls}}^{\mathcal{P}},\bm{% z}_{1}^{\mathcal{P}},...,\bm{z}_{N_{\mathcal{P}}}^{\mathcal{P}}]+\mathcal{Z}_{% \text{pos}}^{\mathcal{P}}caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT = [ bold_italic_z start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT , bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ] + caligraphic_Z start_POSTSUBSCRIPT pos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT (6)

Formally, the 3D student transformer module ssubscript𝑠\mathcal{M}_{s}caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT consumes 𝒫superscript𝒫\mathcal{H}^{\mathcal{P}}caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT and outputs high-dimensional feature representation 𝒛^𝒫superscriptbold-^𝒛𝒫\bm{\hat{z}}^{\mathcal{P}}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT, which can be described as:

𝒵^𝒫=s(𝒫),superscript^𝒵𝒫subscript𝑠superscript𝒫\displaystyle\widehat{\mathcal{Z}}^{\mathcal{P}}=\mathcal{M}_{s}\big{(}% \mathcal{H}^{\mathcal{P}}\big{)},over^ start_ARG caligraphic_Z end_ARG start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT = caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( caligraphic_H start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ) , (7)
𝒛𝒫=Proj(𝒛^0𝒫),superscript𝒛𝒫𝑃𝑟𝑜𝑗superscriptsubscriptbold-^𝒛0𝒫\displaystyle\bm{z}^{\mathcal{P}}=Proj(\bm{\hat{z}}_{0}^{\mathcal{P}}),bold_italic_z start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT = italic_P italic_r italic_o italic_j ( overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ) ,

where 𝒵^𝒫={𝒛^i𝒫}i=0N𝒫(N𝒫+1)×D2superscript^𝒵𝒫superscriptsubscriptsuperscriptsubscriptbold-^𝒛𝑖𝒫𝑖0subscript𝑁𝒫superscriptsubscript𝑁𝒫1subscript𝐷2\widehat{\mathcal{Z}}^{\mathcal{P}}=\{\bm{\hat{z}}_{i}^{\mathcal{P}}\}_{i=0}^{% N_{\mathcal{P}}}\in\mathbb{R}^{(N_{\mathcal{P}}+1)\times D_{2}}over^ start_ARG caligraphic_Z end_ARG start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT = { overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT + 1 ) × italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes the embedded point cloud token features, 𝒛^0𝒫superscriptsubscriptbold-^𝒛0𝒫\bm{\hat{z}}_{0}^{\mathcal{P}}overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT denotes the embedded class token feature, Proj𝑃𝑟𝑜𝑗Projitalic_P italic_r italic_o italic_j is a three-layer MLP, and 𝒛𝒫dsuperscript𝒛𝒫superscript𝑑\bm{z}^{\mathcal{P}}\in\mathbb{R}^{d}bold_italic_z start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the final feature representation of the point cloud modality input. By default N𝒫=27,k=128,D1=768,D2=512formulae-sequencesubscript𝑁𝒫27formulae-sequence𝑘128formulae-sequencesubscript𝐷1768subscript𝐷2512N_{\mathcal{P}}=27,k=128,D_{1}=768,D_{2}=512italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = 27 , italic_k = 128 , italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 768 , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 512.

III-D Online Cross-Modal Knowledge Distillation

Here, we aim to explore how the knowledge from pre-trained 2D Transformer models can be utilized for 3D feature representation learning. On the one hand, the 2D teacher model pre-trained on large-scale data sets (i.e. ImageNet [26]) has strong capabilities to learn high-quality representation, i.e. robust and generalizable features, stemming from their ability to capture long-range correlations across local patches. This prior knowledge of modeling the relationships between local parts is ideal for guiding 3D models to capture the topology of local geometries, eliminating the need for pre-training on large 3D geometry datasets. On the other hand, it is evident that the transformer modules of both the teacher model (tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) and the student model (ssubscript𝑠\mathcal{M}_{s}caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT) are structurally identical, consisting of a series of layer normalization (LN), multi-head self-attention (MSA) and multi-layer perceptron (MLP) layers. The only difference lies in the tokenizer during feature extraction. Therefore, distilling relational priors from a 2D pre-trained model to a 3D model is highly feasible without requiring additional complex designs.

To harness the relational priors ingrained in pre-trained 2D teacher model for 3D representation learning, we propose a strategy of parameter sharing and online knowledge distillation for 2D-to-3D knowledge transfer. First, we share a parameter-frozen pre-trained transformer module between the 2D teacher model (tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) and the 3D student model (s𝑠\mathcal{M}scaligraphic_M italic_s), while keeping the image tokenizer parameters (ΦembsuperscriptsubscriptΦemb\Phi_{\text{emb}}^{\mathcal{I}}roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT) in the 2D teacher model frozen during training. Second, we distill the teacher model’s semantic knowledge into the student model by imposing the following cross-modal alignment constraint:

kd=DKL(Φcls𝒫(𝒛𝒫)||Φcls(𝒛)),\displaystyle\mathcal{L}_{\text{kd}}=D_{\text{KL}}\big{(}\Phi_{\text{cls}}^{% \mathcal{P}}(\bm{z}^{\mathcal{P}})||\Phi_{\text{cls}}^{\mathcal{I}}(\bm{z}^{% \mathcal{I}})\big{)},caligraphic_L start_POSTSUBSCRIPT kd end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ) | | roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ) ) , (8)

where DKLsubscript𝐷𝐾𝐿D_{KL}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT denotes KL-divergence loss function, Φcls𝒫superscriptsubscriptΦcls𝒫\Phi_{\text{cls}}^{\mathcal{P}}roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT and ΦclssuperscriptsubscriptΦcls\Phi_{\text{cls}}^{\mathcal{I}}roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT represent classifiers of 2D teacher model and 3D student model respectively. More details aboout online distillation process are given in Algorithm 1.

Input :  
labeled source data 𝒮={𝒫is,is,yis}i=1ns𝒮superscriptsubscriptsuperscriptsubscript𝒫𝑖𝑠superscriptsubscript𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖1subscript𝑛𝑠\mathcal{S}=\{\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}\}_{i=1}^{n_{s}}caligraphic_S = { caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT;
unlabeled target data 𝒯={𝒫it,it}i=1nt𝒯superscriptsubscriptsuperscriptsubscript𝒫𝑖𝑡superscriptsubscript𝑖𝑡𝑖1subscript𝑛𝑡\mathcal{T}=\{\mathcal{P}_{i}^{t},\mathcal{I}_{i}^{t}\}_{i=1}^{n_{t}}caligraphic_T = { caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT;
student network Φ𝒫(𝒫)=Φcls𝒫(𝒛𝒫)Φfea𝒫(𝒫)superscriptΦ𝒫𝒫superscriptsubscriptΦcls𝒫superscript𝒛𝒫superscriptsubscriptΦfea𝒫𝒫\Phi^{\mathcal{P}}(\mathcal{P})=\Phi_{\text{cls}}^{\mathcal{P}}(\bm{z}^{% \mathcal{P}})\circ\Phi_{\text{fea}}^{\mathcal{P}}(\mathcal{P})roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P ) = roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ) ∘ roman_Φ start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P );
teacher network Φ()=Φcls(𝒛)Φfea()superscriptΦsuperscriptsubscriptΦclssuperscript𝒛superscriptsubscriptΦfea\Phi^{\mathcal{I}}(\mathcal{I})=\Phi_{\text{cls}}^{\mathcal{I}}(\bm{z}^{% \mathcal{I}})\circ\Phi_{\text{fea}}^{\mathcal{I}}(\mathcal{I})roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I ) = roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ) ∘ roman_Φ start_POSTSUBSCRIPT fea end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I );
decoder Φdec𝒫superscriptsubscriptΦdec𝒫\Phi_{\text{dec}}^{\mathcal{P}}roman_Φ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT;
number of epochs E𝐸Eitalic_E;
1
Output :  
Φ𝒫superscriptΦ𝒫\Phi^{\mathcal{P}}roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT and ΦsuperscriptΦ\Phi^{\mathcal{I}}roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT
2
Initialization :  
initialize tsubscript𝑡\mathcal{M}_{t}caligraphic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and ssubscript𝑠\mathcal{M}_{s}caligraphic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT with pre-trained Vit and fix the parameters of first nine blocks;
3
4for e1𝑒1e\leftarrow 1italic_e ← 1 to E𝐸Eitalic_E do
5       for (𝒫is,is,yis),(𝒫it,it)superscriptsubscript𝒫𝑖𝑠superscriptsubscript𝑖𝑠superscriptsubscript𝑦𝑖𝑠superscriptsubscript𝒫𝑖𝑡superscriptsubscript𝑖𝑡(\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}),(\mathcal{P}_{i}^{t},% \mathcal{I}_{i}^{t})( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) , ( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) in (𝒮,𝒯)𝒮𝒯(\mathcal{S},\mathcal{T})( caligraphic_S , caligraphic_T ) do
6             if e% 10<5percent𝑒105e\ \%\ 10<5italic_e % 10 < 5 then
7                   minΦ𝒫,ΦclsssubscriptsuperscriptΦ𝒫superscriptΦsuperscriptsubscriptcls𝑠\min_{\Phi^{\mathcal{P}},\Phi^{\mathcal{I}}}\mathcal{L}_{\text{cls}}^{s}roman_min start_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT , roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT with (𝒫is,yis)superscriptsubscript𝒫𝑖𝑠superscriptsubscript𝑦𝑖𝑠(\mathcal{P}_{i}^{s},y_{i}^{s})( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT );
8                   minΦ𝒫,ΦkdsubscriptsuperscriptΦ𝒫superscriptΦsubscriptkd\min_{\Phi^{\mathcal{P}},\Phi^{\mathcal{I}}}\mathcal{L}_{\text{kd}}roman_min start_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT , roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT kd end_POSTSUBSCRIPT with 𝒫issuperscriptsubscript𝒫𝑖𝑠\mathcal{P}_{i}^{s}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒫itsuperscriptsubscript𝒫𝑖𝑡\mathcal{P}_{i}^{t}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT;
9                   minΦdec𝒫emdsubscriptsuperscriptsubscriptΦdec𝒫subscriptemd\min_{\Phi_{\text{dec}}^{\mathcal{P}}}\mathcal{L}_{\text{emd}}roman_min start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT emd end_POSTSUBSCRIPT with 𝒫issuperscriptsubscript𝒫𝑖𝑠\mathcal{P}_{i}^{s}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒫itsuperscriptsubscript𝒫𝑖𝑡\mathcal{P}_{i}^{t}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT;
10                  
11            else
12                   minΦ𝒫clsssubscriptsuperscriptΦ𝒫superscriptsubscriptcls𝑠\min_{\Phi^{\mathcal{P}}}\mathcal{L}_{\text{cls}}^{s}roman_min start_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT with (𝒫is,yis)superscriptsubscript𝒫𝑖𝑠superscriptsubscript𝑦𝑖𝑠(\mathcal{P}_{i}^{s},y_{i}^{s})( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT );
13                   minΦ𝒫kdsubscriptsuperscriptΦ𝒫subscriptkd\min_{\Phi^{\mathcal{P}}}\mathcal{L}_{\text{kd}}roman_min start_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT kd end_POSTSUBSCRIPT with 𝒫issuperscriptsubscript𝒫𝑖𝑠\mathcal{P}_{i}^{s}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒫itsuperscriptsubscript𝒫𝑖𝑡\mathcal{P}_{i}^{t}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT;
14                   minΦdec𝒫emdsubscriptsuperscriptsubscriptΦdec𝒫subscriptemd\min_{\Phi_{\text{dec}}^{\mathcal{P}}}\mathcal{L}_{\text{emd}}roman_min start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT emd end_POSTSUBSCRIPT with 𝒫issuperscriptsubscript𝒫𝑖𝑠\mathcal{P}_{i}^{s}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and 𝒫itsuperscriptsubscript𝒫𝑖𝑡\mathcal{P}_{i}^{t}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT;
15                  
16             end if
17            
18       end for
19      
20 end for
Algorithm 1 Online Distillation Process

III-E Masked Point Cloud Reconstruction

Transferring the knowledge of 2D pre-trained models for 3D feature representation learning lacks awareness of 3D geometric information. Motivated by SiamMAE [81], we design a self-supervision task that reconstructs masked point cloud patches with corresponding masked multi-view image features to capture 3D geometric information of point clouds. Specially, given a sequence of N𝒫subscript𝑁𝒫N_{\mathcal{P}}italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT tokens embeddings of point cloud local patches {𝒛^i𝒫}i=1N𝒫superscriptsubscriptsuperscriptsubscriptbold-^𝒛𝑖𝒫𝑖1subscript𝑁𝒫\{\bm{\hat{z}}_{i}^{\mathcal{P}}\}_{i=1}^{N_{\mathcal{P}}}{ overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we randomly mask these token embeddings with high mask ratio (85%percent8585\%85 %). A set of learnable mask embeddings {𝒎i𝒫}i=1M𝒫superscriptsubscriptsuperscriptsubscript𝒎𝑖𝒫𝑖1subscript𝑀𝒫\{\bm{m}_{i}^{\mathcal{P}}\}_{i=1}^{M_{\mathcal{P}}}{ bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where M𝒫=0.85×N𝒫subscript𝑀𝒫0.85subscript𝑁𝒫M_{\mathcal{P}}=\lfloor 0.85\times N_{\mathcal{P}}\rflooritalic_M start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = ⌊ 0.85 × italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ⌋, initialized with Gaussian distribution N(0,0.02)𝑁00.02N(0,0.02)italic_N ( 0 , 0.02 ) are used to replace the masked positions and are set as the query inputs of the joint decoder ΦdecsubscriptΦdec\Phi_{\text{dec}}roman_Φ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT. The unmasked token embeddings of point cloud patches are denoted as {𝒓i𝒫}i=1R𝒫superscriptsubscriptsuperscriptsubscript𝒓𝑖𝒫𝑖1subscript𝑅𝒫\{\bm{r}_{i}^{\mathcal{P}}\}_{i=1}^{R_{\mathcal{P}}}{ bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where R𝒫=N𝒫M𝒫subscript𝑅𝒫subscript𝑁𝒫subscript𝑀𝒫R_{\mathcal{P}}=N_{\mathcal{P}}-M_{\mathcal{P}}italic_R start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT - italic_M start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT. Then, the corresponding N×Vsubscript𝑁𝑉N_{\mathcal{I}}\times Vitalic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT × italic_V image tokens embeddings {𝒵^v𝒵^v,0}v=1Vsuperscriptsubscriptsuperscriptsubscript^𝒵𝑣superscriptsubscript^𝒵𝑣0𝑣1𝑉\{\widehat{\mathcal{Z}}_{v}^{\mathcal{I}}-\widehat{\mathcal{Z}}_{v,0}^{% \mathcal{I}}\}_{v=1}^{V}{ over^ start_ARG caligraphic_Z end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT - over^ start_ARG caligraphic_Z end_ARG start_POSTSUBSCRIPT italic_v , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT are set as the key and value input of the joint decoder to reconstruct the masked point cloud patches, where 𝒵^v,0={𝒛^v,0}superscriptsubscript^𝒵𝑣0superscriptsubscriptbold-^𝒛𝑣0\widehat{\mathcal{Z}}_{v,0}^{\mathcal{I}}=\{\bm{\hat{z}}_{v,0}^{\mathcal{I}}\}over^ start_ARG caligraphic_Z end_ARG start_POSTSUBSCRIPT italic_v , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = { overbold_^ start_ARG bold_italic_z end_ARG start_POSTSUBSCRIPT italic_v , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } denote the set of view-specific class token feature. Considering redundant information and computation efficiency, we randomly drop the image token embeddings with high drop ratio (85%percent8585\%85 %), the remaining image token embeddings are represented as {𝒓i}i=1Rsuperscriptsubscriptsuperscriptsubscript𝒓𝑖𝑖1subscript𝑅\{\bm{r}_{i}^{\mathcal{I}}\}_{i=1}^{R_{\mathcal{I}}}{ bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT where R=0.15×N×Vsubscript𝑅0.15subscript𝑁𝑉R_{\mathcal{I}}=\lfloor 0.15\times N_{\mathcal{I}}\times V\rflooritalic_R start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT = ⌊ 0.15 × italic_N start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT × italic_V ⌋. We believe that asymmetric masking/dropping can create a challenging self-supervised learning task while encouraging the network to learn 3D geometric information.

The joint decoder has two layers and each layer consists of a multi-head cross-attention (MCA) and a multi-head self-attention layer (MSA). A fully connected linear layer (FCL) is used to project the output of the decoder to the reconstructed point cloud. Formally, the behaviours of the decoder ΦdecsubscriptΦdec\Phi_{\text{dec}}roman_Φ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT can be formulated as follows:

0={𝒎i𝒫}i=1M𝒫{𝒓i𝒫}i=1R𝒫,subscript0superscriptsubscriptsuperscriptsubscript𝒎𝑖𝒫𝑖1subscript𝑀𝒫superscriptsubscriptsuperscriptsubscript𝒓𝑖𝒫𝑖1subscript𝑅𝒫\displaystyle\mathcal{F}_{0}=\{\bm{m}_{i}^{\mathcal{P}}\}_{i=1}^{M_{\mathcal{P% }}}\cup\{\bm{r}_{i}^{\mathcal{P}}\}_{i=1}^{R_{\mathcal{P}}},caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { bold_italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∪ { bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (9)
1=MSA(MCA(0,{𝒓i}i=1R)),subscript1MSAMCAsubscript0superscriptsubscriptsuperscriptsubscript𝒓𝑖𝑖1subscript𝑅\displaystyle\mathcal{F}_{1}=\text{MSA}\big{(}\text{MCA}\big{(}\mathcal{F}_{0}% ,\{\bm{r}_{i}^{\mathcal{I}}\}_{i=1}^{R_{\mathcal{I}}}\big{)}\big{)},caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = MSA ( MCA ( caligraphic_F start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , { bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) ,
2=MSA(MCA(1,{𝒓i}i=1R)),subscript2MSAMCAsubscript1superscriptsubscriptsuperscriptsubscript𝒓𝑖𝑖1subscript𝑅\displaystyle\mathcal{F}_{2}=\text{MSA}\big{(}\text{MCA}\big{(}\mathcal{F}_{1}% ,\{\bm{r}_{i}^{\mathcal{I}}\}_{i=1}^{R_{\mathcal{I}}}\big{)}\big{)},caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = MSA ( MCA ( caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , { bold_italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ) ,
=FCL(2),FCLsubscript2\displaystyle\mathcal{R}=\text{FCL}(\mathcal{F}_{2}),caligraphic_R = FCL ( caligraphic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

where \mathcal{R}caligraphic_R denotes the reconstracted point cloud. The distance between \mathcal{R}caligraphic_R and the original point cloud 𝒫𝒫\mathcal{P}caligraphic_P is calculated using Earth Mover’s Distance (EMD) distance. Thereby, the loss function for the reconstruction task is defined as:

emd=DEMD(||𝒫),\displaystyle\mathcal{L}_{\text{emd}}=D_{\text{EMD}}(\mathcal{R}||\mathcal{P}),caligraphic_L start_POSTSUBSCRIPT emd end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT EMD end_POSTSUBSCRIPT ( caligraphic_R | | caligraphic_P ) , (10)

where DEMDsubscript𝐷EMDD_{\text{EMD}}italic_D start_POSTSUBSCRIPT EMD end_POSTSUBSCRIPT denotes the EMD distance measure function.

III-F Self-Training

Before adaptation, both 2D teacher model and 3D student model take labeled source domain data (i.e. {𝒫is,is,yis}i=1nssuperscriptsubscriptsuperscriptsubscript𝒫𝑖𝑠superscriptsubscript𝑖𝑠superscriptsubscript𝑦𝑖𝑠𝑖1subscript𝑛𝑠\{\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}\}_{i=1}^{n_{s}}{ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) as input for supervised learning:

clss=1nsi=1nsc=1CI[c=yis]log(Φ𝒫(𝒫is)cΦ(is)c),superscriptsubscriptcls𝑠1subscript𝑛𝑠superscriptsubscript𝑖1subscript𝑛𝑠superscriptsubscript𝑐1𝐶Idelimited-[]𝑐superscriptsubscript𝑦𝑖𝑠superscriptΦ𝒫subscriptsuperscriptsubscript𝒫𝑖𝑠𝑐superscriptΦsubscriptsuperscriptsubscript𝑖𝑠𝑐\displaystyle\mathcal{L}_{\text{cls}}^{s}=-\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}% \sum_{c=1}^{C}{\rm I}[c=y_{i}^{s}]\log\big{(}\Phi^{\mathcal{P}}(\mathcal{P}_{i% }^{s})_{c}\Phi^{\mathcal{I}}(\mathcal{I}_{i}^{s})_{c}\big{)},caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_I [ italic_c = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] roman_log ( roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (11)

where Φ𝒫(𝒫is)csuperscriptΦ𝒫subscriptsuperscriptsubscript𝒫𝑖𝑠𝑐\Phi^{\mathcal{P}}(\mathcal{P}_{i}^{s})_{c}roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Φ(is)csuperscriptΦsubscriptsuperscriptsubscript𝑖𝑠𝑐\Phi^{\mathcal{I}}(\mathcal{I}_{i}^{s})_{c}roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT denote the predicted probabilities of the c𝑐citalic_c-th class of the teacher model and student model respectively, and I[]Idelimited-[]\rm I[\cdot]roman_I [ ⋅ ] is an indicator function.

For adaptation, self-paced self-training (SPST) is a popular strategy to align the two domains by generating pseudo-labels in the target domain according to highly confident predictions. Follow these works [14, 16, 15, 61], we also utilize SPST strategy to further reduce domain shift. The objective of self-paced learning based self-training is depicted as:

clst=1n^ti=1n^t(c=1Cy^i,ctlog(Φ𝒫(𝒫it)cΦ(it)c)+γ|𝒚^it|1),superscriptsubscriptcls𝑡1subscript^𝑛𝑡superscriptsubscript𝑖1subscript^𝑛𝑡superscriptsubscript𝑐1𝐶superscriptsubscript^𝑦𝑖𝑐𝑡superscriptΦ𝒫subscriptsuperscriptsubscript𝒫𝑖𝑡𝑐superscriptΦsubscriptsuperscriptsubscript𝑖𝑡𝑐𝛾subscriptsuperscriptsubscript^𝒚𝑖𝑡1\displaystyle\begin{aligned} \mathcal{L}_{\text{cls}}^{t}=-\frac{1}{\widehat{n% }_{t}}\sum_{i=1}^{\widehat{n}_{t}}\left(\sum_{c=1}^{C}\widehat{y}_{i,c}^{t}% \log\big{(}\Phi^{\mathcal{P}}(\mathcal{P}_{i}^{t})_{c}\Phi^{\mathcal{I}}(% \mathcal{I}_{i}^{t})_{c}\big{)}+\gamma|\widehat{\bm{y}}_{i}^{t}|_{1}\right),% \end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + italic_γ | over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , end_CELL end_ROW (12)

where n^tsubscript^𝑛𝑡\widehat{n}_{t}over^ start_ARG italic_n end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the number of the pseudo labeled samples in target domain, 𝒚^itsuperscriptsubscript^𝒚𝑖𝑡\widehat{\bm{y}}_{i}^{t}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the predicted pseudo label one-hot vector for a target instance 𝒫itsuperscriptsubscript𝒫𝑖𝑡\mathcal{P}_{i}^{t}caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, y^i,ctsuperscriptsubscript^𝑦𝑖𝑐𝑡\widehat{y}_{i,c}^{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is its c𝑐citalic_c-th element, and γ𝛾\gammaitalic_γ is a hyper-parameter controls the number of selected target samples, i.e. the larger γ𝛾\gammaitalic_γ, the more samples. We can simply convert γ𝛾\gammaitalic_γ into the prediction confidence threshold θ=exp(γ)𝜃𝛾\theta=\exp(-\gamma)italic_θ = roman_exp ( - italic_γ ). The generic pseudo-label generation strategy can be simplified to the following form when all network parameters are fixed:

y^i,ct={1,ifc=argmaxcp(c|logiti)&p(c|logiti)>θ,0,otherwise,\displaystyle\widehat{y}^{t}_{i,c}=\!\left\{\begin{aligned} &1,\>\>{\rm if}\>c% =\arg\max_{c}p(c|{\rm logit}_{i})\>\text{\&}\ p(c|{\rm logit}_{i})>\theta,\\ &0,\>\>{\rm otherwise},\end{aligned}\right.over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT = { start_ROW start_CELL end_CELL start_CELL 1 , roman_if italic_c = roman_arg roman_max start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_p ( italic_c | roman_logit start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) & italic_p ( italic_c | roman_logit start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_θ , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 , roman_otherwise , end_CELL end_ROW (13)

where logiti=Avg(Φ𝒫(𝒫it),Φ(it))subscriptlogit𝑖𝐴𝑣𝑔superscriptΦ𝒫superscriptsubscript𝒫𝑖𝑡superscriptΦsuperscriptsubscript𝑖𝑡{\rm logit}_{i}=Avg(\Phi^{\mathcal{P}}(\mathcal{P}_{i}^{t}),\Phi^{\mathcal{I}}% (\mathcal{I}_{i}^{t}))roman_logit start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A italic_v italic_g ( roman_Φ start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT ( caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , roman_Φ start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ( caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ). We adopt a threshold θ𝜃\thetaitalic_θ that gradually increases with self-paced rounds evolve, i.e. each round increases by a constant ϵitalic-ϵ\epsilonitalic_ϵ.

III-G Overall Loss

The framework of our approach is illustrated in Fig. 2. The overall training loss of our method is:

=kd+αemd+βclss+ηclst,subscriptkd𝛼subscriptemd𝛽superscriptsubscriptcls𝑠𝜂superscriptsubscriptcls𝑡\displaystyle\mathcal{L}=\mathcal{L}_{\text{kd}}+\alpha\mathcal{L}_{\text{emd}% }+\beta\mathcal{L}_{\text{cls}}^{s}+\eta\mathcal{L}_{\text{cls}}^{t},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT kd end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT emd end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + italic_η caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (14)

where α𝛼\alphaitalic_α, β𝛽\betaitalic_β and η𝜂\etaitalic_η are hyper-parameters used to balance the weights between methods. We follow [15, 14, 16, 61] to apply a two-stage optimization for training the models. During the first stage of model training, we mainly rely on the first three loss terms to ensure better completion of the adaptation process. Once the initial training is completed, we use the trained teacher and student models together to generate pseudo labels for the target domain samples and perform the self-training.

IV Experiments

IV-A Datasets

PointDA-10. The PointDA-10 [12] is a popular UDA dataset designed for point cloud classification, which consists of subsets of three datasets: ShapeNet, ModelNet40 and ScanNet. These sub-datasets share the same ten categories like bathtub, bed, and bookshelf. In particular, ShapeNet-10(S) is the subset of ShapeNet dataset and contains 17,378 training and 2,492 testing point cloud extracted from synthetic 3D CAD models. Similarly, ModelNet-10(M) consists of 4,183 training and 856 testing samples taken from the synthetic dataset ModelNet40, but the shape of the point cloud exhibits variations from the same class samples in ShapeNet. ScanNet-10(S*) is sampled from ScanNet and contains 6,110 training samples and 1,769 testing samples, respectively. It is the only real dataset of scanned real-world indoor scenes. Due to errors in the registration process and occlusions, the point clouds in ScanNet-10 suffer from issues of noise and sparseness, making classification more challenging. With the three sub-datasets, we can evaluate our method in six different UDA settings including Simulation-to-Reality, Reality-to-Simulation and Simulation-to-Simulation scenarios.

Sim-to-Real. The Sim-to-Real [29] dataset is a fairly new benchmark for the problem of 3D domain generalization (3DDG), which collects object point clouds of 11 shared classes from ModelNet40 [8] and ScanObjectNN [11], and 9 shared classes from ShapeNet [9] and ScanObjectNN [11]. This benchmark consists of four subsets: ModelNet-11 (M11), ScanObjectNN-11 (SO*11), ShapeNet-9 (S9) and ScanObjectNN-9 (SO*9). Among them, M11 consists of 4,844 training and 972 testing point clouds, SO*11 includes 1,915 training and 475 testing point clouds, S9 consists of 1,9904 training and 1,995 testing point clouds, SO*9 includes 1,602 training and 400 testing point clouds. Following [16], we conduct two types of Simulation-to-Reality adaptation scenarios: M11 \rightarrow SO*11 and S9 \rightarrow SO*9.

IV-B Implementation Details

For our RPD, we adopt mini-DGCNN [3] as 3D Tokenizer Φemb𝒫superscriptsubscriptΦemb𝒫\Phi_{\text{emb}}^{\mathcal{P}}roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT which is a standard DGCNN with half the number of layers. The 2D Tokenizer ΦembsuperscriptsubscriptΦemb\Phi_{\text{emb}}^{\mathcal{I}}roman_Φ start_POSTSUBSCRIPT emb end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT is a 2D convolution layer with a convolution kernel size equal to the image patch size. We adopt a standard vision transformer as the backbone to extract relationships across patch tokens from images and point clouds. The transformer module is initialized by MAE [31] pre-trained ViT-B/16 [32] and we only train the last three blocks to balance accuracy and efficiency. The Category Classifier ΦclssuperscriptsubscriptΦcls\Phi_{\text{cls}}^{\mathcal{I}}roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT and Φcls𝒫superscriptsubscriptΦcls𝒫\Phi_{\text{cls}}^{\mathcal{P}}roman_Φ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT are based on a multi-layer perceptron (MLP) with three fully connected layers. The Joint Decoder ΦdecsubscriptΦdec\Phi_{\text{dec}}roman_Φ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT for self-supervised reconstruction has two layers and each layer consists of a multi-head cross-attention (MCA) and a multi-head self-attention (MSA) layer, followed by a fully connected linear (FCL) projection layer. By default, the hyper-parameters of α,β𝛼𝛽\alpha,\betaitalic_α , italic_β and η𝜂\etaitalic_η are empirically set to 1, 1 and 1 respectively. During training, the Adam optimizer [82] is utilized with the initial learning rate 0.0001 and the epoch-wise cosine annealing learning rate scheduler. Dropout of 0.5 and batch normalization were adaptively applied after the convolution layers and the hidden layers. The training batch size is set to 32. More training details are provided in Table I. During self-spaced self-training (SPST), the initial threshold θ𝜃\thetaitalic_θ and the increment constant ϵitalic-ϵ\epsilonitalic_ϵ are empirically set to 0.8 and 0.05 and the training contains 10 rounds, with 5 epochs in each round. For simulation-to-reality scenarios, some specific data augmentation strategies were adopted, such as jittering, randomly dropping holes and rotation.

Transformer Configurations: We extract relatiobships between image and point cloud using the standard ViT [32] architecture, which comprises 12 layers of 12 attention heads and an embedding dimensions of 768. Only the last three layers are trained to balance accuracy and efficiency. The decoder network has 2 layers, each equipped with a multi-head cross-attention (MCA) and a multi-head self-attention (MSA)layer. The number of attention heads and embedding dimensions are set to 16 and 512, respectively.

TABLE I: Training configurations for 6 settings in PointDA-10 [12] and 2 settings in Sim-to-Real [29]. The R, J, D in augmentation denote rotation, jittering and randomly dropping holes respectively.
Config M\rightarrowS M\rightarrowS* S\rightarrowM S\rightarrowS* S*\rightarrowM S*\rightarrowS S9\rightarrowSO*9 M11\rightarrowSO*11
optimizer Adam Adam Adam Adam Adam Adam Adam Adam
base learning rate 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4 1e-4
weight decay 5e-5 5e-5 5e-4 5e-4 5e-5 5e-5 5e-5 5e-5
dropout 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
training epochs 400 400 200 200 200 200 400 400
label smoothing 0 0 0.3 0.3 0 0 0 0
augmentation R R, J, D R R, J, D R, J R, J R, J, D R, J, D
TABLE II: Classification accuracy (%) averaged over 3 seeds (±plus-or-minus\pm± SEM) on the PointDA-10 dataset. M: ModelNet-10; S: ShapeNet-10; S*: ScanNet-10. We compare with the state-of-the-art 3D UDA methods and our method achieves best performance. \dagger denotes experiments without using 3 seeds.The best performance is highlight in bold
Methods SPST M\rightarrowS M\rightarrowS* S\rightarrowM S\rightarrowS* S*\rightarrowM S*\rightarrowS Avg
w/o Adapt 83.3±plus-or-minus\pm±0.7 43.8±plus-or-minus\pm±2.3 75.5±plus-or-minus\pm±1.8 42.5±plus-or-minus\pm±1.4 63.8±plus-or-minus\pm±3.9 64.2±plus-or-minus\pm±0.8 62.2±plus-or-minus\pm±1.8
PointDAN [12] 83.9±plus-or-minus\pm±0.3 44.8±plus-or-minus\pm±1.4 63.3±plus-or-minus\pm±1.1 45.7±plus-or-minus\pm±0.7 43.6±plus-or-minus\pm±2.0 56.4±plus-or-minus\pm±1.5 56.3±plus-or-minus\pm±1.2
RS [63] 79.9±plus-or-minus\pm±0.8 46.7±plus-or-minus\pm±4.8 75.2±plus-or-minus\pm±2.0 51.4±plus-or-minus\pm±3.9 71.8±plus-or-minus\pm±2.3 71.2±plus-or-minus\pm±2.8 66.0±plus-or-minus\pm±1.6
DefRec+PCM [13] 81.7±plus-or-minus\pm±0.6 51.8±plus-or-minus\pm±0.3 78.6±plus-or-minus\pm±0.7 54.5±plus-or-minus\pm±0.3 73.7±plus-or-minus\pm±1.6 71.1±plus-or-minus\pm±1.4 68.6±plus-or-minus\pm±0.8
Learnable-DefRec [62] 82.8±plus-or-minus\pm±0.0 56.3±plus-or-minus\pm±0.0 81.7±plus-or-minus\pm±0.0 54.8±plus-or-minus\pm±0.0 72.9±plus-or-minus\pm±0.0 71.7±plus-or-minus\pm±0.0 70.0±plus-or-minus\pm±0.0
GLRV[16] 85.4±plus-or-minus\pm±0.4 60.4±plus-or-minus\pm±0.4 78.8±plus-or-minus\pm±0.6 57.7±plus-or-minus\pm±0.4 77.8±plus-or-minus\pm±1.1 76.2±plus-or-minus\pm±0.6 72.7±plus-or-minus\pm±0.6
GAST [14] 83.9±plus-or-minus\pm±0.2 56.7±plus-or-minus\pm±0.3 76.4±plus-or-minus\pm±0.2 55.0±plus-or-minus\pm±0.2 73.4±plus-or-minus\pm±0.3 72.2±plus-or-minus\pm±0.2 69.5±plus-or-minus\pm±0.2
84.8±plus-or-minus\pm±0.1 59.8±plus-or-minus\pm±0.2 80.8±plus-or-minus\pm±0.6 56.7±plus-or-minus\pm±0.2 81.1±plus-or-minus\pm±0.8 74.9±plus-or-minus\pm±0.5 73.0±plus-or-minus\pm±0.4
GAI [15] 85.8±plus-or-minus\pm±0.3 55.3±plus-or-minus\pm±0.3 77.2±plus-or-minus\pm±0.4 55.4±plus-or-minus\pm±0.5 73.8±plus-or-minus\pm±0.6 72.4±plus-or-minus\pm±1.0 70.0±plus-or-minus\pm±0.5
86.2±plus-or-minus\pm±0.2 58.6±plus-or-minus\pm±0.1 81.4±plus-or-minus\pm±0.4 56.9±plus-or-minus\pm±0.2 81.5±plus-or-minus\pm±0.5 74.4±plus-or-minus\pm±0.6 73.2±plus-or-minus\pm±0.3
SD [60] 83.9±plus-or-minus\pm±0.0 61.1±plus-or-minus\pm±0.0 80.3±plus-or-minus\pm±0.0 58.9±plus-or-minus\pm±0.0 85.5±plus-or-minus\pm±0.0 80.9±plus-or-minus\pm±0.0 75.1±plus-or-minus\pm±0.0
Ours 81.9±plus-or-minus\pm±0.3 64.4±plus-or-minus\pm±0.5 82.8±plus-or-minus\pm±0.4 59.0±plus-or-minus\pm±0.3 77.1±plus-or-minus\pm±0.8 76.4±plus-or-minus\pm±0.6 73.6±plus-or-minus\pm±0.5
86.3±plus-or-minus\pm±0.3 64.9±plus-or-minus\pm±0.2 88.7±plus-or-minus\pm±0.1 61.1±plus-or-minus\pm±0.1 86.2±plus-or-minus\pm±0.9 81.2±plus-or-minus\pm±0.3 78.0±plus-or-minus\pm±0.3

IV-C Comparison with the State-of-the-art Methods

We compare our RPD with recent state-of-the-art point-based UDA methods including Domain Adversarial Neural Network (PointDAN) [12], Reconstruction Space Network (RS) [63], Deformation Reconstruction Network with Point Cloud Mixup (DefRec+PCM) [13], Learnable Deformation Reconstruction Network (Learnable-DefRec) [62], Global-Local structure modeling and Reliable Voted pseudo label method (GLRV) [16], Geometry-Aware Self-Training (GAST) [14], Geometry-Aware Implicits (GAI) [15], Self-Distillation (SD) [60]. The w/o Adapt method means training the DGCNN network with only labeled source samples and is evaluated as reference of the lower performance bounds.

We report in Tab. II the comparisons between our proposed RPD and other UDA methods on PointDA-10. As can be seen, our method surpasses all baselines by a large margin in 6 settings. The average classification accuracy of the RPD outperforms the current SOTA method SD [60] by 2.9%. Also, the RPD achieves a remarkable enhancement over SD in the Simulation-to-Reality settings of M\rightarrowS* (+3.8 %) and S\rightarrowS* (+2.2 %), which are the most challenging yet realistic tasks. This observations verify the capability of our RPD to effectively capture semantic information from point clouds.

For Sim-to-Real dataset, we compare our method with meta-learning method, i.e. MetaSets [29], Point-based domain adaptation methods, i.e. PointDAN [12] and GLRV [16]. We report the mean accuracy and standard error with three seeds in Table IV. Our method outperforms both point-based domain adaptation and meta-learning methods, achieving a new state-of-the-art.

TABLE III: Ablation study on each component of our method. Experiments are conducted on PointDA-10 dataset.
OCKD MPCR SPST M\rightarrowS M\rightarrowS* S\rightarrowM S\rightarrowS* S*\rightarrowM S*\rightarrowS Avg
PointNet [1] 80.5 41.6 75.8 40.0 60.5 63.6 60.3
DGCNN [3] 83.3 43.8 75.5 42.5 63.8 64.2 62.2
Ours 82.1 58.7 74.2 52.8 72.7 70.7 68.5
82.0 62.6 75.2 58.3 74.1 71.0 70.5
82.5 62.2 77.2 55.1 73.7 73.9 70.8
81.9 64.4 82.8 59.0 77.1 76.4 73.6
82.4 59.0 82.0 56.5 80.0 78.9 73.1
83.7 62.9 85.9 61.1 84.2 79.3 76.2
85.5 63.2 82.2 57.7 80.7 79.8 74.9
86.3 64.9 88.7 61.1 86.2 81.2 78.0
TABLE IV: Classification accuracy (%) averaged over 3 seeds (±plus-or-minus\pm± SEM) on the Sim-to-Real dataset. M11: ModelNet-11; SO*11: ScanObjectNN-11; S9: ShapeNet-9; SO*9: ScanObjectNN-9.
Methods SPST M11\rightarrowSO*11 S9\rightarrowSO*9
w/o Adaptation 61.68±plus-or-minus\pm±1.26 57.42±plus-or-minus\pm±1.01
PointDAN [12] 63.32±plus-or-minus\pm±0.85 54.95±plus-or-minus\pm±0.87
MetaSets [29] 72.42±plus-or-minus\pm±0.21 60.92±plus-or-minus\pm±0.76
GLRV [16] 75.16±plus-or-minus\pm±0.34 62.46±plus-or-minus\pm±0.55
Ours 74.43±plus-or-minus\pm±0.54 63.25±plus-or-minus\pm±0.50
77.05±plus-or-minus\pm±0.42 67.50±plus-or-minus\pm±0.50
TABLE V: Ablation study on each component of our method. Experiments are conducted on Sim-to-Real dataset.
OCKD MPCR SPST M11\rightarrowSO*11 S9\rightarrowSO*9
69.12 60.25
71.24 61.50
70.53 60.75
73.47 63.25
72.14 61.75
74.19 64.50
73.32 63.50
77.05 67.50

IV-D Ablation Studies

To validate the effectiveness of our proposed method, we conducted various ablation studies on the six settings of PointDA-10 and two settings of Sim-to-Real. We utilized a MAE pre-trained Vision Transformer to extract features and introduced three key components for adaptation: an online cross-model knowledge distillation method (OCKD), a mask point cloud reconstruction component (MPCR), and a self-paced self-training strategy (SPST). The results are summarized in Tab. III and Tab. V.

For PointDA-10, the first three rows in Tab. III respectively show the results of using PointNet [1], DGCNN [3], and our proposed method as the backbone network without adaptation. It is evident that our baseline exhibits significantly better performance than PointNet [1] and DGCNN [3] in 4 out of 6 settings, highlighting the superior generalization of transformer models pre-trained on large-scale image datasets over traditional 3D networks. By comparing the fourth row and the third row in Tab. III, we observe that OCKD achieves better scores across all settings than the baseline, indicating that the point cloud branch has acquired abundant semantic information, consequently enhancing its generalization capability. Furthermore, the fifth row shows that the inclusion of the mask point cloud reconstruction module improves the model’s ability to capture geometric information, resulting in better classification accuracy. Moreover, a significant improvement is observed on average by using OCKD and MPCK components together. The Simulation-to-Reality settings achieve competitive results even without SPST, surpassing the performance of the previous SOTA model. Additionally, accuracy improves in all six settings after adding the SPST method, indicating its effectiveness across all datasets. Finally, in the last row, we report the results obtained by combining all components, and our method achieves the best result compared to the recent SOTA method SD [60]. For Sim-to-Real, the results are shown in Tab. V, yielding similar conclusions, which once again verifies the effectiveness of our proposed method.

Refer to caption
(a) w/o Adapt: M \rightarrow S*
Refer to caption
(b) Ours: M \rightarrow S*
Refer to caption
(c) w/o Adapt: S \rightarrow S*
Refer to caption
(d) Ours: S \rightarrow S*
Refer to caption
(e) w/o Adapt: M11 \rightarrow SO*11
Refer to caption
(f) Ours: M11 \rightarrow SO*11
Refer to caption
(g) w/o Adapt: S9 \rightarrow SO*9
Refer to caption
(h) Ours: S9 \rightarrow SO*9
Figure 3: Confusion matrices of classifying testing samples on target domain under four simulation-to-reality scenarios of M \rightarrow S*, S \rightarrow S*, M11 \rightarrow SO*11, and S9 \rightarrow SO*9.
Refer to caption
Figure 4: Illustration of cross-modal knowledge fusion.

We also investigate the influence of the cross-modal knowledge fusion strategy. For shape classification, we directly fuse the prediction by linear interpolation, namely, adding the classification logits of 2D teacher and 3D student models element-wisely. This simple yet effective design produces the ensemble for two types of knowledge: the 3D geometric information captured by self-supervised learning masked point cloud reconstruction, and the robust semantics from the trained 2D Transformer-based models. We believe that these two kinds of knowledge have certain complementary qualities. As shown in Fig. 4, cross-modal knowledge fusion strategy consistently improve the cross-domain generalization on all settings of PointDA-10 and Sim-to-Real.

Refer to caption
(a) M \rightarrow S
Refer to caption
(b) M \rightarrow S*
Refer to caption
(c) S \rightarrow M
Refer to caption
(d) S \rightarrow S*
Refer to caption
(e) S* \rightarrow M
Refer to caption
(f) S* \rightarrow S
Figure 5: Visualization of reconstructed point cloud samples with random masking in the target domain of PointDA-10.

It is noteworthy that for the challenging yet realistically significant Simulation-to-Reality scenarios (i.e. M \rightarrow S*, S \rightarrow S*, M11 \rightarrow SO*11, and S9 \rightarrow SO*9), our proposed RPD acquires a remarkable enhancement over w/o Adapt by 21.1%, 18.6%, 15.37%, and 20.08% respectively. Visualization of confusion matrices in terms of class-wise classification accuracy achieved by the w/o Adapt and our RPD on four Simulation-to-Reality UDA tasks are shown in Fig. 3.

IV-E Visualization

We visualize the input point clouds, random masking, and the reconstructed 3D coordinates in Fig. 5. We believe that reconstruct masked point cloud with masked 2D image tokens can create a challenging self-supervised learning task that encourage the network to learn 3D geometric information. We use the saliency map analysis method referenced in [83] to visualize the features of various comparison methods under the setting of M\rightarrowS* on PointDA-10 [12]. As shown in Figure 6, the proposed RPD method utilizes prior knowledge to model the relationships between local patches using pre-trained Vision Transformers (ViT). The proposed RPD focuses on the global structure and effectively captures the local relationships within point cloud data, enabling our method to extract robust and invariant 3D semantic representations. In contrast, most other comparison methods use DGCNN as the backbone for feature extraction, which tends to focus only on local high-frequency areas, such as edges, thereby affecting generalization.

Refer to captionRefer to captionRefer to caption
(a) PointDAN
Refer to captionRefer to captionRefer to caption
(b) DefRec
Refer to captionRefer to captionRefer to caption
(c) SD
Refer to captionRefer to captionRefer to caption
(d) RPD
Figure 6: Saliency map visualization of various comparison methods under the setting of M\rightarrowS* on PointDA-10.

V Discussion

V-A Scalability

As shown in Table VI, we experimented with three different scales of pre-trained ViT models: ViT-S, ViT-B and ViT-L. However, we observed that using the larger-scale pre-trained model did not lead to performance improvements.

TABLE VI: Results on the effect of using different pre-trained 2D Transformer-based models on PointDA-10.
Domain Methods M\rightarrowS M\rightarrowS* S\rightarrowM S\rightarrowS* S*\rightarrowM S*\rightarrowS Avg
Source RPD-S 98.7 98.8 94.5 94.3 76.5 78.0 90.3
RPD-B 98.7 98.9 94.0 94.7 77.4 79.3 90.7
RPD-L 99.0 99.3 95.4 95.7 78.1 80.0 91.3
Target RPD-S 80.4 63.5 81.7 58.4 74.5 73.5 72.0
RPD-B 81.9 64.4 82.8 59.0 77.1 76.4 73.6
RPD-L 80.9 61.7 80.4 58.9 75.0 75.9 72.1

We attribute this to several factors: First, our relatively small point cloud dataset is prone to overfitting, which is exacerbated by larger ViT-L models. Second, we use only 27 patches compared to the 196 used in ViT-B. This smaller number of patches means that larger ViT-L models are not needed for effective patch relationship modeling. Finally, using larger ViT-L models would require more patches, potentially leading to insufficient information in each patch and reducing the effectiveness of local relationship modeling.

V-B Limitation

While our RPD approach demonstrates considerable promise, there are certain limitations worth for investigation:

Computational Complexity: As shown in Fig. VII, the use of pre-trained ViT model leads to increased computational requirements, especially when dealing with very large datasets or high-resolution point clouds. Future work could focus on optimizing the model for efficiency or exploring more lightweight architectures that retain the robustness of pre-trained ViT model.

Effects of Openset Data: The pre-trained ViT model is based on 2D data, which might influence our results. If the 2D pre-training data does not include any sample of semantic categories of the PointDA-10 [12] and Sim-to-Real [29] point cloud datasets, our method’s performance could be adversely affected. The absence of relevant categories in the pre-training dataset can result in suboptimal feature extraction and limited generalization.

Effects of Pre-training Approaches: Different pre-training approaches for ViT [32], such as DINO [84] and MAE [31], can have varying impacts on the performance of our method. We have observed that MAE pre-trained ViT tends to have an advantage in our experiments. The pre-training method affects the quality of learned representations and the model’s effectiveness in downstream tasks.

Extension to Other 3D Data Types: While our current focus is on point clouds, extending the proposed RPD to other types of 3D data, such as voxel grids or mesh data, could enhance its applicability. Adapting and optimizing the approach for these data types is an interesting area for future work.

TABLE VII: Comparative analysis of training costs of different methods
Methods Parameters (M) FLOPs(G)
DefRec [13] 2.08 2.77
PointDAN [12] 2.84 0.94
SD [60] 3.47 0.92
GAI [15] 22.68 3.58
GAST [14] 23.60 2.17
RPD (ours) 62.27 23.29

VI Conclusion

This paper proposes a novel scheme for unsupervised domain adaptation on object point cloud classification, aiming to alleviate domain shift by distilling relational priors from pre-trained 2D transformers. We illustrate how the relational priors learned by a proficient 2D Transformer model can be transferred to the 3D domain, thereby enhancing the generalization of 3D features. Our methodology involves adopting a standard teacher-student distillation framework, where the parameter-frozen pre-trained Transformer module is shared between the 2D teacher model and the 3D student model. Additionally, we employ an online knowledge distillation strategy to further semantically regularize the 3D student model. Moreover, to empower the model’s capacity to capture 3D geometric information, we introduce a novel self-supervised task involving the reconstruction of masked point cloud patches using corresponding masked multi-view image features. Experiments conducted on two public benchmarks validate the efficacy of our approach, demonstrating new state-of-the-art performance.

References

  • [1] C. Qi, H. Su, K. Mo, and L. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 652–660.
  • [2] C. Qi, L. Yi, H. Su, and L. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 1–16.
  • [3] Y. Wang, Y. Sun, Z. Liu, S. Sarma, M. Bronstein, and J. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
  • [4] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 31, 2018.
  • [5] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9621–9630.
  • [6] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” arXiv preprint arXiv:2202.07123, 2022.
  • [7] H. Thomas, C. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6411–6420.
  • [8] K. V. Vishwanath, D. Gupta, A. Vahdat, and K. Yocum, “Modelnet: Towards a datacenter emulation environment,” in Proc. IEEE 9th Int. Conf. Peer Peer Comput., 2009, pp. 81–82.
  • [9] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” ArXiv, vol. 1512.03012, 2015.
  • [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 5828–5839.
  • [11] M. A. Uy, Q. Pham, B. Hua, T. Nguyen, and S. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 1588–1597.
  • [12] C. Qin, H. You, L. Wang, C. Kuo, and Y. Fu, “Pointdan: A multi-scale 3d domain adaption network for point cloud representation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019.
  • [13] I. Achituve, H. Maron, and G. Chechik, “Self-supervised learning for domain adaptation on point clouds,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2021, pp. 123–133.
  • [14] L. Zou, H. Tang, K. Chen, and K. Jia, “Geometry-aware self-training for unsupervised domain adaptation on object point clouds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6403–6412.
  • [15] Y. Shen, Y. Yang, M. Yan, H. Wang, Y. Zheng, and L. Guibas, “Domain adaptation on point clouds via geometry-aware implicits,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 7223–7232.
  • [16] H. Fan, X. Chang, W. Zhang, Y. Cheng, Y. Sun, and M. Kankanhalli, “Self-supervised global-local structure modeling for point cloud domain adaptation with reliable voted pseudo labels,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 6377–6386.
  • [17] Y. Chen, Z. Wang, L. Zou, K. Chen, and K. Jia, “Quasi-balanced self-training on noise-aware synthesis of object point clouds for closing domain gap,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 728–745.
  • [18] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, pp. 2096–2030, 2016.
  • [19] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 7167–7176.
  • [20] J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, and H. Shen, “Maximum density divergence for domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 11, pp. 3918–3930, 2020.
  • [21] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 3723–3732.
  • [22] F. Zhao, S. Liao, G. Xie, J. Zhao, K. Zhang, and L. Shao, “Unsupervised domain adaptation with noise resistible mutual-training for person re-identification,” in Proc. Eur. Conf. Comput. Vis. (ECCV).   Springer, 2020, pp. 526–544.
  • [23] X. Wei, X. Gu, and J. Sun, “Learning generalizable part-based feature representation for 3d point clouds,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2022.
  • [24] M. Guo, J. Cai, Z. Liu, T. Mu, R. Martin, and S. Hu, “Pct: Point cloud transformer,” Computational Visual Media, vol. 7, pp. 187–199, 2021.
  • [25] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 16 259–16 268.
  • [26] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.   Ieee, 2009, pp. 248–255.
  • [27] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. (ECCV).   Springer, 2014, pp. 740–755.
  • [28] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML.   PMLR, 2021, pp. 8748–8763.
  • [29] C. Huang, Z. Cao, Y. Wang, J. Wang, and M. Long, “Metasets: Meta-learning on point sets for generalizable representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 8863–8872.
  • [30] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 2639–2650.
  • [31] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 16 000–16 009.
  • [32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [33] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri, “3d shape segmentation with projective convolutional networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3779–3788.
  • [34] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2015, pp. 945–953.
  • [35] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view convolutional neural networks for 3d shape recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 264–272.
  • [36] S. Tulsiani, A. Efros, and J. Malik, “Multi-view consistency as supervisory signal for learning shape and pose prediction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 2897–2905.
  • [37] A. Hamdi, S. Giancola, and B. Ghanem, “Mvtn: Multi-view transformation network for 3d shape recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 1–11.
  • [38] A. Goyal, H. Law, B. Liu, A. Newell, and J. Deng, “Revisiting point cloud shape classification with a simple and effective baseline,” in Proc. ICML.   PMLR, 2021, pp. 3809–3820.
  • [39] H. Peng, B. Li, B. Zhang, X. Chen, T. Chen, and H. Zhu, “Multi-view vision fusion network: Can 2d pre-trained model boost 3d point cloud data-scarce learning?” IEEE Trans. Circuits Syst. Video Technol., 2023.
  • [40] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 3075–3084.
  • [41] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 9224–9232.
  • [42] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS).   IEEE, 2015, pp. 922–928.
  • [43] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 5010–5019.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 30, 2017.
  • [45] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 19 313–19 322.
  • [46] Z. Huang, Z. Zhao, B. Li, and J. Han, “Lcpformer: Towards effective 3d point cloud analysis via local context propagation in transformers,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  • [47] Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 289–305.
  • [48] L. Chen, Q. Du, Y. Lou, J. He, T. Bai, and M. Deng, “Mutual nearest neighbor contrast and hybrid prototype self-training for universal domain adaptation,” in Proc. AAAI, vol. 36, no. 6, 2022, pp. 6248–6257.
  • [49] H. Liu, J. Wang, and M. Long, “Cycle self-training for domain adaptation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 22 968–22 981.
  • [50] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in Proc. ICML.   Pmlr, 2018, pp. 1989–1998.
  • [51] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3722–3731.
  • [52] H. Li, N. Dong, Z. Yu, D. Tao, and G. Qi, “Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5, pp. 2814–2830, 2021.
  • [53] Q. Tian, Y. Zhu, H. Sun, S. Chen, and H. Yin, “Unsupervised domain adaptation through dynamically aligning both the feature and label spaces,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8562–8573, 2022.
  • [54] B. Zhang, T. Chen, B. Wang, X. Wu, L. Zhang, and J. Fan, “Densely semantic enhancement for domain adaptive region-free detectors,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1339–1352, 2021.
  • [55] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 27, 2014.
  • [56] X. Gu, C. Zhang, Q. Shen, J. Han, P. Angelov, and P. Atkinson, “A self-training hierarchical prototype-based ensemble framework for remote sensing scene classification,” Information Fusion, vol. 80, pp. 179–204, 2022.
  • [57] Y. Ding, H. Fan, M. Xu, and Y. Yang, “Adaptive exploration for unsupervised person re-identification,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 1, pp. 1–19, 2020.
  • [58] Z. Mei, P. Ye, H. Ye, B. Li, J. Guo, T. Chen, and W. Ouyang, “Automatic loss function search for adversarial unsupervised domain adaptation,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  • [59] M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 23, 2010.
  • [60] A. Cardace, R. Spezialetti, P. Ramirez, S. Salti, and L. Di Stefano, “Self-distillation for unsupervised 3d domain adaptation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2023, pp. 4166–4177.
  • [61] Y. Chen, Z. Wang, L. Zou, K. Chen, and K. Jia, “Quasi-balanced self-training on noise-aware synthesis of object point clouds for closing domain gap,” in Proc. Eur. Conf. Comput. Vis. (ECCV).   Springer, 2022, pp. 728–745.
  • [62] X. Luo, S. Liu, K. Fu, M. Wang, and Z. Song, “A learnable self-supervised task for unsupervised domain adaptation on point clouds,” arXiv preprint arXiv:2104.05164, 2021.
  • [63] J. Sauder and B. Sievers, “Self-supervised deep learning on point clouds by reconstructing space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019.
  • [64] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006, pp. 535–541.
  • [65] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  • [66] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 8552–8562.
  • [67] L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 1179–1189.
  • [68] T. Huang, B. Dong, Y. Yang, X. Huang, R. Lau, and W. Ouyang, W.and Zuo, “Clip2point: Transfer clip to point cloud classification with image-depth pre-training,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 22 157–22 167.
  • [69] Q. Zhang, J. Hou, and Y. Qian, “Pointmcd: Boosting deep point cloud encoders via multi-view cross-modal distillation for 3d shape recognition,” IEEE Trans. Multimedia, 2023.
  • [70] Q. Zhang and J. Hou, “Pointvst: Self-supervised pre-training for 3d point clouds via view-specific point-to-image translation,” IEEE Transactions on Visualization and Computer Graphics, 2023.
  • [71] R. Zhang, L. Wang, Y. Qiao, P. Gao, and H. Li, “Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 21 769–21 780.
  • [72] A. Hamdi, B. Ghanem, and M. Nießsner, “Sparf: Large-scale learning of 3d sparse radiance fields from few input images,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 2930–2940.
  • [73] A. Hamdi, S. Giancola, and B. Ghanem, “Voint cloud: Multi-view point cloud representation for 3d understanding,” arXiv preprint arXiv:2111.15363, 2021.
  • [74] Z. Liu, X. Qi, and C. Fu, “3d-to-2d distillation for indoor scene parsing. 2021 ieee,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 4462–4472.
  • [75] Z. Wang, X. Yu, Y. Rao, J. Zhou, and J. Lu, “P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 35, 2022, pp. 14 388–14 402.
  • [76] C. Xu, S. Yang, T. Galanti, B. Wu, X. Yue, B. Zhai, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka, “Image2point: 3d point-cloud understanding with 2d image pretrained models,” arXiv preprint arXiv:2106.04180, 2021.
  • [77] J. Kang, W. Jia, X. He, and K. Lam, “Point clouds are specialized images: A knowledge transfer approach for 3d understanding,” arXiv preprint arXiv:2307.15569, 2023.
  • [78] X. Huang, S. Li, W. Qu, T. He, Y. Zuo, and W. Ouyang, “Frozen clip model is efficient point cloud backbone,” arXiv preprint arXiv:2212.04098, 2022.
  • [79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778.
  • [80] Y. Gong, X. Yu, Y. Ding, X. Peng, J. Zhao, and Z. Han, “Effective fusion factor in fpn for tiny object detection,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2021, pp. 1160–1168.
  • [81] A. Gupta, J. Wu, J. Deng, and F. Li, “Siamese masked autoencoders,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 36, 2024.
  • [82] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ArXiv, vol. 1412.6980, 2014.
  • [83] T. Zheng, C. Chen, J. Yuan, B. Li, and K. Ren, “Pointcloud saliency maps,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 1598–1606.
  • [84] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660.