Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers

Longkun Zou^∗, Wanru Zhu^∗, Ke Chen^🖂, , Lihua Guo^🖂, , Kailing Guo, , Kui Jia, and Yaowei Wang This work is supported in part by the Guangdong Pearl River Talent Program (Introduction of Young Talent) under Grant No. 2019QN01X246, the Guangdong Basic and Applied Basic Research Foundation under Grant No. 2023A1515011104 and the Major Key Project of Peng Cheng Laboratory under Grant No. PCL2023A08. (Longkun Zou and Wanru Zhu contributed equally to this work.) (Corresponding author: Ke Chen; Lihua Guo.) L. Zou, W. Zhu, L. Guo and K. Guo are with the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510641, China. L. Zou is an intern at the Peng Cheng Laboratory, Shenzhen 518000, China. K. Chen and Y. Wang are with the Peng Cheng Laboratory, Shenzhen 518000, China. K. Jia is with the Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Shenzhen 518000, China.

Abstract

Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Learning discriminative representations can be challenging due to large shape variations of point sets in local regions and incomplete surface in a global perspective, which can be made even more severe in the context of unsupervised domain adaptation (UDA). In specific, traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries, which greatly limits their cross-domain generalization. Recently, the transformer-based models have achieved impressive performance gain in a range of image-based tasks, benefiting from its strong generalization capability and scalability stemming from capturing long range correlation across local patches. Inspired by such successes of visual transformers, we propose a novel Relational Priors Distillation (RPD) method to extract relational priors from the well-trained transformers on massive images, which can significantly empower cross-domain representations with consistent topological priors of objects. To this end, we establish a parameter-frozen pre-trained transformer module shared between 2D teacher and 3D student models, complemented by an online knowledge distillation strategy for semantically regularizing the 3D student model. Furthermore, we introduce a novel self-supervised task centered on reconstructing masked point cloud patches using corresponding masked multi-view image features, thereby empowering the model with incorporating 3D geometric information. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification. The source code of this work is available at https://github.com/zou-longkun/RPD.git.

Index Terms:

unsupervised domain adaptation, point clouds, relational priors, cross-modal, knowledge distillation.

I Introduction

The point cloud is one of the popular 3D shape representations, with broad applications in robotics, drones, autonomous driving, etc. Semantic pattern of an object point cloud is determined by its topological configuration of local geometries. Recent advances in point cloud semantic analysis [1, 2, 3, 4, 5, 6, 7] have been largely driven by synthetic point clouds generated from CAD models (such as those in the ModelNet [8] and the ShapeNet [9]), which typically have noise-free point-based surface in local regions and a complete topological structure. Real-world point cloud data generated from RGB-D scanned by real-time depth sensors (such as the ScanNet [10] and the ScanObjectNN [11]) typically contains noises and occlusion, making it to suffer from large shape variations of point sets in local regions and incomplete surface in a global perspective. Such geometric variations can cause performance degradation when testing the network on a domain different from the training ones. More often, labels in the test domain may be unavailable due to high annotation costs, which is the situation we are interested in and can be formulated as the problem of unsupervised domain adaptation (UDA).

Refer to caption — Figure 1: Illustration of the proposed relational prior distillation framework (RPD) method. We leverage the relational priors of one pretrianed 2D Transformer model to boost the 3D Transfermer encoder via sharing a parameter-frozen pretrained Transformer module and employing an online knowledge distillation strategy as semantic regularization for 3D student model. An ensemble of the knowledge from the two modalities can effectively improve the generalization of point cloud representations to close domain gap.

Unsupervised domain adaptation on point clouds is recently attracted increasing attention in [12, 13, 14, 15, 16, 17] started since the pioneering PointDAN [12]. In general, these point-based UDA methods can be mainly categorized into two group of algorithms to bridge domain gap: domain adversarial training based [12] and self-supervised learning based [14, 13, 16, 15]. The former employs domain adversarial training to explicitly enforce indistinguishable features between point clouds from different domains using domain discriminators. Its main ideas are borrowed from the image-based UDA [18, 19, 20, 21, 22], which can be unstable and has a potential risk of damaging the intrinsic structures of target data discrimination in feature space, resulting in a suboptimal adaptation.

The latter mechanism achieves implicit domain alignment by incorporating self-supervised regularization pretext tasks aimed at capturing domain-invariant geometric patterns alongside semantic representation learning. The underlying motivation is that well-designed self-supervised tasks shared across domains can facilitate the learning of features with similar properties, which typically have a certain degree of cross-domain invariance. A diverse set of well-designed designed self-supervised tasks are proposed, such as rotation angle classification and deformation location [14], deformation reconstruction [13], scaling-up-down prediction and 3D-2D-3D projection reconstruction [16], and global implicit fields learning [15]. The PDG [23] utilized the DGCNN [3] or the PointNet [1] to encode part-level features, which are used as a dictionary to describe other features from local parts with a linear weighting strategy. However, existing point-based UDA algorithms mainly often prioritize feature alignment while overlooking the topological structure between local geometries, which greatly limits their cross-domain generalization capabilities.

Recently, transformer-based models have demonstrated remarkable success across various image-based tasks, following the “pretrain-and-finetune” paradigm, which can be attributed to their robust generalization capability and scalability, stemming from their ability to capture long-range correlations across local patches. Nonetheless, achieving proficiency in discerning topological relationships among local parts necessitates pre-training on extensive datasets. Mainstream point cloud networks, constrained by limited training data, leading to usage of shallow architectures to evade over-fitting, but this compromises their scalability and hampers their capacity to capture robust generalization features. Consequently, these networks struggle to effectively implement the “pretrain-and-finetune” paradigm and typically require training from scratch. While certain approaches, such as the PCT [24] and the Point Transformer [25], integrate the typical Transformer architecture into the 3D domain to deepen networks and enhance scalability, their efficacy remains contingent upon access to substantial labeled 3D data. In contrast, acquiring and annotating 2D data is comparatively straightforward, with vast datasets readily available online, numbering in the millions or even billions (e.g., the ImageNet [26], the COCO [27], the CLIP [28]). Leveraging these extensive 2D datasets, 2D transformer based networks exhibit superior aptitude in capturing topological relationships among local parts. This prompts a pivotal question: Can we harness the abundant relational priors ingrained in pre-trained 2D Transformer-based models to bolster the generalization capabilities of 3D models and mitigate domain shift? Affirmative answers to this question would not only bridge the 2D and 3D modalities but also diminish the heavy reliance on expensive collection and annotation of 3D data for model pre-training.

To harness the rich relational priors ingrained in pre-trained 2D Transformer-based models, we propose a simple yet effective knowledge distillation scheme with the standard teacher-student distillation workflow, whose concept is depicted in Fig. 1. Initially, both the teacher and student models share the frozen parameters of the standard Transformer module where the parameters of most block layers are fixed and only the last few block layers are fine-tuned. Moreover, we adopt an online knowledge distillation strategy, alternating between training the teacher and student models throughout the training process. We employ the KL-divergence loss function to align the predicted logits of the teacher and student models, enhancing cross-modal knowledge transfer and serving as semantic regularization for the 3D student model. Additionally, recognizing that sole reliance on 2D knowledge might inadequately capture 3D geometric information, we introduce a self-supervised task of reconstructing masked point clouds from projected multi-view images. In this way, the model’s ability to capture geometric information is enhanced. During inference, we ensemble predictions from both modalities. Our method achieves state-of-the-art performance on two public benchmark datasets (i.e. PointDA-10 [12] and Simt-to-Real [29]), which validates the effectiveness of our proposed method. In summary, our approach innovatively bridges the gap between 2D and 3D domains by leveraging the strength of Transformer-based attention mechanisms, which excel in modeling the relationships between local parts. This not only improves the robustness and generalization of 3D networks but also provides a practical solution to the data scarcity challenge in the 3D domain. Our main contributions in this study are as follows:

•

This paper proposes a novel scheme for unsupervised domain adaptation on object point cloud classification, which bridges domain gap via distilling relational priors from well-learned 2D transformers into 3D domains to enhance 3D feature representation.
•

Technically, we propose a simple but effective cross-modal knowledge transfer method, in which a parameter-frozen pretrained transformer module is shared between the 2D teacher and 3D student model and an online knowledge distillation strategy is adopted as a semantic regularization for 3D student model.
•

Meanwhile, we design a novel self-supervision task that reconstructs masked point cloud patches with corresponding masked multi-view image features to enhance the model’s ability to capture geometric information.
•

Experiments on two public UDA benchmarks verify that the proposed method consistently achieves the best performance of UDA for point cloud classification.

II Related Works

II-A Deep Networks for Point Clouds

In recent years, deep neural network architectures for point clouds have been extensively studied. Existing methods can be roughly divided into three major categories: view-based [33, 34, 35, 36, 37, 38, 39] and voxel-based [40, 41, 42], and point-based point cloud processing methods [3, 6, 1, 2].

View-based methods project the point cloud into images of multiple views and process them with various variants of 2D CNNs. The pioneering work MVCNN [34] consumes the multi-view images rendered from multiple virtual camera poses and obtains global shape features through cross-view max-pooling. GVCNN [35] proposes a three-level hierarchical correlation modeling framework, which adaptively groups multi-view feature embeddings into separate clusters. RotationNet [43] treats viewpoint indices as learnable latent variables and tends to jointly estimate object poses and semantic categories. MVTN [37] introduces differentiable rendering techniques to implement adaptive regression of optimal camera poses in an end-to-end trainable manner. SimpleView [38] naively project raw points onto image planes and set their pixel values according to the vertical distance. MvNet [39] proposes a multi-view vision-prompt to bridge the gap between 3D data and 2D pretrained models. Although view-based methods have shown dominant performance in various shape recognition tasks [2], [25], [26], acquiring views requires costly shape rendering and inevitably loses the internal geometric structure and spatial information.

Voxel-based methods require first preprocessing a given point cloud into voxels. Then, a voxel-based convolutional neural network is applied to extract features. Such methods can easily overcome point cloud density variations but are hampered by training costs that grow exponentially with voxel resolution. Typical works include VoxelNet [42] and Minkowski Engine [40]. These methods designed octree-based convolution and sparse convolution to extract local representations of point clouds, effectively reducing the consumption of GPU memory and computing costs.

Point-based methods, which directly take point clouds as input and process them in an unstructured format, have attracted increasing attention due to the absence of information loss and high training efficiency. PointNet [1] is a pioneering work , which proposes to model the permutation invariance of points by max-pooling point-wise features. PointNet++ [2] improves PointNet by further gathering local features in a hierarchical way. DGCNN [3] considers a point cloud as a graph and dynamically updates the graph to aggregate features. Recently Transformer [44] based methods have been proposed as a new paradigm for processing point clouds [24, 45, 25, 46].

In this work, we combine point-based methods and view-based methods to achieve cross-modal information fusion.

II-B Unsupervised Domain Adaptation

Unsupervised domain adaptation (UDA) has been extensively explored on images [19, 21, 20, 47, 48, 49, 50, 51, 22], which aims of mitigating the domain gap between source domain containing labled data and target domain containing unlabled data. These methods can generally be categorized into three categories. 1) Adversarial training [19, 21, 20, 52, 53, 54], playing minimax games at the domain level between a discriminator and a generator. 2) Style transfer [50], wherein the translation from the source domain to the target domain is directly learned using Generative Adversarial Networks [55]. 3) Self-training with pseudo-labels [47, 49, 48, 56, 57], where partial supervision is provided to learn the distributions of the target domain. Despite the extensive research on UDA for 2D images, the domain of 3D point clouds is still in its nascent stages, with some methods borrowed from image-based UDA. For instance, PointDAN [12] is a pioneering work addressing UDA in point cloud classification by explicitly aligning local and global features across domains through domain adversarial training. ALSDA [58] introduces an automated loss function search method to address the issues of domain discriminator degeneration and cross-domain semantic mismatches in adversarial domain adaptation. GAST [14] employs a self-training method equipped with self-paced learning [59] for point cloud UDA. GLRV [16] proposes a reliable voting-based method for pseudo label generation, while SD [60] employs Graph Neural Networks (GNNs) [55] to refine pseudo-labels online during self-training. Chen et al. [61] propose quasi-balanced self-training, dynamically adjusting the threshold to balance the proportion of pseudo-label samples for each category, thereby improving the quality of pseudo-labels. In addition to the mainstream methods of UDA for 2D images, recent works on UDA for point clouds primarily focus on designing suitable self-supervised pretext tasks to facilitate the learning of domain-invariant features. For example, GAST [14] proposes rotation classification and distortion localization as a self-supervised task to align features at both local and global levels. DefRec [13] introduces deformation-reconstruction, and Learnable-Defrec [62] extends it into a learnable deformation task to further enhance performance. RS [63] shuffles and restores the input point cloud to improve discrimination. GLRV [16] proposes two self-supervised auxiliary tasks: scaling-up-down prediction and 3D-2D-3D projection reconstruction, along with a reliable pseudo-label voting strategy to further enhance domain adaptation. GAI [15] employs a self-supervised task of learning geometry-aware global implicit representations for domain adaptation on point clouds. Differentiating from the above single-modal self-supervised methods, we propose a cross-modal self-supervised task that uses 2D images to reconstruct 3D point clouds, thereby empowering the network with the ability to extract 3D geometric information from 2D images.

II-C 2D-to-3D Knowledge Transferring

The concept of model compression was originally introduced by Bucila et al. [64], with the aim of transferring knowledge from a large model to a smaller one without significant performance degradation. Hinton et al. [65] systematically summarized existing knowledge distillation techniques, showcasing the effectiveness of the student-teacher strategy and response-based knowledge distillation. Recently, the transfer of 2D knowledge to 3D using view-based methods has garnered considerable attention among researchers. For instance, PointCLIP [66] directly utilized the pretrained CLIP [28] model for zero-shot point cloud classification via image projection. The subsequent version, PointCLIP V2 [30], refined the projection strategy, resulting in a significant performance boost. ULIP [67, 68] employs large multimodal models to generate detailed language descriptions of 3D objects, addressing limitations in existing 3D object datasets regarding the quality and scalability of language descriptions. PointCMD [69] explores the transfer of cross-modal knowledge from multi-view 2D visual modeling to 3D geometric modeling to facilitate the understanding of the shape of the 3D point cloud. PointVST [70] introduces a self-supervised task that utilizes projected multi-view 2D images as self-supervised signals, enhancing the representation capabilities of point-based networks. I2P-MAE [71] proposes a pre-training framework that leverages 2D pre-trained models to guide the learning of 3D representations. More advanced methods exploit point-pixel correspondences [72, 73, 74, 75] between point clouds and multi-view projected images. Image2Point [76] presents a kernel inflation technique that expands kernels of a 2D CNN into 3D kernels and applies them to voxel-based point cloud understanding. There is a growing interest in utilizing pre-trained Transformers for point cloud processing. PCExpert [77] and EPCL [78] directly train high-quality point cloud models using pre-trained Transformer models as encoders. Although the Transformer pre-trained on large-scale 2D image data possesses powerful semantic representation capabilities, it lacks the ability to capture 3D information. Therefore, in this work, we follow the approach of PCExpert and EPCL, maintaining a Transformer pretrained on ImageNet [26] as an encoder for 3D point clouds, while also designing a self-supervised training task to reconstruct masked 3D point clouds using masked 2D images.

III Proposed Methods

This section introduces the overall working mechanism and specific technical implementations of the proposed RPD. We first introduce and formulate the unsupervised domain adaptation problem on point cloud in Sec. III-A. Then we present general formulations of deep image encoders and deep point encoders respectively in Sec. III-B and Sec. III-C, based on which we construct a unified online cross-modal knowledge distillation workflow in Sec. III-D. Furthermore, we introduce a novel self-supervised task to reconstruct masked point cloud patches with masked multi-view image in Sec. III-E. After that, self-Training strategy is described in detail in Sec. III-F. In the end, we summarize the overall loss function and training strategy in Sec. III-G.

III-A Problem Definition

Given a source domain $\mathcal{S}=\{\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}\}_{i=1}^{n_{s}}$ with $n_{s}$ labeled synthetic samples and a target domain $\mathcal{T}=\{\mathcal{P}_{i}^{t},\mathcal{I}_{i}^{t}\}_{i=1}^{n_{t}}$ with $n_{t}$ unlabeled real samples, a semantic label space $\mathcal{Y}$ is shared between $\mathcal{S}$ and $\mathcal{T}$ (i.e. $\mathcal{Y}^{s}=\mathcal{Y}^{t}$ ), where $\mathcal{P}\in\mathbb{R}^{N\times 3}$ represents a point cloud consisting of $N$ three-dimensional spatial coordinate points $(x,y,z)$ , and $\mathcal{I}\in\mathbb{R}^{V\times W\times H}$ represents $V$ views 2D point-based projected images with a resolution of $W\times H$ , and the superscripts $s$ and $t$ denote the source and target domains, respectively. Let input space $\mathcal{X}=\{\mathcal{P},\mathcal{I}\}$ , our goal is to learn a domain-adapted mapping function $\Phi:\mathcal{X}\rightarrow\mathcal{Y}$ that can correctly classify target samples with accessing labeled source domain and unlabeled target domain. The mapping function $\Phi=\Phi^{\mathcal{P}}\oplus\Phi^{\mathcal{I}}$ , can be formulated into a cascade of a feature encoder $\Phi_{\text{fea}}:\mathcal{X}\rightarrow\mathbb{R}^{d}$ for any input $\{\mathcal{P},\mathcal{I}\}$ and a classifier $\Phi_{\text{cls}}:\mathbb{R}^{d}\rightarrow[0,1]^{c}$ typically using fully-connected layers as follows:

		$\displaystyle\Phi^{\mathcal{P}}(\mathcal{P})=\Phi_{\text{cls}}^{\mathcal{P}}(% \bm{z}^{\mathcal{P}})\circ\Phi_{\text{fea}}^{\mathcal{P}}(\mathcal{P}),$		(1)
		$\displaystyle\Phi^{\mathcal{I}}(\mathcal{I})=\Phi_{\text{cls}}^{\mathcal{I}}(% \bm{z}^{\mathcal{I}})\circ\Phi_{\text{fea}}^{\mathcal{I}}(\mathcal{I}),$
		$\displaystyle{\rm logit}=\Phi^{\mathcal{P}}(\mathcal{P})\oplus\Phi^{\mathcal{I% }}(\mathcal{I}),$

where $\oplus$ denotes cross-modal ensemble, $d$ denotes the dimension of the feature representation output $\bm{z}\in\mathbb{R}^{d}$ of $\Phi_{\text{fea}}(\mathcal{.})$ , $c$ denotes the number of shared classes and the superscripts $\mathcal{P}$ and $\mathcal{I}$ denote the point and image modalities, respectively.

III-B Teacher Network for Image Modeling

Owning to the maturity of deep convolutional architectures, we can directly resort to powerful 2D models of different architectures (ResNet [79], ViT [32], Clip [28], [80]) for image feature fusion and extraction. Benefiting from the common practice of large-scale pretraining (e.g., on ImageNet [26] and Conceptual Captions [28]), the resulting 2D deep feature encoder demonstrates strong generalization ability when fine-tuned on downstream visual recognition tasks. This excellent property makes the pre-trained 2D model suitable as a teacher model for image feature extraction. To align the input modality for 2D models, we project the input point cloud onto multiple image planes, and then encode them into multi-view 2D representations. Specifically, given a point cloud $\mathcal{P}$ , we first project it into multiple single-channel depth maps $\{\mathcal{I}_{v}\}_{v=1}^{V}\in\mathbb{R}^{V\times H\times W}$ via Realistic Projection Pipeline introduced by PointClip v2 [30], where $V$ and $(H,W)$ denote the number of view-images and image size, respectively. Then the teacher image encoder take multi-view images $\{\mathcal{I}_{v}\}_{v=1}^{V}$ in parallel as input to extract image features.

In this paper, we employ a MAE [31] pre-trained ViT [32] to encode image feature. Formally, given a single-channel depth image $\mathcal{I}_{v}\in\mathbb{R}^{H\times W}$ , the ViT divides the image into a sequence of flattened local image patches $\{\bm{x}^{\mathcal{I}}_{v,i}\}_{i=1}^{N_{\mathcal{I}}}\in\mathbb{R}^{N_{% \mathcal{I}}\times P^{2}}$ and used a tokenizer $\Phi_{\text{emb}}^{\mathcal{I}}$ (i.e. Conv2D) to convert these patches into a sequence of 1-D visual token embeddings:

\displaystyle\{\bm{z}_{v,i}^{\mathcal{I}}\}_{i=1}^{N_{\mathcal{I}}}=\Phi_{% \text{emb}}^{\mathcal{I}}\big{(}\{\bm{x}^{\mathcal{I}}_{v,i}\}_{i=1}^{N_{% \mathcal{I}}}\big{)},

(2)

where $\{\bm{z}_{v,i}^{\mathcal{I}}\}_{i=1}^{N_{\mathcal{I}}}\in\mathbb{R}^{N_{% \mathcal{I}}\times D_{1}}$ , $N_{\mathcal{I}}=HW/P^{2}$ denotes the number of tokens, $(P,P)$ denotes the resolution of image patches, and $D_{1}$ is the dimension of each image token embedding. A learnable class token embedding $\bm{z}_{\text{cls}}^{\mathcal{I}}$ is prepended to the sequence of the patch embeddings. Then, the final image input representation $\mathcal{H}_{v}^{\mathcal{I}}\in\mathbb{R}^{(N_{\mathcal{I}}+1)\times D_{1}}$ are calculated by summing the image patch embedding with image position embeddings $\mathcal{Z}_{\text{pos},v}^{\mathcal{I}}\in\mathbb{R}^{(N_{\mathcal{I}}+1)% \times D_{1}}$ :

\displaystyle\mathcal{H}_{v}^{\mathcal{I}}=[\bm{z}_{\text{cls}}^{\mathcal{I}},% \bm{z}_{v,1}^{\mathcal{I}},...,\bm{z}_{v,N_{\mathcal{I}}}^{\mathcal{I}}]+% \mathcal{Z}_{\text{pos},v}^{\mathcal{I}}

(3)

Formally, the behaviours of the 2D teacher transformer module $\mathcal{M}_{t}$ can be formulated as follows:

		$\displaystyle\{\widehat{\mathcal{Z}}_{v}^{\mathcal{I}}\}_{v=1}^{V}=\mathcal{M}% _{t}\big{(}\{\mathcal{H}_{v}^{\mathcal{I}}\}_{v=1}^{V}\big{)},$		(4)
		$\displaystyle\bm{\hat{z}}^{\mathcal{I}}=Concat(\{\bm{\hat{z}}_{v,0}^{\mathcal{% I}}\}_{v=1}^{V}),$
		$\displaystyle\bm{z}^{\mathcal{I}}=Proj(\bm{\hat{z}}^{\mathcal{I}}),$

where $\widehat{\mathcal{Z}}_{v}^{\mathcal{I}}=\{\bm{\hat{z}}_{v,i}^{\mathcal{I}}\}_{% i=0}^{N_{\mathcal{I}}}\in\mathbb{R}^{(N_{\mathcal{I}}+1)\times D_{2}}$ with subscript $v$ represents a set of view-specific image token features extracted from image $\mathcal{I}_{v}$ , $\bm{\hat{z}}_{v,0}^{\mathcal{I}}$ denote a view-specific class token feature, $\bm{\hat{z}}^{\mathcal{I}}\in\mathbb{R}^{VD_{2}}$ denotes concatenation of all view-specific class token features, $Proj$ denotes a projector based on a multi-layer perceptron (MLP) with three fully connected layers, and $\bm{z}^{\mathcal{I}}\in\mathbb{R}^{d}$ denotes the final feature representation of the image modality input. By default $V=10,P=16,H=W=224,N_{\mathcal{I}}=196,D_{1}=768,D_{2}=512$ .

III-C Student Network for 3D Point Cloud Modeling

Collecting and labeling 3D shape models is costly and time-consuming, resulting in the current 3D community still lacking large-scale and richly-annotated datasets comparable to those in the 2D field (i.e. [26, 27]). Limited by the insufficiency of training data, the parameters of mainstream point cloud networks (i.e. [4, 1, 2, 3]) are actually small to alleviate overfitting. This makes these point cloud networks poorly scalable and unsuitable for “pretrain-and-finetune”. We believe that Transformer-based models are inherently well-suited for learning robust semantic patterns in point clouds due to their ability to capture the topological configurations of local geometries. Before the standard transformer is applied to the point cloud field, there are some transformer layers ([24, 25]) specifically designed for point cloud processing. Pioneered by PointBERT [45], the standard transformer has been applied to point cloud tasks.

Following [45], we sample $N_{\mathcal{P}}$ centroids using Furthest Point Sampling (FPS). To each of these centroids, we assign $k$ neighbouring points by conducting a $k$ -Nearest Neighbour (KNN) search. Thereby, we obtain $N_{\mathcal{P}}$ local geometric patches $\{\bm{x}^{\mathcal{P}}_{i}\}_{i=1}^{N_{\mathcal{P}}}\in\mathbb{R}^{N_{\mathcal% {P}}\times(k+1)\times 3}$ , where each geometric patch $\bm{x}^{\mathcal{P}}_{i}$ consists of a centroid $\bm{x}^{\mathcal{P}}_{i,0}$ and its $k$ neighboring point $\{\bm{x}^{\mathcal{P}}_{i,j}\}_{j=1}^{k}$ , i.e. $\bm{x}^{\mathcal{P}}_{i}=\{\bm{x}^{\mathcal{P}}_{i,j}\}_{j=0}^{k}$ . These patches are subsequently fed into tokenizer $\Phi_{\text{emb}}^{\mathcal{P}}$ (mini-DGCNN [3]) to obtain patch token embeddings:

\displaystyle\{\bm{z}_{i}^{\mathcal{P}}\}_{i=1}^{N_{\mathcal{P}}}=\Phi_{\text{% emb}}^{\mathcal{P}}\big{(}\{\bm{x}^{\mathcal{P}}_{i}\}_{i=1}^{N_{\mathcal{P}}}% \big{)},

(5)

where $\{\bm{z}_{i}^{\mathcal{P}}\}_{i=1}^{N_{\mathcal{P}}}\in\mathbb{R}^{N_{\mathcal% {P}}\times D_{1}}$ , $N_{\mathcal{P}}$ denotes the number of geometric tokens and $D_{1}$ denotes the feature dimension. Similarly, a learnable class token embedding $\bm{z}_{\text{cls}}^{\mathcal{P}}$ is prepended to the sequence of the patch embeddings. Then, the final point cloud input representation $\mathcal{H}^{\mathcal{P}}\in\mathbb{R}^{(N_{\mathcal{P}}+1)\times D_{1}}$ are calculated by summing the geometric patch embedding with position embeddings $\mathcal{Z}_{\text{pos}}^{\mathcal{P}}\in\mathbb{R}^{(N_{\mathcal{P}}+1)\times D% _{1}}$ :

\displaystyle\mathcal{H}^{\mathcal{P}}=[\bm{z}_{\text{cls}}^{\mathcal{P}},\bm{% z}_{1}^{\mathcal{P}},...,\bm{z}_{N_{\mathcal{P}}}^{\mathcal{P}}]+\mathcal{Z}_{% \text{pos}}^{\mathcal{P}}

(6)

Formally, the 3D student transformer module $\mathcal{M}_{s}$ consumes $\mathcal{H}^{\mathcal{P}}$ and outputs high-dimensional feature representation $\bm{\hat{z}}^{\mathcal{P}}$ , which can be described as:

		$\displaystyle\widehat{\mathcal{Z}}^{\mathcal{P}}=\mathcal{M}_{s}\big{(}% \mathcal{H}^{\mathcal{P}}\big{)},$		(7)
		$\displaystyle\bm{z}^{\mathcal{P}}=Proj(\bm{\hat{z}}_{0}^{\mathcal{P}}),$		(7)

where $\widehat{\mathcal{Z}}^{\mathcal{P}}=\{\bm{\hat{z}}_{i}^{\mathcal{P}}\}_{i=0}^{% N_{\mathcal{P}}}\in\mathbb{R}^{(N_{\mathcal{P}}+1)\times D_{2}}$ denotes the embedded point cloud token features, $\bm{\hat{z}}_{0}^{\mathcal{P}}$ denotes the embedded class token feature, $Proj$ is a three-layer MLP, and $\bm{z}^{\mathcal{P}}\in\mathbb{R}^{d}$ denotes the final feature representation of the point cloud modality input. By default $N_{\mathcal{P}}=27,k=128,D_{1}=768,D_{2}=512$ .

III-D Online Cross-Modal Knowledge Distillation

Here, we aim to explore how the knowledge from pre-trained 2D Transformer models can be utilized for 3D feature representation learning. On the one hand, the 2D teacher model pre-trained on large-scale data sets (i.e. ImageNet [26]) has strong capabilities to learn high-quality representation, i.e. robust and generalizable features, stemming from their ability to capture long-range correlations across local patches. This prior knowledge of modeling the relationships between local parts is ideal for guiding 3D models to capture the topology of local geometries, eliminating the need for pre-training on large 3D geometry datasets. On the other hand, it is evident that the transformer modules of both the teacher model ( $\mathcal{M}_{t}$ ) and the student model ( $\mathcal{M}_{s}$ ) are structurally identical, consisting of a series of layer normalization (LN), multi-head self-attention (MSA) and multi-layer perceptron (MLP) layers. The only difference lies in the tokenizer during feature extraction. Therefore, distilling relational priors from a 2D pre-trained model to a 3D model is highly feasible without requiring additional complex designs.

To harness the relational priors ingrained in pre-trained 2D teacher model for 3D representation learning, we propose a strategy of parameter sharing and online knowledge distillation for 2D-to-3D knowledge transfer. First, we share a parameter-frozen pre-trained transformer module between the 2D teacher model ( $\mathcal{M}_{t}$ ) and the 3D student model ( $\mathcal{M}s$ ), while keeping the image tokenizer parameters ( $\Phi_{\text{emb}}^{\mathcal{I}}$ ) in the 2D teacher model frozen during training. Second, we distill the teacher model’s semantic knowledge into the student model by imposing the following cross-modal alignment constraint:

\displaystyle\mathcal{L}_{\text{kd}}=D_{\text{KL}}\big{(}\Phi_{\text{cls}}^{% \mathcal{P}}(\bm{z}^{\mathcal{P}})||\Phi_{\text{cls}}^{\mathcal{I}}(\bm{z}^{% \mathcal{I}})\big{)},

(8)

where $D_{KL}$ denotes KL-divergence loss function, $\Phi_{\text{cls}}^{\mathcal{P}}$ and $\Phi_{\text{cls}}^{\mathcal{I}}$ represent classifiers of 2D teacher model and 3D student model respectively. More details aboout online distillation process are given in Algorithm 1.

Input :

labeled source data

\mathcal{S}=\{\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}\}_{i=1}^{n_{s}}

;

unlabeled target data

\mathcal{T}=\{\mathcal{P}_{i}^{t},\mathcal{I}_{i}^{t}\}_{i=1}^{n_{t}}

;

student network

\Phi^{\mathcal{P}}(\mathcal{P})=\Phi_{\text{cls}}^{\mathcal{P}}(\bm{z}^{% \mathcal{P}})\circ\Phi_{\text{fea}}^{\mathcal{P}}(\mathcal{P})

;

teacher network

\Phi^{\mathcal{I}}(\mathcal{I})=\Phi_{\text{cls}}^{\mathcal{I}}(\bm{z}^{% \mathcal{I}})\circ\Phi_{\text{fea}}^{\mathcal{I}}(\mathcal{I})

;

decoder

\Phi_{\text{dec}}^{\mathcal{P}}

;

number of epochs

E

;

Output :

\Phi^{\mathcal{P}}

and

\Phi^{\mathcal{I}}

Initialization :

initialize

\mathcal{M}_{t}

and

\mathcal{M}_{s}

with pre-trained Vit and fix the parameters of first nine blocks;

4for $e\leftarrow 1$ to $E$ do

5 for $(\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}),(\mathcal{P}_{i}^{t},% \mathcal{I}_{i}^{t})$ in $(\mathcal{S},\mathcal{T})$ do

6 if $e\ \%\ 10<5$ then

\min_{\Phi^{\mathcal{P}},\Phi^{\mathcal{I}}}\mathcal{L}_{\text{cls}}^{s}

with

(\mathcal{P}_{i}^{s},y_{i}^{s})

;

\min_{\Phi^{\mathcal{P}},\Phi^{\mathcal{I}}}\mathcal{L}_{\text{kd}}

with

\mathcal{P}_{i}^{s}

and

\mathcal{P}_{i}^{t}

;

\min_{\Phi_{\text{dec}}^{\mathcal{P}}}\mathcal{L}_{\text{emd}}

with

\mathcal{P}_{i}^{s}

and

\mathcal{P}_{i}^{t}

;

11 else

\min_{\Phi^{\mathcal{P}}}\mathcal{L}_{\text{cls}}^{s}

with

(\mathcal{P}_{i}^{s},y_{i}^{s})

;

\min_{\Phi^{\mathcal{P}}}\mathcal{L}_{\text{kd}}

with

\mathcal{P}_{i}^{s}

and

\mathcal{P}_{i}^{t}

;

\min_{\Phi_{\text{dec}}^{\mathcal{P}}}\mathcal{L}_{\text{emd}}

with

\mathcal{P}_{i}^{s}

and

\mathcal{P}_{i}^{t}

;

16 end if

18 end for

20 end for

Algorithm 1 Online Distillation Process

III-E Masked Point Cloud Reconstruction

Transferring the knowledge of 2D pre-trained models for 3D feature representation learning lacks awareness of 3D geometric information. Motivated by SiamMAE [81], we design a self-supervision task that reconstructs masked point cloud patches with corresponding masked multi-view image features to capture 3D geometric information of point clouds. Specially, given a sequence of $N_{\mathcal{P}}$ tokens embeddings of point cloud local patches $\{\bm{\hat{z}}_{i}^{\mathcal{P}}\}_{i=1}^{N_{\mathcal{P}}}$ , we randomly mask these token embeddings with high mask ratio ( $85\%$ ). A set of learnable mask embeddings $\{\bm{m}_{i}^{\mathcal{P}}\}_{i=1}^{M_{\mathcal{P}}}$ , where $M_{\mathcal{P}}=\lfloor 0.85\times N_{\mathcal{P}}\rfloor$ , initialized with Gaussian distribution $N(0,0.02)$ are used to replace the masked positions and are set as the query inputs of the joint decoder $\Phi_{\text{dec}}$ . The unmasked token embeddings of point cloud patches are denoted as $\{\bm{r}_{i}^{\mathcal{P}}\}_{i=1}^{R_{\mathcal{P}}}$ where $R_{\mathcal{P}}=N_{\mathcal{P}}-M_{\mathcal{P}}$ . Then, the corresponding $N_{\mathcal{I}}\times V$ image tokens embeddings $\{\widehat{\mathcal{Z}}_{v}^{\mathcal{I}}-\widehat{\mathcal{Z}}_{v,0}^{% \mathcal{I}}\}_{v=1}^{V}$ are set as the key and value input of the joint decoder to reconstruct the masked point cloud patches, where $\widehat{\mathcal{Z}}_{v,0}^{\mathcal{I}}=\{\bm{\hat{z}}_{v,0}^{\mathcal{I}}\}$ denote the set of view-specific class token feature. Considering redundant information and computation efficiency, we randomly drop the image token embeddings with high drop ratio ( $85\%$ ), the remaining image token embeddings are represented as $\{\bm{r}_{i}^{\mathcal{I}}\}_{i=1}^{R_{\mathcal{I}}}$ where $R_{\mathcal{I}}=\lfloor 0.15\times N_{\mathcal{I}}\times V\rfloor$ . We believe that asymmetric masking/dropping can create a challenging self-supervised learning task while encouraging the network to learn 3D geometric information.

The joint decoder has two layers and each layer consists of a multi-head cross-attention (MCA) and a multi-head self-attention layer (MSA). A fully connected linear layer (FCL) is used to project the output of the decoder to the reconstructed point cloud. Formally, the behaviours of the decoder $\Phi_{\text{dec}}$ can be formulated as follows:

		$\displaystyle\mathcal{F}_{0}=\{\bm{m}_{i}^{\mathcal{P}}\}_{i=1}^{M_{\mathcal{P% }}}\cup\{\bm{r}_{i}^{\mathcal{P}}\}_{i=1}^{R_{\mathcal{P}}},$		(9)
		$\displaystyle\mathcal{F}_{1}=\text{MSA}\big{(}\text{MCA}\big{(}\mathcal{F}_{0}% ,\{\bm{r}_{i}^{\mathcal{I}}\}_{i=1}^{R_{\mathcal{I}}}\big{)}\big{)},$
		$\displaystyle\mathcal{F}_{2}=\text{MSA}\big{(}\text{MCA}\big{(}\mathcal{F}_{1}% ,\{\bm{r}_{i}^{\mathcal{I}}\}_{i=1}^{R_{\mathcal{I}}}\big{)}\big{)},$
		$\displaystyle\mathcal{R}=\text{FCL}(\mathcal{F}_{2}),$

where $\mathcal{R}$ denotes the reconstracted point cloud. The distance between $\mathcal{R}$ and the original point cloud $\mathcal{P}$ is calculated using Earth Mover’s Distance (EMD) distance. Thereby, the loss function for the reconstruction task is defined as:

\displaystyle\mathcal{L}_{\text{emd}}=D_{\text{EMD}}(\mathcal{R}||\mathcal{P}),

(10)

where $D_{\text{EMD}}$ denotes the EMD distance measure function.

III-F Self-Training

Before adaptation, both 2D teacher model and 3D student model take labeled source domain data (i.e. $\{\mathcal{P}_{i}^{s},\mathcal{I}_{i}^{s},y_{i}^{s}\}_{i=1}^{n_{s}}$ ) as input for supervised learning:

\displaystyle\mathcal{L}_{\text{cls}}^{s}=-\frac{1}{n_{s}}\sum_{i=1}^{n_{s}}% \sum_{c=1}^{C}{\rm I}[c=y_{i}^{s}]\log\big{(}\Phi^{\mathcal{P}}(\mathcal{P}_{i% }^{s})_{c}\Phi^{\mathcal{I}}(\mathcal{I}_{i}^{s})_{c}\big{)},

(11)

where $\Phi^{\mathcal{P}}(\mathcal{P}_{i}^{s})_{c}$ and $\Phi^{\mathcal{I}}(\mathcal{I}_{i}^{s})_{c}$ denote the predicted probabilities of the $c$ -th class of the teacher model and student model respectively, and $\rm I[\cdot]$ is an indicator function.

For adaptation, self-paced self-training (SPST) is a popular strategy to align the two domains by generating pseudo-labels in the target domain according to highly confident predictions. Follow these works [14, 16, 15, 61], we also utilize SPST strategy to further reduce domain shift. The objective of self-paced learning based self-training is depicted as:

\displaystyle\begin{aligned} \mathcal{L}_{\text{cls}}^{t}=-\frac{1}{\widehat{n% }_{t}}\sum_{i=1}^{\widehat{n}_{t}}\left(\sum_{c=1}^{C}\widehat{y}_{i,c}^{t}% \log\big{(}\Phi^{\mathcal{P}}(\mathcal{P}_{i}^{t})_{c}\Phi^{\mathcal{I}}(% \mathcal{I}_{i}^{t})_{c}\big{)}+\gamma|\widehat{\bm{y}}_{i}^{t}|_{1}\right),% \end{aligned}

(12)

where $\widehat{n}_{t}$ denotes the number of the pseudo labeled samples in target domain, $\widehat{\bm{y}}_{i}^{t}$ is the predicted pseudo label one-hot vector for a target instance $\mathcal{P}_{i}^{t}$ , $\widehat{y}_{i,c}^{t}$ is its $c$ -th element, and $\gamma$ is a hyper-parameter controls the number of selected target samples, i.e. the larger $\gamma$ , the more samples. We can simply convert $\gamma$ into the prediction confidence threshold $\theta=\exp(-\gamma)$ . The generic pseudo-label generation strategy can be simplified to the following form when all network parameters are fixed:

\displaystyle\widehat{y}^{t}_{i,c}=\!\left\{\begin{aligned} &1,\>\>{\rm if}\>c% =\arg\max_{c}p(c|{\rm logit}_{i})\>\text{\&}\ p(c|{\rm logit}_{i})>\theta,\\ &0,\>\>{\rm otherwise},\end{aligned}\right.

(13)

where ${\rm logit}_{i}=Avg(\Phi^{\mathcal{P}}(\mathcal{P}_{i}^{t}),\Phi^{\mathcal{I}}% (\mathcal{I}_{i}^{t}))$ . We adopt a threshold $\theta$ that gradually increases with self-paced rounds evolve, i.e. each round increases by a constant $\epsilon$ .

III-G Overall Loss

The framework of our approach is illustrated in Fig. 2. The overall training loss of our method is:

\displaystyle\mathcal{L}=\mathcal{L}_{\text{kd}}+\alpha\mathcal{L}_{\text{emd}% }+\beta\mathcal{L}_{\text{cls}}^{s}+\eta\mathcal{L}_{\text{cls}}^{t},

(14)

where $\alpha$ , $\beta$ and $\eta$ are hyper-parameters used to balance the weights between methods. We follow [15, 14, 16, 61] to apply a two-stage optimization for training the models. During the first stage of model training, we mainly rely on the first three loss terms to ensure better completion of the adaptation process. Once the initial training is completed, we use the trained teacher and student models together to generate pseudo labels for the target domain samples and perform the self-training.

IV Experiments

IV-A Datasets

PointDA-10. The PointDA-10 [12] is a popular UDA dataset designed for point cloud classification, which consists of subsets of three datasets: ShapeNet, ModelNet40 and ScanNet. These sub-datasets share the same ten categories like bathtub, bed, and bookshelf. In particular, ShapeNet-10(S) is the subset of ShapeNet dataset and contains 17,378 training and 2,492 testing point cloud extracted from synthetic 3D CAD models. Similarly, ModelNet-10(M) consists of 4,183 training and 856 testing samples taken from the synthetic dataset ModelNet40, but the shape of the point cloud exhibits variations from the same class samples in ShapeNet. ScanNet-10(S*) is sampled from ScanNet and contains 6,110 training samples and 1,769 testing samples, respectively. It is the only real dataset of scanned real-world indoor scenes. Due to errors in the registration process and occlusions, the point clouds in ScanNet-10 suffer from issues of noise and sparseness, making classification more challenging. With the three sub-datasets, we can evaluate our method in six different UDA settings including Simulation-to-Reality, Reality-to-Simulation and Simulation-to-Simulation scenarios.

Sim-to-Real. The Sim-to-Real [29] dataset is a fairly new benchmark for the problem of 3D domain generalization (3DDG), which collects object point clouds of 11 shared classes from ModelNet40 [8] and ScanObjectNN [11], and 9 shared classes from ShapeNet [9] and ScanObjectNN [11]. This benchmark consists of four subsets: ModelNet-11 (M11), ScanObjectNN-11 (SO*11), ShapeNet-9 (S9) and ScanObjectNN-9 (SO*9). Among them, M11 consists of 4,844 training and 972 testing point clouds, SO*11 includes 1,915 training and 475 testing point clouds, S9 consists of 1,9904 training and 1,995 testing point clouds, SO*9 includes 1,602 training and 400 testing point clouds. Following [16], we conduct two types of Simulation-to-Reality adaptation scenarios: M11 $\rightarrow$ SO*11 and S9 $\rightarrow$ SO*9.

IV-B Implementation Details

For our RPD, we adopt mini-DGCNN [3] as 3D Tokenizer $\Phi_{\text{emb}}^{\mathcal{P}}$ which is a standard DGCNN with half the number of layers. The 2D Tokenizer $\Phi_{\text{emb}}^{\mathcal{I}}$ is a 2D convolution layer with a convolution kernel size equal to the image patch size. We adopt a standard vision transformer as the backbone to extract relationships across patch tokens from images and point clouds. The transformer module is initialized by MAE [31] pre-trained ViT-B/16 [32] and we only train the last three blocks to balance accuracy and efficiency. The Category Classifier $\Phi_{\text{cls}}^{\mathcal{I}}$ and $\Phi_{\text{cls}}^{\mathcal{P}}$ are based on a multi-layer perceptron (MLP) with three fully connected layers. The Joint Decoder $\Phi_{\text{dec}}$ for self-supervised reconstruction has two layers and each layer consists of a multi-head cross-attention (MCA) and a multi-head self-attention (MSA) layer, followed by a fully connected linear (FCL) projection layer. By default, the hyper-parameters of $\alpha,\beta$ and $\eta$ are empirically set to 1, 1 and 1 respectively. During training, the Adam optimizer [82] is utilized with the initial learning rate 0.0001 and the epoch-wise cosine annealing learning rate scheduler. Dropout of 0.5 and batch normalization were adaptively applied after the convolution layers and the hidden layers. The training batch size is set to 32. More training details are provided in Table I. During self-spaced self-training (SPST), the initial threshold $\theta$ and the increment constant $\epsilon$ are empirically set to 0.8 and 0.05 and the training contains 10 rounds, with 5 epochs in each round. For simulation-to-reality scenarios, some specific data augmentation strategies were adopted, such as jittering, randomly dropping holes and rotation.

Transformer Configurations: We extract relatiobships between image and point cloud using the standard ViT [32] architecture, which comprises 12 layers of 12 attention heads and an embedding dimensions of 768. Only the last three layers are trained to balance accuracy and efficiency. The decoder network has 2 layers, each equipped with a multi-head cross-attention (MCA) and a multi-head self-attention (MSA)layer. The number of attention heads and embedding dimensions are set to 16 and 512, respectively.

TABLE I: Training configurations for 6 settings in PointDA-10 [12] and 2 settings in Sim-to-Real [29]. The R, J, D in augmentation denote rotation, jittering and randomly dropping holes respectively.

Config	M $\rightarrow$ S	M $\rightarrow$ S*	S $\rightarrow$ M	S $\rightarrow$ S*	S* $\rightarrow$ M	S* $\rightarrow$ S	S9 $\rightarrow$ SO*9	M11 $\rightarrow$ SO*11
optimizer	Adam	Adam	Adam	Adam	Adam	Adam	Adam	Adam
base learning rate	1e-4	1e-4	1e-4	1e-4	1e-4	1e-4	1e-4	1e-4
weight decay	5e-5	5e-5	5e-4	5e-4	5e-5	5e-5	5e-5	5e-5
dropout	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
training epochs	400	400	200	200	200	200	400	400
label smoothing	0	0	0.3	0.3	0	0	0	0
augmentation	R	R, J, D	R	R, J, D	R, J	R, J	R, J, D	R, J, D

TABLE II: Classification accuracy (%) averaged over 3 seeds (

\pm

SEM) on the PointDA-10 dataset. M: ModelNet-10; S: ShapeNet-10; S*: ScanNet-10. We compare with the state-of-the-art 3D UDA methods and our method achieves best performance.

\dagger

denotes experiments without using 3 seeds.The best performance is highlight in bold

Methods	SPST	M $\rightarrow$ S	M $\rightarrow$ S*	S $\rightarrow$ M	S $\rightarrow$ S*	S* $\rightarrow$ M	S* $\rightarrow$ S	Avg
w/o Adapt		83.3 $\pm$ 0.7	43.8 $\pm$ 2.3	75.5 $\pm$ 1.8	42.5 $\pm$ 1.4	63.8 $\pm$ 3.9	64.2 $\pm$ 0.8	62.2 $\pm$ 1.8
PointDAN [12]		83.9 $\pm$ 0.3	44.8 $\pm$ 1.4	63.3 $\pm$ 1.1	45.7 $\pm$ 0.7	43.6 $\pm$ 2.0	56.4 $\pm$ 1.5	56.3 $\pm$ 1.2
RS [63]		79.9 $\pm$ 0.8	46.7 $\pm$ 4.8	75.2 $\pm$ 2.0	51.4 $\pm$ 3.9	71.8 $\pm$ 2.3	71.2 $\pm$ 2.8	66.0 $\pm$ 1.6
DefRec+PCM [13]		81.7 $\pm$ 0.6	51.8 $\pm$ 0.3	78.6 $\pm$ 0.7	54.5 $\pm$ 0.3	73.7 $\pm$ 1.6	71.1 $\pm$ 1.4	68.6 $\pm$ 0.8
Learnable-DefRec^† [62]		82.8 $\pm$ 0.0	56.3 $\pm$ 0.0	81.7 $\pm$ 0.0	54.8 $\pm$ 0.0	72.9 $\pm$ 0.0	71.7 $\pm$ 0.0	70.0 $\pm$ 0.0
GLRV[16]	✓	85.4 $\pm$ 0.4	60.4 $\pm$ 0.4	78.8 $\pm$ 0.6	57.7 $\pm$ 0.4	77.8 $\pm$ 1.1	76.2 $\pm$ 0.6	72.7 $\pm$ 0.6
GAST [14]		83.9 $\pm$ 0.2	56.7 $\pm$ 0.3	76.4 $\pm$ 0.2	55.0 $\pm$ 0.2	73.4 $\pm$ 0.3	72.2 $\pm$ 0.2	69.5 $\pm$ 0.2
GAST [14]	✓	84.8 $\pm$ 0.1	59.8 $\pm$ 0.2	80.8 $\pm$ 0.6	56.7 $\pm$ 0.2	81.1 $\pm$ 0.8	74.9 $\pm$ 0.5	73.0 $\pm$ 0.4
GAI [15]		85.8 $\pm$ 0.3	55.3 $\pm$ 0.3	77.2 $\pm$ 0.4	55.4 $\pm$ 0.5	73.8 $\pm$ 0.6	72.4 $\pm$ 1.0	70.0 $\pm$ 0.5
GAI [15]	✓	86.2 $\pm$ 0.2	58.6 $\pm$ 0.1	81.4 $\pm$ 0.4	56.9 $\pm$ 0.2	81.5 $\pm$ 0.5	74.4 $\pm$ 0.6	73.2 $\pm$ 0.3
SD^† [60]	✓	83.9 $\pm$ 0.0	61.1 $\pm$ 0.0	80.3 $\pm$ 0.0	58.9 $\pm$ 0.0	85.5 $\pm$ 0.0	80.9 $\pm$ 0.0	75.1 $\pm$ 0.0
Ours		81.9 $\pm$ 0.3	64.4 $\pm$ 0.5	82.8 $\pm$ 0.4	59.0 $\pm$ 0.3	77.1 $\pm$ 0.8	76.4 $\pm$ 0.6	73.6 $\pm$ 0.5
Ours	✓	86.3 $\pm$ 0.3	64.9 $\pm$ 0.2	88.7 $\pm$ 0.1	61.1 $\pm$ 0.1	86.2 $\pm$ 0.9	81.2 $\pm$ 0.3	78.0 $\pm$ 0.3

IV-C Comparison with the State-of-the-art Methods

We compare our RPD with recent state-of-the-art point-based UDA methods including Domain Adversarial Neural Network (PointDAN) [12], Reconstruction Space Network (RS) [63], Deformation Reconstruction Network with Point Cloud Mixup (DefRec+PCM) [13], Learnable Deformation Reconstruction Network (Learnable-DefRec) [62], Global-Local structure modeling and Reliable Voted pseudo label method (GLRV) [16], Geometry-Aware Self-Training (GAST) [14], Geometry-Aware Implicits (GAI) [15], Self-Distillation (SD) [60]. The w/o Adapt method means training the DGCNN network with only labeled source samples and is evaluated as reference of the lower performance bounds.

We report in Tab. II the comparisons between our proposed RPD and other UDA methods on PointDA-10. As can be seen, our method surpasses all baselines by a large margin in 6 settings. The average classification accuracy of the RPD outperforms the current SOTA method SD [60] by 2.9%. Also, the RPD achieves a remarkable enhancement over SD in the Simulation-to-Reality settings of M $\rightarrow$ S* (+3.8 %) and S $\rightarrow$ S* (+2.2 %), which are the most challenging yet realistic tasks. This observations verify the capability of our RPD to effectively capture semantic information from point clouds.

For Sim-to-Real dataset, we compare our method with meta-learning method, i.e. MetaSets [29], Point-based domain adaptation methods, i.e. PointDAN [12] and GLRV [16]. We report the mean accuracy and standard error with three seeds in Table IV. Our method outperforms both point-based domain adaptation and meta-learning methods, achieving a new state-of-the-art.

TABLE III: Ablation study on each component of our method. Experiments are conducted on PointDA-10 dataset.

	OCKD	MPCR	SPST	M $\rightarrow$ S	M $\rightarrow$ S*	S $\rightarrow$ M	S $\rightarrow$ S*	S* $\rightarrow$ M	S* $\rightarrow$ S	Avg
PointNet [1]				80.5	41.6	75.8	40.0	60.5	63.6	60.3
DGCNN [3]				83.3	43.8	75.5	42.5	63.8	64.2	62.2
Ours				82.1	58.7	74.2	52.8	72.7	70.7	68.5
	✓			82.0	62.6	75.2	58.3	74.1	71.0	70.5
		✓		82.5	62.2	77.2	55.1	73.7	73.9	70.8
	✓	✓		81.9	64.4	82.8	59.0	77.1	76.4	73.6
			✓	82.4	59.0	82.0	56.5	80.0	78.9	73.1
	✓		✓	83.7	62.9	85.9	61.1	84.2	79.3	76.2
		✓	✓	85.5	63.2	82.2	57.7	80.7	79.8	74.9
	✓	✓	✓	86.3	64.9	88.7	61.1	86.2	81.2	78.0

TABLE IV: Classification accuracy (%) averaged over 3 seeds (

\pm

SEM) on the Sim-to-Real dataset. M11: ModelNet-11; SO*11: ScanObjectNN-11; S9: ShapeNet-9; SO*9: ScanObjectNN-9.

Methods	SPST	M11 $\rightarrow$ SO*11	S9 $\rightarrow$ SO*9
w/o Adaptation		61.68 $\pm$ 1.26	57.42 $\pm$ 1.01
PointDAN [12]		63.32 $\pm$ 0.85	54.95 $\pm$ 0.87
MetaSets [29]		72.42 $\pm$ 0.21	60.92 $\pm$ 0.76
GLRV [16]	✓	75.16 $\pm$ 0.34	62.46 $\pm$ 0.55
Ours		74.43 $\pm$ 0.54	63.25 $\pm$ 0.50
Ours	✓	77.05 $\pm$ 0.42	67.50 $\pm$ 0.50

TABLE V: Ablation study on each component of our method. Experiments are conducted on Sim-to-Real dataset.

OCKD	MPCR	SPST	M11 $\rightarrow$ SO*11	S9 $\rightarrow$ SO*9
			69.12	60.25
✓			71.24	61.50
	✓		70.53	60.75
✓	✓		73.47	63.25
		✓	72.14	61.75
✓		✓	74.19	64.50
	✓	✓	73.32	63.50
✓	✓	✓	77.05	67.50

IV-D Ablation Studies

To validate the effectiveness of our proposed method, we conducted various ablation studies on the six settings of PointDA-10 and two settings of Sim-to-Real. We utilized a MAE pre-trained Vision Transformer to extract features and introduced three key components for adaptation: an online cross-model knowledge distillation method (OCKD), a mask point cloud reconstruction component (MPCR), and a self-paced self-training strategy (SPST). The results are summarized in Tab. III and Tab. V.

For PointDA-10, the first three rows in Tab. III respectively show the results of using PointNet [1], DGCNN [3], and our proposed method as the backbone network without adaptation. It is evident that our baseline exhibits significantly better performance than PointNet [1] and DGCNN [3] in 4 out of 6 settings, highlighting the superior generalization of transformer models pre-trained on large-scale image datasets over traditional 3D networks. By comparing the fourth row and the third row in Tab. III, we observe that OCKD achieves better scores across all settings than the baseline, indicating that the point cloud branch has acquired abundant semantic information, consequently enhancing its generalization capability. Furthermore, the fifth row shows that the inclusion of the mask point cloud reconstruction module improves the model’s ability to capture geometric information, resulting in better classification accuracy. Moreover, a significant improvement is observed on average by using OCKD and MPCK components together. The Simulation-to-Reality settings achieve competitive results even without SPST, surpassing the performance of the previous SOTA model. Additionally, accuracy improves in all six settings after adding the SPST method, indicating its effectiveness across all datasets. Finally, in the last row, we report the results obtained by combining all components, and our method achieves the best result compared to the recent SOTA method SD [60]. For Sim-to-Real, the results are shown in Tab. V, yielding similar conclusions, which once again verifies the effectiveness of our proposed method.

We also investigate the influence of the cross-modal knowledge fusion strategy. For shape classification, we directly fuse the prediction by linear interpolation, namely, adding the classification logits of 2D teacher and 3D student models element-wisely. This simple yet effective design produces the ensemble for two types of knowledge: the 3D geometric information captured by self-supervised learning masked point cloud reconstruction, and the robust semantics from the trained 2D Transformer-based models. We believe that these two kinds of knowledge have certain complementary qualities. As shown in Fig. 4, cross-modal knowledge fusion strategy consistently improve the cross-domain generalization on all settings of PointDA-10 and Sim-to-Real.

It is noteworthy that for the challenging yet realistically significant Simulation-to-Reality scenarios (i.e. M $\rightarrow$ S*, S $\rightarrow$ S*, M11 $\rightarrow$ SO*11, and S9 $\rightarrow$ SO*9), our proposed RPD acquires a remarkable enhancement over w/o Adapt by 21.1%, 18.6%, 15.37%, and 20.08% respectively. Visualization of confusion matrices in terms of class-wise classification accuracy achieved by the w/o Adapt and our RPD on four Simulation-to-Reality UDA tasks are shown in Fig. 3.

IV-E Visualization

We visualize the input point clouds, random masking, and the reconstructed 3D coordinates in Fig. 5. We believe that reconstruct masked point cloud with masked 2D image tokens can create a challenging self-supervised learning task that encourage the network to learn 3D geometric information. We use the saliency map analysis method referenced in [83] to visualize the features of various comparison methods under the setting of M $\rightarrow$ S* on PointDA-10 [12]. As shown in Figure 6, the proposed RPD method utilizes prior knowledge to model the relationships between local patches using pre-trained Vision Transformers (ViT). The proposed RPD focuses on the global structure and effectively captures the local relationships within point cloud data, enabling our method to extract robust and invariant 3D semantic representations. In contrast, most other comparison methods use DGCNN as the backbone for feature extraction, which tends to focus only on local high-frequency areas, such as edges, thereby affecting generalization.

V Discussion

V-A Scalability

As shown in Table VI, we experimented with three different scales of pre-trained ViT models: ViT-S, ViT-B and ViT-L. However, we observed that using the larger-scale pre-trained model did not lead to performance improvements.

TABLE VI: Results on the effect of using different pre-trained 2D Transformer-based models on PointDA-10.

Domain	Methods	M $\rightarrow$ S	M $\rightarrow$ S*	S $\rightarrow$ M	S $\rightarrow$ S*	S* $\rightarrow$ M	S* $\rightarrow$ S	Avg
Source	RPD-S	98.7	98.8	94.5	94.3	76.5	78.0	90.3
	RPD-B	98.7	98.9	94.0	94.7	77.4	79.3	90.7
	RPD-L	99.0	99.3	95.4	95.7	78.1	80.0	91.3
Target	RPD-S	80.4	63.5	81.7	58.4	74.5	73.5	72.0
	RPD-B	81.9	64.4	82.8	59.0	77.1	76.4	73.6
	RPD-L	80.9	61.7	80.4	58.9	75.0	75.9	72.1

We attribute this to several factors: First, our relatively small point cloud dataset is prone to overfitting, which is exacerbated by larger ViT-L models. Second, we use only 27 patches compared to the 196 used in ViT-B. This smaller number of patches means that larger ViT-L models are not needed for effective patch relationship modeling. Finally, using larger ViT-L models would require more patches, potentially leading to insufficient information in each patch and reducing the effectiveness of local relationship modeling.

V-B Limitation

While our RPD approach demonstrates considerable promise, there are certain limitations worth for investigation:

Computational Complexity: As shown in Fig. VII, the use of pre-trained ViT model leads to increased computational requirements, especially when dealing with very large datasets or high-resolution point clouds. Future work could focus on optimizing the model for efficiency or exploring more lightweight architectures that retain the robustness of pre-trained ViT model.

Effects of Openset Data: The pre-trained ViT model is based on 2D data, which might influence our results. If the 2D pre-training data does not include any sample of semantic categories of the PointDA-10 [12] and Sim-to-Real [29] point cloud datasets, our method’s performance could be adversely affected. The absence of relevant categories in the pre-training dataset can result in suboptimal feature extraction and limited generalization.

Effects of Pre-training Approaches: Different pre-training approaches for ViT [32], such as DINO [84] and MAE [31], can have varying impacts on the performance of our method. We have observed that MAE pre-trained ViT tends to have an advantage in our experiments. The pre-training method affects the quality of learned representations and the model’s effectiveness in downstream tasks.

Extension to Other 3D Data Types: While our current focus is on point clouds, extending the proposed RPD to other types of 3D data, such as voxel grids or mesh data, could enhance its applicability. Adapting and optimizing the approach for these data types is an interesting area for future work.

TABLE VII: Comparative analysis of training costs of different methods

Methods	Parameters (M)	FLOPs(G)
DefRec [13]	2.08	2.77
PointDAN [12]	2.84	0.94
SD [60]	3.47	0.92
GAI [15]	22.68	3.58
GAST [14]	23.60	2.17
RPD (ours)	62.27	23.29

VI Conclusion

This paper proposes a novel scheme for unsupervised domain adaptation on object point cloud classification, aiming to alleviate domain shift by distilling relational priors from pre-trained 2D transformers. We illustrate how the relational priors learned by a proficient 2D Transformer model can be transferred to the 3D domain, thereby enhancing the generalization of 3D features. Our methodology involves adopting a standard teacher-student distillation framework, where the parameter-frozen pre-trained Transformer module is shared between the 2D teacher model and the 3D student model. Additionally, we employ an online knowledge distillation strategy to further semantically regularize the 3D student model. Moreover, to empower the model’s capacity to capture 3D geometric information, we introduce a novel self-supervised task involving the reconstruction of masked point cloud patches using corresponding masked multi-view image features. Experiments conducted on two public benchmarks validate the efficacy of our approach, demonstrating new state-of-the-art performance.

References

[1] C. Qi, H. Su, K. Mo, and L. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 652–660.
[2] C. Qi, L. Yi, H. Su, and L. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2017, pp. 1–16.
[3] Y. Wang, Y. Sun, Z. Liu, S. Sarma, M. Bronstein, and J. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
[4] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on x-transformed points,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 31, 2018.
[5] W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 9621–9630.
[6] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” arXiv preprint arXiv:2202.07123, 2022.
[7] H. Thomas, C. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6411–6420.
[8] K. V. Vishwanath, D. Gupta, A. Vahdat, and K. Yocum, “Modelnet: Towards a datacenter emulation environment,” in Proc. IEEE 9th Int. Conf. Peer Peer Comput., 2009, pp. 81–82.
[9] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su et al., “Shapenet: An information-rich 3d model repository,” ArXiv, vol. 1512.03012, 2015.
[10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 5828–5839.
[11] M. A. Uy, Q. Pham, B. Hua, T. Nguyen, and S. Yeung, “Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 1588–1597.
[12] C. Qin, H. You, L. Wang, C. Kuo, and Y. Fu, “Pointdan: A multi-scale 3d domain adaption network for point cloud representation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019.
[13] I. Achituve, H. Maron, and G. Chechik, “Self-supervised learning for domain adaptation on point clouds,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2021, pp. 123–133.
[14] L. Zou, H. Tang, K. Chen, and K. Jia, “Geometry-aware self-training for unsupervised domain adaptation on object point clouds,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 6403–6412.
[15] Y. Shen, Y. Yang, M. Yan, H. Wang, Y. Zheng, and L. Guibas, “Domain adaptation on point clouds via geometry-aware implicits,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 7223–7232.
[16] H. Fan, X. Chang, W. Zhang, Y. Cheng, Y. Sun, and M. Kankanhalli, “Self-supervised global-local structure modeling for point cloud domain adaptation with reliable voted pseudo labels,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 6377–6386.
[17] Y. Chen, Z. Wang, L. Zou, K. Chen, and K. Jia, “Quasi-balanced self-training on noise-aware synthesis of object point clouds for closing domain gap,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 728–745.
[18] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” J. Mach. Learn. Res., vol. 17, pp. 2096–2030, 2016.
[19] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 7167–7176.
[20] J. Li, E. Chen, Z. Ding, L. Zhu, K. Lu, and H. Shen, “Maximum density divergence for domain adaptation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 11, pp. 3918–3930, 2020.
[21] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy for unsupervised domain adaptation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 3723–3732.
[22] F. Zhao, S. Liao, G. Xie, J. Zhao, K. Zhang, and L. Shao, “Unsupervised domain adaptation with noise resistible mutual-training for person re-identification,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2020, pp. 526–544.
[23] X. Wei, X. Gu, and J. Sun, “Learning generalizable part-based feature representation for 3d point clouds,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), 2022.
[24] M. Guo, J. Cai, Z. Liu, T. Mu, R. Martin, and S. Hu, “Pct: Point cloud transformer,” Computational Visual Media, vol. 7, pp. 187–199, 2021.
[25] H. Zhao, L. Jiang, J. Jia, P. Torr, and V. Koltun, “Point transformer,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 16 259–16 268.
[26] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Ieee, 2009, pp. 248–255.
[27] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2014, pp. 740–755.
[28] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in Proc. ICML. PMLR, 2021, pp. 8748–8763.
[29] C. Huang, Z. Cao, Y. Wang, J. Wang, and M. Long, “Metasets: Meta-learning on point sets for generalizable representations,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 8863–8872.
[30] X. Zhu, R. Zhang, B. He, Z. Guo, Z. Zeng, Z. Qin, S. Zhang, and P. Gao, “Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 2639–2650.
[31] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 16 000–16 009.
[32] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[33] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri, “3d shape segmentation with projective convolutional networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3779–3788.
[34] H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi-view convolutional neural networks for 3d shape recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2015, pp. 945–953.
[35] Y. Feng, Z. Zhang, X. Zhao, R. Ji, and Y. Gao, “Gvcnn: Group-view convolutional neural networks for 3d shape recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 264–272.
[36] S. Tulsiani, A. Efros, and J. Malik, “Multi-view consistency as supervisory signal for learning shape and pose prediction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 2897–2905.
[37] A. Hamdi, S. Giancola, and B. Ghanem, “Mvtn: Multi-view transformation network for 3d shape recognition,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 1–11.
[38] A. Goyal, H. Law, B. Liu, A. Newell, and J. Deng, “Revisiting point cloud shape classification with a simple and effective baseline,” in Proc. ICML. PMLR, 2021, pp. 3809–3820.
[39] H. Peng, B. Li, B. Zhang, X. Chen, T. Chen, and H. Zhu, “Multi-view vision fusion network: Can 2d pre-trained model boost 3d point cloud data-scarce learning?” IEEE Trans. Circuits Syst. Video Technol., 2023.
[40] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 3075–3084.
[41] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 9224–9232.
[42] D. Maturana and S. Scherer, “Voxnet: A 3d convolutional neural network for real-time object recognition,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS). IEEE, 2015, pp. 922–928.
[43] A. Kanezaki, Y. Matsushita, and Y. Nishida, “Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 5010–5019.
[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 30, 2017.
[45] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 19 313–19 322.
[46] Z. Huang, Z. Zhao, B. Li, and J. Han, “Lcpformer: Towards effective 3d point cloud analysis via local context propagation in transformers,” IEEE Trans. Circuits Syst. Video Technol., 2023.
[47] Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 289–305.
[48] L. Chen, Q. Du, Y. Lou, J. He, T. Bai, and M. Deng, “Mutual nearest neighbor contrast and hybrid prototype self-training for universal domain adaptation,” in Proc. AAAI, vol. 36, no. 6, 2022, pp. 6248–6257.
[49] H. Liu, J. Wang, and M. Long, “Cycle self-training for domain adaptation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 22 968–22 981.
[50] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell, “Cycada: Cycle-consistent adversarial domain adaptation,” in Proc. ICML. Pmlr, 2018, pp. 1989–1998.
[51] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 3722–3731.
[52] H. Li, N. Dong, Z. Yu, D. Tao, and G. Qi, “Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 5, pp. 2814–2830, 2021.
[53] Q. Tian, Y. Zhu, H. Sun, S. Chen, and H. Yin, “Unsupervised domain adaptation through dynamically aligning both the feature and label spaces,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8562–8573, 2022.
[54] B. Zhang, T. Chen, B. Wang, X. Wu, L. Zhang, and J. Fan, “Densely semantic enhancement for domain adaptive region-free detectors,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 3, pp. 1339–1352, 2021.
[55] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 27, 2014.
[56] X. Gu, C. Zhang, Q. Shen, J. Han, P. Angelov, and P. Atkinson, “A self-training hierarchical prototype-based ensemble framework for remote sensing scene classification,” Information Fusion, vol. 80, pp. 179–204, 2022.
[57] Y. Ding, H. Fan, M. Xu, and Y. Yang, “Adaptive exploration for unsupervised person re-identification,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 16, no. 1, pp. 1–19, 2020.
[58] Z. Mei, P. Ye, H. Ye, B. Li, J. Guo, T. Chen, and W. Ouyang, “Automatic loss function search for adversarial unsupervised domain adaptation,” IEEE Trans. Circuits Syst. Video Technol., 2023.
[59] M. Kumar, B. Packer, and D. Koller, “Self-paced learning for latent variable models,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 23, 2010.
[60] A. Cardace, R. Spezialetti, P. Ramirez, S. Salti, and L. Di Stefano, “Self-distillation for unsupervised 3d domain adaptation,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2023, pp. 4166–4177.
[61] Y. Chen, Z. Wang, L. Zou, K. Chen, and K. Jia, “Quasi-balanced self-training on noise-aware synthesis of object point clouds for closing domain gap,” in Proc. Eur. Conf. Comput. Vis. (ECCV). Springer, 2022, pp. 728–745.
[62] X. Luo, S. Liu, K. Fu, M. Wang, and Z. Song, “A learnable self-supervised task for unsupervised domain adaptation on point clouds,” arXiv preprint arXiv:2104.05164, 2021.
[63] J. Sauder and B. Sievers, “Self-supervised deep learning on point clouds by reconstructing space,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019.
[64] C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2006, pp. 535–541.
[65] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[66] R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by clip,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 8552–8562.
[67] L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 1179–1189.
[68] T. Huang, B. Dong, Y. Yang, X. Huang, R. Lau, and W. Ouyang, W.and Zuo, “Clip2point: Transfer clip to point cloud classification with image-depth pre-training,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 22 157–22 167.
[69] Q. Zhang, J. Hou, and Y. Qian, “Pointmcd: Boosting deep point cloud encoders via multi-view cross-modal distillation for 3d shape recognition,” IEEE Trans. Multimedia, 2023.
[70] Q. Zhang and J. Hou, “Pointvst: Self-supervised pre-training for 3d point clouds via view-specific point-to-image translation,” IEEE Transactions on Visualization and Computer Graphics, 2023.
[71] R. Zhang, L. Wang, Y. Qiao, P. Gao, and H. Li, “Learning 3d representations from 2d pre-trained models via image-to-point masked autoencoders,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 21 769–21 780.
[72] A. Hamdi, B. Ghanem, and M. Nießsner, “Sparf: Large-scale learning of 3d sparse radiance fields from few input images,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 2930–2940.
[73] A. Hamdi, S. Giancola, and B. Ghanem, “Voint cloud: Multi-view point cloud representation for 3d understanding,” arXiv preprint arXiv:2111.15363, 2021.
[74] Z. Liu, X. Qi, and C. Fu, “3d-to-2d distillation for indoor scene parsing. 2021 ieee,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 4462–4472.
[75] Z. Wang, X. Yu, Y. Rao, J. Zhou, and J. Lu, “P2p: Tuning pre-trained image models for point cloud analysis with point-to-pixel prompting,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 35, 2022, pp. 14 388–14 402.
[76] C. Xu, S. Yang, T. Galanti, B. Wu, X. Yue, B. Zhai, W. Zhan, P. Vajda, K. Keutzer, and M. Tomizuka, “Image2point: 3d point-cloud understanding with 2d image pretrained models,” arXiv preprint arXiv:2106.04180, 2021.
[77] J. Kang, W. Jia, X. He, and K. Lam, “Point clouds are specialized images: A knowledge transfer approach for 3d understanding,” arXiv preprint arXiv:2307.15569, 2023.
[78] X. Huang, S. Li, W. Qu, T. He, Y. Zuo, and W. Ouyang, “Frozen clip model is efficient point cloud backbone,” arXiv preprint arXiv:2212.04098, 2022.
[79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2016, pp. 770–778.
[80] Y. Gong, X. Yu, Y. Ding, X. Peng, J. Zhao, and Z. Han, “Effective fusion factor in fpn for tiny object detection,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2021, pp. 1160–1168.
[81] A. Gupta, J. Wu, J. Deng, and F. Li, “Siamese masked autoencoders,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 36, 2024.
[82] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ArXiv, vol. 1412.6980, 2014.
[83] T. Zheng, C. Chen, J. Yuan, B. Li, and K. Ren, “Pointcloud saliency maps,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 1598–1606.
[84] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 9650–9660.