Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by selecting from this list of supported packages.

License: arXiv.org perpetual non-exclusive license
arXiv:2305.17770v2 [cs.CV] 16 Dec 2023

Point Cloud Completion Guided by Prior Knowledge via Causal Inference

Songxue Gao, Chuanqi Jiao, Ruidong Chen, Weijie Wang*, Weizhi Nie
Abstract

Point cloud completion aims to recover raw point clouds captured by scanners from partial observations caused by occlusion and limited view angles. This makes it hard to recover details because the global feature is unlikely to capture the full details of all missing parts. In this paper, we propose a novel approach to point cloud completion task called Point-PC, which uses a memory network to retrieve shape priors and designs a causal inference model to filter missing shape information as supplemental geometric information to aid point cloud completion. Specifically, we propose a memory operating mechanism where the complete shape features and the corresponding shapes are stored in the form of “key-value” pairs. To retrieve similar shapes from the partial input, we also apply a contrastive learning-based pre-training scheme to transfer the features of incomplete shapes into the domain of complete shape features. Experimental results on the ShapeNet-55, PCN, and KITTI datasets demonstrate that Point-PC outperforms the state-of-the-art methods.

Index Terms:
Point cloud completion, Memory network, Causal inference, Contrastive alignment.

I Introduction

With more people using 3D scanners and RGB-D cameras, 3D vision has become one of the most popular topics for research in recent years [1, 2, 3, 4]. Among all the 3D descriptors[5, 6, 7], the point cloud stands out because of its remarkable ability to render spatial structure at a lower computational cost. However, due to occlusion, view angles, and limitations of sensor resolution, raw point clouds are usually sparse and defective [8, 9, 10]. Consequently, point cloud completion becomes essential.

Refer to caption
Figure 1: Point-PC is proposed for point cloud completion. Point-PC proposes a novel paradigm that finds similar shape information as prior knowledge to help the model handle the point cloud completion problem. Furthermore, our approach also selects geometric information from shape priors (blue, red, and yellow points) guided by causal inference.

Benefiting from large-scale point cloud datasets [11, 12, 13], massive efficient learning-based methods for point cloud completion have emerged. The pioneering work is PCN [12] which encoded the input shape into a global feature and decoded it using a folding operation. Following an encoder-decoder pattern, several methods such as NSFA [14] and GRNet [15] have emerged. Subsequent research has prioritized enhancing the decoding aspect of generating point clouds with detailed geometric structures. SA-Net [9] and PFNet [16] increased the density of point clouds hierarchically. Such a coarse-to-fine pattern achieves better performance since more constraints are imposed on the generation process.

Most recent methods incorporate geometry-aware modules into a transformer-based structure. PoinTr [17] employed a KNN model to enable transformers, thereby improving the ability to utilize the inductive bias related to 3D geometric structures. SeedFormer [18] aggregated neighboring points by the proposed Upsample Transformer to incorporate valuable local information into the generation process. These two methods formulate the point cloud completion task as a set-to-set translation task, where complex dependency is learned among the point groups. While studies in this field have demonstrated intriguing performance achievements, most existing methods [19, 20] have always failed to jump out of the paradigm where the query points are generated from a global feature vector extracted from partial inputs in the first place, followed by an upsampling module to increase the density of the output. Many approaches used the same framework to handle the point cloud completion problem [21, 22]. However, there are two drawbacks to the paradigm: 1) An incomplete shape makes it hard to learn detailed structure information and build a clear relationship between the complete point cloud model; 2) A global feature like this is spread out and does not keep much fine-grained information for the up-sample phase. Because of this, geometry-aware models can not learn complex structures if they know less about geometry.

To tackle these challenges, our insight into revealing detailed geometry is inspired by human memory. We often exploit fragmented pieces of cues to retrieve similar 3D shapes from our memories and let the shape priors guide the reconstruction. Based on this, we propose a new memory-based framework for completing point clouds (Point-PC). This framework uses a memory network to get shape priors and an effective causal inference model to choose missing shape information as additional geometric information to help complete point clouds. Thus the framework consists of four components: a partial shape encoder, the memory network, the prior knowledge selection module, and the shape decoder. Compared with the above-mentioned methods, our proposed method takes one step forward to introduce shape prior explicitly from an external knowledge base. In this way, the geometry-aware structure is enabled to precisely capture the fine-grained details and geometric relations between neighboring points. Even encountering a large unseen points ratio, complete shapes generated by our method still meet the requirements of smooth surfaces, neat edges, and sharp corners. First, we construct an operating strategy to store, write, and read the memory. Specifically, we store the memory in a “key-value” pair. The key can be updated according to the similarity between the value and the corresponding ground truth. This writing strategy associates the value with incomplete inputs from multiple perspectives and enhances the memory network’s geometry perception of incomplete shapes. To retrieve the precise shapes, we design a partial-complete pre-training scheme with two learning mechanisms by contrastive learning to overcome the distribution biases between the two modalities of point clouds: (i) cross-modality learning to align partial features into the complete domain, (ii) intra-modality learning in the partial modality to eliminate the interference from various view angles and incomplete ratios. Our partial-complete pre-training aligns partial features to be consistent with complete features and encourages partial feature extraction to be invariant to viewpoints, thus significantly improving the accuracy of shape prior retrieval. With the help of the pre-training scheme and the backdoor adjustment, Point-PC explicitly leverages shape priors in the memory network to recover the heavily flawed point clouds on complex occasions. As the retrieved shapes should not change the original structure of the input partial point cloud, to better leverage the retrieved shapes, we also take advantage of causal inference to extract useful information from the retrieved shapes. To achieve the best prior knowledge information, we construct a causal graph to analyze the unrelated shape information of prior shapes. Specifically, we select the shape of prior features as the confounder and stratify it with do-calculus. After cutting off the backdoor path, the features extracted from the shape priors are optimized, and redundant information is removed. To leverage information from both partial points and shape priors, the input partial features are concatenated with the output features of the backdoor adjustment module and then fed into a decoder to predict the final 3D complete shape. The main contributions of our work are as follows:

  • We propose a novel memory-based 3D point cloud completion network, Point-PC, to supplement geometric information from prior knowledge explicitly.

  • We introduce causal inference to further refine the shape prior, so as to eliminate the distraction of irrelevant information.

  • We apply qualitative and quantitative experiments on ShapeNet-55, PCN, and KITTI datasets, which shows that Point-PC improves the accuracy and plausibility of point cloud completion, and outperforms previous SOTA methods.

The following provides an overview of the remaining content in this paper: Section 2 covers related work, Section 3 presents the proposed solution, Section 4 describes the conducted experiments along with a summary of the results, and finally, Section 5 concludes this work and outlines potential avenues for future research.

II Related Work

II-A Point cloud Completion

The pioneering deep learning-based approach, PCN [12], generates a preliminary completion by utilizing a learned global feature and subsequently applies upsampling, assuming that a 3D object resides on a 2D manifold. Later research focuses on mitigating mature learning-based structures. Some previous methods [23, 24] voxelized the point cloud into binary voxels to migrate 3D convolutions, which cubically increased the computational cost, whereas other methods [16, 25] process coordinates directly by Multi-Layer Perceptrons, yet loses geometric information with pooling-based aggregation operations. These two kinds of completion methods ignore relation and context across points, thus failing to preserve regional information of local patterns. To solve this problem, TopNet [26] constrains the point completion process as the growth of a hierarchical rooted tree where several child points are projected by a parent point in a feature expansion layer. On the other hand, SnowflakeNet [27] models point cloud completion procedure as the generation of a snowflake. Most recent state-of-the-art completion methods focus on the decoding process of recovering fine details instead of providing sufficient geometric guidance from partial inputs in the encoding process. By breaking the point cloud into several sequential patches, transformer-based methods [28, 17, 18] are proved to efficiently handle large-scale point cloud and enhance relations between neighboring points, which outperform and dominate the research prospect. Nevertheless, upsample and expansion modules among the aforementioned methods are based on a global feature vector due to its simplicity, which prevents them from precisely capturing the detailed geometries and structures of 3D shapes, therefore it is unable for these methods to arrange the well-structured point splitting into local regions. In order to address this problem by integrating more geometric information explicitly, we utilize a memory network to provide rich structural details and enhance neighboring relations to recover local regions.

\begin{overpic}[width=433.62pt]{architecture-modi.eps}
\put(3.5,38.0){\small{$K_{1}$}}
\put(13.5,35.4){\small{$\vdots$}}
\put(3.5,34.0){\small{$K_{n1}$}}
\put(3.5,30.0){\small{$K_{n2}$}}
\put(3.5,26.0){\small{$K_{n3}$}}
\put(13.5,23.7){\small{$\vdots$}}
\put(3.5,22.2){\small{$K_{|\mathcal{M}|}$}}
\put(45.0,38.0){\small{$V_{1}$}}
\put(38.0,35.4){\small{$\vdots$}}
\put(45.0,35.0){\small{$V_{n1}$}}
\put(45.0,31.3){\small{$V_{n2}$}}
\put(45.0,27.2){\small{$V_{n3}$}}
\put(38.0,23.7){\small{$\vdots$}}
\put(45.0,22.8){\small{$V_{|\mathcal{M}|}$}}
\put(59.0,41.0){\small{$V_{n1}$}}
\put(73.0,41.0){\small{$V_{n2}$}}
\put(87.0,41.0){\small{$V_{n3}$}}
\put(15.0,9.5){\large{$\mathcal{I}$}}
\put(19.7,8.0){\tiny{$(N\times 3)$}}
\put(84.0,9.5){\large{$\mathcal{P}$}}
\put(76.8,8.0){\tiny{$(N^{{}^{\prime}}\times 3)$}}
\end{overpic}
Figure 2: The overall architecture of Point-PC, which consists of four main modules: (i) pre-trained partial shape encoder, (ii) memory network, (iii) prior knowledge selection module, and (iv) shape decoder. The pre-trained encoder extracts features from the partial input, which is then fed into memory. The memory network retrieves shape priors with sufficient geometric information. Moreover, the prior knowledge selection module selects useful information from the prior shapes. The shape decoder takes the concatenation of the partial shape feature and the shape prior feature to generate the complete point cloud.

II-B Memory Network

The Memory Network [29] was initially presented in dialog systems to save scene information and realize the functionality of long-term memory. However, the original design of the Memory Network just vectorizes and saves the original text without proper modification, thus limiting the promotion of the model. Further works [30] reinforce the Memory Network so that it can be trained in an end-to-end way. Hierarchical Memory Network [31] stores and searches memory in a hierarchical structure to speed up calculations when implementing large-scale memory. Key-Value Memory Network [32] stores memory slots in a “key-value” pair where the key module is responsible for scoring the degree of correlation between memory and queries, while the value module is responsible for weighting and summing the values of the memory to obtain the output. In our work, we further extend the application of “key-value” structured memory into point cloud completion and reveal its ability to preserve high-quality geometry details through a well-designed pre-training method.

II-C Self-supervised Pre-training

To learn informative representations from invariant 3D features, Many previous studies have utilized self-supervised pre-training techniques, incorporating views of the same scene, diverse modalities such as depth and RGB images, or even distinct formats like point clouds and voxels. Contrastive Scene Contexts [33] divides the point cloud of a scene into multiple regions and applies contrastive learning individually within each region. This approach effectively utilizes both point-level correspondences and spatial contexts within the scene. DepthContrast [34] works with single-view depth scans for self-supervised learning and extends the successful MoCo [35] pipeline to jointly train with point and voxel input formats. CLIP2POINT [36] proposes an image-depth pre-training method to align multi-view depth features with CLIP visual features so as to transfer CLIP knowledge to the 3D category-level discrimination and adapt it to point cloud classification. We design a partial-complete pre-training scheme by contrastive learning to discriminate the modality difference between partial and complete point clouds. By this means, Point-PC learns fine-grained information association and retrieves similar complete shapes based on the partial point cloud. Nonetheless, we further leverage causal inference to refine the redundant shape priors.

II-D Causal Inference

The causal inference was first introduced by [37]. Recent research [38] has shown that causal inference is beneficial to various fields in computer vision. VC R-CNN [39] proposes that observational bias causes the model to tend to make predictions based on co-occurrence information while ignoring some common-sense causal relationships, and attempts to extract a visual feature that contains common sense through causal intervention. CONTA [40] ascribes the uncertain boundaries of pseudo-masks to confounding context and employs backdoor adjustment to remove the confounding factors. This approach generates improved pixel-level pseudo masks using only image-level labels. Ifsl [41] posits that pre-trained knowledge constitutes a confounding factor leading to spurious correlations between sample features and class labels in the support set and employs backdoor adjustment to eliminate this bias. To the best of our knowledge, we are the first to introduce causal inference to point cloud completion. Our approach incorporates a causal feature fusion strategy aimed at mitigating the confounding effect in shape priors. This strategy directs the decoder to focus on causal features, leading to greater robustness in the memory network.

III Our Approach

Figure.2 shows the framework of Point-PC, which includes four main parts: 1) Pre-train encoder: it is used to extract the partial shape’s feature vector; 2) Memory network: it is used to learn the similarity between partial shape and corresponding complete shape; 3) Prior knowledge selection module: it is used to select the useful information from prior complete shapes and help the partial shape complete the missing structural information. We will detail these modules in the next subsections; 4) Shape decoder: it is used to accept the concatenated features and predict the complete point cloud. We will detail each of our designs in the following.

III-A Memory Network

The memory network aims to learn the dependency of partial and complete shapes in feature space and produce the prior shapes. Denote the input set of partial point clouds as S={i}i=1|S|𝑆superscriptsubscriptsubscript𝑖𝑖1𝑆S={\left\{\mathcal{I}_{i}\right\}}_{i=1}^{|S|}italic_S = { caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_S | end_POSTSUPERSCRIPT, where iN×3subscript𝑖superscript𝑁3\mathcal{I}_{i}\in\mathbb{R}^{N\times 3}caligraphic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT represents each point in the object, and N𝑁Nitalic_N is the point number of a shape. We construct the memory network in a “key-value-age” formation. The “key” and “value” represent complete shape features and the corresponding 3D shapes, respectively. The “age” indicates how long the corresponding “key-value” pair has been established. Therefore, the memory item is denoted as =(Ki,Vi,Ai)i=1||superscriptsubscriptsubscript𝐾𝑖subscript𝑉𝑖subscript𝐴𝑖𝑖1\mathcal{M}={(K_{i},V_{i},A_{i})_{i=1}^{|\mathcal{M}|}}caligraphic_M = ( italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_M | end_POSTSUPERSCRIPT, where |||\mathcal{M}|| caligraphic_M | is the size of the memory.

Compared with other methods, the memory network utilizes the “key” and “value” to improve the effectiveness of prior shapes. Meanwhile, the “key” and “value” can also be updated by the training data and improve the relevance of obtaining prior information. Next, we will introduce the model update and retrieval process in two parts.

III-A1 Update Strategy

Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is extracted through the pre-trained complete shape encoder from Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which can be denoted as FVisuperscript𝐹subscript𝑉𝑖F^{V_{i}}italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. It is worth noting that the updating strategy only works at training because we take the training set as our external knowledge base, which can not be available during testing.

We compute the cosine similarity between Fsuperscript𝐹F^{\mathcal{I}}italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT and FVisuperscript𝐹subscript𝑉𝑖F^{V_{i}}italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to match a “key-value” pair as follows:

Simkey(F,FVi)=FFViFFVi.𝑆𝑖subscript𝑚𝑘𝑒𝑦superscript𝐹superscript𝐹subscript𝑉𝑖superscript𝐹superscript𝐹subscript𝑉𝑖normsuperscript𝐹normsuperscript𝐹subscript𝑉𝑖Sim_{key}\left(F^{\mathcal{I}},F^{V_{i}}\right)=\frac{F^{\mathcal{I}}\cdot F^{% V_{i}}}{\|F^{\mathcal{I}}\|\left\|F^{V_{i}}\right\|}.italic_S italic_i italic_m start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) = divide start_ARG italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ⋅ italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT ∥ ∥ italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ end_ARG . (1)

To measure whether it is a valid match, we adopt the Chamfer distance [12] as the similarity measurement between the corresponding ground truth 𝒱𝒱\mathcal{V}caligraphic_V and the value Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 3D space. If the Chamfer Distance Simvalue𝑆𝑖subscript𝑚𝑣𝑎𝑙𝑢𝑒Sim_{value}italic_S italic_i italic_m start_POSTSUBSCRIPT italic_v italic_a italic_l italic_u italic_e end_POSTSUBSCRIPT is lower than a threshold δ𝛿\deltaitalic_δ (discussed in Section 4.6), it is a positive match and vice versa. For a positive match, the value Vn0subscript𝑉subscript𝑛0V_{n_{0}}italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT stays unchanged, while the key FVn0superscript𝐹subscript𝑉subscript𝑛0F^{V_{n_{0}}}italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is updated as below:

FVn0=F+FVn0F+FVn1,superscript𝐹subscript𝑉subscript𝑛0superscript𝐹superscript𝐹subscript𝑉subscript𝑛0normsuperscript𝐹superscript𝐹subscript𝑉subscript𝑛1F^{V_{n_{0}}}=\frac{F^{\mathcal{I}}+F^{V_{n_{0}}}}{\left\|F^{\mathcal{I}}+F^{V% _{n_{1}}}\right\|},italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = divide start_ARG italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT + italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∥ italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT + italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ end_ARG , (2)

where n0=argmaxiSimkey(F,FVi)subscript𝑛0subscript𝑖𝑆𝑖subscript𝑚𝑘𝑒𝑦superscript𝐹superscript𝐹subscript𝑉𝑖n_{0}=\arg\max_{i}Sim_{key}\left(F^{\mathcal{I}},F^{V_{i}}\right)italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S italic_i italic_m start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ). In the meantime, except for the corresponding age An0subscript𝐴subscript𝑛0A_{n_{0}}italic_A start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be set to zero, all the other ages should be increased by one. For a negative match, 𝒱𝒱\mathcal{V}caligraphic_V is read into the memory and should overwrite the oldest slot as follows:

Kn1=F,Vn1=𝒱,formulae-sequencesubscript𝐾subscript𝑛1superscript𝐹subscript𝑉subscript𝑛1𝒱K_{n_{1}}=F^{\mathcal{I}},V_{n_{1}}=\mathcal{V},italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_V , (3)

where n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT depends on n1=argmaxi(Ai)subscript𝑛1subscript𝑖subscript𝐴𝑖n_{1}=\arg\max_{i}\left(A_{i}\right)italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). The ages here are updated in the same way as mentioned above. In this way, the memory network reinforces its reception ability with similar shapes, saves the unknown shapes, and refreshes the oldest memory slot.

Algorithm 1 Update and Query Strategy

Input: partial point cloud feature Fsuperscript𝐹F^{\mathcal{I}}italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT
Hyper-parameter: similarity threshold δ𝛿\deltaitalic_δ
Output: shape priors Vnisubscript𝑉subscript𝑛𝑖V_{n_{i}}italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT

1:  Let i=0𝑖0i=0italic_i = 0.
2:  while i||1𝑖1i\leq|\mathcal{M}|-1italic_i ≤ | caligraphic_M | - 1 do
3:     Compute Simkey(F,FVi)𝑆𝑖subscript𝑚𝑘𝑒𝑦superscript𝐹superscript𝐹subscript𝑉𝑖Sim_{key}\left(F^{\mathcal{I}},F^{V_{i}}\right)italic_S italic_i italic_m start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) by Eq. 1.
4:     if (Simvalue(𝒱,Vn1)δ)𝑆𝑖subscript𝑚𝑣𝑎𝑙𝑢𝑒𝒱subscript𝑉subscript𝑛1𝛿(Sim_{value}\left(\mathcal{V},V_{n_{1}}\right)\geq\delta)( italic_S italic_i italic_m start_POSTSUBSCRIPT italic_v italic_a italic_l italic_u italic_e end_POSTSUBSCRIPT ( caligraphic_V , italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≥ italic_δ ) then
5:        Let n0=argmaxiSimkey(F,FVi)subscript𝑛0subscript𝑖𝑆𝑖subscript𝑚𝑘𝑒𝑦superscript𝐹superscript𝐹subscript𝑉𝑖n_{0}=\arg\max_{i}Sim_{key}\left(F^{\mathcal{I}},F^{V_{i}}\right)italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S italic_i italic_m start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ),
6:        Update Kn0subscript𝐾subscript𝑛0K_{n_{0}}italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by Eq. 2,
7:        Set An0=0,Ai=Ai+1(in0)formulae-sequencesubscript𝐴subscript𝑛00subscript𝐴𝑖subscript𝐴𝑖1𝑖subscript𝑛0A_{n_{0}}=0,A_{i}=A_{i}+1\left(i\neq n_{0}\right)italic_A start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ( italic_i ≠ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).
8:     else
9:        Let n1=argmaxi(Ai)subscript𝑛1subscript𝑖subscript𝐴𝑖n_{1}=\arg\max_{i}\left(A_{i}\right)italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).
10:        Update Kn1subscript𝐾subscript𝑛1K_{n_{1}}italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Vn1subscript𝑉subscript𝑛1V_{n_{1}}italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by Eq. 3,
11:        Set An1=0,Ai=Ai+1(in1)formulae-sequencesubscript𝐴subscript𝑛10subscript𝐴𝑖subscript𝐴𝑖1𝑖subscript𝑛1A_{n_{1}}=0,A_{i}=A_{i}+1\left(i\neq n_{1}\right)italic_A start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 ( italic_i ≠ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).
12:     end if
13:  end while
14:  return Vnisubscript𝑉subscript𝑛𝑖V_{n_{i}}italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT by Eq. 4.

III-A2 Query Strategy

We propose a query strategy for obtaining shape priors that are rich in geometric information for completion and very similar to the partial input. These shape priors are the values in the memory, which are complete point clouds. To fix the number of shape priors fed forward, we retrieve k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG shapes through top-k^^𝑘\hat{k}over^ start_ARG italic_k end_ARG keys with the largest similarity for convenience, which can be formulated as:

V=[Vni|ni=argmaxiSimkey(F,FVi)].𝑉delimited-[]conditionalsubscript𝑉subscript𝑛𝑖subscript𝑛𝑖subscript𝑖𝑆𝑖subscript𝑚𝑘𝑒𝑦superscript𝐹superscript𝐹subscript𝑉𝑖V=\left[V_{n_{i}}|n_{i}=\arg\max_{i}Sim_{key}\left(F^{\mathcal{I}},F^{V_{i}}% \right)\right].italic_V = [ italic_V start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S italic_i italic_m start_POSTSUBSCRIPT italic_k italic_e italic_y end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) ] . (4)

For a more organized elaboration of the update and query process, we describe the simplified procedure of the memory network in Algorithm 1.

III-B Pre-training Scheme

The pre-training scheme aims to minimize the distance between partial point clouds and complete point clouds, as well as enhance the consistency of partial shape features. Given the complete shape denoted as 𝒮iN×3subscript𝒮𝑖superscript𝑁3\mathcal{S}_{i}\in\mathbb{R}^{N\times 3}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 3 end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of points, we render the corresponding partial ones i,n1subscript𝑖subscript𝑛1\mathcal{I}_{i,n_{1}}caligraphic_I start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and i,n2subscript𝑖subscript𝑛2\mathcal{I}_{i,n_{2}}caligraphic_I start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in different viewpoints and crop different numbers of n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and n2subscript𝑛2n_{2}italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT points. The overall architecture of the contrastive learning-based pre-training scheme is illustrated in Figure 3.

III-B1 Intra-modality Learning

Suppose that i,n1subscript𝑖subscript𝑛1\mathcal{I}_{i,n_{1}}caligraphic_I start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and i,n2subscript𝑖subscript𝑛2\mathcal{I}_{i,n_{2}}caligraphic_I start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are fed into the partial shape encoder EKsubscript𝐸𝐾E_{K}italic_E start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT to extract features Fi,n1K,Fi,n2K1×Csuperscriptsubscript𝐹𝑖subscript𝑛1𝐾superscriptsubscript𝐹𝑖subscript𝑛2𝐾superscript1𝐶F_{i,n_{1}}^{K},F_{i,n_{2}}^{K}\in\mathbb{R}^{1\times C}italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the feature dimension. Following the NT-Xent loss in SimCLR [42], given a positive pair (Fi,n1Ksuperscriptsubscript𝐹𝑖subscript𝑛1𝐾F_{i,n_{1}}^{K}italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, Fi,n2Ksuperscriptsubscript𝐹𝑖subscript𝑛2𝐾F_{i,n_{2}}^{K}italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT), we treat the other 2(N1)2𝑁12(N-1)2 ( italic_N - 1 ) examples within a minibatch as negative examples, where N𝑁Nitalic_N is the size of the minibatch. The intra-modality contrastive loss intrasubscript𝑖𝑛𝑡𝑟𝑎\mathcal{L}_{intra}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT can be formulated as:

lintra(i;n1,n2)=logSimpos(i;n1,n2)Simneg(i;n1,n2),subscript𝑙intra𝑖subscript𝑛1subscript𝑛2𝑆𝑖subscript𝑚𝑝𝑜𝑠𝑖subscript𝑛1subscript𝑛2𝑆𝑖subscript𝑚𝑛𝑒𝑔𝑖subscript𝑛1subscript𝑛2l_{\text{intra}}\left(i;n_{1},n_{2}\right)=-\log\frac{Sim_{pos}(i;n_{1},n_{2})% }{Sim_{neg}(i;n_{1},n_{2})},italic_l start_POSTSUBSCRIPT intra end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = - roman_log divide start_ARG italic_S italic_i italic_m start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_S italic_i italic_m start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG , (5)
intra=12Ni=1N(lintra(i;n1,n2)+lintra(i;n2,n1)),subscriptintra12𝑁superscriptsubscript𝑖1𝑁subscript𝑙intra𝑖subscript𝑛1subscript𝑛2subscript𝑙intra𝑖subscript𝑛2subscript𝑛1\mathcal{L}_{\text{intra}}=\frac{1}{2N}\sum_{i=1}^{N}\left(l_{\text{intra}}% \left(i;n_{1},n_{2}\right)+l_{\text{intra}}\left(i;n_{2},n_{1}\right)\right),caligraphic_L start_POSTSUBSCRIPT intra end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT intra end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_l start_POSTSUBSCRIPT intra end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) , (6)

where Simpos(i;n1,n2)𝑆𝑖subscript𝑚𝑝𝑜𝑠𝑖subscript𝑛1subscript𝑛2Sim_{pos}(i;n_{1},n_{2})italic_S italic_i italic_m start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and Simneg(i;n1,n2)𝑆𝑖subscript𝑚𝑛𝑒𝑔𝑖subscript𝑛1subscript𝑛2Sim_{neg}(i;n_{1},n_{2})italic_S italic_i italic_m start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) represent the positive and negative cosine similarity between the same partial inputs but with a different incomplete pattern. The cosine similarity function is defined as follows:

Simpos(i;n1,n2)=exp(sim(Fi,n1K,Fi,n2K)/τ),Simneg(i;n1,n2)=j=1N𝕀[ji]exp(sim(Fi,n1K,Fj,n1K)/τ)+j=1Nexp(sim(Fi,n1K,Fj,n2K)/τ),𝑆𝑖subscript𝑚𝑝𝑜𝑠𝑖subscript𝑛1subscript𝑛2absent𝑠𝑖𝑚superscriptsubscript𝐹𝑖subscript𝑛1𝐾superscriptsubscript𝐹𝑖subscript𝑛2𝐾𝜏𝑆𝑖subscript𝑚𝑛𝑒𝑔𝑖subscript𝑛1subscript𝑛2absentsuperscriptsubscript𝑗1𝑁subscript𝕀delimited-[]𝑗𝑖𝑠𝑖𝑚superscriptsubscript𝐹𝑖subscript𝑛1𝐾superscriptsubscript𝐹𝑗subscript𝑛1𝐾𝜏missing-subexpressionsuperscriptsubscript𝑗1𝑁𝑠𝑖𝑚superscriptsubscript𝐹𝑖subscript𝑛1𝐾superscriptsubscript𝐹𝑗subscript𝑛2𝐾𝜏\displaystyle\begin{aligned} Sim_{pos}(i;n_{1},n_{2})&=\exp\left(sim\left(F_{i% ,n_{1}}^{K},F_{i,n_{2}}^{K}\right)/\tau\right),\\ Sim_{neg}(i;n_{1},n_{2})&=\sum_{j=1}^{N}\mathbb{I}_{[j\neq i]}\exp\left(sim% \left(F_{i,n_{1}}^{K},F_{j,n_{1}}^{K}\right)/\tau\right)\\ &+\sum_{j=1}^{N}\exp\left(sim\left(F_{i,n_{1}}^{K},F_{j,n_{2}}^{K}\right)/\tau% \right),\\ \end{aligned}start_ROW start_CELL italic_S italic_i italic_m start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = roman_exp ( italic_s italic_i italic_m ( italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) / italic_τ ) , end_CELL end_ROW start_ROW start_CELL italic_S italic_i italic_m start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ( italic_i ; italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) / italic_τ ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_exp ( italic_s italic_i italic_m ( italic_F start_POSTSUBSCRIPT italic_i , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) / italic_τ ) , end_CELL end_ROW

(7)

where 𝕀[ji]{0,1}subscript𝕀delimited-[]𝑗𝑖01\mathbb{I}_{[j\neq i]}\in\{0,1\}blackboard_I start_POSTSUBSCRIPT [ italic_j ≠ italic_i ] end_POSTSUBSCRIPT ∈ { 0 , 1 } is an indicator function evaluating to 1 if ji𝑗𝑖j\neq iitalic_j ≠ italic_i and τ𝜏\tauitalic_τ is the temperature parameter which we set to 0.1.

III-B2 Cross-modality Learning

Considering that the partial shape features should remain consistent with the complete shape features, for each 𝒮isubscript𝒮𝑖{\mathcal{S}_{i}}caligraphic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we extract features FiV1×Csuperscriptsubscript𝐹𝑖𝑉superscript1𝐶F_{i}^{V}\in\mathbb{R}^{1\times C}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_C end_POSTSUPERSCRIPT by the complete shape encoder EVsubscript𝐸𝑉E_{V}italic_E start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. Together with the partial shape features FiKsuperscriptsubscript𝐹𝑖𝐾F_{i}^{K}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the cross-modality contrastive loss crosssubscript𝑐𝑟𝑜𝑠𝑠\mathcal{L}_{cross}caligraphic_L start_POSTSUBSCRIPT italic_c italic_r italic_o italic_s italic_s end_POSTSUBSCRIPT is indicated as follows:

lcross(i;K,V)=logSimpos(i;K,V)Simneg(i;K,V),subscript𝑙cross𝑖𝐾𝑉𝑆𝑖subscript𝑚𝑝𝑜𝑠𝑖𝐾𝑉𝑆𝑖subscript𝑚𝑛𝑒𝑔𝑖𝐾𝑉l_{\text{cross}}\left(i;K,V\right)=-\log\frac{Sim_{pos}(i;K,V)}{Sim_{neg}(i;K,% V)},italic_l start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT ( italic_i ; italic_K , italic_V ) = - roman_log divide start_ARG italic_S italic_i italic_m start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_i ; italic_K , italic_V ) end_ARG start_ARG italic_S italic_i italic_m start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ( italic_i ; italic_K , italic_V ) end_ARG , (8)
cross=12Ni=1N(lcross(i;K,V)+lcross(i;V,K))subscriptcross12𝑁superscriptsubscript𝑖1𝑁subscript𝑙cross𝑖𝐾𝑉subscript𝑙cross𝑖𝑉𝐾\mathcal{L}_{\text{cross}}=\frac{1}{2N}\sum_{i=1}^{N}\left(l_{\text{cross}}(i;% K,V)+l_{\text{cross}}(i;V,K)\right)caligraphic_L start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_l start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT ( italic_i ; italic_K , italic_V ) + italic_l start_POSTSUBSCRIPT cross end_POSTSUBSCRIPT ( italic_i ; italic_V , italic_K ) ) (9)

where Simpos(i;K,V)𝑆𝑖subscript𝑚𝑝𝑜𝑠𝑖𝐾𝑉Sim_{pos}(i;K,V)italic_S italic_i italic_m start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ( italic_i ; italic_K , italic_V ) and Simneg(i;K,V)𝑆𝑖subscript𝑚𝑛𝑒𝑔𝑖𝐾𝑉Sim_{neg}(i;K,V)italic_S italic_i italic_m start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ( italic_i ; italic_K , italic_V ) represent the positive and negative cosine similarity between the partial and complete shape features. The overall pre-training loss function presubscript𝑝𝑟𝑒\mathcal{L}_{pre}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT is the sum of the intra-modality and cross-modality loss pre=intra+crosssubscript𝑝𝑟𝑒subscript𝑖𝑛𝑡𝑟𝑎subscript𝑐𝑟𝑜𝑠𝑠\mathcal{L}_{pre}=\mathcal{L}_{intra}+\mathcal{L}_{cross}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_t italic_r italic_a end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_c italic_r italic_o italic_s italic_s end_POSTSUBSCRIPT.

\begin{overpic}[width=433.62pt]{pretrain-modi.eps}
\put(63.0,51.0){\small{$F^{V}$}}
\put(63.0,33.0){\small{$F^{K}_{n1}$}}
\put(63.0,14.5){\small{$F^{K}_{n2}$}}
\put(94.5,23.5){\small{$F^{K}$}}
\end{overpic}
Figure 3: Architecture of contrastive learning-based pre-training scheme. We adopt one NT-Xent loss between pairs of partial shape features extracted from the partial shape encoder and the other between complete shape features and average partial shape features.

III-C Prior Knowledge Selection Module

We exploit causal theory [43] to dig out the true causality of the extracted features and generated 3D shapes. The causal graph is shown as Figure 4.

We list the following explanations for the causalities among the four variables shown in Figure 4:

  • MI𝑀𝐼M\rightarrow Iitalic_M → italic_I. Since the retrieved shapes share the same semantic structures as the partial inputs, this causal effect is naturally established.

  • ICM𝐼𝐶𝑀I\rightarrow C\leftarrow Mitalic_I → italic_C ← italic_M. The variable C𝐶Citalic_C denotes the causal feature that is truly responsible for the completion result. We not only keep the original part I𝐼Iitalic_I but also add M𝑀Mitalic_M as the supplementary information.

  • CY𝐶𝑌C\rightarrow Yitalic_C → italic_Y. The causality reflects the intrinsic association of the feature space and 3D coordinate space.

Investigating the causal graph above, we recognize a backdoor path between M𝑀Mitalic_M and I𝐼Iitalic_I, i.e., MI𝑀𝐼M\rightarrow Iitalic_M → italic_I, wherein the M𝑀Mitalic_M plays a role of confounder between I𝐼Iitalic_I and C𝐶Citalic_C. This backdoor path will cause I𝐼Iitalic_I to create a false correlation with Y𝑌Yitalic_Y even if I𝐼Iitalic_I is not the only one directly linked to Y𝑌Yitalic_Y, resulting in generating low-quality shapes. Therefore, it is vital to cut off the backdoor path.

Refer to caption
Figure 4: Causal graph for Causal Feature Selection Module. Circles represent variables, and arrows represent causal relationships from one variable to another.

III-C1 Backdoor Adjustment

Instead of modeling the confounded P(Y|I)𝑃conditional𝑌𝐼P(Y|I)italic_P ( italic_Y | italic_I ) in Figure 4, we need to eliminate the backdoor path. According to causal theory, we utilize the do-calculus technique on variable M𝑀Mitalic_M to eliminate the backdoor path by estimating PB(Y|I)=P(Y|do(I))subscript𝑃𝐵conditional𝑌𝐼𝑃conditional𝑌𝑑𝑜𝐼P_{B}(Y|I)=P(Y|do(I))italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_Y | italic_I ) = italic_P ( italic_Y | italic_d italic_o ( italic_I ) ) which stratifies the confounder M𝑀Mitalic_M. We then obtain the following derivations:

  • The features extracted from memory will not be affected by cutting off the backdoor path. Thus, P(m)=PB(m)𝑃𝑚subscript𝑃𝐵𝑚P(m)=P_{B}(m)italic_P ( italic_m ) = italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_m ).

  • C𝐶Citalic_C has nothing to do with the causal effect between the variable M𝑀Mitalic_M and I𝐼Iitalic_I, which we can get PB(C|I,m)=P(C|I,m)subscript𝑃𝐵conditional𝐶𝐼𝑚𝑃conditional𝐶𝐼𝑚P_{B}(C|I,m)=P(C|I,m)italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_C | italic_I , italic_m ) = italic_P ( italic_C | italic_I , italic_m ).

  • After the causal intervention, the variable m𝑚mitalic_m is independent from I𝐼Iitalic_I, for which we have PB(m)=PB(m|I)subscript𝑃𝐵𝑚subscript𝑃𝐵conditional𝑚𝐼P_{B}(m)=P_{B}(m|I)italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_m ) = italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_m | italic_I ).

B𝐵Bitalic_B refers to the case when the backdoor path is cut off, and mM𝑚𝑀m\in Mitalic_m ∈ italic_M denotes the confounder sets. Driven by the derivations above, the backdoor adjustment for Figure 4 can be written as:

P(Ydo(I))=mMPB(Y|I,m)PB(m|I)=mMPB(Y|I,m)PB(m)=mMP(Y|I,m)P(m),𝑃conditional𝑌𝑑𝑜𝐼absentsubscript𝑚𝑀subscript𝑃𝐵conditional𝑌𝐼𝑚subscript𝑃𝐵conditional𝑚𝐼missing-subexpressionabsentsubscript𝑚𝑀subscript𝑃𝐵conditional𝑌𝐼𝑚subscript𝑃𝐵𝑚missing-subexpressionabsentsubscript𝑚𝑀𝑃conditional𝑌𝐼𝑚𝑃𝑚\displaystyle\begin{aligned} P(Y\mid{do}(I))&=\sum_{m\in M}P_{B}(Y|I,m)P_{B}(m% |I)\\ &=\sum_{m\in M}P_{B}(Y|I,m)P_{B}(m)\\ &=\sum_{m\in M}P(Y|I,m)P(m),\end{aligned}start_ROW start_CELL italic_P ( italic_Y ∣ italic_d italic_o ( italic_I ) ) end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_Y | italic_I , italic_m ) italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_m | italic_I ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_Y | italic_I , italic_m ) italic_P start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( italic_m ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_P ( italic_Y | italic_I , italic_m ) italic_P ( italic_m ) , end_CELL end_ROW

(10)

where P(Y|I,m)𝑃conditional𝑌𝐼𝑚P(Y|I,m)italic_P ( italic_Y | italic_I , italic_m ) represents the conditional probability given the partial shape feature and confounder; P(m)𝑃𝑚P(m)italic_P ( italic_m ) is the prior probability of the confounder.

III-C2 Module Design

Driven by Eq 10, we design the causal feature selection module to alleviate the confounding effect in shape priors. Our implementation idea is stratifying the confounder and pairing the partial shape feature with every stratification. To achieve this objective, we perform an implicit intervention on feature-wise stratification. Let’s consider \mathcal{H}caligraphic_H as the index set comprising dimensions of the concatenated shape prior feature obtained from the final layer of the shape prior encoder. We divide \mathcal{H}caligraphic_H into n disjoint subsets of equal size. For instance, if the output feature dimension of the shape prior encoder is 384 and we choose the top-3 shape priors with n=6𝑛6n=6italic_n = 6, each subset would consist of a feature dimension index set of size 192. In other words, msubscript𝑚\mathcal{H}_{m}caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be represented as 192(m1)+1,,192m192𝑚11192𝑚{192(m-1)+1,...,192m}192 ( italic_m - 1 ) + 1 , … , 192 italic_m.

  • P(Y|I,m)=Pϕ(Y|cat(FI,[FV]c))𝑃conditional𝑌𝐼𝑚subscript𝑃italic-ϕconditional𝑌𝑐𝑎𝑡subscript𝐹𝐼subscriptdelimited-[]subscript𝐹𝑉𝑐P(Y|I,m)=P_{\phi}(Y|cat(F_{I},[F_{V}]_{c}))italic_P ( italic_Y | italic_I , italic_m ) = italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Y | italic_c italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , [ italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ), where FIsubscript𝐹𝐼F_{I}italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and FVsubscript𝐹𝑉F_{V}italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are the partial shape feature and the concatenated shape prior feature, respectively. [FV]csubscriptdelimited-[]subscript𝐹𝑉𝑐[F_{V}]_{c}[ italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT refers to a feature selector that extracts dimensions of FVsubscript𝐹𝑉F_{V}italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT based on the index set c. Note that c is defined as c={k|km𝒮t}𝑐conditional-set𝑘𝑘subscript𝑚subscript𝒮𝑡c=\{k|k\in\mathcal{H}_{m}\cap\mathcal{S}_{t}\}italic_c = { italic_k | italic_k ∈ caligraphic_H start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∩ caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, where 𝒮tsubscript𝒮𝑡\mathcal{S}_{t}caligraphic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents an index set containing absolute values in FVsubscript𝐹𝑉F_{V}italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT exceeding the threshold t𝑡titalic_t. Additionally, ϕitalic-ϕ\phiitalic_ϕ denotes the parameters of the shape decoder.

  • P(m)=1/n𝑃𝑚1𝑛P(m)=1/nitalic_P ( italic_m ) = 1 / italic_n, where We assume that each adjusted feature has an equal prior probability, which is calculated as the reciprocal of the number of confounder sets, represented by n𝑛nitalic_n.

Thus, the overall feature-wise adjustment is:

P(Ydo(I))=1nmMPϕ(Y|cat(FI,[FV]c)).𝑃conditional𝑌𝑑𝑜𝐼1𝑛subscript𝑚𝑀subscript𝑃italic-ϕconditional𝑌𝑐𝑎𝑡subscript𝐹𝐼subscriptdelimited-[]subscript𝐹𝑉𝑐\displaystyle P(Y\mid{do}(I))=\frac{1}{n}\sum_{m\in M}P_{\phi}(Y|cat(F_{I},[F_% {V}]_{c})).italic_P ( italic_Y ∣ italic_d italic_o ( italic_I ) ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_Y | italic_c italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , [ italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) .

(11)

To optimize the ϕitalic-ϕ\phiitalic_ϕ in the above Eq 11, we propose a slightly modified L1 Chamfer Distance loss guided by the backdoor adjustment. Let 𝒢𝒢\mathcal{G}caligraphic_G be the notation of high-resolution ground truth, and 𝒫𝒫\mathcal{P}caligraphic_P be the notation of the completed prediction. The caussubscript𝑐𝑎𝑢𝑠\mathcal{L}_{caus}caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_u italic_s end_POSTSUBSCRIPT can be written as:

𝒫=Φ(cat(FI,[FV]c)),𝒫Φ𝑐𝑎𝑡subscript𝐹𝐼subscriptdelimited-[]subscript𝐹𝑉𝑐\mathcal{P}=\Phi(cat(F_{I},[F_{V}]_{c})),caligraphic_P = roman_Φ ( italic_c italic_a italic_t ( italic_F start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , [ italic_F start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) , (12)
caus=1nmM(CD1(𝒫,𝒢)),subscript𝑐𝑎𝑢𝑠1𝑛subscript𝑚𝑀𝐶𝐷subscript1𝒫𝒢\mathcal{L}_{caus}=\frac{1}{n}\sum_{m\in M}\left(CD-\ell_{1}(\mathcal{P},% \mathcal{G})\right),caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_u italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT ( italic_C italic_D - roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_P , caligraphic_G ) ) , (13)

where ΦΦ\Phiroman_Φ represents the shape decoder, and cat(,)𝑐𝑎𝑡cat(\cdot,\cdot)italic_c italic_a italic_t ( ⋅ , ⋅ ) denotes the concatenate operation. The Eq 13 encourages the model to generate predictions for the intervened partial-complete probability that remain consistent and stable across various stratification groups, owing to the shared causal features.

We follow the existing works [17] to use the L1 Chamfer Distance [44] as a quantitative measurement for the quality of output. Apart from generating 𝒫𝒫\mathcal{P}caligraphic_P, Point-PC also predicts local centers 𝒞𝒞\mathcal{C}caligraphic_C of the completed point cloud. For each prediction, the L1 Chamfer Distance loss function between the central point set and the ground truth 𝒢𝒢\mathcal{G}caligraphic_G is calculated as:

recon=1|𝒞|c𝒞ming𝒢cg1+1|𝒢|g𝒢minc𝒞gc1.subscript𝑟𝑒𝑐𝑜𝑛1𝒞subscript𝑐𝒞subscript𝑔𝒢subscriptnorm𝑐𝑔11𝒢subscript𝑔𝒢subscript𝑐𝒞subscriptnorm𝑔𝑐1\mathcal{L}_{recon}=\frac{1}{|\mathcal{C}|}\sum_{c\in\mathcal{C}}\min_{g\in% \mathcal{G}}\|c-g\|_{1}+\frac{1}{|\mathcal{G}|}\sum_{g\in\mathcal{G}}\min_{c% \in\mathcal{C}}\|g-c\|_{1}.caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | caligraphic_C | end_ARG ∑ start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT ∥ italic_c - italic_g ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG | caligraphic_G | end_ARG ∑ start_POSTSUBSCRIPT italic_g ∈ caligraphic_G end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_c ∈ caligraphic_C end_POSTSUBSCRIPT ∥ italic_g - italic_c ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (14)

The final objective function can be defined as the sum of the losses: =λcaus+(1λ)recon𝜆subscript𝑐𝑎𝑢𝑠1𝜆subscript𝑟𝑒𝑐𝑜𝑛\mathcal{L}=\lambda\mathcal{L}_{caus}+(1-\lambda)\mathcal{L}_{recon}caligraphic_L = italic_λ caligraphic_L start_POSTSUBSCRIPT italic_c italic_a italic_u italic_s end_POSTSUBSCRIPT + ( 1 - italic_λ ) caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT, where λ𝜆\lambdaitalic_λ is a hype-parameter used to control the contribution of different losses in the optimization process.

IV Experiment

In this section, we first introduce the implementation details. Then we discuss the evaluation experiments on mainstream benchmarks. We also visualize and analyze the results for both our method and several baseline methods. At last, we provide the ablation study of our method.

IV-A Results on ShapeNet-55

IV-A1 Data.

ShapeNet-55 contains 55 categories of synthetic objects, derived from the ShapeNet dataset. This dataset was first released in PoinTr [17], to improve sample diversity and breadth. They randomly sample 8,192 points to obtain the complete point cloud, with a total of 41,952 point cloud models for training and the rest 10,518 models for testing. The partial point clouds are generated by cutting off certain farthest points from a pre-fixed viewpoint and Keeping 2048 points through the furthest point sample(FPS).

IV-A2 Quantitative Evaluation.

Following the evaluation setting in [17], 8 specific viewpoints have been chosen, and the partial point cloud is configured to contain 2,048, 4,096, or 6,144 points, which correspond to 25%, 50%, and 75% of the total points in the entire point cloud, respectively. In this way, we divide the testing stage into three difficulty degrees simple, moderate, and hard(denoted as CD-S, CD-M, and CD-H). As shown in Table. I, Point-PC achieves 0.83 average CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (multiplied by 1000) and 0.479 F-Score@1% on ShapeNet-55, which demonstrates that Point-PC outperforms the SOTA methods encountering. Compared with previous methods, Point-PC achieves 0.45, 0.72, and 1.32 CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT under the three difficulty settings, respectively. The increment of CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT under CD-M(+0.27) and CD-H(+0.87) strategy also shows that Point-PC better deals with diverse incompleteness levels and diverse incomplete patterns. Furthermore, we report the results on categories with sufficient(first 5 columns) and insufficient(following 5 columns) training samples. Point-PC performs better than all previous SOTA methods on diverse categories of objects despite the training sample imbalance. Quantitative results on ShapeNet-55 clearly demonstrate that Point-PC is capable of generating complete point clouds under diverse settings.

Methods Table Chair Airplane Car Sofa

Birdhouse

Bag Remote

Keyboard

Rocket CD-S CD-M CD-H CD-Avg \downarrow F1 \uparrow
FoldingNet 2.53 2.81 1.43 1.98 2.48 4.71 2.79 1.44 1.24 1.48 2.67 2.66(-0.01) 4.05(+1.38) 3.12 0.082
PCN 2.13 2.29 1.02 1.85 2.06 4.5 2.86 1.33 0.89 1.32 1.94 1.96(+0.02) 4.08(+2.14) 2.66 0.133
TopNet 2.21 2.53 1.14 2.18 2.36 4.83 2.93 1.49 0.95 1.32 2.26 2.16(-0.10) 4.3(+2.04) 2.91 0.126
PFNet 3.95 4.24 1.81 2.53 3.34 6.21 4.96 2.91 1.29 2.36 3.83 3.87(+0.04) 7.97(+4.14) 5.22 0.339
GRNet 1.63 1.88 1.02 1.64 1.72 2.97 2.06 1.09 0.89 1.03 1.35 1.71(+0.36) 2.85(+1.5) 1.97 0.238
PoinTr 0.81 0.95 0.44 0.91 0.79 1.86 0.93 0.53 0.38 0.57 0.58 0.88(+0.30) 1.79(+1.21) 1.09 0.464
SeedFormer 0.72 0.81 0.4 0.89 0.71 1.51 0.79 0.46 0.36 0.5 0.5 0.77(+0.27) 1.49(+0.99) 0.92 0.472
Point-PC 0.69 0.77 0.38 0.84 0.64 1.37 0.7 0.42 0.31 0.46 0.45 0.72(+0.27) 1.32(+0.87) 0.83 0.479
TABLE I: Quantitative results of our methods and several baselines on ShapeNet-55. Detailed results for each method on 10 selected categories are reported, as well as the overall results on 55 categories. CD-S, CD-M, and CD-H represent the CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT results under the simple, moderate, and hard settings, respectively. Numbers in parentheses represent increments of CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT results compared to results under the CD-S setting. We show the best results in bold.
Refer to caption
Figure 5: Qualitative results on ShapeNet-55 benchmark. Each of the methods mentioned above uses the examples in the first row as input to create complete point clouds. Our approach, however, fills in these inputs with a more uniform outline, which distinctly demonstrates the efficacy of our approach.

IV-A3 Qualitative Evaluation.

The qualitative comparison results are shown in Figure. 4. The proposed Point-PC performs better with fine details than the other methods. For example, in the bottle category, Point-PC can predict a smoother and more regular structure of bottle edges compared with the other methods. Moreover, Point-PC retains the original details of the partial shapes. In the fifth column of Figure. 4, Point-PC not only generates the incomplete lamp bracket with a clear structure but also keeps the texture of the lamp shade, which makes it a more plausible completion than the other methods. Consequently, Point-PC can effectively learn the geometric information based on the existing partial shape, retrieve similar shape priors based on the learned information and reconstruct complete shapes with more regular arrangements and surface smoothness.

IV-B Results on ShapeNet-34

IV-B1 Data.

ShapeNet-34 is designed to train on a subset of 34 categories and reserves an additional 21 categories for testing. This setup is to assess how well Point-PC can generalize its performance to new objects from categories that were not encountered during the training stage. ShapeNet-34 uses a subset of 34 categories for training and leaves 21 unseen categories for testing. We utilize ShapeNet-34 to evaluate the performance of Point-PC on novel objects from categories that do not appear in the training phase.

IV-B2 Results.

In Table. II, we present the metrics for 34 familiar categories and 21 unfamiliar categories under three levels of difficulty evaluated as CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(multiplied by 1000) and F-Score@1%. For Point-PC, we observe fewer gaps between the results of 34 seen categories and 21 unseen categories under each difficulty setting, which demonstrates the superiority of shape priors offered by the memory network. We also provide the visual comparison with GRNet in Figure. 6 to show the effectiveness of Point-PC on the unseen categories.

Methods 34 seen categories 21 unseen categories
CD-S CD-M CD-H CD-Avg F1 CD-S CD-M CD-H CD-Avg \downarrow F1 \uparrow
FoldingNet 1.86 1.81 3.38 2.35 0.139 2.76 2.74 5.36 3.62 0.095
PCN 1..87 1.81 2.97 2.22 0.154 3.17 3.08 5.29 3.85 0.101
TopNet 1.77 1.61 3.54 2.31 0.171 2.62 2.43 5.44 3.5 0.121
PFNet 3.16 3.19 7.71 4.68 0.347 5.29 5.87 13.33 8.16 0.322
GRNet 1.26 1.39 2.57 1.74 0.251 1.85 2.25 4.87 2.99 0.216
PoinTr 0.76 1.05 1.88 1.23 0.421 1.04 1.67 3.44 2.05 0.384
SeedFormer 0.48 0.7 1.3 0.83 0.452 0.61 1.07 2.35 1.34 0.402
Point-PC 0.42 0.55 1.16 0.71 0.464 0.57 0.92 2.09 1.19 0.418
TABLE II: Quantitative results on ShapeNet-34 evaluated as CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(multiplied by 1000) and F-Score@1%.
Refer to caption
Figure 6: Quantitative results on objects belonging to new categories that were not included in the training dataset. The input partial point cloud, the ground truth, and the predictions made by GRNet and Point-PC are all illustrated.

IV-C Results on PCN

IV-C1 Data.

The PCN dataset, a prominent benchmark for point cloud completion, encompasses shapes from 8 different categories. Each shape’s ground truth consists of 16,384 points that are uniformly sampled from the mesh surface, while the partial input is formed from 2,048 points derived from 8 randomly chosen viewpoints. To evaluate our method against established benchmarks and to conduct a comparative analysis with leading-edge techniques, we performed experiments on the PCN dataset. We adhered to the standard procedures and evaluation criteria as outlined in [45, 46, 15] for our model analysis and ablation studies.

Methods Air Cab Car Cha Lam Sof Tab Wat Avg \downarrow
FoldingNet 9.49 15.8 12.61 15.55 16.41 15.97 13.65 14.99 14.31
AtlasNet 6.37 11.94 10.1 12.06 12.37 12.99 10.33 10.61 10.85
PCN 5.5 22.7 10.63 8.7 11 11.34 11.68 8.59 9.64
TopNet 7.61 13.31 10.9 13.82 14.44 14.78 11.22 11.12 12.15
MSN 5.6 11.9 10.3 10.2 10.7 11.6 9.6 9.9 10
GRNet 6.45 10.37 9.45 9.41 7.96 10.51 8.44 8.04 8.83
CRN 4.79 9.97 8.31 9.49 8.94 10.69 7.81 8.05 8.51
NSFA 4.76 10.18 8.63 8.53 7.03 10.53 7.35 7.48 8.06
PMP-Net 5.65 11.24 9.64 9.51 6.95 10.83 8.72 7.25 8.73
PoinTr 4.75 10.47 8.68 9.39 7.75 10.93 7.78 7.29 8.38
SnowflakeNet 4.29 9.16 8.08 7.89 6.07 9.23 6.55 6.4 7.21
SeedFormer 3.85 9.05 8.06 7.06 5.21 8.85 6.05 5.85 6.74
Point-PC 3.73 8.97 7.79 6.89 5.01 8.45 5.82 5.64 6.62
TABLE III: Quantitative results on the PCN dataset. We report detailed results on each category and the average results under the CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT(multiplied by 1000) metric.

IV-C2 Results.

We apply the PCN dataset on Point-PC together with several SOTA methods. The CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT metric between the completed shapes and ground truth is reported in Table. III. Our proposed method stands out and has the best results in all the categories. In terms of average CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, Point-PC achieves the best score of 6.62, which illustrates that Point-PC outperforms the SOTA competitors.

IV-D Results on KITTI Benchmark

IV-D1 Data.

The KITTI dataset [13] consists of a series of scans from real-world environments, from which 2,401 partial car models have been extracted using 3D bounding boxes. It should be noted that these partial point clouds from the KITTI dataset do not possess corresponding complete point clouds to serve as ground truth. We follow the standard protocol to finetune our model on ShapeNetCars [12] and evaluate it on KITTI with metrics of Fidelity Distance and MMD(Minimal Matching Distance). We exploit KITTI to mimic a real-world situation where point clouds are sparse and irregular.

IV-D2 Results.

In Table. IV, we present the results for both Fidelity and MMD metrics. Fidelity quantifies the mean distance from points in the input to their closest counterparts in the output, reflecting the degree to which the input has been maintained in the completed model. Meanwhile, MMD, which stands for the Chamfer Distance between the completed output and the nearest ground truth within the ShapeNetCars dataset, measures the similarity of the reconstructed model to a prototypical car shape. Observed in Table. IV, Point-PC shows better generalization ability compared with previous methods achieving a Fidelity of 0.136 and MMD of 0.509. Qualitative results are shown in Figure. 7, which illustrates that Point-PC predicts general structures even if the input is severely sparse and proves the necessity of prior knowledge for guiding the point cloud completion in the realistic scenario.

FoldingNet AtlasNet PCN TopNet MSN PFNet GRNet PoinTr SeedFormer Point-PC
Fidelity \downarrow 7.467 1.759 2.235 5.354 0.434 1.137 0.816 0 0.151 0.136
MMD \downarrow 0.537 2.108 1.366 0.636 2.259 0.792 0.568 0.526 0.516 0.509
TABLE IV: Quantitative results on the KITTI dataset using the metrics of Fidelity Distance and MMD (Minimal Matching Distance), where lower values indicate better performance.
Refer to caption
Figure 7: Quantitative results on the KITTI dataset, where we present both the input partial point cloud and the predictions made by GRNet and Point-PC. For an enhanced illustration of the car’s shape, two distinct views are provided for each object.

IV-E Model Design Analysis

To examine the effectiveness of our designs, we conduct detailed ablation studies. Based on the results of settings A and B, we observed that the memory network brings a significant performance boost, demonstrating that the prior knowledge effectively compensates for the missing structural information and improves the reconstruction results. Setting C only utilizes the pre-training scheme to learn the partial shape encoder. Without the innovative prior information introduced in our paper, this improvement is limited. However, the pre-training scheme aims to help the memory network retrieve more similar prior shapes, which leads to a boost in setting D compared to setting B. The causal feature selection module extracts useful structural information that relies on the prior shapes provided by the memory network. Thus, both modules must be present together, as shown in setting E. The results of settings D and F show improved reconstruction accuracy with the causal feature selection module. As shown in Table V, the memory network improved CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F-Score@1% by 4.84 and 0.432, respectively, while the causal feature selection module improved CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F-Score@1% by 1.37 and 0.132, respectively. The pre-training scheme enhances the partial shape encoder by generating more similar prior shapes through the memory network. However, its effectiveness as an auxiliary design remains somewhat constrained in terms of improvement. The ablation study clearly demonstrates the effectiveness of key components in Point-PC.

Setting Memory Network Pretrain Scheme Causal Feature Selection CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \downarrow F-Score@1% \uparrow
A 15.37 0.109
B 10.53 0.541
C 14.01 0.145
D 8.22 0.723
E 9.16 0.673
F 6.62 0.759
TABLE V: Ablation study on the PCN dataset. We investigate different designs including the Memory Network, the pre-training scheme, and the causal feature selection module.
Shape Prior Num. 0 1 2 3 4 5
CD-AVG 15.37 9.7 7.64 6.62 7.57 10.12
F-Score@1% 0.109 0.647 0.732 0.759 0.740 0.556
TABLE VI: Ablation study on the number of the shape priors.
Threshold 0 0.5 1 1.5 2 2.5
CD-AVG 10.94 10.87 7.29 6.62 7.06 9.23
F-Score@1% 0.531 0.534 0.745 0.759 0.752 0.65
TABLE VII: Ablation study on the similarity threshold δ𝛿\deltaitalic_δ(multiplied by 0.001).

We also present the ablation experiment on the number of shape priors in Table VI. According to the metrics, the performance of Point-PC tends to go upward as the number of shape priors increases from zero but goes downward immediately after the shape priors exceed a certain amount. This is mainly because the shape priors explicitly make up the deficiency of the partial input representing the geometry of missing parts, yet excessive shape priors introduce too many confounding structures which is beyond the capability of the causal feature selection module. Particularly, Point-PC achieves the best score using 3 shape priors.

In Section 3.1, we adopt the Chamfer Distance threshold δ𝛿\deltaitalic_δ to determine whether the partial point cloud and the ”key-value” pair constitute a valid positive match. We set δ𝛿\deltaitalic_δ within a certain range, and report the results in Table VII. Point-PC achieves the best metrics of average CD-1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F-Score@1% when the threshold δ𝛿\deltaitalic_δ is set to 1.5.

IV-F Detailed results on ShapeNet-55

A comprehensive performance analysis of FoldingNet, PCN, TopNet, PFNet, GRNet, and our proposed method on ShapeNet-55 is presented in Table X. Each row in the table represents a distinct object category, and we have assessed the performance of each method across three different settings: simple, moderate, and hard.

IV-G Detailed results on ShapeNet-34

We provide a comprehensive report of the results obtained for novel objects across 21 categories in ShapeNet-34, as shown in Table IX. Each row in the table represents a specific object category, and we have evaluated each method under three different settings: simple, moderate, and hard.

IV-H Complexity Analysis

Table VIII shows a comprehensive examination of the complexity of our method, encompassing the computation cost(FLOPs) and the parameter counts, in comparison to five other methods. Moreover, we present the average Chamfer distances of all categories in ShapeNet-55 as well as unseen categories in ShapeNet-34 as references. Our method demonstrates superior performance while maintaining relatively low FLOPs and params compared to the other methods, as depicted in the table. This highlights the favorable balance our method achieves between cost and performance.

Models FoldingNet PCN TopNet PFNet GRNet PoinTr Point-PC
Params(M) 2.3 5.04 5.76 73.05 73.15 30.9 10.29
FLOPs(G) 27.58 15.25 6.72 4.96 40.44 10.41 14.88
CD55 3.12 2.66 2.91 5.22 1.97 1.07 0.83
CD34 3.62 3.85 3.5 8.16 2.99 2.05 1.19
TABLE VIII: We analyze the complexity of our method and five other methods, reporting the parameter counts(Params) and computation cost (FLOPs). Additionally, we have calculated the average CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for ShapeNet-55 (CD55) and ShapeNet-34 (CD34) to establish a basis for comparison.
CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(× 1000) TopNet PCN GRNet PoinTr SeedFormer Point-PC
S. M. H. S. M. H. S. M. H. S. M. H. S. M. H. S. M. H.
bag 2.08 1.95 4.36 2.48 2.46 3.94 1.47 1.88 3.45 0.96 1.34 2.08 0.49 0.82 1.45 0.61 1.15 1.9
basket 2.46 2.11 5.18 2.79 2.51 4.78 1.78 1.94 4.18 1.04 1.4 2.9 0.6 0.85 1.98 0.64 1.12 2.23
birdhouse 3.17 2.97 5.89 3.53 3.47 5.31 1.89 2.34 5.16 1.22 1.79 3.45 0.72 1.19 2.31 0.89 1.22 2.32
bowl 2.46 2.16 4.84 2.66 2.35 3.97 1.77 1.97 3.9 1.05 1.32 2.4 0.6 0.77 1.5 0.53 1.06 1.73
camera 4.24 4.43 8.11 4.84 5.3 8.03 2.31 3.38 7.2 1.63 2.67 4.97 0.89 1.77 3.75 0.8 1.4 2.32
can 2.02 1.7 5.82 1.95 1.89 5.21 1.53 1.8 3.08 0.8 1.17 2.85 0.56 0.89 1.57 0.58 0.81 1.75
cap 4.68 4.23 9.7 7.21 7.4 10.94 3.29 4.87 13.02 1.4 2.74 8.35 0.5 1.34 5.19 0.6 1.25 3.81
keyboard 0.79 0.77 1.55 1.07 1 1.23 0.73 0.77 1.11 0.43 0.45 0.63 0.32 0.41 0.6 0.29 0.44 0.66
dishwasher 2.51 1.77 4.72 2.45 2.09 3.53 1.79 1.7 3.27 0.93 1.05 2.04 0.63 0.78 1.44 0.53 0.81 1.74
earphone 5.33 4.83 11.67 7.88 6.59 16.53 4.29 4.16 10.3 2.03 5.1 10.69 1.18 2.78 6.71 0.95 1.24 4.82
helmet 4.89 4.86 8.73 6.15 6.41 9.16 3.06 4.38 10.27 1.86 3.3 6.96 1.1 2.27 4.78 0.79 0.68 2.56
mailbox 2.35 2.2 4.91 2.74 2.68 4.31 1.2 1.9 4.33 1.03 1.47 3.34 0.56 0.99 2.06 0.44 1.01 2.45
microphone 3.03 3.2 7.15 4.36 4.65 8.46 2.29 3.23 8.4 1.25 2.27 5.47 0.8 1.61 4.21 0.64 0.61 3.59
microwaves 2.67 2.12 5.41 2.59 2.35 4.47 1.74 1.81 3.82 1.01 1.18 2.14 0.64 0.83 1.69 0.45 0.91 1.92
pillow 2.08 2.05 4.01 2.09 2.16 3.54 1.43 1.69 3.43 0.92 1.24 2.39 0.43 0.66 1.45 0.66 0.79 2.26
printer 2.9 2.96 6.07 3.28 3.6 5.56 1.82 2.41 5.09 1.18 1.76 3.1 0.69 1.25 2.33 0.74 1.37 2.44
remote 0.89 0.89 2.28 0.95 1.08 1.58 0.82 1.02 1.29 0.44 0.58 0.78 0.27 0.42 0.61 0.3 0.39 0.54
rocket 1.14 0.96 2.03 1.39 1.22 2.01 0.97 0.79 1.6 o.39 o.72 1.39 0.28 0.51 1.02 0.32 0.56 1.56
skatcboard 1.23 1.2 2.01 1.97 1.78 2.45 0.93 1.07 1.83 0.52 0.8 1.31 0.35 0.56 0.92 0.34 0.44 0.94
tower 2.2 2.17 5.47 2.37 2.4 4.35 1.35 1.8 3.85 0.82 1.35 2.48 0.51 0.92 1.87 0.64 0.93 1.8
washer 2.63 2.14 6.57 2.77 2.52 4.64 1.83 1.97 5.28 1.04 1.39 2.73 0.61 0.87 1.94 0.47 1.11 1.86
mean 2.65 2.46 5.52 3.22 3.13 5.43 1.84 2.23 4.95 1.05 1.67 3.45 0.61 1.07 2.35 0.57 0.92 2.09
TABLE IX: Detailed quantitative results for the novel categories on ShapeNet-34. S., M. and H. represent the simple, moderate and hard settings.
CD-2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT(× 1000) TopNet PCN GRNet PoinTr SeedFormer Point-PC
S. M. H. S. M. H. S. M. H. S. M. H. S. M. H. S. M. H.
airplane 1.02 0.99 1.48 0.9 0.89 1.32 0.87 0.87 1.27 0.27 0.38 0.69 0.23 0.35 0.61 0.28 0.31 0.54
trash bin 2.51 2.32 5.03 2.16 2.18 5.15 1.69 2.01 3.48 0.8 1.15 2.15 0.73 1.08 1.94 0.65 0.93 1.74
bag 2.36 2.23 4.21 2.11 2.04 4.44 1.41 1.7 2.97 0.53 0.74 1.51 0.43 0.67 1.28 0.41 0.78 1.22
basket 2.62 2.43 5.71 2.21 2.1 4.55 1.65 1.84 3.15 0.73 0.88 1.82 0.65 0.83 1.54 0.55 0.84 1.32
bathtub 2.49 2.25 4.33 2.11 2.09 3.94 1.46 1.73 2.73 0.64 0.94 1.68 0.52 0.82 1.45 0.47 0.81 1.39
bed 3.13 3.1 5.71 2.86 3.07 5.54 1.64 2.03 3.7 0.76 1.1 2.26 0.63 0.91 1.89 0.54 0.81 1.94
bench 1.56 1.39 2.4 1.31 1.24 2.14 1.03 1.09 1.71 0.38 0.52 0.94 0.32 0.42 0.84 0.33 0.55 0.94
birdhouse 3.73 3.98 6.8 3.29 3.53 6.69 1.87 2.4 4.71 0.98 1.49 3.13 0.76 1.3 2.46 0.66 1.17 2.29
bookshelf 3.11 2.87 4.87 2.7 2.7 4.61 1.42 1.71 2.78 0.71 1.06 1.93 0.57 0.84 1.57 0.51 0.97 1.68
bottle 1.56 1.66 4.02 1.25 1.43 4.61 1.05 1.44 2.67 0.37 0.74 1.5 0.31 0.63 1.21 0.34 0.58 1.06
bowl 2.33 1.98 4.82 2.05 1.83 3.66 1.6 1.77 2.99 0.68 0.78 1.44 0.56 0.65 1.18 0.47 0.86 1.24
bus 1.32 1.21 2.29 1.2 1.14 2.08 1.06 1.16 1.48 0.42 0.55 0.79 0.42 0.55 0.73 0.39 0.62 0.77
cabinet 1.91 1.65 3.36 1.6 1.49 3.47 1.27 1.41 2.09 0.55 0.66 1.16 0.57 0.69 1.05 0.57 0.68 1.03
camera 4.75 4.98 9.24 4.05 4.54 8.27 2.14 3.15 6.09 1.1 2.03 4.34 0.83 1.68 3.45 0.75 1.29 2.24
can 2.67 2.4 5.5 2.02 2.28 6.48 1.58 2.11 3.81 0.68 1.19 2.14 0.58 1.03 1.79 0.44 1.18 1.76
cap 3 2.69 5.59 1.82 1.76 4.2 1.17 1.37 3.05 0.46 0.62 1.64 0.33 0.45 1.18 0.41 0.69 1.41
car 1.71 1.65 3.17 1.48 1.47 2.6 1.29 148 2.14 0.64 0.86 1.25 0.65 0.86 1.17 0.55 0.91 1.05
cellphone 1.01 0.96 1.8 0.8 0.79 1.71 0.82 0.91 1.18 0.32 0.39 0.6 0.31 0.4 0.54 0.37 0.34 0.59
chair 1.97 2.04 3.59 1.7 1.81 3.34 1.24 1.56 2.73 0.49 0.74 1.63 0.41 0.65 1.38 0.32 0.51 0.98
clock 2.48 2.16 4.03 2.1 2.01 3.98 1.46 1.66 2.67 0.62 0.84 1.65 0.53 0.74 1.35 0.39 0.66 1.25
keyboard 0.88 0.83 1.5 0.82 0.82 1.04 0.74 0.81 1.09 0.3 0.39 0.45 0.28 0.36 0.45 0.32 0.39 0.44
dishwasher 2.43 1.74 4.64 1.93 1.66 4.39 1.43 1.59 2.53 0.55 0.69 1.42 0.56 0.69 1.3 0.44 0.58 1.27
display 1.84 1.85 3.48 1.56 1.66 3.26 1.13 1.38 2.29 0.48 0.67 1.33 0.39 0.59 1.1 0.29 0.41 0.95
earphone 4.36 4.47 8.36 3.13 2.94 7.56 1.78 2.18 5.33 0.81 1.38 3.78 0.64 1.04 2.75 0.76 0.88 1.79
faucet 3.61 3.59 7.25 3.21 3.48 7.52 1.81 2.32 4.91 0.71 1.42 3.49 0.55 1.15 2.63 0.55 1.02 1.94
filecabinet 2.41 2.12 4.12 2.02 1.97 4.14 1.46 1.71 2.89 0.63 0.84 1.69 0.63 0.84 1.49 0.68 0.97 1.35
guitar 0.57 0.47 1.42 0.42 0.38 1.23 0.44 0.48 0.76 0.14 0.21 0.42 0.13 0.19 0.32 0.27 0.24 0.36
helmet 4.36 4.55 7.73 3.76 4.18 7.53 2.33 3.18 6.03 0.99 1.93 4.22 0.79 1.52 3.61 0.68 1.14 2.22
jar 3.03 3.17 7.03 2.57 2.82 6.01 1.72 2.37 4.37 0.77 1.33 2.87 0.63 1.13 2.36 0.57 1.19 2.08
knife 0.84 0.68 1.44 0.94 0.62 1.37 0.72 0.66 0.96 0.2 0.33 0.56 0.15 0.28 0.45 0.15 0.33 0.42
lamp 3.03 3.39 8.15 3.1 3.45 7.02 1.68 2.43 5.17 0.64 1.4 3.58 0.45 1.06 2.67 0.2 0.98 1.95
laptop 0.8 0.85 1.66 0.75 0.79 1.59 0.83 0.87 1.28 0.32 0.34 0.6 0.32 0.37 0.55 0.35 0.37 0.68
loudspeaker 3.1 2.76 5.32 2.5 2.45 5.08 1.75 2.08 3.45 0.78 1.16 2.17 0.67 1.01 1.8 0.52 1.29 1.95
mailbox 2.16 2.1 5.1 1.66 1.74 5.18 1.15 1.59 3.42 0.39 0.78 2.56 0.3 0.67 2.04 0.29 0.43 1.35
microphone 2.83 3.49 6.87 3.4 3.9 8.52 2.09 2.76 5.7 0.7 1.66 4.48 0.62 1.61 3.66 0.48 1.21 2.85
microwaves 2.65 2.15 5.07 2.2 2.01 4.65 1.51 1.72 2.76 0.67 0.83 1.82 0.63 0.79 1.47 0.41 0.59 1.38
motorbike 2.29 2.25 3.54 2.03 2.01 3.13 1.38 1.52 2.26 0.75 1.1 1.92 0.68 0.96 1.44 0.65 0.93 1.48
mug 2.89 2.56 5.43 2.45 2.48 5.17 1.75 2.16 3.79 0.91 1.17 2.35 0.79 1.03 2.06 0.61 0.81 1.72
piano 2.99 2.89 5.64 2.64 2.74 4.83 1.53 1.82 3.21 0.76 1.06 2.23 0.62 0.87 1.79 0.53 0.77 1.7
pillow 2.31 2.26 4.19 1.85 1.81 3.68 1.42 1.67 3.04 0.61 0.82 1.56 0.48 0.75 1.41 0.32 0.73 1.36
pistol 1.5 1.3 2.62 1.25 1.17 2.65 1.1 1.06 1.76 0.43 0.66 1.3 0.37 0.56 0.96 0.29 0.54 0.79
flowerpot 3.61 3.45 6.28 3.32 3.39 6.04 2.02 2.48 4.19 1.01 1.51 2.77 0.93 1.3 2.32 1.09 1.36 2.53
printer 3.04 3.19 5.84 2.9 3.19 5.84 1.56 2.38 4.24 0.73 1.21 2.47 0.58 1.11 2.13 0.54 0.83 2.18
remote 1.14 1.17 2.16 0.99 0.97 2.04 0.89 1.05 1.29 0.36 0.53 0.71 0.29 0.46 0.62 0.26 0.46 0.68
rifle 0.98 0.86 1.46 0.98 0.8 1.31 0.83 0.77 1.16 0.3 0.45 0.79 0.27 0.41 0.66 0.24 0.44 0.63
rocket 1.04 1 1.93 1.05 104 1.87 0.78 0.92 1.44 0.23 0.48 0.99 0.21 0.46 0.83 0.26 0.55 0.75
skateboard 1.08 1.05 1.84 1.04 0.94 1.68 0.82 0.87 1.24 0.28 0.38 0.62 0.23 0.32 0.62 0.21 0.31 0.59
sofa 1.93 1.76 3.39 1.65 1.61 2.92 1.35 1.45 2.32 0.56 0.67 1.14 0.5 0.62 1.02 0.42 0.58 0.96
stove 2.44 2.16 4.84 2.07 2.02 4.72 1.46 1.72 3.22 0.63 0.92 1.73 0.59 0.87 1.49 0.46 0.56 1.23
table 1.78 1.65 3.21 1.56 1.5 3.36 1.15 1.33 2.33 0.46 0.64 1.31 0.41 0.58 1.18 0.36 0.51 1.09
telephone 1.02 0.95 1.78 0.8 0.8 1.67 0.81 0.89 1.18 0.31 0.38 0.59 0.31 0.39 0.55 0.32 0.35 0.54
tower 2.15 2.05 4.51 1.91 1.97 4.47 1.26 1.69 3.06 0.55 0.9 1.95 0.47 0.84 1.65 0.35 0.67 1.31
train 1.59 1.44 2.51 1.5 1.41 2.37 1.09 1.14 1.61 0.5 0.7 1.12 0.51 0.66 1.01 0.59 0.63 1.05
watercraft 1.53 1.42 2.67 1.46 1.39 2.4 1.09 1.12 1.65 0.41 0.62 1.07 0.35 0.56 0.92 0.32 0.55 0.89
washer 2.92 2.53 6.53 2.42 2.31 6.08 1.72 2.05 4.19 0.75 1.06 2.44 0.64 0.91 2.04 0.38 0.73 1.85
mean 2.26 2.17 4.31 1.96 1.98 4.09 1.35 1.63 2.86 0.58 0.88 1.8 0.5 0.77 1.49 0.45 0.72 1.32
TABLE X: Detailed quantitative results on the ShapeNet-55 dataset. S., M., and H. represent simple, moderate, and hard settings.

IV-I Implementation Details

We employ a geometry-aware transformer encoder [17] to extract the point cloud features. In all of our experiments, we use a uniform configuration for all transformer encoders, utilizing 6 head attention and 8 block depth, with a hidden dimension of 384. In terms of geometry-aware modules, we set the k value of the kNN algorithm to 8 and 16 for those in the DGCNN [47] feature extractor. We set the number of partial shape patches that contain 2048 points to 32. On the ShapeNet-55/34 dataset, we set the complete shape of 8192 points to 128 neighboring patches, while 256 complete shape patches of 16384 points on PCN. The threshold δ𝛿\deltaitalic_δ mentioned above is set to 0.0015 and we finally select top-3 shape priors from the memory. During the evaluation of our method on the ShapeNet-55 and ShapeNet-34/21 benchmarks, we set 8 fixed viewpoints and select 2048, 4096, or 6144 points (25%, 50%, or 75% of the entire point cloud) for ease of evaluation. Consequently, we classify the test samples into three difficulty levels, namely simple, moderate, and hard. Furthermore, we report the overall performance (Avg) by taking the average of the performance across all three difficulty levels. We have implemented our networks using PyTorch and trained our models using the AdamW optimizer [48]. The initial learning rate is set at 5e-4 and decays by 0.76 every 20 epochs. With a batch size of 64, we trained the model for 200 epochs using two NVIDIA RTX 3090Ti GPUs.

V Conclusion

In this paper, We emphasize that the existing methods have two drawbacks: 1) incomplete point cloud is inadequate to provide missing structural information. The reason is that the information does not exist; 2) The global features are spread out in coarse-to-fine up-sampling operations, limiting their ability to capture geometric information. To address these issues, we design the memory network, which can search for the relevant complete shapes (prior shapes) corresponding to the partial point cloud input. These prior shapes can make up for the missing structural information and then guide the generative model to fill in more accurate details. Note that only part of the structure of these prior shapes is helpful for the generation of missing structure information, while the remaining structure information can be regarded as redundant information. Therefore, we utilize causal inference to mitigate the confounding effect of shape priors and to encourage the decoder to pay more attention to causal selected features that truly contribute to the accuracy of completion. To our best knowledge, this is the first work to introduce a causal graph into the point cloud completion task, which effectively filters shape information from previous shapes and preserves missing shape information to improve the integrity and ultimate performance of the fused representation. Comprehensive experiments show the effectiveness and superiority of Point-PC compared to state-of-the-art competitors.

References

  • [1] Z. Han, C. Chen, Y.-S. Liu, and M. Zwicker, “Shapecaptioner: Generative caption network for 3d shapes by learning a mapping from parts detected in multiple views to sentences,” Proceedings of the 28th ACM International Conference on Multimedia, 2019.
  • [2] Z. Han, Z. Liu, C.-M. Vong, Y.-S. Liu, S. Bu, J. Han, and C. L. P. Chen, “Deep spatiality: Unsupervised learning of spatially-enhanced global and local 3d features by deep neural network with coupled softmax,” IEEE Transactions on Image Processing, vol. 27, pp. 3049–3063, 2018.
  • [3] L. Tan, X. Lin, D. Niu, D. Wang, M. Yin, and X. Zhao, “Projected generative adversarial network for point cloud completion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 2, pp. 771–781, 2022.
  • [4] W. Nie, W. Wang, A. Liu, J. Nie, and Y. Su, “Hgan: Holistic generative adversarial networks for two-dimensional image-based three-dimensional object retrieval,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 15, no. 4, pp. 1–24, 2019.
  • [5] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, “Pixel2mesh: Generating 3d mesh models from single rgb images,” in European Conference on Computer Vision, 2018.
  • [6] H. Xie, H. Yao, S. Zhang, S. Zhou, and W. Sun, “Pix2vox++: Multi-scale context-aware 3d object reconstruction from single and multiple images,” International Journal of Computer Vision, vol. 128, pp. 2919 – 2935, 2020.
  • [7] C. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85, 2017.
  • [8] X. Wen, Z. Han, Y.-P. Cao, P. Wan, W. Zheng, and Y.-S. Liu, “Cycle4completion: Unpaired point cloud completion using cycle transformation with missing region coding,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13075–13084, 2021.
  • [9] X. Wen, T. Li, Z. Han, and Y.-S. Liu, “Point cloud completion by skip-attention network with hierarchical folding,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1936–1945, 2020.
  • [10] X. Wen, P. Xiang, Y. Cao, P. Wan, W. Zheng, and Y.-S. Liu, “Pmp-net++: Point cloud completion by transformer-enhanced multi-step point moving paths,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, pp. 852–867, 2022.
  • [11] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “Shapenet: An information-rich 3d model repository,” ArXiv, vol. abs/1512.03012, 2015.
  • [12] W. Yuan, T. Khot, D. Held, C. Mertz, and M. Hebert, “Pcn: Point completion network,” 2018 International Conference on 3D Vision (3DV), pp. 728–737, 2018.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, pp. 1231 – 1237, 2013.
  • [14] W. Zhang, Q. Yan, and C. Xiao, “Detail preserved point cloud completion via separated feature aggregation,” ArXiv, vol. abs/2007.02374, 2020.
  • [15] H. Xie, H. Yao, S. Zhou, J. Mao, S. Zhang, and W. Sun, “Grnet: Gridding residual network for dense point cloud completion,” ArXiv, vol. abs/2006.03761, 2020.
  • [16] Z. Huang, Y. Yu, J. Xu, F. Ni, and X. Le, “Pf-net: Point fractal network for 3d point cloud completion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7659–7667, 2020.
  • [17] X. Yu, Y. Rao, Z. Wang, Z. Liu, J. Lu, and J. Zhou, “Pointr: Diverse point cloud completion with geometry-aware transformers,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12478–12487, 2021.
  • [18] H. Zhou, Y. Cao, W. Chu, J. Zhu, T. Lu, Y. Tai, and C. Wang, “Seedformer: Patch seeds based point cloud completion with upsample transformer,” in European Conference on Computer Vision, 2022.
  • [19] J. Wang, Y. Cui, D. Guo, J. Li, Q. Liu, and C. Shen, “Pointattn: You only need attention for point cloud completion,” ArXiv, vol. abs/2203.08485, 2022.
  • [20] X. Yan, H. Yan, J. Wang, H. Du, Z. Wu, D. Xie, S. Pu, and L. Lu, “Fbnet: Feedback network for point cloud completion,” in European Conference on Computer Vision, 2022.
  • [21] Z. Zhang, Y. Yu, and F. peng Da, “Partial-to-partial point generation network for point cloud completion,” IEEE Robotics and Automation Letters, vol. 7, pp. 11990–11997, 2022.
  • [22] R. Cao, K. Zhang, Y. Chen, X. Yang, and C. Jin, “Point cloud completion via multi-scale edge convolution and attention,” Proceedings of the 30th ACM International Conference on Multimedia, 2022.
  • [23] Z. Liu, H. Tang, Y. Lin, and S. Han, “Point-voxel cnn for efficient 3d deep learning,” ArXiv, vol. abs/1907.03739, 2019.
  • [24] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8887–8896, 2019.
  • [25] P. Mandikal and R. V. Babu, “Dense 3d point cloud reconstruction using a deep pyramid network,” 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1052–1060, 2019.
  • [26] L. P. Tchapmi, V. Kosaraju, H. Rezatofighi, I. D. Reid, and S. Savarese, “Topnet: Structural point cloud decoder,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 383–392, 2019.
  • [27] P. Xiang, X. Wen, Y.-S. Liu, Y.-P. Cao, P. Wan, W. Zheng, and Z. Han, “Snowflakenet: Point cloud completion by snowflake point deconvolution with skip-transformer,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5479–5489, 2021.
  • [28] M.-H. Guo, J. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S. Hu, “Pct: Point cloud transformer,” Comput. Vis. Media, vol. 7, pp. 187–199, 2021.
  • [29] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” CoRR, vol. abs/1410.3916, 2015.
  • [30] F. Liu and J. Perez, “Gated end-to-end memory networks,” in Conference of the European Chapter of the Association for Computational Linguistics, 2017.
  • [31] A. P. S. Chandar, S. Ahn, H. Larochelle, P. Vincent, G. Tesauro, and Y. Bengio, “Hierarchical memory networks,” ArXiv, vol. abs/1605.07427, 2016.
  • [32] A. H. Miller, A. Fisch, J. Dodge, A.-H. Karimi, A. Bordes, and J. Weston, “Key-value memory networks for directly reading documents,” ArXiv, vol. abs/1606.03126, 2016.
  • [33] J. Hou, B. Graham, M. Nießner, and S. Xie, “Exploring data-efficient 3d scene understanding with contrastive scene contexts,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15582–15592, 2020.
  • [34] Z. Zhang, R. Girdhar, A. Joulin, and I. Misra, “Self-supervised pretraining of 3d features on any point-cloud,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10232–10243, 2021.
  • [35] X. Chen, H. Fan, R. B. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” ArXiv, vol. abs/2003.04297, 2020.
  • [36] T. Huang, B. Dong, Y. Yang, X. Huang, R. W. H. Lau, W. Ouyang, and W. Zuo, “Clip2point: Transfer clip to point cloud classification with image-depth pre-training,” ArXiv, vol. abs/2210.01055, 2022.
  • [37] J. Pearl, “Causality: Models, reasoning and inference,” 2000.
  • [38] X. Hu, K. Tang, C. Miao, X. Hua, and H. Zhang, “Distilling causal effect of data in class-incremental learning,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3956–3965, 2021.
  • [39] T. Wang, J. Huang, H. Zhang, and Q. Sun, “Visual commonsense r-cnn,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10757–10767, 2020.
  • [40] D. Zhang, H. Zhang, J. Tang, X. Hua, and Q. Sun, “Causal intervention for weakly-supervised semantic segmentation,” ArXiv, vol. abs/2009.12547, 2020.
  • [41] Z. Yue, H. Zhang, Q. Sun, and X. Hua, “Interventional few-shot learning,” ArXiv, vol. abs/2009.13000, 2020.
  • [42] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” ArXiv, vol. abs/2002.05709, 2020.
  • [43] J. Pearl, “Interpretation and identification of causal mediation,” ERN: Other Econometrics: Econometric Model Construction, 2013.
  • [44] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2463–2471, 2016.
  • [45] X. Wang, M. H. Ang, and G. H. Lee, “Cascaded refinement network for point cloud completion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 787–796, 2020.
  • [46] X. Wen, P. Xiang, Z. Han, Y.-P. Cao, P. Wan, W. Zheng, and Y.-S. Liu, “Pmp-net: Point cloud completion by learning multi-step point moving paths,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7439–7448, 2020.
  • [47] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics (TOG), vol. 38, pp. 1 – 12, 2018.
  • [48] I. Loshchilov and F. Hutter, “Fixing weight decay regularization in adam,” ArXiv, vol. abs/1711.05101, 2017.