Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Dynamic 3D Point Cloud Sequences as 2D Videos

Yiming Zeng, Junhui Hou, Senior Member, IEEE, Qijian Zhang, Siyu Ren, and Wenping Wang, Fellow, IEEE This work was supported by the Hong Kong Research Grants Council under Grant 11202320, Grant 11219422, and Grant 11218121. Corresponding Author: Junhui HouY. Zeng, J. Hou, Q. Zhang, and S. Ren are with the Department of Computer Science, City University of Hong Kong, Hong Kong SAR. Email:jh.hou@cityu.edu.hkW. Wang is with the Department of Computer Science & Engineering, Texas A & M University, USA. Email: wenping@tamu.edu
Abstract

Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called Structured Point Cloud Videos (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences. Code will be released at https://github.com/ZENGYIMING-EAMON/SPCV.

Index Terms:
3D point cloud sequence, feature representation, geometric modeling, self-supervised learning, correspondence, compression, efficiency.

I Introduction

A dynamic 3D point cloud sequence comprises multiple frames of static 3D point clouds captured at consecutive time steps, providing a depiction of geometric changes in objects/scenes. This type of data finds extensive applications in areas such as autonomous driving, robotics navigation, virtual/augmented reality, and immersive telecommunication. As the demand for 3D data processing continues to grow, fueled by the remarkable success of deep learning in handling 2D images/videos, there is an urgent need to develop effective and efficient learning methods for processing dynamic 3D point cloud sequences.

\animategraphics

[width=0.95loop, autoplay]12demo_contour/gif_01/Len_00010100

Figure 1: Visualization of structuring of a single 3D point cloud frame. Each line indicates the corresponding 2D pixel and 3D point that are displayed with an identical color. The black points correspond to the pixels located at the 2D image’s boundaries. Please use Adobe Acrobat to display the video.
Refer to caption
Figure 2: Illustration of the proposed SPCV representations for dynamic 3D point cloud sequences. (a) A dynamic 3D point cloud sequence with point-wise correspondences unknown. (b) The SPCV representation of the above sequence. (c) The dynamic 3D point cloud sequence reshaped/reprojected from the above SPCV by taking the value of a pixel as the coordinate of a 3D point. Note that for each frame, the points are rendered in the same colors as the corresponding pixels to illustrate the correspondence between 2D and 3D domains.

However, deep modeling of 3D point clouds, characterized by irregularity and lack of structure, is not as straightforward as applying standard 2D convolutional operators [1, 2, 3, 4] on well-structured 2D image/video signals defined on regular 2D grids. In recent years, researchers have devoted significant efforts to exploring numerous deep set architectures [5, 6, 7, 8, 9, 10] that operate on static 3D point cloud inputs, by developing various complicated learning mechanisms and highly specialized feature extraction operators. the processing and modeling challenges become even more pronounced when transitioning from a static point cloud to a dynamic sequence of consecutive point cloud frames. This is due to the irregular spatial structure of each frame and the lack of temporal correspondence across frames. Existing dynamic 3D point cloud sequence learning networks [11, 12] typically rely on extensive spatio-temporal point neighbor grouping and multi-scale feature aggregation operations, which are less expressive, memory consuming, and computationally inefficient, particularly with an increasing number of input points. The substantial computational complexity of these time-consuming operations hinders the development of deeper or more sophisticated architectural models for 3D point cloud sequences, thereby limiting performance on downstream tasks. In addition to the issues of network structures, in various reconstruction/generation scenarios, the inefficiency of commonly-used loss functions, e.g., Chamfer Distance (CD) and Earth Mover’s Distance (EMD) [13], can also significantly degrade the overall learning effects. This stands in stark contrast to the 2D image/video domain, where pixel-wise errors can be directly computed at the same 2D grid coordinates.

In this paper, we address the aforementioned challenges from a novel and more fundamental perspective by structuring 3D point cloud sequences. By recognizing that 3D geometric shapes are essentially 2D manifolds, we propose to structure dynamic 3D point cloud sequences with a 2D video-like representation, namely structured point cloud video (SPCV). As depicted in Fig. 2, the proposed SPCV representation exhibits the following key features: (i) the pixel values of an SPCV correspond to the coordinates of 3D points; (ii) the adjacent pixels within a frame generally represent neighboring points in 3D space; (iii) the pixels with identical locations across different frames of an SPCV correspond to the consistent position of the object/scene. To achieve such a structured representation, we design a geometrically regularized self-supervised learning pipeline comprising two stages: frame-wise structurization and sequence-wise structurization. These stages are driven by the learning objectives of intra-frame self-reconstruction and inter-frame deformation fields, respectively. Furthermore, for a better understanding of the structurization process, we direct readers to the video demonstrations presented in Fig. 1 and Fig. 3.

\animategraphics

[width=loop, autoplay]12temporal_gif/00010100

Figure 3: Visualization of our structurization of eight frames of a point cloud sequence named Crane from a typical viewpoint. Note that pixels of all frames are projected to 3D space following the same order to illustrate the temporal consistency across different frames. Please use Adobe Acrobat to view the video.

Essentially, the process of structuring a dynamic 3D point cloud sequence to its SPCV representation involves explicitly learning the indexing cues of the geometric coordinates, resulting in the re-organization of original 3D spatial points into regular 2D grids. The uniform structure of SPCV relives the challenge faced by neural networks in feature learning, thereby potentially generating more expressive features. The decoupled manner can significantly enhance the efficiency of downstream applications, allowing for the integration of advanced techniques to improve performance. The SPCV representation possesses spatial smoothness and temporal consistency properties, enabling the direct application of 2D image/video processing techniques for seamless processing of dynamic 3D point cloud sequences. By leveraging well-established 2D image/video techniques, we further construct SPCV-based frameworks for a range of high-level and low-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and geometry compression. Experimental results demonstrate the versatility and superiority of our approaches.

In summary, the main contributions of this paper are:

  1. 1.

    We propose a new generic representation modality for dynamic 3D point cloud sequences, offering numerous benefits for point cloud sequence processing and analysis.

  2. 2.

    We propose a novel self-supervised learning framework that efficiently and effectively represents any point cloud sequences as SPCVs.

  3. 3.

    We construct diverse SPCV-based frameworks for dynamic 3D point sequence analysis and processing, achieving state-of-the-art performances.

The remainder of this paper is organized as follows: Section II reviews related works in the field. Section III details the methodology behind our SPCV representation. Section IV provides a detailed introduction to the SPCV-based learning network employed for various downstream tasks. Section V discusses the experimental setup and benchmarks our approach against existing methods. Finally, Section VI concludes the paper and suggests potential directions for future research.

II Related Work

In this section, we begin with a comprehensive summarization of representative deep set architectures for learning from static 3D point clouds. Then, we focus on the more challenging problem of learning from dynamic 3D point cloud sequences, which is directly related to the scope of our work.

II-A Deep Learning on Static 3D Point Clouds

The recent years have witnessed the proliferation of various deep set architectures that directly operate on unstructured 3D point sets without any pre-processing procedures, as pioneered by [5, 6]. The follow-up researches further investigated rich varieties of convolution-based [14, 15, 16, 17, 18, 19, 20, 9, 10], graph-based [21, 7, 22], and transformer-based [23, 8, 24, 25] backbone learning architectures.

More specifically, PointNet [5] proposed to perform point-wise embedding and global aggregation via shared multi-layer perceptrons (MLPs) and channel max-pooling. PointNet++ [6] further incorporates multi-scale hierarchical feature abstraction mechanisms with farthest point sampling (FPS) and neighborhood interpolation. SO-Net [14] explicitly utilizes the spatial distribution of point clouds by building self-organizing maps. PointCNN [15] applies learned transformations for adaptively weighting and permuting point features in potentially canonical orders. FeaStNet [21] dynamically builds correspondences between filter weights and graph nodes. KPConv [16] learns to continuously locate convolution weights in Euclidean space through kernel points and further explores spatially deformable convolution operators. RS-CNN [17] focuses on capturing the high-level relations among local point sets via learning from predefined geometric priors. DGCNN [7] queries neighbors globally in the feature space for constructing graph edges and performing dynamic feature aggregation. PointASNL [18] adopts an adaptive sampling scheme together with local-nonlocal cells to improve its model robustness when dealing with noises and outliers. CurveNet [20] aggregates hypothetical curves initially grouped through guided walks to augment point-wise features. PointMLP [26] builds a pure residual MLP network equipped with lightweight geometric affine modules, which can perform competitively without any sophisticated learning mechanisms. PointNeXt [9] systematically revisits the classic PointNet++ processing pipeline and then improves model training and scaling strategies. PT [8] investigates highly-expressive self-attention layers for point clouds. FastPT [24] incorporates lightweight local self-attention with voxel hashing architectures to encode continuous 3D positional information to greatly enhance computational efficiency. Conclusively, these approaches above are specialized for learning from a single 3D point cloud input to extract geometric feature representations in the spatial domain. The adaptation to dynamic point cloud learning for achieving sequence processing and analysis is highly non-trivial, due to the complexity of joint spatio-temporal modeling.

Besides, another line of work resorts to a different perspective for overcoming the irregularity and unstructuredness of 3D point clouds by creating regular 2D geometry image (GI [27]) or GI-like representation structures, on which numerous mature 2D modeling architectures (e.g., convolutional neural networks (CNNs)) can be seamlessly applied, or adapted with minimal modifications, to achieve various point cloud learning tasks. Representatively, [28, 29, 30, 31] exploit traditional mesh parameterization techniques to implement the process of surface-to-plane mapping, while [32, 33] propose to directly learn deep 2D regular representations from 3D point clouds in an unsupervised manner. These works are also related to the previous folding-based methods [34, 35, 36]. However, these approaches are still limited to static geometric representations, and are thus inapplicable to dynamic point cloud sequences with requirements of spatio-temporal structurization.

II-B Deep Learning on 3D Point Cloud Sequences

DPMix [37] employs 2D depth video methods to enhance the understanding of point cloud videos. Other methods explore the temporal interpolation of the 3D point cloud sequences [38, 39, 40]. These advancements demonstrate the growing interest and diversification in approaches to effectively model the dynamic nature of 3D point cloud sequences.

Specifically, PSTNet [41] introduces the point spatio-temporal (PST) convolution, leveraging a ‘point tube’ concept to aggregate local spatio-temporal features from 3D point cloud sequences. Building upon this, PSTNet2 [42] merges spatial features into a unified spatio-temporal representation. And MaST-Pre [43] advances this concept with a masked point tube for pre-training on point cloud videos, incorporating the spatiotemporal masked autoencoder idea from [44]. P4Transformer [45] boosts the performance of PSTNet by employing transformers to bypass the need for point tracking while capturing spatio-temporal correlations across point cloud sequences. Similarly, PST2superscriptPST2\text{PST}^{2}PST start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [46] introduces a spatio-temporal self-attention (STSA) module to grasp contextual information across adjacent frames. This concept is further enhanced by PST-Transformer [12] with spatio-temporal encoding. These architectures primarily follow an autoencoder-like framework to process dynamic point cloud sequences, akin to the role of PointNet++ [6] in handling static point clouds. While these methods introduce specially designed spatio-temporal feature aggregation structures to adapt to dynamic point cloud learning challenges, they do not fundamentally address the inherent irregularities of spatio-temporal characteristics in 3D point cloud sequences. In essence, the irregular data modality of 3D point cloud sequences remains unaltered.

Another line of research explores implicit representation methods. OFlow [47] adopts implicit strategies for 4D reconstruction, utilizing ground truth occupancy and trajectory information. CaSPR [48] introduces spatio-temporal representations for objects within normalized coordinate space. Similarly, CaDeX [49] tackles inter-frame deformations by employing continuous bijective canonical maps, focusing on a canonical shape. Notably, these methodologies primarily utilize mesh, voxel, or occupancy data, diverging from pure point cloud sequences. Our work, in contrast, concentrates on self-supervised explicit representations. Specifically, we structure the irregular 3D point cloud sequences as structured point cloud videos, aiming to achieve a compact representation that effectively minimizes both distortion and storage requirements.

A variety of methods have been developed for specific 3D point cloud sequence processing tasks, as well as related technologies. These technologies include flow-based [50, 51, 52], depth-based [53], GAN-based [54], kinematics-inspired [55], contrastive learning [56], and correspondence approaches [57, 58]. There are also some task-specific sequence processing methods, notably compression [59, 60] and interpolation [38, 39, 40].

For example, PV-RAFT [50] proposes a point-voxel recurrent all-pairs field transforms method to estimate scene flow from point clouds. NSFP [51] revisits the scene flow problem, emphasizing runtime optimization and regularization of the scene flow. SSFE [52] aims to predict 3D scene flow for all pairs of point clouds in a given sequence. 3DV [53] utilizes the dynamic 3D voxel for depth-based 3D action recognition. TPU-GAN [54] proposes a super-resolution generative adversarial network for upsampling dynamic point cloud sequences. Kinet [55] explores kinematics-inspired neural networks to capture 3D motions without explicit tracking. These methods employed various approaches, such as utilizing voxel and depth information for feature extraction, utilizing GAN-based methods to avoid the need for explicit point correspondence annotation, or relying on explicit scene flow annotation datasets. Overall, these methods did not directly address the underlying irregularity issues of 3D point cloud data. Furthermore, they lacked powerful insights into the explicit geometric aspects of 3D point clouds.

III Proposed Method

III-A Problem Statement

As mentioned earlier, learning from irregular and unordered point sets presents a significant challenge compared to feature extraction from 2D images or videos organized in a regular grid. Essentially, an RGB image of sizes H×W×3𝐻𝑊3H\times W\times 3italic_H × italic_W × 3 can be regarded as a set of 5-dimensional vectors denoted as (r,g,b,h,w)𝑟𝑔𝑏𝑤(r,g,b,h,w)( italic_r , italic_g , italic_b , italic_h , italic_w ), where the color information (or pixel value) (r,g,b)𝑟𝑔𝑏(r,g,b)( italic_r , italic_g , italic_b ) to be processed is indexed by the 2D coordinate (h,w)𝑤(h,w)( italic_h , italic_w ) uniformly distributed on a regular 2D grid of sizes H×W𝐻𝑊H\times Witalic_H × italic_W, while for point cloud data, the spatial coordinates (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) serve as both the geometry information to be processed and the indirect indexing cues to be used for determining inter-point relationships such as proximity and correspondence. Consequently, existing point-based deep learning architectures often fall short in efficiency when processing large-scale point data due to their involvement in complex learning mechanisms, including cumbersome pre-processing, extensive neighbor querying, and costly sampling, matching, and multi-scale abstraction processes. The significant computational complexity of the aforementioned time-consuming operations in turn hinders the development of deeper or more sophisticated architectural models for 3D point cloud sequences, thus limiting performance on downstream tasks. Additionally, these operations increase memory consumption, particularly when processing denser 3D point cloud sequences. Furthermore, measuring the discrepancy between two point clouds poses a significant challenge due to their irregular and unordered nature, making standard methods like CD and EMD either ineffective or inefficient [61].

The shift from static point cloud to 3D point cloud sequence modeling introduces additional challenges, exacerbated by the inherent lack of structure in both spatial and temporal dimensions. Current approaches rely heavily on intensive spatial neighborhood searches and the establishment of temporal correspondences, necessitating complex spatio-temporal feature processing. This intricate methodology is not just laborious and costly—it also substantially obstructs the creation of streamlined and efficient learning frameworks for 3D point cloud sequences.

III-B Our Objective

Denote by 𝒫={𝐏tNt×3}t=0T1𝒫superscriptsubscriptsubscript𝐏𝑡superscriptsubscript𝑁𝑡3𝑡0𝑇1\mathcal{P}=\{\mathbf{P}_{t}\in\mathbb{R}^{N_{t}\times 3}\}_{t=0}^{T-1}caligraphic_P = { bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT a dynamic 3D point cloud sequence with T𝑇Titalic_T point cloud frames, where Ntsubscript𝑁𝑡N_{t}italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT refers to the number of points contained in the t𝑡titalic_t-th frame111Without losing generality, we assign a fixed number of points to all frames in practice.. Note that the point-wise correspondence across the frames of 𝒫𝒫\mathcal{P}caligraphic_P is unknown. Inspired by the above-mentioned structure of 2D images/videos and the fact that 3D geometric shapes are essentially 2D manifolds, we aim to structure dynamic 3D point cloud sequences to address the aforementioned challenges.

Generally, for each point of a typical point cloud frame 𝐏tsubscript𝐏𝑡\mathbf{P}_{t}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we plan to obtain a unique 2D coordinate (u,v)𝑢𝑣(u,v)( italic_u , italic_v ), thereby forming a 5-dimensional signal (x,y,z,u,v)𝑥𝑦𝑧𝑢𝑣(x,y,z,u,v)( italic_x , italic_y , italic_z , italic_u , italic_v ). Meanwhile, points that are nearest neighbors in 3D space should have close 2D coordinates. When the (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) coordinates are uniformly distributed across the regular 2D grid, the structure effectively resembles that of an RGB image. In other words, we can fill [x,y,z]𝑥𝑦𝑧[x,~{}y,~{}z][ italic_x , italic_y , italic_z ] into the corresponding (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) location, constructing an RGB image-like representation of dimensions U×V×3𝑈𝑉3U\times V\times 3italic_U × italic_V × 3 (U×V=Nt𝑈𝑉subscript𝑁𝑡U\times V=N_{t}italic_U × italic_V = italic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT). After processing all point cloud frames of 𝒫𝒫\mathcal{P}caligraphic_P with this manner, we can derive a 2D video-like representation of 𝒫𝒫\mathcal{P}caligraphic_P, denoted as 𝒢={𝐆tU×V×3}t=0T1𝒢superscriptsubscriptsubscript𝐆𝑡superscript𝑈𝑉3𝑡0𝑇1\mathcal{G}=\{\mathbf{G}_{t}\in\mathbb{R}^{U\times V\times 3}\}_{t=0}^{T-1}caligraphic_G = { bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_V × 3 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT, named Structured Point Cloud Video (SPCV). Moreover, we constrain that the encoded points at an identical location across different frames of 𝒢𝒢\mathcal{G}caligraphic_G correspond to the same or approximately the same position of the 3D object/scene, which is more advanced than traditional RGB videos. In the following, we also call an element of 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a pixel for convenience. Note that 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be directly reshaped/reprojected back to a point cloud without any additional computation, i.e., the value of a pixel is taken as a 3D point.

Refer to caption
Figure 4: Illustration of the local spatial smoothness and temporal consistency properties of our proposed representation, named SPCV. Here, we take two frames of a dynamic point cloud sequence as an example.

In summary, we anticipate that the resulting SPCV should exhibit the following two characteristics, as illustrated in Fig. 4:

  1. 1.

    Spatial Smoothness: the pixels in a local patch of a typical frame of an SPCV correspond to a set of neighboring points in 3D space;

  2. 2.

    Temporal Consistency: the pixels with identical locations across the T𝑇Titalic_T frames of an SPCV correspond to the consistent position of the object/scene.

The proposed representation modality is able to offer the following advantages:

  1. 1.

    Improved Efficiency and Performance of Downstream Applications. The SPCV framework streamlines downstream processing and analysis tasks by replacing complex and time-consuming operations like FPS and K-NN aggregation with simpler and highly efficient pixel indexing and sliding window mechanisms. Meanwhile, the uniform structure of SPCV relives the challenge faced by neural networks in feature learning, thereby potentially generating more expressive features. Moreover, SPCV can seamlessly integrate with established learning techniques used in 2D image/video network designs, enhancing overall performance.

  2. 2.

    Reduction of Memory Consumption. SPCV efficiently reduces memory usage by indexing frames using variable pixel locations, a contrast to traditional methods that necessitate storing multiple scales of point clouds for feature aggregation. This advantage becomes particularly evident when processing dense or large-scale point clouds.

  3. 3.

    Simplified Design and Implementation. Adopting SPCV simplifies the design and implementation complexities in point cloud applications. For instance, encoders with specialized spatio-temporal feature extraction can be replaced with encoders that are already commonly used in the 2D image/video community.

  4. 4.

    Free of the Limitations of CD and EMD. Traditional models often struggle with inefficiencies from point cloud loss functions like CD and Earth Mover’s Distance (EMD). The SPCV framework, however, enables more efficient and straightforward optimization using grid-based losses like the Frobenius norm or 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, thanks to its inherent structure.

Refer to caption
Figure 5: Flowchart of the SPCV representation for 3D point cloud sequences.

III-C Our Pipeline

Technical Intuition. Intuitively, to achieve the objective of re-organizing 𝒫𝒫\mathcal{P}caligraphic_P into 𝒢𝒢\mathcal{G}caligraphic_G through a learning-based method, we can construct a neural network, which takes {(x,y,z)}𝑥𝑦𝑧\{(x,y,z)\}{ ( italic_x , italic_y , italic_z ) } as input and outputs {(u,v)}𝑢𝑣\{(u,v)\}{ ( italic_u , italic_v ) }, and train it by using (x,y,z)𝑥𝑦𝑧(x,y,z)( italic_x , italic_y , italic_z ) and (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) paired training data. However, adopting a supervised learning approach presents significant challenges. Even for static point clouds exhibiting arbitrary and non-manifold geometric structures, acquiring the ground-truth (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) poses a formidable task due to the complex parameterization problem [62]. This challenge becomes even more pronounced when considering dynamic point cloud sequences, as obtaining additional accurate point-wise correspondences across frames as ground-truth remains a persistent and challenging problem.

To overcome this challenge, we consider constructing a self-supervised pipeline without relying on ground-truth 2D locations and temporal correspondences. Specifically, for a single point cloud, we first pre-define a set of 2D locations uniformly distributed on a regular 2D grid of sizes U×V𝑈𝑉U\times Vitalic_U × italic_V, each location storing/encoding (u,v,0)𝑢𝑣0(u,v,0)( italic_u , italic_v , 0 ). We then feed {(u,v,0)}𝑢𝑣0\{(u,v,0)\}{ ( italic_u , italic_v , 0 ) } into a learnable neural network to generate {(x^,y^,z^)}^𝑥^𝑦^𝑧\{(\hat{x},\hat{y},\hat{z})\}{ ( over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG , over^ start_ARG italic_z end_ARG ) } further assigned to the corresponding locations of their inputs. We can utilize the original point cloud as supervision to drive the learning of the network, ensuring that the generated {(x^,y^,z^)}^𝑥^𝑦^𝑧\{(\hat{x},\hat{y},\hat{z})\}{ ( over^ start_ARG italic_x end_ARG , over^ start_ARG italic_y end_ARG , over^ start_ARG italic_z end_ARG ) } closely resemble {(x,y,z)}𝑥𝑦𝑧\{(x,y,z)\}{ ( italic_x , italic_y , italic_z ) }. Moreover, owing to the inherent spectral bias property of neural networks [63], the generated data from spatially smooth inputs are expected to preserve spatial smoothness.

Furthermore, observing that a dynamic 3D point cloud sequence recording the motion of an object can be essentially thought of as a set of 3D points with a fixed topology deforming/evolving over time, we can adopt the following procedure to structure it. First, we represent 𝐏0subscript𝐏0\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through the above-mentioned self-supervised learning approach for single point clouds, producing 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We then obtain the representations of the remaining T1𝑇1T-1italic_T - 1 point cloud frames {𝐏t}t=1T1superscriptsubscriptsubscript𝐏𝑡𝑡1𝑇1\{\mathbf{P}_{t}\}_{t=1}^{T-1}{ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT by learning the deformation fields that deform 𝐆t1subscript𝐆𝑡1\mathbf{G}_{t-1}bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (t[1,T1]𝑡1𝑇1t\in[1,T-1]italic_t ∈ [ 1 , italic_T - 1 ]) in a recursive fashion, i.e., 𝐆t=𝐆t1+𝚫𝐆tsubscript𝐆𝑡subscript𝐆𝑡1subscript𝚫subscript𝐆𝑡\mathbf{G}_{t}=\mathbf{G}_{t-1}+\bm{\Delta}_{\mathbf{G}_{t}}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where 𝚫𝐆tU×V×3subscript𝚫subscript𝐆𝑡superscript𝑈𝑉3\bm{\Delta}_{\mathbf{G}_{t}}\in\mathbb{R}^{U\times V\times 3}bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_V × 3 end_POSTSUPERSCRIPT stands for the deformation field for 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. To be specific, we can minimize the discrepancy between 𝐏tsubscript𝐏𝑡\mathbf{P}_{t}bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT222Note that we reshape 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into a point cloud and compute the discrepancy in 3D space. to estimate 𝚫𝐆tsubscript𝚫subscript𝐆𝑡\bm{\Delta}_{\mathbf{G}_{t}}bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Such a procedure aligns the resulting structures of {𝐆t}t=1T1superscriptsubscriptsubscript𝐆𝑡𝑡1𝑇1\{\mathbf{G}_{t}\}_{t=1}^{T-1}{ bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT with that of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, leading to the spatial smoothness of all frames. Also, the deformation mechanism establishes a fixed topology across all frames, thus naturally preserving temporal consistency across all frames.

Technical Details. Based on the above analyses, we construct the self-supervised learning pipeline shown in Fig. 5 for representing 𝒫𝒫\mathcal{P}caligraphic_P into 𝒢𝒢\mathcal{G}caligraphic_G. It consists of two main stages: (1) frame-wise structurization and (2) sequence-wise structurization, which are detailed as follows.

III-C1 Frame-wise Structurization

This stage only overfits the first point cloud frame of 𝒫𝒫\mathcal{P}caligraphic_P (i.e., 𝐏0subscript𝐏0\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) into the first frame of 𝒢𝒢\mathcal{G}caligraphic_G (i.e., 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) through a self-supervised neural network regularized with additional geometrically meaningful constraints. Specifically, we first construct a pre-defined 2D grid 𝐆initU×V×3subscript𝐆initsuperscript𝑈𝑉3\mathbf{G}_{\rm init}\in\mathbb{R}^{U\times V\times 3}bold_G start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_V × 3 end_POSTSUPERSCRIPT, with each pixel filled with (u,v,0)𝑢𝑣0(u,v,0)( italic_u , italic_v , 0 ) where u𝑢uitalic_u and v𝑣vitalic_v is sampled regularly from [0,U1]0𝑈1[0,U-1][ 0 , italic_U - 1 ] and [0,V1]0𝑉1[0,V-1][ 0 , italic_V - 1 ], respectively. We then feed 𝐆initsubscript𝐆init\mathbf{G}_{\rm init}bold_G start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT into a network composed of 2D convolutional (Conv2D) layers, denoted as f𝜽()subscript𝑓𝜽f_{\bm{\theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) with 𝜽𝜽\bm{\theta}bold_italic_θ being network parameters. Owing to the inherent spectral bias property of neural networks [63], the generated image 𝐆0=f𝜽(𝐆init)U×V×3subscript𝐆0subscript𝑓𝜽subscript𝐆initsuperscript𝑈𝑉3\mathbf{G}_{0}=f_{\bm{\theta}}(\mathbf{G}_{\rm init})\in\mathbb{R}^{U\times V% \times 3}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_G start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_V × 3 end_POSTSUPERSCRIPT is expected to be spatially smooth. Let 𝐏^0subscript^𝐏0\widehat{\mathbf{P}}_{0}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT be the point cloud form reprojected from 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We minimize the discrepancy between 𝐏^0subscript^𝐏0\widehat{\mathbf{P}}_{0}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐏0subscript𝐏0{\mathbf{P}}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to enforce the shape represented by 𝐏^0subscript^𝐏0\widehat{\mathbf{P}}_{0}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to approximate 𝐏0subscript𝐏0\mathbf{P}_{0}bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Thus, we drive the learning of f𝜽()subscript𝑓𝜽f_{\bm{\theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) by optimizing the following loss function:

min𝜽D(𝐏^0,𝐏0)+Rgeo(𝐆0),subscript𝜽Dsubscript^𝐏0subscript𝐏0subscriptRgeosubscript𝐆0\min_{\bm{\theta}}\texttt{D}(\widehat{\mathbf{P}}_{0},\mathbf{P}_{0})+\texttt{% R}_{\rm geo}(\mathbf{G}_{0}),roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT D ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + R start_POSTSUBSCRIPT roman_geo end_POSTSUBSCRIPT ( bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (1)

where D(,)D\texttt{D}(\cdot,\cdot)D ( ⋅ , ⋅ ) stands for a typical distance metric for 3D point clouds [64], and Rgeo()subscriptRgeo\texttt{R}_{\rm geo}(\cdot)R start_POSTSUBSCRIPT roman_geo end_POSTSUBSCRIPT ( ⋅ ) is a geometry-aware regularization term further promoting the spatial smoothness characteristic of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Specifically, we explicitly regularize the generated pixel values of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to promote its spatial smoothness, i.e., the value of a typical pixel of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT should be close to the average of those of its neighboring pixels. In addition, based on the fact that the normal of a typical point in a locally smooth region of a 3D surface is generally close to the average of the normals of its neighboring points, we also regularize the geometry attribute of the generated pixels of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., the normal of a typical pixel of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT should be close to the average of those of its neighboring pixels. As f𝜽()subscript𝑓𝜽f_{\bm{\theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ) essentially models the process of 2D-to-3D mapping, we can compute the normal of a pixel located at (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) through the cross product of 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT’s partial derivatives with respect to (u,v)𝑢𝑣(u,~{}v)( italic_u , italic_v ):

𝐕0(u,v)superscriptsubscript𝐕0𝑢𝑣\displaystyle\mathbf{V}_{0}^{(u,v)}bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u , italic_v ) end_POSTSUPERSCRIPT =𝐆0(u,v)u×𝐆0(u,v)v,absentsuperscriptsubscript𝐆0𝑢𝑣𝑢superscriptsubscript𝐆0𝑢𝑣𝑣\displaystyle=\frac{\partial\mathbf{G}_{0}^{(u,v)}}{\partial u}\times\frac{% \partial\mathbf{G}_{0}^{(u,v)}}{\partial v},= divide start_ARG ∂ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u , italic_v ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_u end_ARG × divide start_ARG ∂ bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u , italic_v ) end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_v end_ARG , (2)

where the partial derivatives can be derived through the back-propagation of f𝜽()subscript𝑓𝜽f_{\bm{\theta}}(\cdot)italic_f start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ). In other words, such regularization regularizes the network behavior. In all, we explicitly written Rgeo(𝐆0)subscriptRgeosubscript𝐆0\texttt{R}_{\rm geo}(\mathbf{G}_{0})R start_POSTSUBSCRIPT roman_geo end_POSTSUBSCRIPT ( bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) as

Rgeo(𝐆0)=subscriptRgeosubscript𝐆0absent\displaystyle\texttt{R}_{\rm geo}(\mathbf{G}_{0})=R start_POSTSUBSCRIPT roman_geo end_POSTSUBSCRIPT ( bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = (u,v)[λs(𝐆0(u,v)1L2(u,v)𝒩L(u,v)𝐆0(u,v))2.\displaystyle\sum_{(u,v)}\Big{[}\lambda_{\rm s}\Big{(}\mathbf{G}_{0}^{(u,v)}-% \frac{1}{L^{2}}\sum_{(u^{\prime},v^{\prime})\in\mathcal{N}_{L}(u,v)}\mathbf{G}% _{0}^{(u^{\prime},v^{\prime})}\Big{)}^{2}\Big{.}∑ start_POSTSUBSCRIPT ( italic_u , italic_v ) end_POSTSUBSCRIPT [ italic_λ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT ( bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u , italic_v ) end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_u , italic_v ) end_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)
.+λn(𝐕0(u,v)1L2(u,v)𝒩L(u,v)𝐕0(u,v))2],\displaystyle\Big{.}+\lambda_{\rm n}\Big{(}\mathbf{V}_{0}^{(u,v)}-\frac{1}{L^{% 2}}\sum_{(u^{\prime},~{}v^{\prime})\in\mathcal{N}_{L}(u,~{}v)}\mathbf{V}_{0}^{% (u^{\prime},~{}v^{\prime})}\Big{)}^{2}\Big{]},. + italic_λ start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u , italic_v ) end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_u , italic_v ) end_POSTSUBSCRIPT bold_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

where λssubscript𝜆s\lambda_{\rm s}italic_λ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and λnsubscript𝜆n\lambda_{\rm n}italic_λ start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT are two weights balancing the spatial and normal regularization terms, and 𝒩L(,)subscript𝒩𝐿\mathcal{N}_{L}(\cdot,\cdot)caligraphic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( ⋅ , ⋅ ) represents the L×L𝐿𝐿L\times Litalic_L × italic_L window around the center pixel.

Refer to caption
Figure 6: Flowchart of the action recognition model based on (a) SPCV and (b) PST-Transformer [12]. (a) and (b) adopt decoders with the same architecture.

III-C2 Sequence-wise Structurization

This stage represents the remaining (T1)𝑇1(T-1)( italic_T - 1 ) point cloud frames of 𝒫𝒫\mathcal{P}caligraphic_P (i.e., {𝐏t}t=1T1superscriptsubscriptsubscript𝐏𝑡𝑡1𝑇1\{\mathbf{P}_{t}\}_{t=1}^{T-1}{ bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT) as the frames of 𝒢𝒢\mathcal{G}caligraphic_G (i.e., {𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT}T1t=1superscriptsubscriptabsent𝑡1𝑇1{}_{t=1}^{T-1}start_FLOATSUBSCRIPT italic_t = 1 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT) through an optimization scheme, which predicts the deformation field between two adjacent frames to deform 𝐆t1subscript𝐆𝑡1\mathbf{G}_{t-1}bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT recursively, based on 𝐆0subscript𝐆0\mathbf{G}_{0}bold_G start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., 𝐆t=𝐆t1+𝚫𝐆tsubscript𝐆𝑡subscript𝐆𝑡1subscript𝚫subscript𝐆𝑡\mathbf{G}_{t}=\mathbf{G}_{t-1}+\bm{\Delta}_{\mathbf{G}_{t}}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT (t[1,T1]𝑡1𝑇1t\in[1,~{}T-1]italic_t ∈ [ 1 , italic_T - 1 ]), where 𝚫𝐆tU×V×3subscript𝚫subscript𝐆𝑡superscript𝑈𝑉3\bm{\Delta}_{\mathbf{G}_{t}}\in\mathbb{R}^{U\times V\times 3}bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U × italic_V × 3 end_POSTSUPERSCRIPT stands for the deformation fields at the t𝑡titalic_t-th step. By establishing inter-frame relationships within 𝒢𝒢\mathcal{G}caligraphic_G through deformation fields, we can achieve temporal consistency effectively. To be specific, we formulate the following optimization problem To obtain 𝚫𝐆tsubscript𝚫subscript𝐆𝑡\bm{\Delta}_{\mathbf{G}_{t}}bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

min𝚫𝐏tD(𝐏^t1+𝚫𝐏t,𝐏t)+λRsmooth(𝚫𝐏t),subscriptsubscript𝚫subscript𝐏𝑡Dsubscript^𝐏𝑡1subscript𝚫subscript𝐏𝑡subscript𝐏𝑡𝜆subscriptRsmoothsubscript𝚫subscript𝐏𝑡\min_{\bm{\Delta}_{\mathbf{P}_{t}}}\texttt{D}(\widehat{\mathbf{P}}_{t-1}+\bm{% \Delta}_{\mathbf{P}_{t}},\mathbf{P}_{t})+\lambda\texttt{R}_{\rm smooth}(\bm{% \Delta}_{\mathbf{P}_{t}}),roman_min start_POSTSUBSCRIPT bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT D ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ R start_POSTSUBSCRIPT roman_smooth end_POSTSUBSCRIPT ( bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (4)

where 𝐏^t1UV×3subscript^𝐏𝑡1superscript𝑈𝑉3\widehat{\mathbf{P}}_{t-1}\in\mathbb{R}^{UV\times 3}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U italic_V × 3 end_POSTSUPERSCRIPT and 𝚫𝐏tUV×3subscript𝚫subscript𝐏𝑡superscript𝑈𝑉3\bm{\Delta}_{\mathbf{P}_{t}}\in\mathbb{R}^{UV\times 3}bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_U italic_V × 3 end_POSTSUPERSCRIPT are the point cloud forms, respectively reshaped from 𝐆t1subscript𝐆𝑡1\mathbf{G}_{t-1}bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝚫𝐆tsubscript𝚫subscript𝐆𝑡\bm{\Delta}_{\mathbf{G}_{t}}bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, λ>0𝜆0\lambda>0italic_λ > 0 balances the two terms, and Rsmooth()subscriptRsmooth\texttt{R}_{\rm smooth}(\cdot)R start_POSTSUBSCRIPT roman_smooth end_POSTSUBSCRIPT ( ⋅ ) regularizes 𝚫𝐏tsubscript𝚫subscript𝐏𝑡\bm{\Delta}_{\mathbf{P}_{t}}bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT (or 𝚫𝐆tsubscript𝚫subscript𝐆𝑡\bm{\Delta}_{\mathbf{G}_{t}}bold_Δ start_POSTSUBSCRIPT bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT) to be locally smooth/constant, based on the fact that neighboring points on 𝐏^t1subscript^𝐏𝑡1\widehat{\mathbf{P}}_{t-1}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT exhibit similar deformation. Such regularization also propagates the spatial smoothness of 𝐆t1subscript𝐆𝑡1\mathbf{G}_{t-1}bold_G start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT to 𝐆tsubscript𝐆𝑡\mathbf{G}_{t}bold_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, defined as

Rsmooth(𝚫𝐏t)=1Ni=1Nj𝒩K(i)w(i,j)𝚫𝐏t(i)𝚫𝐏t(j)2j𝒩K(i)w(i,j),subscriptRsmoothsubscript𝚫subscript𝐏𝑡1𝑁superscriptsubscript𝑖1𝑁subscript𝑗subscript𝒩𝐾𝑖𝑤𝑖𝑗superscriptnormsuperscriptsubscript𝚫subscript𝐏𝑡𝑖superscriptsubscript𝚫subscript𝐏𝑡𝑗2subscript𝑗subscript𝒩𝐾𝑖𝑤𝑖𝑗\texttt{R}_{\rm smooth}(\bm{\Delta}_{\mathbf{P}_{t}})=\frac{1}{N}\sum_{i=1}^{N% }\frac{\sum_{j\in\mathcal{N}_{K}(i)}w(i,j)\|\bm{\Delta}_{\mathbf{P}_{t}}^{(i)}% -\bm{\Delta}_{\mathbf{P}_{t}}^{(j)}\|^{2}}{\sum_{j\in\mathcal{N}_{K}(i)}w(i,j)},R start_POSTSUBSCRIPT roman_smooth end_POSTSUBSCRIPT ( bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT italic_w ( italic_i , italic_j ) ∥ bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - bold_Δ start_POSTSUBSCRIPT bold_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT italic_w ( italic_i , italic_j ) end_ARG , (5)

where 𝒩K()subscript𝒩𝐾\mathcal{N}_{K}(\cdot)caligraphic_N start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( ⋅ ) returns the index of K𝐾Kitalic_K-NN points, and w(i,j)=1/𝐏t1(i)𝐏t1(j)2𝑤𝑖𝑗1superscriptnormsuperscriptsubscript𝐏𝑡1𝑖superscriptsubscript𝐏𝑡1𝑗2w(i,j)=1/\|\mathbf{P}_{t-1}^{(i)}-\mathbf{P}_{t-1}^{(j)}\|^{2}italic_w ( italic_i , italic_j ) = 1 / ∥ bold_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - bold_P start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We solve Eq. (4) using the Adam optimizer [65], where 𝚫𝐏tsubscript𝚫𝐏𝑡\bm{\Delta}_{\mathbf{P}t}bold_Δ start_POSTSUBSCRIPT bold_P italic_t end_POSTSUBSCRIPT is initialized with the values from the point cloud form of 𝐆initsubscript𝐆init\mathbf{G}_{\rm init}bold_G start_POSTSUBSCRIPT roman_init end_POSTSUBSCRIPT.

It is also worth noting that as a trade-off between efficiency and effectiveness, we achieve the second stage with an optimization-based approach, and it is straightforward to substitute it with a self-supervised learning-based approach, where a neural network is trained to automatically learn and predict the deformation fields for each subsequent frame, for pursuing high performance.

III-D Discussion

Generally, the proposed SPCV shares a similar concept with previous parameterization-based geometry image/video representation for 3D triangle meshes/dynamic mesh sequences [27, 66, 67, 68], i.e., representing 3D geometry data with a 2D RGB image/video-like structure. However, our work significantly distinguishes itself from previous methods in terms of motivation, technical implementation, and application scenarios. First, previous methods map 3D vertices to the 2D domain through traditional parameterization algorithms optimizing the area and/or angle distortions of triangles. Due to the fundamental challenges of parameterization, these methods impose high requirements on the topological structures of 3D meshes to be processed, along with the requirement of correspondence across frames to model temporal information. Given that 3D point cloud sequences lack connectivity information and temporal correspondence and may have arbitrarily complex topological structures, these methods are largely inapplicable to our scenario. Second, driven by the objective of separating the indexing cues from the geometry information for improving efficiency, our pipeline re-organizes neighboring 3D points determined in the sense of the Euclidean distance into adjacent 2D locations on a regular 2D grid. The manner of determining a neighborhood is consistent with that of mainstream deep point-based architectures. In contrast, the parameterization-based methods essentially achieve the mapping in a geodesic distance-aware fashion. Accordingly, the theorems they employed are not suitable for our method. Finally, it is also worth noting that our method mimics the behavior of traditional parameterization when handling 3D objects with low curvatures, as demonstrated in Fig. 7.

In summary, unlike the previous representation mechanism pursuing accurate parameterization, the proposed learning-based pipeline leverages the impressive capability of deep learning to offer a practical and promising solution for real-world applications. The advantages and effectiveness of our proposed SPCV are extensively demonstrated through experiments on various downstream tasks in Section IV.

Refer to caption
Refer to caption
Figure 7: Example of representing a sphere point cloud (left) as an SPCV frame (right) with the checkerboard texture.

IV SPCV-based Applications

To verify the practical advantages of representing 3D point cloud sequences as SPCVs, we construct task-specific processing frameworks that directly consume SPCVs as inputs to achieve the corresponding dynamic point cloud applications. As aforementioned, the 2D video-like structure of our SPCV representation enables the adoption of various existing well-established 2D image/video deep learning techniques to construct task-specific frameworks.

Essentially, SPCV is designed to be a generic representation structure, meaning that there are no restrictions on the downstream task scenarios. In principle, the downstream tasks should be sensitive to spatial smoothness and temporal consistency of input SPCVs, such that the resulting task performances can better indicate the representation quality of SPCVs. And the downstream tasks should cover both high-level semantic understanding and low-level geometry processing applications. Additionally, both learning-based and non-learning-based tasks are necessary to demonstrate the versatility of our SPCV representations. According to the above principles, we consider three targeted downstream tasks: 1) learning-based 3D action recognition; 2) learning-based temporal interpolation of point cloud sequences; and 3) non-learning-based 3D point cloud data compression. Specifically, action recognition evaluates the discriminative and expressive capabilities of high-level features learned through the SPCV representation. The effectiveness of temporal interpolation and compression serves as a measure of SPCV’s representational quality in both spatial and temporal domains.

Refer to caption
Figure 8: Flowchart of the proposed SPCV-based temporal interpolation for 3D point cloud sequences.

IV-A 3D Action Recognition

As a common sequence-level task in point cloud sequence processing, this task evaluates the model’s ability to extract discriminative features from each 3D point cloud sequence, thereby determining the most probable label for the sequence. Existing methods [69, 42, 45, 12] typically employ point-based spatio-temporal convolution operations, e.g., PST convolution [69, 42], for learning feature encodings across spatio-temporal dimensions, or transformers [45, 12] for spatio-temporal context interaction. However, due to the insufficient design to cope with the unstructured characteristic of 3D point cloud data, their capability of extracting discriminative features efficiently and effectively is still relatively weak, thus limiting their accuracy. With the video-like representation, our approach enables the use of mature 2D convolutional operations to obtain more discriminative and expressive features. We modify the feature embedding process, common in approaches like PST-Transformer [12], by employing an efficient convolutional encoder. This modification efficiently generates more expressive feature tokens for the transformer.

Specifically, as illustrated in Fig. 6 (a), the proposed SPCV-based framework first spatially downsamples an input SPCV to a quarter of its original scale before being passed into a video-based encoder, consisting of an 18-layer ‘UNet3D’ architecture, styled after ResNet [2], a design also used in conventional video-based methods [70, 71]. In the network’s bottleneck, the scaled SPCV frames, along with features scaled down to a quarter of their original size, are converted into sequences of point clouds and their corresponding features. These sequences are then fed into a decoder with transformer to enhance feature space interactions. Inside the decoder, the transformer’s output features undergo two layers of max-pooling to produce batches of feature vectors. These vectors are subsequently processed through an MLP head layer to generate the final scores, indicating the probabilities of various classes for the input sequence.

IV-B Temporal Interpolation of 3D Point Cloud Sequences

This task involves reconstructing a high temporal resolution 3D point cloud sequence from a low-resolution one, a counterpart of the 2D video frame interpolation, which targets to increase the frame rate of a video by generating the in-between frames. The unstructured nature presented within individual frames and lack of correspondence across frames compound the complexity of this task, especially for sequences with large non-rigid deformation. PointINet [38] addresses this task by adopting a scene flow estimator for interpolation. IDEA-Net [39] estimates point-wise trajectories for coarse interpolation and trajectory compensation. These methods suffer from inefficient and ineffective feature extraction and rely on distribution-based loss functions to interpolate point cloud frames with restricted fidelity. To address these issues, we develop an interpolation model based on an image encoder. Leveraging the structured nature of our representation, the mature image encoder efficiently generates more effective feature embedding tokens, which are then enhanced through downstream transformers in both temporal and spatial domains for feature interaction. And our model can be driven by common 2D loss functions like the Frobenius norm, enabling interpolating higher-quality point cloud frames efficiently.

Fig. 8 illustrates the overall learning pipeline of our proposed SPCV-based point cloud frame interpolation. Specifically, for an input pair of consecutive frames at time step t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we consider interpolating a new frame at time step t𝑡titalic_t, where t1<t<t2subscript𝑡1𝑡subscript𝑡2t_{1}<t<t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To implement this, we feed the two point cloud frames into a shared 2D image-based encoder consisting of 4444 Conv2D layers to produce the corresponding feature maps 𝐅1subscript𝐅1\mathbf{F}_{1}bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐅2subscript𝐅2\mathbf{F}_{2}bold_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then we can deduce an interpolated feature map, i.e., t2t1tt2t1𝐅1+tt2t1𝐅2subscript𝑡2subscript𝑡1𝑡subscript𝑡2subscript𝑡1subscript𝐅1𝑡subscript𝑡2subscript𝑡1subscript𝐅2\frac{t_{2}-t_{1}-t}{t_{2}-t_{1}}\mathbf{F}_{1}+\frac{t}{t_{2}-t_{1}}\mathbf{F% }_{2}divide start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_t end_ARG start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG italic_t end_ARG start_ARG italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG bold_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which is further passed into a residual decoder with 5555 Conv1D layers. Finally, we add the generated residuals to the input frame at time step t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to obtain the desired interpolation result.

IV-C 3D Point Cloud Data Compression

With the rapid development of 3D sensing technologies and the broad deployment of consumer-level devices (e.g., depth camera, LiDAR, etc.), there is an urgent need to develop efficient 3D point cloud codecs to compress the corresponding huge data volume and effectively save transmission bandwidth and storage space. However, unlike its mature 2D counterpart, i.e., image/video compression, which has been investigated and developed for a long time, 3D point cloud compression is still a relatively young topic whose actual coding performance is far from satisfactory. Therefore, there exists an obvious contradiction between ever-growing 3D point cloud data volume and immature compression techniques. Compared to 2D image/video compression, 3D point cloud data compression333Note that we consider compression of the geometry information of 3D point clouds, rather than the attributes. poses a greater challenge due to the unstructured nature of spatial and temporal domains, i.e., the points of each frame are irregularly distributed, the number of points may change from one frame to another, and there is no consistent point-to-point correspondence between successive frames.

The Moving Picture Experts Group (MPEG) introduced two point cloud compression (PCC) standards: Geometry-Based PCC (G-PCC) and Video-Based PCC (V-PCC) for compressing static and dynamic point cloud data, respectively. We refer the readers to [72, 73, 60] for detailed introductions of these methods and [74] for a survey on deep learning-based point cloud compression.

Like [75, 76], the proposed SPCV naturally bridges the gap between dynamic 3D point cloud sequences and mature 2D image/video compression techniques, paving the way for efficient and effective compression pipelines. Specifically, given a 3D point cloud sequence, we represent it as an SPCV, which is further fed into the Versatile Video Coding (VVC) Test Model, the latest H.266/VVC video compression standard [77], to achieve compression. This pipeline is fully compatible with existing infrastructures for 2D image/video compression and transmission.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Row 1: Original static 3D point clouds. Rows 2-4: 2D representations of the structured point clouds through Flattening-Net [33], RegGeoNet [32], and our method, respectively. Rows 5-8: Surfaces reconstructed from the structured point clouds by Flattening-Net [33], RegGeoNet [32], and our method, and the original points, respectively.

V Experiments

In Section V-A, we quantitatively and qualitatively evaluated the representation capability of our framework by quantifying the spatial smoothness, temporal consistency, and geometric fidelity of SPCVs. Sections V-B to V-D evaluate the three SPCV-based point cloud analysis and processing tasks. Section V-E presents in-depth ablation studies focusing on key components within our SPCV representation framework.

TABLE I: Quantitative comparisons of geometric fidelity of the structured static point clouds by different methods in terms of CD (×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT), HD (×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), and mNUC (×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT). For all three metrics, the smaller, the better.
Method Dancer Chair Church
CD HD mNUC CD HD mNUC CD HD mNUC
Flattening-Net [33] 1.790 6.56 2.9 2.881 8.49 2.9 1.513 6.49 3.0
RegGeoNet [32] 0.659 6.02 3.0 1.043 4.56 2.7 1.103 4.85 3.0
Ours 0.034 0.65 1.0 0.086 1.13 0.7 0.090 0.97 1.1
Method Buddha Wheel Erato
CD HD mNUC CD HD mNUC CD HD mNUC
Flattening-Net [33] 1.493 4.90 3.1 1.934 6.94 2.5 1.528 7.88 2.9
RegGeoNet [32] 0.895 4.87 3.2 1.595 6.26 2.9 0.858 5.04 3.0
Ours 0.064 1.03 1.4 0.160 1.42 1.3 0.063 0.88 1.2

V-A Evaluation of Representation Quality

Refer to caption
Figure 17: Visual results of representing a sequence from the KITTI dataset as SPCV.
TABLE II: Quantitative results of geometric fidelity of structured point cloud sequences by our method. The values of CD (×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT), HD (×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT), and mNUC (×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT) refer to the average CD, HD, and mNUC of all frames.
Sequence CD HD mNUC
Camel 0.024 0.66 1.3
Elephant 0.041 0.78 1.2
Horse 0.032 0.68 1.1
Crane 0.034 0.68 1.0
Swing 0.029 0.65 1.0
Bouncing 0.032 0.68 1.2

V-A1 Results of static 3D point clouds

We quantitatively evaluated the representation capability of our framework on static 3D point clouds by computing the spatial smoothness and geometric fidelity of resulting 2D image representations. Under this scenario, only the first stage of our framework, i.e., frame-wise structurization, is used. Specifically, we quantify the spatial smoothness by examining the proportion of pixels with a k×k𝑘𝑘k\times kitalic_k × italic_k window on the resulting 2D representation that are the (k21)superscript𝑘21(k^{2}-1)( italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 )-nearest neighbors of the window’s central pixel after reprojecting the pixels into 3D space. The higher the proportion, the better. We utilized CD, Hausdorff Distance (HD), and Mean Normalized Uniformity Coefficients (mNUC) [79] between structured and original point clouds to comprehensively assess the geometric fidelity. We employed six point clouds with various topological structures, as shown in the top of Fig. 9, and each point cloud contains 104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT points that are uniformly distributed. We also bench-marked two recent methods for representing static point clouds as 2D images: Flattening-Net [33] and RegGeoNet [32].

From Fig. LABEL:reconstruction_smoothness, it can be observed that our method achieves a smoothness ratio ranging roughly from 35% to 100% for window sizes from 3×3333\times 33 × 3 to 12×12121212\times 1212 × 12 among various types of 3D point clouds, which are consistently higher than those of the two compared methods. Such a superiority is also verified by the visual results of the resulting 2D representations shown in Fig. 9, where our results show smooth pixel distributions while the results of Flattening-Net and RegGeoNet show obvious block effects.

In terms of geometric fidelity, as compared in Table I, our method significantly outperforms the competing state-of-the-art, achieving about ten times smaller CD and at least two times smaller HD and mNUC. The advantages reflected in these metrics demonstrate that our representation preserves the original 3D geometry more accurately and captures the point distribution more uniformly. As shown in Fig. 9, benefiting from our geometry accuracy and distribution uniformity, the 3D surfaces reconstructed from our structured representations are closer to the ground truths, while the results produced by Flattening-Net and RegGeoNet degrade to a large extent.

Overall, when evaluating the representation quality, the metrics of both spatial smoothness and geometry fidelity should be jointly taken into account. Although the spatial smoothness ratios of Flattening-Net and RegGeoNet can be relatively close to ours in some cases, e.g., Figs. LABEL:reconstruction_smoothness (b), (e), and (f), it does not mean their representations achieve sufficiently satisfactory quality because of their non-uniform point distribution, which can also be verified by their inferior surface reconstruction results as shown in Fig. 9.

V-A2 Results of dynamic 3D point cloud sequences

We quantitatively evaluated the representation ability of our method on dynamic 3D point cloud sequences by computing the mean spatial smoothness and geometric fidelity over all frames of the resulting SPCV representations. Additionally, we defined the correspondence ratio to quantify the temporal consistency of SPCVs, i.e., for each pair of adjacent frames of the SPCV, we calculated the corresponding proportion of the k𝑘kitalic_k-nearest neighborhood of each point and proposed an algorithm to calculate the correspondence ratio in percentage, with higher values indicating better performance. This metric takes into account both reconstruction and correspondence accuracy. We refer readers to the Supplementary Material for detailed descriptions of the calculation process.

Here, we employed six 3D point cloud sequences, including three human motion sequences [80] and three animal motion sequences [81]. For the human sequences, we utilized 46 frames from the swing, 31 frames from the bouncing and crane, while for the animal sequences, we employed 10 frames from the horse, 48 frames from the camel and elephant. Each point cloud frame of the sequences contains 10K points.

Our method showcases superior representation quality, as evidenced in Fig. LABEL:quant_temporal_consistency_smooth_regular and Table II. Specifically, as observed in Fig. LABEL:quant_temporal_consistency_smooth_regular, our representation maintains over 60% temporal consistency and a mean spatial smoothness ratio of over 35%, indicating its capability to handle various types of sequences. Additionally, Table II reveals that our method consistently achieves robust geometric fidelity across different sequence types, as indicated by stable CD, HD, and mNUC metrics. Besides, Fig. LABEL:hum_anim_seq_vis_ours visualizes the SPCV representations on the human and animal sequences, further verifying the effectiveness of the proposed framework in terms of spatial smoothness, temporal consistency, and geometric fidelity. As shown in Fig. 17, we cropped a patch from the first ten frames of KITTI and generated the SPCV for better illustration. We selected 3 frames to visualize a typical situation where a tree appears in the first two frames but vanishes in the last frame. The tree is mapped to a similar pixel area, as shown in the red box on the 2D grid. In the last frame, these pixel positions are replaced by the coordinates of the ground. These results demonstrate the robustness and versatility of the SPCV representation even for scenarios involving scanned LiDAR point cloud sequences.

Moreover, we also compared the learned temporal consistency by our framework with CorrNet3D [58], a representative unsupervised method for building dense correspondence between two point clouds, following its evaluation setups. As shown in Fig. LABEL:quant_temporal_consistency_diff_methods, our method outperforms CorrNet3D, achieving over 95% correspondence ratios for varying neighborhood sizes (i.e., values of K𝐾Kitalic_K). The visual results in Fig. LABEL:correspondence_results_representation demonstrate that our representation maintains more consistent patterns across SPCV frames and ensures more accurate correspondences between adjacent frames.

V-B Results of Action Recognition

TABLE III: Quantitative comparisons of recognition accuracy (%), GPU memory cost (MB), and forward pass time (seconds) on DeformingThings4D [82].
Methods Accuracy Memory Time
P4Transformer [45] 80.9580.9580.9580.95 4020402040204020 0.1190.1190.1190.119
PSTNet2 [42] 60.9060.9060.9060.90 10816108161081610816 0.2160.2160.2160.216
PSTTransformer [12] 76.1976.1976.1976.19 2044204420442044 0.0100.0100.0100.010
Ours 90.4890.4890.4890.48 2026202620262026 0.0060.0060.0060.006

We compared our SPCV-based action recognition framework with existing state-of-the-art 3D sequence processing methods [45, 42, 12] on the challenging dataset of deformingThings4D [82], which consists of 1,972 animation 3D point cloud sequences across 31 categories of humanoids and animals with large non-rigid deformations. Each frame of the point cloud sequences contains 2,048 points. We segmented the whole sequence into multiple fixed-frame clips, which are fed into all competing methods. To adapt our method, we converted all 3D point cloud sequences to SPCVs with the height and width of each frame equal to 64 and 32, respectively. Following [45], during inference, the prediction result of the whole sequence is determined by averaging the probabilities of all clips.

As compared in Table III, our method outperforms existing state-of-the-art methods with the highest recognition accuracy, which reveals the enhanced feature extraction capability of our SPCV-based learning network. Moreover, our framework turns out to be much more efficient in terms of the GPU memory cost and inference speed, benefiting from the structured nature of SPCVs that can circumvent many expensive operations of indexing, sampling, and grouping.

V-C Results of Temporal Interpolation

TABLE IV: Quantitative comparison of temporal interpolation by different methods on the DHB dataset [78]. The values of CD and EMD is scaled by ×103absentsuperscript103\times 10^{3}× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT.
Methods Swing Longdress Memory Time
EMD CD EMD CD (MB) (seconds)
PointINet [38] 15.03 1.70 10.09 0.95 6488 0.143
PSTNet2 [42] 27.12 2.47 39.60 3.43 3009 0.142
P4Transformer [45] 38.35 3.24 105.68 8.88 3008 0.089
PST-Transformer [12] 32.43 3.02 116.26 13.83 3006 0.058
IDEA-Net [78] 7.07 1.24 5.92 0.88 11428 0.106
Ours 2.742.742.742.74 0.860.860.860.86 2.162.162.162.16 0.850.850.850.85 2708270827082708 0.0140.0140.0140.014

Following the experimental protocols in [78], we performed temporal interpolation on the DHB dataset and made comparisons with existing state-of-the-art 3D point cloud sequence processing approaches [38, 42, 45, 78, 12], where we adopted CD and EMD as quantitative metrics.

In our implementation, we bench-marked our model against the original learning framework of [78], along with the point-level task models outlined in [42, 45, 12]. Similarly to [78], we utilized EMD as the training loss function for the compared methods. All the competing methods share the same training and testing data as prepared in [78]. As depicted in Fig. 8, we constructed an image-based learning framework to fully leverage the capabilities of our regular grid representation for temporal interpolation. Thanks to the structured nature of our SPCV representation, we can directly utilize the pixel-wise Frobenius norm as the loss function.

As shown in Table IV, our approach demonstrates significant improvements over the existing state-of-the-art methods, as indicated by its lower CD and EMD metrics. Besides, our learning framework shows less GPU memory consumption and forward pass time cost. As visualized in Fig. LABEL:intpl_swing, our method distinctly outperforms the others in preserving geometric fidelity within the interpolated frames. Compared with previous methods, our SPCV representation has explicit temporal consistency, which greatly reduces the difficulty of frame interpolation by the model, thereby achieving a lower geometric error. Thanks to the regular structure of SPCVs, our method can directly use sliding window convolution operations without the need for specially designed grouping and feature aggregation methods, thus improving the processing speed. Moreover, when extracting multi-scale features, due to the regularity of each frame, we do not need to pre-use FPS or other sampling methods to extract key points, which enhances the speed and reduces the GPU storage.

Refer to caption
Refer to caption
Figure 20: Comparison of different methods for compressing static point clouds. Here, the results refer to a typical frame of the sequence Exercise from the MPEG dataset [83].
Refer to caption
Figure 21: Visual comparisons of reconstructed surfaces from 3D point clouds compressed by various methods with the same compression ratio. (a) Original point clouds, (b) Flattening-Net [33], (c) RegGeoNet [32], (d) G-PCC, and (e) Our approach. The results refer to the shapes from the MPEG dataset [83].
Refer to caption
Refer to caption
Figure 22: Comparison of our method and V-PCC for compressing point cloud sequences. Here, the results refer to a 5-frame sequence of mixed shapes from the MPEG dataset [83].

V-D Results of Compression

We adopted the popular MPEG [83] and MITAMA [84] datasets and bench-marked our SPCV-based point cloud compression framework for compressing both static and dynamic point clouds. For experiments on static point cloud compression where each point cloud uniformly contains 10K points, we made comparisons with the latest version of G-PCC (TMC13 v23.0-rc2). Besides, we also included Flattening-Net [33] and RegGeoNet [32] for comparison by feeding their generated 2D representations into the H.266/VVC video encoder. For experiments on dynamic point cloud compression where each sequence consists of 5 frames and each frame contains 250K points, we made comparisons with the latest version of V-PCC (TMC2 Release 18.0).

The quantitative results depicted in Fig. 20 indicate that our SPCV-based compression framework outperforms both G-PCC and recent point cloud structurization methods. This is substantiated by achieving lower CD and Point-to-Face (P2F) metrics at the same compression ratio. The enhanced performance can be attributed to the structured nature of the SPCV representation that is conducive to the utilization of the advanced video compression techniques, and the intrinsic spatial smoothness of SPCV that further boosts the compression performance. Visual comparisons in Fig. 21 show that our compression method more accurately preserves the geometric fidelity of the surfaces reconstructed from the decoded 3D point clouds.

For point cloud sequence compression, the results in Fig. 22 demonstrate that our SPCV-based compression framework significantly outperforms V-PCC in compression efficiency, as evidenced by much lower CD and P2F metrics at the same compression ratio (or much smaller compressed file size at the same CD and P2F metrics). The SPCV’s temporal consistency enhances its intra-frame compression capabilities, optimizing the overall performance. Visual comparisons in Fig. LABEL:vpcc_vis demonstrate that our SPCV more faithfully preserves the geometric fidelity in the reconstructed surfaces.

V-E Ablation Studies

We conducted ablation studies to facilitate analyzing and understanding the proposed learning framework for structuring an arbitrary 3D point cloud sequence into a 2D video.

Refer to caption Refer to caption Variants CD ×104absentsuperscript104\times 10^{-4}× 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT HD ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT mNUC ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (a) 0.043 0.70 1.1 (b) 0.037 0.70 1.2 (c) 0.029 0.65 1.0 (d) 0.059 0.74 1.3 (e) 0.258 1.76 2.3 (f) 0.029 0.65 1.0

Figure 25: Ablative analyses of spatial smoothness (top left), temporal consistency (top right), and geometric fidelity (bottom) for variants of our method. (a) without sequential structuring, (b) without normal constraints, (c) without spatial constraints, (d) with EMD, (e) with CD, and (f) full model.

V-E1 Effectiveness of sequence-wise structurization

In the framework depicted in Fig. 5, we introduced a sequence-wise stage representing the remaining point cloud frames. To validate the importance of this stage in maintaining temporal consistency, we removed the sequence-wise processing stage and instead processed each frame independently through the frame-wise stage to obtain the SPCV. As evident from Fig. 25 (a), this setting ensures a considerable level of intra-frame smoothness but fails to maintain temporal consistency across the entire frame sequence.

V-E2 Effectiveness of geometric regularization in Eqs. (3) and (5)

We explored the effects of the geometric regularization terms used in the SPCV framework. As shown in Figs. 25 (b) and (c), the overall performance degrades when removing each individual constraint. Specifically, the removal of the spatial regularization, i.e., λs=0subscript𝜆s0\lambda_{\rm s}=0italic_λ start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT = 0, impacts the model’s ability to maintain spatial smoothness within individual frames. Besides, the absence of normal regularization, i.e., λn=0subscript𝜆n0\lambda_{\rm n}=0italic_λ start_POSTSUBSCRIPT roman_n end_POSTSUBSCRIPT = 0, leads to a decline in both the spatial smoothness and temporal consistency, highlighting its critical role in preserving the geometric integrity of the model. Collectively, these findings underscore the importance of geometric constraints in the overall efficacy and quality of the SPCV framework.

V-E3 Impact of distance metrics

We evaluated the impact of alternative distance metrics (EMD and CD) on our framework’s performance in comparison to standard settings. As shown in Figs. 25 (d) and (e), both EMD and CD do not match the efficacy of our original configuration. EMD prioritizes global shape alignment over maintaining point proximity during point cloud reconstruction; however, searching for such a global alignment cannot guarantee spatial smoothness and temporal consistency, even with the corresponding constraints. On the other hand, although CD is more effective than EMD in preserving smoothness, the reconstructed shapes under its alignment are extremely inaccurate than EMD and our selected distance metric.

VI Conclusion

This paper presents a novel generic representation for structuring dynamic 3D point cloud sequences as 2D videos. We constructed a self-supervised learning framework together with geometrically meaningful constraints for achieving spatial smoothness and temporal consistency. The generated SPCVs show satisfactory representation quality. To demonstrate the practical value, we conducted a variety of downstream tasks, including action recognition, temporal interpolation, and compression, where our SPCV-based learning frameworks outperform existing state-of-the-art approaches. Through comprehensive experimental evaluations, we demonstrated the universality and potential of the proposed SPCV representations. We believe that this study opens up many new possibilities for research on dynamic point cloud processing and learning.

References

  • [1] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Int. Conf. Comput. Vis., 2015, pp. 2758–2766.
  • [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
  • [3] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2462–2470.
  • [4] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Adv. Neural Inform. Process. Syst., vol. 27, 2014.
  • [5] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 652–660.
  • [6] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Adv. Neural Inform. Process. Syst., 2017, pp. 5099–5108.
  • [7] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
  • [8] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Int. Conf. Comput. Vis., 2021, pp. 16 259–16 268.
  • [9] G. Qian, Y. Li, H. Peng, J. Mai, H. A. A. K. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,” in Proc. NeurIPS, 2022.
  • [10] H. Lin, X. Zheng, L. Li, F. Chao, S. Wang, Y. Wang, Y. Tian, and R. Ji, “Meta architecture for point cloud analysis,” in Proc. CVPR, 2023, pp. 17 682–17 691.
  • [11] X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” in Int. Conf. Comput. Vis., 2019, pp. 9246–9255.
  • [12] H. Fan, Y. Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2181–2192, 2022.
  • [13] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in Proc. CVPR, 2017, pp. 605–613.
  • [14] J. Li, B. M. Chen, and G. H. Lee, “So-net: Self-organizing network for point cloud analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9397–9406.
  • [15] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on χ𝜒\chiitalic_χ-transformed points,” in Adv. Neural Inform. Process. Syst., 2018, pp. 828–838.
  • [16] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Int. Conf. Comput. Vis., 2019, pp. 6411–6420.
  • [17] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 8895–8904.
  • [18] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui, “Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 5589–5598.
  • [19] M. Xu, R. Ding, H. Zhao, and X. Qi, “Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 3173–3182.
  • [20] T. Xiang, C. Zhang, Y. Song, J. Yu, and W. Cai, “Walk in the cloud: Learning curves for point clouds shape analysis,” in Int. Conf. Comput. Vis., 2021, pp. 915–924.
  • [21] N. Verma, E. Boyer, and J. Verbeek, “Feastnet: Feature-steered graph convolutions for 3d shape analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2598–2606.
  • [22] Q. Xu, X. Sun, C.-Y. Wu, P. Wang, and U. Neumann, “Grid-gcn for fast and scalable point cloud learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 5661–5670.
  • [23] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, “Pct: Point cloud transformer,” Computational Visual Media, vol. 7, no. 2, pp. 187–199, 2021.
  • [24] C. Park, Y. Jeong, M. Cho, and J. Park, “Fast Point Transformer,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 949–16 958.
  • [25] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proc. CVPR, 2022, pp. 19 313–19 322.
  • [26] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual MLP framework,” in Proc. ICLR, 2022.
  • [27] X. Gu, S. J. Gortler, and H. Hoppe, “Geometry images,” in Proceedings of the 29th annual conference on Computer graphics and interactive techniques, 2002, pp. 355–361.
  • [28] A. Sinha, J. Bai, and K. Ramani, “Deep learning 3d shape surfaces using geometry images,” in Eur. Conf. Comput. Vis., 2016, pp. 223–240.
  • [29] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani, “Surfnet: Generating 3d shape surfaces using deep residual networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 6040–6049.
  • [30] H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. Kim, and Y. Lipman, “Convolutional neural networks on surfaces via seamless toric covers,” ACM Trans. Graph., vol. 36, no. 4, pp. 71–1, 2017.
  • [31] N. Haim, N. Segol, H. Ben-Hamu, H. Maron, and Y. Lipman, “Surface networks via general covers,” in Int. Conf. Comput. Vis., 2019, pp. 632–641.
  • [32] Q. Zhang, J. Hou, Y. Qian, A. B. Chan, J. Zhang, and Y. He, “Reggeonet: Learning regular representations for large-scale 3d point clouds,” Int. J. Comput. Vis., vol. 130, no. 12, pp. 3100–3122, 2022.
  • [33] Q. Zhang, J. Hou, Y. Qian, Y. Zeng, J. Zhang, and Y. He, “Flattening-net: Deep regular 2d representation for 3d point cloud analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9726–9742, 2023.
  • [34] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 206–215.
  • [35] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papier-mâché approach to learning 3d surface generation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 216–224.
  • [36] J. Pang, D. Li, and D. Tian, “Tearingnet: Point cloud autoencoder to learn topology-friendly representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7453–7462.
  • [37] Y. Zhang, H. Fan, Y. Yang, and M. Kankanhalli, “Dpmix: Mixture of depth and point cloud video experts for 4d action segmentation,” arXiv preprint arXiv:2307.16803, 2023.
  • [38] F. Lu, G. Chen, S. Qu, Z. Li, Y. Liu, and A. Knoll, “Pointinet: Point cloud frame interpolation network,” in AAAI, 2021.
  • [39] Y. Zeng, Y. Qian, Q. Zhang, J. Hou, Y. Yuan, and Y. He, “Idea-net: Dynamic 3d point cloud interpolation via deep embedding alignment,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6338–6347.
  • [40] Z. Zheng, D. Wu, R. Lu, F. Lu, G. Chen, and C. Jiang, “Neuralpci: Spatio-temporal neural field for 3d point cloud multi-frame non-linear interpolation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 909–918.
  • [41] H. Fan, X. Yu, Y. Ding, Y. Yang, and M. S. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” in Int. Conf. Learn. Represent., 2021.
  • [42] H. Fan, X. Yu, Y. Yang, and M. Kankanhalli, “Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9918–9930, 2021.
  • [43] Z. Shen, X. Sheng, H. Fan, L. Wang, Y. Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos,” in Int. Conf. Comput. Vis., 2023, pp. 16 580–16 589.
  • [44] C. Feichtenhofer, Y. Li, K. He et al., “Masked autoencoders as spatiotemporal learners,” Adv. Neural Inform. Process. Syst., vol. 35, pp. 35 946–35 958, 2022.
  • [45] H. Fan, Y. Yang, and M. S. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 14 204–14 213.
  • [46] Y. Wei, H. Liu, T. Xie, Q. Ke, and Y. Guo, “Spatial-temporal transformer for 3d point cloud sequences,” in IEEE Winter Conf. on App. of Comput. Vis., 2022, pp. 1171–1180.
  • [47] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Occupancy flow: 4d reconstruction by learning particle dynamics,” in Int. Conf. Comput. Vis., 2019, pp. 5379–5389.
  • [48] D. Rempe, T. Birdal, Y. Zhao, Z. Gojcic, S. Sridhar, and L. J. Guibas, “Caspr: Learning canonical spatiotemporal point cloud representations,” in Adv. Neural Inform. Process. Syst., 2020, pp. 13 688–13 701.
  • [49] J. Lei and K. Daniilidis, “Cadex: Learning canonical deformation coordinate space for dynamic surface representation via neural homeomorphism,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6624–6634.
  • [50] Y. Wei, Z. Wang, Y. Rao, J. Lu, and J. Zhou, “Pv-raft: Point-voxel correlation fields for scene flow estimation of point clouds,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 6954–6963.
  • [51] X. Li, J. Kaesemodel Pontes, and S. Lucey, “Neural scene flow prior,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 7838–7851, 2021.
  • [52] P. He, P. Emami, S. Ranka, and A. Rangarajan, “Learning scene dynamics from point cloud sequences,” Int. J. Comput. Vis., vol. 130, no. 3, pp. 669–695, 2022.
  • [53] Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 511–520.
  • [54] Z. Li, T. Li, and A. B. Farimani, “Tpu-gan: Learning temporal coherence from dynamic point cloud sequences,” in Int. Conf. Learn. Represent., 2021.
  • [55] J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 8510–8520.
  • [56] X. Sheng, Z. Shen, G. Xiao, L. Wang, Y. Guo, and H. Fan, “Point contrastive prediction with semantic clustering for self-supervised learning on point cloud videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 515–16 524.
  • [57] M. Eisenberger, D. Novotny, G. Kerchenbaum, P. Labatut, N. Neverova, D. Cremers, and A. Vedaldi, “Neuromorph: Unsupervised shape interpolation and correspondence in one go,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 7473–7483.
  • [58] Y. Zeng, Y. Qian, Z. Zhu, J. Hou, H. Yuan, and Y. He, “Corrnet3d: Unsupervised end-to-end learning of dense correspondence for 3d point clouds,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 6052–6061.
  • [59] M. Quach, G. Valenzise, and F. Dufaux, “Folding-based compression of point cloud attributes,” in IEEE Int. Conf. Image Process.   IEEE, 2020, pp. 3309–3313.
  • [60] D. Graziosi, O. Nakagami, S. Kuma, A. Zaghetto, T. Suzuki, and A. Tabatabai, “An overview of ongoing point cloud compression standardization activities: Video-based (v-pcc) and geometry-based (g-pcc),” APSIPA Transactions on Signal and Information Processing, vol. 9, p. e13, 2020.
  • [61] T. Nguyen, Q.-H. Pham, T. Le, T. Pham, N. Ho, and B.-S. Hua, “Point-set distances for learning representations of 3d point clouds,” in Int. Conf. Comput. Vis., 2021, pp. 10 478–10 487.
  • [62] A. Sheffer, E. Praun, and K. Rose, “Mesh parameterization methods and their applications,” Found. Trends Comput. Graph. Vis., vol. 2, no. 2, pp. 105–171, 2006.
  • [63] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” in Int. Conf. on ML., 2019, pp. 5301–5310.
  • [64] S. Ren and J. Hou, “Unleash the potential of 3d point cloud modeling with a calibrated local geometry-driven distance metric,” arXiv preprint arXiv:2306.00552, 2023.
  • [65] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [66] H. M. Briceno, P. V. Sander, L. McMillan, S. Gortler, and H. Hoppe, “Geometry videos,” in Eurographics/SIGGRAPH symposium on computer animation (SCA), 2003.
  • [67] J. Xia, Y. He, D. P. Quynh, X. Chen, and S. C. Hoi, “Modeling 3d facial expressions using geometry videos,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 591–600.
  • [68] D. T. Quynh, Y. He, X. Chen, J. Xia, Q. Sun, and S. C. Hoi, “Modeling 3d articulated motions with conformal geometry videos (cgvs),” in Proceedings of the 19th ACM international conference on Multimedia, 2011, pp. 383–392.
  • [69] H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” in Int. Conf. Learn. Represent., 2021.
  • [70] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 6450–6459.
  • [71] T. Kalluri, D. Pathak, M. Chandraker, and D. Tran, “Flavr: Flow-agnostic video representations for fast frame interpolation,” in IEEE Winter Conf. on App. of Comput. Vis., 2023, pp. 2071–2082.
  • [72] S. Schwarz, M. Preda, V. Baroncini, M. Budagavi, P. Cesar, P. A. Chou, R. A. Cohen, M. Krivokuća, S. Lasserre, Z. Li et al., “Emerging mpeg standards for point cloud compression,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 133–148, 2018.
  • [73] H. Liu, H. Yuan, Q. Liu, J. Hou, and J. Liu, “A comprehensive study and comparison of core technologies for mpeg 3-d point cloud compression,” IEEE Transactions on Broadcasting, vol. 66, no. 3, pp. 701–717, 2019.
  • [74] M. Quach, J. Pang, D. Tian, G. Valenzise, and F. Dufaux, “Survey on deep learning-based point cloud compression,” Frontiers in Signal Processing, vol. 2, p. 846972, 2022.
  • [75] J. Hou, L.-P. Chau, N. Magnenat-Thalmann, and Y. He, “Compressing 3-d human motions via keyframe-based geometry videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 1, pp. 51–62, 2014.
  • [76] J. Hou, L.-P. Chau, M. Zhang, N. Magnenat-Thalmann, and Y. He, “A highly efficient compression framework for time-varying 3-d facial expressions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 9, pp. 1541–1553, 2014.
  • [77] International Telecommunication Union, “H.266: Versatile video coding,” 2021, archived from the original on 21 June 2021. Retrieved on 21 June 2021. [Online]. Available: https://www.itu.int/rec/T-REC-H.266
  • [78] Y. Zeng, Y. Qian, Q. Zhang, J. Hou, Y. Yuan, and Y. He, “Idea-net: Dynamic 3d point cloud interpolation via deep embedding alignment,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6338–6347.
  • [79] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng, “Pu-net: Point cloud upsampling network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2790–2799.
  • [80] D. Vlasic, I. Baran, W. Matusik, and J. Popovic, “Articulated mesh animation from multi-view silhouettes (data link),” https://people.csail.mit.edu/drdaniel/mesh_animation/#data, 2023, accessed: 2023-11-2.
  • [81] R. W. Sumner and J. Popović, “Deformation transfer for triangle meshes,” ACM Trans. Graph., vol. 23, no. 3, pp. 399–405, 2004.
  • [82] Y. Li, H. Takehara, T. Taketomi, B. Zheng, and M. Nießner, “4dcomplete: Non-rigid motion estimation beyond the observable surface,” in Int. Conf. Comput. Vis., 2021, pp. 12 706–12 716.
  • [83] Y. Xu, Y. Lu, and Z. Wen, “Owlii Dynamic human mesh sequence dataset,” ISO/IEC JTC1/SC29/WG11 m41658, 120th MPEG Meeting, Macau, Oct 2017.
  • [84] D. Vlasic, I. Baran, W. Matusik, and J. Popovic, “Articulated mesh animation from multi-view silhouettes,” in SIGGRAPH, 2008.
[Uncaptioned image] Yiming Zeng received his B.S. degree in Automation from the South China University of Technology, Guangzhou, China, in 2019, and his Ph.D. degree in Computer Science from the City University of Hong Kong in 2024. His research interests include deep learning processing and the development of geometrically meaningful representations for both static and dynamic 3D point clouds.
[Uncaptioned image] Junhui Hou (Senior Member, IEEE) is an Associate Professor with the Department of Computer Science, City University of Hong Kong. He holds a B.Eng. degree in information engineering (Talented Students Program) from the South China University of Technology, Guangzhou, China (2009), an M.Eng. degree in signal and information processing from Northwestern Polytechnical University, Xi’an, China (2012), and a Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (2016). His research interests are multi-dimensional visual computing. Dr. Hou received the Early Career Award (3/381) from the Hong Kong Research Grants Council in 2018. He is an elected member of IEEE MSATC, VSPC-TC, and MMSP-TC. He has served or is serving as an Associate Editor for IEEE Transactions on Visualization and Computer Graphics, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, and Signal Processing: Image Communication, and The Visual Computer.
[Uncaptioned image] Qijian Zhang received the B.S. degree in Electronic Information Science and Technology from Beijing Normal University, Beijing, China, in 2019. Currently, he is a Ph.D. student (2020-present) under the Department of Computer Science, City University of Hong Kong, HKSAR. His research interests include geometry processing, 3D computer vision, computer graphics, and cross-modal learning.
[Uncaptioned image] Siyu Ren received the B.S. degree in Optoelectronic Information Science and Engineering from Tianjin University, Tianjin, China, in 2018. He is currently pursuing the Ph.D. degree in Computer Science at the City University of Hong Kong and Optical Engineering at the Tianjin University. His research interests include deep learning and 3D point cloud processing.
[Uncaptioned image] Wenping Wang (Fellow, IEEE) received the Ph.D. degree in computer science from the University of Alberta in 1992. He is a Professor of computer science at Texas A&M University. His research interests include computer graphics, computer visualization, computer vision, robotics, medical image processing, and geometric computing, and he has published over 180 journal papers in these fields. He is a journal associate editor of Computer Aided Geometric Design (CAGD), Computer Graphics Forum (CGF), and IEEE Transactions on Visualization and Computer Graphics. He has chaired a number of international conferences, including Pacific Graphics 2012, ACM Symposium on Physical and Solid Modeling (SPM) 2013, SIGGRAPH Asia 2013, and Geometry Summit 2019. Prof. Wang received the John Gregory Memorial Award for his contributions to geometric modeling. He is an ACM Fellow.