Dynamic 3D Point Cloud Sequences as 2D Videos
Abstract
Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called Structured Point Cloud Videos (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences. Code will be released at https://github.com/ZENGYIMING-EAMON/SPCV.
Index Terms:
3D point cloud sequence, feature representation, geometric modeling, self-supervised learning, correspondence, compression, efficiency.I Introduction
A dynamic 3D point cloud sequence comprises multiple frames of static 3D point clouds captured at consecutive time steps, providing a depiction of geometric changes in objects/scenes. This type of data finds extensive applications in areas such as autonomous driving, robotics navigation, virtual/augmented reality, and immersive telecommunication. As the demand for 3D data processing continues to grow, fueled by the remarkable success of deep learning in handling 2D images/videos, there is an urgent need to develop effective and efficient learning methods for processing dynamic 3D point cloud sequences.
However, deep modeling of 3D point clouds, characterized by irregularity and lack of structure, is not as straightforward as applying standard 2D convolutional operators [1, 2, 3, 4] on well-structured 2D image/video signals defined on regular 2D grids. In recent years, researchers have devoted significant efforts to exploring numerous deep set architectures [5, 6, 7, 8, 9, 10] that operate on static 3D point cloud inputs, by developing various complicated learning mechanisms and highly specialized feature extraction operators. the processing and modeling challenges become even more pronounced when transitioning from a static point cloud to a dynamic sequence of consecutive point cloud frames. This is due to the irregular spatial structure of each frame and the lack of temporal correspondence across frames. Existing dynamic 3D point cloud sequence learning networks [11, 12] typically rely on extensive spatio-temporal point neighbor grouping and multi-scale feature aggregation operations, which are less expressive, memory consuming, and computationally inefficient, particularly with an increasing number of input points. The substantial computational complexity of these time-consuming operations hinders the development of deeper or more sophisticated architectural models for 3D point cloud sequences, thereby limiting performance on downstream tasks. In addition to the issues of network structures, in various reconstruction/generation scenarios, the inefficiency of commonly-used loss functions, e.g., Chamfer Distance (CD) and Earth Mover’s Distance (EMD) [13], can also significantly degrade the overall learning effects. This stands in stark contrast to the 2D image/video domain, where pixel-wise errors can be directly computed at the same 2D grid coordinates.
In this paper, we address the aforementioned challenges from a novel and more fundamental perspective by structuring 3D point cloud sequences. By recognizing that 3D geometric shapes are essentially 2D manifolds, we propose to structure dynamic 3D point cloud sequences with a 2D video-like representation, namely structured point cloud video (SPCV). As depicted in Fig. 2, the proposed SPCV representation exhibits the following key features: (i) the pixel values of an SPCV correspond to the coordinates of 3D points; (ii) the adjacent pixels within a frame generally represent neighboring points in 3D space; (iii) the pixels with identical locations across different frames of an SPCV correspond to the consistent position of the object/scene. To achieve such a structured representation, we design a geometrically regularized self-supervised learning pipeline comprising two stages: frame-wise structurization and sequence-wise structurization. These stages are driven by the learning objectives of intra-frame self-reconstruction and inter-frame deformation fields, respectively. Furthermore, for a better understanding of the structurization process, we direct readers to the video demonstrations presented in Fig. 1 and Fig. 3.
Essentially, the process of structuring a dynamic 3D point cloud sequence to its SPCV representation involves explicitly learning the indexing cues of the geometric coordinates, resulting in the re-organization of original 3D spatial points into regular 2D grids. The uniform structure of SPCV relives the challenge faced by neural networks in feature learning, thereby potentially generating more expressive features. The decoupled manner can significantly enhance the efficiency of downstream applications, allowing for the integration of advanced techniques to improve performance. The SPCV representation possesses spatial smoothness and temporal consistency properties, enabling the direct application of 2D image/video processing techniques for seamless processing of dynamic 3D point cloud sequences. By leveraging well-established 2D image/video techniques, we further construct SPCV-based frameworks for a range of high-level and low-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and geometry compression. Experimental results demonstrate the versatility and superiority of our approaches.
In summary, the main contributions of this paper are:
-
1.
We propose a new generic representation modality for dynamic 3D point cloud sequences, offering numerous benefits for point cloud sequence processing and analysis.
-
2.
We propose a novel self-supervised learning framework that efficiently and effectively represents any point cloud sequences as SPCVs.
-
3.
We construct diverse SPCV-based frameworks for dynamic 3D point sequence analysis and processing, achieving state-of-the-art performances.
The remainder of this paper is organized as follows: Section II reviews related works in the field. Section III details the methodology behind our SPCV representation. Section IV provides a detailed introduction to the SPCV-based learning network employed for various downstream tasks. Section V discusses the experimental setup and benchmarks our approach against existing methods. Finally, Section VI concludes the paper and suggests potential directions for future research.
II Related Work
In this section, we begin with a comprehensive summarization of representative deep set architectures for learning from static 3D point clouds. Then, we focus on the more challenging problem of learning from dynamic 3D point cloud sequences, which is directly related to the scope of our work.
II-A Deep Learning on Static 3D Point Clouds
The recent years have witnessed the proliferation of various deep set architectures that directly operate on unstructured 3D point sets without any pre-processing procedures, as pioneered by [5, 6]. The follow-up researches further investigated rich varieties of convolution-based [14, 15, 16, 17, 18, 19, 20, 9, 10], graph-based [21, 7, 22], and transformer-based [23, 8, 24, 25] backbone learning architectures.
More specifically, PointNet [5] proposed to perform point-wise embedding and global aggregation via shared multi-layer perceptrons (MLPs) and channel max-pooling. PointNet++ [6] further incorporates multi-scale hierarchical feature abstraction mechanisms with farthest point sampling (FPS) and neighborhood interpolation. SO-Net [14] explicitly utilizes the spatial distribution of point clouds by building self-organizing maps. PointCNN [15] applies learned transformations for adaptively weighting and permuting point features in potentially canonical orders. FeaStNet [21] dynamically builds correspondences between filter weights and graph nodes. KPConv [16] learns to continuously locate convolution weights in Euclidean space through kernel points and further explores spatially deformable convolution operators. RS-CNN [17] focuses on capturing the high-level relations among local point sets via learning from predefined geometric priors. DGCNN [7] queries neighbors globally in the feature space for constructing graph edges and performing dynamic feature aggregation. PointASNL [18] adopts an adaptive sampling scheme together with local-nonlocal cells to improve its model robustness when dealing with noises and outliers. CurveNet [20] aggregates hypothetical curves initially grouped through guided walks to augment point-wise features. PointMLP [26] builds a pure residual MLP network equipped with lightweight geometric affine modules, which can perform competitively without any sophisticated learning mechanisms. PointNeXt [9] systematically revisits the classic PointNet++ processing pipeline and then improves model training and scaling strategies. PT [8] investigates highly-expressive self-attention layers for point clouds. FastPT [24] incorporates lightweight local self-attention with voxel hashing architectures to encode continuous 3D positional information to greatly enhance computational efficiency. Conclusively, these approaches above are specialized for learning from a single 3D point cloud input to extract geometric feature representations in the spatial domain. The adaptation to dynamic point cloud learning for achieving sequence processing and analysis is highly non-trivial, due to the complexity of joint spatio-temporal modeling.
Besides, another line of work resorts to a different perspective for overcoming the irregularity and unstructuredness of 3D point clouds by creating regular 2D geometry image (GI [27]) or GI-like representation structures, on which numerous mature 2D modeling architectures (e.g., convolutional neural networks (CNNs)) can be seamlessly applied, or adapted with minimal modifications, to achieve various point cloud learning tasks. Representatively, [28, 29, 30, 31] exploit traditional mesh parameterization techniques to implement the process of surface-to-plane mapping, while [32, 33] propose to directly learn deep 2D regular representations from 3D point clouds in an unsupervised manner. These works are also related to the previous folding-based methods [34, 35, 36]. However, these approaches are still limited to static geometric representations, and are thus inapplicable to dynamic point cloud sequences with requirements of spatio-temporal structurization.
II-B Deep Learning on 3D Point Cloud Sequences
DPMix [37] employs 2D depth video methods to enhance the understanding of point cloud videos. Other methods explore the temporal interpolation of the 3D point cloud sequences [38, 39, 40]. These advancements demonstrate the growing interest and diversification in approaches to effectively model the dynamic nature of 3D point cloud sequences.
Specifically, PSTNet [41] introduces the point spatio-temporal (PST) convolution, leveraging a ‘point tube’ concept to aggregate local spatio-temporal features from 3D point cloud sequences. Building upon this, PSTNet2 [42] merges spatial features into a unified spatio-temporal representation. And MaST-Pre [43] advances this concept with a masked point tube for pre-training on point cloud videos, incorporating the spatiotemporal masked autoencoder idea from [44]. P4Transformer [45] boosts the performance of PSTNet by employing transformers to bypass the need for point tracking while capturing spatio-temporal correlations across point cloud sequences. Similarly, [46] introduces a spatio-temporal self-attention (STSA) module to grasp contextual information across adjacent frames. This concept is further enhanced by PST-Transformer [12] with spatio-temporal encoding. These architectures primarily follow an autoencoder-like framework to process dynamic point cloud sequences, akin to the role of PointNet++ [6] in handling static point clouds. While these methods introduce specially designed spatio-temporal feature aggregation structures to adapt to dynamic point cloud learning challenges, they do not fundamentally address the inherent irregularities of spatio-temporal characteristics in 3D point cloud sequences. In essence, the irregular data modality of 3D point cloud sequences remains unaltered.
Another line of research explores implicit representation methods. OFlow [47] adopts implicit strategies for 4D reconstruction, utilizing ground truth occupancy and trajectory information. CaSPR [48] introduces spatio-temporal representations for objects within normalized coordinate space. Similarly, CaDeX [49] tackles inter-frame deformations by employing continuous bijective canonical maps, focusing on a canonical shape. Notably, these methodologies primarily utilize mesh, voxel, or occupancy data, diverging from pure point cloud sequences. Our work, in contrast, concentrates on self-supervised explicit representations. Specifically, we structure the irregular 3D point cloud sequences as structured point cloud videos, aiming to achieve a compact representation that effectively minimizes both distortion and storage requirements.
A variety of methods have been developed for specific 3D point cloud sequence processing tasks, as well as related technologies. These technologies include flow-based [50, 51, 52], depth-based [53], GAN-based [54], kinematics-inspired [55], contrastive learning [56], and correspondence approaches [57, 58]. There are also some task-specific sequence processing methods, notably compression [59, 60] and interpolation [38, 39, 40].
For example, PV-RAFT [50] proposes a point-voxel recurrent all-pairs field transforms method to estimate scene flow from point clouds. NSFP [51] revisits the scene flow problem, emphasizing runtime optimization and regularization of the scene flow. SSFE [52] aims to predict 3D scene flow for all pairs of point clouds in a given sequence. 3DV [53] utilizes the dynamic 3D voxel for depth-based 3D action recognition. TPU-GAN [54] proposes a super-resolution generative adversarial network for upsampling dynamic point cloud sequences. Kinet [55] explores kinematics-inspired neural networks to capture 3D motions without explicit tracking. These methods employed various approaches, such as utilizing voxel and depth information for feature extraction, utilizing GAN-based methods to avoid the need for explicit point correspondence annotation, or relying on explicit scene flow annotation datasets. Overall, these methods did not directly address the underlying irregularity issues of 3D point cloud data. Furthermore, they lacked powerful insights into the explicit geometric aspects of 3D point clouds.
III Proposed Method
III-A Problem Statement
As mentioned earlier, learning from irregular and unordered point sets presents a significant challenge compared to feature extraction from 2D images or videos organized in a regular grid. Essentially, an RGB image of sizes can be regarded as a set of 5-dimensional vectors denoted as , where the color information (or pixel value) to be processed is indexed by the 2D coordinate uniformly distributed on a regular 2D grid of sizes , while for point cloud data, the spatial coordinates serve as both the geometry information to be processed and the indirect indexing cues to be used for determining inter-point relationships such as proximity and correspondence. Consequently, existing point-based deep learning architectures often fall short in efficiency when processing large-scale point data due to their involvement in complex learning mechanisms, including cumbersome pre-processing, extensive neighbor querying, and costly sampling, matching, and multi-scale abstraction processes. The significant computational complexity of the aforementioned time-consuming operations in turn hinders the development of deeper or more sophisticated architectural models for 3D point cloud sequences, thus limiting performance on downstream tasks. Additionally, these operations increase memory consumption, particularly when processing denser 3D point cloud sequences. Furthermore, measuring the discrepancy between two point clouds poses a significant challenge due to their irregular and unordered nature, making standard methods like CD and EMD either ineffective or inefficient [61].
The shift from static point cloud to 3D point cloud sequence modeling introduces additional challenges, exacerbated by the inherent lack of structure in both spatial and temporal dimensions. Current approaches rely heavily on intensive spatial neighborhood searches and the establishment of temporal correspondences, necessitating complex spatio-temporal feature processing. This intricate methodology is not just laborious and costly—it also substantially obstructs the creation of streamlined and efficient learning frameworks for 3D point cloud sequences.
III-B Our Objective
Denote by a dynamic 3D point cloud sequence with point cloud frames, where refers to the number of points contained in the -th frame111Without losing generality, we assign a fixed number of points to all frames in practice.. Note that the point-wise correspondence across the frames of is unknown. Inspired by the above-mentioned structure of 2D images/videos and the fact that 3D geometric shapes are essentially 2D manifolds, we aim to structure dynamic 3D point cloud sequences to address the aforementioned challenges.
Generally, for each point of a typical point cloud frame , we plan to obtain a unique 2D coordinate , thereby forming a 5-dimensional signal . Meanwhile, points that are nearest neighbors in 3D space should have close 2D coordinates. When the coordinates are uniformly distributed across the regular 2D grid, the structure effectively resembles that of an RGB image. In other words, we can fill into the corresponding location, constructing an RGB image-like representation of dimensions (). After processing all point cloud frames of with this manner, we can derive a 2D video-like representation of , denoted as , named Structured Point Cloud Video (SPCV). Moreover, we constrain that the encoded points at an identical location across different frames of correspond to the same or approximately the same position of the 3D object/scene, which is more advanced than traditional RGB videos. In the following, we also call an element of a pixel for convenience. Note that can be directly reshaped/reprojected back to a point cloud without any additional computation, i.e., the value of a pixel is taken as a 3D point.
In summary, we anticipate that the resulting SPCV should exhibit the following two characteristics, as illustrated in Fig. 4:
-
1.
Spatial Smoothness: the pixels in a local patch of a typical frame of an SPCV correspond to a set of neighboring points in 3D space;
-
2.
Temporal Consistency: the pixels with identical locations across the frames of an SPCV correspond to the consistent position of the object/scene.
The proposed representation modality is able to offer the following advantages:
-
1.
Improved Efficiency and Performance of Downstream Applications. The SPCV framework streamlines downstream processing and analysis tasks by replacing complex and time-consuming operations like FPS and K-NN aggregation with simpler and highly efficient pixel indexing and sliding window mechanisms. Meanwhile, the uniform structure of SPCV relives the challenge faced by neural networks in feature learning, thereby potentially generating more expressive features. Moreover, SPCV can seamlessly integrate with established learning techniques used in 2D image/video network designs, enhancing overall performance.
-
2.
Reduction of Memory Consumption. SPCV efficiently reduces memory usage by indexing frames using variable pixel locations, a contrast to traditional methods that necessitate storing multiple scales of point clouds for feature aggregation. This advantage becomes particularly evident when processing dense or large-scale point clouds.
-
3.
Simplified Design and Implementation. Adopting SPCV simplifies the design and implementation complexities in point cloud applications. For instance, encoders with specialized spatio-temporal feature extraction can be replaced with encoders that are already commonly used in the 2D image/video community.
-
4.
Free of the Limitations of CD and EMD. Traditional models often struggle with inefficiencies from point cloud loss functions like CD and Earth Mover’s Distance (EMD). The SPCV framework, however, enables more efficient and straightforward optimization using grid-based losses like the Frobenius norm or , thanks to its inherent structure.
III-C Our Pipeline
Technical Intuition. Intuitively, to achieve the objective of re-organizing into through a learning-based method, we can construct a neural network, which takes as input and outputs , and train it by using and paired training data. However, adopting a supervised learning approach presents significant challenges. Even for static point clouds exhibiting arbitrary and non-manifold geometric structures, acquiring the ground-truth poses a formidable task due to the complex parameterization problem [62]. This challenge becomes even more pronounced when considering dynamic point cloud sequences, as obtaining additional accurate point-wise correspondences across frames as ground-truth remains a persistent and challenging problem.
To overcome this challenge, we consider constructing a self-supervised pipeline without relying on ground-truth 2D locations and temporal correspondences. Specifically, for a single point cloud, we first pre-define a set of 2D locations uniformly distributed on a regular 2D grid of sizes , each location storing/encoding . We then feed into a learnable neural network to generate further assigned to the corresponding locations of their inputs. We can utilize the original point cloud as supervision to drive the learning of the network, ensuring that the generated closely resemble . Moreover, owing to the inherent spectral bias property of neural networks [63], the generated data from spatially smooth inputs are expected to preserve spatial smoothness.
Furthermore, observing that a dynamic 3D point cloud sequence recording the motion of an object can be essentially thought of as a set of 3D points with a fixed topology deforming/evolving over time, we can adopt the following procedure to structure it. First, we represent through the above-mentioned self-supervised learning approach for single point clouds, producing . We then obtain the representations of the remaining point cloud frames by learning the deformation fields that deform to () in a recursive fashion, i.e., , where stands for the deformation field for . To be specific, we can minimize the discrepancy between and 222Note that we reshape into a point cloud and compute the discrepancy in 3D space. to estimate . Such a procedure aligns the resulting structures of with that of , leading to the spatial smoothness of all frames. Also, the deformation mechanism establishes a fixed topology across all frames, thus naturally preserving temporal consistency across all frames.
Technical Details. Based on the above analyses, we construct the self-supervised learning pipeline shown in Fig. 5 for representing into . It consists of two main stages: (1) frame-wise structurization and (2) sequence-wise structurization, which are detailed as follows.
III-C1 Frame-wise Structurization
This stage only overfits the first point cloud frame of (i.e., ) into the first frame of (i.e., ) through a self-supervised neural network regularized with additional geometrically meaningful constraints. Specifically, we first construct a pre-defined 2D grid , with each pixel filled with where and is sampled regularly from and , respectively. We then feed into a network composed of 2D convolutional (Conv2D) layers, denoted as with being network parameters. Owing to the inherent spectral bias property of neural networks [63], the generated image is expected to be spatially smooth. Let be the point cloud form reprojected from . We minimize the discrepancy between and to enforce the shape represented by to approximate . Thus, we drive the learning of by optimizing the following loss function:
(1) |
where stands for a typical distance metric for 3D point clouds [64], and is a geometry-aware regularization term further promoting the spatial smoothness characteristic of . Specifically, we explicitly regularize the generated pixel values of to promote its spatial smoothness, i.e., the value of a typical pixel of should be close to the average of those of its neighboring pixels. In addition, based on the fact that the normal of a typical point in a locally smooth region of a 3D surface is generally close to the average of the normals of its neighboring points, we also regularize the geometry attribute of the generated pixels of , i.e., the normal of a typical pixel of should be close to the average of those of its neighboring pixels. As essentially models the process of 2D-to-3D mapping, we can compute the normal of a pixel located at through the cross product of ’s partial derivatives with respect to :
(2) |
where the partial derivatives can be derived through the back-propagation of . In other words, such regularization regularizes the network behavior. In all, we explicitly written as
(3) | ||||
where and are two weights balancing the spatial and normal regularization terms, and represents the window around the center pixel.
III-C2 Sequence-wise Structurization
This stage represents the remaining point cloud frames of (i.e., ) as the frames of (i.e., {}) through an optimization scheme, which predicts the deformation field between two adjacent frames to deform to recursively, based on , i.e., (), where stands for the deformation fields at the -th step. By establishing inter-frame relationships within through deformation fields, we can achieve temporal consistency effectively. To be specific, we formulate the following optimization problem To obtain :
(4) |
where and are the point cloud forms, respectively reshaped from and , balances the two terms, and regularizes (or ) to be locally smooth/constant, based on the fact that neighboring points on exhibit similar deformation. Such regularization also propagates the spatial smoothness of to , defined as
(5) |
where returns the index of -NN points, and . We solve Eq. (4) using the Adam optimizer [65], where is initialized with the values from the point cloud form of .
It is also worth noting that as a trade-off between efficiency and effectiveness, we achieve the second stage with an optimization-based approach, and it is straightforward to substitute it with a self-supervised learning-based approach, where a neural network is trained to automatically learn and predict the deformation fields for each subsequent frame, for pursuing high performance.
III-D Discussion
Generally, the proposed SPCV shares a similar concept with previous parameterization-based geometry image/video representation for 3D triangle meshes/dynamic mesh sequences [27, 66, 67, 68], i.e., representing 3D geometry data with a 2D RGB image/video-like structure. However, our work significantly distinguishes itself from previous methods in terms of motivation, technical implementation, and application scenarios. First, previous methods map 3D vertices to the 2D domain through traditional parameterization algorithms optimizing the area and/or angle distortions of triangles. Due to the fundamental challenges of parameterization, these methods impose high requirements on the topological structures of 3D meshes to be processed, along with the requirement of correspondence across frames to model temporal information. Given that 3D point cloud sequences lack connectivity information and temporal correspondence and may have arbitrarily complex topological structures, these methods are largely inapplicable to our scenario. Second, driven by the objective of separating the indexing cues from the geometry information for improving efficiency, our pipeline re-organizes neighboring 3D points determined in the sense of the Euclidean distance into adjacent 2D locations on a regular 2D grid. The manner of determining a neighborhood is consistent with that of mainstream deep point-based architectures. In contrast, the parameterization-based methods essentially achieve the mapping in a geodesic distance-aware fashion. Accordingly, the theorems they employed are not suitable for our method. Finally, it is also worth noting that our method mimics the behavior of traditional parameterization when handling 3D objects with low curvatures, as demonstrated in Fig. 7.
In summary, unlike the previous representation mechanism pursuing accurate parameterization, the proposed learning-based pipeline leverages the impressive capability of deep learning to offer a practical and promising solution for real-world applications. The advantages and effectiveness of our proposed SPCV are extensively demonstrated through experiments on various downstream tasks in Section IV.
IV SPCV-based Applications
To verify the practical advantages of representing 3D point cloud sequences as SPCVs, we construct task-specific processing frameworks that directly consume SPCVs as inputs to achieve the corresponding dynamic point cloud applications. As aforementioned, the 2D video-like structure of our SPCV representation enables the adoption of various existing well-established 2D image/video deep learning techniques to construct task-specific frameworks.
Essentially, SPCV is designed to be a generic representation structure, meaning that there are no restrictions on the downstream task scenarios. In principle, the downstream tasks should be sensitive to spatial smoothness and temporal consistency of input SPCVs, such that the resulting task performances can better indicate the representation quality of SPCVs. And the downstream tasks should cover both high-level semantic understanding and low-level geometry processing applications. Additionally, both learning-based and non-learning-based tasks are necessary to demonstrate the versatility of our SPCV representations. According to the above principles, we consider three targeted downstream tasks: 1) learning-based 3D action recognition; 2) learning-based temporal interpolation of point cloud sequences; and 3) non-learning-based 3D point cloud data compression. Specifically, action recognition evaluates the discriminative and expressive capabilities of high-level features learned through the SPCV representation. The effectiveness of temporal interpolation and compression serves as a measure of SPCV’s representational quality in both spatial and temporal domains.
IV-A 3D Action Recognition
As a common sequence-level task in point cloud sequence processing, this task evaluates the model’s ability to extract discriminative features from each 3D point cloud sequence, thereby determining the most probable label for the sequence. Existing methods [69, 42, 45, 12] typically employ point-based spatio-temporal convolution operations, e.g., PST convolution [69, 42], for learning feature encodings across spatio-temporal dimensions, or transformers [45, 12] for spatio-temporal context interaction. However, due to the insufficient design to cope with the unstructured characteristic of 3D point cloud data, their capability of extracting discriminative features efficiently and effectively is still relatively weak, thus limiting their accuracy. With the video-like representation, our approach enables the use of mature 2D convolutional operations to obtain more discriminative and expressive features. We modify the feature embedding process, common in approaches like PST-Transformer [12], by employing an efficient convolutional encoder. This modification efficiently generates more expressive feature tokens for the transformer.
Specifically, as illustrated in Fig. 6 (a), the proposed SPCV-based framework first spatially downsamples an input SPCV to a quarter of its original scale before being passed into a video-based encoder, consisting of an 18-layer ‘UNet3D’ architecture, styled after ResNet [2], a design also used in conventional video-based methods [70, 71]. In the network’s bottleneck, the scaled SPCV frames, along with features scaled down to a quarter of their original size, are converted into sequences of point clouds and their corresponding features. These sequences are then fed into a decoder with transformer to enhance feature space interactions. Inside the decoder, the transformer’s output features undergo two layers of max-pooling to produce batches of feature vectors. These vectors are subsequently processed through an MLP head layer to generate the final scores, indicating the probabilities of various classes for the input sequence.
IV-B Temporal Interpolation of 3D Point Cloud Sequences
This task involves reconstructing a high temporal resolution 3D point cloud sequence from a low-resolution one, a counterpart of the 2D video frame interpolation, which targets to increase the frame rate of a video by generating the in-between frames. The unstructured nature presented within individual frames and lack of correspondence across frames compound the complexity of this task, especially for sequences with large non-rigid deformation. PointINet [38] addresses this task by adopting a scene flow estimator for interpolation. IDEA-Net [39] estimates point-wise trajectories for coarse interpolation and trajectory compensation. These methods suffer from inefficient and ineffective feature extraction and rely on distribution-based loss functions to interpolate point cloud frames with restricted fidelity. To address these issues, we develop an interpolation model based on an image encoder. Leveraging the structured nature of our representation, the mature image encoder efficiently generates more effective feature embedding tokens, which are then enhanced through downstream transformers in both temporal and spatial domains for feature interaction. And our model can be driven by common 2D loss functions like the Frobenius norm, enabling interpolating higher-quality point cloud frames efficiently.
Fig. 8 illustrates the overall learning pipeline of our proposed SPCV-based point cloud frame interpolation. Specifically, for an input pair of consecutive frames at time step and , we consider interpolating a new frame at time step , where . To implement this, we feed the two point cloud frames into a shared 2D image-based encoder consisting of Conv2D layers to produce the corresponding feature maps and . Then we can deduce an interpolated feature map, i.e., , which is further passed into a residual decoder with Conv1D layers. Finally, we add the generated residuals to the input frame at time step to obtain the desired interpolation result.
IV-C 3D Point Cloud Data Compression
With the rapid development of 3D sensing technologies and the broad deployment of consumer-level devices (e.g., depth camera, LiDAR, etc.), there is an urgent need to develop efficient 3D point cloud codecs to compress the corresponding huge data volume and effectively save transmission bandwidth and storage space. However, unlike its mature 2D counterpart, i.e., image/video compression, which has been investigated and developed for a long time, 3D point cloud compression is still a relatively young topic whose actual coding performance is far from satisfactory. Therefore, there exists an obvious contradiction between ever-growing 3D point cloud data volume and immature compression techniques. Compared to 2D image/video compression, 3D point cloud data compression333Note that we consider compression of the geometry information of 3D point clouds, rather than the attributes. poses a greater challenge due to the unstructured nature of spatial and temporal domains, i.e., the points of each frame are irregularly distributed, the number of points may change from one frame to another, and there is no consistent point-to-point correspondence between successive frames.
The Moving Picture Experts Group (MPEG) introduced two point cloud compression (PCC) standards: Geometry-Based PCC (G-PCC) and Video-Based PCC (V-PCC) for compressing static and dynamic point cloud data, respectively. We refer the readers to [72, 73, 60] for detailed introductions of these methods and [74] for a survey on deep learning-based point cloud compression.
Like [75, 76], the proposed SPCV naturally bridges the gap between dynamic 3D point cloud sequences and mature 2D image/video compression techniques, paving the way for efficient and effective compression pipelines. Specifically, given a 3D point cloud sequence, we represent it as an SPCV, which is further fed into the Versatile Video Coding (VVC) Test Model, the latest H.266/VVC video compression standard [77], to achieve compression. This pipeline is fully compatible with existing infrastructures for 2D image/video compression and transmission.
V Experiments
In Section V-A, we quantitatively and qualitatively evaluated the representation capability of our framework by quantifying the spatial smoothness, temporal consistency, and geometric fidelity of SPCVs. Sections V-B to V-D evaluate the three SPCV-based point cloud analysis and processing tasks. Section V-E presents in-depth ablation studies focusing on key components within our SPCV representation framework.
V-A Evaluation of Representation Quality
Sequence | CD | HD | mNUC |
Camel | 0.024 | 0.66 | 1.3 |
Elephant | 0.041 | 0.78 | 1.2 |
Horse | 0.032 | 0.68 | 1.1 |
Crane | 0.034 | 0.68 | 1.0 |
Swing | 0.029 | 0.65 | 1.0 |
Bouncing | 0.032 | 0.68 | 1.2 |
V-A1 Results of static 3D point clouds
We quantitatively evaluated the representation capability of our framework on static 3D point clouds by computing the spatial smoothness and geometric fidelity of resulting 2D image representations. Under this scenario, only the first stage of our framework, i.e., frame-wise structurization, is used. Specifically, we quantify the spatial smoothness by examining the proportion of pixels with a window on the resulting 2D representation that are the -nearest neighbors of the window’s central pixel after reprojecting the pixels into 3D space. The higher the proportion, the better. We utilized CD, Hausdorff Distance (HD), and Mean Normalized Uniformity Coefficients (mNUC) [79] between structured and original point clouds to comprehensively assess the geometric fidelity. We employed six point clouds with various topological structures, as shown in the top of Fig. 9, and each point cloud contains points that are uniformly distributed. We also bench-marked two recent methods for representing static point clouds as 2D images: Flattening-Net [33] and RegGeoNet [32].
From Fig. LABEL:reconstruction_smoothness, it can be observed that our method achieves a smoothness ratio ranging roughly from 35% to 100% for window sizes from to among various types of 3D point clouds, which are consistently higher than those of the two compared methods. Such a superiority is also verified by the visual results of the resulting 2D representations shown in Fig. 9, where our results show smooth pixel distributions while the results of Flattening-Net and RegGeoNet show obvious block effects.
In terms of geometric fidelity, as compared in Table I, our method significantly outperforms the competing state-of-the-art, achieving about ten times smaller CD and at least two times smaller HD and mNUC. The advantages reflected in these metrics demonstrate that our representation preserves the original 3D geometry more accurately and captures the point distribution more uniformly. As shown in Fig. 9, benefiting from our geometry accuracy and distribution uniformity, the 3D surfaces reconstructed from our structured representations are closer to the ground truths, while the results produced by Flattening-Net and RegGeoNet degrade to a large extent.
Overall, when evaluating the representation quality, the metrics of both spatial smoothness and geometry fidelity should be jointly taken into account. Although the spatial smoothness ratios of Flattening-Net and RegGeoNet can be relatively close to ours in some cases, e.g., Figs. LABEL:reconstruction_smoothness (b), (e), and (f), it does not mean their representations achieve sufficiently satisfactory quality because of their non-uniform point distribution, which can also be verified by their inferior surface reconstruction results as shown in Fig. 9.
V-A2 Results of dynamic 3D point cloud sequences
We quantitatively evaluated the representation ability of our method on dynamic 3D point cloud sequences by computing the mean spatial smoothness and geometric fidelity over all frames of the resulting SPCV representations. Additionally, we defined the correspondence ratio to quantify the temporal consistency of SPCVs, i.e., for each pair of adjacent frames of the SPCV, we calculated the corresponding proportion of the -nearest neighborhood of each point and proposed an algorithm to calculate the correspondence ratio in percentage, with higher values indicating better performance. This metric takes into account both reconstruction and correspondence accuracy. We refer readers to the Supplementary Material for detailed descriptions of the calculation process.
Here, we employed six 3D point cloud sequences, including three human motion sequences [80] and three animal motion sequences [81]. For the human sequences, we utilized 46 frames from the swing, 31 frames from the bouncing and crane, while for the animal sequences, we employed 10 frames from the horse, 48 frames from the camel and elephant. Each point cloud frame of the sequences contains 10K points.
Our method showcases superior representation quality, as evidenced in Fig. LABEL:quant_temporal_consistency_smooth_regular and Table II. Specifically, as observed in Fig. LABEL:quant_temporal_consistency_smooth_regular, our representation maintains over 60% temporal consistency and a mean spatial smoothness ratio of over 35%, indicating its capability to handle various types of sequences. Additionally, Table II reveals that our method consistently achieves robust geometric fidelity across different sequence types, as indicated by stable CD, HD, and mNUC metrics. Besides, Fig. LABEL:hum_anim_seq_vis_ours visualizes the SPCV representations on the human and animal sequences, further verifying the effectiveness of the proposed framework in terms of spatial smoothness, temporal consistency, and geometric fidelity. As shown in Fig. 17, we cropped a patch from the first ten frames of KITTI and generated the SPCV for better illustration. We selected 3 frames to visualize a typical situation where a tree appears in the first two frames but vanishes in the last frame. The tree is mapped to a similar pixel area, as shown in the red box on the 2D grid. In the last frame, these pixel positions are replaced by the coordinates of the ground. These results demonstrate the robustness and versatility of the SPCV representation even for scenarios involving scanned LiDAR point cloud sequences.
Moreover, we also compared the learned temporal consistency by our framework with CorrNet3D [58], a representative unsupervised method for building dense correspondence between two point clouds, following its evaluation setups. As shown in Fig. LABEL:quant_temporal_consistency_diff_methods, our method outperforms CorrNet3D, achieving over 95% correspondence ratios for varying neighborhood sizes (i.e., values of ). The visual results in Fig. LABEL:correspondence_results_representation demonstrate that our representation maintains more consistent patterns across SPCV frames and ensures more accurate correspondences between adjacent frames.
V-B Results of Action Recognition
We compared our SPCV-based action recognition framework with existing state-of-the-art 3D sequence processing methods [45, 42, 12] on the challenging dataset of deformingThings4D [82], which consists of 1,972 animation 3D point cloud sequences across 31 categories of humanoids and animals with large non-rigid deformations. Each frame of the point cloud sequences contains 2,048 points. We segmented the whole sequence into multiple fixed-frame clips, which are fed into all competing methods. To adapt our method, we converted all 3D point cloud sequences to SPCVs with the height and width of each frame equal to 64 and 32, respectively. Following [45], during inference, the prediction result of the whole sequence is determined by averaging the probabilities of all clips.
As compared in Table III, our method outperforms existing state-of-the-art methods with the highest recognition accuracy, which reveals the enhanced feature extraction capability of our SPCV-based learning network. Moreover, our framework turns out to be much more efficient in terms of the GPU memory cost and inference speed, benefiting from the structured nature of SPCVs that can circumvent many expensive operations of indexing, sampling, and grouping.
V-C Results of Temporal Interpolation
Methods | Swing | Longdress | Memory | Time | ||
EMD | CD | EMD | CD | (MB) | (seconds) | |
PointINet [38] | 15.03 | 1.70 | 10.09 | 0.95 | 6488 | 0.143 |
PSTNet2 [42] | 27.12 | 2.47 | 39.60 | 3.43 | 3009 | 0.142 |
P4Transformer [45] | 38.35 | 3.24 | 105.68 | 8.88 | 3008 | 0.089 |
PST-Transformer [12] | 32.43 | 3.02 | 116.26 | 13.83 | 3006 | 0.058 |
IDEA-Net [78] | 7.07 | 1.24 | 5.92 | 0.88 | 11428 | 0.106 |
Ours |
Following the experimental protocols in [78], we performed temporal interpolation on the DHB dataset and made comparisons with existing state-of-the-art 3D point cloud sequence processing approaches [38, 42, 45, 78, 12], where we adopted CD and EMD as quantitative metrics.
In our implementation, we bench-marked our model against the original learning framework of [78], along with the point-level task models outlined in [42, 45, 12]. Similarly to [78], we utilized EMD as the training loss function for the compared methods. All the competing methods share the same training and testing data as prepared in [78]. As depicted in Fig. 8, we constructed an image-based learning framework to fully leverage the capabilities of our regular grid representation for temporal interpolation. Thanks to the structured nature of our SPCV representation, we can directly utilize the pixel-wise Frobenius norm as the loss function.
As shown in Table IV, our approach demonstrates significant improvements over the existing state-of-the-art methods, as indicated by its lower CD and EMD metrics. Besides, our learning framework shows less GPU memory consumption and forward pass time cost. As visualized in Fig. LABEL:intpl_swing, our method distinctly outperforms the others in preserving geometric fidelity within the interpolated frames. Compared with previous methods, our SPCV representation has explicit temporal consistency, which greatly reduces the difficulty of frame interpolation by the model, thereby achieving a lower geometric error. Thanks to the regular structure of SPCVs, our method can directly use sliding window convolution operations without the need for specially designed grouping and feature aggregation methods, thus improving the processing speed. Moreover, when extracting multi-scale features, due to the regularity of each frame, we do not need to pre-use FPS or other sampling methods to extract key points, which enhances the speed and reduces the GPU storage.
V-D Results of Compression
We adopted the popular MPEG [83] and MITAMA [84] datasets and bench-marked our SPCV-based point cloud compression framework for compressing both static and dynamic point clouds. For experiments on static point cloud compression where each point cloud uniformly contains 10K points, we made comparisons with the latest version of G-PCC (TMC13 v23.0-rc2). Besides, we also included Flattening-Net [33] and RegGeoNet [32] for comparison by feeding their generated 2D representations into the H.266/VVC video encoder. For experiments on dynamic point cloud compression where each sequence consists of 5 frames and each frame contains 250K points, we made comparisons with the latest version of V-PCC (TMC2 Release 18.0).
The quantitative results depicted in Fig. 20 indicate that our SPCV-based compression framework outperforms both G-PCC and recent point cloud structurization methods. This is substantiated by achieving lower CD and Point-to-Face (P2F) metrics at the same compression ratio. The enhanced performance can be attributed to the structured nature of the SPCV representation that is conducive to the utilization of the advanced video compression techniques, and the intrinsic spatial smoothness of SPCV that further boosts the compression performance. Visual comparisons in Fig. 21 show that our compression method more accurately preserves the geometric fidelity of the surfaces reconstructed from the decoded 3D point clouds.
For point cloud sequence compression, the results in Fig. 22 demonstrate that our SPCV-based compression framework significantly outperforms V-PCC in compression efficiency, as evidenced by much lower CD and P2F metrics at the same compression ratio (or much smaller compressed file size at the same CD and P2F metrics). The SPCV’s temporal consistency enhances its intra-frame compression capabilities, optimizing the overall performance. Visual comparisons in Fig. LABEL:vpcc_vis demonstrate that our SPCV more faithfully preserves the geometric fidelity in the reconstructed surfaces.
V-E Ablation Studies
We conducted ablation studies to facilitate analyzing and understanding the proposed learning framework for structuring an arbitrary 3D point cloud sequence into a 2D video.
V-E1 Effectiveness of sequence-wise structurization
In the framework depicted in Fig. 5, we introduced a sequence-wise stage representing the remaining point cloud frames. To validate the importance of this stage in maintaining temporal consistency, we removed the sequence-wise processing stage and instead processed each frame independently through the frame-wise stage to obtain the SPCV. As evident from Fig. 25 (a), this setting ensures a considerable level of intra-frame smoothness but fails to maintain temporal consistency across the entire frame sequence.
V-E2 Effectiveness of geometric regularization in Eqs. (3) and (5)
We explored the effects of the geometric regularization terms used in the SPCV framework. As shown in Figs. 25 (b) and (c), the overall performance degrades when removing each individual constraint. Specifically, the removal of the spatial regularization, i.e., , impacts the model’s ability to maintain spatial smoothness within individual frames. Besides, the absence of normal regularization, i.e., , leads to a decline in both the spatial smoothness and temporal consistency, highlighting its critical role in preserving the geometric integrity of the model. Collectively, these findings underscore the importance of geometric constraints in the overall efficacy and quality of the SPCV framework.
V-E3 Impact of distance metrics
We evaluated the impact of alternative distance metrics (EMD and CD) on our framework’s performance in comparison to standard settings. As shown in Figs. 25 (d) and (e), both EMD and CD do not match the efficacy of our original configuration. EMD prioritizes global shape alignment over maintaining point proximity during point cloud reconstruction; however, searching for such a global alignment cannot guarantee spatial smoothness and temporal consistency, even with the corresponding constraints. On the other hand, although CD is more effective than EMD in preserving smoothness, the reconstructed shapes under its alignment are extremely inaccurate than EMD and our selected distance metric.
VI Conclusion
This paper presents a novel generic representation for structuring dynamic 3D point cloud sequences as 2D videos. We constructed a self-supervised learning framework together with geometrically meaningful constraints for achieving spatial smoothness and temporal consistency. The generated SPCVs show satisfactory representation quality. To demonstrate the practical value, we conducted a variety of downstream tasks, including action recognition, temporal interpolation, and compression, where our SPCV-based learning frameworks outperform existing state-of-the-art approaches. Through comprehensive experimental evaluations, we demonstrated the universality and potential of the proposed SPCV representations. We believe that this study opens up many new possibilities for research on dynamic point cloud processing and learning.
References
- [1] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Int. Conf. Comput. Vis., 2015, pp. 2758–2766.
- [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
- [3] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 2462–2470.
- [4] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Adv. Neural Inform. Process. Syst., vol. 27, 2014.
- [5] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 652–660.
- [6] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Adv. Neural Inform. Process. Syst., 2017, pp. 5099–5108.
- [7] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Trans. Graph., vol. 38, no. 5, pp. 1–12, 2019.
- [8] H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Int. Conf. Comput. Vis., 2021, pp. 16 259–16 268.
- [9] G. Qian, Y. Li, H. Peng, J. Mai, H. A. A. K. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,” in Proc. NeurIPS, 2022.
- [10] H. Lin, X. Zheng, L. Li, F. Chao, S. Wang, Y. Wang, Y. Tian, and R. Ji, “Meta architecture for point cloud analysis,” in Proc. CVPR, 2023, pp. 17 682–17 691.
- [11] X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” in Int. Conf. Comput. Vis., 2019, pp. 9246–9255.
- [12] H. Fan, Y. Yang, and M. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2181–2192, 2022.
- [13] H. Fan, H. Su, and L. J. Guibas, “A point set generation network for 3d object reconstruction from a single image,” in Proc. CVPR, 2017, pp. 605–613.
- [14] J. Li, B. M. Chen, and G. H. Lee, “So-net: Self-organizing network for point cloud analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9397–9406.
- [15] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen, “Pointcnn: Convolution on -transformed points,” in Adv. Neural Inform. Process. Syst., 2018, pp. 828–838.
- [16] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas, “Kpconv: Flexible and deformable convolution for point clouds,” in Int. Conf. Comput. Vis., 2019, pp. 6411–6420.
- [17] Y. Liu, B. Fan, S. Xiang, and C. Pan, “Relation-shape convolutional neural network for point cloud analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 8895–8904.
- [18] X. Yan, C. Zheng, Z. Li, S. Wang, and S. Cui, “Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 5589–5598.
- [19] M. Xu, R. Ding, H. Zhao, and X. Qi, “Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 3173–3182.
- [20] T. Xiang, C. Zhang, Y. Song, J. Yu, and W. Cai, “Walk in the cloud: Learning curves for point clouds shape analysis,” in Int. Conf. Comput. Vis., 2021, pp. 915–924.
- [21] N. Verma, E. Boyer, and J. Verbeek, “Feastnet: Feature-steered graph convolutions for 3d shape analysis,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2598–2606.
- [22] Q. Xu, X. Sun, C.-Y. Wu, P. Wang, and U. Neumann, “Grid-gcn for fast and scalable point cloud learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 5661–5670.
- [23] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, “Pct: Point cloud transformer,” Computational Visual Media, vol. 7, no. 2, pp. 187–199, 2021.
- [24] C. Park, Y. Jeong, M. Cho, and J. Park, “Fast Point Transformer,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 16 949–16 958.
- [25] X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in Proc. CVPR, 2022, pp. 19 313–19 322.
- [26] X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual MLP framework,” in Proc. ICLR, 2022.
- [27] X. Gu, S. J. Gortler, and H. Hoppe, “Geometry images,” in Proceedings of the 29th annual conference on Computer graphics and interactive techniques, 2002, pp. 355–361.
- [28] A. Sinha, J. Bai, and K. Ramani, “Deep learning 3d shape surfaces using geometry images,” in Eur. Conf. Comput. Vis., 2016, pp. 223–240.
- [29] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani, “Surfnet: Generating 3d shape surfaces using deep residual networks,” in IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 6040–6049.
- [30] H. Maron, M. Galun, N. Aigerman, M. Trope, N. Dym, E. Yumer, V. G. Kim, and Y. Lipman, “Convolutional neural networks on surfaces via seamless toric covers,” ACM Trans. Graph., vol. 36, no. 4, pp. 71–1, 2017.
- [31] N. Haim, N. Segol, H. Ben-Hamu, H. Maron, and Y. Lipman, “Surface networks via general covers,” in Int. Conf. Comput. Vis., 2019, pp. 632–641.
- [32] Q. Zhang, J. Hou, Y. Qian, A. B. Chan, J. Zhang, and Y. He, “Reggeonet: Learning regular representations for large-scale 3d point clouds,” Int. J. Comput. Vis., vol. 130, no. 12, pp. 3100–3122, 2022.
- [33] Q. Zhang, J. Hou, Y. Qian, Y. Zeng, J. Zhang, and Y. He, “Flattening-net: Deep regular 2d representation for 3d point cloud analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9726–9742, 2023.
- [34] Y. Yang, C. Feng, Y. Shen, and D. Tian, “Foldingnet: Point cloud auto-encoder via deep grid deformation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 206–215.
- [35] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry, “A papier-mâché approach to learning 3d surface generation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 216–224.
- [36] J. Pang, D. Li, and D. Tian, “Tearingnet: Point cloud autoencoder to learn topology-friendly representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7453–7462.
- [37] Y. Zhang, H. Fan, Y. Yang, and M. Kankanhalli, “Dpmix: Mixture of depth and point cloud video experts for 4d action segmentation,” arXiv preprint arXiv:2307.16803, 2023.
- [38] F. Lu, G. Chen, S. Qu, Z. Li, Y. Liu, and A. Knoll, “Pointinet: Point cloud frame interpolation network,” in AAAI, 2021.
- [39] Y. Zeng, Y. Qian, Q. Zhang, J. Hou, Y. Yuan, and Y. He, “Idea-net: Dynamic 3d point cloud interpolation via deep embedding alignment,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6338–6347.
- [40] Z. Zheng, D. Wu, R. Lu, F. Lu, G. Chen, and C. Jiang, “Neuralpci: Spatio-temporal neural field for 3d point cloud multi-frame non-linear interpolation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2023, pp. 909–918.
- [41] H. Fan, X. Yu, Y. Ding, Y. Yang, and M. S. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” in Int. Conf. Learn. Represent., 2021.
- [42] H. Fan, X. Yu, Y. Yang, and M. Kankanhalli, “Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 12, pp. 9918–9930, 2021.
- [43] Z. Shen, X. Sheng, H. Fan, L. Wang, Y. Guo, Q. Liu, H. Wen, and X. Zhou, “Masked spatio-temporal structure prediction for self-supervised learning on point cloud videos,” in Int. Conf. Comput. Vis., 2023, pp. 16 580–16 589.
- [44] C. Feichtenhofer, Y. Li, K. He et al., “Masked autoencoders as spatiotemporal learners,” Adv. Neural Inform. Process. Syst., vol. 35, pp. 35 946–35 958, 2022.
- [45] H. Fan, Y. Yang, and M. S. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 14 204–14 213.
- [46] Y. Wei, H. Liu, T. Xie, Q. Ke, and Y. Guo, “Spatial-temporal transformer for 3d point cloud sequences,” in IEEE Winter Conf. on App. of Comput. Vis., 2022, pp. 1171–1180.
- [47] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger, “Occupancy flow: 4d reconstruction by learning particle dynamics,” in Int. Conf. Comput. Vis., 2019, pp. 5379–5389.
- [48] D. Rempe, T. Birdal, Y. Zhao, Z. Gojcic, S. Sridhar, and L. J. Guibas, “Caspr: Learning canonical spatiotemporal point cloud representations,” in Adv. Neural Inform. Process. Syst., 2020, pp. 13 688–13 701.
- [49] J. Lei and K. Daniilidis, “Cadex: Learning canonical deformation coordinate space for dynamic surface representation via neural homeomorphism,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6624–6634.
- [50] Y. Wei, Z. Wang, Y. Rao, J. Lu, and J. Zhou, “Pv-raft: Point-voxel correlation fields for scene flow estimation of point clouds,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 6954–6963.
- [51] X. Li, J. Kaesemodel Pontes, and S. Lucey, “Neural scene flow prior,” Adv. Neural Inform. Process. Syst., vol. 34, pp. 7838–7851, 2021.
- [52] P. He, P. Emami, S. Ranka, and A. Rangarajan, “Learning scene dynamics from point cloud sequences,” Int. J. Comput. Vis., vol. 130, no. 3, pp. 669–695, 2022.
- [53] Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” in IEEE Conf. Comput. Vis. Pattern Recog., 2020, pp. 511–520.
- [54] Z. Li, T. Li, and A. B. Farimani, “Tpu-gan: Learning temporal coherence from dynamic point cloud sequences,” in Int. Conf. Learn. Represent., 2021.
- [55] J.-X. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 8510–8520.
- [56] X. Sheng, Z. Shen, G. Xiao, L. Wang, Y. Guo, and H. Fan, “Point contrastive prediction with semantic clustering for self-supervised learning on point cloud videos,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 515–16 524.
- [57] M. Eisenberger, D. Novotny, G. Kerchenbaum, P. Labatut, N. Neverova, D. Cremers, and A. Vedaldi, “Neuromorph: Unsupervised shape interpolation and correspondence in one go,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 7473–7483.
- [58] Y. Zeng, Y. Qian, Z. Zhu, J. Hou, H. Yuan, and Y. He, “Corrnet3d: Unsupervised end-to-end learning of dense correspondence for 3d point clouds,” in IEEE Conf. Comput. Vis. Pattern Recog., 2021, pp. 6052–6061.
- [59] M. Quach, G. Valenzise, and F. Dufaux, “Folding-based compression of point cloud attributes,” in IEEE Int. Conf. Image Process. IEEE, 2020, pp. 3309–3313.
- [60] D. Graziosi, O. Nakagami, S. Kuma, A. Zaghetto, T. Suzuki, and A. Tabatabai, “An overview of ongoing point cloud compression standardization activities: Video-based (v-pcc) and geometry-based (g-pcc),” APSIPA Transactions on Signal and Information Processing, vol. 9, p. e13, 2020.
- [61] T. Nguyen, Q.-H. Pham, T. Le, T. Pham, N. Ho, and B.-S. Hua, “Point-set distances for learning representations of 3d point clouds,” in Int. Conf. Comput. Vis., 2021, pp. 10 478–10 487.
- [62] A. Sheffer, E. Praun, and K. Rose, “Mesh parameterization methods and their applications,” Found. Trends Comput. Graph. Vis., vol. 2, no. 2, pp. 105–171, 2006.
- [63] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville, “On the spectral bias of neural networks,” in Int. Conf. on ML., 2019, pp. 5301–5310.
- [64] S. Ren and J. Hou, “Unleash the potential of 3d point cloud modeling with a calibrated local geometry-driven distance metric,” arXiv preprint arXiv:2306.00552, 2023.
- [65] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [66] H. M. Briceno, P. V. Sander, L. McMillan, S. Gortler, and H. Hoppe, “Geometry videos,” in Eurographics/SIGGRAPH symposium on computer animation (SCA), 2003.
- [67] J. Xia, Y. He, D. P. Quynh, X. Chen, and S. C. Hoi, “Modeling 3d facial expressions using geometry videos,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 591–600.
- [68] D. T. Quynh, Y. He, X. Chen, J. Xia, Q. Sun, and S. C. Hoi, “Modeling 3d articulated motions with conformal geometry videos (cgvs),” in Proceedings of the 19th ACM international conference on Multimedia, 2011, pp. 383–392.
- [69] H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” in Int. Conf. Learn. Represent., 2021.
- [70] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 6450–6459.
- [71] T. Kalluri, D. Pathak, M. Chandraker, and D. Tran, “Flavr: Flow-agnostic video representations for fast frame interpolation,” in IEEE Winter Conf. on App. of Comput. Vis., 2023, pp. 2071–2082.
- [72] S. Schwarz, M. Preda, V. Baroncini, M. Budagavi, P. Cesar, P. A. Chou, R. A. Cohen, M. Krivokuća, S. Lasserre, Z. Li et al., “Emerging mpeg standards for point cloud compression,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 133–148, 2018.
- [73] H. Liu, H. Yuan, Q. Liu, J. Hou, and J. Liu, “A comprehensive study and comparison of core technologies for mpeg 3-d point cloud compression,” IEEE Transactions on Broadcasting, vol. 66, no. 3, pp. 701–717, 2019.
- [74] M. Quach, J. Pang, D. Tian, G. Valenzise, and F. Dufaux, “Survey on deep learning-based point cloud compression,” Frontiers in Signal Processing, vol. 2, p. 846972, 2022.
- [75] J. Hou, L.-P. Chau, N. Magnenat-Thalmann, and Y. He, “Compressing 3-d human motions via keyframe-based geometry videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 25, no. 1, pp. 51–62, 2014.
- [76] J. Hou, L.-P. Chau, M. Zhang, N. Magnenat-Thalmann, and Y. He, “A highly efficient compression framework for time-varying 3-d facial expressions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 9, pp. 1541–1553, 2014.
- [77] International Telecommunication Union, “H.266: Versatile video coding,” 2021, archived from the original on 21 June 2021. Retrieved on 21 June 2021. [Online]. Available: https://www.itu.int/rec/T-REC-H.266
- [78] Y. Zeng, Y. Qian, Q. Zhang, J. Hou, Y. Yuan, and Y. He, “Idea-net: Dynamic 3d point cloud interpolation via deep embedding alignment,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 6338–6347.
- [79] L. Yu, X. Li, C.-W. Fu, D. Cohen-Or, and P.-A. Heng, “Pu-net: Point cloud upsampling network,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2790–2799.
- [80] D. Vlasic, I. Baran, W. Matusik, and J. Popovic, “Articulated mesh animation from multi-view silhouettes (data link),” https://people.csail.mit.edu/drdaniel/mesh_animation/#data, 2023, accessed: 2023-11-2.
- [81] R. W. Sumner and J. Popović, “Deformation transfer for triangle meshes,” ACM Trans. Graph., vol. 23, no. 3, pp. 399–405, 2004.
- [82] Y. Li, H. Takehara, T. Taketomi, B. Zheng, and M. Nießner, “4dcomplete: Non-rigid motion estimation beyond the observable surface,” in Int. Conf. Comput. Vis., 2021, pp. 12 706–12 716.
- [83] Y. Xu, Y. Lu, and Z. Wen, “Owlii Dynamic human mesh sequence dataset,” ISO/IEC JTC1/SC29/WG11 m41658, 120th MPEG Meeting, Macau, Oct 2017.
- [84] D. Vlasic, I. Baran, W. Matusik, and J. Popovic, “Articulated mesh animation from multi-view silhouettes,” in SIGGRAPH, 2008.
Yiming Zeng received his B.S. degree in Automation from the South China University of Technology, Guangzhou, China, in 2019, and his Ph.D. degree in Computer Science from the City University of Hong Kong in 2024. His research interests include deep learning processing and the development of geometrically meaningful representations for both static and dynamic 3D point clouds. |
Junhui Hou (Senior Member, IEEE) is an Associate Professor with the Department of Computer Science, City University of Hong Kong. He holds a B.Eng. degree in information engineering (Talented Students Program) from the South China University of Technology, Guangzhou, China (2009), an M.Eng. degree in signal and information processing from Northwestern Polytechnical University, Xi’an, China (2012), and a Ph.D. degree from the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (2016). His research interests are multi-dimensional visual computing. Dr. Hou received the Early Career Award (3/381) from the Hong Kong Research Grants Council in 2018. He is an elected member of IEEE MSATC, VSPC-TC, and MMSP-TC. He has served or is serving as an Associate Editor for IEEE Transactions on Visualization and Computer Graphics, IEEE Transactions on Image Processing, IEEE Transactions on Circuits and Systems for Video Technology, and Signal Processing: Image Communication, and The Visual Computer. |
Qijian Zhang received the B.S. degree in Electronic Information Science and Technology from Beijing Normal University, Beijing, China, in 2019. Currently, he is a Ph.D. student (2020-present) under the Department of Computer Science, City University of Hong Kong, HKSAR. His research interests include geometry processing, 3D computer vision, computer graphics, and cross-modal learning. |
Siyu Ren received the B.S. degree in Optoelectronic Information Science and Engineering from Tianjin University, Tianjin, China, in 2018. He is currently pursuing the Ph.D. degree in Computer Science at the City University of Hong Kong and Optical Engineering at the Tianjin University. His research interests include deep learning and 3D point cloud processing. |
Wenping Wang (Fellow, IEEE) received the Ph.D. degree in computer science from the University of Alberta in 1992. He is a Professor of computer science at Texas A&M University. His research interests include computer graphics, computer visualization, computer vision, robotics, medical image processing, and geometric computing, and he has published over 180 journal papers in these fields. He is a journal associate editor of Computer Aided Geometric Design (CAGD), Computer Graphics Forum (CGF), and IEEE Transactions on Visualization and Computer Graphics. He has chaired a number of international conferences, including Pacific Graphics 2012, ACM Symposium on Physical and Solid Modeling (SPM) 2013, SIGGRAPH Asia 2013, and Geometry Summit 2019. Prof. Wang received the John Gregory Memorial Award for his contributions to geometric modeling. He is an ACM Fellow. |