Gaussian-Informed Continuum for Physical Property Identification and Simulation

Junhao Cai^1∗ Yuji Yang^2∗ Weihao Yuan³ Yisheng He³
Zilong Dong³ Liefeng Bo³ Hui Cheng² Qifeng Chen¹
¹The Hong Kong University of Science and Technology, ²Sun Yat-sen University, ³Alibaba Group
^∗ Equal contribution, order determined by coin toss

Abstract

This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https://jukgei.github.io/project/gic.

1 Introduction

Identifying the physical properties of objects (i.e., system identification) is essential for numerous applications such as games, digital twins, and robotic manipulation [1, 2, 3]. Although humans can intuitively deduce the underlying physical properties with a single glance when the object undergoes deformation, estimating the properties with only visual observations remains challenging for computational perceptual algorithms.

To tackle this challenge, many established methods [4, 5, 6] adopt the assumption of elastic material [7] and perform physics-based modeling based on mass-spring systems (MSS) or finite element method (FEM) to model and simulate the dynamics of the objects. Such an assumption inevitably restricts the ability to simulate more general types beyond elastic materials, such as fluids or granular media. Another problem of previous methods lies in that many methods [8, 9, 10] require the ground-truth full knowledge of object geometry for the identification, which limits their practicality. Some subsequent methods [5, 4] turn to recover the geometries and physical properties from observations in a decoupled manner. Specifically, these methods first extract object geometries by making use of stereo observations or dynamic neural reconstruction [11] from RGB video sequences, and then perform simulation directly on the point clouds or after the tetrahedral mesh conversion. While these methods introduce explicit geometries to guide the estimation of physical properties, the noisy reconstruction results usually lead to degraded system identification performance.

Recently, PAC-NeRF [12] integrates neural radiance fields (NeRF) [13] with a continuum dynamic model to tackle the above problems. The object geometries and physical properties are captured in a unified framework. Despite its effectiveness, this method possesses two limitations. Firstly, the implicit shapes represented by NeRF often lead to inferior geometries, which might cause inaccurate trajectories during simulation. Secondly, PAC-NeRF renders the novel views of deformed objects based on the appearance radiance field reconstructed from the static scene, which might introduce texture distortion, particularly when objects undergo significant deformations, resulting in discrepancies between the rendered and the observed images [14].

To address these limitations, this paper proposes a novel hybrid solution based on 3D Gaussians [15, 16] and material point method (MPM) [17, 18]. The core strength of this work is that we make use of both explicit shapes from dynamic 3D Gaussian reconstruction and implicit shapes rendered by the Gaussian-informed continuum for physical property estimation.

To generate more precise shapes to reason physical property, we first propose a motion-factorized dynamic 3D Gaussian network to conduct dynamic scene reconstruction. We then extract the continuum from the recovered 3D Gaussians at each frame by leveraging a coarse-to-fine filling strategy to generate the density field of the object progressively. The resulting density fields can be used to sample continuum particles for simulation and extract object surfaces as explicit-shape supervision in physical property estimation. To eliminate the appearance distortion caused by large deformation in PAC-NeRF, we further assign Gaussian attributes to the continuum particles where the opacity and scale attributes are evaluated from the density field. Such Gaussian-informed continuum are able to render object masks during simulation, which can be regarded as an implicit-shape representation to guide the estimation and effectively avoid using inferior rendering results for learning physical properties.

To demonstrate the superiority of the proposed method over other baselines, we conduct three types of experiments, including evaluations of physical properties, dynamic reconstruction, and future state simulation. We also demonstrate a real-world application in digital twins and robotic manipulation, showing the applicability of the proposed method in real-world scenarios.

Our contributions are summarized as follows.

•

We propose a novel hybrid pipeline that takes advantage of the 3D Gaussian representation of the object to both acquire explicit shapes and empower the simulated continuum to infer implicit shapes for physical property estimation.
•

We propose a novel dynamic 3D Gaussian framework with motion factorization to achieve more precise dynamic reconstruction. We also propose a coarse-to-fine filling strategy to generate the density field of the object, which can be utilized to extract object surfaces and obtain Gaussian-informed continuum particles.
•

Extensive experiments show that our pipeline attains state-of-the-art performance on existing benchmarks with a wide range of metrics. We also present a real-world demonstration to show the efficiency of the proposed method.

2 Related Work

Dynamic reconstruction. Reconstructing dynamic scenes from monocular or multi-view video(s) is a long-standing problem in the computer vision community [19, 20]. Previous works exploit neural implicit representation [21, 22] for non-rigid reconstruction. These methods either reconstruct the scene in a frame-wise manner [23, 24] or maintain a canonical shape and model the deformation with a neural network [25, 26, 11, 27]. While effective for novel view synthesis, these methods often require extensive training time and can result in noisy deformations owing to the implicit representation, which may compromise the utility of the recovered geometries for physical property estimation [12]. Recent progress in 3D Gaussian Splatting (3DGS) technique [15] stands out to be a prevalent method for 3D reconstruction and novel view synthesis because of the abilities of explicit shape modeling and extremely fast view rendering. Similar to non-rigid NeRF, many follow-up works extend the 3DGS into 4D by treating each frame separately [28] or decomposing a scene into a canonical 3D Gaussian point cloud and a deformation model that warps the canonical shape into a specific scene [16, 29, 30]. In this paper, we draw upon these prior studies [16, 29] and propose a novel motion-factorized dynamic 3D Gaussian network to achieve better performance on reconstruction and novel view synthesis.

System identification. Understanding the physics laws of the 3D world is beneficial for simulation [31, 32, 6] and manipulation [2, 3, 33]. However, unveiling these properties from visual information is an extremely difficult task due to the ambiguity introduced by incomplete observation and the high degrees of freedom of the scene. Early works [34, 35] study the problem by learning physical properties via interactions. With recent improvements in differentiable physics simulation [17, 18, 36, 37, 38, 39, 40], many methods turn to evaluate the physical properties by comparing the rendering results with 2D ground truth given the prior knowledge about the object geometry. VEO [5] presents a differentiable simulator to learn patterns from 4D reconstruction and force-displacement measurements. Another approach [4] eliminates the dependence of captured forces by proposing an iteration framework between deformation tracking and parameter optimization. While these methods demonstrate promising results, the inferior reconstruction might lead to degraded performance, and the assumption of elastic material restricts the applicability. PAC-NeRF [12] instead proposes a single framework to recover both the unknown geometry and physical properties of deformable objects from multi-view video sequences. However, the inferior geometries and blurry rendered images might have detrimental effects on physical property reasoning. In this work, we adopt MPM as our simulation framework following the approach used in PAC-NeRF due to its ability to simulate a variety of materials [6, 41, 42, 43]. Unlike previous approaches, we utilize dynamic 3D Gaussians to reconstruct explicit 3D geometries and generate simulatable continuum particles. Furthermore, we enhance the particles with Gaussian attributes, facilitating the rendering of implicit 2D shapes, and thereby improving physical parameter estimation.

Refer to caption — Figure 1: Overview. (a) Continuum Generation: Given a series of multi-view images capturing a moving object, the motion-factorized dynamic 3D Gaussian network is trained to reconstruct the dynamic object as 3D Gaussian point sets across different time states. From the reconstructed results, we employ a coarse-to-fine strategy to generate density fields to recover the continuums and extract object surfaces. The continuum is endowed with Gaussian attributes to allow mask rendering. (b) Identification: The MPM simulates the trajectory with the initial continuum $\mathbb{P}(0)$ and the physical parameters $\Theta$ . The simulated object surfaces and the rendered masks are then compared against the previously extracted surfaces (colored in blue) and the corresponding masks from the dataset. The differences are quantified to guide the parameter estimation process. (c) Simulation: Digital twin demonstrations are displayed. Simulated objects (colored by stress increasing from blue to red), characterized by the properties estimated from observation, exhibit behavior consistent with real-world objects.

3 Preliminary

In this section, we briefly review the core idea of 3D Gaussian Splatting (3DGS) [15] and introduce its point-based alpha blending to render depth maps and foreground masks. Typically, 3DGS utilizes 3D Gaussians, each defined by a central point $\mu_{0}$ , a covariance matrix $\Sigma_{0}$ , a density value $\sigma$ , and a color attribute $c$ , to efficiently render images from specific viewpoints. Each point is denoted as

G(x)=\exp(-\frac{1}{2}(x-\mu_{0})^{T}\Sigma_{0}^{-1}(x-\mu_{0})),

(1)

where $\Sigma_{0}$ can be factorized as $\Sigma_{0}=R_{0}S_{0}S_{0}^{T}R_{0}^{T}$ , in which $R_{0}$ is a rotation matrix represented by a quaternion vector $r_{0}\in\mathbb{R}^{4}$ , and $S_{0}$ is a a diagonal scaling matrix characterized by a 3D vector $s_{0}\in\mathbb{R}^{3}$ . If we consider isotropic Gaussian representation, the scaling matrix can be written as $s_{0}I$ , where $s_{0}$ is a scalar and $I$ is the identity matrix. When performing splatting, the 3D Gaussians are projected into 2D with the covariance matrix defined as $\Sigma_{0}^{\prime}=JW{\Sigma_{0}}W^{T}J^{T}$ , where $J$ is the Jacobian of affine approximation of the projective transformation [44], and $W$ is the viewing transformation matrix. The rendered color $I(u)$ with its foreground mask $A(u)$ at pixel $u$ are then evaluated by integrating $N$ ordered slatted Gaussians via the point-based alpha blending. Since the depth of each Gaussian point at a specific view can be obtained according to its transformation matrix, we can further render the depth map $D$ using the same blending method [16, 45], as

I(u)=\sum_{i\in N}T_{i}\alpha_{i}c_{i},\qquad A(u)=\sum_{i\in N}T_{i}\alpha_{i% },\qquad D(u)=\sum_{i\in N}T_{i}\alpha_{i}d_{i},

(2)

where $T_{i}=\prod_{j=1}^{i-1}(1-\alpha_{j})$ is the accumulated transmittance, $\alpha_{i}$ is the probability of termination at point $i$ , and $d_{i}$ is the depth of the Gaussian point at the specific view.

4 Method

4.1 Problem Definition and Overview

In this work, we aim to reconstruct the geometries and the physical properties of various object types from multi-view videos. Formally, given a set of video sequences $\{V_{i}|i=1...n\}$ with moving object and the corresponding camera extrinsic and intrinsic parameters $\{(T_{i},K_{i})|i=1...n\}$ , the goal of this task is to recover the explicit geometries of the object represented by continuum particles $P(t)$ and its corresponding physical parameters $\Theta$ (e.g., Young’s modulus $E$ and Poisson’s ratio $\nu$ for elastic objects). We follow the assumption in PAC-NeRF and PhysGaussian [12, 46] that the object types (e.g., elastic, granular, Newtonian/non-Newtonian, plastic) are known and the physical phenomenon follows continuum mechanics [17, 47].

The overview of the proposed pipeline is illustrated in Fig. 1, which consists of three modules: a motion-factorized dynamic 3D Gaussian network (Sec. 4.2) for 4D reconstruction of the object, a coarse-to-fine density field generation strategy (Sec. 4.3) for continuum generation, surface extraction, and Gaussian attribute assignment, and a procedure (Sec. 4.4) showing how we leverage Gaussian-informed continuum and extracted surfaces to estimate physical properties.

4.2 Motion-factorized Dynamic 3D Gaussian Network

Our dynamic 3D Gaussian network follows existing frameworks [16, 29, 30] that simultaneously maintain a canonical 3D Gaussian set and a deformation field modeled by a neural network to warp the canonical shape into object states at specific times. The core idea of this pipeline, presented in Fig. 2, is that the motion of every point in the object can be decomposed into a small range of motion bases.

Architecture. We first factorize the entire motion into $N_{m}$ bases that are modeled by a fully connected neural network, where every basis shares a common backbone except the final layer. The output of each basis consists of the deformations at position $d\mu_{i}(t)\in\mathbb{R}^{3}$ and at scale $ds_{i}(t)\in\mathbb{R}$ . To model the exact deformation for each position, we next propose a lightweight coefficient network that maps the positions at canonical space with specific time to their corresponding motion coefficients $w(\mu_{0},t)\in\mathbb{R}^{N_{m}}$ . Therefore, the deformed position and the scale for each Gaussian point are evaluated by the linear combination of the motion basis according to the motion coefficients:

\mu(t)=\mu_{0}+\sum_{i=1}^{N_{m}}{w_{i}(\mu_{0},t)d\mu_{i}(t)},\qquad s(t)=s_{% 0}+\sum_{i=1}^{N_{m}}{w_{i}(\mu_{0},t)ds_{i}(t)}.

(3)

In this work, we regard all the Gaussians as isotropic kernels, which has been demonstrated as an efficient way to simplify the model and better reconstruct the scene [6, 48]. We should note that although previous works [29, 49] also perform motion decomposition modeling, our pipeline shows two major differences: 1) instead of modeling each basis with an independent neural network, our module shares a common backbone. Our key observation is that for reconstructing a dynamic object, all points on the object should follow a similar moving tendency, and the final heads of the neural network are sufficient to model the details of different parts of the object; 2) to increase the ability to fit high rank of the dynamic scene [16], we model the motion coefficients as time-variant variables rather than constant Gaussian attributes [29].

Optimization. We employ the same setting in [16] to train our pipeline. Concretely, the canonical 3D Gaussians are initialized with points randomly sampled from the given bounding box of the scene. We start training the deformation network after 3,000 iterations of warm-up for the 3D Gaussians. Similar to previous works [16, 29], we optimize the pipeline by computing the L1 norm and Structural Similarity Index Measure (SSIM) between the rendered image $I$ and the ground truth image $\tilde{I}$ . Moreover, since large scales may lead to inaccurate reconstructed shapes [50], we thus perform L1 norm on the scale attributes of all the points to recover more fine-grand shapes of the object. Therefore, the overall loss function is defined as:

\mathcal{L}_{gs}=\mathcal{L}_{1}(I,\tilde{I})+\lambda_{1}\mathcal{L}_{ssim}(I,% \tilde{I})+\lambda_{2}\mathcal{L}_{1}(s(t)),

(4)

where $\lambda_{1}$ and $\lambda_{2}$ are balancing hyperparameters. More in-depth analysis of the proposed pipeline, including implementation details and effects of scale regularization, are presented in Appendix A.1.

4.3 Gaussian-informed Continnum Generation

Coarse-to-fine density field generation. Since the reconstructed Gaussian particles are served for rendering only, meaning that they are not evenly distributed on the objects, they cannot be directly used for simulation [46]. Therefore, we propose a novel coarse-to-fine filling strategy to iteratively generate density fields of the object based on the reconstructed Gaussian particles from Eqn. 3 and the internal particles filtered by the rendered depth maps. The proposed strategy is presented in Alg. 1. The implementation details and visual results are illustrated in Appendix A.2.

Concretely, the internal particles, initialized by uniform sampling from the bounding box of Gaussian particles, are filtered by projecting the particles to various images to compare the projected depth with rendered depth values (lines 1-6 in Alg. 1). The resulting particles can roughly represent the shape of the object. However, as denoted in Eqn. 2, the rendered depth maps are evaluated in an accumulated manner, making them less precise in representing the object surface.

Therefore, We employ a coarse-to-fine filling strategy by iteratively upsampling the density field and reassigning the densities on the indices computed from both the Gaussian and internal particles (lines 8-16 in Alg. 1). Fig. 3 provides a sketch illustration of the proposed strategy. Specifically, due to the large grid size at the initial stage, the object is completely inside the voxels with high densities. Next, we sequentially perform upsampling (line 10), mean filtering (line 13), and reassigning the field (line 14) at each iteration. The first two operations produce more fine-grained shapes, and the reassigning operation ensures high densities at the surface to avoid over-erosion caused by the first two steps. Finally, the continuum particles with the corresponding object surfaces can be extracted by thresholding the density field (lines 16-17 in Alg. 1).

Algorithm 1 Pseudo code for coarse-to-fine filling

Input:
   Gaussian particles at time $t$ : $\mathbb{P}_{G}(t)=\{(\mu(t),s(t),\sigma,c)\}$ ;
   $n$ pairs of camera extrinsic and intrinsic parameters: $\{(T_{i},K_{i})|i=1...n\}$ ;
   parameters: grid size $\Delta x$ ; number of upsampling steps $n_{u}$ ; thresholds $th_{min}$ , $th_{min}$ ;
Output:
   Continuum particles $\tilde{P}(t)$ and the corresponding surface $\tilde{S}(t)$ ;

1:Randomly sample an initial particle set

P_{in}

from the bounding box of

\{\mu(t)\}

;

2:for

i\leftarrow 1,n

\tilde{D}_{i}=GaussianSplatting(\mathbb{P}_{G}(t),T_{i},K_{i})

;

\triangleright

render depth map at view

i

(u_{in},v_{in}),d_{in}\leftarrow Proj(P_{in},T_{i},K_{i})

;

\triangleright

obtain image indices and depths of

P_{in}

at view

i

P_{in}\leftarrow P_{in}[\tilde{D}_{i}(u_{in},v_{in})\leq d_{in}]

;

\triangleright

filter out particles that are outside the object

6:end for

7:Initialize the zero-value density field

F(t)

with

\Delta x

and the bounding box of

\{\mu(t)\}

;

8:for

j\leftarrow 1,n_{u}

9: if

j\neq 1

then

10:

F(t)\leftarrow TrilinearInterpolation(F(t),2)

\triangleright

upsample

F(t)

with scale factor 2

11:

F(t)[p,q,r]=1

, where

p,q,r\leftarrow Discretize(P_{in}\cup\{\mu(t)\})

;

12: end if

13:

F(t)\leftarrow MeanFiltering(F(t))

;

14:

F(t)[p,q,r]=1

, where

p,q,r\leftarrow Discretize(P_{in}\cup\{\mu(t)\})

;

15:end for

16:

\tilde{P}(t)\leftarrow GetPosition(th_{min}\leq F(t))

;

17:

\tilde{S}(t)\leftarrow GetPosition(th_{min}\leq F(t)\leq th_{max})

;

Gaussian-informed continuum. In PAC-NeRF, the particles are equipped with appearance features to enable image rendering for the continuum at different states. We can also achieve this function by treating the particles as Gaussian kernels and re-train the particles using the visual data. However, this process is cumbersome and will also face the same issue in PAC-NeRF where distorted RGB images will be rendered when large deformation occurs. Therefore, instead of injecting appearance attributes, we opt to assign density and scale attributes to the particles where the densities originate from the density field, and the scale attributes can be directly obtained by the field grid size. The Gaussian-informed continuum is defined as a set of triplets:

\mathbb{P}_{\tilde{P}}=\{(\tilde{p},s_{\Delta x},\sigma_{F})\},

(5)

where $\tilde{p}\in\tilde{P}$ , $s_{\Delta x}=\Delta x/2^{n_{u}}$ , and $\sigma_{F}=F[Discretize(\tilde{p})]$ (we neglect $t$ in the notation for simplicity). Therefore, we only render object masks as 2D shape surrogates for supervision.

4.4 Geometry-aware Physical Property Estimation

With the Gaussian-informed continuum at initial state $\mathbb{P}_{\tilde{P}}(0)$ and the extracted surfaces $\tilde{S}(t)$ in place, we can employ MPM to perform simulation on the continuum and evaluate the difference in terms of both the explicit and implicit shapes. Concretely, after a rollout by MPM given the current estimation of physical parameters, we obtain a trajectory $P(t)$ with corresponding object surfaces $S(t)$ . We thus can render object masks over the trajectory. Then the loss of the current rollout can be computed as:

\mathcal{L}_{ppe}=\frac{1}{m}\sum_{i=1}^{m}[\mathcal{L}_{CD}(S(t_{i}),\tilde{S% }(t_{i}))+\frac{1}{n}\sum_{j=1}^{n}\mathcal{L}_{1}(A_{j}(t_{i}),\tilde{A}_{j}(% t_{i}))],

(6)

where $\mathcal{L}_{CD}$ and $\mathcal{L}_{1}$ are chamfer distance and L1 norm respectively, $S(t_{i})$ denotes the simulated surface at time $t_{i}$ , $A_{j}(t_{i})$ is the rendered mask at view $j$ , and $\tilde{A}_{j}(t_{i})$ represents the object mask of the image extracted from video $V_{j}$ at time $t_{i}$ . Due to the differential property of the simulator, the evaluated loss is used to optimize the target physical parameters $\Theta$ .

5 Experiments

Datasets. To thoroughly assess our proposed method, we employ two sources of data introduced by PAC-NeRF [12] and Spring-Gaus [6]. Concretely, PAC-NeRF contributes two synthetic datasets generated by MLS-MPM framework [18]. Each object in both datasets includes RGB images from 11 distinct viewpoints, with approximately 14 frames per viewpoint. The datasets feature a range of materials, including elastic and plastic objects, granular media, and both Newtonian and non-Newtonian fluids. The first dataset contains 9 objects with different shapes, while the second one consists of 45 cross-shape objects with different initial conditions and ground-truth values of physical properties. The interpretation of the physical parameters is listed in Appendix A.6 and A.7. Spring-Gaus generates a synthetic dataset of elastic objects and collects a real-world dataset containing both static and dynamic scenes. The synthetic data contains 30 frames in each of 10 viewpoints. While the real-world data only contains 3 viewpoints for each object in the dynamic scene, it captures 50-70 images from various viewpoints for the static scene. Moreover, we follow previous works [12, 6] and use the off-the-shelf matting [51] or segmentation [52] techniques to obtain object masks.

Table 1: System Identification Performance on PAC-NeRF Dataset

	PAC-NeRF [12]	Ours	Ground Truth
Droplet	$\mu=2.09\times 10^{2}$ , $\kappa=\bm{1.08\times 10^{5}}$	$\mu=\bm{2.01\times 10^{2}}$ , $\kappa=0.18\times 10^{5}$	$\mu=200$ , $\kappa=10^{5}$
Letter	$\mu=83.85$ , $\kappa=1.35\times 10^{5}$	$\mu=\bm{95.05}$ , $\kappa=\bm{1.00\times 10^{5}}$	$\mu=100$ , $\kappa=10^{5}$
Cream	$\mu=1.21\times 10^{5}$ , $\kappa=1.57\times 10^{6}$ ,	$\mu=\bm{1.03\times 10^{4}}$ , $\kappa=\bm{1.48\times 10^{6}}$ ,	$\mu=10^{4}$ , $\kappa=10^{6}$ ,
	$\tau_{Y}=3.16\times 10^{3}$ , $\eta=5.6$	$\tau_{Y}=\bm{2.98\times 10^{3}}$ , $\eta=\bm{6.6}$	$\tau_{Y}=3\times 10^{3}$ , $\eta=10$
Toothpaste	$\mu=6.51\times 10^{3}$ , $\kappa=2.22\times 10^{5}$ ,	$\mu=\bm{4.19\times 10^{3}}$ , $\kappa=\bm{9.24\times 10^{4}}$ ,	$\mu=5\times 10^{3}$ , $\kappa=10^{5}$ ,
	$\tau_{Y}=228$ , $\eta=\bm{9.77}$	$\tau_{Y}=\bm{226}$ , $\eta=9.1$	$\tau_{Y}=200$ , $\eta=10$
Torus	$E=1.04\times 10^{6}$ , $\nu=0.322$	$E=\bm{0.99\times 10^{6}}$ , $\nu=\bm{0.295}$	$E=10^{6}$ , $\nu=0.3$
Bird	$E=2.78\times 10^{5}$ , $\nu=0.273$	$E=\bm{3.08\times 10^{5}}$ , $\nu=\bm{0.284}$	$E=3\times 10^{5}$ , $\nu=0.3$
Playdoh	$E=3.84\times 10^{6}$ , $\nu=0.272$ , $\tau_{Y}=1.69\times 10^{4}$	$E=\bm{1.58\times 10^{6}}$ , $\nu=\bm{0.322}$ , $\tau_{Y}=\bm{1.56\times 10^{4}}$	$E=2\times 10^{6}$ , $\nu=0.3$ , $\tau_{Y}=1.54\times 10^{4}$
Cat	$E=1.61\times 10^{5}$ , $\nu=0.293$ , $\tau_{Y}=3.57\times 10^{3}$	$E=\bm{0.98\times 10^{6}}$ , $\nu=\bm{0.296}$ , $\tau_{Y}=\bm{3.76\times 10^{3}}$	$E=10^{6}$ , $\nu=0.3$ , $\tau_{Y}=3.85\times 10^{3}$
Trophy	$\theta_{fric}^{0}=36.1^{\circ}$	$\theta_{fric}^{0}=\bm{38.0^{\circ}}$	$\theta_{fric}^{0}=40^{\circ}$

Table 2: Dynamic Reconstruction on PAC-NeRF Dataset

Metrics	CD $\downarrow$			EMD $\downarrow$
Methods	PAC-NeRF [12]	DefGS [16]	Ours	PAC-NeRF [12]	DefGS [16]	Ours
Newtonian	0.277	0.269	0.243	0.027	0.027	0.025
Non-Newtonian	0.236	0.216	0.195	0.025	0.024	0.022
Elasticity	0.238	0.191	0.178	0.025	0.022	0.02
Plasticine	0.429	0.213	0.196	0.029	0.024	0.022
Sand	0.212	0.281	0.25	0.025	0.028	0.025
Mean	0.278	0.234	0.212	0.026	0.025	0.023

Baselines. For dynamic reconstruction, we compare with PAC-NeRF and the current state-of-the-art deformable 3D Gaussian method DefGS [16] on the PAC-NeRF synthetic dataset. More comparison of our dynamic 3D Gaussian pipeline on other widely-used datasets such as D-NeRF [25] is presented in Appendix A.1.3. For system identification, we employ PAC-NeRF as the baseline and evaluate the performance using the two datasets introduced in PAC-NeRF. To further demonstrate the precision of the proposed method in terms of geometry recovery and future prediction, we perform experiments on the Spring-Gaus synthetic dataset and compare the results with PAC-NeRF and Spring-Gaus.

Metrics. The evaluation metrics in the experiments include 1) Chamfer Distance (CD), with units expressed in $10^{3}mm^{2}$ ; 2) Earth Mover’s Distance (EMD); 3) Peak Signal-to-Noise Ratio (PSNR); 4) Structural Similarity Index Metric (SSIM) [53]; and 5) Mean Absolute Error (MAE), with values scaled by a factor of $100$ . The first two metrics are used to evaluate discrepancies between the reconstructed and ground-truth point clouds. PSNR and SSIM are leveraged on the Spring-Gaus dataset to validate the precision of future state prediction. We compute the mean absolute error for the evaluation of physical property estimation.

5.1 Evaluation on PAC-NeRF Synthetic Dataset

Table 3: System identification performance on PAC-NeRF cross-shaped object Dataset

Type

Parameters

PAC-NeRF

Ours*

Ours

Newtonian

\log_{10}(\mu)

\log_{10}(\kappa)

v

11.6

\pm

6.60

16.7

\pm

5.37

0.86

\pm

1.45

1.53

\pm

1.45

16.0

\pm

22.4

0.20

\pm

0.08

1.53

\pm

1.31

14.8

\pm

19.2

0.20

\pm

0.07

Non-

Newtonian

\log_{10}(\mu)

\log_{10}(\kappa)

\log_{10}(\tau_{Y})

\log_{10}(\eta)

v

24.1

\pm

21.9

44.0

\pm

26.3

5.09

\pm

7.41

28.7

\pm

23.3

0.29

\pm

0.13

32.9

\pm

44.6

17.7

\pm

20.2

3.74

\pm

3.72

34.9

\pm

24.1

0.68

\pm

0.28

13.5

\pm

18.2

12.9

\pm

16.8

4.80

\pm

3.92

40.7

\pm

24.6

0.19

\pm

0.09

Elasticity

\log_{10}(E)

\nu

v

3.02

\pm

3.72

4.35

\pm

5.08

0.50

\pm

0.23

3.27

\pm

4.13

3.10

\pm

2.00

0.78

\pm

0.26

2.43

\pm

3.29

2.52

\pm

2.03

0.82

\pm

0.32

Plasticine

\log_{10}(E)

\log_{10}(\tau_{Y})

\nu

v

83.8

\pm

68.4

11.2

\pm

14.5

18.9

\pm

15.7

0.56

\pm

0.17

28.1

\pm

24.4

1.24

\pm

0.90

10.2

\pm

5.34

0.13

\pm

0.04

25.6

\pm

29.4

1.67

\pm

1.21

9.59

\pm

5.00

0.22

\pm

0.10

Sand

\theta_{fric}

v

4.89

\pm

1.10

0.21

\pm

0.08

4.21

\pm

0.08

0.24

\pm

0.08

4.18

\pm

0.52

0.17

\pm

0.05

Comparison on dynamic reconstruction. In this experiment, we first perform dynamic Gaussian reconstruction on the cross-shaped object dataset using DefGS and our proposed method, respectively. We then employ the same filling strategy on the reconstructed Gaussians at each time state to generate the continuum, which is regarded as the final recovered geometry of the object and used to make comparisons with the oracle shape to compute CD and EMD. Since PAC-NeRF jointly recovers both geometries and physical parameters, we use the final estimated results to generate the trajectory for evaluation. The results, reported in Tab. 2, show that our method outperforms the baselines on both metrics and achieves more precise reconstruction performance on most objects. Specifically, we find that the NeRF representation used by PAC-NeRF usually leads to overly large shape generation. While DefGS performs well on elastic objects, its performance degenerates when modeling objects with large deformations, such as granular media and fluids. Our method can better handle these objects due to the flexibility of trajectory representation.

Table 4: Future State Simulation on Spring-Gaus Synthetic Dataset

		torus	cross	cream	apple	paste	chess	banana	Mean
CD $\downarrow$	Spring-Gaus [6]	2.38	1.57	2.22	1.87	7.03	2.59	18.48	5.16
	PAC-NeRF [12]	2.47	3.87	2.21	4.69	37.7	8.2	66.43	17.94
	Ours	0.75	1.09	0.94	0.22	2.79	0.77	0.12	0.95
EMD $\downarrow$	Spring-Gaus [6]	0.087	0.051	0.094	0.076	0.126	0.095	0.135	0.095
	PAC-NeRF [12]	0.055	0.111	0.083	0.108	0.192	0.155	0.234	0.134
	Ours	0.034	0.058	0.050	0.030	0.096	0.059	0.017	0.049
PSNR $\uparrow$	Spring-Gaus [6]	16.83	16.93	15.42	21.55	14.71	16.08	17.89	17.06
	PAC-NeRF [12]	17.46	14.15	15.37	19.94	12.32	15.08	16.04	15.77
	Ours	20.24	30.51	19.15	26.89	16.31	18.44	29.29	22.98
SSIM $\uparrow$	Spring-Gaus [6]	0.919	0.940	0.862	0.902	0.872	0.881	0.904	0.897
	PAC-NeRF [12]	0.913	0.906	0.858	0.878	0.819	0.848	0.886	0.870
	Ours	0.942	0.939	0.909	0.948	0.894	0.912	0.964	0.930

Comparison on system identification. We evaluate the performance of system identification of the two datasets proposed by PAC-NeRF. For the first dataset, we execute 10 times of our method with different random seeds for each object instance and report the mean value of the estimation results. For the second dataset, we compute the MAE of the parameters for each type of object. To demonstrate the effectiveness of the implicit shape representation, we also conduct experiments on the second dataset by only using masks for supervision on our method, namely “Ours*”. The training details are illustrated in Appendix A.3.

The results, reported in Tab. 1 and Tab. 3, show that the proposed hybrid pipeline can achieve more accurate estimation over a wide range of entries and objects, which demonstrate the effectiveness of the geometry-aware guidance. Fig. 4 visualizes the RGB images rendered by PAC-NeRF and the masks rendered by our method. We can see that when large deformation occurs, the rendered RGB image becomes distorted, while the rendered mask can effectively reduce such effect and get better performance. By leveraging both explicit and implicit shape guidance, our method obtains the best results on most entries. More qualitative results are available in the supplementary video.

5.2 Evaluation on Spring-Gaus Synthetic Dataset

Comparison on future state simulation. To further demonstrate the performance of our proposed method, we follow the setting in Spring-Gaus [6] that uses the first 20 frames as training data and the subsequent 10 frames for evaluation. Concretely, we first perform system identification based on our method and then use the estimated physical parameters and the continuum to simulate a trajectory that includes the states of the 30 frames. Therefore, we can compute CD and EMD between the simulated continuum and the ground-truth point cloud. Since we know the exact position of the continuum at each time state after estimation, we can assign time-invariant Gaussian attributes by training Gaussians on the continuum using the first 20 frames of RGB images, which enable image rendering at novel views and states. Therefore, we can compute PSNR and SSIM at any time state.

The results of future state prediction are presented in Tab. 4, and the results of reconstruction on the training states are reported in Appendix A.4. We observe that our method significantly outperforms the baselines on CD and EMD metrics over almost all object instances, which shows the superiority of our method for both geometry recovery and system identification. The results of PSNR and SSIM show that leveraging dynamic visual data to train the Gaussian attributes on the continuum improves rendering quality. This further reveals that the generated trajectories are precise such that the particles are consistent to contribute to the rendering for the same region of the object at different time states.

5.3 Real-world Application: Digital Twins in Robotic Grasping Scenario

To demonstrate the efficacy of the proposed method in real-world scenarios, we perform system identification on the real-world dataset collected by Spring-Gaus [6], as shown in Fig. 5. Since the real-world dataset consists of static and dynamic scenes for each object, we follow the procedure introduced by Spring-Gaus to progressively 1) reconstruct a Gaussian set of the object from the static scene, 2) transform the static Gaussian set to the initial state of the dynamic scene based on a registration network similar as iNeRF [6, 54], and 3) perform system identification from the dynamic observation by our method “Ours*” due to the lack of sufficient images for dynamic reconstruction. Subsequently, we establish robotic platforms in both simulated and real-world environments, each equipped with UR10 robot arms configured identically. We then execute grasp attempts on both the reconstructed objects with the estimated properties in the simulation and the corresponding real-world objects under the same configuration. The results of more objects, and more details about the training and the experiment setting are presented in Appendix A.5. From the results shown in Fig. 5, we see that our method demonstrates its capability to effectively model the deformation experienced by the objects upon impact with a surface. Furthermore, by applying identical gripper forces to both the simulated and real-world versions of the objects, we observe similar deformation behaviors. This consistency in deformation under identical conditions supports that the estimated physical parameters closely mirror the real-world properties of the objects.

6 Conclusion and Limitations

This paper proposes a novel solution that leverages the 3D Gaussian representation of objects to acquire explicit shapes while concurrently enabling the simulated continuum to infer implicit shapes to facilitate the estimation of physical properties. A novel motion-factorized dynamic 3D Gaussian framework is proposed to reconstruct precise dynamic scenes. Object surfaces and Gaussian-informed continuum are obtained by utilizing the proposed coarse-to-fine density field generation strategy. Extensive experiments demonstrate the efficacy and applicability of our method.

Despite the performance we achieve, this method still suffers from limitations, such as the assumption of continuum mechanics, the requirements of multi-view images with known camera poses, and the need for prior knowledge of object constitutive models. Integrating the pose-free method [55] or generalized constitutive [56] model with our method will be an interesting direction for future work.

References

[1] Hang Yin, Anastasia Varava, and Danica Kragic. Modeling, learning, perception, and control methods for deformable object manipulation. Science Robotics, 6(54):eabd8803, 2021.
[2] Haochen Shi, Huazhe Xu, Samuel Clarke, Yunzhu Li, and Jiajun Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. In Proceedings of the Conference on Robot Learning (CoRL), pages 642–660. PMLR, 2023.
[3] Haochen Shi, Huazhe Xu, Zhiao Huang, Yunzhu Li, and Jiajun Wu. Robocraft: Learning to see, simulate, and shape elasto-plastic objects in 3d with graph networks. The International Journal of Robotics Research (IJRR), 43(4):533–549, 2024.
[4] Bin Wang, Longhua Wu, KangKang Yin, Uri Ascher, Libin Liu, and Hui Huang. Deformation capture and modeling of soft objects. ACM Transactions on Graphics (TOG), 34(4):1–12, 2015.
[5] Hsiao-yu Chen, Edith Tretschk, Tuur Stuyck, Petr Kadlecek, Ladislav Kavan, Etienne Vouga, and Christoph Lassner. Virtual elastic objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15827–15837, 2022.
[6] Licheng Zhong, Hong-Xing Yu, Jiajun Wu, and Yunzhu Li. Reconstruction and simulation of elastic objects with spring-mass 3d gaussians. arXiv preprint arXiv:2403.09434, 2024.
[7] Matthias Müller and Markus H Gross. Interactive virtual materials. In Graphics interface, volume 2004, pages 239–246, 2004.
[8] Miguel Jaques, Michael Burke, and Timothy Hospedales. Physics-as-inverse-graphics: Unsupervised physical parameter estimation from video. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
[9] Pingchuan Ma, Tao Du, Joshua B Tenenbaum, Wojciech Matusik, and Chuang Gan. Risp: Rendering-invariant state predictor with differentiable simulation and rendering for cross-domain parameter estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
[10] Pingchuan Ma, Peter Yichen Chen, Bolei Deng, Joshua B Tenenbaum, Tao Du, Chuang Gan, and Wojciech Matusik. Learning neural constitutive laws from motion observations for generalizable pde dynamics. In International Conference on Machine Learning (ICML), pages 23279–23300. PMLR, 2023.
[11] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 12959–12970, 2021.
[12] Xuan Li, Yi-Ling Qiao, Peter Yichen Chen, Krishna Murthy Jatavallabhula, Ming Lin, Chenfanfu Jiang, and Chuang Gan. Pac-nerf: Physics augmented continuum neural radiance fields for geometry-agnostic system identification. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
[13] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5459–5469, 2022.
[14] Yutao Feng, Yintong Shang, Xuan Li, Tianjia Shao, Chenfanfu Jiang, and Yin Yang. Pie-nerf: Physics-based interactive elastodynamics with nerf. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[15] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 42(4):1–14, 2023.
[16] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2024.
[17] Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials. In Acm siggraph 2016 courses, pages 1–52. ACM New York, NY, USA, 2016.
[18] Yuanming Hu, Yu Fang, Ziheng Ge, Ziyin Qu, Yixin Zhu, Andre Pradhana, and Chenfanfu Jiang. A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
[19] Li Zhang, Brian Curless, and Steven M Seitz. Spacetime stereo: Shape recovery for dynamic scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), volume 2, pages II–367. IEEE, 2003.
[20] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 343–352, 2015.
[21] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
[22] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems (NeurIPS), 34:27171–27183, 2021.
[23] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5521–5531, 2022.
[24] Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3295–3306, 2023.
[25] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10318–10327, 2021.
[26] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: a higher-dimensional representation for topologically varying neural radiance fields. ACM Transactions on Graphics (TOG), 40(6):1–12, 2021.
[27] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–141, 2023.
[28] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In Proceedings of the International Conference on 3D Vision (3DV), 2024.
[29] Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiv preprint arXiv:2312.00112, 2023.
[30] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[31] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
[32] Jinxi Li, Ziyang Song, and Bo Yang. Nvfi: Neural velocity fields for 3d physics learning from dynamic videos. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
[33] Yi-Ling Qiao, Alexander Gao, and Ming Lin. Neuphysics: Editable neural geometry and physics from monocular videos. Advances in Neural Information Processing Systems (NeurIPS), 35:12841–12854, 2022.
[34] Barbara Frank, Rüdiger Schmedding, Cyrill Stachniss, Matthias Teschner, and Wolfram Burgard. Learning the elasticity parameters of deformable objects with a manipulation robot. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1877–1883. IEEE, 2010.
[35] Zhenjia Xu, Jiajun Wu, Andy Zeng, Joshua B Tenenbaum, and Shuran Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. In Proceedings of the Robotics: Science and Systems, 2019.
[36] J Krishna Murthy, Miles Macklin, Florian Golemo, Vikram Voleti, Linda Petrini, Martin Weiss, Breandan Considine, Jérôme Parent-Lévesque, Kevin Xie, Kenny Erleben, et al. gradsim: Differentiable simulation for system identification and visuomotor control. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
[37] Moritz Geilinger, David Hahn, Jonas Zehnder, Moritz Bächer, Bernhard Thomaszewski, and Stelian Coros. Add: Analytically differentiable dynamics for multi-body systems with frictional contact. ACM Transactions on Graphics (TOG), 39(6):1–15, 2020.
[38] Eric Heiden, Miles Macklin, Yashraj Narang, Dieter Fox, Animesh Garg, and Fabio Ramos. Disect: A differentiable simulation engine for autonomous robotic cutting. In Proceedings of the Robotics: Science and Systems, 2021.
[39] Tao Du, Kui Wu, Pingchuan Ma, Sebastien Wah, Andrew Spielberg, Daniela Rus, and Wojciech Matusik. Diffpd: Differentiable projective dynamics. ACM Transactions on Graphics (TOG), 41(2):1–21, 2021.
[40] Yiling Qiao, Junbang Liang, Vladlen Koltun, and Ming Lin. Differentiable simulation of soft multi-body systems. Advances in Neural Information Processing Systems (NeurIPS), 34:17123–17135, 2021.
[41] Chenfanfu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin. The affine particle-in-cell method. ACM Transactions on Graphics (TOG), 34(4):1–10, 2015.
[42] Yonghao Yue, Breannan Smith, Christopher Batty, Changxi Zheng, and Eitan Grinspun. Continuum foam: A material point method for shear-dependent flows. ACM Transactions on Graphics (TOG), 34(5):1–20, 2015.
[43] Gergely Klár, Theodore Gast, Andre Pradhana, Chuyuan Fu, Craig Schroeder, Chenfanfu Jiang, and Joseph Teran. Drucker-prager elastoplasticity for sand animation. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
[44] Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. Surface splatting. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 371–378, 2001.
[45] Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[46] Tianyi Xie, Zeshun Zong, Yuxin Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[47] Eduardo WV Chaves. Notes on continuum mechanics. Springer Science & Business Media, 2013.
[48] Vladimir Yugay, Yue Li, Theo Gevers, and Martin R Oswald. Gaussian-slam: Photo-realistic dense slam with gaussian splatting. arXiv preprint arXiv:2312.10070, 2023.
[49] Kai Katsumata, Duc Minh Vo, and Hideki Nakayama. An efficient 3d gaussian representation for monocular/multi-view dynamic scenes. arXiv preprint arXiv:2311.12897, 2023.
[50] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance. arXiv preprint arXiv:2312.00846, 2023.
[51] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L Curless, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Real-time high-resolution background matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8762–8771, 2021.
[52] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
[53] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[54] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021.
[55] Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
[56] Haozhe Su, Xuan Li, Tao Xue, Chenfanfu Jiang, and Mridul Aanjaneya. A generalized constitutive model for versatile mpm simulation and inverse learning with differentiable physics. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–20, 2023.
[57] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
[58] Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16632–16642, 2023.
[59] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12479–12488, 2023.
[60] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
[61] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
[62] Isabella Huang, Yashraj Narang, Clemens Eppner, Balakumar Sundaralingam, Miles Macklin, Ruzena Bajcsy, Tucker Hermans, and Dieter Fox. Defgraspsim: Physics-based simulation of grasp outcomes for 3d deformable objects. IEEE Robotics and Automation Letters, 7(3):6274–6281, 2022.
[63] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM SIGGRAPH Computer Graphics, page 163–169, 1998.
[64] Yixin Hu, Teseo Schneider, Bolun Wang, Denis Zorin, and Daniele Panozzo. Fast tetrahedral meshing in the wild. ACM Transactions on Graphics (TOG), 39(4):117–1, 2020.

Appendix A Appendix

A.1 Motion-factorized Dynamic 3D Gaussian Network

A.1.1 Implementation details

All the modules within the proposed network are composed of fully connected layers. The intermediate layers are uniformly designed, featuring both input and output channels configured to 256, and employ ReLU activation. For training, we adhere to the protocol established in [16], utilizing the Adam optimizer [57] with the same learning rate as specified in [16]. The total number of iterations is set at 40,000, with densification and pruning operations conducted every 500 steps until reaching 15,000 iterations. Additionally, the number of motions $N_{m}$ is set to 8 for all objects in our network. $\lambda_{1}$ and $\lambda_{2}$ in Eqn. 4 are all set to $1$ . All the experiments are conducted on a single A10 GPU.

A.1.2 Effects of scale regularization

When addressing the deformation of objects such as fluids or granular media, the network may struggle to fit transformations accurately due to significant discrepancies between the canonical and target shapes. As a compensatory mechanism, the network may employ Gaussians with enlarged scales to mitigate shape distortions during image rendering. This effect is visualized in the top row of Fig. 6. To rectify this issue, we implement scale regularization during network training, which enforces Gaussian kernels to maintain smaller scales. The efficacy of this operation is demonstrated in the second row of Fig. 6, where it is evident that scale regularization enables the reconstruction of more precise shapes for rendering.

A.1.3 Evaluation on D-NeRF Dataset

To further evaluate the performance of our method in terms of novel view synthesis, we conduct the experiment on the D-NeRF [25] dataset, which is a widely used benchmark consisting of moving items with data captured by a monocular camera. We compute PSNR on the D-NeRF test set and compare our method with previous dynamic approaches, including Tensor4D [58], K-Planes [59], TiNeuVox [60], and DefGS [16]. The results, reported in Tab. 5, demonstrate the proposed dynamic 3D Gaussian pipeline can also achieve superior performance on rendering.

Table 5: Results of PSNR

(\uparrow)

on D-NeRF [25] Dataset

Method

Hell

Warrior

Mutant

Hook

Bouncing

Balls

T-Rex

Stand Up

Jumping

Jacks

Mean

Tensor4D [58]

31.26

29.11

28.63

24.47

23.86

30.56

24.2

27.44

K-Planes [59]

24.58

32.5

28.12

40.05

30.43

33.1

31.11

31.41

TiNeuVox [60]

27.1

31.87

30.61

40.23

31.25

34.61

33.49

32.74

DefGS [16]

41.54

42.63

37.42

41.01

38.1

44.62

37.72

40.43

Ours

41.97

42.93

38.04

41.26

37.54

45.32

38.86

40.85

A.2 Gaussian-informed Continnum Generation

A.2.1 Implementation details

In Alg. 1, the number of iterations, denoted as $n_{u}$ , is uniformly set to 4 for all objects. We set the initial grid size $\Delta x$ according to the volume of the object. For most objects, $\Delta x=0.1$ , while for small items such as toothpaste in PAC-NeRF dataset, $\Delta x=0.01$ . The parameters $th_{min}$ and $th_{max}$ are set to 0.5 and 0.8, respectively. The resulting particle count ranges from approximately 50,000 to 100,000.

A.2.2 Visualization of coarse-to-fine filling

Fig. 7 visualizes the filling results of our proposed coarse-to-fine strategy with different numbers of iterations, along with the results from PAC-NeRF and ground-truth shapes. The qualitative results show that our method can generate more accurate shapes compared with PAC-NeRF, which tends to recover over-large shapes. We should note that we cannot recover the cat-shaped object as in [12], though we use the code officially implemented by PAC-NeRF without any modification.

A.3 Training details on PAC-NeRF Dataset

The training process is divided into two sub-processes, where we perform system identification after estimating the initial velocity of the object using the first three frames of data. Both processes use Adam [57] optimizer to tune the parameters.

A.4 More Experiments on Spring-Gaus Synthetic Dataset

Besides performing evaluation on the simulated future states in Sec. 5.2, we also evaluate CD and EMD on states existing in the training data, and the results are reported in Tab. 6. It is obvious to see that our method outperforms the baselines by a large margin, which further demonstrates the performance of our method in terms of reconstruction and identification.

Table 6: Dynamic Reconstruction on Spring-Gaus Synthetic Dataset

		torus	cross	cream	apple	paste	chess	banana	Mean
CD $\downarrow$	Spring-Gaus [6]	0.17	0.48	0.36	0.38	0.19	1.80	2.60	0.85
	PAC-NeRF [12]	4.92	1.10	0.77	1.11	3.14	0.96	2.77	2.11
	Ours	0.13	0.13	0.14	0.15	0.17	0.41	0.03	0.17
EMD $\downarrow$	Spring-Gaus [6]	0.040	0.037	0.031	0.033	0.022	0.063	0.052	0.040
	PAC-NeRF [12]	0.056	0.052	0.041	0.045	0.054	0.052	0.062	0.052
	Ours	0.020	0.020	0.019	0.020	0.025	0.036	0.011	0.022

A.5 Experiment Setting for Spring-Gaus Real-world Dataset

Training details. The dynamic scenes in Spring-Gaus [6] contain only three viewpoints, which are insufficient for dynamic 3D Gaussian reconstruction. Conversely, the static scenes incorporate 50 to 70 images captured from various viewpoints. Following the protocol established in Spring-Gaus, we reconstruct 3D Gaussian points from the static scenes using the traditional 3D Gaussian Splatting (3DGS) technique [15]. Subsequently, we transform the static Gaussian set to the initial configuration of the dynamic scene, guided by the relative pose between the two scenes. The pose is estimated iteratively based on the discrepancies observed between the rendered images and the actual images at the initial state of the dynamic scene. After pose estimation, we implement our methodology, which leverages only implicit shape guidance, to conduct system identification.

Experimental setting. We conducted grasping experiments using the UR10 robotic arm equipped with the Robotiq140 dexterous gripper in both simulated and real-world settings, ensuring consistency in the mass of the objects and their grasping poses across both environments. For the simulations, we employed the FEM-based Isaac Gym simulator [61] for its advanced capabilities in realistically simulating deformable objects [62]. To facilitate the simulation of deformable objects, we apply the Marching Cubes algorithm [63] to the generated density fields to derive the object meshes. Subsequently, we utilize fTetWild [64] for the tetrahedralization of these meshes.

More results. Qualitative results of grasp demonstrations on pig and dog objects are shown in Fig. 8.

A.6 Physical Properties

In this work, we simulate five types of materials, including elasticity, plasticine, granular media, Newtonian fluids, and non-Newtonian fluids. Each material exhibits distinct physical properties. We provide a brief introduction to the properties of each material.

Elasticity: The Young’s modulus ( $E$ ) is a measure of the stiffness of a solid material, quantifying the relationship between stress and strain in a material under elastic deformation. The Poisson’s ratio ( $\nu$ ) describes the tendency of a material to expand or contract along its width when it is stretched or compressed along its length.

Plasticine: The yield stress ( $\tau_{Y}$ ) is the minimum stress that a material requires to transition from elastic deformation to plastic deformation, marking the onset of permanent deformation. Both Young’s modulus ( $E$ ) and Poisson’s ratio ( $\nu$ ) exhibit characteristics similar to those of elastic materials.

Granular Media: The friction angle ( $\theta_{fric}$ ) is a measure of the inherent resistance of a granular material to sliding or shearing, directly related to the angle at which a material can be piled without slumping.

Newtonian fluids: The bulk modulus ( $\kappa$ ) is a measure of a material’s resistance to uniform compression, quantifying how much it compresses under a given amount of external pressure. Fluid viscosity ( $\mu$ ) describes a fluid’s resistance to flow, quantifying how much it resists deformation at a given rate.

Non-Newtonian fluids: The plasticity viscosity ( $\eta$ ) refers to the measure of a viscoplastic material’s resistance to deformation, which defines how it behaves under stress beyond its yield point. The bulk modulus ( $\kappa$ ) and fluid viscosity ( $\mu$ ) are comparable to those of Newtonian fluids, while the yield stress ( $\tau_{Y}$ ) is akin to that of plasticine.

A.7 Constitutive Models

A constitutive model describes how a material responds to stress, strain, or other external forces. It defines the material’s behavior by relating stress and strain through constitutive equations, which can capture complex behaviors such as elasticity, plasticity, and fracture. The MPM simulator is capable of modeling a diverse range of materials by employing various constitutive models. In this work, we have implemented simulations for five distinct types of materials: elasticity, plasticine, granular, Newtonian fluids, and non-Newtonian fluids.

Elasticity. We use the Neo-Hookean model, which is a common nonlinear hyperelastic model, to simulate the elasticity of materials and predict deformations. The Cauchy stress for this model is defined by

J\bm{\sigma}=\mu\left(\mathbf{FF}^{\intercal}\right)+\left[\lambda\log\it(J)-% \mu\right]\bf{I},

(7)

where the $\mathbf{F}$ is the deformation gradient, $\it{J}=\det(\bf{F})$ and $\mu,\lambda$ are the Lamé parameters, which are related to the material properties of Young’s modulus ( $E$ ) and Poisson’s ratio ( $\nu$ ) as:

\mu=\frac{E}{2(1+\nu)},\qquad\lambda=\frac{E\nu}{(1+\nu)(1-2\nu)}.

(8)

Plasticine. We use the Saint Venant-Kirchhoff Model (StVK) together with von Mises yield criterion to simulate the plasticine. For this model, the stess is defined as:

J\bm{\sigma}=\mathbf{F}\left[2\mu\bf{G}+\lambda\rm{Tr}(\bf{G})I\right]\mathbf{% F}^{\intercal},

(9)

where $\mathbf{G}=\frac{1}{2}\left(\mathbf{F}^{\intercal}\mathbf{F}-\mathbf{I}\right)$ is the Green strain. The von Mises yield criterion serves as a tool to assess whether the deformation exceeds the recoverable limit. The deformation gradient will be mapped back onto the boundary of elastic region using the following projection:

\mathcal{Z}(\mathbf{F})=\begin{cases}\mathbf{F}&\delta\gamma\leq 0\\ \mathbf{U}\exp(\bm{\epsilon}-\delta\gamma\frac{\hat{\bm{\epsilon}}}{\||\hat{% \bm{\epsilon}}\|})\mathbf{V}^{\intercal}&\text{otherwise}\end{cases},

(10)

where the $\delta\gamma=\|\hat{\bm{\epsilon}}\|-\frac{\tau_{Y}}{2\mu}$ , $\bm{\epsilon}=\log({\Sigma})$ is the normalized Hencky strain. The $\mathbf{U},\bm{\Sigma}$ and $\mathbf{V}$ can be obtained by performing Singular Value Decomposition (SVD) on deformation gradient $\mathbf{F}$ .

Granular Media. Similar to plasticine, the StVK constitutive model is used to simulate granular media. Drucker-Prager yield criteria [43] is selected as the yielding condition. It is defined as follows:

\text{Tr}(\bm{\epsilon})>0,\quad\text{or}\quad\delta\gamma=\|\bm{\hat{\epsilon% }}\|_{F}+\alpha\frac{(d\lambda+2\mu)\text{Tr}(\bm{\epsilon})}{2\mu}>0,

(11)

where $d$ is the spatial dimension, $\alpha=\sqrt{\frac{2}{3}}\frac{2\sin\theta_{fric}}{3-\sin\theta_{fric}}$ and $\theta_{fric}$ is the friction angle. The deformation gradient return mapping is defined by

\mathcal{Z}(\mathbf{F})=\begin{cases}\mathbf{UV}^{\intercal}&\text{Tr}(\bm{% \epsilon})>0\\ \mathbf{F}&\delta\gamma\leq 0,\text{Tr}(\bm{\epsilon})\leq 0\\ \mathbf{U}\exp{(\bm{\epsilon}-\delta\gamma\ \frac{\hat{\bm{\epsilon}}}{\||\hat% {\bm{\epsilon}}\|})}\mathbf{V}^{\intercal}&\text{otherwise}\par\end{cases}.

(12)

Newtonian Fluid. We adopt the approach used in PAC-NeRF [12], which employs a J-based fluid model combined with a viscosity term to simulate Newtonian fluids. The stress for this model is defined by

J\bm{\sigma}=\frac{1}{2}\mu(\nabla\mathbf{v}+\nabla\mathbf{v}^{\intercal})+% \kappa(J-\frac{1}{J^{6}}),

(13)

where $\mu$ and $\kappa$ represent the fluid viscosity and the bulk modulus, respectively.

Non-Newtonian Fluid. We employ the viscoplastic model [42] to simulate non-Newtonian fluids. Although we continue to utilize the von Mises criteria to delineate the elastic region, the presence of viscoplasticity implies that deformation will not be immediately reverted onto the yield surface. It is defined as follows:

\mathcal{Z}(\mathbf{F})=\begin{cases}\mathbf{F}&\delta\gamma\leq 0\\ \mathbf{U}\exp(\frac{\hat{\bm{s}}}{2\mu}\bm{\hat{\epsilon}}+\frac{1}{d}\text{% Tr}(\epsilon)\mathbf{I})\mathbf{V}^{\intercal}&\text{otherwise}\end{cases},

(14)

$\displaystyle\hat{\mu}$	$\displaystyle=\frac{\mu}{d}\text{Tr}(\bm{\Sigma}^{2}),$	(15)
$\displaystyle\bm{s}$	$\displaystyle=2\mu\hat{\epsilon},$
$\displaystyle\hat{s}$	$\displaystyle=\\|\bm{s}\\|-\frac{\delta\gamma}{1+\frac{\eta}{2\hat{\mu}\Delta t}}$

where $d$ is the spatial dimension. The $\mathbf{U},\bm{\Sigma}$ and $\mathbf{V}$ can be obtained by performing Singular Value Decomposition (SVD) on deformation gradient $\mathbf{F}$ .