Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Gaussian-Informed Continuum for Physical Property Identification and Simulation

Junhao Cai1∗  Yuji Yang2∗  Weihao Yuan3  Yisheng He3
Zilong Dong3  Liefeng Bo3  Hui Cheng2  Qifeng Chen1
1The Hong Kong University of Science and Technology, 2Sun Yat-sen University, 3Alibaba Group
Equal contribution, order determined by coin toss
Abstract

This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https://jukgei.github.io/project/gic.

1 Introduction

Identifying the physical properties of objects (i.e., system identification) is essential for numerous applications such as games, digital twins, and robotic manipulation [1, 2, 3]. Although humans can intuitively deduce the underlying physical properties with a single glance when the object undergoes deformation, estimating the properties with only visual observations remains challenging for computational perceptual algorithms.

To tackle this challenge, many established methods [4, 5, 6] adopt the assumption of elastic material [7] and perform physics-based modeling based on mass-spring systems (MSS) or finite element method (FEM) to model and simulate the dynamics of the objects. Such an assumption inevitably restricts the ability to simulate more general types beyond elastic materials, such as fluids or granular media. Another problem of previous methods lies in that many methods [8, 9, 10] require the ground-truth full knowledge of object geometry for the identification, which limits their practicality. Some subsequent methods [5, 4] turn to recover the geometries and physical properties from observations in a decoupled manner. Specifically, these methods first extract object geometries by making use of stereo observations or dynamic neural reconstruction [11] from RGB video sequences, and then perform simulation directly on the point clouds or after the tetrahedral mesh conversion. While these methods introduce explicit geometries to guide the estimation of physical properties, the noisy reconstruction results usually lead to degraded system identification performance.

Recently, PAC-NeRF [12] integrates neural radiance fields (NeRF) [13] with a continuum dynamic model to tackle the above problems. The object geometries and physical properties are captured in a unified framework. Despite its effectiveness, this method possesses two limitations. Firstly, the implicit shapes represented by NeRF often lead to inferior geometries, which might cause inaccurate trajectories during simulation. Secondly, PAC-NeRF renders the novel views of deformed objects based on the appearance radiance field reconstructed from the static scene, which might introduce texture distortion, particularly when objects undergo significant deformations, resulting in discrepancies between the rendered and the observed images [14].

To address these limitations, this paper proposes a novel hybrid solution based on 3D Gaussians [15, 16] and material point method (MPM) [17, 18]. The core strength of this work is that we make use of both explicit shapes from dynamic 3D Gaussian reconstruction and implicit shapes rendered by the Gaussian-informed continuum for physical property estimation.

To generate more precise shapes to reason physical property, we first propose a motion-factorized dynamic 3D Gaussian network to conduct dynamic scene reconstruction. We then extract the continuum from the recovered 3D Gaussians at each frame by leveraging a coarse-to-fine filling strategy to generate the density field of the object progressively. The resulting density fields can be used to sample continuum particles for simulation and extract object surfaces as explicit-shape supervision in physical property estimation. To eliminate the appearance distortion caused by large deformation in PAC-NeRF, we further assign Gaussian attributes to the continuum particles where the opacity and scale attributes are evaluated from the density field. Such Gaussian-informed continuum are able to render object masks during simulation, which can be regarded as an implicit-shape representation to guide the estimation and effectively avoid using inferior rendering results for learning physical properties.

To demonstrate the superiority of the proposed method over other baselines, we conduct three types of experiments, including evaluations of physical properties, dynamic reconstruction, and future state simulation. We also demonstrate a real-world application in digital twins and robotic manipulation, showing the applicability of the proposed method in real-world scenarios.

Our contributions are summarized as follows.

  • We propose a novel hybrid pipeline that takes advantage of the 3D Gaussian representation of the object to both acquire explicit shapes and empower the simulated continuum to infer implicit shapes for physical property estimation.

  • We propose a novel dynamic 3D Gaussian framework with motion factorization to achieve more precise dynamic reconstruction. We also propose a coarse-to-fine filling strategy to generate the density field of the object, which can be utilized to extract object surfaces and obtain Gaussian-informed continuum particles.

  • Extensive experiments show that our pipeline attains state-of-the-art performance on existing benchmarks with a wide range of metrics. We also present a real-world demonstration to show the efficiency of the proposed method.

2 Related Work

Dynamic reconstruction. Reconstructing dynamic scenes from monocular or multi-view video(s) is a long-standing problem in the computer vision community [19, 20]. Previous works exploit neural implicit representation [21, 22] for non-rigid reconstruction. These methods either reconstruct the scene in a frame-wise manner [23, 24] or maintain a canonical shape and model the deformation with a neural network [25, 26, 11, 27]. While effective for novel view synthesis, these methods often require extensive training time and can result in noisy deformations owing to the implicit representation, which may compromise the utility of the recovered geometries for physical property estimation [12]. Recent progress in 3D Gaussian Splatting (3DGS) technique [15] stands out to be a prevalent method for 3D reconstruction and novel view synthesis because of the abilities of explicit shape modeling and extremely fast view rendering. Similar to non-rigid NeRF, many follow-up works extend the 3DGS into 4D by treating each frame separately [28] or decomposing a scene into a canonical 3D Gaussian point cloud and a deformation model that warps the canonical shape into a specific scene [16, 29, 30]. In this paper, we draw upon these prior studies [16, 29] and propose a novel motion-factorized dynamic 3D Gaussian network to achieve better performance on reconstruction and novel view synthesis.

System identification. Understanding the physics laws of the 3D world is beneficial for simulation [31, 32, 6] and manipulation [2, 3, 33]. However, unveiling these properties from visual information is an extremely difficult task due to the ambiguity introduced by incomplete observation and the high degrees of freedom of the scene. Early works [34, 35] study the problem by learning physical properties via interactions. With recent improvements in differentiable physics simulation [17, 18, 36, 37, 38, 39, 40], many methods turn to evaluate the physical properties by comparing the rendering results with 2D ground truth given the prior knowledge about the object geometry. VEO [5] presents a differentiable simulator to learn patterns from 4D reconstruction and force-displacement measurements. Another approach [4] eliminates the dependence of captured forces by proposing an iteration framework between deformation tracking and parameter optimization. While these methods demonstrate promising results, the inferior reconstruction might lead to degraded performance, and the assumption of elastic material restricts the applicability. PAC-NeRF [12] instead proposes a single framework to recover both the unknown geometry and physical properties of deformable objects from multi-view video sequences. However, the inferior geometries and blurry rendered images might have detrimental effects on physical property reasoning. In this work, we adopt MPM as our simulation framework following the approach used in PAC-NeRF due to its ability to simulate a variety of materials [6, 41, 42, 43]. Unlike previous approaches, we utilize dynamic 3D Gaussians to reconstruct explicit 3D geometries and generate simulatable continuum particles. Furthermore, we enhance the particles with Gaussian attributes, facilitating the rendering of implicit 2D shapes, and thereby improving physical parameter estimation.

Refer to caption
Figure 1: Overview. (a) Continuum Generation: Given a series of multi-view images capturing a moving object, the motion-factorized dynamic 3D Gaussian network is trained to reconstruct the dynamic object as 3D Gaussian point sets across different time states. From the reconstructed results, we employ a coarse-to-fine strategy to generate density fields to recover the continuums and extract object surfaces. The continuum is endowed with Gaussian attributes to allow mask rendering. (b) Identification: The MPM simulates the trajectory with the initial continuum (0)0\mathbb{P}(0)blackboard_P ( 0 ) and the physical parameters ΘΘ\Thetaroman_Θ. The simulated object surfaces and the rendered masks are then compared against the previously extracted surfaces (colored in blue) and the corresponding masks from the dataset. The differences are quantified to guide the parameter estimation process. (c) Simulation: Digital twin demonstrations are displayed. Simulated objects (colored by stress increasing from blue to red), characterized by the properties estimated from observation, exhibit behavior consistent with real-world objects.

3 Preliminary

In this section, we briefly review the core idea of 3D Gaussian Splatting (3DGS) [15] and introduce its point-based alpha blending to render depth maps and foreground masks. Typically, 3DGS utilizes 3D Gaussians, each defined by a central point μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a covariance matrix Σ0subscriptΣ0\Sigma_{0}roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, a density value σ𝜎\sigmaitalic_σ, and a color attribute c𝑐citalic_c, to efficiently render images from specific viewpoints. Each point is denoted as

G(x)=exp(12(xμ0)TΣ01(xμ0)),𝐺𝑥12superscript𝑥subscript𝜇0𝑇superscriptsubscriptΣ01𝑥subscript𝜇0G(x)=\exp(-\frac{1}{2}(x-\mu_{0})^{T}\Sigma_{0}^{-1}(x-\mu_{0})),italic_G ( italic_x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) , (1)

where Σ0subscriptΣ0\Sigma_{0}roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be factorized as Σ0=R0S0S0TR0TsubscriptΣ0subscript𝑅0subscript𝑆0superscriptsubscript𝑆0𝑇superscriptsubscript𝑅0𝑇\Sigma_{0}=R_{0}S_{0}S_{0}^{T}R_{0}^{T}roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, in which R0subscript𝑅0R_{0}italic_R start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a rotation matrix represented by a quaternion vector r04subscript𝑟0superscript4r_{0}\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, and S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a a diagonal scaling matrix characterized by a 3D vector s03subscript𝑠0superscript3s_{0}\in\mathbb{R}^{3}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. If we consider isotropic Gaussian representation, the scaling matrix can be written as s0Isubscript𝑠0𝐼s_{0}Iitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_I, where s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a scalar and I𝐼Iitalic_I is the identity matrix. When performing splatting, the 3D Gaussians are projected into 2D with the covariance matrix defined as Σ0=JWΣ0WTJTsuperscriptsubscriptΣ0𝐽𝑊subscriptΣ0superscript𝑊𝑇superscript𝐽𝑇\Sigma_{0}^{\prime}=JW{\Sigma_{0}}W^{T}J^{T}roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where J𝐽Jitalic_J is the Jacobian of affine approximation of the projective transformation [44], and W𝑊Witalic_W is the viewing transformation matrix. The rendered color I(u)𝐼𝑢I(u)italic_I ( italic_u ) with its foreground mask A(u)𝐴𝑢A(u)italic_A ( italic_u ) at pixel u𝑢uitalic_u are then evaluated by integrating N𝑁Nitalic_N ordered slatted Gaussians via the point-based alpha blending. Since the depth of each Gaussian point at a specific view can be obtained according to its transformation matrix, we can further render the depth map D𝐷Ditalic_D using the same blending method [16, 45], as

I(u)=iNTiαici,A(u)=iNTiαi,D(u)=iNTiαidi,formulae-sequence𝐼𝑢subscript𝑖𝑁subscript𝑇𝑖subscript𝛼𝑖subscript𝑐𝑖formulae-sequence𝐴𝑢subscript𝑖𝑁subscript𝑇𝑖subscript𝛼𝑖𝐷𝑢subscript𝑖𝑁subscript𝑇𝑖subscript𝛼𝑖subscript𝑑𝑖I(u)=\sum_{i\in N}T_{i}\alpha_{i}c_{i},\qquad A(u)=\sum_{i\in N}T_{i}\alpha_{i% },\qquad D(u)=\sum_{i\in N}T_{i}\alpha_{i}d_{i},italic_I ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D ( italic_u ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_N end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (2)

where Ti=j=1i1(1αj)subscript𝑇𝑖superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗T_{i}=\prod_{j=1}^{i-1}(1-\alpha_{j})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the accumulated transmittance, αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the probability of termination at point i𝑖iitalic_i, and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the depth of the Gaussian point at the specific view.

4 Method

4.1 Problem Definition and Overview

In this work, we aim to reconstruct the geometries and the physical properties of various object types from multi-view videos. Formally, given a set of video sequences {Vi|i=1n}conditional-setsubscript𝑉𝑖𝑖1𝑛\{V_{i}|i=1...n\}{ italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 … italic_n } with moving object and the corresponding camera extrinsic and intrinsic parameters {(Ti,Ki)|i=1n}conditional-setsubscript𝑇𝑖subscript𝐾𝑖𝑖1𝑛\{(T_{i},K_{i})|i=1...n\}{ ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 … italic_n }, the goal of this task is to recover the explicit geometries of the object represented by continuum particles P(t)𝑃𝑡P(t)italic_P ( italic_t ) and its corresponding physical parameters ΘΘ\Thetaroman_Θ (e.g., Young’s modulus E𝐸Eitalic_E and Poisson’s ratio ν𝜈\nuitalic_ν for elastic objects). We follow the assumption in PAC-NeRF and PhysGaussian [12, 46] that the object types (e.g., elastic, granular, Newtonian/non-Newtonian, plastic) are known and the physical phenomenon follows continuum mechanics [17, 47].

The overview of the proposed pipeline is illustrated in Fig. 1, which consists of three modules: a motion-factorized dynamic 3D Gaussian network (Sec. 4.2) for 4D reconstruction of the object, a coarse-to-fine density field generation strategy (Sec. 4.3) for continuum generation, surface extraction, and Gaussian attribute assignment, and a procedure (Sec. 4.4) showing how we leverage Gaussian-informed continuum and extracted surfaces to estimate physical properties.

4.2 Motion-factorized Dynamic 3D Gaussian Network

Refer to caption
Figure 2: The pipeline of the proposed dynamic 3D Gaussian network. The motion network backbone consists of 8 fully connected (FC) layers. The output of the motion block is fed to Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT heads to generate motion residuals. The coefficient network contains 4 FC layers.

Our dynamic 3D Gaussian network follows existing frameworks [16, 29, 30] that simultaneously maintain a canonical 3D Gaussian set and a deformation field modeled by a neural network to warp the canonical shape into object states at specific times. The core idea of this pipeline, presented in Fig. 2, is that the motion of every point in the object can be decomposed into a small range of motion bases.

Architecture. We first factorize the entire motion into Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bases that are modeled by a fully connected neural network, where every basis shares a common backbone except the final layer. The output of each basis consists of the deformations at position dμi(t)3𝑑subscript𝜇𝑖𝑡superscript3d\mu_{i}(t)\in\mathbb{R}^{3}italic_d italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and at scale dsi(t)𝑑subscript𝑠𝑖𝑡ds_{i}(t)\in\mathbb{R}italic_d italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) ∈ blackboard_R. To model the exact deformation for each position, we next propose a lightweight coefficient network that maps the positions at canonical space with specific time to their corresponding motion coefficients w(μ0,t)Nm𝑤subscript𝜇0𝑡superscriptsubscript𝑁𝑚w(\mu_{0},t)\in\mathbb{R}^{N_{m}}italic_w ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Therefore, the deformed position and the scale for each Gaussian point are evaluated by the linear combination of the motion basis according to the motion coefficients:

μ(t)=μ0+i=1Nmwi(μ0,t)dμi(t),s(t)=s0+i=1Nmwi(μ0,t)dsi(t).formulae-sequence𝜇𝑡subscript𝜇0superscriptsubscript𝑖1subscript𝑁𝑚subscript𝑤𝑖subscript𝜇0𝑡𝑑subscript𝜇𝑖𝑡𝑠𝑡subscript𝑠0superscriptsubscript𝑖1subscript𝑁𝑚subscript𝑤𝑖subscript𝜇0𝑡𝑑subscript𝑠𝑖𝑡\mu(t)=\mu_{0}+\sum_{i=1}^{N_{m}}{w_{i}(\mu_{0},t)d\mu_{i}(t)},\qquad s(t)=s_{% 0}+\sum_{i=1}^{N_{m}}{w_{i}(\mu_{0},t)ds_{i}(t)}.italic_μ ( italic_t ) = italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) italic_d italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) , italic_s ( italic_t ) = italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t ) italic_d italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) . (3)

In this work, we regard all the Gaussians as isotropic kernels, which has been demonstrated as an efficient way to simplify the model and better reconstruct the scene [6, 48]. We should note that although previous works [29, 49] also perform motion decomposition modeling, our pipeline shows two major differences: 1) instead of modeling each basis with an independent neural network, our module shares a common backbone. Our key observation is that for reconstructing a dynamic object, all points on the object should follow a similar moving tendency, and the final heads of the neural network are sufficient to model the details of different parts of the object; 2) to increase the ability to fit high rank of the dynamic scene [16], we model the motion coefficients as time-variant variables rather than constant Gaussian attributes [29].

Optimization. We employ the same setting in [16] to train our pipeline. Concretely, the canonical 3D Gaussians are initialized with points randomly sampled from the given bounding box of the scene. We start training the deformation network after 3,000 iterations of warm-up for the 3D Gaussians. Similar to previous works [16, 29], we optimize the pipeline by computing the L1 norm and Structural Similarity Index Measure (SSIM) between the rendered image I𝐼Iitalic_I and the ground truth image I~~𝐼\tilde{I}over~ start_ARG italic_I end_ARG. Moreover, since large scales may lead to inaccurate reconstructed shapes [50], we thus perform L1 norm on the scale attributes of all the points to recover more fine-grand shapes of the object. Therefore, the overall loss function is defined as:

gs=1(I,I~)+λ1ssim(I,I~)+λ21(s(t)),subscript𝑔𝑠subscript1𝐼~𝐼subscript𝜆1subscript𝑠𝑠𝑖𝑚𝐼~𝐼subscript𝜆2subscript1𝑠𝑡\mathcal{L}_{gs}=\mathcal{L}_{1}(I,\tilde{I})+\lambda_{1}\mathcal{L}_{ssim}(I,% \tilde{I})+\lambda_{2}\mathcal{L}_{1}(s(t)),caligraphic_L start_POSTSUBSCRIPT italic_g italic_s end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_I , over~ start_ARG italic_I end_ARG ) + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT ( italic_I , over~ start_ARG italic_I end_ARG ) + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ( italic_t ) ) , (4)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are balancing hyperparameters. More in-depth analysis of the proposed pipeline, including implementation details and effects of scale regularization, are presented in Appendix A.1.

4.3 Gaussian-informed Continnum Generation

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Sketch illustration of the coarse-to-fine filling strategy. Gaussian and internal particles are depicted in green and blue, respectively. (a) Voxels containing particles are assigned high densities. (b) Following the upsampling and smoothing of the field, densities near boundaries become blurred (indicated in light yellow). (c) The particles are again used to correct the voxels that contain particles with high densities. (d) and (e) repeat the previous operations to achieve a more detailed shape.

Coarse-to-fine density field generation. Since the reconstructed Gaussian particles are served for rendering only, meaning that they are not evenly distributed on the objects, they cannot be directly used for simulation [46]. Therefore, we propose a novel coarse-to-fine filling strategy to iteratively generate density fields of the object based on the reconstructed Gaussian particles from Eqn. 3 and the internal particles filtered by the rendered depth maps. The proposed strategy is presented in Alg. 1. The implementation details and visual results are illustrated in Appendix A.2.

Concretely, the internal particles, initialized by uniform sampling from the bounding box of Gaussian particles, are filtered by projecting the particles to various images to compare the projected depth with rendered depth values (lines 1-6 in Alg. 1). The resulting particles can roughly represent the shape of the object. However, as denoted in Eqn. 2, the rendered depth maps are evaluated in an accumulated manner, making them less precise in representing the object surface.

Therefore, We employ a coarse-to-fine filling strategy by iteratively upsampling the density field and reassigning the densities on the indices computed from both the Gaussian and internal particles (lines 8-16 in Alg. 1). Fig. 3 provides a sketch illustration of the proposed strategy. Specifically, due to the large grid size at the initial stage, the object is completely inside the voxels with high densities. Next, we sequentially perform upsampling (line 10), mean filtering (line 13), and reassigning the field (line 14) at each iteration. The first two operations produce more fine-grained shapes, and the reassigning operation ensures high densities at the surface to avoid over-erosion caused by the first two steps. Finally, the continuum particles with the corresponding object surfaces can be extracted by thresholding the density field (lines 16-17 in Alg. 1).

Algorithm 1 Pseudo code for coarse-to-fine filling

Input:
   Gaussian particles at time t𝑡titalic_t: G(t)={(μ(t),s(t),σ,c)}subscript𝐺𝑡𝜇𝑡𝑠𝑡𝜎𝑐\mathbb{P}_{G}(t)=\{(\mu(t),s(t),\sigma,c)\}blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_t ) = { ( italic_μ ( italic_t ) , italic_s ( italic_t ) , italic_σ , italic_c ) };
   n𝑛nitalic_n pairs of camera extrinsic and intrinsic parameters: {(Ti,Ki)|i=1n}conditional-setsubscript𝑇𝑖subscript𝐾𝑖𝑖1𝑛\{(T_{i},K_{i})|i=1...n\}{ ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 … italic_n };
   parameters: grid size ΔxΔ𝑥\Delta xroman_Δ italic_x; number of upsampling steps nusubscript𝑛𝑢n_{u}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT; thresholds thmin𝑡subscript𝑚𝑖𝑛th_{min}italic_t italic_h start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, thmin𝑡subscript𝑚𝑖𝑛th_{min}italic_t italic_h start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT;
Output:
   Continuum particles P~(t)~𝑃𝑡\tilde{P}(t)over~ start_ARG italic_P end_ARG ( italic_t ) and the corresponding surface S~(t)~𝑆𝑡\tilde{S}(t)over~ start_ARG italic_S end_ARG ( italic_t );

1:Randomly sample an initial particle set Pinsubscript𝑃𝑖𝑛P_{in}italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT from the bounding box of {μ(t)}𝜇𝑡\{\mu(t)\}{ italic_μ ( italic_t ) };
2:for i1,n𝑖1𝑛i\leftarrow 1,nitalic_i ← 1 , italic_n do
3:    D~i=GaussianSplatting(G(t),Ti,Ki)subscript~𝐷𝑖𝐺𝑎𝑢𝑠𝑠𝑖𝑎𝑛𝑆𝑝𝑙𝑎𝑡𝑡𝑖𝑛𝑔subscript𝐺𝑡subscript𝑇𝑖subscript𝐾𝑖\tilde{D}_{i}=GaussianSplatting(\mathbb{P}_{G}(t),T_{i},K_{i})over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_G italic_a italic_u italic_s italic_s italic_i italic_a italic_n italic_S italic_p italic_l italic_a italic_t italic_t italic_i italic_n italic_g ( blackboard_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_t ) , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ); \triangleright render depth map at view i𝑖iitalic_i
4:    (uin,vin),dinProj(Pin,Ti,Ki)subscript𝑢𝑖𝑛subscript𝑣𝑖𝑛subscript𝑑𝑖𝑛𝑃𝑟𝑜𝑗subscript𝑃𝑖𝑛subscript𝑇𝑖subscript𝐾𝑖(u_{in},v_{in}),d_{in}\leftarrow Proj(P_{in},T_{i},K_{i})( italic_u start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) , italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ← italic_P italic_r italic_o italic_j ( italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ); \triangleright obtain image indices and depths of Pinsubscript𝑃𝑖𝑛P_{in}italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT at view i𝑖iitalic_i
5:    PinPin[D~i(uin,vin)din]subscript𝑃𝑖𝑛subscript𝑃𝑖𝑛delimited-[]subscript~𝐷𝑖subscript𝑢𝑖𝑛subscript𝑣𝑖𝑛subscript𝑑𝑖𝑛P_{in}\leftarrow P_{in}[\tilde{D}_{i}(u_{in},v_{in})\leq d_{in}]italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ← italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT [ over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ) ≤ italic_d start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ]; \triangleright filter out particles that are outside the object
6:end for
7:Initialize the zero-value density field F(t)𝐹𝑡F(t)italic_F ( italic_t ) with ΔxΔ𝑥\Delta xroman_Δ italic_x and the bounding box of {μ(t)}𝜇𝑡\{\mu(t)\}{ italic_μ ( italic_t ) };
8:for j1,nu𝑗1subscript𝑛𝑢j\leftarrow 1,n_{u}italic_j ← 1 , italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT do
9:    if j1𝑗1j\neq 1italic_j ≠ 1 then
10:       F(t)TrilinearInterpolation(F(t),2)𝐹𝑡𝑇𝑟𝑖𝑙𝑖𝑛𝑒𝑎𝑟𝐼𝑛𝑡𝑒𝑟𝑝𝑜𝑙𝑎𝑡𝑖𝑜𝑛𝐹𝑡2F(t)\leftarrow TrilinearInterpolation(F(t),2)italic_F ( italic_t ) ← italic_T italic_r italic_i italic_l italic_i italic_n italic_e italic_a italic_r italic_I italic_n italic_t italic_e italic_r italic_p italic_o italic_l italic_a italic_t italic_i italic_o italic_n ( italic_F ( italic_t ) , 2 ) \triangleright upsample F(t)𝐹𝑡F(t)italic_F ( italic_t ) with scale factor 2
11:       F(t)[p,q,r]=1𝐹𝑡𝑝𝑞𝑟1F(t)[p,q,r]=1italic_F ( italic_t ) [ italic_p , italic_q , italic_r ] = 1, where p,q,rDiscretize(Pin{μ(t)})𝑝𝑞𝑟𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑖𝑧𝑒subscript𝑃𝑖𝑛𝜇𝑡p,q,r\leftarrow Discretize(P_{in}\cup\{\mu(t)\})italic_p , italic_q , italic_r ← italic_D italic_i italic_s italic_c italic_r italic_e italic_t italic_i italic_z italic_e ( italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∪ { italic_μ ( italic_t ) } );
12:    end if
13:    F(t)MeanFiltering(F(t))𝐹𝑡𝑀𝑒𝑎𝑛𝐹𝑖𝑙𝑡𝑒𝑟𝑖𝑛𝑔𝐹𝑡F(t)\leftarrow MeanFiltering(F(t))italic_F ( italic_t ) ← italic_M italic_e italic_a italic_n italic_F italic_i italic_l italic_t italic_e italic_r italic_i italic_n italic_g ( italic_F ( italic_t ) );
14:    F(t)[p,q,r]=1𝐹𝑡𝑝𝑞𝑟1F(t)[p,q,r]=1italic_F ( italic_t ) [ italic_p , italic_q , italic_r ] = 1, where p,q,rDiscretize(Pin{μ(t)})𝑝𝑞𝑟𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑖𝑧𝑒subscript𝑃𝑖𝑛𝜇𝑡p,q,r\leftarrow Discretize(P_{in}\cup\{\mu(t)\})italic_p , italic_q , italic_r ← italic_D italic_i italic_s italic_c italic_r italic_e italic_t italic_i italic_z italic_e ( italic_P start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT ∪ { italic_μ ( italic_t ) } );
15:end for
16:P~(t)GetPosition(thminF(t))~𝑃𝑡𝐺𝑒𝑡𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑡subscript𝑚𝑖𝑛𝐹𝑡\tilde{P}(t)\leftarrow GetPosition(th_{min}\leq F(t))over~ start_ARG italic_P end_ARG ( italic_t ) ← italic_G italic_e italic_t italic_P italic_o italic_s italic_i italic_t italic_i italic_o italic_n ( italic_t italic_h start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_F ( italic_t ) );
17:S~(t)GetPosition(thminF(t)thmax)~𝑆𝑡𝐺𝑒𝑡𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑡subscript𝑚𝑖𝑛𝐹𝑡𝑡subscript𝑚𝑎𝑥\tilde{S}(t)\leftarrow GetPosition(th_{min}\leq F(t)\leq th_{max})over~ start_ARG italic_S end_ARG ( italic_t ) ← italic_G italic_e italic_t italic_P italic_o italic_s italic_i italic_t italic_i italic_o italic_n ( italic_t italic_h start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_F ( italic_t ) ≤ italic_t italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT );

Gaussian-informed continuum. In PAC-NeRF, the particles are equipped with appearance features to enable image rendering for the continuum at different states. We can also achieve this function by treating the particles as Gaussian kernels and re-train the particles using the visual data. However, this process is cumbersome and will also face the same issue in PAC-NeRF where distorted RGB images will be rendered when large deformation occurs. Therefore, instead of injecting appearance attributes, we opt to assign density and scale attributes to the particles where the densities originate from the density field, and the scale attributes can be directly obtained by the field grid size. The Gaussian-informed continuum is defined as a set of triplets:

P~={(p~,sΔx,σF)},subscript~𝑃~𝑝subscript𝑠Δ𝑥subscript𝜎𝐹\mathbb{P}_{\tilde{P}}=\{(\tilde{p},s_{\Delta x},\sigma_{F})\},blackboard_P start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT = { ( over~ start_ARG italic_p end_ARG , italic_s start_POSTSUBSCRIPT roman_Δ italic_x end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) } , (5)

where p~P~~𝑝~𝑃\tilde{p}\in\tilde{P}over~ start_ARG italic_p end_ARG ∈ over~ start_ARG italic_P end_ARG, sΔx=Δx/2nusubscript𝑠Δ𝑥Δ𝑥superscript2subscript𝑛𝑢s_{\Delta x}=\Delta x/2^{n_{u}}italic_s start_POSTSUBSCRIPT roman_Δ italic_x end_POSTSUBSCRIPT = roman_Δ italic_x / 2 start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and σF=F[Discretize(p~)]subscript𝜎𝐹𝐹delimited-[]𝐷𝑖𝑠𝑐𝑟𝑒𝑡𝑖𝑧𝑒~𝑝\sigma_{F}=F[Discretize(\tilde{p})]italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_F [ italic_D italic_i italic_s italic_c italic_r italic_e italic_t italic_i italic_z italic_e ( over~ start_ARG italic_p end_ARG ) ] (we neglect t𝑡titalic_t in the notation for simplicity). Therefore, we only render object masks as 2D shape surrogates for supervision.

4.4 Geometry-aware Physical Property Estimation

With the Gaussian-informed continuum at initial state P~(0)subscript~𝑃0\mathbb{P}_{\tilde{P}}(0)blackboard_P start_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG end_POSTSUBSCRIPT ( 0 ) and the extracted surfaces S~(t)~𝑆𝑡\tilde{S}(t)over~ start_ARG italic_S end_ARG ( italic_t ) in place, we can employ MPM to perform simulation on the continuum and evaluate the difference in terms of both the explicit and implicit shapes. Concretely, after a rollout by MPM given the current estimation of physical parameters, we obtain a trajectory P(t)𝑃𝑡P(t)italic_P ( italic_t ) with corresponding object surfaces S(t)𝑆𝑡S(t)italic_S ( italic_t ). We thus can render object masks over the trajectory. Then the loss of the current rollout can be computed as:

ppe=1mi=1m[CD(S(ti),S~(ti))+1nj=1n1(Aj(ti),A~j(ti))],subscript𝑝𝑝𝑒1𝑚superscriptsubscript𝑖1𝑚delimited-[]subscript𝐶𝐷𝑆subscript𝑡𝑖~𝑆subscript𝑡𝑖1𝑛superscriptsubscript𝑗1𝑛subscript1subscript𝐴𝑗subscript𝑡𝑖subscript~𝐴𝑗subscript𝑡𝑖\mathcal{L}_{ppe}=\frac{1}{m}\sum_{i=1}^{m}[\mathcal{L}_{CD}(S(t_{i}),\tilde{S% }(t_{i}))+\frac{1}{n}\sum_{j=1}^{n}\mathcal{L}_{1}(A_{j}(t_{i}),\tilde{A}_{j}(% t_{i}))],caligraphic_L start_POSTSUBSCRIPT italic_p italic_p italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ caligraphic_L start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT ( italic_S ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over~ start_ARG italic_S end_ARG ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ] , (6)

where CDsubscript𝐶𝐷\mathcal{L}_{CD}caligraphic_L start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are chamfer distance and L1 norm respectively, S(ti)𝑆subscript𝑡𝑖S(t_{i})italic_S ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) denotes the simulated surface at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Aj(ti)subscript𝐴𝑗subscript𝑡𝑖A_{j}(t_{i})italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the rendered mask at view j𝑗jitalic_j, and A~j(ti)subscript~𝐴𝑗subscript𝑡𝑖\tilde{A}_{j}(t_{i})over~ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the object mask of the image extracted from video Vjsubscript𝑉𝑗V_{j}italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at time tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Due to the differential property of the simulator, the evaluated loss is used to optimize the target physical parameters ΘΘ\Thetaroman_Θ.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Comparison between rendered and ground-truth images. (a) Rendered RGB images by PAC-NeRF. (b) Rendered masks by our method. (c)-(d) Ground-truth RGB images and masks. The mask-based supervision can introduce fewer discrepancies compared with the RGB-based guidance when the estimated shapes are correct.

5 Experiments

Datasets. To thoroughly assess our proposed method, we employ two sources of data introduced by PAC-NeRF [12] and Spring-Gaus [6]. Concretely, PAC-NeRF contributes two synthetic datasets generated by MLS-MPM framework [18]. Each object in both datasets includes RGB images from 11 distinct viewpoints, with approximately 14 frames per viewpoint. The datasets feature a range of materials, including elastic and plastic objects, granular media, and both Newtonian and non-Newtonian fluids. The first dataset contains 9 objects with different shapes, while the second one consists of 45 cross-shape objects with different initial conditions and ground-truth values of physical properties. The interpretation of the physical parameters is listed in Appendix A.6 and A.7. Spring-Gaus generates a synthetic dataset of elastic objects and collects a real-world dataset containing both static and dynamic scenes. The synthetic data contains 30 frames in each of 10 viewpoints. While the real-world data only contains 3 viewpoints for each object in the dynamic scene, it captures 50-70 images from various viewpoints for the static scene. Moreover, we follow previous works [12, 6] and use the off-the-shelf matting [51] or segmentation [52] techniques to obtain object masks.

Table 1: System Identification Performance on PAC-NeRF Dataset
PAC-NeRF [12] Ours Ground Truth
Droplet μ=2.09×102𝜇2.09superscript102\mu=2.09\times 10^{2}italic_μ = 2.09 × 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, κ=1.08×𝟏𝟎𝟓𝜅1.08superscript105\kappa=\bm{1.08\times 10^{5}}italic_κ = bold_1.08 bold_× bold_10 start_POSTSUPERSCRIPT bold_5 end_POSTSUPERSCRIPT μ=2.01×𝟏𝟎𝟐𝜇2.01superscript102\mu=\bm{2.01\times 10^{2}}italic_μ = bold_2.01 bold_× bold_10 start_POSTSUPERSCRIPT bold_2 end_POSTSUPERSCRIPT, κ=0.18×105𝜅0.18superscript105\kappa=0.18\times 10^{5}italic_κ = 0.18 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT μ=200𝜇200\mu=200italic_μ = 200, κ=105𝜅superscript105\kappa=10^{5}italic_κ = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
Letter μ=83.85𝜇83.85\mu=83.85italic_μ = 83.85, κ=1.35×105𝜅1.35superscript105\kappa=1.35\times 10^{5}italic_κ = 1.35 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT μ=95.05𝜇95.05\mu=\bm{95.05}italic_μ = bold_95.05, κ=1.00×𝟏𝟎𝟓𝜅1.00superscript105\kappa=\bm{1.00\times 10^{5}}italic_κ = bold_1.00 bold_× bold_10 start_POSTSUPERSCRIPT bold_5 end_POSTSUPERSCRIPT μ=100𝜇100\mu=100italic_μ = 100, κ=105𝜅superscript105\kappa=10^{5}italic_κ = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
Cream μ=1.21×105𝜇1.21superscript105\mu=1.21\times 10^{5}italic_μ = 1.21 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, κ=1.57×106𝜅1.57superscript106\kappa=1.57\times 10^{6}italic_κ = 1.57 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT , μ=1.03×𝟏𝟎𝟒𝜇1.03superscript104\mu=\bm{1.03\times 10^{4}}italic_μ = bold_1.03 bold_× bold_10 start_POSTSUPERSCRIPT bold_4 end_POSTSUPERSCRIPT, κ=1.48×𝟏𝟎𝟔𝜅1.48superscript106\kappa=\bm{1.48\times 10^{6}}italic_κ = bold_1.48 bold_× bold_10 start_POSTSUPERSCRIPT bold_6 end_POSTSUPERSCRIPT, μ=104𝜇superscript104\mu=10^{4}italic_μ = 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, κ=106𝜅superscript106\kappa=10^{6}italic_κ = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ,
τY=3.16×103subscript𝜏𝑌3.16superscript103\tau_{Y}=3.16\times 10^{3}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 3.16 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, η=5.6𝜂5.6\eta=5.6italic_η = 5.6 τY=2.98×𝟏𝟎𝟑subscript𝜏𝑌2.98superscript103\tau_{Y}=\bm{2.98\times 10^{3}}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = bold_2.98 bold_× bold_10 start_POSTSUPERSCRIPT bold_3 end_POSTSUPERSCRIPT, η=6.6𝜂6.6\eta=\bm{6.6}italic_η = bold_6.6 τY=3×103subscript𝜏𝑌3superscript103\tau_{Y}=3\times 10^{3}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 3 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, η=10𝜂10\eta=10italic_η = 10
Toothpaste μ=6.51×103𝜇6.51superscript103\mu=6.51\times 10^{3}italic_μ = 6.51 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, κ=2.22×105𝜅2.22superscript105\kappa=2.22\times 10^{5}italic_κ = 2.22 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT , μ=4.19×𝟏𝟎𝟑𝜇4.19superscript103\mu=\bm{4.19\times 10^{3}}italic_μ = bold_4.19 bold_× bold_10 start_POSTSUPERSCRIPT bold_3 end_POSTSUPERSCRIPT, κ=9.24×𝟏𝟎𝟒𝜅9.24superscript104\kappa=\bm{9.24\times 10^{4}}italic_κ = bold_9.24 bold_× bold_10 start_POSTSUPERSCRIPT bold_4 end_POSTSUPERSCRIPT, μ=5×103𝜇5superscript103\mu=5\times 10^{3}italic_μ = 5 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, κ=105𝜅superscript105\kappa=10^{5}italic_κ = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT ,
τY=228subscript𝜏𝑌228\tau_{Y}=228italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 228, η=9.77𝜂9.77\eta=\bm{9.77}italic_η = bold_9.77 τY=𝟐𝟐𝟔subscript𝜏𝑌226\tau_{Y}=\bm{226}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = bold_226, η=9.1𝜂9.1\eta=9.1italic_η = 9.1 τY=200subscript𝜏𝑌200\tau_{Y}=200italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 200, η=10𝜂10\eta=10italic_η = 10
Torus E=1.04×106𝐸1.04superscript106E=1.04\times 10^{6}italic_E = 1.04 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, ν=0.322𝜈0.322\nu=0.322italic_ν = 0.322 E=0.99×𝟏𝟎𝟔𝐸0.99superscript106E=\bm{0.99\times 10^{6}}italic_E = bold_0.99 bold_× bold_10 start_POSTSUPERSCRIPT bold_6 end_POSTSUPERSCRIPT, ν=0.295𝜈0.295\nu=\bm{0.295}italic_ν = bold_0.295 E=106𝐸superscript106E=10^{6}italic_E = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, ν=0.3𝜈0.3\nu=0.3italic_ν = 0.3
Bird E=2.78×105𝐸2.78superscript105E=2.78\times 10^{5}italic_E = 2.78 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, ν=0.273𝜈0.273\nu=0.273italic_ν = 0.273 E=3.08×𝟏𝟎𝟓𝐸3.08superscript105E=\bm{3.08\times 10^{5}}italic_E = bold_3.08 bold_× bold_10 start_POSTSUPERSCRIPT bold_5 end_POSTSUPERSCRIPT, ν=0.284𝜈0.284\nu=\bm{0.284}italic_ν = bold_0.284 E=3×105𝐸3superscript105E=3\times 10^{5}italic_E = 3 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, ν=0.3𝜈0.3\nu=0.3italic_ν = 0.3
Playdoh E=3.84×106𝐸3.84superscript106E=3.84\times 10^{6}italic_E = 3.84 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, ν=0.272𝜈0.272\nu=0.272italic_ν = 0.272, τY=1.69×104subscript𝜏𝑌1.69superscript104\tau_{Y}=1.69\times 10^{4}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 1.69 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT E=1.58×𝟏𝟎𝟔𝐸1.58superscript106E=\bm{1.58\times 10^{6}}italic_E = bold_1.58 bold_× bold_10 start_POSTSUPERSCRIPT bold_6 end_POSTSUPERSCRIPT, ν=0.322𝜈0.322\nu=\bm{0.322}italic_ν = bold_0.322, τY=1.56×𝟏𝟎𝟒subscript𝜏𝑌1.56superscript104\tau_{Y}=\bm{1.56\times 10^{4}}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = bold_1.56 bold_× bold_10 start_POSTSUPERSCRIPT bold_4 end_POSTSUPERSCRIPT E=2×106𝐸2superscript106E=2\times 10^{6}italic_E = 2 × 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, ν=0.3𝜈0.3\nu=0.3italic_ν = 0.3, τY=1.54×104subscript𝜏𝑌1.54superscript104\tau_{Y}=1.54\times 10^{4}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 1.54 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT
Cat E=1.61×105𝐸1.61superscript105E=1.61\times 10^{5}italic_E = 1.61 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT, ν=0.293𝜈0.293\nu=0.293italic_ν = 0.293, τY=3.57×103subscript𝜏𝑌3.57superscript103\tau_{Y}=3.57\times 10^{3}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 3.57 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT E=0.98×𝟏𝟎𝟔𝐸0.98superscript106E=\bm{0.98\times 10^{6}}italic_E = bold_0.98 bold_× bold_10 start_POSTSUPERSCRIPT bold_6 end_POSTSUPERSCRIPT, ν=0.296𝜈0.296\nu=\bm{0.296}italic_ν = bold_0.296, τY=3.76×𝟏𝟎𝟑subscript𝜏𝑌3.76superscript103\tau_{Y}=\bm{3.76\times 10^{3}}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = bold_3.76 bold_× bold_10 start_POSTSUPERSCRIPT bold_3 end_POSTSUPERSCRIPT E=106𝐸superscript106E=10^{6}italic_E = 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT, ν=0.3𝜈0.3\nu=0.3italic_ν = 0.3, τY=3.85×103subscript𝜏𝑌3.85superscript103\tau_{Y}=3.85\times 10^{3}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT = 3.85 × 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT
Trophy θfric0=36.1superscriptsubscript𝜃𝑓𝑟𝑖𝑐0superscript36.1\theta_{fric}^{0}=36.1^{\circ}italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 36.1 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT θfric0=38.0superscriptsubscript𝜃𝑓𝑟𝑖𝑐0superscript38.0\theta_{fric}^{0}=\bm{38.0^{\circ}}italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = bold_38.0 start_POSTSUPERSCRIPT bold_∘ end_POSTSUPERSCRIPT θfric0=40superscriptsubscript𝜃𝑓𝑟𝑖𝑐0superscript40\theta_{fric}^{0}=40^{\circ}italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 40 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT
Table 2: Dynamic Reconstruction on PAC-NeRF Dataset
Metrics CD \downarrow EMD \downarrow
Methods PAC-NeRF [12] DefGS [16] Ours PAC-NeRF [12] DefGS [16] Ours
Newtonian 0.277 0.269 0.243 0.027 0.027 0.025
Non-Newtonian 0.236 0.216 0.195 0.025 0.024 0.022
Elasticity 0.238 0.191 0.178 0.025 0.022 0.02
Plasticine 0.429 0.213 0.196 0.029 0.024 0.022
Sand 0.212 0.281 0.25 0.025 0.028 0.025
Mean 0.278 0.234 0.212 0.026 0.025 0.023

Baselines. For dynamic reconstruction, we compare with PAC-NeRF and the current state-of-the-art deformable 3D Gaussian method DefGS [16] on the PAC-NeRF synthetic dataset. More comparison of our dynamic 3D Gaussian pipeline on other widely-used datasets such as D-NeRF [25] is presented in Appendix A.1.3. For system identification, we employ PAC-NeRF as the baseline and evaluate the performance using the two datasets introduced in PAC-NeRF. To further demonstrate the precision of the proposed method in terms of geometry recovery and future prediction, we perform experiments on the Spring-Gaus synthetic dataset and compare the results with PAC-NeRF and Spring-Gaus.

Metrics. The evaluation metrics in the experiments include 1) Chamfer Distance (CD), with units expressed in 103mm2superscript103𝑚superscript𝑚210^{3}mm^{2}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_m italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT; 2) Earth Mover’s Distance (EMD); 3) Peak Signal-to-Noise Ratio (PSNR); 4) Structural Similarity Index Metric (SSIM) [53]; and 5) Mean Absolute Error (MAE), with values scaled by a factor of 100100100100. The first two metrics are used to evaluate discrepancies between the reconstructed and ground-truth point clouds. PSNR and SSIM are leveraged on the Spring-Gaus dataset to validate the precision of future state prediction. We compute the mean absolute error for the evaluation of physical property estimation.

5.1 Evaluation on PAC-NeRF Synthetic Dataset

Table 3: System identification performance on PAC-NeRF cross-shaped object Dataset
Type
Parameters
PAC-NeRF
Ours*
Ours
Newtonian
log10(μ)subscript10𝜇\log_{10}(\mu)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_μ )
log10(κ)subscript10𝜅\log_{10}(\kappa)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_κ )
v𝑣vitalic_v
11.6±plus-or-minus\pm±6.60
16.7±plus-or-minus\pm±5.37
0.86±plus-or-minus\pm±1.45
1.53±plus-or-minus\pm±1.45
16.0±plus-or-minus\pm±22.4
0.20±plus-or-minus\pm±0.08
1.53±plus-or-minus\pm±1.31
14.8±plus-or-minus\pm±19.2
0.20±plus-or-minus\pm±0.07
Non-
Newtonian
log10(μ)subscript10𝜇\log_{10}(\mu)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_μ )
log10(κ)subscript10𝜅\log_{10}(\kappa)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_κ )
log10(τY)subscript10subscript𝜏𝑌\log_{10}(\tau_{Y})roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT )
log10(η)subscript10𝜂\log_{10}(\eta)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_η )
v𝑣vitalic_v
24.1±plus-or-minus\pm±21.9
44.0±plus-or-minus\pm±26.3
5.09±plus-or-minus\pm±7.41
28.7±plus-or-minus\pm±23.3
0.29±plus-or-minus\pm±0.13
32.9±plus-or-minus\pm±44.6
17.7±plus-or-minus\pm±20.2
3.74±plus-or-minus\pm±3.72
34.9±plus-or-minus\pm±24.1
0.68±plus-or-minus\pm±0.28
13.5±plus-or-minus\pm±18.2
12.9±plus-or-minus\pm±16.8
4.80±plus-or-minus\pm±3.92
40.7±plus-or-minus\pm±24.6
0.19±plus-or-minus\pm±0.09
Elasticity
log10(E)subscript10𝐸\log_{10}(E)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_E )
ν𝜈\nuitalic_ν
v𝑣vitalic_v
3.02±plus-or-minus\pm±3.72
4.35±plus-or-minus\pm±5.08
0.50±plus-or-minus\pm±0.23
3.27±plus-or-minus\pm±4.13
3.10±plus-or-minus\pm±2.00
0.78±plus-or-minus\pm±0.26
2.43±plus-or-minus\pm±3.29
2.52±plus-or-minus\pm±2.03
0.82±plus-or-minus\pm±0.32
Plasticine
log10(E)subscript10𝐸\log_{10}(E)roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_E )
log10(τY)subscript10subscript𝜏𝑌\log_{10}(\tau_{Y})roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT )
ν𝜈\nuitalic_ν
v𝑣vitalic_v
83.8±plus-or-minus\pm±68.4
11.2±plus-or-minus\pm±14.5
18.9±plus-or-minus\pm±15.7
0.56±plus-or-minus\pm±0.17
28.1±plus-or-minus\pm±24.4
1.24±plus-or-minus\pm±0.90
10.2±plus-or-minus\pm±5.34
0.13±plus-or-minus\pm±0.04
25.6±plus-or-minus\pm±29.4
1.67±plus-or-minus\pm±1.21
9.59±plus-or-minus\pm±5.00
0.22±plus-or-minus\pm±0.10
Sand
θfricsubscript𝜃𝑓𝑟𝑖𝑐\theta_{fric}italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT
v𝑣vitalic_v
4.89±plus-or-minus\pm±1.10
0.21±plus-or-minus\pm±0.08
4.21±plus-or-minus\pm±0.08
0.24±plus-or-minus\pm±0.08
4.18±plus-or-minus\pm±0.52
0.17±plus-or-minus\pm±0.05

Comparison on dynamic reconstruction. In this experiment, we first perform dynamic Gaussian reconstruction on the cross-shaped object dataset using DefGS and our proposed method, respectively. We then employ the same filling strategy on the reconstructed Gaussians at each time state to generate the continuum, which is regarded as the final recovered geometry of the object and used to make comparisons with the oracle shape to compute CD and EMD. Since PAC-NeRF jointly recovers both geometries and physical parameters, we use the final estimated results to generate the trajectory for evaluation. The results, reported in Tab. 2, show that our method outperforms the baselines on both metrics and achieves more precise reconstruction performance on most objects. Specifically, we find that the NeRF representation used by PAC-NeRF usually leads to overly large shape generation. While DefGS performs well on elastic objects, its performance degenerates when modeling objects with large deformations, such as granular media and fluids. Our method can better handle these objects due to the flexibility of trajectory representation.

Table 4: Future State Simulation on Spring-Gaus Synthetic Dataset
torus cross cream apple paste chess banana Mean
CD\downarrow Spring-Gaus [6] 2.38 1.57 2.22 1.87 7.03 2.59 18.48 5.16
PAC-NeRF [12] 2.47 3.87 2.21 4.69 37.7 8.2 66.43 17.94
Ours 0.75 1.09 0.94 0.22 2.79 0.77 0.12 0.95
EMD\downarrow Spring-Gaus [6] 0.087 0.051 0.094 0.076 0.126 0.095 0.135 0.095
PAC-NeRF [12] 0.055 0.111 0.083 0.108 0.192 0.155 0.234 0.134
Ours 0.034 0.058 0.050 0.030 0.096 0.059 0.017 0.049
PSNR\uparrow Spring-Gaus [6] 16.83 16.93 15.42 21.55 14.71 16.08 17.89 17.06
PAC-NeRF [12] 17.46 14.15 15.37 19.94 12.32 15.08 16.04 15.77
Ours 20.24 30.51 19.15 26.89 16.31 18.44 29.29 22.98
SSIM\uparrow Spring-Gaus [6] 0.919 0.940 0.862 0.902 0.872 0.881 0.904 0.897
PAC-NeRF [12] 0.913 0.906 0.858 0.878 0.819 0.848 0.886 0.870
Ours 0.942 0.939 0.909 0.948 0.894 0.912 0.964 0.930

Comparison on system identification. We evaluate the performance of system identification of the two datasets proposed by PAC-NeRF. For the first dataset, we execute 10 times of our method with different random seeds for each object instance and report the mean value of the estimation results. For the second dataset, we compute the MAE of the parameters for each type of object. To demonstrate the effectiveness of the implicit shape representation, we also conduct experiments on the second dataset by only using masks for supervision on our method, namely “Ours*”. The training details are illustrated in Appendix A.3.

The results, reported in Tab. 1 and Tab. 3, show that the proposed hybrid pipeline can achieve more accurate estimation over a wide range of entries and objects, which demonstrate the effectiveness of the geometry-aware guidance. Fig. 4 visualizes the RGB images rendered by PAC-NeRF and the masks rendered by our method. We can see that when large deformation occurs, the rendered RGB image becomes distorted, while the rendered mask can effectively reduce such effect and get better performance. By leveraging both explicit and implicit shape guidance, our method obtains the best results on most entries. More qualitative results are available in the supplementary video.

5.2 Evaluation on Spring-Gaus Synthetic Dataset

Comparison on future state simulation. To further demonstrate the performance of our proposed method, we follow the setting in Spring-Gaus [6] that uses the first 20 frames as training data and the subsequent 10 frames for evaluation. Concretely, we first perform system identification based on our method and then use the estimated physical parameters and the continuum to simulate a trajectory that includes the states of the 30 frames. Therefore, we can compute CD and EMD between the simulated continuum and the ground-truth point cloud. Since we know the exact position of the continuum at each time state after estimation, we can assign time-invariant Gaussian attributes by training Gaussians on the continuum using the first 20 frames of RGB images, which enable image rendering at novel views and states. Therefore, we can compute PSNR and SSIM at any time state.

The results of future state prediction are presented in Tab. 4, and the results of reconstruction on the training states are reported in Appendix A.4. We observe that our method significantly outperforms the baselines on CD and EMD metrics over almost all object instances, which shows the superiority of our method for both geometry recovery and system identification. The results of PSNR and SSIM show that leveraging dynamic visual data to train the Gaussian attributes on the continuum improves rendering quality. This further reveals that the generated trajectories are precise such that the particles are consistent to contribute to the rendering for the same region of the object at different time states.

5.3 Real-world Application: Digital Twins in Robotic Grasping Scenario

Refer to caption
Figure 5: Real-world application. Left: Identification and future state simulation. Right: Grasping simulation. The stress on the simulated object is indicated by blue (low) to red (high). The gripper widths from top to bottom are set to 6cm, 4.5cm, and 3.5cm, respectively.

To demonstrate the efficacy of the proposed method in real-world scenarios, we perform system identification on the real-world dataset collected by Spring-Gaus [6], as shown in Fig. 5. Since the real-world dataset consists of static and dynamic scenes for each object, we follow the procedure introduced by Spring-Gaus to progressively 1) reconstruct a Gaussian set of the object from the static scene, 2) transform the static Gaussian set to the initial state of the dynamic scene based on a registration network similar as iNeRF [6, 54], and 3) perform system identification from the dynamic observation by our method “Ours*” due to the lack of sufficient images for dynamic reconstruction. Subsequently, we establish robotic platforms in both simulated and real-world environments, each equipped with UR10 robot arms configured identically. We then execute grasp attempts on both the reconstructed objects with the estimated properties in the simulation and the corresponding real-world objects under the same configuration. The results of more objects, and more details about the training and the experiment setting are presented in Appendix A.5. From the results shown in Fig. 5, we see that our method demonstrates its capability to effectively model the deformation experienced by the objects upon impact with a surface. Furthermore, by applying identical gripper forces to both the simulated and real-world versions of the objects, we observe similar deformation behaviors. This consistency in deformation under identical conditions supports that the estimated physical parameters closely mirror the real-world properties of the objects.

6 Conclusion and Limitations

This paper proposes a novel solution that leverages the 3D Gaussian representation of objects to acquire explicit shapes while concurrently enabling the simulated continuum to infer implicit shapes to facilitate the estimation of physical properties. A novel motion-factorized dynamic 3D Gaussian framework is proposed to reconstruct precise dynamic scenes. Object surfaces and Gaussian-informed continuum are obtained by utilizing the proposed coarse-to-fine density field generation strategy. Extensive experiments demonstrate the efficacy and applicability of our method.

Despite the performance we achieve, this method still suffers from limitations, such as the assumption of continuum mechanics, the requirements of multi-view images with known camera poses, and the need for prior knowledge of object constitutive models. Integrating the pose-free method [55] or generalized constitutive [56] model with our method will be an interesting direction for future work.

References

  • [1] Hang Yin, Anastasia Varava, and Danica Kragic. Modeling, learning, perception, and control methods for deformable object manipulation. Science Robotics, 6(54):eabd8803, 2021.
  • [2] Haochen Shi, Huazhe Xu, Samuel Clarke, Yunzhu Li, and Jiajun Wu. Robocook: Long-horizon elasto-plastic object manipulation with diverse tools. In Proceedings of the Conference on Robot Learning (CoRL), pages 642–660. PMLR, 2023.
  • [3] Haochen Shi, Huazhe Xu, Zhiao Huang, Yunzhu Li, and Jiajun Wu. Robocraft: Learning to see, simulate, and shape elasto-plastic objects in 3d with graph networks. The International Journal of Robotics Research (IJRR), 43(4):533–549, 2024.
  • [4] Bin Wang, Longhua Wu, KangKang Yin, Uri Ascher, Libin Liu, and Hui Huang. Deformation capture and modeling of soft objects. ACM Transactions on Graphics (TOG), 34(4):1–12, 2015.
  • [5] Hsiao-yu Chen, Edith Tretschk, Tuur Stuyck, Petr Kadlecek, Ladislav Kavan, Etienne Vouga, and Christoph Lassner. Virtual elastic objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15827–15837, 2022.
  • [6] Licheng Zhong, Hong-Xing Yu, Jiajun Wu, and Yunzhu Li. Reconstruction and simulation of elastic objects with spring-mass 3d gaussians. arXiv preprint arXiv:2403.09434, 2024.
  • [7] Matthias Müller and Markus H Gross. Interactive virtual materials. In Graphics interface, volume 2004, pages 239–246, 2004.
  • [8] Miguel Jaques, Michael Burke, and Timothy Hospedales. Physics-as-inverse-graphics: Unsupervised physical parameter estimation from video. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  • [9] Pingchuan Ma, Tao Du, Joshua B Tenenbaum, Wojciech Matusik, and Chuang Gan. Risp: Rendering-invariant state predictor with differentiable simulation and rendering for cross-domain parameter estimation. In Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  • [10] Pingchuan Ma, Peter Yichen Chen, Bolei Deng, Joshua B Tenenbaum, Tao Du, Chuang Gan, and Wojciech Matusik. Learning neural constitutive laws from motion observations for generalizable pde dynamics. In International Conference on Machine Learning (ICML), pages 23279–23300. PMLR, 2023.
  • [11] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic scene from monocular video. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 12959–12970, 2021.
  • [12] Xuan Li, Yi-Ling Qiao, Peter Yichen Chen, Krishna Murthy Jatavallabhula, Ming Lin, Chenfanfu Jiang, and Chuang Gan. Pac-nerf: Physics augmented continuum neural radiance fields for geometry-agnostic system identification. In Proceedings of the International Conference on Learning Representations (ICLR), 2022.
  • [13] Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5459–5469, 2022.
  • [14] Yutao Feng, Yintong Shang, Xuan Li, Tianjia Shao, Chenfanfu Jiang, and Yin Yang. Pie-nerf: Physics-based interactive elastodynamics with nerf. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [15] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG), 42(4):1–14, 2023.
  • [16] Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), 2024.
  • [17] Chenfanfu Jiang, Craig Schroeder, Joseph Teran, Alexey Stomakhin, and Andrew Selle. The material point method for simulating continuum materials. In Acm siggraph 2016 courses, pages 1–52. ACM New York, NY, USA, 2016.
  • [18] Yuanming Hu, Yu Fang, Ziheng Ge, Ziyin Qu, Yixin Zhu, Andre Pradhana, and Chenfanfu Jiang. A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM Transactions on Graphics (TOG), 37(4):1–14, 2018.
  • [19] Li Zhang, Brian Curless, and Steven M Seitz. Spacetime stereo: Shape recovery for dynamic scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), volume 2, pages II–367. IEEE, 2003.
  • [20] Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), pages 343–352, 2015.
  • [21] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106, 2021.
  • [22] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems (NeurIPS), 34:27171–27183, 2021.
  • [23] Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5521–5531, 2022.
  • [24] Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3295–3306, 2023.
  • [25] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10318–10327, 2021.
  • [26] Keunhong Park, Utkarsh Sinha, Peter Hedman, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Ricardo Martin-Brualla, and Steven M Seitz. Hypernerf: a higher-dimensional representation for topologically varying neural radiance fields. ACM Transactions on Graphics (TOG), 40(6):1–12, 2021.
  • [27] Ang Cao and Justin Johnson. Hexplane: A fast representation for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 130–141, 2023.
  • [28] Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In Proceedings of the International Conference on 3D Vision (3DV), 2024.
  • [29] Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting. arXiv preprint arXiv:2312.00112, 2023.
  • [30] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Wang Xinggang. 4d gaussian splatting for real-time dynamic scene rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [31] Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
  • [32] Jinxi Li, Ziyang Song, and Bo Yang. Nvfi: Neural velocity fields for 3d physics learning from dynamic videos. Advances in Neural Information Processing Systems (NeurIPS), 36, 2024.
  • [33] Yi-Ling Qiao, Alexander Gao, and Ming Lin. Neuphysics: Editable neural geometry and physics from monocular videos. Advances in Neural Information Processing Systems (NeurIPS), 35:12841–12854, 2022.
  • [34] Barbara Frank, Rüdiger Schmedding, Cyrill Stachniss, Matthias Teschner, and Wolfram Burgard. Learning the elasticity parameters of deformable objects with a manipulation robot. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1877–1883. IEEE, 2010.
  • [35] Zhenjia Xu, Jiajun Wu, Andy Zeng, Joshua B Tenenbaum, and Shuran Song. Densephysnet: Learning dense physical object representations via multi-step dynamic interactions. In Proceedings of the Robotics: Science and Systems, 2019.
  • [36] J Krishna Murthy, Miles Macklin, Florian Golemo, Vikram Voleti, Linda Petrini, Martin Weiss, Breandan Considine, Jérôme Parent-Lévesque, Kevin Xie, Kenny Erleben, et al. gradsim: Differentiable simulation for system identification and visuomotor control. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
  • [37] Moritz Geilinger, David Hahn, Jonas Zehnder, Moritz Bächer, Bernhard Thomaszewski, and Stelian Coros. Add: Analytically differentiable dynamics for multi-body systems with frictional contact. ACM Transactions on Graphics (TOG), 39(6):1–15, 2020.
  • [38] Eric Heiden, Miles Macklin, Yashraj Narang, Dieter Fox, Animesh Garg, and Fabio Ramos. Disect: A differentiable simulation engine for autonomous robotic cutting. In Proceedings of the Robotics: Science and Systems, 2021.
  • [39] Tao Du, Kui Wu, Pingchuan Ma, Sebastien Wah, Andrew Spielberg, Daniela Rus, and Wojciech Matusik. Diffpd: Differentiable projective dynamics. ACM Transactions on Graphics (TOG), 41(2):1–21, 2021.
  • [40] Yiling Qiao, Junbang Liang, Vladlen Koltun, and Ming Lin. Differentiable simulation of soft multi-body systems. Advances in Neural Information Processing Systems (NeurIPS), 34:17123–17135, 2021.
  • [41] Chenfanfu Jiang, Craig Schroeder, Andrew Selle, Joseph Teran, and Alexey Stomakhin. The affine particle-in-cell method. ACM Transactions on Graphics (TOG), 34(4):1–10, 2015.
  • [42] Yonghao Yue, Breannan Smith, Christopher Batty, Changxi Zheng, and Eitan Grinspun. Continuum foam: A material point method for shear-dependent flows. ACM Transactions on Graphics (TOG), 34(5):1–20, 2015.
  • [43] Gergely Klár, Theodore Gast, Andre Pradhana, Chuyuan Fu, Craig Schroeder, Chenfanfu Jiang, and Joseph Teran. Drucker-prager elastoplasticity for sand animation. ACM Transactions on Graphics (TOG), 35(4):1–12, 2016.
  • [44] Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. Surface splatting. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 371–378, 2001.
  • [45] Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [46] Tianyi Xie, Zeshun Zong, Yuxin Qiu, Xuan Li, Yutao Feng, Yin Yang, and Chenfanfu Jiang. Physgaussian: Physics-integrated 3d gaussians for generative dynamics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [47] Eduardo WV Chaves. Notes on continuum mechanics. Springer Science & Business Media, 2013.
  • [48] Vladimir Yugay, Yue Li, Theo Gevers, and Martin R Oswald. Gaussian-slam: Photo-realistic dense slam with gaussian splatting. arXiv preprint arXiv:2312.10070, 2023.
  • [49] Kai Katsumata, Duc Minh Vo, and Hideki Nakayama. An efficient 3d gaussian representation for monocular/multi-view dynamic scenes. arXiv preprint arXiv:2311.12897, 2023.
  • [50] Hanlin Chen, Chen Li, and Gim Hee Lee. Neusg: Neural implicit surface reconstruction with 3d gaussian splatting guidance. arXiv preprint arXiv:2312.00846, 2023.
  • [51] Shanchuan Lin, Andrey Ryabtsev, Soumyadip Sengupta, Brian L Curless, Steven M Seitz, and Ira Kemelmacher-Shlizerman. Real-time high-resolution background matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8762–8771, 2021.
  • [52] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
  • [53] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [54] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Alberto Rodriguez, Phillip Isola, and Tsung-Yi Lin. inerf: Inverting neural radiance fields for pose estimation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1323–1330. IEEE, 2021.
  • [55] Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • [56] Haozhe Su, Xuan Li, Tao Xue, Chenfanfu Jiang, and Mridul Aanjaneya. A generalized constitutive model for versatile mpm simulation and inverse learning with differentiable physics. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 6(3):1–20, 2023.
  • [57] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [58] Ruizhi Shao, Zerong Zheng, Hanzhang Tu, Boning Liu, Hongwen Zhang, and Yebin Liu. Tensor4d: Efficient neural 4d decomposition for high-fidelity dynamic reconstruction and rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16632–16642, 2023.
  • [59] Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12479–12488, 2023.
  • [60] Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. Fast dynamic radiance fields with time-aware neural voxels. In SIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022.
  • [61] Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
  • [62] Isabella Huang, Yashraj Narang, Clemens Eppner, Balakumar Sundaralingam, Miles Macklin, Ruzena Bajcsy, Tucker Hermans, and Dieter Fox. Defgraspsim: Physics-based simulation of grasp outcomes for 3d deformable objects. IEEE Robotics and Automation Letters, 7(3):6274–6281, 2022.
  • [63] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. ACM SIGGRAPH Computer Graphics, page 163–169, 1998.
  • [64] Yixin Hu, Teseo Schneider, Bolun Wang, Denis Zorin, and Daniele Panozzo. Fast tetrahedral meshing in the wild. ACM Transactions on Graphics (TOG), 39(4):117–1, 2020.

Appendix A Appendix

A.1 Motion-factorized Dynamic 3D Gaussian Network

A.1.1 Implementation details

All the modules within the proposed network are composed of fully connected layers. The intermediate layers are uniformly designed, featuring both input and output channels configured to 256, and employ ReLU activation. For training, we adhere to the protocol established in [16], utilizing the Adam optimizer [57] with the same learning rate as specified in [16]. The total number of iterations is set at 40,000, with densification and pruning operations conducted every 500 steps until reaching 15,000 iterations. Additionally, the number of motions Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is set to 8 for all objects in our network. λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Eqn. 4 are all set to 1111. All the experiments are conducted on a single A10 GPU.

A.1.2 Effects of scale regularization

Refer to caption
Figure 6: Visualization of trophy sequences. Row 1: rendering results from the network trained without scale regularization. Row 1: rendering results from the network trained with scale regularization.

When addressing the deformation of objects such as fluids or granular media, the network may struggle to fit transformations accurately due to significant discrepancies between the canonical and target shapes. As a compensatory mechanism, the network may employ Gaussians with enlarged scales to mitigate shape distortions during image rendering. This effect is visualized in the top row of Fig. 6. To rectify this issue, we implement scale regularization during network training, which enforces Gaussian kernels to maintain smaller scales. The efficacy of this operation is demonstrated in the second row of Fig. 6, where it is evident that scale regularization enables the reconstruction of more precise shapes for rendering.

A.1.3 Evaluation on D-NeRF Dataset

To further evaluate the performance of our method in terms of novel view synthesis, we conduct the experiment on the D-NeRF [25] dataset, which is a widely used benchmark consisting of moving items with data captured by a monocular camera. We compute PSNR on the D-NeRF test set and compare our method with previous dynamic approaches, including Tensor4D [58], K-Planes [59], TiNeuVox [60], and DefGS [16]. The results, reported in Tab. 5, demonstrate the proposed dynamic 3D Gaussian pipeline can also achieve superior performance on rendering.

Table 5: Results of PSNR ()(\uparrow)( ↑ ) on D-NeRF [25] Dataset
Method
Hell
Warrior
Mutant Hook
Bouncing
Balls
T-Rex Stand Up
Jumping
Jacks
Mean
Tensor4D [58] 31.26 29.11 28.63 24.47 23.86 30.56 24.2 27.44
K-Planes [59] 24.58 32.5 28.12 40.05 30.43 33.1 31.11 31.41
TiNeuVox [60] 27.1 31.87 30.61 40.23 31.25 34.61 33.49 32.74
DefGS [16] 41.54 42.63 37.42 41.01 38.1 44.62 37.72 40.43
Ours 41.97 42.93 38.04 41.26 37.54 45.32 38.86 40.85

A.2 Gaussian-informed Continnum Generation

A.2.1 Implementation details

In Alg. 1, the number of iterations, denoted as nusubscript𝑛𝑢n_{u}italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, is uniformly set to 4 for all objects. We set the initial grid size ΔxΔ𝑥\Delta xroman_Δ italic_x according to the volume of the object. For most objects, Δx=0.1Δ𝑥0.1\Delta x=0.1roman_Δ italic_x = 0.1, while for small items such as toothpaste in PAC-NeRF dataset, Δx=0.01Δ𝑥0.01\Delta x=0.01roman_Δ italic_x = 0.01. The parameters thmin𝑡subscript𝑚𝑖𝑛th_{min}italic_t italic_h start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and thmax𝑡subscript𝑚𝑎𝑥th_{max}italic_t italic_h start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT are set to 0.5 and 0.8, respectively. The resulting particle count ranges from approximately 50,000 to 100,000.

A.2.2 Visualization of coarse-to-fine filling

Fig. 7 visualizes the filling results of our proposed coarse-to-fine strategy with different numbers of iterations, along with the results from PAC-NeRF and ground-truth shapes. The qualitative results show that our method can generate more accurate shapes compared with PAC-NeRF, which tends to recover over-large shapes. We should note that we cannot recover the cat-shaped object as in [12], though we use the code officially implemented by PAC-NeRF without any modification.

Refer to caption
(a) nu=2subscript𝑛𝑢2n_{u}=2italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 2
Refer to caption
(b) nu=3subscript𝑛𝑢3n_{u}=3italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 3
Refer to caption
(c) nu=4subscript𝑛𝑢4n_{u}=4italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 4
Refer to caption
(d) nu=5subscript𝑛𝑢5n_{u}=5italic_n start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT = 5
Refer to caption
(e) PAC-NeRF
Refer to caption
(f) Oracle
Figure 7: Visualization of Coarse-to-fine Filling. (a)-(d) are the filling results by our method with different times of upsampling operations. (e) visualize the point clouds recovered by PAC-NeRF. (f) shows the ground-truth shapes.

A.3 Training details on PAC-NeRF Dataset

The training process is divided into two sub-processes, where we perform system identification after estimating the initial velocity of the object using the first three frames of data. Both processes use Adam [57] optimizer to tune the parameters.

A.4 More Experiments on Spring-Gaus Synthetic Dataset

Besides performing evaluation on the simulated future states in Sec. 5.2, we also evaluate CD and EMD on states existing in the training data, and the results are reported in Tab. 6. It is obvious to see that our method outperforms the baselines by a large margin, which further demonstrates the performance of our method in terms of reconstruction and identification.

Table 6: Dynamic Reconstruction on Spring-Gaus Synthetic Dataset
torus cross cream apple paste chess banana Mean
CD\downarrow Spring-Gaus [6] 0.17 0.48 0.36 0.38 0.19 1.80 2.60 0.85
PAC-NeRF [12] 4.92 1.10 0.77 1.11 3.14 0.96 2.77 2.11
Ours 0.13 0.13 0.14 0.15 0.17 0.41 0.03 0.17
EMD\downarrow Spring-Gaus [6] 0.040 0.037 0.031 0.033 0.022 0.063 0.052 0.040
PAC-NeRF [12] 0.056 0.052 0.041 0.045 0.054 0.052 0.062 0.052
Ours 0.020 0.020 0.019 0.020 0.025 0.036 0.011 0.022

A.5 Experiment Setting for Spring-Gaus Real-world Dataset

Training details. The dynamic scenes in Spring-Gaus [6] contain only three viewpoints, which are insufficient for dynamic 3D Gaussian reconstruction. Conversely, the static scenes incorporate 50 to 70 images captured from various viewpoints. Following the protocol established in Spring-Gaus, we reconstruct 3D Gaussian points from the static scenes using the traditional 3D Gaussian Splatting (3DGS) technique [15]. Subsequently, we transform the static Gaussian set to the initial configuration of the dynamic scene, guided by the relative pose between the two scenes. The pose is estimated iteratively based on the discrepancies observed between the rendered images and the actual images at the initial state of the dynamic scene. After pose estimation, we implement our methodology, which leverages only implicit shape guidance, to conduct system identification.

Experimental setting. We conducted grasping experiments using the UR10 robotic arm equipped with the Robotiq140 dexterous gripper in both simulated and real-world settings, ensuring consistency in the mass of the objects and their grasping poses across both environments. For the simulations, we employed the FEM-based Isaac Gym simulator [61] for its advanced capabilities in realistically simulating deformable objects [62]. To facilitate the simulation of deformable objects, we apply the Marching Cubes algorithm [63] to the generated density fields to derive the object meshes. Subsequently, we utilize fTetWild [64] for the tetrahedralization of these meshes.

More results. Qualitative results of grasp demonstrations on pig and dog objects are shown in Fig. 8.

Refer to caption
Figure 8: The stress on the simulated objects is indicated by blue (low) to red (high). The gripper widths from left to right for each object are set to 5.5cm, 4.5cm, and 3.5cm, respectively.

A.6 Physical Properties

In this work, we simulate five types of materials, including elasticity, plasticine, granular media, Newtonian fluids, and non-Newtonian fluids. Each material exhibits distinct physical properties. We provide a brief introduction to the properties of each material.

Elasticity: The Young’s modulus (E𝐸Eitalic_E) is a measure of the stiffness of a solid material, quantifying the relationship between stress and strain in a material under elastic deformation. The Poisson’s ratio (ν𝜈\nuitalic_ν) describes the tendency of a material to expand or contract along its width when it is stretched or compressed along its length.

Plasticine: The yield stress (τYsubscript𝜏𝑌\tau_{Y}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT) is the minimum stress that a material requires to transition from elastic deformation to plastic deformation, marking the onset of permanent deformation. Both Young’s modulus (E𝐸Eitalic_E) and Poisson’s ratio (ν𝜈\nuitalic_ν) exhibit characteristics similar to those of elastic materials.

Granular Media: The friction angle (θfricsubscript𝜃𝑓𝑟𝑖𝑐\theta_{fric}italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT) is a measure of the inherent resistance of a granular material to sliding or shearing, directly related to the angle at which a material can be piled without slumping.

Newtonian fluids: The bulk modulus (κ𝜅\kappaitalic_κ) is a measure of a material’s resistance to uniform compression, quantifying how much it compresses under a given amount of external pressure. Fluid viscosity (μ𝜇\muitalic_μ) describes a fluid’s resistance to flow, quantifying how much it resists deformation at a given rate.

Non-Newtonian fluids: The plasticity viscosity (η𝜂\etaitalic_η) refers to the measure of a viscoplastic material’s resistance to deformation, which defines how it behaves under stress beyond its yield point. The bulk modulus (κ𝜅\kappaitalic_κ) and fluid viscosity (μ𝜇\muitalic_μ) are comparable to those of Newtonian fluids, while the yield stress (τYsubscript𝜏𝑌\tau_{Y}italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT) is akin to that of plasticine.

A.7 Constitutive Models

A constitutive model describes how a material responds to stress, strain, or other external forces. It defines the material’s behavior by relating stress and strain through constitutive equations, which can capture complex behaviors such as elasticity, plasticity, and fracture. The MPM simulator is capable of modeling a diverse range of materials by employing various constitutive models. In this work, we have implemented simulations for five distinct types of materials: elasticity, plasticine, granular, Newtonian fluids, and non-Newtonian fluids.

Elasticity. We use the Neo-Hookean model, which is a common nonlinear hyperelastic model, to simulate the elasticity of materials and predict deformations. The Cauchy stress for this model is defined by

J𝝈=μ(𝐅𝐅)+[λlog(J)μ]𝐈,𝐽𝝈𝜇superscript𝐅𝐅delimited-[]𝜆𝐽𝜇𝐈J\bm{\sigma}=\mu\left(\mathbf{FF}^{\intercal}\right)+\left[\lambda\log\it(J)-% \mu\right]\bf{I},italic_J bold_italic_σ = italic_μ ( bold_FF start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) + [ italic_λ roman_log ( italic_J ) - italic_μ ] bold_I , (7)

where the 𝐅𝐅\mathbf{F}bold_F is the deformation gradient, J=det(𝐅)𝐽𝐅\it{J}=\det(\bf{F})italic_J = roman_det ( bold_F ) and μ,λ𝜇𝜆\mu,\lambdaitalic_μ , italic_λ are the Lamé parameters, which are related to the material properties of Young’s modulus (E𝐸Eitalic_E) and Poisson’s ratio (ν𝜈\nuitalic_ν) as:

μ=E2(1+ν),λ=Eν(1+ν)(12ν).formulae-sequence𝜇𝐸21𝜈𝜆𝐸𝜈1𝜈12𝜈\mu=\frac{E}{2(1+\nu)},\qquad\lambda=\frac{E\nu}{(1+\nu)(1-2\nu)}.italic_μ = divide start_ARG italic_E end_ARG start_ARG 2 ( 1 + italic_ν ) end_ARG , italic_λ = divide start_ARG italic_E italic_ν end_ARG start_ARG ( 1 + italic_ν ) ( 1 - 2 italic_ν ) end_ARG . (8)

Plasticine. We use the Saint Venant-Kirchhoff Model (StVK) together with von Mises yield criterion to simulate the plasticine. For this model, the stess is defined as:

J𝝈=𝐅[2μ𝐆+λTr(𝐆)𝐈]𝐅,𝐽𝝈𝐅delimited-[]2𝜇𝐆𝜆Tr𝐆𝐈superscript𝐅J\bm{\sigma}=\mathbf{F}\left[2\mu\bf{G}+\lambda\rm{Tr}(\bf{G})I\right]\mathbf{% F}^{\intercal},italic_J bold_italic_σ = bold_F [ 2 italic_μ bold_G + italic_λ roman_Tr ( bold_G ) bold_I ] bold_F start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , (9)

where 𝐆=12(𝐅𝐅𝐈)𝐆12superscript𝐅𝐅𝐈\mathbf{G}=\frac{1}{2}\left(\mathbf{F}^{\intercal}\mathbf{F}-\mathbf{I}\right)bold_G = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_F start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT bold_F - bold_I ) is the Green strain. The von Mises yield criterion serves as a tool to assess whether the deformation exceeds the recoverable limit. The deformation gradient will be mapped back onto the boundary of elastic region using the following projection:

𝒵(𝐅)={𝐅δγ0𝐔exp(ϵδγϵ^|ϵ^)𝐕otherwise,\mathcal{Z}(\mathbf{F})=\begin{cases}\mathbf{F}&\delta\gamma\leq 0\\ \mathbf{U}\exp(\bm{\epsilon}-\delta\gamma\frac{\hat{\bm{\epsilon}}}{\||\hat{% \bm{\epsilon}}\|})\mathbf{V}^{\intercal}&\text{otherwise}\end{cases},caligraphic_Z ( bold_F ) = { start_ROW start_CELL bold_F end_CELL start_CELL italic_δ italic_γ ≤ 0 end_CELL end_ROW start_ROW start_CELL bold_U roman_exp ( bold_italic_ϵ - italic_δ italic_γ divide start_ARG over^ start_ARG bold_italic_ϵ end_ARG end_ARG start_ARG ∥ | over^ start_ARG bold_italic_ϵ end_ARG ∥ end_ARG ) bold_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW , (10)

where the δγ=ϵ^τY2μ𝛿𝛾norm^bold-italic-ϵsubscript𝜏𝑌2𝜇\delta\gamma=\|\hat{\bm{\epsilon}}\|-\frac{\tau_{Y}}{2\mu}italic_δ italic_γ = ∥ over^ start_ARG bold_italic_ϵ end_ARG ∥ - divide start_ARG italic_τ start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_μ end_ARG, ϵ=log(Σ)bold-italic-ϵΣ\bm{\epsilon}=\log({\Sigma})bold_italic_ϵ = roman_log ( roman_Σ ) is the normalized Hencky strain. The 𝐔,𝚺𝐔𝚺\mathbf{U},\bm{\Sigma}bold_U , bold_Σ and 𝐕𝐕\mathbf{V}bold_V can be obtained by performing Singular Value Decomposition (SVD) on deformation gradient 𝐅𝐅\mathbf{F}bold_F.

Granular Media. Similar to plasticine, the StVK constitutive model is used to simulate granular media. Drucker-Prager yield criteria  [43] is selected as the yielding condition. It is defined as follows:

Tr(ϵ)>0,orδγ=ϵ^F+α(dλ+2μ)Tr(ϵ)2μ>0,formulae-sequenceTrbold-italic-ϵ0or𝛿𝛾subscriptnormbold-^bold-italic-ϵ𝐹𝛼𝑑𝜆2𝜇Trbold-italic-ϵ2𝜇0\text{Tr}(\bm{\epsilon})>0,\quad\text{or}\quad\delta\gamma=\|\bm{\hat{\epsilon% }}\|_{F}+\alpha\frac{(d\lambda+2\mu)\text{Tr}(\bm{\epsilon})}{2\mu}>0,Tr ( bold_italic_ϵ ) > 0 , or italic_δ italic_γ = ∥ overbold_^ start_ARG bold_italic_ϵ end_ARG ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + italic_α divide start_ARG ( italic_d italic_λ + 2 italic_μ ) Tr ( bold_italic_ϵ ) end_ARG start_ARG 2 italic_μ end_ARG > 0 , (11)

where d𝑑ditalic_d is the spatial dimension, α=232sinθfric3sinθfric𝛼232subscript𝜃𝑓𝑟𝑖𝑐3subscript𝜃𝑓𝑟𝑖𝑐\alpha=\sqrt{\frac{2}{3}}\frac{2\sin\theta_{fric}}{3-\sin\theta_{fric}}italic_α = square-root start_ARG divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_ARG divide start_ARG 2 roman_sin italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT end_ARG start_ARG 3 - roman_sin italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT end_ARG and θfricsubscript𝜃𝑓𝑟𝑖𝑐\theta_{fric}italic_θ start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c end_POSTSUBSCRIPT is the friction angle. The deformation gradient return mapping is defined by

𝒵(𝐅)={𝐔𝐕Tr(ϵ)>0𝐅δγ0,Tr(ϵ)0𝐔exp(ϵδγϵ^|ϵ^)𝐕otherwise.\mathcal{Z}(\mathbf{F})=\begin{cases}\mathbf{UV}^{\intercal}&\text{Tr}(\bm{% \epsilon})>0\\ \mathbf{F}&\delta\gamma\leq 0,\text{Tr}(\bm{\epsilon})\leq 0\\ \mathbf{U}\exp{(\bm{\epsilon}-\delta\gamma\ \frac{\hat{\bm{\epsilon}}}{\||\hat% {\bm{\epsilon}}\|})}\mathbf{V}^{\intercal}&\text{otherwise}\par\end{cases}.caligraphic_Z ( bold_F ) = { start_ROW start_CELL bold_UV start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_CELL start_CELL Tr ( bold_italic_ϵ ) > 0 end_CELL end_ROW start_ROW start_CELL bold_F end_CELL start_CELL italic_δ italic_γ ≤ 0 , Tr ( bold_italic_ϵ ) ≤ 0 end_CELL end_ROW start_ROW start_CELL bold_U roman_exp ( bold_italic_ϵ - italic_δ italic_γ divide start_ARG over^ start_ARG bold_italic_ϵ end_ARG end_ARG start_ARG ∥ | over^ start_ARG bold_italic_ϵ end_ARG ∥ end_ARG ) bold_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW . (12)

Newtonian Fluid. We adopt the approach used in PAC-NeRF  [12], which employs a J-based fluid model combined with a viscosity term to simulate Newtonian fluids. The stress for this model is defined by

J𝝈=12μ(𝐯+𝐯)+κ(J1J6),𝐽𝝈12𝜇𝐯superscript𝐯𝜅𝐽1superscript𝐽6J\bm{\sigma}=\frac{1}{2}\mu(\nabla\mathbf{v}+\nabla\mathbf{v}^{\intercal})+% \kappa(J-\frac{1}{J^{6}}),italic_J bold_italic_σ = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_μ ( ∇ bold_v + ∇ bold_v start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ) + italic_κ ( italic_J - divide start_ARG 1 end_ARG start_ARG italic_J start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT end_ARG ) , (13)

where μ𝜇\muitalic_μ and κ𝜅\kappaitalic_κ represent the fluid viscosity and the bulk modulus, respectively.

Non-Newtonian Fluid. We employ the viscoplastic model  [42] to simulate non-Newtonian fluids. Although we continue to utilize the von Mises criteria to delineate the elastic region, the presence of viscoplasticity implies that deformation will not be immediately reverted onto the yield surface. It is defined as follows:

𝒵(𝐅)={𝐅δγ0𝐔exp(𝒔^2μϵ^+1dTr(ϵ)𝐈)𝐕otherwise,𝒵𝐅cases𝐅𝛿𝛾0𝐔^𝒔2𝜇bold-^bold-italic-ϵ1𝑑Tritalic-ϵ𝐈superscript𝐕otherwise\mathcal{Z}(\mathbf{F})=\begin{cases}\mathbf{F}&\delta\gamma\leq 0\\ \mathbf{U}\exp(\frac{\hat{\bm{s}}}{2\mu}\bm{\hat{\epsilon}}+\frac{1}{d}\text{% Tr}(\epsilon)\mathbf{I})\mathbf{V}^{\intercal}&\text{otherwise}\end{cases},caligraphic_Z ( bold_F ) = { start_ROW start_CELL bold_F end_CELL start_CELL italic_δ italic_γ ≤ 0 end_CELL end_ROW start_ROW start_CELL bold_U roman_exp ( divide start_ARG over^ start_ARG bold_italic_s end_ARG end_ARG start_ARG 2 italic_μ end_ARG overbold_^ start_ARG bold_italic_ϵ end_ARG + divide start_ARG 1 end_ARG start_ARG italic_d end_ARG Tr ( italic_ϵ ) bold_I ) bold_V start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW , (14)
μ^^𝜇\displaystyle\hat{\mu}over^ start_ARG italic_μ end_ARG =μdTr(𝚺2),absent𝜇𝑑Trsuperscript𝚺2\displaystyle=\frac{\mu}{d}\text{Tr}(\bm{\Sigma}^{2}),= divide start_ARG italic_μ end_ARG start_ARG italic_d end_ARG Tr ( bold_Σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (15)
𝒔𝒔\displaystyle\bm{s}bold_italic_s =2μϵ^,absent2𝜇^italic-ϵ\displaystyle=2\mu\hat{\epsilon},= 2 italic_μ over^ start_ARG italic_ϵ end_ARG ,
s^^𝑠\displaystyle\hat{s}over^ start_ARG italic_s end_ARG =𝒔δγ1+η2μ^Δtabsentnorm𝒔𝛿𝛾1𝜂2^𝜇Δ𝑡\displaystyle=\|\bm{s}\|-\frac{\delta\gamma}{1+\frac{\eta}{2\hat{\mu}\Delta t}}= ∥ bold_italic_s ∥ - divide start_ARG italic_δ italic_γ end_ARG start_ARG 1 + divide start_ARG italic_η end_ARG start_ARG 2 over^ start_ARG italic_μ end_ARG roman_Δ italic_t end_ARG end_ARG

where d𝑑ditalic_d is the spatial dimension. The 𝐔,𝚺𝐔𝚺\mathbf{U},\bm{\Sigma}bold_U , bold_Σ and 𝐕𝐕\mathbf{V}bold_V can be obtained by performing Singular Value Decomposition (SVD) on deformation gradient 𝐅𝐅\mathbf{F}bold_F.