Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory

Yunlong Ran Zhejiang UniversityHangzhouChina Yanxu Li Zhejiang UniversityHangzhouChina Qi Ye Zhejiang UniversityHangzhouChina Yuchi Huo Zhejiang UniversityHangzhouChina Zechun Bai Shandong UniversityJinanChina Jiahao sun Zhejiang UniversityHangzhouChina  and  Jiming chen Zhejiang UniversityHangzhouChina
Abstract.

Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose  CT-NeRF, an incremental reconstruction optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojected geometric image distance constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction,  CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of  CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show  CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy.

Pose estimation, Implicit representation, Structure from motion, SLAM
conference: ; ; booktitle: isbn:
Refer to caption
Figure 1. Comparison on Free-dataset (Wang et al., 2023a). Top: novel view synthesis; Middle: depth maps; Bottom: the estimated trajectory errors (red rectangles for poses from COLMAP, blue rectangles for estimated, gray lines between them for errors). Our method enables more robust pose estimation, renders better novel views, and constructs better geometry than the state-of-the-arts.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

1. Introduction

Refer to caption
Figure 2. (a) Left: center-based pose graph to force pose consistent to the scene; Right: our pose graph to enable consistency between the camera poses in addition to the consistency to the scene. (b) The reprojection loss (bottom) provides an accurate gradient towards alignment, while the photometric loss (top) provides inconsistent gradients. (c) For a pair of correspondence (q,p)𝑞𝑝(q,p)( italic_q , italic_p ) between I1subscript𝐼1I_{1}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, q𝑞qitalic_q is reprojected to the image plane of I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT via the depth value of q𝑞qitalic_q. The reprojection loss is the geometric image distance between the reprojected point psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and the ground truth corresponding point p𝑝pitalic_p.

Reconstructing high-fidelity, high-quality 3D scenes holds significant importance for the development of virtual reality / augmented reality, autonomous driving, and other domains. Recently, implicit representations such as neural radiance fields (NeRF) (Mildenhall et al., 2021) have achieved remarkable progress in reconstructing photo-realistic scenes given a sequence of RGB images and their corresponding camera poses. The camera poses for these high-fidelity reconstructions with implicit representations are primarily obtained through the structure from motion (SfM) methods, with the off-the-shelf tool COLMAP (Schonberger and Frahm, 2016) being the most popular choice. These SfM methods estimate the camera poses through local registration between images and global bundle adjustment (BA) on both all camera poses and sparse 3D scene points. Therefore, accurate camera poses can only be acquired after all images are processed. Also, matching and registration of the methods are sensitive to image variations.

Recent works such as NeRFmm (Wang et al., 2021), SC-NeRF (Jeong et al., 2021), BARF (Lin et al., 2021), GARF (Chng et al., 2022) and L2G-NeRF (Chen et al., 2023) tackle the dependency on the camera pose priors by treating the camera pose as learnable parameters and jointly optimizing poses and scenes offline. However, using images only to constrain the 3D space encounters many problems like wrong geometry, blurry textures, or floaters when very dense multiview images are not available; adding more freedoms for the camera poses to the optimization leads to worse results. Therefore, these methods often require initial camera parameters close to the ground truth poses in object-centered scenes with dense multiview observations, or small camera movements. Nope-NeRF (Bian et al., 2023) incorporates monocular depth to impose further constraints on adjacent images, enabling pose estimation for trajectories with relatively small camera motions and rotations while its initialization of all poses as identity matrices leads to local optima when facing complex trajectories.

On the other hand, following the classic SfM pipelines, CF-NeRF (Yan et al., 2023) adds images incrementally, initializes the pose for a newly added image with the pose for the previous one, and optimizes the poses (and the scene). With the incremental strategy, the method is capable of reconstructing the real-world scene under complex camera trajectories. However, it still suffers from large pose errors and inferior reconstruction quality as shown in Fig. 1,  Fig. 4 and  Fig. 5. The reason can be attributed to two aspects. First, the bundle adjustment constructs a center-based graph as shown in Fig. 2 (a) left, optimizing only the consistency between the camera poses and the center implicit global scene while neglecting the consistency between the pose and the multiview images. When the global structure falls into local minima, the camera poses cannot be recovered; in turn, the structure cannot find a way to escape the local minima as the poses and structure are optimized jointly. Secondly, the method only uses the visual difference between the rendering images and raw images whereas BARF (Lin et al., 2021) observes that as natural images are typically complex signals, gradient-based registration with pixel value differences is susceptible to suboptimal solutions if poorly initialized, as shown in the top figure of  Fig. 2 (b). The coarse-to-fine pose estimation proposed by BARF (Lin et al., 2021) mitigates this issue but requires good initialization or dense forward-facing images. In complex trajectories, the camera typically exhibits large motions, and views covering a region are much sparser than the forward-facing scenario.

To tackle the issues and enable accurate pose estimation and reconstruction for complex trajectories, we propose a novel incremental joint optimization method for implicit radiance fields and camera poses named CT-NeRF. For the first issue, as shown in Fig. 2 (a) right, we propose a joint incremental reconstruction and pose estimation pipeline with pose graphs connecting edges between camera poses upon the center-based pose graphs for a local-global bundle adjustment (BA). The graph forms many subgraphs and forces consistency between the camera poses, which helps to recover the poses when the scene and poses are consistent but the scene is actual in a local minimum during BA. For the second issue and also instantiating the pose consistency between the pose edges, we introduce a geometric image instance, i.e. the reprojected Euclidean distance between the correspondences of two input images. In addition to providing consistency constraints for pose edges, the reprojected distance benefits the pose and scene optimization in three aspects: 1) it provides direct direction to align the poses, whereas the pixel value differences do not necessarily correlate to the pose error: as shown in Fig. 2 (b) gradients based on the pixel value difference are not consistent while the gradients from the reprojected geometric image distance are; 2) the reprojected distance requires the depth of the scene to warp a pixel in an image to the other one as shown in Fig. 2 (c) and therefore, the gradient can help the convergence of the geometry of the scene directly; 3) correspondence learning networks typically leverage large scale pair-wise image datasets and the reprojected loss based on the correspondences is robust to occlusion, lighting variation, textureless and large motions compared to raw image losses.

In summary, our main contributions are threefold:

  • We design an incremental reconstruction pipeline for neural radiance fields using only RGB images under complex camera trajectories, without pose and depth input.

  • We propose to construct pose graphs with in-between pose consistency edges for BA and instantiate the consistency as a reprojected geometric image distance constraint from the learned correspondences between input images for robust pose and scene optimization.

  • We achieve significant improvements in pose estimation accuracy and reconstruction quality compared to state-of-the-art methods in complex trajectories.

2. Related Work

SFM and SLAM In the field of computer vision, given a set of input images, SfM and SLAM aim to concurrently estimate camera poses and reconstruct the scene. The distinction lies in the fact that SLAM operates online, emphasizing runtime performance, while SfM does not require online operation but demands higher accuracy. SfM methods can be categorized into incremental (Schonberger and Frahm, 2016; Wu, 2013; Snavely et al., 2008), global (Jiang et al., 2013; Cui and Tan, 2015), and hierarchical (Gherardi et al., 2010) approaches: The incremental approach initializes with two images and progressively registers and reconstructs additional images one by one. The global approach registers and reconstructs all images simultaneously. The hierarchical approach first groups images, performs registration and reconstruction for each group, and then conducts a global optimization. SLAM methods are primarily divided into filter-based (Bailey et al., 2006; Castellanos et al., 2007; Abdelrasoul et al., 2016) and graph optimization-based (Engel et al., 2014; Mur-Artal et al., 2015; Campos et al., 2021) approaches. Filter-based methods mainly utilize state estimation strategies such as Kalman filtering and particle filtering to incrementally estimate the posterior distributions of camera poses and key point locations. Graph optimization-based methods abstract camera poses at different times as nodes and the observation constraints at different robot locations as edges connecting the nodes, then employ bundle adjustment (BA) algorithms for global optimization. Our proposed incremental pipeline is inspired by the incremental SfM and SLAM approaches.

NeRF-based SFM and SLAM Implicit neural representations have gained prominence since 2019 (Park et al., 2019). Compared to traditional explicit representations that store geometric information in a relatively fixed and simple manner, implicit neural representations can better handle complex topological structures and geometric details. The classic algorithm for implicit neural representations, Vanilla NeRF (Mildenhall et al., 2021) (Neural Radiance Fields), is based on the theory of volume rendering and utilizes a Multi-Layer Perceptron (MLP) to learn the implicit neural representation of a static scene, achieving high-quality novel view synthesis. Consequently, some researchers have considered combining SfM and SLAM with NeRF, not only to reduce NeRF’s dependence on the accuracy of input image poses but also to enhance the scene representation capability of SfM and SLAM. BARF (Lin et al., 2021) was the first to integrate the core bundle adjustment (BA) algorithm from SfM with NeRF and adopted a coarse-to-fine reconstruction strategy, progressively aligning camera poses during the reconstruction process. To address the issue of camera poses being prone to local optima in BARF, L2G-NeRF (Chen et al., 2023) proposed a Local-to-Global alignment strategy, allowing camera poses to converge more easily to the global optimum. NoPe-NeRF (Bian et al., 2023) introduced monocular depth information and key point matching information, respectively, to constrain the relative camera pose relationships, ensuring global consistency of camera poses. LocalRF (Meuleman et al., 2023) proposed a progressive strategy based on video sequences to gradually optimize local regions. CF-NeRF (Yan et al., 2023) employed an incremental learning approach to enable reconstruction under complex trajectories. LU-NeRF (Cheng et al., 2023) introduced a Local-to-Global pose estimation strategy, enabling pose estimation and scene reconstruction from datasets with completely unknown camera poses. As for NeRF-based SLAM methods, iMAP (Sucar et al., 2021) adopted two threads: Tracking and Mapping. The Tracking thread optimizes the camera pose using the current model and performs key frame selection, while the Mapping thread jointly optimizes the poses of the keyframes and the model. iMAP uses a single MLP to represent the entire scene, limiting its scalability. Nice-SLAM (Zhu et al., 2022) improved upon iMAP by combining Hierarchical Feature Grids and MLP as the scene representation, enabling the application to large-scale scenes. Co-SLAM (Wang et al., 2023c) and e-SLAM (Johari et al., 2023) introduced multi-resolution hash encoding and tri-plane representations, respectively, to improve system frame rate and scene representation capability, building upon Nice-SLAM. Regrettably, current NeRF-based SLAM methods necessitate dense image sequences, while NeRF-based SfM techniques face difficulties in accommodating complex camera trajectories. To tackle these limitations, we introduce  CT-NeRF, an incremental optimization framework that leverages additional correspondence constraints. CT-NeRF employs an incremental optimization process, iteratively refining the reconstruction as new images are integrated.

Correspondence in pose-estimate Local feature matching plays a crucial role in SfM and SLAM. The traditional feature matching pipeline consists of three main steps: feature detection, feature descriptor computation, and feature matching. Through feature detectors, the search space for matching can be effectively reduced, and the generated sparse matches are often sufficient to handle most tasks. However, in low-texture regions or repetitive patterns, these methods often fail due to the inability to detect sufficient feature points. With the flourishing development of deep learning, some researchers have started to leverage data-driven dense feature matching to enhance the accuracy and robustness of SfM and SLAM. For instance, Droid-SLAM (Teed and Deng, 2021) utilizes the dense optical flow learned by RAFT (Teed and Deng, 2020) for feature matching, achieving higher accuracy and robustness compared to traditional SLAM, and rarely failing in experimental scenarios. Detector-free SfM (He et al., 2023) leverages the matching strategy of Loftr (Sun et al., 2021) without feature detectors, exhibiting significant advantages in low-texture regions and winning multiple competitions. Droid-SLAM and Detector-free SfM focus on pose estimation and reconstructing the scene with sparse points. On the other hand, SPARF (Truong et al., 2023) utilizes the correspondences from DKM matching for implicit neural reconstruction under noisy poses with several views. Different from these works, our method focuses on an incremental pose and implicit scene joint optimization pipeline with complex trajectories.

3. Method

Refer to caption
Figure 3. Our incremental optimization pipeline for neural radiance fields and pose estimation.

We first present the formulation of incremental scene reconstruction and pose estimation: given a set of sequential images ={I1,I2,,In}subscript𝐼1subscript𝐼2subscript𝐼𝑛\mathcal{I}=\{I_{1},I_{2},...,I_{n}\}caligraphic_I = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where n𝑛nitalic_n represents the nthsuperscript𝑛𝑡n^{th}italic_n start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT frame captured in a camera trajectory, we aim to jointly optimize the poses 𝒫={P1,P2,,Pn}𝒫subscript𝑃1subscript𝑃2subscript𝑃𝑛\mathcal{P}=\{P_{1},P_{2},...,P_{n}\}caligraphic_P = { italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } for the images and a neural radiance field model ΘΘ\Thetaroman_Θ representing the 3D scene captured by the images by adding one image at a time sequentially. To achieve the goal, we design an incremental reconstruction pipeline for the neural radiance field without pose priors. Our pipeline consists of five parts as shown in Fig. 3. The scene is initialized using a small set of images. Subsequently, for each input image, tracking is applied to estimate the rough camera pose for a new image. A window optimization is followed to refine the poses of images within the window and also reconstruct the local structural components. To further incorporate the consistency of all visited camera poses, global optimization (bundle adjustment) is performed on all images to optimize the global camera poses and the overall scene structure. Tracking, window, and global optimization repeat until all images are added. After all images are added, post optimization iteratively refines the entire scene and all camera poses until convergence.

3.1. Preliminary: NeRF with Pose Optimization

We define the camera projection function π𝜋\piitalic_π that projects a space point 𝐱3𝐱superscript3\mathbf{x}\in\mathcal{R}^{3}bold_x ∈ caligraphic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to a pixel p2𝑝superscript2p\in\mathcal{R}^{2}italic_p ∈ caligraphic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as

(1) p=π(𝐱,P,K),𝑝𝜋𝐱𝑃𝐾p=\pi(\mathbf{x},P,K),italic_p = italic_π ( bold_x , italic_P , italic_K ) ,

where camera pose P𝑃Pitalic_P is the camera-to-world transformation of image I𝐼Iitalic_I and K𝐾Kitalic_K is the intrinsic (we assume all images share the same intrinsic in a trajectory). P=[R,t]𝑃𝑅𝑡P=[R,t]italic_P = [ italic_R , italic_t ], where RSO(3)𝑅𝑆𝑂3R\in SO(3)italic_R ∈ italic_S italic_O ( 3 ) represents rotation and t3𝑡superscript3t\in\mathcal{R}^{3}italic_t ∈ caligraphic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT translation. The homogenization operations are omitted for clarity. The backprojection function π1superscript𝜋1\pi^{-1}italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT projects the pixel coordinate location p𝑝pitalic_p into a space point with depth z𝑧zitalic_z

(2) x=π1(p,P,K,z).𝑥superscript𝜋1𝑝𝑃𝐾𝑧x=\pi^{-1}(p,P,K,z).italic_x = italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p , italic_P , italic_K , italic_z ) .

NeRF maps a 3D location 𝐱3𝐱superscript3\mathbf{x}\in\mathbb{R}^{3}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a view direction 𝐝3𝐝superscript3\mathbf{d}\in\mathbb{R}^{3}bold_d ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT to a radiance color 𝐜3𝐜superscript3\mathbf{c}\in\mathbb{R}^{3}bold_c ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and volume density σ𝜎\sigma\in\mathbb{R}italic_σ ∈ blackboard_R with an MLP parameterized by ΘΘ\Thetaroman_Θ. It optimizes the model ΘΘ\Thetaroman_Θ and camera poses 𝒫𝒫\mathcal{P}caligraphic_P by minimizing photometric loss between rendered images ^^\mathcal{\hat{I}}over^ start_ARG caligraphic_I end_ARG and input images \mathcal{I}caligraphic_I

(3) Θ,P=argminΘ,Pphoto(^,𝒫),superscriptΘsuperscript𝑃subscriptΘ𝑃subscript𝑝𝑜𝑡𝑜^conditional𝒫\Theta^{*},P^{*}=\mathop{\arg\min_{\Theta,P}}\mathcal{L}_{photo}(\hat{\mathcal% {I}},\mathcal{P}\mid\mathcal{I}),roman_Θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min start_POSTSUBSCRIPT roman_Θ , italic_P end_POSTSUBSCRIPT end_BIGOP caligraphic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT ( over^ start_ARG caligraphic_I end_ARG , caligraphic_P ∣ caligraphic_I ) ,

where photo=1n1nIi^(Pi,Θ)Ii22subscript𝑝𝑜𝑡𝑜1𝑛superscriptsubscript1𝑛superscriptsubscriptnorm^subscript𝐼𝑖subscript𝑃𝑖Θsubscript𝐼𝑖22\mathcal{L}_{photo}=\frac{1}{n}\sum_{1}^{n}||\hat{I_{i}}(P_{i},\Theta)-I_{i}||% _{2}^{2}caligraphic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | | over^ start_ARG italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_Θ ) - italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and I^isubscript^𝐼𝑖\hat{I}_{i}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be obtained through volume rendering. For each pixel q𝑞qitalic_q in image I^isubscript^𝐼𝑖\hat{I}_{i}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its color is rendered by aggregating predicted colors 𝐜𝐜\mathbf{c}bold_c and densities σ𝜎\sigmaitalic_σ alone the ray r=𝐨+𝐝s𝑟𝐨𝐝𝑠r=\mathbf{o}+\mathbf{d}sitalic_r = bold_o + bold_d italic_s where 𝐨𝐨\mathbf{o}bold_o and 𝐝𝐝\mathbf{d}bold_d can be obtained by 𝐨,𝐝=π1(p,P,K)𝐨𝐝superscript𝜋1𝑝𝑃𝐾\mathbf{o},\mathbf{d}=\pi^{-1}(p,P,K)bold_o , bold_d = italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p , italic_P , italic_K ) and s(snear,sfar]𝑠subscript𝑠𝑛𝑒𝑎𝑟subscript𝑠𝑓𝑎𝑟s\in(s_{near},s_{far}]italic_s ∈ ( italic_s start_POSTSUBSCRIPT italic_n italic_e italic_a italic_r end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_f italic_a italic_r end_POSTSUBSCRIPT ] represents sample distance.

(4) I^i(q)=snearsfarT(s)σ(r(s))𝐜(r(s),𝐝)𝑑s,subscript^𝐼𝑖𝑞superscriptsubscriptsubscript𝑠𝑛𝑒𝑎𝑟subscript𝑠𝑓𝑎𝑟𝑇𝑠𝜎𝑟𝑠𝐜𝑟𝑠𝐝differential-d𝑠\hat{I}_{i}(q)=\int_{s_{near}}^{s_{far}}T(s)\sigma(r(s))\mathbf{c}(r(s),% \mathbf{d})ds,over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) = ∫ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n italic_e italic_a italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_f italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T ( italic_s ) italic_σ ( italic_r ( italic_s ) ) bold_c ( italic_r ( italic_s ) , bold_d ) italic_d italic_s ,

where T(s)=exp(snearsσ(r(h))𝑑h)𝑇𝑠superscriptsubscriptsubscript𝑠𝑛𝑒𝑎𝑟𝑠𝜎𝑟differential-dT(s)=\exp(-\int_{s_{near}}^{s}\sigma(r(h))dh)italic_T ( italic_s ) = roman_exp ( - ∫ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n italic_e italic_a italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT italic_σ ( italic_r ( italic_h ) ) italic_d italic_h ) indicates how much light is transmitted on ray up to s𝑠sitalic_s. In the same way, depth map D^isubscript^𝐷𝑖\hat{D}_{i}over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be rendered.

(5) D^i(q)=snearsfarT(s)σ(r(s))s𝑑s.subscript^𝐷𝑖𝑞superscriptsubscriptsubscript𝑠𝑛𝑒𝑎𝑟subscript𝑠𝑓𝑎𝑟𝑇𝑠𝜎𝑟𝑠𝑠differential-d𝑠\hat{D}_{i}(q)=\int_{s_{near}}^{s_{far}}T(s)\sigma(r(s))sds.over^ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_q ) = ∫ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_n italic_e italic_a italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_f italic_a italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_T ( italic_s ) italic_σ ( italic_r ( italic_s ) ) italic_s italic_d italic_s .

3.2. Reprojected Geometric Image Distance

As aforementioned, we aim to incorporate the geometric distance constraint from correspondences between images to jointly optimize camera poses and 3D scene model parameters under complex trajectories. For correspondence generation, many existing correspondence learning networks can be exploited. In this work we choose DKM (Edstedt et al., 2023) as it is the current state-of-the-art work in dense correspondence matching, achieving high matching accuracy and strong robustness. Given a pair of adjacent images I1,I2subscript𝐼1subscript𝐼2I_{1},I_{2}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the input trajectory \mathcal{I}caligraphic_I, a pre-trained DKM model can generate full-resolution pixel-level correspondences between them while predicting the confidence of each pixel match α𝛼\alphaitalic_α. We use a set M𝑀Mitalic_M to represent the output correspondences for two input images

(6) M={m=(q,p,α)qI1,pI2,α(0,1]}.𝑀conditional-set𝑚𝑞𝑝𝛼formulae-sequence𝑞subscript𝐼1formulae-sequence𝑝subscript𝐼2𝛼01M=\{m=(q,p,\alpha)\mid q\in I_{1},p\in I_{2},\alpha\in(0,1]\}.italic_M = { italic_m = ( italic_q , italic_p , italic_α ) ∣ italic_q ∈ italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p ∈ italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α ∈ ( 0 , 1 ] } .

According to the multiview geometry theory  (Hartley and Zisserman, 2003), four pairs of correspondences can solve the relative pose between the images and using the triangulation technique, the 3D scene points for the pairs can be acquired. Though the correspondences only produce sparse 3D points and our representation is implicit, the theory provides important information for our problem: 1) correspondences provide a way to solve the pose estimation problem without knowing a 3D scene; 2) the pose estimation requires only sparse correspondences for the pose estimation; 3) the correspondences embed the 3D information.

We define our geometric image distance for the pose estimation between two images under implicit scene representation as follows. As correspondence in practice often contains noise, we randomly sample Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT correspondences from the correspondence set M𝑀Mitalic_M with confidence above a threshold tmsubscript𝑡𝑚t_{m}italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, to serve as a correspondences set Msparsesubscript𝑀𝑠𝑝𝑎𝑟𝑠𝑒M_{sparse}italic_M start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT for pose estimation,

(7) Msparse={m1,m2,,mNm}{mmM,α>tm}.subscript𝑀𝑠𝑝𝑎𝑟𝑠𝑒subscript𝑚1subscript𝑚2subscript𝑚subscript𝑁𝑚similar-toconditional-set𝑚formulae-sequence𝑚𝑀𝛼subscript𝑡𝑚M_{sparse}=\{m_{1},m_{2},...,m_{N_{m}}\}\sim\{m\mid m\in M,\alpha>t_{m}\}.italic_M start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } ∼ { italic_m ∣ italic_m ∈ italic_M , italic_α > italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } .

In classic SfM pipelines, algorithms like RANSAC  (Fischler and Bolles, 1981) are exploited to remove outliers and get robust poses and 3D points. Though differentiable RANSAC (Wei et al., 2023) may be deployed in our implicit joint optimization, we choose a simpler strategy: 1) weighting the reprojection error according to the confidence of correspondence estimation; 2) randomly sampling a small set of samples in each iteration. In every training iteration for the scene model ΘΘ\Thetaroman_Θ, Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT pairs of correspondences are randomly fetched from Msparsesubscript𝑀𝑠𝑝𝑎𝑟𝑠𝑒M_{sparse}italic_M start_POSTSUBSCRIPT italic_s italic_p italic_a italic_r italic_s italic_e end_POSTSUBSCRIPT. Sampled correspondence sets with the confidence from multiple different iterations to optimize the pose and/or scene parameters serve as the similar purpose of RANSAC.

For each correspondence (q,p,α)𝑞𝑝𝛼(q,p,\alpha)( italic_q , italic_p , italic_α ), its depth value z(q;Θ)𝑧𝑞Θz(q;\Theta)italic_z ( italic_q ; roman_Θ ) can be rendered by Eq. (5), and q𝑞qitalic_q can be backprojected into a 3D point and reprojected to I2subscript𝐼2I_{2}italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then we have the geometric image distance

(8) rp(I1,I2)=1Ns1Nsα|π((π1(q,P1,K,z(q;Θ))),P2,K)p|,subscript𝑟𝑝subscript𝐼1subscript𝐼21subscript𝑁𝑠superscriptsubscript1subscript𝑁𝑠𝛼𝜋superscript𝜋1𝑞subscript𝑃1𝐾𝑧𝑞Θsubscript𝑃2𝐾𝑝\mathcal{L}_{rp}(I_{1},I_{2})=\frac{1}{N_{s}}\sum_{1}^{N_{s}}\alpha|\pi((\pi^{% -1}(q,P_{1},K,z(q;\Theta))),P_{2},K)-p|,caligraphic_L start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_α | italic_π ( ( italic_π start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_q , italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_K , italic_z ( italic_q ; roman_Θ ) ) ) , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_K ) - italic_p | ,

which is reprojection error. P1,P2subscript𝑃1subscript𝑃2P_{1},P_{2}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the poses to be estimated for images I1,I2subscript𝐼1subscript𝐼2I_{1},I_{2}italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively. The gradient with respect the camera poses P1,P2subscript𝑃1subscript𝑃2P_{1},P_{2}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, rendered depths and further the model ΘΘ\Thetaroman_Θ can be obtained from Eq. (8), indicating the gradients can help the depth and scene reconstruction in addition to pose estimation.

The reprojection error is referred as Gold Standard (Hartley and Zisserman, 2003). The error is a quadratic convex function of pose parameters, pointing the right direction for pose optimization without local minima. However, the pixel value difference does not necessarily correlate to the correct direction for the pose optimization as shown in BARF (Lin et al., 2021). Compared with the pixel value difference, the reprojection error also provides more robust gradients for the scene estimation as the pixel value difference is susceptible to the lighting, occlusion, sparse views, etc, while the correspondence learning network for the reprojection error learns these factors during training.

3.3. Tracking

When a new frame Iisubscript𝐼𝑖I_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is added to the training process, tracking provides a rough estimate of the camera pose, which becomes particularly crucial when there exist violent changes in camera motion. The pose of the new frame is initialized based on the previous frame Pi=Pi1subscript𝑃𝑖subscript𝑃𝑖1P_{i}=P_{i-1}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Tracking performs pose estimation through the reprojection error with adjacent frames and photometric loss with the scene. The loss function can be formulated as:

(9) ()=nrp()+photo(),subscript𝑛𝑟𝑝subscript𝑝𝑜𝑡𝑜\mathcal{L}(\mathcal{E})=\mathcal{L}_{nrp}(\mathcal{E})+\mathcal{L}_{photo}(% \mathcal{E}),caligraphic_L ( caligraphic_E ) = caligraphic_L start_POSTSUBSCRIPT italic_n italic_r italic_p end_POSTSUBSCRIPT ( caligraphic_E ) + caligraphic_L start_POSTSUBSCRIPT italic_p italic_h italic_o italic_t italic_o end_POSTSUBSCRIPT ( caligraphic_E ) ,
(10) nrp()=1Ne221Ne(rp(Ii,Ii1)+rp(Ii,Ii+1)),subscript𝑛𝑟𝑝1subscript𝑁𝑒22superscriptsubscript1subscript𝑁𝑒subscript𝑟𝑝subscript𝐼𝑖subscript𝐼𝑖1subscript𝑟𝑝subscript𝐼𝑖subscript𝐼𝑖1\mathcal{L}_{nrp}(\mathcal{E})=\frac{1}{N_{e}*2-2}\sum_{1}^{N_{e}}(\mathcal{L}% _{rp}(I_{i},I_{i-1})+\mathcal{L}_{rp}(I_{i},I_{i+1})),caligraphic_L start_POSTSUBSCRIPT italic_n italic_r italic_p end_POSTSUBSCRIPT ( caligraphic_E ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∗ 2 - 2 end_ARG ∑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + caligraphic_L start_POSTSUBSCRIPT italic_r italic_p end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) ) ,

where nrpsubscript𝑛𝑟𝑝\mathcal{L}_{nrp}caligraphic_L start_POSTSUBSCRIPT italic_n italic_r italic_p end_POSTSUBSCRIPT is reprojection loss for paris of neighboring images and ={I1,I2,,INe}subscript𝐼1subscript𝐼2subscript𝐼subscript𝑁𝑒\mathcal{E}=\{I_{1},I_{2},...,I_{N_{e}}\}caligraphic_E = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT } is an optimization frame set with Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT frames.

For the tracking, tracking={Ii1,Ii}subscript𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔subscript𝐼𝑖1subscript𝐼𝑖\mathcal{E}_{tracking}=\{I_{i-1},I_{i}\}caligraphic_E start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } consists of a new frame and the preceding frame (Ne=2subscript𝑁𝑒2N_{e}=2italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 2) and we optimize only the pose of the newly added frame Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, keeping other optimizable pose parameters and the network parameters ΘΘ\Thetaroman_Θ fixed.

Initialization is crucial as it affects the subsequent tracking performance and the final pose estimation and reconstruction quality. In our approach, we select the first Ninitsubscript𝑁𝑖𝑛𝑖𝑡N_{init}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT images in the sequence as the initialization images set init={I1,I2,,INinit}subscript𝑖𝑛𝑖𝑡subscript𝐼1subscript𝐼2subscript𝐼subscript𝑁𝑖𝑛𝑖𝑡\mathcal{E}_{init}=\{I_{1},I_{2},...,I_{N_{init}}\}caligraphic_E start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. The initialization is achieved by minimizing Eq. (9). Due to the difficulty in optimizing the rotation R𝑅Ritalic_R parameters in the initial stages, we fix the rotation parameters and do not optimize them during the initialization.

3.4. Joint Optimization

Using the center-based graph and photo loss in equation Eq. (3) for the joint optimization or bundle adjustment is susceptible to local minima when only RGB inputs are available. It only constrains all the poses in consistency with the scene while reconstructing the scene with implicit neural radiance fields from only RGB images (especially sparse images) are prone to converge to wrong geometry, which is demonstrated in many previous works (Deng et al., 2022; Niemeyer et al., 2022; Wang et al., 2023b; Song et al., 2024). If the scene is stuck in a local minimum, the bundle adjustment can not get the pose right as long as the poses conform with the twisted scene. As shown in Table 1, though the poses from BARF(Lin et al., 2021), L2G-NeRF(Chen et al., 2023), Nope-NeRF(Bian et al., 2023) are fairly deviated, visual metrics like PSNR for the scene maintains high, causing the pose not able to escape the local minima. In contrast, we construct pose graphs with constraints between pose edges for the joint optimization. The subgraphs formed between the poses and between the pose and the scene (shown in  Fig. 2 (a) right) enable consistency of all the in-between poses in the image set. The reprojection error forces the consistency of the adjacent poses with the correspondences of input images. Further, we design a combination of local window and global BA strategy to balance the integration of new information and consistency with existing estimation.

Our joint optimization uses the same equation Eq. (10) in tracking. However, the scene model ΘΘ\Thetaroman_Θ is learnable, and different frame sets \mathcal{E}caligraphic_E for the optimization are maintained.

Window optimization To joint optimize the scene and the camera pose for a newly added image, window optimization selects the most recent Nwindowsubscript𝑁𝑤𝑖𝑛𝑑𝑜𝑤N_{window}italic_N start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT frames as the optimization set window={IiNwindow+1,,Ii1,Ii}subscript𝑤𝑖𝑛𝑑𝑜𝑤subscript𝐼𝑖subscript𝑁𝑤𝑖𝑛𝑑𝑜𝑤1subscript𝐼𝑖1subscript𝐼𝑖\mathcal{E}_{window}=\{I_{i-N_{window}+1},...,I_{i-1},I_{i}\}caligraphic_E start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_i - italic_N start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. The window optimization process fixes the camera poses outside the window and optimizes the camera poses within the window and network ΘΘ\Thetaroman_Θ using Eq. (9) by performing local bundle adjustment. Different from the existing implicit SfM and SLAM methods which consist of tracking for the newly added image and global bundle adjustment for all the input images, the extra window optimization improves the pose estimation accuracy by enhancing consistency with the near previous frames and makes the network learn faster from the information of the new frame by leaving older frames out.

Global optimization Relying solely on tracking and window optimization can lead to cumulative errors and even failure in pose estimation. To address this issue, global optimization incorporates all frames currently added to the training process into the optimization set global={I1,I2,,Ii}subscript𝑔𝑙𝑜𝑏𝑎𝑙subscript𝐼1subscript𝐼2subscript𝐼𝑖\mathcal{E}_{global}=\{I_{1},I_{2},...,I_{i}\}caligraphic_E start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }. By applying Eq. (9) to optimize all frames simultaneously, global optimization significantly enhances the robustness and accuracy of pose estimation.

Refer to caption
Figure 4. Qualitative Comparison on Free-Dataset (Wang et al., 2023a). Rendered views and depths (top left corner of each image)

Post optimization Before all frames are added, the learning rate of the network lrΘ𝑙subscript𝑟Θlr_{\Theta}italic_l italic_r start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT and poses lr𝒫^𝑙subscript𝑟^𝒫lr_{\hat{\mathcal{P}}}italic_l italic_r start_POSTSUBSCRIPT over^ start_ARG caligraphic_P end_ARG end_POSTSUBSCRIPT are fixed. The positional encoding control parameter αpesubscript𝛼𝑝𝑒\alpha_{pe}italic_α start_POSTSUBSCRIPT italic_p italic_e end_POSTSUBSCRIPT is also fixed. After the whole frames are added, the post optimization process gradually reduces the learning rate and increases the frequency of positional encoding to iteratively refine the entire scene and camera poses through Eq. (9), ultimately obtaining the final results.

3.5. Training procedure

Training Pipeline The pipeline is initialized with Ninitsubscript𝑁𝑖𝑛𝑖𝑡N_{init}italic_N start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT frames. The stage is optimized for βinitsubscript𝛽𝑖𝑛𝑖𝑡\beta_{init}italic_β start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT iterations. Afterwards, for each subsequent frame added, tracking is performed for βtrackingsubscript𝛽𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔\beta_{tracking}italic_β start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT iterations, window optimization for βwindowsubscript𝛽𝑤𝑖𝑛𝑑𝑜𝑤\beta_{window}italic_β start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT iterations, and global optimization for βglobalsubscript𝛽𝑔𝑙𝑜𝑏𝑎𝑙\beta_{global}italic_β start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT iterations. These three stages repeat with a new frame added and continue until all frames have been added. Finally, the post optimization stage consisting of βpostsubscript𝛽𝑝𝑜𝑠𝑡\beta_{post}italic_β start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT iterations is conducted to further refine the reconstruction.

Positional Encoding The coarse-to-fine positional encoding plays an important role in accurate pose estimation (Lin et al., 2021), as excessively high frequencies can hinder this process. To address this, we employ the BARF (Lin et al., 2021) positional encoding frequency control method. Specifically, before the post-optimization stage, we ensure a low-frequency setting for the positional encoding control parameter αpesubscript𝛼𝑝𝑒\alpha_{pe}italic_α start_POSTSUBSCRIPT italic_p italic_e end_POSTSUBSCRIPT. During post optimization, we keep the same coarse-to-fine strategy as the BARF.

4. Experiments

4.1. Experiment Settings

Dataset We evaluated our method on the challenging datasets with complex trajectories, NeRFBuster (Warburg et al., 2023) and Free-Dataset (Wang et al., 2023a). NeRFBuster consists of a total of 12 scenes, with most trajectories revolving around a central object. We employ sequences selected by CF-NERF (Yan et al., 2023), with approximately 50 images per scene. We chose every 8th image from each sequence for novel view synthesis as the test set. All images are downsampled to a resolution of 480×270480270480\times 270480 × 270. Ground truth poses are estimated using COLMAP, as provided by CF-NeRF. Free-Dataset comprises 7 scenes with arbitrary trajectories, predominantly in outdoor environments characterized by highly dynamic camera motions. We select 50 images per scene in sequential order, and every 8th image is designated as the test set. The images are downsampled to a resolution of 312×487312487312\times 487312 × 487, and the ground truth poses are obtained through COLMAP (Schonberger and Frahm, 2016).

Implementation Details Our approach is implemented based on the BARF (Lin et al., 2021) framework. The majority of the hyperparameters in our network model align with the BARF Real-World Scenes setting, including the network learning rate lrΘ𝑙subscript𝑟Θlr_{\Theta}italic_l italic_r start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT decay from 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, pose learning rate lr𝒫𝑙subscript𝑟𝒫lr_{\mathcal{P}}italic_l italic_r start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT decay from 3×1033superscript1033\times 10^{-3}3 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to 1×1051superscript1051\times 10^{-5}1 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, inverse sampling of 128 points along each ray with an inverse range of [1,0)10[1,0)[ 1 , 0 ), a batch size of 1024, and linearly adjust α𝛼\alphaitalic_α for post optimization phase from iteration 20K to 100K. We randomly select Nm=10000subscript𝑁𝑚10000N_{m}=10000italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 10000 correspondences with confidence scores higher than tm=0.2subscript𝑡𝑚0.2t_{m}=0.2italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 0.2 from dense correspondences as sparse correspondences. During each iteration, Ns=256len()subscript𝑁𝑠256𝑙𝑒𝑛N_{s}=\frac{256}{len(\mathcal{E})}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = divide start_ARG 256 end_ARG start_ARG italic_l italic_e italic_n ( caligraphic_E ) end_ARG correspondences are randomly chosen from this set for reprojection loss. For NeRFBuster scenes, we set Ninit=3subscript𝑁𝑖𝑛𝑖𝑡3N_{init}=3italic_N start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = 3, Nwindow=4subscript𝑁𝑤𝑖𝑛𝑑𝑜𝑤4N_{window}=4italic_N start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT = 4, βinit=2000subscript𝛽𝑖𝑛𝑖𝑡2000\beta_{init}=2000italic_β start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = 2000, βtracking=100subscript𝛽𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔100\beta_{tracking}=100italic_β start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT = 100, βwindow=200subscript𝛽𝑤𝑖𝑛𝑑𝑜𝑤200\beta_{window}=200italic_β start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT = 200, βglobal=500subscript𝛽𝑔𝑙𝑜𝑏𝑎𝑙500\beta_{global}=500italic_β start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT = 500, and βpost=200Ksubscript𝛽𝑝𝑜𝑠𝑡200𝐾\beta_{post}=200Kitalic_β start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT = 200 italic_K. As the frames in in Free-Dataset exhibit larger camera motions and smaller overlap, the network requires more iterations to estimate poses accurately and achieve convergence. For Free-Dataset scenes, we set Ninit=3subscript𝑁𝑖𝑛𝑖𝑡3N_{init}=3italic_N start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = 3, Nwindow=4subscript𝑁𝑤𝑖𝑛𝑑𝑜𝑤4N_{window}=4italic_N start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT = 4, βinit=4000subscript𝛽𝑖𝑛𝑖𝑡4000\beta_{init}=4000italic_β start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = 4000, βtracking=200subscript𝛽𝑡𝑟𝑎𝑐𝑘𝑖𝑛𝑔200\beta_{tracking}=200italic_β start_POSTSUBSCRIPT italic_t italic_r italic_a italic_c italic_k italic_i italic_n italic_g end_POSTSUBSCRIPT = 200, βwindow=400subscript𝛽𝑤𝑖𝑛𝑑𝑜𝑤400\beta_{window}=400italic_β start_POSTSUBSCRIPT italic_w italic_i italic_n italic_d italic_o italic_w end_POSTSUBSCRIPT = 400, βglobal=900subscript𝛽𝑔𝑙𝑜𝑏𝑎𝑙900\beta_{global}=900italic_β start_POSTSUBSCRIPT italic_g italic_l italic_o italic_b italic_a italic_l end_POSTSUBSCRIPT = 900, and βpost=200Ksubscript𝛽𝑝𝑜𝑠𝑡200𝐾\beta_{post}=200Kitalic_β start_POSTSUBSCRIPT italic_p italic_o italic_s italic_t end_POSTSUBSCRIPT = 200 italic_K.

Table 1. Evaluations of the pose accuracy (top 2 rows) and the novel view quality (bottom 3 rows) on NeRFBuster (Warburg et al., 2023). ΔTΔ𝑇\Delta Troman_Δ italic_T is the transition error in ground truth scale and ΔRΔ𝑅\Delta Rroman_Δ italic_R is rotation error in degree.
Metrics Method aloe art car century garbage flowers picnic pikachu pipe plant roses table mean
ΔRΔ𝑅\Delta Rroman_Δ italic_R BARF (Lin et al., 2021) 128.599 45.739 97.148 114.261 103.259 79.702 81.418 112.996 166.701 140.270 125.974 139.675 111.312
L2G-NeRF (Chen et al., 2023) 117.475 28.247 161.862 59.730 90.889 88.750 99.076 123.543 106.838 71.551 139.057 144.848 102.656
Nope-NeRF (Bian et al., 2023) 101.589 32.345 113.063 150.253 149.459 161.859 148.710 158.059 99.836 138.816 150.050 114.783 126.569
CF-NeRF (Yan et al., 2023) 6.703 76.306 29.079 11.013 74.163 10.672 109.868 13.243 122.345 18.664 3.903 3.835 39.983
Ours 3.163 3.151 0.701 2.343 0.902 0.481 1.938 7.708 2.302 6.302 0.570 1.154 2.560
ΔTΔ𝑇\Delta Troman_Δ italic_T BARF 6.039 4.040 5.043 5.434 4.663 4.693 3.007 3.772 3.763 5.865 4.952 4.076 4.612
L2G-NeRF 4.986 4.402 4.764 5.895 4.272 4.926 4.214 6.451 5.592 2.764 5.055 4.199 4.795
Nope-NeRF 5.151 5.302 5.401 3.202 5.571 4.742 4.819 3.757 4.983 5.896 5.399 5.817 5.004
CF-NeRF 0.637 1.549 1.621 0.497 0.548 0.745 1.285 0.879 5.757 0.685 0.182 0.274 1.222
Ours 0.168 0.030 0.035 0.134 0.039 0.039 0.106 0.548 0.164 0.225 0.038 0.045 0.131
PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R BARF 23.56 20.55 23.69 18.73 19.92 23.14 22.91 31.58 23.43 29.38 21.87 25.88 23.72
L2G-NeRF 23.47 22.58 23.98 19.32 20.36 24.52 22.18 33.66 21.99 29.63 21.74 25.60 24.09
Nope-NeRF 22.42 21.53 22.62 19.55 20.59 21.91 22.97 27.39 21.33 25.69 19.91 26.83 22.73
CF-NeRF 23.32 23.50 22.04 21.35 21.30 23.91 23.31 31.51 22.24 25.89 23.42 26.71 24.04
Ours neighbor 23.17 25.76 24.90 21.64 21.62 26.14 23.04 30.50 23.02 27.19 22.14 30.85 25.00
Ours sim(3) 24.36 26.73 27.41 22.56 22.69 27.37 23.04 22.91 23.13 22.64 29.63 32.73 25.43
SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M BARF 0.59 0.68 0.72 0.52 0.54 0.72 0.54 0.92 0.62 0.85 0.69 0.84 0.69
L2G-NeRF 0.58 0.74 0.74 0.56 0.53 0.76 0.50 0.94 0.55 0.87 0.69 0.83 0.69
Nope-NeRF 0.52 0.73 0.69 0.57 0.56 0.67 0.53 0.88 0.53 0.80 0.64 0.86 0.67
CF-NeRF 0.56 0.74 0.66 0.63 0.57 0.72 0.53 0.93 0.54 0.80 0.72 0.85 0.69
Ours neighbor 0.56 0.82 0.74 0.63 0.58 0.78 0.52 0.91 0.58 0.83 0.71 0.89 0.71
Ours sim(3) 0.61 0.83 0.79 0.65 0.63 0.81 0.52 0.76 0.59 0.71 0.88 0.91 0.72
LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S BARF 0.36 0.20 0.29 0.40 0.38 0.27 0.44 0.09 0.33 0.14 0.24 0.23 0.28
L2G-NeRF 0.37 0.18 0.28 0.35 0.42 0.14 0.49 0.10 0.39 0.14 0.24 0.24 0.29
Nope-NeRF 0.51 0.33 0.43 0.48 0.50 0.47 0.54 0.30 0.52 0.34 0.48 0.35 0.44
CF-NeRF 0.43 0.24 0.41 0.37 0.46 0.33 0.55 0.11 0.50 0.20 0.21 0.23 0.34
Ours neighbor 0.36 0.14 0.26 0.50 0.44 0.22 0.49 0.11 0.41 0.22 0.14 0.17 0.29
Ours sim(3) 0.35 0.14 0.25 0.50 0.43 0.22 0.49 0.29 0.40 0.30 0.09 0.17 0.30

Metrics We primarily evaluate our method by assessing the quality of novel view synthesis and the accuracy of pose estimation. As the variables for the scene and the cameras are up to a 3D similarity transformation, existing work (Lin et al., 2021; Chen et al., 2023; Bian et al., 2023) aligns the optimized poses to the ground truth using Sim(3) with Procrusters analysis on the camera locations for pose error computation and test pose initialization (termed as Sim(3) below) and then runs an additional test-time optimization on the trained model to reduce the pose error that may influence the view synthesis quality. Since all existing methods compared below except ours struggle to obtain reasonable initial test poses through Sim(3) and fail to perform the test-time optimization under the complex trajectories, we adopt the approach of Nope-NeRF (Bian et al., 2023) for these methods, i.e. initializing a test image pose with the estimated pose of the training frame that is closest to it (termed as neighbor below). For our method, we provide results using both initialization methods. We report PSNR, SSIM, and LPIPS for the view synthesis, and rotation and translation errors for the pose estimation.

4.2. Comparison with Pose-Unknown Methods

We compare with the state-of-the-art methods for joint optimization of scenes and poses from RGB images, i.e. BARF (Lin et al., 2021), L2G-NeRF (Chen et al., 2023), Nope-NeRF (Bian et al., 2023), and CF-NeRF (Yan et al., 2023).

Refer to caption
Figure 5. Trajectory comparison. We visualize camera poses of both estimated (blue) and COLMAP (red). Sparse 3D points for the scenes are from COLMAP. While there are abrupt changes in the trajectories of BARF, L2G-NeRF, and Nope-NeRF, the changes are steady along the trajectories of CF-NeRF and ours. The bottom row shows rendered interframes between two frames of abrupt changes denoted by green rectangles.
Refer to caption
Figure 6. Qualitative Comparison on NeRFbuster (Warburg et al., 2023). Rendered views and depths (top left corner of each image).

Results on the Object Centered Dataset Table 1 presents the pose evaluation results on the NeRFBuster dataset. BARF, L2G-NeRF, and Nope-NeRF initialize all poses as identity matrices and then perform bundle adjustment to jointly optimize poses and the scene. With pose initialization far from the actual poses and sparse views not able to effectively constrain the scenes, these methods frequently fail to recover the camera poses and geometry. CF-NeRF manages to estimate poses but still suffers from larger errors as CF-NeRF only constrains poses through photo loss.

Our method achieves significantly smaller errors compared to these methods.

Despite the significant pose errors and poor structural quality of methods such as BARF, they still achieve surprisingly ”good” results on PSNR, SSIM, and LPIPS in Table 1, almost on-par with our results on the view synthesis. We attribute this to overfitting both in the training stage and test-time optimization. In Fig. 5, we visualize the trajectory of the NeRFbuster garbage scene and select two frames with an abrupt change in the estimated camera trajectory. We then render novel views by interpolating between these poses. As shown in Fig. 5 (b), the rendering results are unreasonable, while CF-NeRF and our method can render smooth view transitions (In the supplementary video, we further show BARF, L2G-NeRF, Nope-NeRF, and CF-NeRF renders view inconsistent effects). The other evidence is that these methods struggle to reconstruct the geometry as shown Fig. 6 and Fig. 4. During test-time optimization, as the estimated trajectories of these methods diverge far from the ground truth and they fail to perform test-time optimization using Sim(3), the poses of the neighboring frames are used to initialize the test frames. This initialization causes the pose of a test frame to converge to a ”pseudo ” pose close to the estimated pose of the closest neighboring train frame. Then the network for the scene further overfits to the ”pseudo” ground truth test pose and image (P,I𝑃𝐼P,Iitalic_P , italic_I for example) pairs and renders an image Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using P𝑃Pitalic_P to calculate visual metrics with I𝐼Iitalic_I.

Results on the Free Trajectory Dataset We also conduct our experiments on the Free-Dataset, which consists of more challenging scenarios with arbitrary trajectory variations and reduced frame overlap (please refer to the supplementary material for visualization of the sequences). In Table 2, we report the results on the Free-Dataset, which demonstrates that our method exhibits more significant advantages under more challenging scenes. Fig. 4 shows that in addition to the superior quality of the novel view synthesis, our method can produce depths of good quality while most existing methods fail to.

Table 2. Evaluations of the pose accuracy and the novel view quality on Free-Dataset (Wang et al., 2023a). ΔTΔ𝑇\Delta Troman_Δ italic_T is the transition error in ground truth scale and ΔRΔ𝑅\Delta Rroman_Δ italic_R is rotation error in degree.
Method ΔRΔ𝑅\Delta Rroman_Δ italic_R ΔTΔ𝑇\Delta Troman_Δ italic_T PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S
BARF 61.098 3.498 19.56 0.52 0.45
L2G-NeRF 110.303 6.587 19.95 0.54 0.45
Nope-NeRF 144.202 4.693 18.67 0.51 0.66
CF-NeRF 55.329 2.385 18.30 0.42 0.72
Ours neighbor 2.805 0.161 18.69 0.49 0.49
Ours sim(3) 2.805 0.161 22.46 0.59 0.43

4.3. Ablation Study

In this subsection, we conduct ablation studies to investigate the impact of various components in our method. We ablate projection loss, tracking, window optimization, and global optimization components individually. Table 3 shows that removing tracking, window, or global optimization leads to performance degradation, but the method remains functional. However, removing reprojection loss leads to dramatic pose errors. We refer readers to supplementary material for more results for the ablation.

Table 3. Ablation study on reprojection loss, tracking, window optimization, and global optimization.
Method ΔRΔ𝑅\Delta Rroman_Δ italic_R ΔTΔ𝑇\Delta Troman_Δ italic_T PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S
Ours w/o reproj. loss 56.040 1.904 16.18 0.44 0.63
Ours w/o tracking 5.302 0.280 23.66 0.67 0.35
Ours w/o window opt. 3.562 0.182 24.57 0.70 0.33
Ours w/o global opt. 6.189 0.234 22.66 0.64 0.40
Ours 2.560 0.131 25.43 0.72 0.30

5. Conclusion

We present  CT-NeRF, a method capable of recovering poses and reconstructing scenes from image sequences captured along complex trajectories. We first introduce correspondence and reprojected geometric image distance to impose extra constraints on the optimization graph, enabling robust and accurate pose estimation and scene structure reconstruction. Subsequently, we detail our incremental learning process for pose recovery, including initialization, tracking, window optimization, and global optimization. Through comparative and ablation experiments, we demonstrate the superiority of our method and the necessity of its individual components. Although our method enables joint pose estimation and reconstruction under complex camera trajectories, we only explore simple pose graphs. More sophisticated graph optimization is required for very long trajectories. Also, evaluation datasets, protocols, and metrics are required for complex camera trajectories as discussed in the paper, the current visual metrics can not fully reflect the reconstruction quality.

References

  • (1)
  • Abdelrasoul et al. (2016) Yassin Abdelrasoul, Abu Bakar Sayuti HM Saman, and Patrick Sebastian. 2016. A quantitative study of tuning ROS gmapping parameters and their effect on performing indoor 2D SLAM. In 2016 2nd IEEE international symposium on robotics and manufacturing automation (ROMA). IEEE, 1–6.
  • Bailey et al. (2006) Tim Bailey, Juan Nieto, Jose Guivant, Michael Stevens, and Eduardo Nebot. 2006. Consistency of the EKF-SLAM algorithm. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3562–3568.
  • Bian et al. (2023) Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. 2023. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4160–4169.
  • Campos et al. (2021) Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. 2021. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37, 6 (2021), 1874–1890.
  • Castellanos et al. (2007) José A Castellanos, Ruben Martinez-Cantin, Juan D Tardós, and José Neira. 2007. Robocentric map joining: Improving the consistency of EKF-SLAM. Robotics and autonomous systems 55, 1 (2007), 21–29.
  • Chen et al. (2023) Yue Chen, Xingyu Chen, Xuan Wang, Qi Zhang, Yu Guo, Ying Shan, and Fei Wang. 2023. Local-to-global registration for bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8264–8273.
  • Cheng et al. (2023) Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. 2023. LU-NeRF: Scene and pose estimation by synchronizing local unposed nerfs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 18312–18321.
  • Chng et al. (2022) Shin-Fang Chng, Sameera Ramasinghe, Jamie Sherrah, and Simon Lucey. 2022. Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In European Conference on Computer Vision. Springer, 264–280.
  • Cui and Tan (2015) Zhaopeng Cui and Ping Tan. 2015. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE International Conference on Computer Vision. 864–872.
  • Deng et al. (2022) Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. 2022. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12882–12891.
  • Edstedt et al. (2023) Johan Edstedt, Ioannis Athanasiadis, Mårten Wadenbäck, and Michael Felsberg. 2023. DKM: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17765–17775.
  • Engel et al. (2014) Jakob Engel, Thomas Schöps, and Daniel Cremers. 2014. LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision. Springer, 834–849.
  • Fischler and Bolles (1981) Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 6 (1981), 381–395.
  • Gherardi et al. (2010) Riccardo Gherardi, Michela Farenzena, and Andrea Fusiello. 2010. Improving the efficiency of hierarchical structure-and-motion. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 1594–1600.
  • Hartley and Zisserman (2003) Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer vision. Cambridge university press.
  • He et al. (2023) Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. 2023. Detector-Free Structure from Motion. In arxiv.
  • Jeong et al. (2021) Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. 2021. Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5846–5854.
  • Jiang et al. (2013) Nianjuan Jiang, Zhaopeng Cui, and Ping Tan. 2013. A global linear method for camera pose registration. In Proceedings of the IEEE international conference on computer vision. 481–488.
  • Johari et al. (2023) Mohammad Mahdi Johari, Camilla Carta, and François Fleuret. 2023. Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17408–17419.
  • Lin et al. (2021) Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. 2021. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5741–5751.
  • Meuleman et al. (2023) Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H Kim, and Johannes Kopf. 2023. Progressively optimized local radiance fields for robust view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16539–16548.
  • Mildenhall et al. (2021) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
  • Mur-Artal et al. (2015) Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on robotics 31, 5 (2015), 1147–1163.
  • Niemeyer et al. (2022) Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. 2022. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5480–5490.
  • Park et al. (2019) Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165–174.
  • Schonberger and Frahm (2016) Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113.
  • Snavely et al. (2008) Noah Snavely, Steven M Seitz, and Richard Szeliski. 2008. Modeling the world from internet photo collections. International journal of computer vision 80 (2008), 189–210.
  • Song et al. (2024) Jiuhn Song, Seonghoon Park, Honggyu An, Seokju Cho, Min-Seop Kwak, Sungjin Cho, and Seungryong Kim. 2024. DäRF: Boosting Radiance Fields from Sparse Input Views with Monocular Depth Adaptation. Advances in Neural Information Processing Systems 36 (2024).
  • Sucar et al. (2021) Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davison. 2021. imap: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6229–6238.
  • Sun et al. (2021) Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. 2021. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8922–8931.
  • Teed and Deng (2020) Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 402–419.
  • Teed and Deng (2021) Zachary Teed and Jia Deng. 2021. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems 34 (2021), 16558–16569.
  • Truong et al. (2023) Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt, and Federico Tombari. 2023. Sparf: Neural radiance fields from sparse and noisy poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4190–4200.
  • Wang et al. (2023b) Chen Wang, Jiadai Sun, Lina Liu, Chenming Wu, Zhelun Shen, Dayan Wu, Yuchao Dai, and Liangjun Zhang. 2023b. Digging into depth priors for outdoor neural radiance fields. In Proceedings of the 31st ACM International Conference on Multimedia. 1221–1230.
  • Wang et al. (2023c) Hengyi Wang, Jingwen Wang, and Lourdes Agapito. 2023c. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13293–13302.
  • Wang et al. (2023a) Peng Wang, Yuan Liu, Zhaoxi Chen, Lingjie Liu, Ziwei Liu, Taku Komura, Christian Theobalt, and Wenping Wang. 2023a. F2-nerf: Fast neural radiance field training with free camera trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4150–4159.
  • Wang et al. (2021) Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. 2021. NeRF–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064 (2021).
  • Warburg et al. (2023) Frederik Warburg, Ethan Weber, Matthew Tancik, Aleksander Holynski, and Angjoo Kanazawa. 2023. Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs. arXiv:2304.10532 [cs.CV]
  • Wei et al. (2023) Tong Wei, Yash Patel, Alexander Shekhovtsov, Jiří Matas, and Daniel Barath. 2023. Generalized Differentiable RANSAC. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 17603–17614. https://doi.org/10.1109/ICCV51070.2023.01618
  • Wu (2013) Changchang Wu. 2013. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013. IEEE, 127–134.
  • Yan et al. (2023) Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, and Fei Deng. 2023. CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning. arXiv:2312.08760 [cs.CV]
  • Zhu et al. (2022) Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. 2022. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12786–12796.
Refer to caption
Figure 7. Consecutive frames in the NeRFBuster dataset (top) and the Free Dataset (bottom). The Free Dataset exhibits more pronounced camera motion, posing greater challenges.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Appendix A More Implementation Details

Our network architecture follows the BARF (Lin et al., 2021) approach, utilizing a single 8-layer MLP network with a width of 128. All SOTA methods employ their official open-source implementations. For test-optimization, NoPe-NeRF adopts its official implementation, while all other methods undergo 100 iterations of test-optimization after per-image neighbor initialization before evaluation. The Sim(3) alignment approach is also derived from the official open-source version of BARF.

Dataset We choose two datasets NeRFBuster (Warburg et al., 2023) which used in CF-NeRF (Yan et al., 2023) and Free-Dataset (Wang et al., 2023a) which consists of more challenging scenarios with arbitrary trajectory variations and reduced frame overlap as shown in Fig. 7. We utilize the NeRFBuster sequences processed by CF-NeRF. For each scene, CF-NeRF selects approximately 50 images based on their overlap, ordered sequentially. Regarding the Free-Dataset, the sky scene comprises images with indexes from 50 to 100, while all other scenes consist of images with indexes from 0 to 50. All selected sequences present considerable challenges.

Table 4. Pose accuracy aligned by approaches of BARF and CF-NeR. ΔTΔ𝑇\Delta Troman_Δ italic_T is the transition error in ground truth scale and ΔRΔ𝑅\Delta Rroman_Δ italic_R is rotation error in degree. Sorted in descending order by ΔRΔ𝑅\Delta Rroman_Δ italic_R.
Metrics Method aloe art car century garbage flowers picnic pikachu pipe plant roses table mean
ΔRΔ𝑅\Delta Rroman_Δ italic_R CF-NeRF (Yan et al., 2023) Sim(3) with rotation 21.918 25.702 22.653 11.245 9.061 9.915 13.489 12.046 173.343 11.056 7.002 3.837 26.772
CF-NeRF Sim(3) 6.703 76.306 29.079 11.013 74.163 10.672 109.868 13.243 122.345 18.664 3.903 3.835 39.983
Ours Sim(3) with rotation 3.618 0.469 0.545 2.237 0.921 0.596 2.118 7.698 2.320 5.212 1.919 1.223 2.406
Ours Sim(3) 3.163 3.151 0.701 2.343 0.902 0.481 1.938 7.708 2.302 6.008 0.570 1.154 2.560
ΔTΔ𝑇\Delta Troman_Δ italic_T CF-NeRF Sim(3) with rotation 3.858 5.064 8.423 3.655 4.018 4.305 3.372 4.949 36.930 5.154 1.130 2.200 6.921
CF-NeRF Sim(3) 0.637 1.549 1.621 0.497 0.548 0.745 1.285 0.879 5.757 0.685 0.182 0.274 1.222
Ours Sim(3) with rotation 0.701 0.237 0.086 0.517 0.256 0.215 0.347 0.983 0.515 0.519 0.371 0.247 0.416
Ours Sim(3) 0.168 0.030 0.035 0.134 0.039 0.039 0.106 0.548 0.164 0.225 0.038 0.045 0.131
Table 5. Comparison to NeRF with COLMAP(GT) pose.
scenes Ours NeRF + Our pose NeRF + COLMAP pose
ΔRΔ𝑅\Delta Rroman_Δ italic_R ΔTΔ𝑇\Delta Troman_Δ italic_T PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S
pikachu 7.708 0.548 22.91 0.76 0.29 23.62 0.79 0.28 37.06 0.97 0.05
plant 6.302 0.225 22.64 0.71 0.30 20.51 0.63 0.39 28.27 0.85 0.24
aloe 3.163 0.168 24.36 0.61 0.35 25.51 0.68 0.26 24.04 0.58 0.40
art 3.151 0.030 26.73 0.83 0.14 13.77 0.34 0.56 12.90 0.30 0.60
century 2.343 0.134 22.56 0.65 0.50 14.08 0.25 0.69 14.28 0.28 0.68
pipe 2.302 0.164 23.13 0.59 0.40 21.90 0.55 0.43 23.05 0.63 0.37
picnic 1.938 0.106 23.04 0.52 0.49 22.52 0.51 0.45 25.25 0.67 0.31
table 1.154 0.045 32.73 0.91 0.17 25.64 0.84 0.28 23.82 0.82 0.27
flowers 0.902 0.039 22.69 0.63 0.43 16.29 0.30 0.66 15.38 0.27 0.67
car 0.701 0.035 27.41 0.79 0.25 21.52 0.67 0.33 18.93 0.60 0.40
roses 0.570 0.038 29.63 0.88 0.09 30.09 0.90 0.09 27.77 0.84 0.16
garbage 0.481 0.039 27.37 0.81 0.22 18.34 0.59 0.41 13.53 0.36 0.67
mean 2.560 0.131 25.43 0.72 0.30 21.14 0.59 0.40 22.02 0.60 0.40
Refer to caption
Figure 8. Visualization of segments and interpolated views. (a), (c), and (e) are novel views in segments. (b) and (d) are interpolated novel views
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 9. Trajectory comparison. We visualize camera poses of both estimated (blue) and COLMAP (red). Sparse 3D points for the scenes are from COLMAP. While there are abrupt changes in the trajectories of BARF, L2G-NeRF, and Nope-NeRF, the changes are steady along the trajectories of CF-NeRF and ours. The bottom row shows rendered interframes between two frames of abrupt changes denoted by green rectangles.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Table 6. Evaluations of the pose accuracy (top 2 rows) and the novel view quality (bottom 3 rows) on Free-Dataset (Wang et al., 2023a). ΔTΔ𝑇\Delta Troman_Δ italic_T is the transition error in ground truth scale and ΔRΔ𝑅\Delta Rroman_Δ italic_R is rotation error in degree.
Metrics Method grass hydrant lab pillar road sky stair mean
ΔRΔ𝑅\Delta Rroman_Δ italic_R BARF (Lin et al., 2021) 124.875 74.091 124.754 16.908 64.433 22.197 0.425 61.098
L2G-NeRF (Chen et al., 2023) 114.356 170.250 56.227 131.588 109.558 27.245 162.898 110.303
Nope-NeRF (Bian et al., 2023) 158.408 140.245 165.086 153.613 144.478 67.749 179.836 144.202
CF-NeRF (Yan et al., 2023) 36.875 36.129 150.882 18.282 49.790 94.082 1.260 55.329
Ours 7.785 0.454 6.126 0.124 3.001 2.054 0.089 2.805
ΔTΔ𝑇\Delta Troman_Δ italic_T BARF 7.273 3.890 6.675 1.475 4.587 0.583 0.004 3.498
L2G-NeRF 7.962 7.203 4.786 7.707 7.107 0.849 10.498 6.587
Nope-NeRF 2.920 4.904 2.062 4.044 7.201 1.156 10.564 4.693
CF-NeRF 2.850 2.018 5.998 1.382 1.808 2.518 0.121 2.385
Ours 0.304 0.032 0.526 0.008 0.201 0.046 0.007 0.161
PSNR𝑃𝑆𝑁𝑅PSNRitalic_P italic_S italic_N italic_R BARF 18.00 17.33 18.73 20.17 19.01 15.66 28.00 19.56
L2G-NeRF 18.29 17.46 21.18 19.93 20.49 17.90 24.41 19.95
Nope-NeRF 17.02 18.33 17.55 18.99 19.08 15.39 24.35 18.67
CF-NeRF 18.15 17.85 16.25 20.25 18.85 15.23 21.51 18.30
Ours neighbor 17.57 17.95 17.94 21.91 19.30 15.12 21.07 18.69
Ours Sim(3) 16.96 22.54 14.88 26.23 24.06 24.37 28.16 22.46
SSIM𝑆𝑆𝐼𝑀SSIMitalic_S italic_S italic_I italic_M BARF 0.40 0.33 0.58 0.52 0.47 0.49 0.83 0.52
L2G-NeRF 0.42 0.32 0.67 0.53 0.52 0.55 0.74 0.54
Nope-NeRF 0.40 0.37 0.63 0.47 0.44 0.60 0.66 0.51
CF-NeRF 0.34 0.30 0.44 0.44 0.40 0.43 0.56 0.42
Ours neighbor 0.40 0.35 0.53 0.56 0.49 0.45 0.64 0.49
Ours Sim(3) 0.36 0.50 0.41 0.67 0.61 0.77 0.83 0.59
LPIPS𝐿𝑃𝐼𝑃𝑆LPIPSitalic_L italic_P italic_I italic_P italic_S BARF 0.51 0.60 0.38 0.50 0.51 0.48 0.18 0.45
L2G-NeRF 0.51 0.61 0.26 0.51 0.57 0.41 0.25 0.45
Nope-NeRF 0.75 0.69 0.56 0.68 0.79 0.63 0.52 0.66
CF-NeRF 0.77 0.82 0.57 0.73 0.83 0.70 0.59 0.72
Ours neighbor 0.59 0.57 0.39 0.42 0.65 0.50 0.30 0.49
Ours Sim(3) 0.61 0.50 0.55 0.37 0.48 0.31 0.21 0.43

Appendix B Testing methods

As mentioned in the main paper, to calculate the metrics for test images, two sequential steps during testing are required: alignment of trajectories for pose quality assessment and test-time optimization for view synthesis quality assessment.

Alignment A 3D similarity transformation Sim(3) for the scene and the cameras can be obtained through different methods.

  • Sim(3) BARF (Lin et al., 2021) and L2G-NeRF (Chen et al., 2023) align estimated poses to the ground truth through Sim(3) obtained by Procrustes analysis on the camera pose locations.

  • Sim(3) with rotation CF-NeRF (Yan et al., 2023) finds that the Procrustes analysis used for Sim(3) is unreliable when all cameras lie in a line or the camera translation contains noise. To overcome the problem, CF-NeRF adds a virtual point (0,0,1)001(0,0,1)( 0 , 0 , 1 ) in the camera coordinate of each image and uses the camera parameter to transform it to the world coordinate, then uses the camera rotation during the alignment process (termed as rotation). However, we find the approach of CF-NeRF will cause more transition errors.

We list the results of pose error on buster both aligned by the approach of Sim(3) and Sim(3) with rotation in Table 4. The results show that the approach of Sim(3) with rotation can reduce rotation errors while causing more transition errors. When the accuracy of poses is high, Sim(3) with rotation takes rare benefits on ΔRΔ𝑅\Delta Rroman_Δ italic_R but harms to ΔTΔ𝑇\Delta Troman_Δ italic_T. As a result, we employ the Sim(3) approach in the main paper for all methods to align two trajectories and then calculate pose errors.

Test-time optimization Here we outline previous testing methods with different combinations of initialization and test-time optimization.

  • Sim(3) + opt. In BARF (Lin et al., 2021), the poses are first initialized using Sim(3) alignment with Procrustes analysis on the camera pose locations. Then, an additional test-time optimization is used to further adjust the test poses. This initialization works well when the estimated poses can be aligned precisely to COLMAP poses. However, incorrect pose estimations can affect the Sim(3) alignment.

  • Estimated + no opt. CF-NeRF (Yan et al., 2023) recovers all poses without employing a test/train split and then tests every 8th image. However, such an approach leads to results indistinguishable whether the rendered results are due to overfitting or successful reconstruction.

  • Neighbor + opt. Nope-NeRF (Bian et al., 2023) initializes the test image pose with the estimated pose of the training frame that is closest to it. Neighbor initialization works well when the framerate is high and the test pose is near the neighbor pose. Facing complex trajectories and reduced overlap it struggles to supply a good initialization as shown in Table 1 in the main paper and Table 6.

Due to the substantial alignment errors, all methods except ours struggle to obtain reasonable initial test poses through Sim(3). In our main paper for testing results, we adopt Neighbor + opt. for all the methods and also provide results using Sim(3) + opt. .

Overfitting As described in Table 1 and Section 4.2 of the main text and Table 6, although methods like BARF converge to significant pose errors and poor structural quality, they still achieve comparable novel view synthesis metrics to our method. We attribute this to the network converging to local optima. The left part of Fig. 8 illustrates the poses estimated by BARF, where the three green boxes indicate three pose segments after fitting. To render a video, we fit B-spline functions to the estimated poses to get a smooth camera trajectory. The right part visualizes novel views synthesized for poses within the segments (a,c,e) and between segments (b, d) on the B-spline trajectory. The novel views synthesized for poses in the three segments (a, c, e) appear normal. However, the visualization results for the interpolated poses (b,d) between segments are unreasonable. It seems that each segment fits a sub-scene and the images for the interpolated poses between segments stitch different scenes together. Fig. 9 shows more results of this issue. These methods do not recover correct poses and scene geometry. During test-time optimization, poses for testing images are initialized with the estimated poses of neighboring training images. With the twisted scene and fragmented pose trajectories after training, the test-time optimization results in the pose of a test frame converging to a ”pseudo ” pose close to the estimated pose of the closest neighboring training frame. Then the network for the scene further overfits to the ”pseudo” ground truth test pose and image (P,I𝑃𝐼P,Iitalic_P , italic_I for example) pairs and renders an image Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using P𝑃Pitalic_P to calculate visual metrics with I𝐼Iitalic_I, leading to high view synthesis metrics. Notice that during the process, P𝑃Pitalic_P diverges far from its true pose to the direction minimizing Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and I𝐼Iitalic_I when the scene geometry and camera poses exhibit large errors.

Appendix C Comparison to NeRF with COLMAP pose

We additionally compare the novel view synthesis quality of the NeRF model trained with our estimated pose and COLMAP pose (we use it as GT) to demonstrate the pose accuracy estimated. On average, NeRF + Our pose achieves novel view quality close to that of NeRF + COLMAP pose. In some scenes, our poses have large estimation error, like Pikachu and plant. Both pose error during training and test pose misalignment lead to worse view quality of NeRF + Our pose in these scenes. In scenes with small pose error, NeRF + Our pose gains similar view quality with NeRF + COLMAP pose, even better in many scenes. In many scenes, our methods achieve better view quality than NeRF + Our pose and NeRF + COLMAP pose. We attribute it to coarse to fine positional encoding, reprojection loss, and joint optimization of poses and scenes.

Appendix D More Results

Detailed results of Free-Dataset are shown in Table 6. We provide more qualitative comparisons with state-of-the-art works. Fig. 13, Fig. 14, Fig. 15 and Fig. 10 illustrates a comparison of novel view synthesis quality. Fig. 16, Fig. 17, Fig. 18 and Fig. 11 demonstrates a comparison of depth map rendering quality. Fig. 19, Fig. 20 and Fig. 12 present a comparison of the reconstructed trajectories obtained by various methods, where the red boxes represent the COLMAP poses, and the blue boxes depict the estimated poses. The point cloud, processed from the COLMAP output, serves as a reference for relative positioning.

Refer to caption
Figure 10. Comparison of novel views on Free-Dataset (Wang et al., 2023a).
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 11. Comparison of rendered depths on Free-Dataset (Wang et al., 2023a).
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 12. Trajectory comparison on Free-Dataset (Wang et al., 2023a). We visualize camera poses of both estimated (blue) and COLMAP (red). Sparse 3D points for the scenes are from COLMAP.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 13. Comparison of novel views on NeRFBuster (Warburg et al., 2023). Part one.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 14. Comparison of novel views on NeRFBuster (Warburg et al., 2023). Part two.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 15. Comparison of novel views on NeRFBuster (Warburg et al., 2023). Part three.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 16. Comparison of rendered depths on NeRFBuster (Warburg et al., 2023). Part one.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 17. Comparison of rendered depths on NeRFBuster (Warburg et al., 2023). Rendered depths. Part two.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 18. Comparison of rendered depths on NeRFBuster (Warburg et al., 2023). Rendered depths. Part three.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 19. Trajectory comparison on NeRFBuster (Warburg et al., 2023). We visualize camera poses of both estimated (blue) and COLMAP (red). Sparse 3D points for the scenes are from COLMAP. Part one.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.

Refer to caption
Figure 20. Trajectory comparison on NeRFBuster (Warburg et al., 2023). We visualize camera poses of both estimated (blue) and COLMAP (red). Sparse 3D points for the scenes are from COLMAP. Part two.
\Description

Enjoying the baseball game from the third-base seats. Ichiro Suzuki preparing to bat.