CT-NeRF: Incremental Optimizing Neural Radiance Field and Poses with Complex Trajectory
Abstract.
Neural radiance field (NeRF) has achieved impressive results in high-quality 3D scene reconstruction. However, NeRF heavily relies on precise camera poses. While recent works like BARF have introduced camera pose optimization within NeRF, their applicability is limited to simple trajectory scenes. Existing methods struggle while tackling complex trajectories involving large rotations. To address this limitation, we propose CT-NeRF, an incremental reconstruction optimization pipeline using only RGB images without pose and depth input. In this pipeline, we first propose a local-global bundle adjustment under a pose graph connecting neighboring frames to enforce the consistency between poses to escape the local minima caused by only pose consistency with the scene structure. Further, we instantiate the consistency between poses as a reprojected geometric image distance constraint resulting from pixel-level correspondences between input image pairs. Through the incremental reconstruction, CT-NeRF enables the recovery of both camera poses and scene structure and is capable of handling scenes with complex trajectories. We evaluate the performance of CT-NeRF on two real-world datasets, NeRFBuster and Free-Dataset, which feature complex trajectories. Results show CT-NeRF outperforms existing methods in novel view synthesis and pose estimation accuracy.
1. Introduction
Reconstructing high-fidelity, high-quality 3D scenes holds significant importance for the development of virtual reality / augmented reality, autonomous driving, and other domains. Recently, implicit representations such as neural radiance fields (NeRF) (Mildenhall et al., 2021) have achieved remarkable progress in reconstructing photo-realistic scenes given a sequence of RGB images and their corresponding camera poses. The camera poses for these high-fidelity reconstructions with implicit representations are primarily obtained through the structure from motion (SfM) methods, with the off-the-shelf tool COLMAP (Schonberger and Frahm, 2016) being the most popular choice. These SfM methods estimate the camera poses through local registration between images and global bundle adjustment (BA) on both all camera poses and sparse 3D scene points. Therefore, accurate camera poses can only be acquired after all images are processed. Also, matching and registration of the methods are sensitive to image variations.
Recent works such as NeRFmm (Wang et al., 2021), SC-NeRF (Jeong et al., 2021), BARF (Lin et al., 2021), GARF (Chng et al., 2022) and L2G-NeRF (Chen et al., 2023) tackle the dependency on the camera pose priors by treating the camera pose as learnable parameters and jointly optimizing poses and scenes offline. However, using images only to constrain the 3D space encounters many problems like wrong geometry, blurry textures, or floaters when very dense multiview images are not available; adding more freedoms for the camera poses to the optimization leads to worse results. Therefore, these methods often require initial camera parameters close to the ground truth poses in object-centered scenes with dense multiview observations, or small camera movements. Nope-NeRF (Bian et al., 2023) incorporates monocular depth to impose further constraints on adjacent images, enabling pose estimation for trajectories with relatively small camera motions and rotations while its initialization of all poses as identity matrices leads to local optima when facing complex trajectories.
On the other hand, following the classic SfM pipelines, CF-NeRF (Yan et al., 2023) adds images incrementally, initializes the pose for a newly added image with the pose for the previous one, and optimizes the poses (and the scene). With the incremental strategy, the method is capable of reconstructing the real-world scene under complex camera trajectories. However, it still suffers from large pose errors and inferior reconstruction quality as shown in Fig. 1, Fig. 4 and Fig. 5. The reason can be attributed to two aspects. First, the bundle adjustment constructs a center-based graph as shown in Fig. 2 (a) left, optimizing only the consistency between the camera poses and the center implicit global scene while neglecting the consistency between the pose and the multiview images. When the global structure falls into local minima, the camera poses cannot be recovered; in turn, the structure cannot find a way to escape the local minima as the poses and structure are optimized jointly. Secondly, the method only uses the visual difference between the rendering images and raw images whereas BARF (Lin et al., 2021) observes that as natural images are typically complex signals, gradient-based registration with pixel value differences is susceptible to suboptimal solutions if poorly initialized, as shown in the top figure of Fig. 2 (b). The coarse-to-fine pose estimation proposed by BARF (Lin et al., 2021) mitigates this issue but requires good initialization or dense forward-facing images. In complex trajectories, the camera typically exhibits large motions, and views covering a region are much sparser than the forward-facing scenario.
To tackle the issues and enable accurate pose estimation and reconstruction for complex trajectories, we propose a novel incremental joint optimization method for implicit radiance fields and camera poses named CT-NeRF. For the first issue, as shown in Fig. 2 (a) right, we propose a joint incremental reconstruction and pose estimation pipeline with pose graphs connecting edges between camera poses upon the center-based pose graphs for a local-global bundle adjustment (BA). The graph forms many subgraphs and forces consistency between the camera poses, which helps to recover the poses when the scene and poses are consistent but the scene is actual in a local minimum during BA. For the second issue and also instantiating the pose consistency between the pose edges, we introduce a geometric image instance, i.e. the reprojected Euclidean distance between the correspondences of two input images. In addition to providing consistency constraints for pose edges, the reprojected distance benefits the pose and scene optimization in three aspects: 1) it provides direct direction to align the poses, whereas the pixel value differences do not necessarily correlate to the pose error: as shown in Fig. 2 (b) gradients based on the pixel value difference are not consistent while the gradients from the reprojected geometric image distance are; 2) the reprojected distance requires the depth of the scene to warp a pixel in an image to the other one as shown in Fig. 2 (c) and therefore, the gradient can help the convergence of the geometry of the scene directly; 3) correspondence learning networks typically leverage large scale pair-wise image datasets and the reprojected loss based on the correspondences is robust to occlusion, lighting variation, textureless and large motions compared to raw image losses.
In summary, our main contributions are threefold:
-
•
We design an incremental reconstruction pipeline for neural radiance fields using only RGB images under complex camera trajectories, without pose and depth input.
-
•
We propose to construct pose graphs with in-between pose consistency edges for BA and instantiate the consistency as a reprojected geometric image distance constraint from the learned correspondences between input images for robust pose and scene optimization.
-
•
We achieve significant improvements in pose estimation accuracy and reconstruction quality compared to state-of-the-art methods in complex trajectories.
2. Related Work
SFM and SLAM In the field of computer vision, given a set of input images, SfM and SLAM aim to concurrently estimate camera poses and reconstruct the scene. The distinction lies in the fact that SLAM operates online, emphasizing runtime performance, while SfM does not require online operation but demands higher accuracy. SfM methods can be categorized into incremental (Schonberger and Frahm, 2016; Wu, 2013; Snavely et al., 2008), global (Jiang et al., 2013; Cui and Tan, 2015), and hierarchical (Gherardi et al., 2010) approaches: The incremental approach initializes with two images and progressively registers and reconstructs additional images one by one. The global approach registers and reconstructs all images simultaneously. The hierarchical approach first groups images, performs registration and reconstruction for each group, and then conducts a global optimization. SLAM methods are primarily divided into filter-based (Bailey et al., 2006; Castellanos et al., 2007; Abdelrasoul et al., 2016) and graph optimization-based (Engel et al., 2014; Mur-Artal et al., 2015; Campos et al., 2021) approaches. Filter-based methods mainly utilize state estimation strategies such as Kalman filtering and particle filtering to incrementally estimate the posterior distributions of camera poses and key point locations. Graph optimization-based methods abstract camera poses at different times as nodes and the observation constraints at different robot locations as edges connecting the nodes, then employ bundle adjustment (BA) algorithms for global optimization. Our proposed incremental pipeline is inspired by the incremental SfM and SLAM approaches.
NeRF-based SFM and SLAM Implicit neural representations have gained prominence since 2019 (Park et al., 2019). Compared to traditional explicit representations that store geometric information in a relatively fixed and simple manner, implicit neural representations can better handle complex topological structures and geometric details. The classic algorithm for implicit neural representations, Vanilla NeRF (Mildenhall et al., 2021) (Neural Radiance Fields), is based on the theory of volume rendering and utilizes a Multi-Layer Perceptron (MLP) to learn the implicit neural representation of a static scene, achieving high-quality novel view synthesis. Consequently, some researchers have considered combining SfM and SLAM with NeRF, not only to reduce NeRF’s dependence on the accuracy of input image poses but also to enhance the scene representation capability of SfM and SLAM. BARF (Lin et al., 2021) was the first to integrate the core bundle adjustment (BA) algorithm from SfM with NeRF and adopted a coarse-to-fine reconstruction strategy, progressively aligning camera poses during the reconstruction process. To address the issue of camera poses being prone to local optima in BARF, L2G-NeRF (Chen et al., 2023) proposed a Local-to-Global alignment strategy, allowing camera poses to converge more easily to the global optimum. NoPe-NeRF (Bian et al., 2023) introduced monocular depth information and key point matching information, respectively, to constrain the relative camera pose relationships, ensuring global consistency of camera poses. LocalRF (Meuleman et al., 2023) proposed a progressive strategy based on video sequences to gradually optimize local regions. CF-NeRF (Yan et al., 2023) employed an incremental learning approach to enable reconstruction under complex trajectories. LU-NeRF (Cheng et al., 2023) introduced a Local-to-Global pose estimation strategy, enabling pose estimation and scene reconstruction from datasets with completely unknown camera poses. As for NeRF-based SLAM methods, iMAP (Sucar et al., 2021) adopted two threads: Tracking and Mapping. The Tracking thread optimizes the camera pose using the current model and performs key frame selection, while the Mapping thread jointly optimizes the poses of the keyframes and the model. iMAP uses a single MLP to represent the entire scene, limiting its scalability. Nice-SLAM (Zhu et al., 2022) improved upon iMAP by combining Hierarchical Feature Grids and MLP as the scene representation, enabling the application to large-scale scenes. Co-SLAM (Wang et al., 2023c) and e-SLAM (Johari et al., 2023) introduced multi-resolution hash encoding and tri-plane representations, respectively, to improve system frame rate and scene representation capability, building upon Nice-SLAM. Regrettably, current NeRF-based SLAM methods necessitate dense image sequences, while NeRF-based SfM techniques face difficulties in accommodating complex camera trajectories. To tackle these limitations, we introduce CT-NeRF, an incremental optimization framework that leverages additional correspondence constraints. CT-NeRF employs an incremental optimization process, iteratively refining the reconstruction as new images are integrated.
Correspondence in pose-estimate Local feature matching plays a crucial role in SfM and SLAM. The traditional feature matching pipeline consists of three main steps: feature detection, feature descriptor computation, and feature matching. Through feature detectors, the search space for matching can be effectively reduced, and the generated sparse matches are often sufficient to handle most tasks. However, in low-texture regions or repetitive patterns, these methods often fail due to the inability to detect sufficient feature points. With the flourishing development of deep learning, some researchers have started to leverage data-driven dense feature matching to enhance the accuracy and robustness of SfM and SLAM. For instance, Droid-SLAM (Teed and Deng, 2021) utilizes the dense optical flow learned by RAFT (Teed and Deng, 2020) for feature matching, achieving higher accuracy and robustness compared to traditional SLAM, and rarely failing in experimental scenarios. Detector-free SfM (He et al., 2023) leverages the matching strategy of Loftr (Sun et al., 2021) without feature detectors, exhibiting significant advantages in low-texture regions and winning multiple competitions. Droid-SLAM and Detector-free SfM focus on pose estimation and reconstructing the scene with sparse points. On the other hand, SPARF (Truong et al., 2023) utilizes the correspondences from DKM matching for implicit neural reconstruction under noisy poses with several views. Different from these works, our method focuses on an incremental pose and implicit scene joint optimization pipeline with complex trajectories.
3. Method
We first present the formulation of incremental scene reconstruction and pose estimation: given a set of sequential images , where represents the frame captured in a camera trajectory, we aim to jointly optimize the poses for the images and a neural radiance field model representing the 3D scene captured by the images by adding one image at a time sequentially. To achieve the goal, we design an incremental reconstruction pipeline for the neural radiance field without pose priors. Our pipeline consists of five parts as shown in Fig. 3. The scene is initialized using a small set of images. Subsequently, for each input image, tracking is applied to estimate the rough camera pose for a new image. A window optimization is followed to refine the poses of images within the window and also reconstruct the local structural components. To further incorporate the consistency of all visited camera poses, global optimization (bundle adjustment) is performed on all images to optimize the global camera poses and the overall scene structure. Tracking, window, and global optimization repeat until all images are added. After all images are added, post optimization iteratively refines the entire scene and all camera poses until convergence.
3.1. Preliminary: NeRF with Pose Optimization
We define the camera projection function that projects a space point to a pixel as
(1) |
where camera pose is the camera-to-world transformation of image and is the intrinsic (we assume all images share the same intrinsic in a trajectory). , where represents rotation and translation. The homogenization operations are omitted for clarity. The backprojection function projects the pixel coordinate location into a space point with depth
(2) |
NeRF maps a 3D location and a view direction to a radiance color and volume density with an MLP parameterized by . It optimizes the model and camera poses by minimizing photometric loss between rendered images and input images
(3) |
where and can be obtained through volume rendering. For each pixel in image , its color is rendered by aggregating predicted colors and densities alone the ray where and can be obtained by and represents sample distance.
(4) |
where indicates how much light is transmitted on ray up to . In the same way, depth map can be rendered.
(5) |
3.2. Reprojected Geometric Image Distance
As aforementioned, we aim to incorporate the geometric distance constraint from correspondences between images to jointly optimize camera poses and 3D scene model parameters under complex trajectories. For correspondence generation, many existing correspondence learning networks can be exploited. In this work we choose DKM (Edstedt et al., 2023) as it is the current state-of-the-art work in dense correspondence matching, achieving high matching accuracy and strong robustness. Given a pair of adjacent images in the input trajectory , a pre-trained DKM model can generate full-resolution pixel-level correspondences between them while predicting the confidence of each pixel match . We use a set to represent the output correspondences for two input images
(6) |
According to the multiview geometry theory (Hartley and Zisserman, 2003), four pairs of correspondences can solve the relative pose between the images and using the triangulation technique, the 3D scene points for the pairs can be acquired. Though the correspondences only produce sparse 3D points and our representation is implicit, the theory provides important information for our problem: 1) correspondences provide a way to solve the pose estimation problem without knowing a 3D scene; 2) the pose estimation requires only sparse correspondences for the pose estimation; 3) the correspondences embed the 3D information.
We define our geometric image distance for the pose estimation between two images under implicit scene representation as follows. As correspondence in practice often contains noise, we randomly sample correspondences from the correspondence set with confidence above a threshold , to serve as a correspondences set for pose estimation,
(7) |
In classic SfM pipelines, algorithms like RANSAC (Fischler and Bolles, 1981) are exploited to remove outliers and get robust poses and 3D points. Though differentiable RANSAC (Wei et al., 2023) may be deployed in our implicit joint optimization, we choose a simpler strategy: 1) weighting the reprojection error according to the confidence of correspondence estimation; 2) randomly sampling a small set of samples in each iteration. In every training iteration for the scene model , pairs of correspondences are randomly fetched from . Sampled correspondence sets with the confidence from multiple different iterations to optimize the pose and/or scene parameters serve as the similar purpose of RANSAC.
For each correspondence , its depth value can be rendered by Eq. (5), and can be backprojected into a 3D point and reprojected to . Then we have the geometric image distance
(8) |
which is reprojection error. are the poses to be estimated for images respectively. The gradient with respect the camera poses , rendered depths and further the model can be obtained from Eq. (8), indicating the gradients can help the depth and scene reconstruction in addition to pose estimation.
The reprojection error is referred as Gold Standard (Hartley and Zisserman, 2003). The error is a quadratic convex function of pose parameters, pointing the right direction for pose optimization without local minima. However, the pixel value difference does not necessarily correlate to the correct direction for the pose optimization as shown in BARF (Lin et al., 2021). Compared with the pixel value difference, the reprojection error also provides more robust gradients for the scene estimation as the pixel value difference is susceptible to the lighting, occlusion, sparse views, etc, while the correspondence learning network for the reprojection error learns these factors during training.
3.3. Tracking
When a new frame is added to the training process, tracking provides a rough estimate of the camera pose, which becomes particularly crucial when there exist violent changes in camera motion. The pose of the new frame is initialized based on the previous frame . Tracking performs pose estimation through the reprojection error with adjacent frames and photometric loss with the scene. The loss function can be formulated as:
(9) |
(10) |
where is reprojection loss for paris of neighboring images and is an optimization frame set with frames.
For the tracking, consists of a new frame and the preceding frame () and we optimize only the pose of the newly added frame , keeping other optimizable pose parameters and the network parameters fixed.
Initialization is crucial as it affects the subsequent tracking performance and the final pose estimation and reconstruction quality. In our approach, we select the first images in the sequence as the initialization images set . The initialization is achieved by minimizing Eq. (9). Due to the difficulty in optimizing the rotation parameters in the initial stages, we fix the rotation parameters and do not optimize them during the initialization.
3.4. Joint Optimization
Using the center-based graph and photo loss in equation Eq. (3) for the joint optimization or bundle adjustment is susceptible to local minima when only RGB inputs are available. It only constrains all the poses in consistency with the scene while reconstructing the scene with implicit neural radiance fields from only RGB images (especially sparse images) are prone to converge to wrong geometry, which is demonstrated in many previous works (Deng et al., 2022; Niemeyer et al., 2022; Wang et al., 2023b; Song et al., 2024). If the scene is stuck in a local minimum, the bundle adjustment can not get the pose right as long as the poses conform with the twisted scene. As shown in Table 1, though the poses from BARF(Lin et al., 2021), L2G-NeRF(Chen et al., 2023), Nope-NeRF(Bian et al., 2023) are fairly deviated, visual metrics like PSNR for the scene maintains high, causing the pose not able to escape the local minima. In contrast, we construct pose graphs with constraints between pose edges for the joint optimization. The subgraphs formed between the poses and between the pose and the scene (shown in Fig. 2 (a) right) enable consistency of all the in-between poses in the image set. The reprojection error forces the consistency of the adjacent poses with the correspondences of input images. Further, we design a combination of local window and global BA strategy to balance the integration of new information and consistency with existing estimation.
Our joint optimization uses the same equation Eq. (10) in tracking. However, the scene model is learnable, and different frame sets for the optimization are maintained.
Window optimization To joint optimize the scene and the camera pose for a newly added image, window optimization selects the most recent frames as the optimization set . The window optimization process fixes the camera poses outside the window and optimizes the camera poses within the window and network using Eq. (9) by performing local bundle adjustment. Different from the existing implicit SfM and SLAM methods which consist of tracking for the newly added image and global bundle adjustment for all the input images, the extra window optimization improves the pose estimation accuracy by enhancing consistency with the near previous frames and makes the network learn faster from the information of the new frame by leaving older frames out.
Global optimization Relying solely on tracking and window optimization can lead to cumulative errors and even failure in pose estimation. To address this issue, global optimization incorporates all frames currently added to the training process into the optimization set . By applying Eq. (9) to optimize all frames simultaneously, global optimization significantly enhances the robustness and accuracy of pose estimation.
Post optimization Before all frames are added, the learning rate of the network and poses are fixed. The positional encoding control parameter is also fixed. After the whole frames are added, the post optimization process gradually reduces the learning rate and increases the frequency of positional encoding to iteratively refine the entire scene and camera poses through Eq. (9), ultimately obtaining the final results.
3.5. Training procedure
Training Pipeline The pipeline is initialized with frames. The stage is optimized for iterations. Afterwards, for each subsequent frame added, tracking is performed for iterations, window optimization for iterations, and global optimization for iterations. These three stages repeat with a new frame added and continue until all frames have been added. Finally, the post optimization stage consisting of iterations is conducted to further refine the reconstruction.
Positional Encoding The coarse-to-fine positional encoding plays an important role in accurate pose estimation (Lin et al., 2021), as excessively high frequencies can hinder this process. To address this, we employ the BARF (Lin et al., 2021) positional encoding frequency control method. Specifically, before the post-optimization stage, we ensure a low-frequency setting for the positional encoding control parameter . During post optimization, we keep the same coarse-to-fine strategy as the BARF.
4. Experiments
4.1. Experiment Settings
Dataset We evaluated our method on the challenging datasets with complex trajectories, NeRFBuster (Warburg et al., 2023) and Free-Dataset (Wang et al., 2023a). NeRFBuster consists of a total of 12 scenes, with most trajectories revolving around a central object. We employ sequences selected by CF-NERF (Yan et al., 2023), with approximately 50 images per scene. We chose every 8th image from each sequence for novel view synthesis as the test set. All images are downsampled to a resolution of . Ground truth poses are estimated using COLMAP, as provided by CF-NeRF. Free-Dataset comprises 7 scenes with arbitrary trajectories, predominantly in outdoor environments characterized by highly dynamic camera motions. We select 50 images per scene in sequential order, and every 8th image is designated as the test set. The images are downsampled to a resolution of , and the ground truth poses are obtained through COLMAP (Schonberger and Frahm, 2016).
Implementation Details Our approach is implemented based on the BARF (Lin et al., 2021) framework. The majority of the hyperparameters in our network model align with the BARF Real-World Scenes setting, including the network learning rate decay from to , pose learning rate decay from to , inverse sampling of 128 points along each ray with an inverse range of , a batch size of 1024, and linearly adjust for post optimization phase from iteration 20K to 100K. We randomly select correspondences with confidence scores higher than from dense correspondences as sparse correspondences. During each iteration, correspondences are randomly chosen from this set for reprojection loss. For NeRFBuster scenes, we set , , , , , , and . As the frames in in Free-Dataset exhibit larger camera motions and smaller overlap, the network requires more iterations to estimate poses accurately and achieve convergence. For Free-Dataset scenes, we set , , , , , , and .
Metrics | Method | aloe | art | car | century | garbage | flowers | picnic | pikachu | pipe | plant | roses | table | mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
↓ | BARF (Lin et al., 2021) | 128.599 | 45.739 | 97.148 | 114.261 | 103.259 | 79.702 | 81.418 | 112.996 | 166.701 | 140.270 | 125.974 | 139.675 | 111.312 |
L2G-NeRF (Chen et al., 2023) | 117.475 | 28.247 | 161.862 | 59.730 | 90.889 | 88.750 | 99.076 | 123.543 | 106.838 | 71.551 | 139.057 | 144.848 | 102.656 | |
Nope-NeRF (Bian et al., 2023) | 101.589 | 32.345 | 113.063 | 150.253 | 149.459 | 161.859 | 148.710 | 158.059 | 99.836 | 138.816 | 150.050 | 114.783 | 126.569 | |
CF-NeRF (Yan et al., 2023) | 6.703 | 76.306 | 29.079 | 11.013 | 74.163 | 10.672 | 109.868 | 13.243 | 122.345 | 18.664 | 3.903 | 3.835 | 39.983 | |
Ours | 3.163 | 3.151 | 0.701 | 2.343 | 0.902 | 0.481 | 1.938 | 7.708 | 2.302 | 6.302 | 0.570 | 1.154 | 2.560 | |
↓ | BARF | 6.039 | 4.040 | 5.043 | 5.434 | 4.663 | 4.693 | 3.007 | 3.772 | 3.763 | 5.865 | 4.952 | 4.076 | 4.612 |
L2G-NeRF | 4.986 | 4.402 | 4.764 | 5.895 | 4.272 | 4.926 | 4.214 | 6.451 | 5.592 | 2.764 | 5.055 | 4.199 | 4.795 | |
Nope-NeRF | 5.151 | 5.302 | 5.401 | 3.202 | 5.571 | 4.742 | 4.819 | 3.757 | 4.983 | 5.896 | 5.399 | 5.817 | 5.004 | |
CF-NeRF | 0.637 | 1.549 | 1.621 | 0.497 | 0.548 | 0.745 | 1.285 | 0.879 | 5.757 | 0.685 | 0.182 | 0.274 | 1.222 | |
Ours | 0.168 | 0.030 | 0.035 | 0.134 | 0.039 | 0.039 | 0.106 | 0.548 | 0.164 | 0.225 | 0.038 | 0.045 | 0.131 | |
↑ | BARF | 23.56 | 20.55 | 23.69 | 18.73 | 19.92 | 23.14 | 22.91 | 31.58 | 23.43 | 29.38 | 21.87 | 25.88 | 23.72 |
L2G-NeRF | 23.47 | 22.58 | 23.98 | 19.32 | 20.36 | 24.52 | 22.18 | 33.66 | 21.99 | 29.63 | 21.74 | 25.60 | 24.09 | |
Nope-NeRF | 22.42 | 21.53 | 22.62 | 19.55 | 20.59 | 21.91 | 22.97 | 27.39 | 21.33 | 25.69 | 19.91 | 26.83 | 22.73 | |
CF-NeRF | 23.32 | 23.50 | 22.04 | 21.35 | 21.30 | 23.91 | 23.31 | 31.51 | 22.24 | 25.89 | 23.42 | 26.71 | 24.04 | |
Ours neighbor | 23.17 | 25.76 | 24.90 | 21.64 | 21.62 | 26.14 | 23.04 | 30.50 | 23.02 | 27.19 | 22.14 | 30.85 | 25.00 | |
Ours sim(3) | 24.36 | 26.73 | 27.41 | 22.56 | 22.69 | 27.37 | 23.04 | 22.91 | 23.13 | 22.64 | 29.63 | 32.73 | 25.43 | |
↑ | BARF | 0.59 | 0.68 | 0.72 | 0.52 | 0.54 | 0.72 | 0.54 | 0.92 | 0.62 | 0.85 | 0.69 | 0.84 | 0.69 |
L2G-NeRF | 0.58 | 0.74 | 0.74 | 0.56 | 0.53 | 0.76 | 0.50 | 0.94 | 0.55 | 0.87 | 0.69 | 0.83 | 0.69 | |
Nope-NeRF | 0.52 | 0.73 | 0.69 | 0.57 | 0.56 | 0.67 | 0.53 | 0.88 | 0.53 | 0.80 | 0.64 | 0.86 | 0.67 | |
CF-NeRF | 0.56 | 0.74 | 0.66 | 0.63 | 0.57 | 0.72 | 0.53 | 0.93 | 0.54 | 0.80 | 0.72 | 0.85 | 0.69 | |
Ours neighbor | 0.56 | 0.82 | 0.74 | 0.63 | 0.58 | 0.78 | 0.52 | 0.91 | 0.58 | 0.83 | 0.71 | 0.89 | 0.71 | |
Ours sim(3) | 0.61 | 0.83 | 0.79 | 0.65 | 0.63 | 0.81 | 0.52 | 0.76 | 0.59 | 0.71 | 0.88 | 0.91 | 0.72 | |
↓ | BARF | 0.36 | 0.20 | 0.29 | 0.40 | 0.38 | 0.27 | 0.44 | 0.09 | 0.33 | 0.14 | 0.24 | 0.23 | 0.28 |
L2G-NeRF | 0.37 | 0.18 | 0.28 | 0.35 | 0.42 | 0.14 | 0.49 | 0.10 | 0.39 | 0.14 | 0.24 | 0.24 | 0.29 | |
Nope-NeRF | 0.51 | 0.33 | 0.43 | 0.48 | 0.50 | 0.47 | 0.54 | 0.30 | 0.52 | 0.34 | 0.48 | 0.35 | 0.44 | |
CF-NeRF | 0.43 | 0.24 | 0.41 | 0.37 | 0.46 | 0.33 | 0.55 | 0.11 | 0.50 | 0.20 | 0.21 | 0.23 | 0.34 | |
Ours neighbor | 0.36 | 0.14 | 0.26 | 0.50 | 0.44 | 0.22 | 0.49 | 0.11 | 0.41 | 0.22 | 0.14 | 0.17 | 0.29 | |
Ours sim(3) | 0.35 | 0.14 | 0.25 | 0.50 | 0.43 | 0.22 | 0.49 | 0.29 | 0.40 | 0.30 | 0.09 | 0.17 | 0.30 |
Metrics We primarily evaluate our method by assessing the quality of novel view synthesis and the accuracy of pose estimation. As the variables for the scene and the cameras are up to a 3D similarity transformation, existing work (Lin et al., 2021; Chen et al., 2023; Bian et al., 2023) aligns the optimized poses to the ground truth using Sim(3) with Procrusters analysis on the camera locations for pose error computation and test pose initialization (termed as Sim(3) below) and then runs an additional test-time optimization on the trained model to reduce the pose error that may influence the view synthesis quality. Since all existing methods compared below except ours struggle to obtain reasonable initial test poses through Sim(3) and fail to perform the test-time optimization under the complex trajectories, we adopt the approach of Nope-NeRF (Bian et al., 2023) for these methods, i.e. initializing a test image pose with the estimated pose of the training frame that is closest to it (termed as neighbor below). For our method, we provide results using both initialization methods. We report PSNR, SSIM, and LPIPS for the view synthesis, and rotation and translation errors for the pose estimation.
4.2. Comparison with Pose-Unknown Methods
We compare with the state-of-the-art methods for joint optimization of scenes and poses from RGB images, i.e. BARF (Lin et al., 2021), L2G-NeRF (Chen et al., 2023), Nope-NeRF (Bian et al., 2023), and CF-NeRF (Yan et al., 2023).
Results on the Object Centered Dataset Table 1 presents the pose evaluation results on the NeRFBuster dataset. BARF, L2G-NeRF, and Nope-NeRF initialize all poses as identity matrices and then perform bundle adjustment to jointly optimize poses and the scene. With pose initialization far from the actual poses and sparse views not able to effectively constrain the scenes, these methods frequently fail to recover the camera poses and geometry. CF-NeRF manages to estimate poses but still suffers from larger errors as CF-NeRF only constrains poses through photo loss.
Our method achieves significantly smaller errors compared to these methods.
Despite the significant pose errors and poor structural quality of methods such as BARF, they still achieve surprisingly ”good” results on PSNR, SSIM, and LPIPS in Table 1, almost on-par with our results on the view synthesis. We attribute this to overfitting both in the training stage and test-time optimization. In Fig. 5, we visualize the trajectory of the NeRFbuster garbage scene and select two frames with an abrupt change in the estimated camera trajectory. We then render novel views by interpolating between these poses. As shown in Fig. 5 (b), the rendering results are unreasonable, while CF-NeRF and our method can render smooth view transitions (In the supplementary video, we further show BARF, L2G-NeRF, Nope-NeRF, and CF-NeRF renders view inconsistent effects). The other evidence is that these methods struggle to reconstruct the geometry as shown Fig. 6 and Fig. 4. During test-time optimization, as the estimated trajectories of these methods diverge far from the ground truth and they fail to perform test-time optimization using Sim(3), the poses of the neighboring frames are used to initialize the test frames. This initialization causes the pose of a test frame to converge to a ”pseudo ” pose close to the estimated pose of the closest neighboring train frame. Then the network for the scene further overfits to the ”pseudo” ground truth test pose and image ( for example) pairs and renders an image using to calculate visual metrics with .
Results on the Free Trajectory Dataset We also conduct our experiments on the Free-Dataset, which consists of more challenging scenarios with arbitrary trajectory variations and reduced frame overlap (please refer to the supplementary material for visualization of the sequences). In Table 2, we report the results on the Free-Dataset, which demonstrates that our method exhibits more significant advantages under more challenging scenes. Fig. 4 shows that in addition to the superior quality of the novel view synthesis, our method can produce depths of good quality while most existing methods fail to.
Method | ↓ | ↓ | ↑ | ↑ | ↓ |
---|---|---|---|---|---|
BARF | 61.098 | 3.498 | 19.56 | 0.52 | 0.45 |
L2G-NeRF | 110.303 | 6.587 | 19.95 | 0.54 | 0.45 |
Nope-NeRF | 144.202 | 4.693 | 18.67 | 0.51 | 0.66 |
CF-NeRF | 55.329 | 2.385 | 18.30 | 0.42 | 0.72 |
Ours neighbor | 2.805 | 0.161 | 18.69 | 0.49 | 0.49 |
Ours sim(3) | 2.805 | 0.161 | 22.46 | 0.59 | 0.43 |
4.3. Ablation Study
In this subsection, we conduct ablation studies to investigate the impact of various components in our method. We ablate projection loss, tracking, window optimization, and global optimization components individually. Table 3 shows that removing tracking, window, or global optimization leads to performance degradation, but the method remains functional. However, removing reprojection loss leads to dramatic pose errors. We refer readers to supplementary material for more results for the ablation.
Method | ↓ | ↓ | ↑ | ↑ | ↓ |
---|---|---|---|---|---|
Ours w/o reproj. loss | 56.040 | 1.904 | 16.18 | 0.44 | 0.63 |
Ours w/o tracking | 5.302 | 0.280 | 23.66 | 0.67 | 0.35 |
Ours w/o window opt. | 3.562 | 0.182 | 24.57 | 0.70 | 0.33 |
Ours w/o global opt. | 6.189 | 0.234 | 22.66 | 0.64 | 0.40 |
Ours | 2.560 | 0.131 | 25.43 | 0.72 | 0.30 |
5. Conclusion
We present CT-NeRF, a method capable of recovering poses and reconstructing scenes from image sequences captured along complex trajectories. We first introduce correspondence and reprojected geometric image distance to impose extra constraints on the optimization graph, enabling robust and accurate pose estimation and scene structure reconstruction. Subsequently, we detail our incremental learning process for pose recovery, including initialization, tracking, window optimization, and global optimization. Through comparative and ablation experiments, we demonstrate the superiority of our method and the necessity of its individual components. Although our method enables joint pose estimation and reconstruction under complex camera trajectories, we only explore simple pose graphs. More sophisticated graph optimization is required for very long trajectories. Also, evaluation datasets, protocols, and metrics are required for complex camera trajectories as discussed in the paper, the current visual metrics can not fully reflect the reconstruction quality.
References
- (1)
- Abdelrasoul et al. (2016) Yassin Abdelrasoul, Abu Bakar Sayuti HM Saman, and Patrick Sebastian. 2016. A quantitative study of tuning ROS gmapping parameters and their effect on performing indoor 2D SLAM. In 2016 2nd IEEE international symposium on robotics and manufacturing automation (ROMA). IEEE, 1–6.
- Bailey et al. (2006) Tim Bailey, Juan Nieto, Jose Guivant, Michael Stevens, and Eduardo Nebot. 2006. Consistency of the EKF-SLAM algorithm. In 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3562–3568.
- Bian et al. (2023) Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. 2023. Nope-nerf: Optimising neural radiance field with no pose prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4160–4169.
- Campos et al. (2021) Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. 2021. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics 37, 6 (2021), 1874–1890.
- Castellanos et al. (2007) José A Castellanos, Ruben Martinez-Cantin, Juan D Tardós, and José Neira. 2007. Robocentric map joining: Improving the consistency of EKF-SLAM. Robotics and autonomous systems 55, 1 (2007), 21–29.
- Chen et al. (2023) Yue Chen, Xingyu Chen, Xuan Wang, Qi Zhang, Yu Guo, Ying Shan, and Fei Wang. 2023. Local-to-global registration for bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8264–8273.
- Cheng et al. (2023) Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. 2023. LU-NeRF: Scene and pose estimation by synchronizing local unposed nerfs. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 18312–18321.
- Chng et al. (2022) Shin-Fang Chng, Sameera Ramasinghe, Jamie Sherrah, and Simon Lucey. 2022. Gaussian activated neural radiance fields for high fidelity reconstruction and pose estimation. In European Conference on Computer Vision. Springer, 264–280.
- Cui and Tan (2015) Zhaopeng Cui and Ping Tan. 2015. Global structure-from-motion by similarity averaging. In Proceedings of the IEEE International Conference on Computer Vision. 864–872.
- Deng et al. (2022) Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. 2022. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12882–12891.
- Edstedt et al. (2023) Johan Edstedt, Ioannis Athanasiadis, Mårten Wadenbäck, and Michael Felsberg. 2023. DKM: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17765–17775.
- Engel et al. (2014) Jakob Engel, Thomas Schöps, and Daniel Cremers. 2014. LSD-SLAM: Large-scale direct monocular SLAM. In European conference on computer vision. Springer, 834–849.
- Fischler and Bolles (1981) Martin A Fischler and Robert C Bolles. 1981. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 6 (1981), 381–395.
- Gherardi et al. (2010) Riccardo Gherardi, Michela Farenzena, and Andrea Fusiello. 2010. Improving the efficiency of hierarchical structure-and-motion. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 1594–1600.
- Hartley and Zisserman (2003) Richard Hartley and Andrew Zisserman. 2003. Multiple view geometry in computer vision. Cambridge university press.
- He et al. (2023) Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. 2023. Detector-Free Structure from Motion. In arxiv.
- Jeong et al. (2021) Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. 2021. Self-calibrating neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5846–5854.
- Jiang et al. (2013) Nianjuan Jiang, Zhaopeng Cui, and Ping Tan. 2013. A global linear method for camera pose registration. In Proceedings of the IEEE international conference on computer vision. 481–488.
- Johari et al. (2023) Mohammad Mahdi Johari, Camilla Carta, and François Fleuret. 2023. Eslam: Efficient dense slam system based on hybrid representation of signed distance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17408–17419.
- Lin et al. (2021) Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. 2021. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5741–5751.
- Meuleman et al. (2023) Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H Kim, and Johannes Kopf. 2023. Progressively optimized local radiance fields for robust view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16539–16548.
- Mildenhall et al. (2021) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
- Mur-Artal et al. (2015) Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE transactions on robotics 31, 5 (2015), 1147–1163.
- Niemeyer et al. (2022) Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. 2022. Regnerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5480–5490.
- Park et al. (2019) Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 165–174.
- Schonberger and Frahm (2016) Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113.
- Snavely et al. (2008) Noah Snavely, Steven M Seitz, and Richard Szeliski. 2008. Modeling the world from internet photo collections. International journal of computer vision 80 (2008), 189–210.
- Song et al. (2024) Jiuhn Song, Seonghoon Park, Honggyu An, Seokju Cho, Min-Seop Kwak, Sungjin Cho, and Seungryong Kim. 2024. DäRF: Boosting Radiance Fields from Sparse Input Views with Monocular Depth Adaptation. Advances in Neural Information Processing Systems 36 (2024).
- Sucar et al. (2021) Edgar Sucar, Shikun Liu, Joseph Ortiz, and Andrew J Davison. 2021. imap: Implicit mapping and positioning in real-time. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6229–6238.
- Sun et al. (2021) Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. 2021. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8922–8931.
- Teed and Deng (2020) Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 402–419.
- Teed and Deng (2021) Zachary Teed and Jia Deng. 2021. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems 34 (2021), 16558–16569.
- Truong et al. (2023) Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt, and Federico Tombari. 2023. Sparf: Neural radiance fields from sparse and noisy poses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4190–4200.
- Wang et al. (2023b) Chen Wang, Jiadai Sun, Lina Liu, Chenming Wu, Zhelun Shen, Dayan Wu, Yuchao Dai, and Liangjun Zhang. 2023b. Digging into depth priors for outdoor neural radiance fields. In Proceedings of the 31st ACM International Conference on Multimedia. 1221–1230.
- Wang et al. (2023c) Hengyi Wang, Jingwen Wang, and Lourdes Agapito. 2023c. Co-slam: Joint coordinate and sparse parametric encodings for neural real-time slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13293–13302.
- Wang et al. (2023a) Peng Wang, Yuan Liu, Zhaoxi Chen, Lingjie Liu, Ziwei Liu, Taku Komura, Christian Theobalt, and Wenping Wang. 2023a. F2-nerf: Fast neural radiance field training with free camera trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4150–4159.
- Wang et al. (2021) Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. 2021. NeRF–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064 (2021).
- Warburg et al. (2023) Frederik Warburg, Ethan Weber, Matthew Tancik, Aleksander Holynski, and Angjoo Kanazawa. 2023. Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs. arXiv:2304.10532 [cs.CV]
- Wei et al. (2023) Tong Wei, Yash Patel, Alexander Shekhovtsov, Jiří Matas, and Daniel Barath. 2023. Generalized Differentiable RANSAC. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 17603–17614. https://doi.org/10.1109/ICCV51070.2023.01618
- Wu (2013) Changchang Wu. 2013. Towards linear-time incremental structure from motion. In 2013 International Conference on 3D Vision-3DV 2013. IEEE, 127–134.
- Yan et al. (2023) Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, and Fei Deng. 2023. CF-NeRF: Camera Parameter Free Neural Radiance Fields with Incremental Learning. arXiv:2312.08760 [cs.CV]
- Zhu et al. (2022) Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. 2022. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12786–12796.
Appendix A More Implementation Details
Our network architecture follows the BARF (Lin et al., 2021) approach, utilizing a single 8-layer MLP network with a width of 128. All SOTA methods employ their official open-source implementations. For test-optimization, NoPe-NeRF adopts its official implementation, while all other methods undergo 100 iterations of test-optimization after per-image neighbor initialization before evaluation. The Sim(3) alignment approach is also derived from the official open-source version of BARF.
Dataset We choose two datasets NeRFBuster (Warburg et al., 2023) which used in CF-NeRF (Yan et al., 2023) and Free-Dataset (Wang et al., 2023a) which consists of more challenging scenarios with arbitrary trajectory variations and reduced frame overlap as shown in Fig. 7. We utilize the NeRFBuster sequences processed by CF-NeRF. For each scene, CF-NeRF selects approximately 50 images based on their overlap, ordered sequentially. Regarding the Free-Dataset, the sky scene comprises images with indexes from 50 to 100, while all other scenes consist of images with indexes from 0 to 50. All selected sequences present considerable challenges.
Metrics | Method | aloe | art | car | century | garbage | flowers | picnic | pikachu | pipe | plant | roses | table | mean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
↓ | CF-NeRF (Yan et al., 2023) Sim(3) with rotation | 21.918 | 25.702 | 22.653 | 11.245 | 9.061 | 9.915 | 13.489 | 12.046 | 173.343 | 11.056 | 7.002 | 3.837 | 26.772 |
CF-NeRF Sim(3) | 6.703 | 76.306 | 29.079 | 11.013 | 74.163 | 10.672 | 109.868 | 13.243 | 122.345 | 18.664 | 3.903 | 3.835 | 39.983 | |
Ours Sim(3) with rotation | 3.618 | 0.469 | 0.545 | 2.237 | 0.921 | 0.596 | 2.118 | 7.698 | 2.320 | 5.212 | 1.919 | 1.223 | 2.406 | |
Ours Sim(3) | 3.163 | 3.151 | 0.701 | 2.343 | 0.902 | 0.481 | 1.938 | 7.708 | 2.302 | 6.008 | 0.570 | 1.154 | 2.560 | |
↓ | CF-NeRF Sim(3) with rotation | 3.858 | 5.064 | 8.423 | 3.655 | 4.018 | 4.305 | 3.372 | 4.949 | 36.930 | 5.154 | 1.130 | 2.200 | 6.921 |
CF-NeRF Sim(3) | 0.637 | 1.549 | 1.621 | 0.497 | 0.548 | 0.745 | 1.285 | 0.879 | 5.757 | 0.685 | 0.182 | 0.274 | 1.222 | |
Ours Sim(3) with rotation | 0.701 | 0.237 | 0.086 | 0.517 | 0.256 | 0.215 | 0.347 | 0.983 | 0.515 | 0.519 | 0.371 | 0.247 | 0.416 | |
Ours Sim(3) | 0.168 | 0.030 | 0.035 | 0.134 | 0.039 | 0.039 | 0.106 | 0.548 | 0.164 | 0.225 | 0.038 | 0.045 | 0.131 |
scenes | Ours | NeRF + Our pose | NeRF + COLMAP pose | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
↓ | ↓ | ↑ | ↑ | ↓ | ↑ | ↑ | ↓ | ↑ | ↑ | ↓ | |
pikachu | 7.708 | 0.548 | 22.91 | 0.76 | 0.29 | 23.62 | 0.79 | 0.28 | 37.06 | 0.97 | 0.05 |
plant | 6.302 | 0.225 | 22.64 | 0.71 | 0.30 | 20.51 | 0.63 | 0.39 | 28.27 | 0.85 | 0.24 |
aloe | 3.163 | 0.168 | 24.36 | 0.61 | 0.35 | 25.51 | 0.68 | 0.26 | 24.04 | 0.58 | 0.40 |
art | 3.151 | 0.030 | 26.73 | 0.83 | 0.14 | 13.77 | 0.34 | 0.56 | 12.90 | 0.30 | 0.60 |
century | 2.343 | 0.134 | 22.56 | 0.65 | 0.50 | 14.08 | 0.25 | 0.69 | 14.28 | 0.28 | 0.68 |
pipe | 2.302 | 0.164 | 23.13 | 0.59 | 0.40 | 21.90 | 0.55 | 0.43 | 23.05 | 0.63 | 0.37 |
picnic | 1.938 | 0.106 | 23.04 | 0.52 | 0.49 | 22.52 | 0.51 | 0.45 | 25.25 | 0.67 | 0.31 |
table | 1.154 | 0.045 | 32.73 | 0.91 | 0.17 | 25.64 | 0.84 | 0.28 | 23.82 | 0.82 | 0.27 |
flowers | 0.902 | 0.039 | 22.69 | 0.63 | 0.43 | 16.29 | 0.30 | 0.66 | 15.38 | 0.27 | 0.67 |
car | 0.701 | 0.035 | 27.41 | 0.79 | 0.25 | 21.52 | 0.67 | 0.33 | 18.93 | 0.60 | 0.40 |
roses | 0.570 | 0.038 | 29.63 | 0.88 | 0.09 | 30.09 | 0.90 | 0.09 | 27.77 | 0.84 | 0.16 |
garbage | 0.481 | 0.039 | 27.37 | 0.81 | 0.22 | 18.34 | 0.59 | 0.41 | 13.53 | 0.36 | 0.67 |
mean | 2.560 | 0.131 | 25.43 | 0.72 | 0.30 | 21.14 | 0.59 | 0.40 | 22.02 | 0.60 | 0.40 |
Metrics | Method | grass | hydrant | lab | pillar | road | sky | stair | mean |
---|---|---|---|---|---|---|---|---|---|
↓ | BARF (Lin et al., 2021) | 124.875 | 74.091 | 124.754 | 16.908 | 64.433 | 22.197 | 0.425 | 61.098 |
L2G-NeRF (Chen et al., 2023) | 114.356 | 170.250 | 56.227 | 131.588 | 109.558 | 27.245 | 162.898 | 110.303 | |
Nope-NeRF (Bian et al., 2023) | 158.408 | 140.245 | 165.086 | 153.613 | 144.478 | 67.749 | 179.836 | 144.202 | |
CF-NeRF (Yan et al., 2023) | 36.875 | 36.129 | 150.882 | 18.282 | 49.790 | 94.082 | 1.260 | 55.329 | |
Ours | 7.785 | 0.454 | 6.126 | 0.124 | 3.001 | 2.054 | 0.089 | 2.805 | |
↓ | BARF | 7.273 | 3.890 | 6.675 | 1.475 | 4.587 | 0.583 | 0.004 | 3.498 |
L2G-NeRF | 7.962 | 7.203 | 4.786 | 7.707 | 7.107 | 0.849 | 10.498 | 6.587 | |
Nope-NeRF | 2.920 | 4.904 | 2.062 | 4.044 | 7.201 | 1.156 | 10.564 | 4.693 | |
CF-NeRF | 2.850 | 2.018 | 5.998 | 1.382 | 1.808 | 2.518 | 0.121 | 2.385 | |
Ours | 0.304 | 0.032 | 0.526 | 0.008 | 0.201 | 0.046 | 0.007 | 0.161 | |
↑ | BARF | 18.00 | 17.33 | 18.73 | 20.17 | 19.01 | 15.66 | 28.00 | 19.56 |
L2G-NeRF | 18.29 | 17.46 | 21.18 | 19.93 | 20.49 | 17.90 | 24.41 | 19.95 | |
Nope-NeRF | 17.02 | 18.33 | 17.55 | 18.99 | 19.08 | 15.39 | 24.35 | 18.67 | |
CF-NeRF | 18.15 | 17.85 | 16.25 | 20.25 | 18.85 | 15.23 | 21.51 | 18.30 | |
Ours neighbor | 17.57 | 17.95 | 17.94 | 21.91 | 19.30 | 15.12 | 21.07 | 18.69 | |
Ours Sim(3) | 16.96 | 22.54 | 14.88 | 26.23 | 24.06 | 24.37 | 28.16 | 22.46 | |
↑ | BARF | 0.40 | 0.33 | 0.58 | 0.52 | 0.47 | 0.49 | 0.83 | 0.52 |
L2G-NeRF | 0.42 | 0.32 | 0.67 | 0.53 | 0.52 | 0.55 | 0.74 | 0.54 | |
Nope-NeRF | 0.40 | 0.37 | 0.63 | 0.47 | 0.44 | 0.60 | 0.66 | 0.51 | |
CF-NeRF | 0.34 | 0.30 | 0.44 | 0.44 | 0.40 | 0.43 | 0.56 | 0.42 | |
Ours neighbor | 0.40 | 0.35 | 0.53 | 0.56 | 0.49 | 0.45 | 0.64 | 0.49 | |
Ours Sim(3) | 0.36 | 0.50 | 0.41 | 0.67 | 0.61 | 0.77 | 0.83 | 0.59 | |
↓ | BARF | 0.51 | 0.60 | 0.38 | 0.50 | 0.51 | 0.48 | 0.18 | 0.45 |
L2G-NeRF | 0.51 | 0.61 | 0.26 | 0.51 | 0.57 | 0.41 | 0.25 | 0.45 | |
Nope-NeRF | 0.75 | 0.69 | 0.56 | 0.68 | 0.79 | 0.63 | 0.52 | 0.66 | |
CF-NeRF | 0.77 | 0.82 | 0.57 | 0.73 | 0.83 | 0.70 | 0.59 | 0.72 | |
Ours neighbor | 0.59 | 0.57 | 0.39 | 0.42 | 0.65 | 0.50 | 0.30 | 0.49 | |
Ours Sim(3) | 0.61 | 0.50 | 0.55 | 0.37 | 0.48 | 0.31 | 0.21 | 0.43 |
Appendix B Testing methods
As mentioned in the main paper, to calculate the metrics for test images, two sequential steps during testing are required: alignment of trajectories for pose quality assessment and test-time optimization for view synthesis quality assessment.
Alignment A 3D similarity transformation Sim(3) for the scene and the cameras can be obtained through different methods.
- •
-
•
Sim(3) with rotation CF-NeRF (Yan et al., 2023) finds that the Procrustes analysis used for Sim(3) is unreliable when all cameras lie in a line or the camera translation contains noise. To overcome the problem, CF-NeRF adds a virtual point in the camera coordinate of each image and uses the camera parameter to transform it to the world coordinate, then uses the camera rotation during the alignment process (termed as rotation). However, we find the approach of CF-NeRF will cause more transition errors.
We list the results of pose error on buster both aligned by the approach of Sim(3) and Sim(3) with rotation in Table 4. The results show that the approach of Sim(3) with rotation can reduce rotation errors while causing more transition errors. When the accuracy of poses is high, Sim(3) with rotation takes rare benefits on but harms to . As a result, we employ the Sim(3) approach in the main paper for all methods to align two trajectories and then calculate pose errors.
Test-time optimization Here we outline previous testing methods with different combinations of initialization and test-time optimization.
-
•
Sim(3) + opt. In BARF (Lin et al., 2021), the poses are first initialized using Sim(3) alignment with Procrustes analysis on the camera pose locations. Then, an additional test-time optimization is used to further adjust the test poses. This initialization works well when the estimated poses can be aligned precisely to COLMAP poses. However, incorrect pose estimations can affect the Sim(3) alignment.
-
•
Estimated + no opt. CF-NeRF (Yan et al., 2023) recovers all poses without employing a test/train split and then tests every 8th image. However, such an approach leads to results indistinguishable whether the rendered results are due to overfitting or successful reconstruction.
-
•
Neighbor + opt. Nope-NeRF (Bian et al., 2023) initializes the test image pose with the estimated pose of the training frame that is closest to it. Neighbor initialization works well when the framerate is high and the test pose is near the neighbor pose. Facing complex trajectories and reduced overlap it struggles to supply a good initialization as shown in Table 1 in the main paper and Table 6.
Due to the substantial alignment errors, all methods except ours struggle to obtain reasonable initial test poses through Sim(3). In our main paper for testing results, we adopt Neighbor + opt. for all the methods and also provide results using Sim(3) + opt. .
Overfitting As described in Table 1 and Section 4.2 of the main text and Table 6, although methods like BARF converge to significant pose errors and poor structural quality, they still achieve comparable novel view synthesis metrics to our method. We attribute this to the network converging to local optima. The left part of Fig. 8 illustrates the poses estimated by BARF, where the three green boxes indicate three pose segments after fitting. To render a video, we fit B-spline functions to the estimated poses to get a smooth camera trajectory. The right part visualizes novel views synthesized for poses within the segments (a,c,e) and between segments (b, d) on the B-spline trajectory. The novel views synthesized for poses in the three segments (a, c, e) appear normal. However, the visualization results for the interpolated poses (b,d) between segments are unreasonable. It seems that each segment fits a sub-scene and the images for the interpolated poses between segments stitch different scenes together. Fig. 9 shows more results of this issue. These methods do not recover correct poses and scene geometry. During test-time optimization, poses for testing images are initialized with the estimated poses of neighboring training images. With the twisted scene and fragmented pose trajectories after training, the test-time optimization results in the pose of a test frame converging to a ”pseudo ” pose close to the estimated pose of the closest neighboring training frame. Then the network for the scene further overfits to the ”pseudo” ground truth test pose and image ( for example) pairs and renders an image using to calculate visual metrics with , leading to high view synthesis metrics. Notice that during the process, diverges far from its true pose to the direction minimizing and when the scene geometry and camera poses exhibit large errors.
Appendix C Comparison to NeRF with COLMAP pose
We additionally compare the novel view synthesis quality of the NeRF model trained with our estimated pose and COLMAP pose (we use it as GT) to demonstrate the pose accuracy estimated. On average, NeRF + Our pose achieves novel view quality close to that of NeRF + COLMAP pose. In some scenes, our poses have large estimation error, like Pikachu and plant. Both pose error during training and test pose misalignment lead to worse view quality of NeRF + Our pose in these scenes. In scenes with small pose error, NeRF + Our pose gains similar view quality with NeRF + COLMAP pose, even better in many scenes. In many scenes, our methods achieve better view quality than NeRF + Our pose and NeRF + COLMAP pose. We attribute it to coarse to fine positional encoding, reprojection loss, and joint optimization of poses and scenes.
Appendix D More Results
Detailed results of Free-Dataset are shown in Table 6. We provide more qualitative comparisons with state-of-the-art works. Fig. 13, Fig. 14, Fig. 15 and Fig. 10 illustrates a comparison of novel view synthesis quality. Fig. 16, Fig. 17, Fig. 18 and Fig. 11 demonstrates a comparison of depth map rendering quality. Fig. 19, Fig. 20 and Fig. 12 present a comparison of the reconstructed trajectories obtained by various methods, where the red boxes represent the COLMAP poses, and the blue boxes depict the estimated poses. The point cloud, processed from the COLMAP output, serves as a reference for relative positioning.