Keywords

1 Introduction

In this paper, we are interested in relative camera pose estimation—a task consisting in accurately estimating the location and orientation of the camera with respect to another camera’s reference system. Relative pose estimation is an essential task for many computer vision problems, such as Structure from Motion (SfM), Simultaneous Localisation And Mapping (SLAM), etc. Traditionally, this task can be accomplished by (i) extracting sparse keypoints (e.x. SIFT, SURF), (ii) establishing 2D correspondences between keypoints and (iii) estimating the essential matrix using 5-points or 8-point algorithms [13]. RANSAC is very often used to reject outliers in a robust manner.

This technique, although it has been considered as the de facto standard for many years, presents two main drawbacks. First, the quality of the estimation depends heavily on the correspondence assignment. This is to say, too few correspondences (textureless objects) or too many noisy correspondences (repetitive texture or too much viewpoint change) can lead to surprisingly bad results. Second, the traditional method is able to estimate the translation vector only up to scale (directional vector).

In this paper, our objective is three folds: (i) we propose a system producing more stable results (ii) recovering the full translation vector (iii) and we provide insights regarding relative pose inference (i.e. from absolute pose, regressor etc.).

As pointed out in [20], CNN based methods are able to produce pretty good results in some cases where SIFT-based methods fail (i.e. texture less images). This is the reason why we opted for a global method based on CNN. Inspired by the success of PoseNet [9], we propose a modified Siamese PoseNet for relative camera pose estimation, dubbed as RPNet, with different ways to infer the relative pose. To the best of our knowledge, [12] is the only end-to-end system aiming at solving relative camera pose using deep learning approach. However, their system estimate the translation vector up to scale, while ours produces full translation vectors.

The rest of the paper is organized as follows: Sect. 2 presents the related work. Section 3 introduces the network architecture and the training methodology. Section 4 discusses the datasets and presents the experimental validation of the approach. Finally, Sect. 5 concludes the paper.

2 State of the Art

Local Keypoint-Based Approaches. They address relative camera pose estimation using the epipolar geometry between 2D-2D correspondences of keypoints. Early attempts aimed at better engineering interest point detectors to focus on interesting image properties such as corners [6], blobs in scale-space [10], regions [11], or speed [2, 16, 18] etc. More recently, there is a growing interest to train interest point detectors together with the matching function [4, 5, 17, 19, 23]. LIFT [21] adopted the traditional pipeline combining a detector, an orientation estimator, and a descriptor, tied together with differentiable operations and learned end-to-end. [1] proposed a multitask network with different sub-branches to operate on varying input sizes. [4] proposed a bootstrapping strategy by first learning on simple synthetic data and increasing the training set with real images in a second time.

End-to-End Pose Estimation. The first end-to-end neural network for camera pose estimation from single RGB images is PoseNet [9]. It is based on GoogLeNet with two output branches to regress translations and rotations. PoseNet follow-up includes: Baysian PoseNet [7], Posenet-LSTM [20] where LSTM is used to model the context of the images, Geometric-PoseNet where the loss is calculated using the re-projection error of the coordinates using the predicted pose and the ground truth [8]. Since all the 3D models used for comparisons are created using SIFT-based techniques, traditional approach seems more accurate. [20] showed that the classical approaches completely fail with less textured datasets such as the proposed TMU-LSI dataset. [14] is an end-to-end system for pose regression taking sparse keypoint as inputs. Regarding relative pose estimation, [12] is the only system we are aware of. Their network is based on ResNet35 with FCs layers acting as pose regressor. Similar to the previous networks, the authors formulate the loss function as minimising the L2-distances between the ground truth and the estimated pose. Unfortunately, several aspects of their results (including their label generation, experimental methodology and the baseline system) make comparisons difficult. Along side with pose regression problems, another promising works from [15] showed that an end-to-end neural network can effectively be trained to regress to infer the homography between two images. Finally, two recent papers [3, 22] made useful contributions to the training of end-to-end systems for pose estimation. [22] proposed a regressor network to produce essential matrix which can be then used to find the relative pose. However, their system is able to find the translation up to scale which is completely different from our objective. In [3], a differentiable RANSAC is proposed for outlier rejection and can be a plug-and-play component into an end-to-end system.

Fig. 1.
figure 1

Illustration of the proposed system

3 Relative Pose Inference with RPNet

Architecture. The architecture of the proposed RPNet, illustrated Fig. 1, is made of two building blocks: (i) a Siamese Network with two branches regressing one pose per image, (ii) a pose inference module for computing the relative pose between the cameras. We provide three variants of the pose inference module: (1) a parameter-free module, (2) a parameter-free module with additional losses (same as PoseNet loss [9]) aiming at regressing the two camera poses as well as the relative pose, and (3) a relative pose regressor based on FC layers. The whole network is trained end-to-end for relative pose estimation. Inspired by PoseNet [9], the feature extraction network is based on the GoogLeNet architecture with 22 CNN layers and 6 inception modules. We only normalize the quaternion during test time. It outputs one pose per image.

For RPNet and RPNet\(^{+}\), the module for computing the relative pose between the cameras is straightforward and relies on simple geometry. Following the convention of OpenCV, the relative pose is calculated in the reference system of the 2nd camera. Let \((R_1, t_1, R_2, t_2)\) be the rotation matrices and translation vectors used to project a point X from world coordinate system to a fixed camera system (camera 1 & 2). Let \((q_1, q_2)\) be the corresponding quaternions of \((R_1, R_2)\). The relative pose is calculated as followed:

$$\begin{aligned} R_{1,2} = q_2 \times q_1^* \quad \text {and}\quad T_{1,2} = R_2 (-R_1^Tt_1) + t_2 \end{aligned}$$
(1)

where \(q_1^*\) is the conjugate of \(q_1\), and \(\times \) denotes the multiplication in the quaternion domain. Both equations are differentiable. For RPNetFC, the pose inference module is a simple stacked fully connected layers with relu activation. To limit over-fitting, we modified the output of the Siamese network by reducing its output dimension from 2048 to 256. This results in almost 50% reduction of the number of parameters compared to PoseNet, RPNet and RPNet\(^{+}\) network. The pose regressor network contains two FC layers (both with 128 dimensions) (Table 1).

Losses. The loss function uses the Euclidean distance to compare predicted relative rotation \(\hat{q}_{1,2}\) and translation \(\hat{T}_{1,2}\) with ground truth \(\hat{q}_{1,2}\) and \(q_{1,2}\) : \(loss = \sum _i(||\hat{T}^i_{1,2} - T_{1,2}^i||_2 + \beta * ||\hat{q}_{1,2}^i - q_{1,2}^i||_2\)). Quaternions are unit quaternions. The original PoseNet has a \(\beta \) term in front of quaternions to balance the loss values between the translation and rotation. To find the most suitable value of \(\beta \), we cross-validated on our validation set. Please refer to our codes for different hyper-parameter values on different subsets.

4 Experimentations

4.1 Experimental Setup

Dataset. Experimental validation is done on the Cambridge Landmark datasetFootnote 1. Each image is associated with a ground-truth pose. We provide results on 4 of the 5 subsets (scenes). As discussed by several people, the ‘street’ scene raises several issuesFootnote 2.

Table 1. Number of training and testing pairs for Cambridge Landmark dataset. SE stands for spatial extent, measured in meter.

Pair Generation. For each sequence of each scene, we randomly pair each image with eight different images of the same sequence. For a fair comparison with SURF, the pair generation is done by making sure that they overlap enough. We followed the train-test splits defined with the data set. Images are scaled so that the smallest dimension is 256 pixels, keeping its original aspect ratio. During training, we use 224 * 224 random crops and feed them into the network. During test time, we center crop the image.

Baseline. The baseline is a traditional keypoint-based method (SURF). The focal length and the principle point are provided by the dataset. Other parameters are cross-validated on the validation set. For a fair comparison, we provide two scenarios for baselines: (1) the image are scaled to be 256 * 455 pixels, followed by a center-crop (224 * 224 pixels) to produce the same image pairs as tested with our networks and (2) the original images without down-sampling. We named these two scenarios as ‘SURFSmall’ and ‘SURFFull’. All the camera parameters are adapted to the scaling and cropping we applied.

Evaluation Metric. We measured 3 different errors: (i) translation errors, in meters (ii) rotation errors, in degrees and (iii) translation errors in degrees. We report the median for all the measurements.

4.2 Experimental Results

Relative Pose Inference Module. Figure 2 compares the performance of the different systems and test scenarios. Based on these experimental results, RPNetFC and RPNet\(^{+}\) are the most efficient ways to recover the relative pose. On easy dataset (i.e. KingsCollege and OldHospital), where there is no ambiguity textures, using pose regressor (RPNetFC) produces slightly better results than inferring the relative pose from the two images (PoseNet/RPNet/RPNet\(^+\)). On the contrary, on hard datasets (i.e. ShopFacade and StMarysChurch), RPNet-family outperforms RPNetFC. This behavior is also true for relative rotation and relative translation measured in degree. Globally, RPNetFC produces the best results followed by RPNet\(^+\), PoseNet and finally, RPNet. The differences of their results are between 0 and 8\(^\circ \). Regarding technical aspect, RPNetFC is a lot easier to train than RPNet\(^+\)/RPNet since it does not involve multiple hyper-parameters to balance the different losses. It also converges faster.

Fig. 2.
figure 2

Translation and rotation errors (median) of the different approaches

Comparison with Traditional Approaches. We will start by discussing the SURFSmall scenario first. In general, the error on both translation and rotation can be reduced between 5 to 70% using RPNet family, except on KingsCollege where the traditional approach slightly outperforms RPNet-based methods. We observed that the performance of the traditional approaches varies largely from one subset to another, while RPNet\(^+\)/RPNetFC are more stable. In addition, the traditional approach requires camera information for each image in order to correctly estimate the pose. In contrast, RPNet-based does not require any specific information at all. Using the original image size (SURFFull) significantly boost the performance of the traditional approach. However, RPNetFC still enjoy a significant gain in performance on OldHospital and ShopFacade, while performing slightly worse than SURFFull on KingsCollege and StMarysChurch. The difference in performance between SURFFull and RPNetFC is even more significant when the images contain large view point changes (see Fig. 3).

Fig. 3.
figure 3

Accumulative hist. of errors in rotation (1st row, d), translation (2nd row, m).

Full Translation Vector. One of our objectives is to provide a system able to estimate the full translation vector. On average, we observed that the median error ranges between 2 to 4 m, using RPNetFC. Figure 4 gives an idea of ground truth translations w.r.t. reference axes (xyz). For instance, on KingsCollege, the values of X-axis can range from −29 m to 30 m with an STD of 7 m. Interestingly, our network has a translation error of only 2.88 m.

Fig. 4.
figure 4

Min/Max/Mean/STD relative translations (ground truth), w.r.t. XYZ axis (m).

5 Conclusions

This paper proposed a novel architecture for estimating full relative poses using an end-to-end neural network. The network is based on a Siamese architecture, which was experimented with different ways to infer the relative poses. In addition, to produce competitive or better results over the traditional SURF-based approaches, our system is able to produce an accurate full translation vector. We hope this paper will provide more insight and motivate other researchers to focus on global end-to-end system for relative pose regression problems.