1. Introduction
The fusion of monocular cameras and inertial measurement units (IMUs) is very popular recently, thanks to great improvements in the computation capacity of computers with low energy costs and low weight, and the increasing demand for accurate motion tracking or positioning in unmanned aerial vehicles (UAVs), augmented reality, and driverless cars. This fusion problem has been studied by brilliant scientists for years [
1,
2,
3,
4], and as a result, one can now build his own visual-inertial odometry (VIO) module simply with cheap sensors and open-source software [
2,
5,
6,
7,
8,
9].
Fusion frameworks are divided into two main branches: filter-based and optimization- based [
10,
11]. Optimization-based methods are so far widely recognized performing better in terms of precision [
12] due to their iterating mechanism, which is essentially solving a noted bundle adjustment (BA) problem [
13]. The BA problem was considered to be computational costly in earlier years, until the literature recognized and revealed its sparse structure [
13,
14] so as to develop real-time algorithms. Kümmerle et al. [
7] modeled BA as a graph optimization problem. Kaess et al. [
8] introduced a factor graph model to further illustrate the Bayesian nature of BA. Kaess et al. [
8] also found that the incremental fact of sophiscated Hessian matrix in normal equation can be utilized for solving BA, thus speeding up the calculation further. With these profound insights, researchers also made efforts to overcome the inconsistency in fixed-lag fusion algorithms, which has the advantages of having bounded computation with less information loss and maintaining the sparse structure [
15]. Several open-source libraries are available for building the back-end for algorithms of this branch, based on different mathematical descriptions listed above and providing convenient application program interfaces (APIs) [
7,
8,
9,
16]. Although enabling reduced computation by leveraging sparse matrix factorization, optimization-based VIO systems still need to be tailored sometimes in order to be deployed on a computation-limited platform. Sometimes, this leads to a downgraded performance [
17].
Before the flowering of optimization-based methods, the solving of fusion problems was dominated by filtering [
18]. The ordinary procedure is to include IMU pose and map point positions in the filter state and recursively propagate and update as IMU and camera measurements respectively arrive [
10]. The accurate estimation of map point positions is the key to bring about an unbiased IMU pose updating. In traditional filter-based solutions, the filter state would invariably have a very large dimension since it always preserves a lot of map points, resulting in enhanced computation requirements [
1]. The use of a multi-state constraint Kalman filter (MSCKF) was proposed as an effective and optimal filter-based solution that does not maintain map points in filter state [
19]. By properly handling the camera measurement, MSCKF can achieve as competitive a performance as optimization-based algorithms and demands far less computation [
17].
By correcting observability properties [
19,
20,
21] and incorporating camera–IMU extrinsic parameters into the filter state [
22], the performance of MSCKF was further improved. Many follow-up works emerged, including an open-source monocular implementation [
23], expansion to stereo camera rig [
24], and schemes using direct visual front-ends [
25] or adding line features [
26].
It should be emphasized that all members of the MSCKF family so far have been developed based on Shuster’s notation of quaternion [
27], whereas most of the community utilizes the traditional Hamilton notation, which results in unnecessary trouble in understanding for researchers [
28].
Visual front-ends apparently play an important role in VIOs. There are typically of two categories. Feature-based methods use descriptors to match features between consecutive images [
6], while direct methods seek a minimization of photometric residuals to accomplish data correlation [
5,
25]. Sparse optical flow tracking is an efficient direct method that is widely used [
2,
23,
24]. It provides sub-pixel accuracy but contains more outliers than feature descriptor matching [
29]. An optimization-based back-end would eliminate outliers during iteration [
30]. Filter-based back-ends are meanwhile vulnerable to the outliers if only one-off updating is applied [
23]. Using an iterated update scheme would mitigate this situation while introducing additive computation [
31].
To recap, in order to make VIO algorithms more practical, it is desirable to develop algorithms with lower computation while maintaining high precision.
In this paper, we developed a filter-based monocular visual-inertial odometry which can be regarded as a member of MSCKF family, giving consideration to both high precision and computation efficiency. The main contributions of this paper are as follows:
We deduced a closed-form IMU error state transition equation based on the more cognitive Hamilton notation of quaternion. By solving integration terms analytically, a novel fully linear formulation was further obtained, which is also closed-form, and furthermore, is readily implemented.
By analyzing the statistical properties of ORB descriptor [
32] distances of matched and unmatched feature points, we introduced a novel descriptor-assisted sparse optical flow tracking technique, which enhances the feature tracking robustness and barely adds any computation complexity.
More improvements are made to improve the usability and performance of the filter. An initialization procedure is developed that automatically detects stationary scenes by analyzing tracked features and initializes the filter state based on static IMU data. The feature triangulation mechanism is carefully refined to provide efficient measurement updates.
A filter-based monocular VIO using the proposed state transition equation, visual front-end, and initialization procedure under Sun et al.’s [
24] framework is implemented. The performances of our VIO and MSCKF-MONO [
23], an open-source monocular implementation of MSCKF, are compared with parameters setup as similarly as possible. Ours is also compared with other state-of-the-art open-source VIOs including ROVIO [
5], OKVIS [
6], and VINS-MONO [
2]. In addition, we analyze the process time of our algorithm. All of the evaluations above are done on EuRoC datasets [
33]. Detailed evaluations are reported.
The rest of this paper is organized as follows. The problem of quaternion notation confusion is illustrated in
Section 2.
Section 3 deduces the error state differential equation based on Hamilton’s notation.
Section 4 gives a closed-form error state transition formulation and then solves the integration terms in it, obtaining a fully linear closed-form formulation.
Section 5 presents the descriptor-assisted sparse optical flow tracking front-end. Other implementation details and improvements are presented in
Section 6, including the overall filter model, automatic initialization procedure, and refined feature triangulation mechanism.
Section 7 presents the experimental results in detail. Finally, conclusions are made in
Section 8.
2. Quaternion Notation Confusion
Quaternion is one of the widely used representations of rotation in numerical calculations [
34]. In the related literature, there are mainly two different notations: Hamilton’s notation and Shuster’s notation [
35]. The difference between them lies in their flipped rule for the multiplication of imaginary parts
,
, and
. Hamilton utilizes
, while Shuster advocates for
to maintain the order of chain rule when transferring to direction cosine matrices (DCMs). Sommer et al. [
28] surveyed this notation confusion problem in detail and argue for entirely abandoning Shuster’s notation. In this section, we present the original problem that Shuster’s notation is designed to solve and a solution for maintaining chain rule order while still using Hamilton’s notation.
A quaternion of rotation
is basically a unit quaternion; it can be defined as
where
is the unit vector of rotation axis in frame
A, and
is the angle of rotation. In the rest of this article, the term “quaternion” will be used to refer to a quaternion of rotation, for the sake of simplicity.
Equation (
1) shows how to construct a quaternion
from an axis-angle
, which describes the anticlockwise rotation of an angle
about the axis
. If the original frame
A is rotated to a new frame
B after this rotation, as illustrated in
Figure 1, then we can use a quaternion
or a DCM
to describe this rotation.
Note that
can be used to compute the coordinate of a vector
in frame
B given its coordinate in frame
A, that is
.
can be written as a function of
where
is an operator mapping
to
.
Let
be the quaternion describing the rotation from frame
B to frame
C and
the rotation from frame
A to frame
C. Then, according to Equation (
2), we have
Coordinate transformation of vectors can also be done by applying the triple product of quaternions:
Here we abuse the notation of
,
, and
to describe quaternions with zero real part such that
. Combining the two equations above yields:
Referring to Equation (
5), there is
At the same time, by applying the chain rule in DCM production, it follows that
Now we can conclude that
, which means the mapping
is not a homomorphism. One would prefer a homomorphic mapping between DCM and quaternion to maintain the chain-rule order, which is convenient to manipulate. Shuster utilized a flipped multiplication rule to avoid this problem. This notation was adopted by the Jet Propulsion Laboratory (JPL) and thus introduced to spacecraft literatures, while other research fields were still using the traditional Hamilton notation. But as researchers have exchanged ideas between different research fields, Shuster’s notation has been utilized in robotics for rotation representation [
28]. So far, all of the theories about MSCKF are deduced based on this notation [
1].
As Sommer et al. [
28] claimed, a homomorphic mapping could be obtained even under Hamilton’s notation. Let
be an operator that satisfies
. Thus, we have
According to Equations (
6) and (
7), we now have
, which proves
to be a homomorphism.
Given a quaternion
the operator
is defined as a function mapping quaternion
to a DCM
as
This is, in fact, the classical Rodrigues Rotation Formula. There is a more thorough discussion about this mapping in [
36].
5. ORB Descriptor-Assisted Optical Flow Front-End
In this section, we propose a sparse visual front-end using descriptor-assisted optical flow feature tracking.
Different kinds of feature descriptors are used in several VIOs to accomplish feature extraction and matching [
1,
6,
30]. In contrast, other solutions choose optical flow feature tracking as their front-end solution since it is not that time-consuming compared to the descriptor-based methods [
2,
23,
24]. However, there are more wrong matches in optical flow tracking than in descriptor-based methods, and these wrong matches exist even after eliminating algorithms such as random sample consensus (RANSAC). Filter-based VIOs are very sensitive to feature outliers since they don’t eliminate outliers in their iterations as the optimization-based ones do. Wrong matches left behind will participate in measurement updates, which may result in deteriorating estimates or even failure. As a conclusion, a robust front-end is needed to achieve stable performance for filter-based VIOs, while a real-time solution also calls for fast data correlation.
Yang et al. [
29], refined ORB-SLAM [
38] by using a sparse optical flow algorithm. The key idea was to correct the image coordinates of ORB features by optical flow tracking results to achieve sub-pixel precision. The proposed method here is a bit different since we use optical flow to first conduct a fast tracking, then compute descriptor distance between matched feature pair members and justify whether they are a good match-up.
There exist plenty of feature descriptor algorithms. We chose the ORB descriptor in our proposed method for two reasons:
The ORB descriptor is a binary string, so the distance between two descriptors can be expressed as a Hamming distance, which can be computed efficiently.
The rotation between consecutive images in a real-time application is usually very gentle, so invariance to rotation is not very important for a descriptor.
Descriptor Distance Analysis for General Corner Features
The basic visual front-end is based on Shi-Tomasi corner detection [
39] and optical flow tracking [
40]. It is important to figure out whether the ORB descriptor is meaningful for general Shi-Tomasi corner features. An experiment was done and proved that it is indeed meaningful statistically. We calculated the feature angle for a Shi-Tomasi feature and then used it to compute the ORB descriptor [
32]. Several tests were conducted in the experiment. For each test, feature pairs from every two adjacent images of a continuous image stream were stored separately in two sequences. These tests basically analyzed the statistical properties of ORB descriptor distances of feature pairs, including
Coarsely matched feature pairs based on Shi-Tomasi corner detection and optical flow tracking.
Relatively strictly matched feature pairs based on ORB descriptor matching and RANSAC.
Randomly constructed feature pairs.
Unmatched feature pairs generated by inverse order of one of the strictly matched feature sequences.
One feature sequence from strictly matched pairs was inverted to generate strictly unmatched feature pairs. The experimental result is shown in
Figure 2.
We strongly suspect the very long tail in
Figure 2a may be due to wrong matches because no further outlier rejection method was applied after optical flow tracking in this test. In
Figure 2b, except for the massive Guassian-like distribution, a little bump centered at about 17 appeared, which is framed by a red rectangular border. This is because the random pairs were constructed in two adjacent images and thus, two matched features have a considerable probability of being coincidentally formed into a pair. These two experiments prove that ORB descriptors and descriptor distances are meaningful for general Shi-Tomasi corners, from a statistical standpoint.
In order to clearly analyze the statistical properties of matched and unmatched pairs, two further tests were conducted. First, a descriptor-based matching and RANSAC mechanism were applied to obtain relatively strictly matched feature pairs. Then, the order of one of the feature sequences was reversed, which is a simple yet effective way to make two sequences unmatched. Descriptor distances before and after order reversion were computed, and statistical results are shown in
Figure 2c,d. It can be seen from the figures that the long tail and little bump disappear because of the relatively strict pairing rule. They are plotted together in
Figure 3 to make a clear comparison.
The experimental results show that the descriptor distances of unmatched and matched feature pairs possess significantly different statistical properties. As shown in
Figure 3, descriptor distances of unmatched features approximately follow the Gaussian distribution with a mean, or we can say peak, at about
and with a standard deviation of
. For matched pairs, the distribution shows a sharper peak at about
. There is still a tail in the matched distribution, but it is much smaller than the one in
Figure 2a. The difference between matched and unmatched pairs is significant enough to design a strategy to filter out wrong matches.
We use a heuristic to complete the mission:
For feature pairs with distances lower than the smaller peak value, classify them as inliers.
For feature pairs with distances higher than the bigger peak value, classify them as outliers.
For feature pairs whose distances are between two peaks, calculate and compare the Mahalanobis distances to both peak to decide their classification.
8. Conclusions
In this paper, we first deduced a highly closed-form IMU error state transition equation from scratch. By using Hamilton’s notation of quaternion, we tried to eliminate notation ambiguity. We then managed to solve the integration terms left behind in the transition equation by introducing a two-sample fitting method to approximate the axis-angle, resulting in a fully linear closed-form formulation that is unbiased up to the fitting resolution. This formulation also has potential to incorporate IMU intrinsics into the filter state, since it is a linear function of IMU measurements. An automatic initialization procedure is developed and the feature triangulation mechanism is carefully refined. The ORB descriptor distance between Shi-Tomasi corner pairs was analyzed, and we found that there is a statistical difference in descriptor distances between matched and unmatched feature pairs. As outliers are sometimes fatal for filter-based VIOs, this inspired us to propose a visual front-end based on optical flow tracking and additionally, to use ORB descriptors to eliminate outliers. We implement a monocular VIO under the framework of MSCKF with proposed novelties.
Through a comparison between estimation results with and without the proposed outlier eliminating method, we demonstrate its effectiveness. Furthermore, an experiment was done to compare the proposed method with several state-of-the-art VIOs, both in terms of precision and computation. Results show that the proposed VIO is a visual inertial fusion solution with comparable precision to the state-of-the-ar but which demands less computation resources.
Future works include adding a robust initialization procedure adapting to versatile scenes and analyzing the point selection mechanism in detail.