Matas Bmvc02
Matas Bmvc02
Matas Bmvc02
Center for Machine Perception, Dept. of Cybernetics, CTU Prague, Karlovo nam 13, CZ 121 35
2
[matas, chum]@cmp.felk.cvut.cz
Abstract
The wide-baseline stereo problem, i.e. the problem of establishing correspondences between a pair of images taken from different viewpoints is studied.
A new set of image elements that are put into correspondence, the so
called extremal regions, is introduced. Extremal regions possess highly desirable properties: the set is closed under 1. continuous (and thus projective)
transformation of image coordinates and 2. monotonic transformation of image intensities. An efficient (near linear complexity) and practically fast detection algorithm (near frame rate) is presented for an affinely-invariant stable
subset of extremal regions, the maximally stable extremal regions (MSER).
A new robust similarity measure for establishing tentative correspondences is proposed. The robustness ensures that invariants from multiple
measurement regions (regions obtained by invariant constructions from extremal regions), some that are significantly larger (and hence discriminative)
than the MSERs, may be used to establish tentative correspondences.
The high utility of MSERs, multiple measurement regions and the robust
metric is demonstrated in wide-baseline experiments on image pairs from
both indoor and outdoor scenes. Significant change of scale (3.5), illumination conditions, out-of-plane rotation, occlusion , locally anisotropic scale
change and 3D translation of the viewpoint are all present in the test problems. Good estimates of epipolar geometry (average distance from corresponding points to the epipolar line below 0.09 of the inter-pixel distance)
are obtained.
1 Introduction
Finding reliable correspondences in two images of a scene taken from arbitrary viewpoints viewed with possibly different cameras and in different illumination conditions is a
difficult and critical step towards fully automatic reconstruction of 3D scenes [5]. A crucial issue is the choice of elements whose correspondence is sought. In the wide-baseline
set-up, local image deformations cannot be realistically approximated by translation or
translation with rotation and a full affine model is required. Correspondence cannot be
therefore established by comparing regions of a fixed (Euclidean) shape like rectangles or
circles since their shape is not preserved under affine transformation.
In most images there are regions that can be detected with high repeatability since they
posses some distinguishing, invariant and stable property. We argue that such regions of,
384
in general, data-dependent shape, called distinguished regions (DRs) in the paper, may
serve as the elements to be put into correspondence either in stereo matching or object
recognition.
The first contribution of the paper is the introduction of a new set of distinguished
regions, the so called extremal regions. Extremal regions have two desirable properties.
The set is closed under continuous (and thus perspective) transformation of image coordinates and, secondly, it is closed under monotonic transformation of image intensities.
An efficient (near linear complexity) and practically fast detection algorithm is presented
for an affinely-invariant stable subset of extremal regions, the maximally stable extremal
regions (MSER). Robustness of a particular type of DR depends on the image data and
must be tested experimentally. Successful wide-baseline experiments on indoor and outdoor datasets presented in Section 4 demonstrate the potential of MSERs.
Reliable extraction of a manageable number of potentially corresponding image elements is a necessary but certainly not a sufficient prerequisite for successful wide-baseline
matching. With two sets of distinguished regions, the matching problem can be posed as
a search in the correspondence space [3]. Forming a complete bipartite graph on the two
sets of DRs and searching for a globally consistent subset of correspondences is clearly
out of question for computational reasons. Recently, a whole class of stereo matching
and object recognition algorithms with common structure has emerged [9, 15, 1, 16, 2,
13, 7, 6]. These methods exploit local invariant descriptors to limit the number of tentative correspondences. Important design decisions at this stage include: 1. the choice of
measurement regions, i.e. the parts of the image on which invariants are computed, 2. the
method of selecting tentative correspondences given the invariant description and 3. the
choice of invariants.
Typically, distinguished regions or their scaled version serve as measurement regions
and tentative correspondences are established by comparing invariants using Mahalanobis
distance [10, 16, 11]. As a second novelty of the presented approach, a robust similarity measure for establishing tentative correspondences is proposed to replace the Mahalanobis distance. The robustness of the proposed similarity measure allows us to use
invariants from a collection of measurement regions, even some that are much larger than
the associated distinguished region. Measurements from large regions are either very
discriminative (it is very unlikely that two large parts of the image are identical) or completely wrong (e.g. if orientation or depth discontinuity becomes part of the region). The
former helps establishing reliable tentative (local) correspondences, the influence of the
latter is limited due to the robustness of the approach.
Finding epipolar geometry consistent with the largest number of tentative (local) correspondences is the final step of all wide-baseline algorithms. RANSAC has been by far
the most widely adopted method since [14]. The presented algorithm takes novel steps
to increase the number of matched regions and the precision of the epipolar geometry.
The rough epipolar geometry estimated from tentative correspondences is used to guide
the search for further region matches. It restricts location to epipolar lines and provides
an estimate of affine mapping between corresponding regions. This mapping allows the
use of correlation to filter out mismatches. The process significantly increases precision
of the EG estimate; the final average inlier distance-from-epipolar-line is below 0.1 pixel.
For details see Section 3.
Related work. Since the influential paper by Schmid and Mohr [11] many image
matching and wide-baseline stereo algorithms have been proposed, most commonly using
385
386
387
388
Figure 1: B OOKSHELF: Estimated epipolar geometry on indoor scene with significant scale
change. In the cutouts the change in the resolution of detected DRs is clearly visible.
Tentative correspondences using correlation. Invariant description is used as a preliminary test. The final selection of tentative correspondences is based on correlation.
First transformations that diagonalise the covariance matrix of the DRs are applied. The
resulting circular regions are correlated (for all relative rotations). This procedure is done
efficiently in polar coordinates for different sizes of circles.
Rough epipolar geometry (EG) is estimated by applying RANSAC to the centers of
gravity of DRs. Subsequently, the precision of the EG estimate is significantly improved
by the following process. First, an affine transformation between pairs of potentially corresponding DRs, i.e. the DRs consistent with the rough EG, is computed. Correspondence
of covariance matrices defines an affine transformation up to a rotation. The rotation is
determined from epipolar lines. Next, DR correspondences are pruned and only those
with correlation of their transformed images above a threshold are selected. In the next
step, RANSAC is applied again, but this time with a very narrow threshold. The final improvement of the EG is achieved by adding to RANSAC inliers DR pairs whose convex
hull centres are EG-consistent. Commonly, DRs differ in minute differences that render
their centres of gravity inconsistent with the fine EG, but the centers of the convex hulls
are precise enough. The precision of the final EG, estimated linearly by the eight point
algorithm (without bundle adjustment or radial distortion correction) is surprisingly high.
The average distance of inliers from epipolar line is below 0.1 pixel, see Table 3.
4 Experiments
The following experiments were conducted:
Bookshelf, (Fig. 1). The BOOKSHELF scene tests performance under a very large scale
change. The corresponding DRs in the left view are confined only to a small part of the
389
Figure 2: VALBONNE : Estimated epipolar geometry and points associated to the matched regions
are shown in the first row. Cutouts in the second row show matched bricks.
image since the rest of the scene
number of:
MSER MSER +
TC
is not visible in the second view.
Bookshelf
511 908 349 488 85
Different resolution of detected
Valbonne
906 1012 761 950 49
features is evident in the close-up.
Wash
1026 714 542 448 171
Valbonne, (Fig. 2). This outdoor
Kampa
1015 914 659 652 303
scene has been analysed in the litCyl. Box
1043 627 788 39
63
erature [10, 9]. Repetitive patterns
Shout
298 348
80 93
151
such as bricks are present. The
part of the scene visible in both Table 2: Number of DRs detected in images. The number
views covers a small fraction of of tentative correspondences is given in the TC column.
the image.
Wash, (Fig. 3). Results on this image set have been presented in [16]. The camera undergoes significant translation and rotation. The ordering constraint is notably violated,
objects appear on different backgrounds.
Kampa, (Fig. 4), is an example of an urban outdoor scene. A relatively large fraction of
the images is covered by changing sky. Repeating windows made matching difficult.
Cylindrical Box, (Fig. 5, top and bottom left), shows a metal box on a textured floor.
The regions matched on the box demonstrate performance on a non-planar surface. A
significant change of illumination and a strong specular reflection is present in the second
image that was taken with a flash (this strongly decreases the number of MSER +).
Shout, (Fig. 5, bottom right). This scene has been used in [16]. Since the spectral power
distribution of the illumination and the position of light sources is significantly different,
we included the test to demonstrate performance in variable illumination conditions.
Results are summarized in Tables 2 and 3. Table 2 shows the number of detected DRs
in the left right images for both types of the DRs (MSER- and MSER+). The number
of tentative correspondences is given in the last column of Table 2. Table 3 shows the
390
Figure 3: WASH: Epipolar geometry and dense matched regions with fully affine distortion.
Bookshelf
Valbonne
Wash
Kampa
Cyl. Box
Shout
TC
85
49
171
303
63
151
rough EG
25
27
42
78
23
44
rough d
0.48
0.17
0.34
0.34
0.15
0.43
EG + corr
151
180
220
422
102
220
fine EG
63
82
86
185
67
86
fine d
0.09
0.08
0.08
0.08
0.09
0.08
miss
1
0
2
2
3
1
Table 3: Experimental results. For details see the text, at the beginning of Section 4.
number of correspondences established in different stages of the algorithm. Column TC
repeats the number of tentative correspondences. Column rough EG displays the number
of tentative correspondences consistent with the rough estimate of the epipolar geometry.
The ratio of TC and rough EG determines the speed of the RANSAC algorithm. The
column headed EG + corr gives the number of correspondences consistent with rough
EG that passed the correlation test. Notice that the numbers are much higher than those in
the rough EG column. The final number of correspondences is given in the penultimate
column fine EG. Average distances from epipolar lines are presented in columns rough
d and fine d . We can see, that the precision of the estimated epipolar geometry is
very high, much higher than the precision of the rough EG. The last column shows the
number of mismatches (found manually).
5 Conclusions
In the paper, a new method for wide-baseline matching was proposed. The three main
novelties are: the introduction of MSERs, robust matching of local features and the use
of multiple scaled measurement regions.
391
References
[1] A. Baumberg. Reliable feature matching across widely separated views. In CVPR00, pages
I:774781, 2000.
[2] Y. Dufournaud, C. Schmid, and R. Horaud. Matching images with different resolutions. In
CVPR00, pages I:612618, 2000.
[3] W. Eric L. Grimson. Object Recognition. MIT Press, 1990.
[4] R. Hartley. In defence of the 8-point algorithm. In ICCV95, pages 10641070, 1995.
[5] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge, UK, 2000.
[6] D.G. Lowe. Object recognition from local scale-invariant features. In ICCV99, pages 1150
1157, 1999.
[7] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In Eighth
Int. Conference on Computer Vision (Vancouver, Canada), 2001.
392
Figure 5: CYLINDRICAL B OX: Epipolar geometry (top) and matched regions (bottom left). Fully
affine distortion, a non-planar object, textured surface and a strong specular reflection are present in
the scene. S HOUT (bottom right), a scene with a change of illumination spectral power distribution.
[8] F. Mindru, T. Moons, and L.J. van Gool. Recognizing color patterns irrespective of viewpoint
and illumination. In CVPR99, pages I:368373, 1999.
[9] P. Pritchett and A. Zisserman. Wide baseline stereo matching. In Proc. 6th International
Conference on Computer Vision, Bombay, India, pages 754760, January 1998.
[10] F. Schaffalitzky and A. Zisserman. Viewpoint invariant texture matching and wide baseline
stereo. In Eighth Int. Conference on Computer Vision (Vancouver, Canada), 2001.
[11] C. Schmid and R. Mohr. Local grayvalue invariants for image retrieval. PAMI, 19(5):530535,
May 1997.
[12] R. Sedgewick. Algorithms. Addison-Wesley, 2nd edition, 1988.
[13] D. Tell and S. Carlsson. Wide baseline point matching using affine invariants computed from
intensity profiles. In ECCV00, 2000.
[14] P.H.S. Torr and A. Zisserman. Robust parameterization and computation of the trifocal tensor.
In BMVC96, page Motion and Active Vision, 1996.
[15] T. Tuytelaars and L. Van Gool. Content-based image retrieval based on local affinely invariant
regions. In Proc Third Intl Conf. on Visual Information Systems, pages 493500, 1999.
[16] T. Tuytelaars and L. Van Gool. Wide baseline stereo based on local, affinely invariant regions.
In M. Mirmehdi and B. Thomas, editors, Proc British Machine Vision Conference BMVC2000,
pages 412422, London, UK, 2000.
[17] L. Vincent and P. Soille. Watersheds in digital spaces: an efficient algorithm based on
immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence,
13(6):583598, June 1991.
Acknowledgments. The authors were supported by the European Union under project IST-2001-32184 and
by the Grant Agency of the Czech Republic under projects GACR 102/01/0971 and GACR 102/02/1539. The
SHOUT and WASH images were kindly made available by Tinne Tuytelaars.
393