OMNIVIS09
OMNIVIS09
OMNIVIS09
net/publication/224135244
CITATIONS READS
92 12,368
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Michal Havlena on 28 January 2016.
Abstract ical Digital Video Camera System [32]. This paper shows a
large scale sparse 3D reconstruction using the original om-
We present a structure-from-motion (SfM) pipeline for nidirectional panoramic images.
visual 3D modeling of a large city area using 360◦ field of Previously, city reconstruction has been addressed us-
view Google Street View images. The core of the pipeline ing aerial images [9, 3, 10, 22, 40, 41] which allowed re-
combines the state of the art techniques such as SURF fea- constructing large areas from a small number of images.
ture detection, tentative matching by an approximate near- The resulting models, however, often lacked visual realism
est neighbour search, relative camera motion estimation by when viewed from the ground level since it was impossible
solving 5-pt minimal camera pose problem, and sparse bun- to texture the facades of the buildings.
dle adjustment. The robust and stable camera poses esti- A framework for city modeling from ground-level im-
mated by PROSAC with soft voting and by scale selection age sequences working in real-time has been developed, e.g.
using a visual cone test bring high quality initial structure in [1] and [5]. Work [5] uses SfM to reconstruct camera
for bundle adjustment. Furthermore, searching for trajec- trajectories and 3D key points in the scene, fast dense im-
tory loops based on co-occurring visual words and clos- age matching, assuming that there is a single gravity vector
ing them by adding new constraints for the bundle adjust- in the scene and all the building facades are ruled surfaces
ment enforce the global consistency of camera poses and 3D parallel to it. The system gives good results but 3D recon-
structure in the sequence. We present a large-scale recon- struction could not survive sharp camera turns when a large
struction computed from 4,799 images of the Google Street part of the scene moved away from the limited view field
View Pittsburgh Research Data Set. of cameras. A recent extension of [5] using a pair of cali-
brated fisheye lens cameras [12], which have hemispherical
fields of view, could successfully reconstruct a trajectory
1. Introduction with sharp turns. In this work, we assume a single moving
Large scale 3D models of cities built from video se- camera which provides sparse image sequences only.
quences acquired by car mounted cameras provide richer Short baseline SfM using simple image features [5],
3D contents than those built from aerial images only. A vir- which performs real-time detection and matching, recovers
tual reality system covering the whole world can be brought camera poses and trajectory sufficiently well when all cam-
by embedding such 3D contents into Google Earth or Mi- era motions between consecutive frames in the sequence
crosoft Virtual Earth in near future. In this paper, we present are small. On the other hand, wide baseline SfM based
a structure-from-motion (SfM) pipeline for visual 3D mod- methods, which use richer features such as MSER [25],
eling of such a large city area using 360◦ field of view om- Laplacian-Affine, Hessian-Affine [28], SIFT [21], and
nidirectional images. SURF [2], are capable of producing feasible tentative
Recently, work [27] demonstrated 3D modeling from matches under large changes of visual appearance between
perspective images exported from Google Street View im- images induced by rapid changes of camera pose and illu-
ages using piecewise planar structure constraints. Another mination. Work [7] presented the SfM based on wide base-
recent related work [38] demonstrated the performance of line matching of SIFT features using a single omnidirec-
the SfM which employs the guided matching by using tional camera and demonstrated the performance on indoor
epipolar geometries computed in previous frames, and the environments. We use SURF features [2] since they are
robust camera trajectory estimation by computing camera the fastest among those features used for the wide baseline
orientations and positions individually for the calibrated matching and produce sufficiently robust tentative matches
perspective images acquired by Point Grey Ladybug Spher- even on distorted omnidirectional images.
(a) (b)
Figure 1. Camera trajectory computed by SfM. (a) Camera positions (red circles) exported into Google Earth [8]. To increase the visibility,
every 12th camera position in the original sequence is plotted. (b) The 3D model representing 4,799 camera positions (red circles) and
123,035 3D points (color dots).
Secondly, sets of tentative matches are constructed be- of verifying tracks. The essential matrix Eij , encoding the
tween pairs of consecutive images. The matching is relative camera pose between frames i and j = i + 1, can
achieved by finding features with closest descriptors be- be decomposed into Eij = [tij ]× Rij . Although there exist
tween the pair of images, which is done for each feature four possible decompositions, the right one can be selected
independently. When conflicts appear, we select the most as that which reconstructs the largest number of 3D points
discriminative match by computing the ratio between the in front of both cameras. Having the normalized camera
first and the second best match. We use Fast Library for Ap- matrix [11] of the i-th frame Pi = [Ri | Ti ], the normalized
proximate Nearest Neighbors (FLANN) [29] which delivers camera matrix Pj can be computed by
approximate nearest neighbours significantly faster than ex-
act matching thanks to using several random kd-trees. Pj = [Rij Ri | Rij Ti + γ tij ] (4)
Thirdly, tentative matches between each pair of consec-
utive images are verified through epipolar geometry (EG) where γ is the scale of the translation between frames i and
computed by solving the 5-point minimal relative pose j in the canonical coordinate system. The scale γ can be
problem for calibrated cameras [30]. The tentative matches computed by any 3D point seen in at least three consec-
are verified with a RANSAC based robust estimation [6] utive frames but the precision depends on the uncertainty
which searches for the largest subset of the set of tenta- of the reconstructed 3D point. Therefore, a robust selec-
tive matches consistent with the given epipolar geometry. tion from possible candidates of scales has to be done while
We use PROSAC [4], a simple modification of RANSAC, evaluating the quality of the computed camera position. The
which brings a good performance [33] because of reducing best scale is found by RANSAC maximizing the number of
the number of samples by using the ordered sampling [4]. points that pass the “cone test” [13] which checks the inter-
The 5-tuples of tentative matches are drawn from the list section of pixel ray cones in a similar way as the feasibility
ordered ascendingly by their discriminativity scores, which test of L1 - or L∞ - triangulation [14, 15], see Algorithm 1.
are the ratios between the distances of the first and the sec- During the cone test, one pixel wide cones formed by four
ond nearest neighbours in the feature space. Finally, the planes (up, down, left, and right) are casted around the
tracks are constructed by concatenating inlier matches. matches and we test whether the intersection of the cones
The pairwise matches, obtained by epipolar geometry is empty or not using the LP feasibility test [23] or an ex-
validation, often contain incorrect matches lying on epipo- haustive test [13] which is faster when the number of the
lar lines or in the vicinity of epipoles since they may sup- intersected cones is smaller than four.
port the epipolar geometry even without violating geomet-
ric consistency. In practice, such incorrect matches can be 2.4. Bundle Adjustment Enforcing Global Camera
mostly filtered out by selecting only the tracks having a Pose Consistency
longer length. We reject tracks containing less than three Even though the Google Street View data is not primarily
features. acquired by driving the same street several times, there are
some overlaps suitable for constructing loops that can com-
2.3. Robust Initial Camera Pose Estimation
pensate drift errors induced while proceeding the trajectory
Initial camera poses and positions in a canonical coor- sequentially. We construct loops by searching pairs of im-
dinate system are recovered by using the epipolar geome- ages observing the same 3D structure in different times in
tries of pairs of consecutive images computed in the stage the sequence.
Algorithm 1 Construction of the Initial Camera Poses by Chaining Epipolar Geometries
Input {Ei,i+1 }n−1
i=1 Epipolar geometries of pairs of consecutive images.
{mi }n−1
i=1 Matches (tracks) supporting the epipolar geometries.
Output {Pi }ni=1 Normalized camera matrices.
3×3 3×1
1: P1 := [I | 0 ] ... Set the first camera to be the origin of the canonical coordinates.
2: for i := 1, . . . , n − 1 do
3: Decompose Ei,i+1 and select the right rotation R and translation t where||t|| = 1.
4: {Ui } := 3D points computed by triangulating the matches {mi }i+1 i using R and t
5: if i = 1 then
6: Pi+1 := [ RA | Rb + t] where Pi = [A | b].
7: {X} := {Ui } ... Update 3D points
8: else
9: Find 3D points {Ui−1,i+1 } in {Ui } in the i th camera coordinates seen in three images.
10: Find 3D points {Xi−1,i+1 } in {X} in the canonical coordinates seen in three images.
11: t := 0, Smax := 0, N := |{Xi−1,i+1 }| ... Initialization for RANSAC cone test.
12: while t ≤ N do
13: t := t + 1 ... New sample.
14: γ := || Xi−1,i+1 || / || A (Ui−1,i+1 − b)|| ... The scale to be tested.
15: Pt := [ RA | Rb + γt] where Pi = [A | b].
16: St := the number of matches {mi }i+1 i−1 which are consistent with the motions Pi−1 , Pi and Pt .
17: if St > Smax then
18: Pi+1 := Pt ... The best motion with scale so far.
19: Smax := St ... The maximum number of supports so far.
20: Update the termination length N .
21: end if
22: end while
23: Update {X} by merging {Ui−1,i+1 } and adding {Ui }\{Ui−1,i+1 }
24: end if
25: end for
The knowledge of GPS locations of Street View images Loop Finding and Closing First, we take the upper tri-
truly alleviates the problem of image matching for loop angular part of M to avoid duplicate search. Since the diag-
closing but does not completely reduce it since common 3D onal entries of M which are the neigbouring frames in the
structures can be seen even among relatively distant images. sequence essentially have high scores, the 1st to 50th diag-
In this paper, we do not rely on GPS locations because the onals are zeroed in order to exclude very small loops. Next,
image matching achieved by using the image similarity ma- for the image Ii in the sequence, we select the image Ij as
trix is potentially capable to match such distant images and the one having the highest similarity score in the i th row of
it is always important for the vision community to see that M. Image Ij is a candidate of the endpoint of the loop which
certain problem can be solved entirely using vision. starts from Ii . Note that the use of an upper triangular ma-
trix constraints j > i.
Building Image Similarity Matrix SURF descriptors of Next, the candidate image Ij is verified by solving the
each image are quantized into visual words using the vi- camera resectioning [31]. Triplets of the tentative 2D-3D
sual vocabulary containing 130,000 words computed from matches constructed by matching the descriptors of 3D
urban area omnidirectional images. Next, term frequency– points associated to the images Ii and Ii+1 with the de-
inverse document frequency (tf-idf) vectors [36, 17], which scriptors of the features detected in the image Ij are sam-
weight words occurring often in a particular document and pled by RANSAC to find the camera pose having the largest
downweight words that appear often in the database, are support evaluated by the cone test again. The image Ii+1 ,
computed for each image with more than 50 detected vi- which is the successive frame of Ii , is additionally used for
sual words. Finally, the image similarity matrix M is con- performing the cone test with three images in order to en-
structed by computing the image similarities, which we de- force geometric consistencies in the support evaluation of
fine as cosines of angles between normalized tf-idf vectors, the RANSAC. Local optimization is achieved by repeated
between all pairs of images. camera pose computation from all inliers [35] via SDP and
(a) (b)
(c) (d)
Figure 3. Results of SfM with loop closing. (a) Trajectory before bundle adjustment. (b) Trajectory after bundle adjustment with loop
closing. Examples of the images used for the loop closing: (c) Frames 6597 and 8643. (d) Frames 6711 and 6895.
SeDuMi [37]. If the inlier ratio is higher than 70%, the cam- 3. Experimental Results
era resectioning is considered successful and the candidate
image Ij is accepted as the endpoint of the loop. The in- We used 4,799 omnidirectional images of the Google
lier matches are used to give additional constraints on the Street View Pittsburgh Research Data Set. Since the input
final bundle adjustment. We perform this loop search for omnidirectional images have large distortion at the top and
every image in the sequence and test only the pair of im- bottom, we clipped original images by cropping 230 pix-
ages having the highest similarity score. If one increased els from the top and 410 pixels from the bottom to obtain
the number of candidates to be tested, our pipeline would 3,328 × 1,024 pixel large images, see Figure 2(b). Since
approach SfM [24, 19, 26] for unorganized images based the tracks are generated based on wide baseline matching,
on exhaustive pairwise matching. it is possible to save computation time by constructing ini-
tial camera poses and 3D structure from a sparser image
sequence. Our SfM was run on every second image in the
Finally, very distant points, i.e. likely outliers, are fil- sequence, i.e. 2,400 images were used to create a global re-
tered out and sparse bundle adjustment [20] modified in or- construction. The remaining 2,399 images were attached to
der to work with unit vectors, which is the approach similar the reconstruction in the final stage.
to [18], refines both points and cameras. The initial camera poses were estimated by comput-
(a) (b)
Figure 4. Resulted 3D model consisting of 2,400 camera positions (red circles) and 124,035 3D points (blue dots) recovered by our pipeline.
(a) Initial estimation. (b) After bundle adjustment with loop closing.
ing epipolar geometries of pairs of successive images, and Figure 4 shows the camera positions and the 3D points of
chaining them by finding the global scale of camera trans- the initial recovery (a) and after the loop closing (b) in dif-
lation, see Algorithm 1. The resulting trajectory is shown ferent views. In Figure 5, the recovered trajectory is com-
in Figure 3(a). After estimating the initial camera poses and pared to the GPS positions provided in the Google Street
reconstructing 3D points, the pairs of images acquired at the View Pittsburgh Research Data Set. The computational
same location in different times were searched for. The red time spent in different steps of the pipeline implemented
lines in Figure 3(a) indicate links between the accepted im- in MATLAB+MEX running on a standard Core2Duo PC is
age pairs. Figure 3(b) shows the camera trajectory after the shown in Table 1. Since the method is scalable and there-
bundle adjustment with the additional constraints obtained fore storing the intermediate results of the computation on
from loop closing. Figures 3(c) and (d) show the exam- a hard drive instead of in RAM, performance could be im-
ples of pairs of images used for closing the loops at frames proved by using a fast SSD drive instead of a standard SATA
(6597, 8643) and (6711, 6895) respectively. Furthermore, drive.
6
x 10 (meters)
GPS
Acknowledgment
SfM
4.4775
The authors were supported by EC project FP6-IST-
027787 DIRAC. T. Pajdla was supported by Czech Govern-
4.4774
ment under the research program MSM-684 0770038. Any
4.4773
opinions expressed in this paper do not necessarily reflect
the views of the European Community. The Community is
4.4772
not liable for any use that may be made of the information
contained herein.
4.4771
References
4.477
[1] A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. En-
4.4769
gels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Tal-
ton, L. Wang, Q. Yang, H. Stewénius, R. Yang, G. Welch,
5.842 5.843 5.844 5.845 5.846 5.847 5.848 5.849 5.85
5
x 10 (meters)
H. Towles, D. Nistér, and M. Pollefeys. Towards urban 3d
Figure 5. Comparison to the GPS provided in the Google Street reconstruction from video. In 3DPVT06, May 2006.
View Pittsburgh Research Data Set. Camera trajectory by GPS [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up
(red line) and estimated camera trajectory by our SfM (blue line). robust features (SURF). CVIU, 110(3):346–359, June 2008.
[3] C. Brenner and N. Haala. Fast production of virtual reality
city models. IAPRS98, 32(4):77–84, 1998.
Detection 12.8 [4] O. Chum and J. Matas. Matching with PROSAC: Progressive
Matching 4.5 sample consensus. In CVPR05, pages I: 220–226, 2005.
Chaining 1.0 [5] N. Cornelis, K. Cornelis, and L. Van Gool. Fast compact
Loop Closing 6.3 city modeling for navigation pre-visualization. In CVPR06,
Bundle 14.5 pages 1339–1344, 2006.
[6] M. A. Fischler and R. C. Bolles. Random sample consen-
Table 1. Computational time in hours. (Detection) SURF detec- sus: A paradigm for model fitting with applications to image
tion and description. (Matching) Tentative matching and comput- analysis and automated cartography. Communications of the
ing EGs. (Chaining) Chaining EGs and computing scales. (Loop ACM, 24(6):381–395, June 1981.
Closing) Searching and testing loops. (Bundle) Final sparse bun- [7] T. Goedemé, M. Nuttin, T. Tuytelaars, and L. Van Gool.
dle adjustment. Omnidirectional vision based topological navigation. IJCV,
74(3):219–236, 2007.
[8] Google. Google earth - http://earth.google.com/, 2004.
Finally, the remaining 2,383 camera poses were com- [9] A. Grün. Automation in building reconstruction. In Pho-
puted by solving the camera resectioning in the same man- togrammetric Week’97, pages 175–186, 1997.
ner as used in the loop verification. Linear interpolation [10] N. Haala, C. Brenner, and C. Stätter. An integrated system
was used for the 16 cameras that could not be resectioned for urban model generation. In ISPRS Congress Comm. II,
successfully. Figure 1(b) shows the 4,799 camera positions pages 96–103, 1998.
(red circles) and the 124,035 world 3D points (color dots) [11] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, second edi-
of the resulted 3D model.
tion, 2003.
[12] M. Havlena, T. Pajdla, and K. Cornelis. Structure from om-
4. Conclusions nidirectional stereo rig motion for city modeling. In VIS-
APP08, pages II: 407–414, 2008.
We demonstrated the recovery of camera trajectory and [13] M. Havlena, A. Torii, and T. Pajdla. Randomized struc-
3D structure of a large city area from omnidirectional im- ture from motion based on atomic 3d models from camera
ages and showed that the world can in principle be recon- triplets. In CVPR09, 2009.
structed from Google Street View images. We also showed [14] F. Kahl. Multiple view geometry and the L∞ -norm. In
ICCV05, pages II: 1002–1009, 2005.
that finding loops and using additional constraints on final
[15] Q. Ke and T. Kanade. Quasiconvex optimization for robust
bundle adjustment significantly improve the qualities of re-
geometric reconstruction. PAMI, 29(10):1834–1847, 2007.
sulting camera trajectory and 3D structures. Since the street [16] M. Klopschitz, C. Zach, A. Irschara, and D. Schmalstieg.
view images on Google Maps are approximately 10 times Generalized detection and merging of loop closures for video
sparser than the original sequence from the Google Street sequences. In 3DPVT, 2008.
View Pittsburgh Research Data Set, testing the performance [17] J. Knopp, J. Sivic, and T. Pajdla. Location recognition us-
of the proposed pipeline on such sparse sequences will be ing large vocabularies and fast spatial matching. Research
our next challenge. Report CTU–CMP–2009–01, CMP Prague, January 2009.
[18] M. Lhuillier. Effective and generic structure from motion [40] C. Vestri and F. Devernay. Using robust methods for auto-
using angular error. In ICPR06, pages I: 67–70, 2006. matic extraction of buildings. In CVPR01, pages I:133–138,
[19] X. Li, C. Wu, C. Zach, S. Lazebnik, and J. Frahm. Modeling 2001.
and recognition of landmark image collections using iconic [41] G. Vosselman and S. Dijkman. Reconstruction of 3d building
scene graphs. In ECCV08, pages I: 427–440, 2008. models from laser altimetry data. IAPRS01, 34(3):22–24,
[20] M. Lourakis and A. Argyros. The design and implementa- 2001.
tion of a generic sparse bundle adjustment software package
based on the levenberg-marquardt algorithm. Tech. Report
340, Institute of Computer Science – FORTH, August 2004.
[21] D. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110, November 2004.
[22] H. Maas. The suitability for airborne laser scanner data for
automatic 3d object reconstruction. In Ascona01, pages 291–
296, 2001.
[23] A. Makhorin. GLPK: GNU linear programming kit -
http://www.gnu.org/software/glpk, 2000.
[24] D. Martinec and T. Pajdla. Robust rotation and translation
estimation in multiview reconstruction. In CVPR07, 2007.
[25] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide
baseline stereo from maximally stable extremal regions. IVC,
22(10):761–767, September 2004.
[26] Microsoft. Photosynth - http://livelabs.com/photosynth,
2008.
[27] B. Micusik and J. Kosecka. Piecewise planar city 3d mod-
eling from street view panoramic sequences. In CVPR09,
2009.
[28] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,
J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A
comparison of affine region detectors. IJCV, 65(1-2):43–72,
2005.
[29] M. Muja and D. Lowe. Fast approximate nearest neighbors
with automatic algorithm configuration. In VISAPP09, 2009.
[30] D. Nistér. An efficient solution to the five-point relative pose
problem. PAMI, 26(6):756–770, June 2004.
[31] D. Nister. A minimal solution to the generalized 3-point pose
problem. In CVPR04, pages I: 560–567, 2004.
[32] I. Point Grey Research. Ladybug2 -
http://www.ptgrey.com/products/ladybug2/index.asp, 2005.
[33] R. Raguram, J.-M. Frahm, and M. Pollefeys. A comparative
analysis of RANSAC techniques leading to adaptive real-
time random sample consensus. In ECCV08, pages 500–513,
2008.
[34] D. Scaramuzza, F. Fraundorfer, R. Siegwart, and M. Polle-
feys. Closing the loop in appearance guided SfM for omni-
directional cameras. In OMNIVIS08, 2008.
[35] G. Schweighofer and A. Pinz. Globally optimal O(n) so-
lution to the PnP problem for general camera models. In
BMVC08, 2008.
[36] J. Sivic and A. Zisserman. Video google: Efficient visual
search of videos. In CLOR06, pages 127–144, 2006.
[37] J. Sturm. Sedumi: A software package to solve optimization
problems - http://sedumi.ie.lehigh.edu, 2006.
[38] J. Tardif, Y. Pavlidis, and K. Daniilidis. Monocular vi-
sual odometry in urban environments using an omdirectional
camera. In IROS08, 2008.
[39] A. Torii, M. Havlena, and T. Pajdla. Omnidirectional image
stabilization by computing camera trajectory. In PSIVT09,
pages 71–82, 2009.