KleinMurray2007ISMAR PDF
KleinMurray2007ISMAR PDF
KleinMurray2007ISMAR PDF
A BSTRACT
This paper presents a method of estimating camera pose in an un-
known scene. While this has previously been attempted by adapting
SLAM algorithms developed for robotic exploration, we propose a
system specifically designed to track a hand-held camera in a small
AR workspace. We propose to split tracking and mapping into two
separate tasks, processed in parallel threads on a dual-core com-
puter: one thread deals with the task of robustly tracking erratic
hand-held motion, while the other produces a 3D map of point fea-
tures from previously observed video frames. This allows the use of
computationally expensive batch optimisation techniques not usu-
ally associated with real-time operation: The result is a system that
produces detailed maps with thousands of landmarks which can be
tracked at frame-rate, with an accuracy and robustness rivalling that
of state-of-the-art model-based systems.
1 I NTRODUCTION
The majority of Augmented Reality (AR) systems operate with
prior knowledge of the users environment - i.e, some form of map. Figure 1: Typical operation of the system: Here, a desktop is
This could be a map of a city, CAD model of a component requiring tracked. The on-line generated map contains close to 3000 point
maintenance, or even a sparse map of fiducials known to be present features, of which the system attempted to find 1000 in the current
in the scene. The application then allows the user to interact with frame. The 660 successful observations are shown as dots. Also
this environment based on prior information on salient parts of this
shown is the maps dominant plane, drawn as a grid, on which vir-
model (e.g. This location is of interest or remove this nut from
tual characters can interact. This frame was tracked in 18ms.
this component). If the map or model provided is comprehensive,
registration can be performed directly from it, and this is the com-
mon approach to camera-based AR tracking.
Unfortunately, a comprehensive map is often not available, often based AR applications. One approach for providing the user with
a small map of only an object of interest is available - for exam- meaningful augmentations is to employ a remote expert [4, 16] who
ple, a single physical object in a room or a single fiducial marker. can annotate the generated map. In this paper we take a different
Tracking is then limited to the times when this known feature can be approach: we treat the generated map as a sandbox in which virtual
measured by some sensor, and this limits range and quality of reg- simulations can be created. In particular, we estimate a dominant
istration. This has led to the development of a class of techniques plane (a virtual ground plane) from the mapped points - an example
known (in the AR context) as extensible tracking [21, 14, 4, 28, 2] of this is shown in Figure 1 - and allow this to be populated with
in which the system attempts to add previously unknown scene el- virtual characters. In essence, we would like to transform any flat
ements to its initial map, and these then provide registration even (and reasonably textured) surface into a playing field for VR sim-
when the original map is out of sensing range. In [4], the initial ulations (at this stage, we have developed a simple but fast-paced
map is minimal, consisting only of a template which provides met- action game). The hand-held camera then becomes both a viewing
ric scale; later versions of this monocular SLAM algorithm can now device and a user interface component.
operate without this initialisation template. To further provide the user with the freedom to interact with the
The logical extension of extensible tracking is to track in scenes simulation, we require fast, accurate and robust camera tracking, all
without any prior map, and this is the focus of this paper. Specifi- while refining the map and expanding it if new regions are explored.
cally, we aim to track a calibrated hand-held camera in a previously This is a challenging problem, and to simplify the task somewhat
unknown scene without any known objects or initialisation target, we have imposed some constraints on the scene to be tracked: it
while building a map of this environment. Once a rudimentary map should be mostly static, i.e. not deformable, and it should be small.
has been built, it is used to insert virtual objects into the scene, and By small we mean that the user will spend most of his or her time
these should be accurately registered to real objects in the environ- in the same place: for example, at a desk, in one corner of a room,
ment. or in front of a single building. We consider this to be compatible
Since we do not use a prior map, the system has no deep under- with a large number of workspace-related AR applications, where
standing of the users environment and this precludes many task- the user is anyway often tethered to a computer. Exploratory tasks
such as running around a city are not supported.
e-mail: gk@robots.ox.ac.uk
e-mail:dwm@robots.ox.ac.uk The next section outlines the proposed method and contrasts
this to previous methods. Subsequent sections describe in detail
the method used, present results and evaluate the methods perfor-
mance.
2 M ETHOD OVERVIEW IN THE C ONTEXT OF SLAM to maintain real-time performance), achieving exceptional accuracy
Our method can be summarised by the following points: over long distances. While we adopt the stereo initialisation, and
occasionally make use of local bundle updates, our method is dif-
Tracking and Mapping are separated, and run in two parallel ferent in that we attempt to build a long-term map in which features
threads. are constantly re-visited, and we can afford expensive full-map op-
timisations. Finally, in our hand-held camera scenario, we cannot
Mapping is based on keyframes, which are processed using rely on long 2D feature tracks being available to initialise features
batch techniques (Bundle Adjustment). and we replace this with an epipolar feature search.
The map is densely intialised from a stereo pair (5-Point Al-
gorithm) 3 F URTHER R ELATED W ORK
Efforts to improve the robustness of monocular SLAM have re-
New points are initialised with an epipolar search. cently been made by [22] and [3]. [22] replace the EKF typical
Large numbers (thousands) of points are mapped. of many SLAM problems with a particle filter which is resilient to
rapid camera motions; however, the mapping procedure does not
To put the above in perspective, it is helpful to compare this in any way consider feature-to-feature or camera-to-feature cor-
approach to the current state-of-the-art. To our knowledge, the relations. An alternative approach is taken by [3] who replace
two most convincing systems for tracking-while-mapping a sin- correlation-based search with a more robust image descriptor which
gle hand-held camera are those of Davison et al [5] and Eade and greatly reduces the probabilities of outlier measurements. This al-
Drummond [8, 7]. Both systems can be seen as adaptations of al- lows the system to operate with large feature search regions without
gorithms developed for SLAM in the robotics domain (respectively, compromising robustness. The system is based on the unscented
these are EKF-SLAM [26] and FastSLAM 2.0 [17]) and both are in- Kalman filter which scales poorly (O(N 3 )) with map size and hence
cremental mapping methods: tracking and mapping are intimately no more than a few dozen points can be mapped. However the re-
linked, so current camera pose and the position of every landmark placement of intensity-patch descriptors with a more robust alter-
are updated together at every single video frame. native appears to have merit.
Here, we argue that tracking a hand-held camera is more difficult Extensible tracking using batch techniques has previously been
than tracking a moving robot: firstly, a robot usually receives some attempted by [11, 28]. An external tracking system or fiducial
form of odometry; secondly a robot can be driven at arbitrarily slow markers are used in a learning stage to triangulate new feature
speeds. By contrast, this is not the case for hand-held monocu- points, which can later be used for tracking. [11] employs clas-
lar SLAM and so data-association errors become a problem, and sic bundle adjustment in the training stage and achieve respectable
can irretrievably corrupt the maps generated by incremental sys- tracking performance when later tracking the learned features, but
tems. For this reason, both monocular SLAM methods mentioned no attempt is made to extend the map after the learning stage. [28]
go to great lengths to avoid data association errors. Starting with introduces a different estimator which is claimed to be more robust
covariance-driven gating (active search), they must further per- and accurate, however this comes at a severe performance penalty,
form binary inlier/outlier rejection with Joint Compatibility Branch slowing the system to unusable levels. It is not clear if the latter
and Bound (JCBB) [19] (in the case of [5]) or Random Sample Con- system continues to grow the map after the initial training phase.
sensus (RANSAC) [10] (in the case of [7]). Despite these efforts, Most recently, [2] triangulate new patch features on-line while
neither system provides the robustness we would like for AR use. tracking a previously known CAD model. The system is most no-
This motivates a split between tracking and mapping. If these table for the evident high-quality patch tracking, which uses a high-
two processes are separated, tracking is no longer probabilisti- DOF minimisation technique across multiple scales, yielding con-
cally slaved to the map-making procedure, and any robust tracking vincingly better patch tracking results than the NCC search often
method desired can be used (here, we use a coarse-to-fine approach used in SLAM. However, it is also computationally expensive, so
with a robust estimator.) Indeed, data association between tracking the authors simplify map-building by discarding feature-feature co-
and mapping need not even be shared. Also, since modern comput- variances effectively an attempt at FastSLAM 2.0 with only a
ers now typically come with more than one processing core, we can single particle.
split tracking and mapping into two separately-scheduled threads. We notice that [15] have recently described a system which also
Freed from the computational burden of updating a map at every employs SfM techniques to map and track an unknown environ-
frame, the tracking thread can perform more thorough image pro- ment - indeed, it also employs two processors, but in a different
cessing, further increasing performance. way: the authors decouple 2D feature tracking from 3D pose esti-
Next, if mapping is not tied to tracking, it is not necessary to mation. Robustness to motion is obtained through the use of inertial
use every single video frame for mapping. Many video frames sensors and a fish-eye lens. Finally, our implementation of an AR
contain redundant information, particularly when the camera is not application which takes place on a planar playing field may invite
moving. While most incremental systems will waste their time re- a comparison with [25] in which the authors specifically choose to
filtering the same data frame after frame, we can concentrate on track and augment a planar structure: It should be noted that while
processing some smaller number of more useful keyframes. These the AR game described in this system uses a plane, the focus lies
new keyframes then need not be processed within strict real-time on the tracking and mapping strategy, which makes no fundamental
limits (although processing should be finished by the time the next assumption of planarity.
keyframe is added) and this allows operation with a larger numer-
ical map size. Finally, we can replace incremental mapping with 4 T HE M AP
a computationally expensive but highly accurate batch method, i.e.
bundle adjustment. This section describes the systems representation of the users en-
While bundle adjustment has long been a proven method for off- vironment. Section 5 will describe how this map is tracked, and
line Structure-from-Motion (SfM), we are more directly inspired Section 6 will describe how the map is built and updated.
by its recent successful applications to real-time visual odometry The map consists of a collection of M point features located in a
and tracking [20, 18, 9]. These methods build an initial map from world coordinate frame W. Each point feature represents a locally
five-point stereo [27] and then track a camera using local bundle ad- planar textured patch in the world. The jth point in the map (pj )
justment over the N most recent camera poses (where N is selected has coordinates pjW = (xjW yjW zjW 1)T in coordinate frame
W. Each point also has a unit patch normal nj and a reference to The subscript CW may be read as frame C from frame W. The
the patch source pixels. matrix ECW contains a rotation and a translation component and is
The map also contains N keyframes: These are snapshots taken a member of the Lie group SE(3), the set of 3D rigid-body transfor-
by the handheld camera at various points in time. Each keyframe mations.
has an associated camera-centred coordinate frame, denoted Ki for To project points in the camera frame into image, a calibrated
the ith keyframe. The transformation between this coordinate frame camera projection model CamProj() is used:
and the world is then EKi W . Each keyframe also stores a four-
level pyramid of greyscale 8bpp images; level zero stores the full ui
= CamProj(ECW piW ) (2)
640480 pixel camera snapshot, and this is sub-sampled down to vi
level three at 8060 pixels.
The pixels which make up each patch feature are not stored indi- We employ a pin-hole camera projection function which supports
vidually, rather each point feature has a source keyframe - typically lenses exhibiting barrel radial distortion. The radial distortion
the first keyframe in which this point was observed. Thus each map model which transforms r r is the FOV-model of [6]. The cam-
point stores a reference to a single source keyframe, a single source era parameters for focal length (fu , fv ), principal point (u0 , v0 ) and
pyramid level within this keyframe, and pixel location within this distortion () are assumed to be known:
level. In the source pyramid level, patches correspond to 88 pixel
x
0 1
squares; in the world, the size and shape of a patch depends on the x
pyramid level, distance from source keyframe camera centre, and y u0 fu 0 r
CamProj @ = + z (3)
B C
y
orientation of the patch normal. z A v0 0 fv r z
In the examples shown later the map might contain some 1
M =2000 to 6000 points and N =40 to 120 keyframes. r
x2 + y 2
5 T RACKING r= (4)
z2
This section describes the operation of the point-based tracking sys-
1
tem, with the assumption that a map of 3D points has already been r = arctan(2r tan ) (5)
created. The tracking system receives images from the hand-held 2
video camera and maintains a real-time estimate of the camera pose A fundamental requirement of the tracking (and also the map-
relative to the built map. Using this estimate, augmented graphics ping) system is the ability to differentiate Eq. 2 with respect to
can then be drawn on top of the video frame. At every frame, the changes in camera pose ECW . Changes to camera pose are rep-
system performs the following two-stage tracking procedure: resented by left-multiplication with a 44 camera motion M :
1. A new frame is acquired from the camera, and a prior pose
ECW = M ECW = exp()ECW (6)
estimate is generated from a motion model.
2. Map points are projected into the image according to the where the camera motion is also a member of SE(3) and can be
frames prior pose estimate. minimally parametrised with a six-vector using the exponential
map. Typically the first three elements of represent a translation
3. A small number (50) of the coarsest-scale features are and the latter three elements represent a rotation axis and magni-
searched for in the image. tude. This representation of camera state and motion allows for
trivial differentiation of Eq. 6, and from this, partial differentials of
4. The camera pose is updated from these coarse matches. Eq. 2 of the form u v
, are readily obtained in closed form. De-
i i
5. A larger number (1000) of points is re-projected and searched tails of the Lie group SE(3) and its representation may be found in
for in the image. [29].
6. A final pose estimate for the frame is computed from all the 5.3 Patch Search
matches found. To find a single map point p in the current frame, we perform a
fixed-range image search around the points predicted image loca-
5.1 Image acquisition tion. To perform this search, the corresponding patch must first be
Images are captured from a Unibrain Fire-i video camera equipped warped to take account of viewpoint changes between the patchs
with a 2.1mm wide-angle lens. The camera delivers 640480 pixel first observation and the current camera position. We perform an
YUV411 frames at 30Hz. These frames are converted to 8bpp affine warp characterised by a warping matrix A, where
greyscale for tracking and an RGB image for augmented display. " #
The tracking system constructs a four-level image pyramid as uc uc
us vs
described in section 4. Further, we run the FAST-10 [23] corner A= vc vc (7)
detector on each pyramid level. This is done without non-maximal us vs
100 6000
Map size and procesing time for the desk video
5000
80
Tracking time (ms)
4000
60
Map size
3000
40
2000
20
1000
0
0 200 400 600 800 1000 1200 1400 1600
Frame number
Figure 4: Map size (right axis) and tracking timings (left axis) for the desk video included in the video attachment. The timing spike occurs
when tracking is lost and is attempting relocalisation.
3D Trajectories from synthetic data (all axes metric)
3.05
z
3
Ground truth
2.95 EKFSLAM
Proposed method 0
0 2
5 4
y 6 x
10 8
Figure 5: Comparison with EKF-SLAM on a synthetic sequence. The left image shows the map produced by the system described here,
the centre image shows the map produced by an up-to-date implementation of EKF-SLAM [30]. Trajectories compared to ground truth are
shown on the right. NB. the different scale of the z-axis, as ground truth lies on z=3.
up-to-date enhancements such as JCBB [19]. We use a synthetic tolerance to rapid motions and associated motion blur. Further, it
scene produced in a 3D rendering package. The scene consists of allows mapped points to be useful across a wide range of distances.
two textured walls at right angles, plus the initialisation target for In practice, this means that our system allows a user to zoom in
the SLAM system. The camera moves sideways along one wall to- much closer (and more rapidly) to objects in the environment. This
ward the corner, then along the next wall, for a total of 600 frames is illustrated in Figure 6 and also in the accompanying video file. At
at 600480 resolution. the same time, the use of a larger number of features reduces visible
The synthetic scenario tested here (continual exploration with tracking jitter and improves performance when some features are
no re-visiting of old features, pauses or slam wiggles) is neither occluded or otherwise corrupted.
systems strong point, and not typical usage in an AR context. Nev- The system scales with map size in a different way. In EKF-
ertheless it effectively demonstrates some of the differences in the SLAM, the frame-rate will start to drop; in our system, the frame-
systems behaviours. Figure 5 illustrates the maps output from the rate is not as affected, but the rate at which new parts of the envi-
two systems. The method proposed here produces a relatively dense ronment can be explored slows down.
map of 6600 features, of which several are clearly outliers. By con-
trast, EKF-SLAM produces a sparse map of 114 features with fully 7.5 AR with a hand-held camera
accessible covariance information (our system also implicitly en-
codes the full covariance, but it is not trivial to access) of which all To investigate the suitability of the proposed system for AR tasks,
appear to be inliers. we have developed two simple table-top applications. Both assume
To compare the calculated trajectories, these are first aligned so a flat operating surface, and use the hand-held camera as a tool for
as to minimise their sum-squared error to ground truth. This is interaction. AR applications are usable as soon as the map has been
necessary because our system uses an (almost) arbitrary coordinate initialised from stereo; mapping proceeds in the background in a
frame and scale. Both trajectories are aligned by minimising error manner transparent to the user, unless particularly rapid exploration
over a 6-DOF rigid body transformation and 1-DOF scale change. causes tracking failure.
The resulting trajectories are shown in the right panel of figure 5. The first application is Ewok Rampage, which gives the player
For both trajectories, the error is predominantly in the z-direction control over Darth Vader, who is assaulted by a horde of ewoks. The
(whose scale is exaggerated in the plot) although EKF-SLAM also player can control Darth Vaders movements using the keyboard,
fractionally underestimates the angle between the walls. Numeri- while a laser pistol can be aimed with the camera: The projection of
cally, the standard deviation from ground truth is 135mm for EKF- the cameras optical axis onto the playing surface forms the players
SLAM and 6mm for our system (the camera travels 18.2m through cross-hairs. This game demonstrates the systems ability to cope
the virtual sequence). Frames are tracked in a relatively constant with fast camera translations as the user rapidly changes aim.
20ms by our system, whereas EKF-SLAM scales quadratically The second application simulates the effects of a virtual magni-
from 3ms when the map is empty to 40ms at the end of the sequence fying glass and sun. A virtual convex lens is placed at the camera
(although of course this includes mapping as well as tracking.) centre and simple ray-tracing used to render the caustics onto the
playing surface. When the light converges onto a small enough dot
7.4 Subjective comparison with EKF-SLAM i.e., when user has the camera at the correct height and angle
virtual burn-marks (along with smoke) are added to the surface. In
When used on live video with a hand-held camera, our system han- this way the user can annotate the environment using just the cam-
dles quite differently than iterative SLAM implementations, and era. This game demonstrates tracking accuracy.
this affects the way in which an operator will use the system to These applications are illustrated in Figure 7 and are also demon-
achieve effective mapping and tracking. strated in the accompanying video file.
This system does not require the SLAM wiggle: incremental
systems often need continual smooth camera motion to effectively
initialise new features at their correct depth. If the camera is sta- 8 L IMITATIONS AND F UTURE W ORK
tionary, tracking jitter can initialise features at the wrong depth. By This section describes some of the known issues with the system
contrast, our system works best if the camera is focused on a point presented. This system requires fairly powerful computing hard-
of interest, the user then pauses briefly, and then proceeds (not nec- ware and this has so far limited live experiments to a single office;
essarily smoothly) to the next point of interest, or a different view we expect that with some optimisations we will be able to run at
of the same point. frame-rate on mobile platforms and perform experiments in a wider
The use of multiple pyramid levels greatly increases the systems range of environments. Despite current experimental limitations
initially extracting the dominant plane as an AR surface, the map
becomes purely a tool for camera tracking. This is not ideal: virtual
entities should be able to interact with features in the map in some
way. For example, out-of-plane real objects should block and oc-
clude virtual characters running into or behind them. This is a very
complex and important area for future research.
Several aspects of mapping could be improved to aid tracking
performance: the system currently has no notion of self-occlusion
by the map. While the tracking system is robust enough to track
a map despite self-occlusion, the unexplained absence of features
it expects to be able to measure impacts tracking quality estimates,
and may unnecessarily remove features as outliers. Further, an effi-
cient on-line estimation of patch normals would likely be of benefit
(our initial attempts at this have been too slow.)
Finally, the system is not designed to close large loops in the
SLAM sense. While the mapping module is statistically able to
handle loop closure (and loops can indeed be closed by judicious
placement of the camera near the boundary), the problem lies in the
fact that the trackers M-Estimator is not informed of feature-map
Figure 6: The system can easily track across multiple scales. Here,
uncertainties. In practical AR use, this is not an issue.
the map is initialised at the top-right scale; the user moves closer in
and places a label, which is still accurately registered when viewed
from far away. 9 C ONCLUSION
This paper has presented an alternative to the SLAM approaches
previously employed to track and map unknown environments.
some failure modes and some avenues for further work have Rather then being limited by the frame-to-frame scalability of in-
become evident. cremental mapping approaches which mandate a sparse map of
high quality features [5], we implement the alternative approach,
8.1 Failure modes using a far denser map of lower-quality features.
There are various ways in which tracking can fail. Some of these Results show that on modern hardware, the system is capable of
are due to the systems dependence on corner features: rapid cam- providing tracking quality adequate for small-workspace AR appli-
era motions produce large levels of motion blur which can decimate cations - provided the scene tracked is reasonably textured, rela-
most corners features in the image, and this will cause tracking fail- tively static, and not substantially self-occluding. No prior model
ure. In general, tracking can only proceed when the FAST corner of the scene is required, and the system imposes only a minimal ini-
detector fires, and this limits the types of textures and environments tialisation burden on the user (the procedure takes three seconds.)
supported. Future work might aim to include other types of features We believe the level of tracking robustness and accuracy we achieve
for example, image intensity edges are not as affected by motion significantly advances the state-of-the art.
blur, and often conveniently delineate geometric entities in the map. Nevertheless, performance is not yet good enough for any un-
The system is somewhat robust to repeated structure and light- trained user to simply pick up and use in an arbitrary environment.
ing changes (as illustrated by figures showing a keyboard and CD- Future work will attempt to address some of the shortcomings of
ROM disc being tracked) but such this is purely the happy result of the system and expand its potential applications.
the system using many features with a robust estimator. Repeated
structure in particular still produces large numbers of outliers in the ACKNOWLEDGEMENTS
map (due to the epipolar search making incorrect correspondences)
and can make the whole system fragile: if tracking falls into a local This work was supported by EPSRC grant GR/S97774/01.
minimum and a keyframe is then inserted, the whole map could be
corrupted. R EFERENCES
We experience three types of mapping failure: the first is a fail- [1] S. Baker and I. Matthews. Equivalence and efficiency of image align-
ure of the initial stereo algorithm. This is merely a nuisance, as it ment algorithms. In Proc. IEEE Intl. Conference on Computer Vision
is immediately noticed by the user, who then just repeats the pro- and Pattern Recognition (CVPR01), Hawaii, Dec 2001.
cedure; nevertheless it is an obstacle to a fully automatic initialisa- [2] G. Bleser, H. Wuest, and D. Stricker. Online camera pose estima-
tion of the whole system. The second is the insertion of incorrect tion in partially known and dynamic scenes. In Proc. 5th IEEE and
information into the map. This happens if tracking has failed (or ACM International Symposium on Mixed and Augmented Reality (IS-
reached an incorrect local minimum as described above) but the MAR06), San Diego, CA, October 2006.
tracking quality heuristics have not detected this failure. A more [3] D. Chekhlov, M. Pupilli, W. Mayol-Cuevas, and A. Calway. Real-
robust tracking quality assessment might prevent such failures; al- time and robust monocular SLAM using predictive multi-resolution
ternatively, a method of automatically removing outlier keyframes descriptors. In 2nd International Symposium on Visual Computing,
from the map might be viable. Finally, while the system is very November 2006.
tolerant of temporary partial occlusions, it will fail if the real-world [4] A. Davison, W. Mayol, and D. Murray. Real-time localisation and
scene is substantially and permanently changed. mapping with wearable active vision. In Proc. 2nd IEEE and ACM In-
ternational Symposium on Mixed and Augmented Reality (ISMAR03),
8.2 Mapping inadequacies Tokyo, October 2003.
[5] A. Davison, I. Reid, N. D. Molton, and O. Stasse. MonoSLAM: Real-
Currently, the systems map consists only of a point cloud. While time single camera SLAM. to appear in IEEE Trans. Pattern Analysis
the statistics of feature points are linked through common obser- and Machine Intelligence, 2007.
vations in the bundle adjustment, the system currently makes little [6] F. Devernay and O. D. Faugeras. Straight lines have to be straight.
effort to extract any geometric understanding from the map: after Machine Vision and Applications, 13(1):1424, 2001.
Figure 7: Sample AR applications using tracking as a user interface. Left: Darth Vaders laser gun is aimed by the cameras optical axis to
defend against a rabid ewok horde. Right: The user employs a virtual camera-centred magnifying glass and the heat of a virtual sun to burn a
tasteful image onto a CD-R. These applications are illustrated in the accompanying video.
[7] E. Eade and T. Drummond. Edge landmarks in monocular slam. [21] J. Park, S. You, and U. Neumann. Natural feature tracking for ex-
In Proc. British Machine Vision Conference (BMVC06), Edinburgh, tendible robust augmented realities. In Proc. Int. Workshop on Aug-
September 2006. BMVA. mented Reality, 1998.
[8] E. Eade and T. Drummond. Scalable monocular slam. In Proc. IEEE [22] M. Pupilli and A. Calway. Real-time camera tracking using a particle
Intl. Conference on Computer Vision and Pattern Recognition (CVPR filter. In Proc. British Machine Vision Conference (BMVC05), pages
06), pages 469476, New York, NY, 2006. 519528, Oxford, September 2005. BMVA.
[9] C. Engels, H. Stewenius, and D. Nister. Bundle adjustment rules. In [23] E. Rosten and T. Drummond. Machine learning for high-speed cor-
Photogrammetric Computer Vision (PCV06), 2006. ner detection. In Proc. 9th European Conference on Computer Vision
[10] M. Fischler and R. Bolles. Random sample consensus: A paradigm (ECCV06), Graz, May 2006.
for model fitting with applications to image analysis and automated [24] J. Shi and C. Tomasi. Good features to track. In Proc. IEEE Intl.
cartography. Communcations of the ACM, 24(6):381395, June 1981. Conference on Computer Vision and Pattern Recognition (CVPR 94),
[11] Y. Genc, S. Riedel, F. Souvannavong, C. Akinlar, and N. Navab. pages 593600. IEEE Computer Society, 1994.
Marker-less tracking for AR: A learning-based approach. In Proc. [25] G. Simon, A. Fitzgibbon, and A. Zisserman. Markerless tracking us-
IEEE and ACM International Symposium on Mixed and Augmented ing planar structures in the scene. In Proc. IEEE and ACM Interna-
Reality (ISMAR02), Darmstadt, Germany, September 2002. tional Symposium on Augmented Reality (ISAR00), Munich, October
[12] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer 2000.
Vision. Cambridge University Press, second edition, 2004. [26] R. C. Smith and P. Cheeseman. On the representation and estimation
[13] P. Huber. Robust Statistics. Wiley, 1981. of spatial uncertainty. International Journal of Robotics Research,
[14] B. Jiang and U. Neumann. Extendible tracking by line auto- 5(4):5668, 1986.
calibration. In Proc. IEEE and ACM International Symposium on Aug- [27] H. Stewenius, C. Engels, and D. Nister. Recent developments on direct
mented Reality (ISAR01), pages 97103, New York, October 2001. relative orientation. ISPRS Journal of Photogrammetry and Remote
[15] R. Koch, K. Koeser, B. Streckel, and J.-F. Evers-Senne. Markerless Sensing, 60:284294, June 2006.
image-based 3d tracking for real-time augmented reality applications. [28] R. Subbarao, P. Meer, and Y. Genc. A balanced approach to 3d track-
In WIAMIS, Montreux, 2005. ing from image streams. In Proc. 4th IEEE and ACM International
[16] T. Kurata, N. Sakata, M. Kourogi, H. Kuzuoka, and M. Billinghurst. Symposium on Mixed and Augmented Reality (ISMAR05), pages 70
Remote collaboration using a shoulder-worn active camera/laser. In 78, Vienna, October 2005.
8th International Symposium on Wearable Computers (ISWC04), [29] V. Varadarajan. Lie Groups, Lie Algebras and Their Representa-
pages 6269, Arlington, VA, USA, 2004. tions. Number 102 in Graduate Texts in Mathematics. Springer-
[17] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit. FastSLAM 2.0: Verlag, 1974.
An improved particle filtering algorithm for simultaneous localization [30] B. Williams, G. Klein, and I. Reid. Real-time SLAM relocalisa-
and mapping that provably converges. In Proc. International Joint tion. In Proc. 11th IEEE International Conference on Computer Vision
Conference on Artificial Intelligence, pages 11511156, 2003. (ICCV07), Rio de Janeiro, October 2007.
[18] E. Mouragnon, F. Dekeyser, P. Sayd, M. Lhuillier, and M. Dhome.
Real time localization and 3d reconstruction. In Proc. IEEE Intl.
Conference on Computer Vision and Pattern Recognition (CVPR 06),
pages 363370, New York, NY, 2006.
[19] J. Neira and J. Tardos. Data association in stochastic mapping using
the joint compatibility test. In IEEE Trans. on Robotics and Automa-
tion, 2001.
[20] D. Nister, O. Naroditsky, and J. R. Bergen. Visual odometry. In
Proc. IEEE Intl. Conference on Computer Vision and Pattern Recogni-
tion (CVPR 04), pages 652659, Washington, D.C., June 2005. IEEE
Computer Society.