Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Autonomous Robots (2021) 45:407–420 https://doi.org/10.1007/s10514-021-09973-w Scale-invariant localization using quasi-semantic object landmarks Andrew Holliday1 · Gregory Dudek2 Received: 21 January 2020 / Accepted: 23 January 2021 / Published online: 25 February 2021 © The Author(s) 2021 Abstract This work presents Object Landmarks, a new type of visual feature designed for visual localization over major changes in distance and scale. An Object Landmark consists of a bounding box b defining an object, a descriptor q of that object produced by a Convolutional Neural Network, and a set of classical point features within b. We evaluate Object Landmarks on visual odometry and place-recognition tasks, and compare them against several modern approaches. We find that Object Landmarks enable superior localization over major scale changes, reducing error by as much as 18% and increasing robustness to failure by as much as 80% versus the state-of-the-art. They allow localization under scale change factors up to 6, where state-of-the-art approaches break down at factors of 3 or more. Keywords Visual features · Visual odometry · Place recognition · Robotic localization 1 Introduction Visual localization is an important capability in mobile robotics. In order for a robot to operate “in the wild” in unstructured environments, it has to be able to perceive and understand its surroundings. It must recognize previouslyvisited locations under new conditions and perspectives, and estimate its own position in the world. Vision sensing is a modality well-suited to this task due to its low power requirements and richness of information. Much research (Brown and Lowe 2002; Bay et al. 2008; Simo-Serra et al. 2015; Yi et al. 2016 among others) has looked at how to make visual localization and place recognition more robust to changes in viewing angle and appearance, such as under variations in lighting or weather. But comparably little attention has been given to the problem of localizing under large differences Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s10514021-09973-w. B Andrew Holliday ahollid@cim.mcgill.ca Gregory Dudek greg.dudek@samsung.com 1 McGill University Center for Intelligent Machines, 3480 University Street, McConnell Engineering Building, Room 410, Montréal, QC H3A 0E9, Canada 2 Samsung AI Center, 1250 René-Lévesque, 37th floor, Montréal, Canada in scale, when a scene is viewed from two (or more) very different distances. This problem can arise in numerous cases. One is that of repeated missions carried out by aquatic robots over a coral reef. A high-altitude robot might build a map of a reef, which is then used by a low-altitude robot on a subsequent mission to navigate in the reef. Another case is indoor navigation, in which a robot may first see a location from far away, and then later visit that location, viewing it much more closely, without having moved directly between those two views. The robot must recognize that these two disparate viewpoints contain the same scene, and determine the implied spatial relationship between the two views. As we will show in Sect. 4, state-of-the-art techniques such as the Scale-Invariant Feature Transform (SIFT) have poor robustness to scale changes greater than about 3×. By contrast, humans can recognize known landmarks and accurately estimate their own positions over a very wide range of visual scales, and may possess this ability from as early as three years of age (Spencer and Darvizeh 1981). This work builds on the hypothesis that a key to human navigation over large scale changes is the use of semantically-meaningful objects as landmarks (Fig. 1). In our prior work (Holliday and Dudek 2018), we proposed an image feature we refer to as an Object Landmark. Object Landmarks combine learned object features like those used in Sünderhauf et al. (2015) with more traditional point features. They are composed of “off-the-shelf” compo- 123 408 Autonomous Robots (2021) 45:407–420 Far Image Far mapped to near Near Image Fig. 1 A homographic mapping between two images computed using Object Landmarks, as described in Sect. 3.2. Despite a 6× difference in visual scale, the highly-distinctive foreground object allows our system to determine an accurate homography between the images. See also Fig. 8 nents and require no environment-specific pre-training. We demonstrated that by using matches between object features to guide the search for point feature matches, accurate metric pose estimation could be achieved on image pairs exhibiting major scale changes. The present work builds on Holliday and Dudek (2018) in the following ways: – We refine Object Landmarks by substituting new object-proposal and object-description components that improve speed and accuracy, and present new results using these enhancements. – We evaluate two recent learned feature point extractors from the literature, Learned Invariant Feature Transform (LIFT) and D2Net, and compare their performance with Object Landmarks. – We show that Object Landmarks can be used for longrange place recognition, and report new results showing that they improve on the state-of-the-art. scaled accordingly. Other widely-used point feature types include Speeded-up Robust Features (SURF) (Bay et al. 2008) and Oriented FAST and rotated BRIEF (ORB) (Rublee et al. 2011). Point features have the advantage of retaining explicit geometric information about an image. Whole-image methods include GIST (Oliva and Torralba 2001) and bag-of-words approaches, among others. GIST describes an image based on its responses to a variety of Gabor-wavelet filters. In bag-of-words, the vector space of the descriptors of some point feature type, such as SIFT, is discretized to form a dictionary of “visual words”. One then extracts point features from an image, finds the nearest word in the dictionary to each feature, and computes a histogram of word frequencies to serve as a whole-image descriptor. Classical visual feature methods like these tend to break down under large changes in perspective and appearance, since different perspectives on a scene can produce very different patterns of gradients in the image. Object Landmarks address this weakness by basing descriptors of image components on learned high-level abstractions. A third category of classical methods partition an image into components that can be interpreted as objects. Uijlings et al. (2013) propose Selective Search, which hierarchically groups pixels based on their low-level similarities. Zitnick and Dollar (2014) propose Edge Boxes, which detects edges in an image and outputs boxes that tightly enclose many edges. Both approaches propose image regions to treat as objects, but do not provide descriptors for those regions. Object Landmarks make use of Edge Box object proposals as the first step in the process, and build representations of these proposed objects that can be used for visual localization. 2.2 Visual localization 2 Background and related work 2.1 Classical visual features Most classical methods for producing descriptors of image content are based on image gradients and responses to engineered filter functions. Most can be categorized as either point-feature methods, where a set of keypoints in an image are detected with associated descriptor vectors, or wholeimage methods, where a single vector is computed to describe the entire image. One notable point-feature method is SIFT, originally proposed by Lowe (1999). As the name implies, SIFT features are robust to some visual scale change. Their extraction is based on the principles of scale-space theory (Lindeberg 1994). Keypoints are taken as the extrema of difference-ofGaussian functions applied at various scales to the image, and descriptors are computed from image patches rotated and 123 As described in Dudek and Jenkin (2010), robotic localization can broadly be divided into two problems. In “local localization”, some prior on the robot’s pose estimate is given, while in global localization, the robot’s position must be estimated without any prior. Early work on these problems was carried out by Leonard and Durrant-Whyte (1991), MacKenzie and Dudek (1994), and Fox et al. (1999), among others. Most contemporary visual localization approaches are based on classical low-level methods of image description, and they suffer from the deficiencies of these methods described in Sect. 2.1. Visual place recognition is a case of global localization. In this problem, the environment is modeled as a set of previously-observed scenes. New observations are determined either to match some previous scene, or to show a new scene. Fast Appearance-Based Mapping, or FAB-MAP, is a classic method proposed by Cummins and Newman (2010). It uses a bag-of-words scene representation, build- Autonomous Robots (2021) 45:407–420 ing a Bayesian model of visual-word occurrence in scenes. It performs well only under very small changes in perspective between the query and its nearest match. More recent systems, such as the Convolutional Autoencoder for Loop Closure (CALC) of Merrill and Huang (2018) and RegionVLAD of Khaliq et al. (2019), as well as the work of Chen et al. (2018) and Garg et al. (2018), have used Convolutional Neural Networks (CNNs) to improve on FAB-MAP’s performance under appearance and perpsective change. Visual odometry is a case of “local localization” in which a robot attempts to estimate its trajectory from two or more images captured by its camera(s). Given a pair of images and certain other data, points or regions in one image can be matched to the other, and the matches can be used to triangulate the relative camera poses. Most approaches to date have established matches either using point-feature matching, as in PTAM by Klein and Murray (2007) and ORB-SLAM by Mur-Artal et al. (2015), or direct image alignment of image intensities, as in LSD-SLAM by Engel et al. (2014). Both such approaches are quite limited in their robustness to changes in scale and perspective. Our proposed Object Landmark feature is designed to provide much greater robustness to scale change. 2.3 Learning visual features and localization Much work has explored ways of learning to extract useful localization features from images. Linegar et al. (2016) and Li et al. (2015) propose similar approaches: both identify distinctive image patches as “landmarks” over a traversal of an environment, and train Support Vector Machines (SVMs) to recognize each patch. The notion of using semantically-based observables for mapping or navigation can be traced to Kriegman et al. (Dec 1989), and there are many examples of its use in robotics, such as the Simultaneous Localization And Mapping (SLAM) system SLAM++ of Salas-Moreno et al. (2013), and the work of Galindo et al. (2005). Bowman et al. (2017) train Deformable Parts Models to detect several known object types, such as doorways and office chairs, and use these to propose semantic landmarks for use in SLAM, demonstrating state-of-the-art performance on small office environments. More recently, Li et al. (2019) use semantic landmarks to localize over image pairs with very wide baselines, focusing on changes in perspective (front vs. side or rear views) rather than changes in scale. All of these systems rely on some form of pre-training, either on the environment in question or on a known class of objects. Kaeli et al. (2014) use the distinctiveness of low-level feature histograms to cluster images from a robotic traversal of an environment, and their system requires no pre-training. However, they explicitly limit observed scale changes to 2.67×, ignoring images outside that range. 409 Other approaches make use of the semantic capabilities of Deep Neural Networks (DNNs) to provide visual descriptors. Sünderhauf et al. (2015) use the intermediate activations of an ImageNet-trained CNN (Russakovsky et al. 2015) as whole-image descriptors to perform place recognition. In their follow-up work (Sünderhauf et al. 2015), they use the same CNN feature extractor on image patches proposed by an object detector. They define a similarity score based on matching objects between images. Using this for place recognition, they report better precision and recall than FAB-MAP. Simo-Serra et al. (2015) train a CNN to produce local descriptors for keypoints from 64 × 64-pixel patches around those keypoints. By detecting SIFT keypoints and using their CNN to generate the descriptors, they show improved point-matching accuracy vs. plain SIFT under large rotations, lateral translations, and appearance changes. The LIFT of Yi et al. (2016), SuperPoint of DeTone et al. (2018), and D2Net of Dusmanu et al. (2019) all train CNNs that both detect keypoints and compute descriptors, and demonstrate further improvements over SIFT and Simo-Serra et al. (2015)’s method. None of this work reports experiments involving large scale changes, however. Learning systems have also been trained to localize directly from images. One of the earliest efforts was that of Dudek and Zhang (1995, 1996) wherein a neural network was used to map edge statistics directly to 3D pose in a small indoor environment. Kendall and Cipolla (2016) train a CNN to predict camera pose from an image of an outdoor environment. Mirowski et al. (2018) train a neural-network system to navigate through the graph of a city extracted from Google Street View. Because this system operates only on the limited set of viewpoints in the Street View graph, it is not suitable for real-world deployment. Both this system and PoseNet require a neural network to be trained from scratch for each new environment in which they operate, making deployment costly. In all of this work, the proposed systems are either unsuited by design to localizing across large scale changes, or their evaluation does not include large scale changes, making their performance in such cases unknown. 3 Proposed system The core idea of this work is that objects can be used as landmarks for coarse localization under major scale changes, because their semantics are robust to such changes. But more precise localization between viewpoints requires consideration not just of the object, but of its parts. If enough points on an object’s surface in one view can be matched to points in the other, that one object can be enough to precisely localize a robot from a new viewpoint, even over significant scale change (Fig. 2). 123 410 Autonomous Robots (2021) 45:407–420 classification on ImageNet, and the activation of an intermediate layer (flattened to a vector) is taken as the descriptor q of the object. We refer to q as a “quasi-semantic” descriptor, because it is a highly abstract representation of the image used to construct the CNN’s semantic output, but lacks its own semantic interpretation. We use ImageNet-trained CNN features because they have been demonstrated to be useful for localization tasks in works such as Sünderhauf et al. (2015, ?). Point features are also extracted from the original image. The set of point features inside an Object Landmark’s b are designated the sub-features of the landmark, P. 3.1.1 Object proposals In general, the more unique an object is, the better it will serve as a localization landmark - but the less likely it is to belong to any object class known in advance. For this reason, we prefer an object-detection mechanism that is class-agnostic and based on low-level image features, rather than a DNNbased detection method that may be biased by the contents of its training data. Selective Search (Uijlings et al. 2013) was shown to be effective for robust scale-invariant localization in Holliday and Dudek (2018), but in this work, we use Edge Boxes (Zitnick and Dollar 2014), which we found to be faster and more accurate. 3.1.2 Object descriptors Fig. 2 A schematic of the Object Landmarks extraction process. Blue boxes represent data, red boxes are operations. Bounding boxes b for objects are detected. The image patches enclosed by these boxes are passed through a CNN to get a descriptor q for each object, and a set P of SIFT features are also extracted from each patch. The Object Landmark feature consists of b, q, and P taken together 3.1 Object landmarks We define an Object Landmark as a triplet, o = (b, q, P) where: – b = xleft , ytop , w, h is a 2D bounding box defining the object’s location in the image, – q is a quasi-semantic descriptor vector of dimension kq , – P is a set of point features inside b. Each point feature p = (l, v) consists of a location l = x, y and a descriptor vector v of dimension kv . The Object Landmark extraction pipeline is as follows: first, object-proposal bounding boxes are computed from an image. The rectangular image patch corresponding to each box b is extracted and resized to a fixed-size square. This distorts the contents, but helps normalize for perspective change. Each resized patch is passed through a CNN trained for image 123 Our prior work (Holliday and Dudek 2018) employed a ResNet CNN (He et al. 2015) to extract q. We reported results for a range of input sizes and network layers on a held-out dataset. The best-performing descriptors ranged from kq ≈ 32k to kq ≈ 100k. For the large-scale placerecognition experiments in this work, we sought to reduce kq , both to speed up the computation of distances between different qs and to limit our memory footprint. Our final choice and rationale are presented in Sect. 4.1.2. 3.1.3 Sub-features We use SIFT for the landmark sub-features in our experiments, because of their well-demonstrated robustness to scale changes. We use the recommended SIFT configuration proposed in Lowe (1999): 3 octave layers, σ = 1.6, contrast threshold 0.04, and edge threshold 10. As the bounding boxes of object proposals can overlap one another, we allow one point feature to be associated with multiple Object Landmarks. 3.2 Transform estimation Transform estimation is a limiting case of visual odometry with just two frames. Given a pair of images I1 and I2 of a Autonomous Robots (2021) 45:407–420 411 scene, we use Object Landmarks to estimate the transform between the camera poses as follows. After extracting landmarks from both images, for every landmark i ∈ I1 and every landmark j ∈ I2 , we compute the cosine distance between their qs: di j = dcos(qi , q j ) = 1 − qi · q j   qi  q j  (1) where q is the query, c is a candidate, n q and n c are the number of Object Landmarks detected in q and c, and si j is a shape similarity score for bi and b j . The match to the query is arg max Sq,c . In Sünderhauf et al. (2015), si j was c defined to reflect the difference in the size of the two boxes. Under large scale changes, landmarks corresponding to the same world object will have different sizes, so we instead base si j on the difference in aspect ratios: The matches are all pairs of landmarks (i, j) for which: j = arg min di j ′ (2) i = arg min di ′ j (3) j′ i′ Once these matches are found, we match the sub-features Pi , P j of each match (i, j) in the same way, but using Euclidean distance between SIFT vs instead of cosine distance. This produces a set of point matches of the form 1 , l2 ) between sub-feature m ∈ P and n ∈ P . If no (li,m i j j,n sub-feature matches can be found for (i, j), the pair produces a single point match between the centroids of bi and b j . The point matches are then used to estimate either an essential matrix E or a homography matrix H that relates the two images. A homography H projects points from one 2D space to another, and is used when we know that the scene being viewed is roughly planar. Given four point matches, H is computed via least-squares so as to minimize their reprojection error. If the scene is not expected to be planar, we instead compute E via the five-point algorithm of Nistér (2004). Faugeras et al. (2001) describes the meaning of homographies and essential matrices. In either case, a Random Sample Consensus (RANSAC) process is used that estimates matrices from many different subsets of the point matches, and returns the matrix that is consistent with the largest number of them. Once H or E has been calculated, a set of possible transforms consistent with the matrix can be derived. Cheirality checking (Hartley 1993) is performed on each transform to eliminate those that would place any matched points behind either camera, leaving a single transform. 3.3 Place recognition To perform place recognition using Object Landmarks, we propose a modification of the technique of Sünderhauf et al. (2015). This technique consists of matching the Object Landmarks in a query image against those of each candidate in the map set, and computing a similarity score for each candidate: 1  1 − (di j si j ) Sq,c = √ nq nc (i, j) (4) si j = exp   wi  hi −  wj  hj  wi w j max( h i , h j ) (5) The other significant modification we make is the addition of a geometric-consistency check. This is done by attempting transform estimation between q and c as described in Sect. 3.2: if this fails to estimate a valid transform, the candidate is rejected, and the candidate c′ with the next-highest Sq,c′ is considered, and so on until a consistent match is found or all possibilities are exhausted. The system reports the matched candidate if one was found, as well as the estimated transform. 4 Transform estimation experiments In this section, we evaluate our transform estimation approach against two datasets: the KITTI urban-driving dataset, and the Montreal scale-change dataset published in Holliday and Dudek (2018). The KITTI dataset consists of stereo image pairs with precise ground truth poses, allowing a strict metric evaluation of accuracy, but also contains considerable variation in viewing angle. The Montreal scale-change dataset varies much less in viewing angle, allowing an analysis more closely focused on scale, and its scene contents are very different from KITTI’s, captured by a pedestrian in a dense urban environment. 4.1 KITTI odometry The KITTI Odometry dataset (Geiger et al. 2012) consists of data gathered during 22 traversals of urban and suburban environments in the city of Karlsruhe, Germany, by a car equipped with a sensor package. The data includes stereo colour image pairs captured at a rapid rate, and the first eleven traversals have ground-truth poses for every stereo pair. We sample every fifth frame from the first eleven traversals to reduce the scope of our experiments. For each frame in a subsampled traversal, we estimate the transforms between that frame and the next ten frames using the images from the left-hand colour camera. Frame pairs in which the true angle between gaze directions is greater than the camera’s FOV are discarded, as such pairs rarely share any content. 123 412 Autonomous Robots (2021) 45:407–420 estimation technique in question, and we compute test as: ′ test = test × mediank (sk ) (7) 4.1.1 Error metrics We examine two metrics of error in these experiments. The first is the error in the estimated pose between the two cameras 1 and 2, which we call the pose error: terr = ||21 test − 21 tgt ||2 rerr = dcos(12 rest , 12 rgt ) pose error = Fig. 3 A set of images from one of our the subsampled KITTI odometry traversals. The second, third, and last images are from one, five, and ten frames after the first. This gives a sense of the range of visual scale changes that are present in our KITTI evaluation The resulting frame pairs cover a wide range of scale changes and scene types (Fig. 3). When estimating transforms from monocular image pairs, the translation’s magnitude cannot be determined. But the stereo image pairs of each KITTI frame allow us to estimate the magnitude. A disparity map D is computed for images Ileft and Iright via the block-matching algorithm of Konolige (1998), such that Dx,y is the distance (in pixel space) between the pixel at location x, y in Ileft and its match in Iright . An unscaled 3D world point L′ = x ′ , y ′ , z ′  returned by our transform-estimation algorithm is related to the true world point L by a scale factor: L = L′ s. Given the corresponding pixel location lleft = x, y of L′ in Ileft , the baseline b between the left and right cameras in meters, and the focal length f of the left camera in pixels, we can compute s: s= bf z′ D (6) x,y In principle, s is the same for any L′ and L, and relates the ′ to the scaled translation estimated unitless translation test 1,left ′ test : test = test s. We estimate s separately using li,m for 1,left 2,left each point match (li,m , l j,n ) produced by the transform- 123 terr + rerr ∗ ||21 tgt ||2 (8) (9) (10) where ij test , ij tgt , and ij rest , ij rgt are the estimated and true positions and orientation quaternions of camera j in camera i’s frame of reference. rerr is unitless and ranges from 0 to 2, while terr has units of meters. To compose a single error metric, we scale rerr by the length of the ground-truth translation. Our rationale is that an error in estimated orientation at the beginning of a motion will contribute to an error in estimated translation at the end of the motion proportional to the real distance travelled. The second error metric is the localization failure rate. Localization failure means that no transform could be estimated for an image pair. This can occur when too few point matches are discovered to estimate E or H , or when none of the candidate transforms pass the cheirality check. Such failure can be catastrophic in applications such as SLAM. Depending on the robot’s sensors, no recovery may be possible, so failure rate is a very important metric. 4.1.2 Parameter search Before evaluating our method, we performed a parametertuning phase on a “tuning set” made from traversals 01, 06, and 07. This contained 4,055 frame pairs, about 11% of the whole pose-annotated KITTI odometry dataset. The first part was a search over network architectures and layers for a suitable extractor for q. We considered four families of networks: AlexNet (Krizhevsky et al. 2012), VGGNet (Simonyan and Zisserman 2014), ResNet (He et al. 2015), and DenseNet (Huang et al. 2017), testing several variants of each. To constrain kq and allow fast q matching, inputs to the networks were resized to 64 × 64 pixels, and we did not consider any network layers with kq > 216 . For all experiments, we used the pre-trained ImageNet weights available through the PyTorch deep learning framework (Paszke et al. 2017). The full details of this architecture search and analysis of its results were presented in Holliday and Dudek (2020), but are omitted here for brevity. Autonomous Robots (2021) 45:407–420 413 Table 1 This table summarizes the performance of the five methods being evaluated on the KITTI odometry set Feature Type Mean pose error (m) # failures Object Landmarks 40.402 1 Plain object features 67.918 94 SIFT 127.062 56 LIFT 104.279 105 D2Net 16.578 5 Object Landmarks are by far the best. They outperform the next-best method by 68% on pose error, and result in only one localization failure, while the next-best method results in 56 Fig. 4 A schematic illustration of a DenseNet. The green block represents a composite dense-block. The contents of the first denseblock are displayed in the insert; the circled ’C’ means concatenation of all inputs along the channel dimension. This basic structure, in which each layer’s input is the concatenated outputs of every previous layer, is common to all denseblocks, but they differ in their number of layers. Only convolutional (blue) and pooling (red) layers are shown We found that most layers provided comparable accuracy, but varied widely in kq . DenseNets produced much smaller kq than other networks, while being among the most accurate. We settled on an intermediate layer of a DenseNet with 169 total layers, specifically the pooling layer labelled “transition3” in Fig. 4, as the q extractor for all subsequent experiments with Object Landmarks and object features. It has kq = 2560. Its pose error on the tuning set was only 5% higher than the lowest-error layer, and its kq was 40% smaller than the smallest layer with lower error. We then ran a grid search over the parameters α and β of Edge Boxes (Zitnick and Dollar 2014), as well as the number of object proposals to use, n p , and an upper limit on aspect ratio rmax , where the aspect ratio is defined as r = max( wh , wh ). Our best results were obtained with α = 0.55, β = 0.55, and rmax = 6. Accuracy increased monotonically with n p , but so did running time (since more boxes needed to be matched for each image pair), and the gains in accuracy diminished rapidly after n p = 500, so 500 was settled on. Each sub-feature p may be associated with multiple landmarks, so the final set of point matches used to estimate H or E may include matches of one point in I1 to multiple points in I2 , or may include multiple instances of the same point match. We tested three approaches to handling this on our tuning set: 1. Score match ( p1 , p2 ) as 1/d(v1 , v2 ), and keep only the highest-score match for each p1 , 2. Score match ( p1 , p2 ) as n/d(v1 , v2 ), where n is the number of times that match occurs, and keep only the highest-score match for each p, 3. Simply keep all matches to each point. Approach 3 outperformed 1 and 2 by 18% and 9% on pose error, respectively, and had no failures. We believe this is because it preserves the most information from the matching process. RANSAC handles this naturally: if some p 1 ∈ I 1 has multiple matches in I 2 , RANSAC finds which match is most consistent with other matches. If ( p 1 , p 2 ) occurs multiple times, this suggests the system has more confidence in the match. It will be counted multiply by RANSAC, which has the effect of “up-weighting” the pair in proportion to the system’s confidence. In all of our KITTI experiments, a RANSAC threshold of 6 was used to estimate the essential matrix E. 4.1.3 Final evaluation The final evaluation set consists of 35, 744 frame pairs taken from traversals 00, 02 to 05, and 08 to 10. We evaluate the following approaches: SIFT feature matching alone; Object Landmark matching as described in Sect. 4; and plain object feature matching, where we use the same set of bs and qs as Object Landmark matching, but ignore P, using the centroids of matched objects as point matches. As the objectfeature and SIFT approaches are components of Object Landmarks, comparing them to Object Landmarks serves as an ablation analysis. To provide a comparison with contemporary learned image features, we also evaluate LIFT (Yi et al. 2016) and D2Net (Dusmanu et al. 2019) feature matching, the latter of which was published as this work was in the final stages of preparation. LIFT features were extracted with the network models trained with rotation augmentation. The D2Net weights used were those trained only on D2Net’s authors’ “MegaDepth” dataset, and D2Net was run in pyramidal mode to extract features at multiple scales. Only the 3500 highestscore D2Net features on an image were used, since this was roughly the average number of SIFT features detected in each KITTI image. Table 1 summarizes the performance of each approach. 123 414 Autonomous Robots (2021) 45:407–420 (a) Fig. 6 Examples from the Montreal scale-change dataset. Each row depicts all images of one scene, ordered from far to near transform estimation, and the use of q descriptors to guide SIFT matching greatly improves accuracy. At all dsep , LIFT feature matching is notably less accurate than Object Landmarks, and has the highest failure rate of any evaluated method. D2Net features show 59% better accuracy than Object Landmarks on this evaluation, concentrated at small values of dsep . But they fail 5 times as often as Object Landmarks, a crucial fact for some applications. In light of the results in Sect. 4.2, it appears D2Net’s reduced error is mostly due to better robustness to viewing angle. (b) Fig. 5 Comparison of pose error (a) and failure rates (b) on image pairs, clustered by the true distance between the images. Upper and lower error bars indicate the mean of only the errors above and below the whole cluster’s mean error, respectively. Image pairs for which failure occurred under some method are not included when computing the mean pose error for that method on that cluster As Fig. 5a shows, for frame separation dsep > 20 m, transforms estimated from Object Landmarks are 50–70% more accurate than SIFT features alone. Plain object features give lower errors than SIFT at dsep ≥ 30 m, but their error is consistently greater than complete Object Landmarks when dsep ≥ 20m, and the difference grows with dsep . This shows that the quasi-semantic descriptors provided by the CNN are much more robust to changes in scale than are SIFT descriptors. Figure 5b reveals a greater disparity: the failure rate of plain object features is about double that of SIFT features at separations greater than 10 m, while over the 35, 744 frame pairs, just one failure occurs with Object Landmarks. This shows that the more numerous and precise matches provided by Object Landmarks’ sub-features are key to reliable 123 4.2 Montreal scale-change dataset Collected from around McGill University’s campus in downtown Montreal, this dataset is designed to capture large, nearly uniform changes in visual scale in real-world environments. The dataset consists of 31 image pairs formed from two to three images of 11 diverse scenes. Each scene contains a central subject chosen to be both approximately planar and visually rich. Each pair of images has ground truth in the form of ten manually-annotated point matches (gnear , gfar ), where g = xgt , ygt . We have made the dataset available online1 . Figure 6 displays some examples. We evaluate the same five feature types as on KITTI: SIFT, plain object features, Object Landmarks, LIFT, and D2Net. Since these scenes are roughly planar, we use the point matches from each method to compute a homography H , which projects points from one image into the space of the other. For each image pair, we project each gnear to the far image, and vice-versa, and compute the sum of the distance of each projected point to its true match. This is the 1 http://www.cim.mcgill.ca/~mrl/montreal_scale_pairs/. Autonomous Robots (2021) 45:407–420 415 Table 2 This table summarizes the performance of the five methods on the Montreal dataset Feature Type Mean log STE # failures Object Landmarks 3.767 0 Plain object features 4.061 0 SIFT 4.641 8 LIFT 4.934 10 D2Net 4.523 4 Object Landmarks here have the lowest errors, and again show far fewer total failures than any other method Fig. 7 Logarithmic STE of the homographies computed using each method on each image pair over the Montreal dataset. Best-fit lines to the results are plotted for each feature type. The black horizontal line indicates 107 , the STE value that was substituted when no transform could be estimated. Points lying on this line indicate that the method in question failed to estimate a transform for this image pair symmetric transfer error: STE = N  i ||gifar − H ginear ||2 + ||ginear − H −1 gifar ||2 (11) If no valid H was found (localization failure), we used a fixed error value of 107 , which was approximately double the maximum error we observed on any image pair for any method giving a valid H . We define the scale change between a pair of ground-truth point matches i, j as: scale changei, j = ||ginear − gnear j ||2 ||gifar − gfar j ||2 with scale change above 4×. D2Net fares better but, surprisingly in light of the results in Sect. 4.1.3, still has 17% greater logarithmic STE than Object Landmarks, and fails on four instances. Object Landmarks also outperform plain object features. The difference is greatest at small scale changes, where SIFT on its own performs well. This is expected, as plain object features lack spatial precision, and their more scale-robust descriptors make little difference to matching accuracy under small scale changes. A homography H can also be used to map a whole image into the space of another. When this is done with the H matrices computed for these image pairs, they ought to transform the farther image of the pair into a “zoomed-in” image that resembles the nearer image. This allows a qualitative assessment of the results of these experiments. Some example homographies for a large-scale-change pair are displayed in Fig. 8. The improvement of Object Landmarks over other methods is especially striking in this assessment. Due to space constraints, we provide the whole set as supplementary material. (12) For each image pair, we plot the median scale change over all pairs of ground-truth point matches versus the logarithmic STE in Fig. 7. Because Edge Boxes is a stochastic algorithm, and because we are here dealing with a very small dataset, we ran both object-based methods ten times on the dataset, and plotted the mean of the ten trials. All input images were resized to 900 × 1200 pixels before being processed. The parameters for SIFT, Edge Boxes, LIFT, and D2Net were those described in Sect. 4.1. A RANSAC threshold of 75 was used for plain object features, as it was found to give better performance, while a threshold of 6 was used for all other methods. Table 2 summarizes the performance of each method. Figure 7 shows that Object Landmarks consistently outperform SIFT, LIFT, and D2Net features at scale changes greater than 2.5×. Both SIFT and LIFT provide wildly inaccurate estimates, or fail outright, on all but one image pair 5 Place recognition experiments For each place-recognition experiment, we use images with ground truth camera poses captured over a trajectory by a sensor-equipped vehicle. We sample the images from the vehicle’s trajectory such that every consecutive pair of frames are separated by either some minimum distance, dmin , or a minimum angle between gaze directions, θmin . These form the sampled set T . We then treat each image in T as a query q against a “map” set M = T \{q}. The method under evaluation is used to find a match m ∈ M. We repeat this process over a range of different values of dmin ; larger values of dmin require a place-recognition method to recognize scenes across greater scale changes. For each value of dmin , we perform multiple subsamplings so that similar numbers of queries are performed for each value of dmin . θmin is set to the horizontal field-of-view (FOV) of the dataset’s images. 123 416 Autonomous Robots (2021) 45:407–420 Far Image SIFT Near Image LIFT tors produced for each colour channel instead of using only the descriptor of a grayscale image. This improves its performance slightly. For all methods, the transform estimation step is performed using Object Landmarks, so that differences in accuracy will be due only to the place recognition method. We perform these experiments on the KITTI odometry benchmark, as well as on a subset of the COLD Saarbrucken dataset (Pronobis and Caputo 2009). D2Net 5.1 Outdoor: KITTI odometry Failure Object Landmarks Object features w/o sub-features Fig. 8 Homographies computed by each of the methods for an image pair with 6× scale change. SIFT failed to produce any valid homography, so is not displayed here In a place-recognition context, localization across large changes in scale can mean that multiple map images are “correct” matches to a query, if they view parts of the same scene. Evaluating place recognition by counting correct and incorrect matches, or computing precision and recall, is therefore inappropriate. Instead, where possible, we perform transform estimation between q and m as described in Sect. 3.2, and compute the error in the resulting pose estimate of q. This is more relevant to real robotic localization than a simple correct-incorrect tally: what matters is that the robot have an accurate estimate of its pose, not which specific image from its database it used to estimate that pose. In these experiments, we evaluate the place recognition method described in Sect. 3.3. We use the same Object Landmark configuration as in the transform estimation experiments, except that we set n p = 250 for Edge Boxes, to reduce the computational burden of matching queries over a dataset. We compare this method against FAB-MAP 2.0 (Cummins and Newman 2011), using the open-source implementation openFABMAP (Glover et al. 2011) with SURF features, as they were used in the original work on FAB-MAP 2.0. We tested using FAB-MAP 2.0 with SIFT features, but found this had reduced accuracy vs. SURF. We also evaluate two recent CNN-based place recognition schemes: CALC (Merrill and Huang 2018), and Region-VLAD (Khaliq et al. 2019). In fairness, we modify the CALC by concatenating the descrip- 123 Our primary place-recognition experiment was performed on traversal 02 from the KITTI odometry benchmark (Geiger et al. 2012). It is one of the longest traversals and has very little self-overlap. FAB-MAP 2.0 was pre-trained on images sampled with a minimum separation of 30 meters from KITTI odometry traversals 01, 04, 05, 06, 08, 09, and 10 (none of which overlap with 02). The other methods required no pretraining. We conducted experiments with dmin values of 2, 5, 10, 20, 40, and 80 meters. To more evenly compare FAB-MAP 2.0 with the other methods, which cannot declare the query to be a new location, the most probable match reported by FAB-MAP 2.0 was used, regardless of whether it indicated the query was more likely a new location. The metric of main interest is the pose error defined in Eq. 8. The results are summarized in the “KITTI” section of Table 3, showing that Object Landmarks outperform other methods significantly. terr and rerr are also presented to show how much each contributes to the pose error: for all methods, about 74−79% of error is translational, and both components show the same relationship between the methods as the mean pose error. As shown in Fig. 9, our method substantially outperforms FAB-MAP and CALC from dmin = 5 to 40m. Region-VLAD is more competitive with our method, but is still substantially outperformed from dmin = 10 to 40m: at dmin = 40, our method’s first error quartile is the lower than Region-VLAD’s by a factor of 2. The difference is most pronounced at dmin = 20m, while by dmin = 80m all methods perform very poorly; mean errors are not much better than random matches for any method, suggesting that all methods give mostly incorrect matches. This experiment assumes that all queries have a correct match in the map. But a robot may also need to consider whether it is visiting an entirely new location. FAB-MAP 2.0 outputs a probability that q represents a new location along with its match probability for each c ∈ M. Our placerecognition scheme can accomplish the same by establishing a threshold on the similarity score. Table 4 and Fig. 10 illustrate the trade-offs of different settings of the similarity threshold, and the distances at which this place-recognition method performs well. A threshold of 0.1, which would make the “new” rate about equal to that of Autonomous Robots (2021) 45:407–420 417 Table 3 This table summarizes the results from the place-recognition experiments Method Mean pose error (m) KITTI Mean terr (m) Mean rerr # failures COLD Mean dist. to match (m) Failure rate (%) OL 128.9 101.8 0.077 0 5.796 2.43 FAB-MAP 2.0 279.2 208.2 0.169 0 6.565 1.42 CALC 238.8 189.7 0.132 2 7.168 18.33 Region-VLAD 156.2 123.8 0.097 0 6.392 1.55 Failures are queries where no spatially-consistent match was found. Object Landmarks have the best accuracy on both the KITTI and COLD experiments, although on COLD, both FAB-MAP 2.0 and Region-VLAD have a lower failure rate. This is because the COLD trajectory that we use has some segments where the robot sees only untextured wall, a case to which the Object Landmark method is brittle, since it cannot detect any objects Fig. 9 Pose errors on the KITTI Odometry place-recognition experiments, separated by step distance dmin . The y-axis is log-scaled. Each box shows the quartiles of errors for one method at one dmin value. The whiskers indicate the minimum and maximum errors observed. Results from matching queries randomly are included as a baseline. Object Landmarks achieves much lower median error than both FABMAP and CALC at dmin values from 5 to 40m, and notably outperforms Region-VLAD at dmin = 10 and 20m. No method performs much better than chance at dmin = 80 Table 4 Statistics of Sq,m scores for each dmin in the KITTI experiments dmin Mean Sq,m SD 2 0.34 0.044 5 0.25 0.052 10 0.19 0.068 20 0.14 0.068 40 0.13 0.061 80 0.11 0.025 Fig. 10 Impact of a range of Sq,m thresholds. At each threshold, queries with no match above the threshold are considered new locations. FABMAP 2.0 does not depend on a threshold, so its values are constant. FABMAP’s mean error is much higher than our system’s at any threshold. Even at a threshold that produces the same number of “new” declarations from both systems, FAB-MAP’s mean pose error is 223% greater than that of our system 80m experiments are cut off, as are approximately half of the queries in the dmin = 10m experiments. It should be noted that in these experiments, we do not exploit any priors on robotic motion. Such priors could greatly constrain the search space for matches and thus increase the robustness of localizing over large distances, allowing higher thresholds to be used while maintaining a desired accuracy. Ultimately, the choice of threshold must depend on the application, balancing tolerable error against feasible loop-closure distance and map size. 5.2 Indoor: COLD Saarbrucken The scores decline with dmin FAB-MAP 2.0, would cut off about half of the dmin = 80m and 40m queries. A threshold of 0.16 gives a mean error very close to 0m, but a “new” rate of about 50% - at this threshold, most of the queries in the dmin = 20, 40, and To evaluate the performance of our method on an indoor environment, we conducted place-recognition experiments on the COsy Localization Database (COLD) produced by Pronobis and Caputo (2009). COLD is an indoor navigation dataset consisting of sensor data collected over traversals of various university buildings. The data was gathered by a robot 123 418 Autonomous Robots (2021) 45:407–420 dominate all other methods. It is notable that of the methods other than Object Landmarks, the only one that operates by detecting and matching object-like regions between images is Region-VLAD, which has the best performance of the three. 6 Conclusions Fig. 11 Box plots of ground-truth distances between queries and matches proposed by each method, for each value of dmin . At dmin = 8 m, FAB-MAP 2.0’s second quartile is almost equal to the third quartile, and at dmin = 8, the second is almost equal to the first, making it hard to discern in each case equipped with a forward-facing monocular camera, and has precise ground truth poses associated with each image, as well as annotations indicating which room in the building the robot is in at each frame. For COLD, we ran experiments with dmin values of 0.1, 2, 4, 6, 8, and 10 meters. These values were chosen because of the small size of these traversals, and because changes in visual scale observed under a given motion in the z direction will tend to be larger in an enclosed indoor space than outdoors. The monocular images in COLD are unrectified, and the intrinsic parameters of the cameras used are not available, so we cannot accurately estimate camera poses, though geometric consistency checking via the cheirality check can still be performed by assuming arbitrary camera parameters. For this reason, in our experiments on COLD we consider the ground truth distance of q from proposed match m as our metric of interest. By construction, this cannot be less than dmin for a given experiment, but it is a useful proxy for the “correctness” of the pose estimate, as smaller baselines between image pairs make transform estimation more accurate, as shown in our other experiments. The results are summarized in the “COLD” section of Table 3. Figure 11 shows that Object Landmarks usually outperforms the other methods at dmin ≥ 2 m. At dmin = 4 and 6 m, Region-Vlad has a slightly lower second quartile, but a much higher third quartile, than Object Landmarks, and FABMAP 2.0 and CALC are strictly dominated. Region-VLAD’s performance degrades at dmin = 8 m, while FAB-MAP 2.0 and CALC anomalously perform as well or better than Object Landmarks. At dmin ≥ 10 m, where the most extreme scale changes are experienced, Object Landmarks 123 We have proposed Object Landmarks, a new feature type for visual localization. An Object Landmark consists of an object detection with a quasi-semantic descriptor from intermediate activations of a CNN, as well as a set of point features such as SIFT on the object. This design is motivated by a holistic philosophy of image processing, which considers that information at both high and low levels of abstraction from the raw image data are important to the task. We have demonstrated that Object Landmarks offer major improvements over the state of the art when used to localize over changes in visual scale greater than 3×. These results are supportive of our holistic approach, and we believe that future work on visual localization, and on image processing more broadly, may benefit from taking such an approach. Our experimental evaluation did not investigate the robustness of Object Landmarks to large changes in viewing angle and scene appearance, both of which are of considerable importance in visual localization. In future work, we would like to investigate these aspects of the feature’s performance in more detail. The off-the-shelf construction of Object Landmarks has the advantage that they do not require any re-training to apply them to new environments. Nonetheless, one direction for future research would be to attempt to train a neural network to replace some or all of the stages of this pipeline. For such a system to perform reliably over large scale changes would require training data exhibiting such changes over a wide variety of types of environment, which may be difficult to obtain. Nonetheless, this could provide gains in accuracy and speed, and would remove the need for some of the parametertuning associated with components like Edge Boxes. Acknowledgements This work was conducted at the Samsung AI Center Montreal. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/. Autonomous Robots (2021) 45:407–420 References Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3), 346–359. https://doi.org/10.1016/j.cviu.2007.09.014. Bowman, S. L., Atanasov, N., Daniilidis, K., & Pappas, G. J. (2017). Probabilistic data association for semantic slam. In: 2017 IEEE international conference on robotics and automation (ICRA), pp. 1722–1729. Brown, M., & Lowe, D. G. (2002). Invariant features from interest point groups. In: BMVC, Vol. 4. Chen, Z., Liu, L., Sa, I., Ge, Z., & Chli, M. (2018). Learning context flexible attention model for long-term visual place recognition. IEEE Robotics and Automation Letters, 3(4), 4015–4022. Cummins, M., & Newman, P. (2010) Invited Applications Paper FABMAP: Appearance-based place recognition and mapping using a learned visual vocabulary model. In: 27th International conference on machine learning (ICML2010). Cummins, M., & Newman, P. (2011). Appearance-only slam at large scale with fab-map 20. The International Journal of Robotics Research, 30(9), 1100–1123. https://doi.org/10.1177/ 0278364910385483. DeTone, D., Malisiewicz, T., & Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236. Dudek, G., & Jenkin, M. (2010). Computational principles of mobile robotics. Cambridge: Cambridge University Press. Dudek, G., & Zhang, C. (1995). Pose estimation from image data without explicit object models. In: Research in computer and robot vision. World Scientific, pp. 19–35. Dudek, G., & Zhang, C. (1996). Vision-based robot localization without explicit object models. In: IEEE International conference on robotics and automation, Vol. 1. IEEE, pp. 76–82. Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., & Sattler, T. (2019). D2-Net: A Trainable CNN for joint detection and description of local features. In: Proceedings of the 2019 IEEE/CVF conference on computer vision and pattern recognition Engel, J., Schöps, T., & Cremers, D. (2014). Lsd-slam: Large-scale direct monocular slam. In: European conference on computer vision. Springer, pp. 834–849. Faugeras, O., Luong, Q.-T., & Papadopoulo, T. (2001). The geometry of multiple images: The laws that govern the formation of multiple images of a scene and some of their applications. MIT press Fox, D., Burgard, W., & Thrun, S. (1999). Markov localization for mobile robots in dynamic environments. Journal of Artificial Intelligence Research, 11, 391–427. Galindo, C., Saffiotti, A., Coradeschi, S., Buschka, P., FernandezMadrigal, J. A., & Gonzalez, J. (2005). Multi-hierarchical semantic maps for mobile robotics. In: 2005 IEEE/RSJ international conference on intelligent robots and systems, pp. 2278–2283. Garg, S., Suenderhauf, N., & Milford, M. (2018). Lost? appearanceinvariant place recognition for opposite viewpoints using visual semantics. arXiv preprint arXiv:1804.05526. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In: Conference on computer vision and pattern recognition (CVPR) Glover, A., Maddern, W., Warren, M., Reid, S., Milford, M., & Wyeth, G. (2011). Openfabmap: An open source toolbox for appearancebased loop closure detection. In: The international conference on robotics and automation. St Paul, Minnesota: IEEE. Hartley, R. I. (1993). Cheirality invariants. In: Proceedings of DARPA image understanding workshop, pp. 745–753. He, K., Zhang, X., Ren, S., & Sun, J. (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 419 Holliday, A., & Dudek, G. (2018). Scale-robust localization using general object landmarks. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 1688–1694 Holliday, A., & Dudek, G. (2020). Pre-trained cnns as visual feature extractors: A broad evaluation. In: 2020 17th conference on computer and robot vision (CRV). IEEE, pp. 78–84. Huang, G., Liu, Z., Maaten, L. v. d., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 2261–2269. Kaeli, J. W., Leonard, J. J., & Singh, H. (2014). Visual summaries for low-bandwidth semantic mapping with autonomous underwater vehicles. In: 2014 IEEE/OES Autonomous underwater vehicles (AUV), pp. 1–7. Kendall, A., & Cipolla, R. (2016). Modelling uncertainty in deep learning for camera relocalization. In: Proceedings of the international conference on robotics and automation (ICRA) Khaliq, A., Ehsan, S., Chen, Z., Milford, M., & McDonald-Maier, K. (2019). A holistic visual place recognition approach using lightweight cnns for significant viewpoint and appearance changes. IEEE Transactions on Robotics, pp. 1–9. Klein, G., & Murray, D. (2007). Parallel tracking and mapping for small ar workspaces. In: Proceedings of the 2007 6th IEEE and ACM international symposium on mixed and augmented reality, ser. ISMAR ’07. Washington, DC: IEEE Computer Society, pp. 1–10. https://doi.org/10.1109/ISMAR.2007.4538852 Konolige, K. (1998). Small vision systems: Hardware and implementation. Robotics Research. Springer, pp. 203–212. Kriegman, D. J., Triendl, E., & Binford, T. O. (1989). Stereo vision and navigation in buildings for mobile robots. IEEE Transactions on Robotics and Automation, 5(6), 792–803. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 25, pp. 1097–1105). Curran Associates Inc. Leonard, J. J., & Durrant-Whyte, H. F. (1991). Mobile robot localization by tracking geometric beacons. IEEE Transactions on Robotics and Automation, 7(3), 376–382. Li, J., Eustice, R. M., & Johnson-Roberson, M. (2015). High-level visual features for underwater place recognition. Li, J., Meger, D., Dudek, G. (2019). Semantic mapping for viewinvariant relocalization. In: Proceedings of the. (2019). IEEE international conference on robotics and automation (ICRA 19), Montreal, Canada Lindeberg, T. (1994). Scale-space theory: a basic tool for analyzing structures at different scales. Journal of Applied Statistics, 21(1– 2), 225–270. https://doi.org/10.1080/757582976. Linegar, C., Churchill, W., Newman, P. (2016). Made to measure: Bespoke landmarks for 24-hour, all-weather localisation with a camera. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 787–794. Lowe, D. G. (1999). Object recognition from local scale-invariant features. In: Proceedings of the international conference on computer vision-volume 2, ser. ICCV ’99. IEEE Computer Society, Washington, pp. 1150, http://dl.acm.org/citation.cfm?id=850924.851523 MacKenzie, P., & Dudek, G. (1994). Precise positioning using modelbased maps. In: 1994 IEEE international conference on robotics and automation. IEEE, pp. 1615–1621. Merrill, N., & Huang, G. (2018). Lightweight unsupervised deep loop closure. In: Proceedings of robotics: science and systems (RSS), Pittsburgh Mirowski, P., Grimes, M. Koichi, Malinowski, M., Moritz Hermann, K., Anderson, K., Teplyashin, D., Simonyan, K., Kavukcuoglu, K. Zisserman, A., & Hadsell, R. (2018). Learning to navigate in cities without a map, 03. 123 420 Mur-Artal, R., Montiel, J. M. M., & Tardós, J. D. (2015). Orb-slam: A versatile and accurate monocular slam system. CoRR Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 756–777. https://doi.org/10.1109/ TPAMI.2004.17. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175. https://doi.org/10.1023/ A:1011139631724. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., & Lerer, A. (2017). Automatic differentiation in PyTorch. In: NIPS Autodiff workshop Pronobis, A., & Caputo, B. (2009). COLD: COsy localization database. The International Journal of Robotics Research (IJRR), 28(5):588–594 http://www.pronobis.pro/publications/ pronobis2009ijrr Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011). Orb: An efficient alternative to sift or surf. In: proceedings of the 2011 international conference on computer vision, ser. ICCV ’11. Washington, DC: IEEE Computer Society, pp. 2564–2571. https://doi. org/10.1109/ICCV.2011.6126544 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211– 252. Salas-Moreno, R. F., Newcombe, R. A., Strasdat, H., Kelly, P. H., & Davison, A. J. (2013). Slam++: Simultaneous localisation and mapping at the level of objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., & MorenoNoguer, F. (2015). Discriminative learning of deep convolutional feature point descriptors. In: 2015 IEEE international conference on computer vision (ICCV), pp. 118–126. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, arXiv:1409.1556 Spencer, C., & Darvizeh, Z. (1981). The case for developing a cognitive environmental psychology that does not underestimate the abilities of young children. Journal of Environmental Psychology, 1(1), 21– 31. Sünderhauf, N., Dayoub, F., Shirazi, S., Upcroft, B., & Milford, M. (2015). On the performance of convnet features for place recognition. CoRR, arXiv:1501.04158 Sünderhauf, N., Shirazi, S., Jacobson, A., Dayoub, F., Pepperell, E., Upcroft, B., & Milford, M. (2015). Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free. In: Proceedings of robotics: science and systems (RSS) Uijlings, J. R. R., van de Sande, K. E. A., Gevers, T., & Smeulders, A. W. M. (2013) Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, https://ivi. fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013 Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). Lift: Learned invariant feature transform. In: European conference on computer vision (ECCV), pp. 467–483. Zitnick, L., & Dollar, P. (2014) Edge boxes: Locating object proposals from edges. In: ECCV: European conference on computer vision, https://www.microsoft.com/en-us/research/ publication/edge-boxes-locating-object-proposals-from-edges/ Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 123 Autonomous Robots (2021) 45:407–420 Andrew Holliday is a PhD candidate at the Mobile Robotics Laboratory of McGill University. He is interested in visual navigation, neural representations for computer vision, and applying machine learning to combinatorial optimization problems. Gregory Dudek does research on sensing for mobile robotics including vision, robot pose estimation (position estimation), recognition, path planning. He is also looking into a few non-robotic problems such as those related to recommender systems.