Search | arXiv e-print repository

Online Tree Reconstruction and Forest Inventory on a Mobile Robotic System

Authors: Leonard Freißmuth, Matias Mattamala, Nived Chebrolu, Simon Schaefer, Stefan Leutenegger, Maurice Fallon

Abstract: Terrestrial laser scanning (TLS) is the standard technique used to create accurate point clouds for digital forest inventories. However, the measurement process is demanding, requiring up to two days per hectare for data collection, significant data storage, as well as resource-heavy post-processing of 3D data. In this work, we present a real-time mapping and analysis system that enables online ge… ▽ More Terrestrial laser scanning (TLS) is the standard technique used to create accurate point clouds for digital forest inventories. However, the measurement process is demanding, requiring up to two days per hectare for data collection, significant data storage, as well as resource-heavy post-processing of 3D data. In this work, we present a real-time mapping and analysis system that enables online generation of forest inventories using mobile laser scanners that can be mounted e.g. on mobile robots. Given incrementally created and locally accurate submaps-data payloads-our approach extracts tree candidates using a custom, Voronoi-inspired clustering algorithm. Tree candidates are reconstructed using an adapted Hough algorithm, which enables robust modeling of the tree stem. Further, we explicitly incorporate the incremental nature of the data collection by consistently updating the database using a pose graph LiDAR SLAM system. This enables us to refine our estimates of the tree traits if an area is revisited later during a mission. We demonstrate competitive accuracy to TLS or manual measurements using laser scanners that we mounted on backpacks or mobile robots operating in conifer, broad-leaf and mixed forests. Our results achieve RMSE of 1.93 cm, a bias of 0.65 cm and a standard deviation of 1.81 cm (averaged across these sequences)-with no post-processing required after the mission is complete. △ Less

Submitted 26 March, 2024; originally announced March 2024.

arXiv:2403.11370 [pdf, other]

DynamicGlue: Epipolar and Time-Informed Data Association in Dynamic Environments using Graph Neural Networks

Authors: Theresa Huber, Simon Schaefer, Stefan Leutenegger

Abstract: The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challe… ▽ More The assumption of a static environment is common in many geometric computer vision tasks like SLAM but limits their applicability in highly dynamic scenes. Since these tasks rely on identifying point correspondences between input images within the static part of the environment, we propose a graph neural network-based sparse feature matching network designed to perform robust matching under challenging conditions while excluding keypoints on moving objects. We employ a similar scheme of attentional aggregation over graph edges to enhance keypoint representations as state-of-the-art feature-matching networks but augment the graph with epipolar and temporal information and vastly reduce the number of graph edges. Furthermore, we introduce a self-supervised training scheme to extract pseudo labels for image pairs in dynamic environments from exclusively unprocessed visual-inertial data. A series of experiments show the superior performance of our network as it excludes keypoints on moving objects compared to state-of-the-art feature matching networks while still achieving similar results regarding conventional matching metrics. When integrated into a SLAM system, our network significantly improves performance, especially in highly dynamic scenes. △ Less

Submitted 1 July, 2024; v1 submitted 17 March, 2024; originally announced March 2024.

arXiv:2403.09596 [pdf, other]

Scalable Autonomous Drone Flight in the Forest with Visual-Inertial SLAM and Dense Submaps Built without LiDAR

Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Dimos Tzoumanikas, Simon Schaefer, Hanzhi Chen, Stefan Leutenegger

Abstract: Forestry constitutes a key element for a sustainable future, while it is supremely challenging to introduce digital processes to improve efficiency. The main limitation is the difficulty of obtaining accurate maps at high temporal and spatial resolution as a basis for informed forestry decision-making, due to the vast area forests extend over and the sheer number of trees. To address this challeng… ▽ More Forestry constitutes a key element for a sustainable future, while it is supremely challenging to introduce digital processes to improve efficiency. The main limitation is the difficulty of obtaining accurate maps at high temporal and spatial resolution as a basis for informed forestry decision-making, due to the vast area forests extend over and the sheer number of trees. To address this challenge, we present an autonomous Micro Aerial Vehicle (MAV) system which purely relies on cost-effective and light-weight passive visual and inertial sensors to perform under-canopy autonomous navigation. We leverage visual-inertial simultaneous localization and mapping (VI-SLAM) for accurate MAV state estimates and couple it with a volumetric occupancy submapping system to achieve a scalable mapping framework which can be directly used for path planning. As opposed to a monolithic map, submaps inherently deal with inevitable drift and corrections from VI-SLAM, since they move with pose estimates as they are updated. To ensure the safety of the MAV during navigation, we also propose a novel reference trajectory anchoring scheme that moves and deforms the reference trajectory the MAV is tracking upon state updates from the VI-SLAM system in a consistent way, even upon large changes in state estimates due to loop-closures. We thoroughly validate our system in both real and simulated forest environments with high tree densities in excess of 400 trees per hectare and at speeds up to 3 m/s - while not encountering a single collision or system failure. To the best of our knowledge this is the first system which achieves this level of performance in such unstructured environment using low-cost passive visual sensors and fully on-board computation including VI-SLAM. △ Less

Submitted 14 March, 2024; originally announced March 2024.

Comments: 8 pages, 7 figures

arXiv:2403.04331 [pdf, other]

Control-Barrier-Aided Teleoperation with Visual-Inertial SLAM for Safe MAV Navigation in Complex Environments

Authors: Siqi Zhou, Sotiris Papatheodorou, Stefan Leutenegger, Angela P. Schoellig

Abstract: In this paper, we consider a Micro Aerial Vehicle (MAV) system teleoperated by a non-expert and introduce a perceptive safety filter that leverages Control Barrier Functions (CBFs) in conjunction with Visual-Inertial Simultaneous Localization and Mapping (VI-SLAM) and dense 3D occupancy mapping to guarantee safe navigation in complex and unstructured environments. Our system relies solely on onboa… ▽ More In this paper, we consider a Micro Aerial Vehicle (MAV) system teleoperated by a non-expert and introduce a perceptive safety filter that leverages Control Barrier Functions (CBFs) in conjunction with Visual-Inertial Simultaneous Localization and Mapping (VI-SLAM) and dense 3D occupancy mapping to guarantee safe navigation in complex and unstructured environments. Our system relies solely on onboard IMU measurements, stereo infrared images, and depth images and autonomously corrects teleoperated inputs when they are deemed unsafe. We define a point in 3D space as unsafe if it satisfies either of two conditions: (i) it is occupied by an obstacle, or (ii) it remains unmapped. At each time step, an occupancy map of the environment is updated by the VI-SLAM by fusing the onboard measurements, and a CBF is constructed to parameterize the (un)safe region in the 3D space. Given the CBF and state feedback from the VI-SLAM module, a safety filter computes a certified reference that best matches the teleoperation input while satisfying the safety constraint encoded by the CBF. In contrast to existing perception-based safe control frameworks, we directly close the perception-action loop and demonstrate the full capability of safe control in combination with real-time VI-SLAM without any external infrastructure or prior knowledge of the environment. We verify the efficacy of the perceptive safety filter in real-time MAV experiments using exclusively onboard sensing and computation and show that the teleoperated MAV is able to safely navigate through unknown environments despite arbitrary inputs sent by the teleoperator. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: Accepted to the IEEE International Conference on Robotics and Automation (ICRA) 2024, 7 pages, 7 figures, supplementary video is available at https://youtu.be/rCxbWY4PIfQ?si=DC-9mg7g1WooNdaV

arXiv:2403.02280 [pdf, other]

Tightly-Coupled LiDAR-Visual-Inertial SLAM and Large-Scale Volumetric Occupancy Mapping

Authors: Simon Boche, Sebastián Barbas Laina, Stefan Leutenegger

Abstract: Autonomous navigation is one of the key requirements for every potential application of mobile robots in the real-world. Besides high-accuracy state estimation, a suitable and globally consistent representation of the 3D environment is indispensable. We present a fully tightly-coupled LiDAR-Visual-Inertial SLAM system and 3D mapping framework applying local submapping strategies to achieve scalabi… ▽ More Autonomous navigation is one of the key requirements for every potential application of mobile robots in the real-world. Besides high-accuracy state estimation, a suitable and globally consistent representation of the 3D environment is indispensable. We present a fully tightly-coupled LiDAR-Visual-Inertial SLAM system and 3D mapping framework applying local submapping strategies to achieve scalability to large-scale environments. A novel and correspondence-free, inherently probabilistic, formulation of LiDAR residuals is introduced, expressed only in terms of the occupancy fields and its respective gradients. These residuals can be added to a factor graph optimisation problem, either as frame-to-map factors for the live estimates or as map-to-map factors aligning the submaps with respect to one another. Experimental validation demonstrates that the approach achieves state-of-the-art pose accuracy and furthermore produces globally consistent volumetric occupancy submaps which can be directly used in downstream tasks such as navigation or exploration. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: IEEE International Conference on Robotics and Automation (ICRA) 2024

arXiv:2402.05644 [pdf, other]

FuncGrasp: Learning Object-Centric Neural Grasp Functions from Single Annotated Example Object

Authors: Hanzhi Chen, Binbin Xu, Stefan Leutenegger

Abstract: We present FuncGrasp, a framework that can infer dense yet reliable grasp configurations for unseen objects using one annotated object and single-view RGB-D observation via categorical priors. Unlike previous works that only transfer a set of grasp poses, FuncGrasp aims to transfer infinite configurations parameterized by an object-centric continuous grasp function across varying instances. To eas… ▽ More We present FuncGrasp, a framework that can infer dense yet reliable grasp configurations for unseen objects using one annotated object and single-view RGB-D observation via categorical priors. Unlike previous works that only transfer a set of grasp poses, FuncGrasp aims to transfer infinite configurations parameterized by an object-centric continuous grasp function across varying instances. To ease the transfer process, we propose Neural Surface Grasping Fields (NSGF), an effective neural representation defined on the surface to densely encode grasp configurations. Further, we exploit function-to-function transfer using sphere primitives to establish semantically meaningful categorical correspondences, which are learned in an unsupervised fashion without any expert knowledge. We showcase the effectiveness through extensive experiments in both simulators and the real world. Remarkably, our framework significantly outperforms several strong baseline methods in terms of density and reliability for generated grasps. △ Less

Submitted 22 February, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: Accepted to ICRA 2024

arXiv:2312.13471 [pdf, other]

NeRF-VO: Real-Time Sparse Visual Odometry with Neural Radiance Fields

Authors: Jens Naumann, Binbin Xu, Stefan Leutenegger, Xingxing Zuo

Abstract: We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular predic… ▽ More We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory. △ Less

Submitted 16 July, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: Project page: https://xingxingzuo.github.io/nerfvo/

Journal ref: IEEE Robotics and Automation Letters (RA-L), 2024

arXiv:2312.05247 [pdf, other]

Dynamic LiDAR Re-simulation using Compositional Neural Fields

Authors: Hanfeng Wu, Xingxing Zuo, Stefan Leutenegger, Or Litany, Konrad Schindler, Shengyu Huang

Abstract: We introduce DyNFL, a novel neural field-based approach for high-fidelity re-simulation of LiDAR scans in dynamic driving scenes. DyNFL processes LiDAR measurements from dynamic environments, accompanied by bounding boxes of moving objects, to construct an editable neural field. This field, comprising separately reconstructed static background and dynamic objects, allows users to modify viewpoints… ▽ More We introduce DyNFL, a novel neural field-based approach for high-fidelity re-simulation of LiDAR scans in dynamic driving scenes. DyNFL processes LiDAR measurements from dynamic environments, accompanied by bounding boxes of moving objects, to construct an editable neural field. This field, comprising separately reconstructed static background and dynamic objects, allows users to modify viewpoints, adjust object positions, and seamlessly add or remove objects in the re-simulated scene. A key innovation of our method is the neural field composition technique, which effectively integrates reconstructed neural assets from various scenes through a ray drop test, accounting for occlusions and transparent surfaces. Our evaluation with both synthetic and real-world environments demonstrates that DyNFL substantially improves dynamic scene LiDAR simulation, offering a combination of physical fidelity and flexible editing capabilities. △ Less

Submitted 3 April, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: Project page: https://shengyuh.github.io/dynfl

arXiv:2311.18610 [pdf, other]

DiffCAD: Weakly-Supervised Probabilistic CAD Model Retrieval and Alignment from an RGB Image

Authors: Daoyi Gao, Dávid Rozenberszki, Stefan Leutenegger, Angela Dai

Abstract: Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well… ▽ More Perceiving 3D structures from RGB images based on CAD model primitives can enable an effective, efficient 3D object-based representation of scenes. However, current approaches rely on supervision from expensive annotations of CAD models associated with real images, and encounter challenges due to the inherent ambiguities in the task -- both in depth-scale ambiguity in monocular perception, as well as inexact matches of CAD database models to real observations. We thus propose DiffCAD, the first weakly-supervised probabilistic approach to CAD retrieval and alignment from an RGB image. We formulate this as a conditional generative task, leveraging diffusion to learn implicit probabilistic models capturing the shape, pose, and scale of CAD objects in an image. This enables multi-hypothesis generation of different plausible CAD reconstructions, requiring only a few hypotheses to characterize ambiguities in depth/scale and inexact shape matches. Our approach is trained only on synthetic data, leveraging monocular depth and mask estimates to enable robust zero-shot adaptation to various real target domains. Despite being trained solely on synthetic data, our multi-hypothesis approach can even surpass the supervised state-of-the-art on the Scan2CAD dataset by 5.9% with 8 hypotheses. △ Less

Submitted 6 June, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: SIGGRAPH 2024, Project page: https://daoyig.github.io/DiffCAD/

arXiv:2311.02510 [pdf, other]

Anthropomorphic Grasping with Neural Object Shape Completion

Authors: Diego Hidalgo-Carvajal, Hanzhi Chen, Gemma C. Bettelani, Jaesug Jung, Melissa Zavaglia, Laura Busse, Abdeldjallil Naceri, Stefan Leutenegger, Sami Haddadin

Abstract: The progressive prevalence of robots in human-suited environments has given rise to a myriad of object manipulation techniques, in which dexterity plays a paramount role. It is well-established that humans exhibit extraordinary dexterity when handling objects. Such dexterity seems to derive from a robust understanding of object properties (such as weight, size, and shape), as well as a remarkable… ▽ More The progressive prevalence of robots in human-suited environments has given rise to a myriad of object manipulation techniques, in which dexterity plays a paramount role. It is well-established that humans exhibit extraordinary dexterity when handling objects. Such dexterity seems to derive from a robust understanding of object properties (such as weight, size, and shape), as well as a remarkable capacity to interact with them. Hand postures commonly demonstrate the influence of specific regions on objects that need to be grasped, especially when objects are partially visible. In this work, we leverage human-like object understanding by reconstructing and completing their full geometry from partial observations, and manipulating them using a 7-DoF anthropomorphic robot hand. Our approach has significantly improved the grasping success rates of baselines with only partial reconstruction by nearly 30% and achieved over 150 successful grasps with three different object categories. This demonstrates our approach's consistent ability to predict and execute grasping postures based on the completed object shapes from various directions and positions in real-world scenarios. Our work opens up new possibilities for enhancing robotic applications that require precise grasping and manipulation skills of real-world reconstructed objects. △ Less

Submitted 9 November, 2023; v1 submitted 4 November, 2023; originally announced November 2023.

Comments: Accepted to RA-L 2023

arXiv:2310.09982 [pdf, other]

AP$n$P: A Less-constrained P$n$P Solver for Pose Estimation with Unknown Anisotropic Scaling or Focal Lengths

Authors: Jiaxin Wei, Stefan Leutenegger, Laurent Kneip

Abstract: Perspective-$n$-Point (P$n$P) stands as a fundamental algorithm for pose estimation in various applications. In this paper, we present a new approach to the P$n$P problem with relaxed constraints, eliminating the need for precise 3D coordinates or complete calibration data. We refer to it as AP$n$P due to its ability to handle unknown anisotropic scaling factors of 3D coordinates or alternatively… ▽ More Perspective-$n$-Point (P$n$P) stands as a fundamental algorithm for pose estimation in various applications. In this paper, we present a new approach to the P$n$P problem with relaxed constraints, eliminating the need for precise 3D coordinates or complete calibration data. We refer to it as AP$n$P due to its ability to handle unknown anisotropic scaling factors of 3D coordinates or alternatively two distinct focal lengths in addition to the conventional rigid transformation. Through algebraic manipulations and a novel parametrization, both cases are brought into similar forms that distinguish themselves primarily by the order of a rotation and an anisotropic scaling operation. AP$n$P then boils down to one unique polynomial problem, which is solved by the Gröbner basis approach. Experimental results on both simulated and real datasets demonstrate the effectiveness of AP$n$P as a more flexible and practical solution to camera pose estimation. Code: https://github.com/goldoak/APnP. △ Less

Submitted 9 November, 2023; v1 submitted 15 October, 2023; originally announced October 2023.

arXiv:2309.14514 [pdf, other]

Accurate and Interactive Visual-Inertial Sensor Calibration with Next-Best-View and Next-Best-Trajectory Suggestion

Authors: Christopher L. Choi, Binbin Xu, Stefan Leutenegger

Abstract: Visual-Inertial (VI) sensors are popular in robotics, self-driving vehicles, and augmented and virtual reality applications. In order to use them for any computer vision or state-estimation task, a good calibration is essential. However, collecting informative calibration data in order to render the calibration parameters observable is not trivial for a non-expert. In this work, we introduce a nov… ▽ More Visual-Inertial (VI) sensors are popular in robotics, self-driving vehicles, and augmented and virtual reality applications. In order to use them for any computer vision or state-estimation task, a good calibration is essential. However, collecting informative calibration data in order to render the calibration parameters observable is not trivial for a non-expert. In this work, we introduce a novel VI calibration pipeline that guides a non-expert with the use of a graphical user interface and information theory in collecting informative calibration data with Next-Best-View and Next-Best-Trajectory suggestions to calibrate the intrinsics, extrinsics, and temporal misalignment of a VI sensor. We show through experiments that our method is faster, more accurate, and more consistent than state-of-the-art alternatives. Specifically, we show how calibrations with our proposed method achieve higher accuracy estimation results when used by state-of-the-art VI Odometry as well as VI-SLAM approaches. The source code of our software can be found on: https://github.com/chutsu/yac. △ Less

Submitted 25 September, 2023; originally announced September 2023.

Comments: 8 pages, 11 figures, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023)

arXiv:2309.10369 [pdf, other]

GloPro: Globally-Consistent Uncertainty-Aware 3D Human Pose Estimation & Tracking in the Wild

Authors: Simon Schaefer, Dorian F. Henning, Stefan Leutenegger

Abstract: An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first fram… ▽ More An accurate and uncertainty-aware 3D human body pose estimation is key to enabling truly safe but efficient human-robot interactions. Current uncertainty-aware methods in 3D human pose estimation are limited to predicting the uncertainty of the body posture, while effectively neglecting the body shape and root pose. In this work, we present GloPro, which to the best of our knowledge the first framework to predict an uncertainty distribution of a 3D body mesh including its shape, pose, and root pose, by efficiently fusing visual clues with a learned motion model. We demonstrate that it vastly outperforms state-of-the-art methods in terms of human trajectory accuracy in a world coordinate system (even in the presence of severe occlusions), yields consistent uncertainty distributions, and can run in real-time. △ Less

Submitted 20 September, 2023; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: IEEE International Conference on Intelligent Robots and Systems (IROS) 2023

arXiv:2309.01236 [pdf, other]

BodySLAM++: Fast and Tightly-Coupled Visual-Inertial Camera and Human Motion Tracking

Authors: Dorian F. Henning, Christopher Choi, Simon Schaefer, Stefan Leutenegger

Abstract: Robust, fast, and accurate human state - 6D pose and posture - estimation remains a challenging problem. For real-world applications, the ability to estimate the human state in real-time is highly desirable. In this paper, we present BodySLAM++, a fast, efficient, and accurate human and camera state estimation framework relying on visual-inertial data. BodySLAM++ extends an existing visual-inertia… ▽ More Robust, fast, and accurate human state - 6D pose and posture - estimation remains a challenging problem. For real-world applications, the ability to estimate the human state in real-time is highly desirable. In this paper, we present BodySLAM++, a fast, efficient, and accurate human and camera state estimation framework relying on visual-inertial data. BodySLAM++ extends an existing visual-inertial state estimation framework, OKVIS2, to solve the dual task of estimating camera and human states simultaneously. Our system improves the accuracy of both human and camera state estimation with respect to baseline methods by 26% and 12%, respectively, and achieves real-time performance at 15+ frames per second on an Intel i7-model CPU. Experiments were conducted on a custom dataset containing both ground truth human and camera poses collected with an indoor motion tracking system. △ Less

Submitted 3 September, 2023; originally announced September 2023.

Comments: IROS 2023. Video: https://youtu.be/UcutiHQwbGk

arXiv:2306.11483 [pdf, other]

Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning

Authors: Anna Penzkofer, Simon Schaefer, Florian Strohm, Mihai Bâce, Stefan Leutenegger, Andreas Bulling

Abstract: While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the lear… ▽ More While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e. the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma's Revenge - one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods. △ Less

Submitted 20 June, 2023; originally announced June 2023.

arXiv:2306.08648 [pdf, other]

SimpleMapping: Real-Time Visual-Inertial Dense Mapping with Deep Multi-View Stereo

Authors: Yingye Xin, Xingxing Zuo, Dongyue Lu, Stefan Leutenegger

Abstract: We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse poin… ▽ More We present a real-time visual-inertial dense mapping method capable of performing incremental 3D mesh reconstruction with high quality using only sequential monocular images and inertial measurement unit (IMU) readings. 6-DoF camera poses are estimated by a robust feature-based visual-inertial odometry (VIO), which also generates noisy sparse 3D map points as a by-product. We propose a sparse point aided multi-view stereo neural network (SPA-MVSNet) that can effectively leverage the informative but noisy sparse points from the VIO system. The sparse depth from VIO is firstly completed by a single-view depth completion network. This dense depth map, although naturally limited in accuracy, is then used as a prior to guide our MVS network in the cost volume generation and regularization for accurate dense depth prediction. Predicted depth maps of keyframe images by the MVS network are incrementally fused into a global map using TSDF-Fusion. We extensively evaluate both the proposed SPA-MVSNet and the entire visual-inertial dense mapping system on several public datasets as well as our own dataset, demonstrating the system's impressive generalization capabilities and its ability to deliver high-quality 3D mesh reconstruction online. Our proposed dense mapping system achieves a 39.7% improvement in F-score over existing systems when evaluated on the challenging scenarios of the EuRoC dataset. △ Less

Submitted 27 August, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

arXiv:2305.14918 [pdf, other]

doi 10.1109/LRA.2023.3273509

Incremental Dense Reconstruction from Monocular Video with Guided Sparse Feature Volume Fusion

Authors: Xingxing Zuo, Nan Yang, Nathaniel Merrill, Binbin Xu, Stefan Leutenegger

Abstract: Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memor… ▽ More Incrementally recovering 3D dense structures from monocular videos is of paramount importance since it enables various robotics and AR applications. Feature volumes have recently been shown to enable efficient and accurate incremental dense reconstruction without the need to first estimate depth, but they are not able to achieve as high of a resolution as depth-based methods due to the large memory consumption of high-resolution feature volumes. This letter proposes a real-time feature volume-based dense reconstruction method that predicts TSDF (Truncated Signed Distance Function) values from a novel sparsified deep feature volume, which is able to achieve higher resolutions than previous feature volume-based methods, and is favorable in large-scale outdoor scenarios where the majority of voxels are empty. An uncertainty-aware multi-view stereo (MVS) network is leveraged to infer initial voxel locations of the physical surface in a sparse feature volume. Then for refining the recovered 3D geometry, deep features are attentively aggregated from multiview images at potential surface locations, and temporally fused. Besides achieving higher resolutions than before, our method is shown to produce more complete reconstructions with finer detail in many cases. Extensive evaluations on both public and self-collected datasets demonstrate a very competitive real-time reconstruction result for our method compared to state-of-the-art reconstruction methods in both indoor and outdoor settings. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 8 pages, 5 figures, RA-L 2023

arXiv:2302.14569 [pdf, other]

doi 10.1109/ICRA48891.2023.10160490

Finding Things in the Unknown: Semantic Object-Centric Exploration with an MAV

Authors: Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, Christopher Choi, Binbin Xu, Stefan Leutenegger

Abstract: Exploration of unknown space with an autonomous mobile robot is a well-studied problem. In this work we broaden the scope of exploration, moving beyond the pure geometric goal of uncovering as much free space as possible. We believe that for many practical applications, exploration should be contextualised with semantic and object-level understanding of the environment for task-specific exploratio… ▽ More Exploration of unknown space with an autonomous mobile robot is a well-studied problem. In this work we broaden the scope of exploration, moving beyond the pure geometric goal of uncovering as much free space as possible. We believe that for many practical applications, exploration should be contextualised with semantic and object-level understanding of the environment for task-specific exploration. Here, we study the task of both finding specific objects in unknown space as well as reconstructing them to a target level of detail. We therefore extend our environment reconstruction to not only consist of a background map, but also object-level and semantically fused submaps. Importantly, we adapt our previous objective function of uncovering as much free space as possible in as little time as possible with two additional elements: first, we require a maximum observation distance of background surfaces to ensure target objects are not missed by image-based detectors because they are too small to be detected. Second, we require an even smaller maximum distance to the found objects in order to reconstruct them with the desired accuracy. We further created a Micro Aerial Vehicle (MAV) semantic exploration simulator based on Habitat in order to quantitatively demonstrate how our framework can be used to efficiently find specific objects as part of exploration. Finally, we showcase this capability can be deployed in real-world scenes involving our drone equipped with an Intel RealSense D455 RGB-D camera. △ Less

Submitted 3 March, 2023; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: 7 pages, 9 figures, accepted in ICRA 2023

arXiv:2210.06270 [pdf, other]

Event-based Non-Rigid Reconstruction from Contours

Authors: Yuxuan Xue, Haolong Li, Stefan Leutenegger, Jörg Stückler

Abstract: Visual reconstruction of fast non-rigid object deformations over time is a challenge for conventional frame-based cameras. In this paper, we propose a novel approach for reconstructing such deformations using measurements from event-based cameras. Under the assumption of a static background, where all events are generated by the motion, our approach estimates the deformation of objects from events… ▽ More Visual reconstruction of fast non-rigid object deformations over time is a challenge for conventional frame-based cameras. In this paper, we propose a novel approach for reconstructing such deformations using measurements from event-based cameras. Under the assumption of a static background, where all events are generated by the motion, our approach estimates the deformation of objects from events generated at the object contour in a probabilistic optimization framework. It associates events to mesh faces on the contour and maximizes the alignment of the line of sight through the event pixel with the associated face. In experiments on synthetic and real data, we demonstrate the advantages of our method over state-of-the-art optimization and learning-based approaches for reconstructing the motion of human hands. A video of the experiments is available at https://youtu.be/gzfw7i5OKjg △ Less

Submitted 13 November, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted for BMVC2022

arXiv:2208.05067 [pdf, other]

Learning to Complete Object Shapes for Object-level Mapping in Dynamic Scenes

Authors: Binbin Xu, Andrew J. Davison, Stefan Leutenegger

Abstract: In this paper, we propose a novel object-level mapping system that can simultaneously segment, track, and reconstruct objects in dynamic scenes. It can further predict and complete their full geometries by conditioning on reconstructions from depth inputs and a category-level shape prior with the aim that completed object geometry leads to better object reconstruction and tracking accuracy. For ea… ▽ More In this paper, we propose a novel object-level mapping system that can simultaneously segment, track, and reconstruct objects in dynamic scenes. It can further predict and complete their full geometries by conditioning on reconstructions from depth inputs and a category-level shape prior with the aim that completed object geometry leads to better object reconstruction and tracking accuracy. For each incoming RGB-D frame, we perform instance segmentation to detect objects and build data associations between the detection and the existing object maps. A new object map will be created for each unmatched detection. For each matched object, we jointly optimise its pose and latent geometry representations using geometric residual and differential rendering residual towards its shape prior and completed geometry. Our approach shows better tracking and reconstruction performance compared to methods using traditional volumetric mapping or learned shape prior approaches. We evaluate its effectiveness by quantitatively and qualitatively testing it in both synthetic and real-world sequences. △ Less

Submitted 9 August, 2022; originally announced August 2022.

Comments: International Conference on Intelligent Robots and Systems (IROS) 2022

arXiv:2208.04274 [pdf, other]

Visual-Inertial Multi-Instance Dynamic SLAM with Object-level Relocalisation

Authors: Yifei Ren, Binbin Xu, Christopher L. Choi, Stefan Leutenegger

Abstract: In this paper, we present a tightly-coupled visual-inertial object-level multi-instance dynamic SLAM system. Even in extremely dynamic scenes, it can robustly optimise for the camera pose, velocity, IMU biases and build a dense 3D reconstruction object-level map of the environment. Our system can robustly track and reconstruct the geometries of arbitrary objects, their semantics and motion by incr… ▽ More In this paper, we present a tightly-coupled visual-inertial object-level multi-instance dynamic SLAM system. Even in extremely dynamic scenes, it can robustly optimise for the camera pose, velocity, IMU biases and build a dense 3D reconstruction object-level map of the environment. Our system can robustly track and reconstruct the geometries of arbitrary objects, their semantics and motion by incrementally fusing associated colour, depth, semantic, and foreground object probabilities into each object model thanks to its robust sensor and object tracking. In addition, when an object is lost or moved outside the camera field of view, our system can reliably recover its pose upon re-observation. We demonstrate the robustness and accuracy of our method by quantitatively and qualitatively testing it in real-world data sequences. △ Less

Submitted 8 August, 2022; originally announced August 2022.

Comments: International Conference on Intelligent Robots and Systems (IROS) 2022

arXiv:2208.00709 [pdf, other]

Visual-Inertial SLAM with Tightly-Coupled Dropout-Tolerant GPS Fusion

Authors: Simon Boche, Xingxing Zuo, Simon Schaefer, Stefan Leutenegger

Abstract: Robotic applications are continuously striving towards higher levels of autonomy. To achieve that goal, a highly robust and accurate state estimation is indispensable. Combining visual and inertial sensor modalities has proven to yield accurate and locally consistent results in short-term applications. Unfortunately, visual-inertial state estimators suffer from the accumulation of drift for long-t… ▽ More Robotic applications are continuously striving towards higher levels of autonomy. To achieve that goal, a highly robust and accurate state estimation is indispensable. Combining visual and inertial sensor modalities has proven to yield accurate and locally consistent results in short-term applications. Unfortunately, visual-inertial state estimators suffer from the accumulation of drift for long-term trajectories. To eliminate this drift, global measurements can be fused into the state estimation pipeline. The most known and widely available source of global measurements is the Global Positioning System (GPS). In this paper, we propose a novel approach that fully combines stereo Visual-Inertial Simultaneous Localisation and Mapping (SLAM), including visual loop closures, with the fusion of global sensor modalities in a tightly-coupled and optimisation-based framework. Incorporating measurement uncertainties, we provide a robust criterion to solve the global reference frame initialisation problem. Furthermore, we propose a loop-closure-like optimisation scheme to compensate drift accumulated during outages in receiving GPS signals. Experimental validation on datasets and in a real-world experiment demonstrates the robustness of our approach to GPS dropouts as well as its capability to estimate highly accurate and globally consistent trajectories compared to existing state-of-the-art methods. △ Less

Submitted 1 August, 2022; originally announced August 2022.

Comments: International Conference on Intelligent Robots and Systems (IROS) 2022

arXiv:2207.13464 [pdf, other]

Towards the Probabilistic Fusion of Learned Priors into Standard Pipelines for 3D Reconstruction

Authors: Tristan Laidlow, Jan Czarnowski, Andrea Nicastro, Ronald Clark, Stefan Leutenegger

Abstract: The best way to combine the results of deep learning with standard 3D reconstruction pipelines remains an open problem. While systems that pass the output of traditional multi-view stereo approaches to a network for regularisation or refinement currently seem to get the best results, it may be preferable to treat deep neural networks as separate components whose results can be probabilistically fu… ▽ More The best way to combine the results of deep learning with standard 3D reconstruction pipelines remains an open problem. While systems that pass the output of traditional multi-view stereo approaches to a network for regularisation or refinement currently seem to get the best results, it may be preferable to treat deep neural networks as separate components whose results can be probabilistically fused into geometry-based systems. Unfortunately, the error models required to do this type of fusion are not well understood, with many different approaches being put forward. Recently, a few systems have achieved good results by having their networks predict probability distributions rather than single values. We propose using this approach to fuse a learned single-view depth prior into a standard 3D reconstruction system. Our system is capable of incrementally producing dense depth maps for a set of keyframes. We train a deep neural network to predict discrete, nonparametric probability distributions for the depth of each pixel from a single image. We then fuse this "probability volume" with another probability volume based on the photometric consistency between subsequent frames and the keyframe image. We argue that combining the probability volumes from these two sources will result in a volume that is better conditioned. To extract depth maps from the volume, we minimise a cost function that includes a regularisation term based on network predicted surface normals and occlusion boundaries. Through a series of experiments, we demonstrate that each of these components improves the overall performance of the system. △ Less

Submitted 27 July, 2022; originally announced July 2022.

Comments: Accepted at ICRA 2020

arXiv:2207.12244 [pdf, other]

DeepFusion: Real-Time Dense 3D Reconstruction for Monocular SLAM using Single-View Depth and Gradient Predictions

Authors: Tristan Laidlow, Jan Czarnowski, Stefan Leutenegger

Abstract: While the keypoint-based maps created by sparse monocular simultaneous localisation and mapping (SLAM) systems are useful for camera tracking, dense 3D reconstructions may be desired for many robotic tasks. Solutions involving depth cameras are limited in range and to indoor spaces, and dense reconstruction systems based on minimising the photometric error between frames are typically poorly const… ▽ More While the keypoint-based maps created by sparse monocular simultaneous localisation and mapping (SLAM) systems are useful for camera tracking, dense 3D reconstructions may be desired for many robotic tasks. Solutions involving depth cameras are limited in range and to indoor spaces, and dense reconstruction systems based on minimising the photometric error between frames are typically poorly constrained and suffer from scale ambiguity. To address these issues, we propose a 3D reconstruction system that leverages the output of a convolutional neural network (CNN) to produce fully dense depth maps for keyframes that include metric scale. Our system, DeepFusion, is capable of producing real-time dense reconstructions on a GPU. It fuses the output of a semi-dense multiview stereo algorithm with the depth and gradient predictions of a CNN in a probabilistic fashion, using learned uncertainties produced by the network. While the network only needs to be run once per keyframe, we are able to optimise for the depth map with each new frame so as to constantly make use of new geometric constraints. Based on its performance on synthetic and real-world datasets, we demonstrate that DeepFusion is capable of performing at least as well as other comparable systems. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: Accepted at ICRA 2019

arXiv:2207.10940 [pdf, other]

Dense RGB-D-Inertial SLAM with Map Deformations

Authors: Tristan Laidlow, Michael Bloesch, Wenbin Li, Stefan Leutenegger

Abstract: While dense visual SLAM methods are capable of estimating dense reconstructions of the environment, they suffer from a lack of robustness in their tracking step, especially when the optimisation is poorly initialised. Sparse visual SLAM systems have attained high levels of accuracy and robustness through the inclusion of inertial measurements in a tightly-coupled fusion. Inspired by this performan… ▽ More While dense visual SLAM methods are capable of estimating dense reconstructions of the environment, they suffer from a lack of robustness in their tracking step, especially when the optimisation is poorly initialised. Sparse visual SLAM systems have attained high levels of accuracy and robustness through the inclusion of inertial measurements in a tightly-coupled fusion. Inspired by this performance, we propose the first tightly-coupled dense RGB-D-inertial SLAM system. Our system has real-time capability while running on a GPU. It jointly optimises for the camera pose, velocity, IMU biases and gravity direction while building up a globally consistent, fully dense surfel-based 3D reconstruction of the environment. Through a series of experiments on both synthetic and real world datasets, we show that our dense visual-inertial SLAM system is more robust to fast motions and periods of low texture and low geometric variation than a related RGB-D-only SLAM system. △ Less

Submitted 22 July, 2022; originally announced July 2022.

Comments: Accepted at IROS 2017; supplementary video available at https://youtu.be/-gUdQ0cxDh0

arXiv:2205.02301 [pdf, other]

BodySLAM: Joint Camera Localisation, Mapping, and Human Motion Tracking

Authors: Dorian F. Henning, Tristan Laidlow, Stefan Leutenegger

Abstract: Estimating human motion from video is an active research area due to its many potential applications. Most state-of-the-art methods predict human shape and posture estimates for individual images and do not leverage the temporal information available in video. Many "in the wild" sequences of human motion are captured by a moving camera, which adds the complication of conflated camera and human mot… ▽ More Estimating human motion from video is an active research area due to its many potential applications. Most state-of-the-art methods predict human shape and posture estimates for individual images and do not leverage the temporal information available in video. Many "in the wild" sequences of human motion are captured by a moving camera, which adds the complication of conflated camera and human motion to the estimation. We therefore present BodySLAM, a monocular SLAM system that jointly estimates the position, shape, and posture of human bodies, as well as the camera trajectory. We also introduce a novel human motion model to constrain sequential body postures and observe the scale of the scene. Through a series of experiments on video sequences of human motion captured by a moving monocular camera, we demonstrate that BodySLAM improves estimates of all human body parameters and camera poses when compared to estimating these separately. △ Less

Submitted 24 July, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

Comments: ECCV 2022. Video: https://youtu.be/0-SL3VeWEvU

arXiv:2205.01823 [pdf, other]

Symmetry and Uncertainty-Aware Object SLAM for 6DoF Object Pose Estimation

Authors: Nathaniel Merrill, Yuliang Guo, Xingxing Zuo, Xinyu Huang, Stefan Leutenegger, Xi Peng, Liu Ren, Guoquan Huang

Abstract: We propose a keypoint-based object-level SLAM framework that can provide globally consistent 6DoF pose estimates for symmetric and asymmetric objects alike. To the best of our knowledge, our system is among the first to utilize the camera pose information from SLAM to provide prior knowledge for tracking keypoints on symmetric objects -- ensuring that new measurements are consistent with the curre… ▽ More We propose a keypoint-based object-level SLAM framework that can provide globally consistent 6DoF pose estimates for symmetric and asymmetric objects alike. To the best of our knowledge, our system is among the first to utilize the camera pose information from SLAM to provide prior knowledge for tracking keypoints on symmetric objects -- ensuring that new measurements are consistent with the current 3D scene. Moreover, our semantic keypoint network is trained to predict the Gaussian covariance for the keypoints that captures the true error of the prediction, and thus is not only useful as a weight for the residuals in the system's optimization problems, but also as a means to detect harmful statistical outliers without choosing a manual threshold. Experiments show that our method provides competitive performance to the state of the art in 6DoF object pose estimation, and at a real-time speed. Our code, pre-trained models, and keypoint labels are available https://github.com/rpng/suo_slam. △ Less

Submitted 13 July, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

Comments: Accepted to CVPR2022

arXiv:2103.16442 [pdf, other]

SIMstack: A Generative Shape and Instance Model for Unordered Object Stacks

Authors: Zoe Landgraf, Raluca Scona, Tristan Laidlow, Stephen James, Stefan Leutenegger, Andrew J. Davison

Abstract: By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that ph… ▽ More By estimating 3D shape and instances from a single view, we can capture information about an environment quickly, without the need for comprehensive scanning and multi-view fusion. Solving this task for composite scenes (such as object stacks) is challenging: occluded areas are not only ambiguous in shape but also in instance segmentation; multiple decompositions could be valid. We observe that physics constrains decomposition as well as shape in occluded regions and hypothesise that a latent space learned from scenes built under physics simulation can serve as a prior to better predict shape and instances in occluded regions. To this end we propose SIMstack, a depth-conditioned Variational Auto-Encoder (VAE), trained on a dataset of objects stacked under physics simulation. We formulate instance segmentation as a centre voting task which allows for class-agnostic detection and doesn't require setting the maximum number of objects in the scene. At test time, our model can generate 3D shape and instance segmentation from a single depth view, probabilistically sampling proposals for the occluded region from the learned latent space. Our method has practical applications in providing robots some of the ability humans have to make rapid intuitive inferences of partially observed scenes. We demonstrate an application for precise (non-disruptive) object grasping of unknown objects from a single depth view. △ Less

Submitted 26 September, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Journal ref: ICCV 2021

arXiv:2103.15875 [pdf, other]

In-Place Scene Labelling and Understanding with Implicit Scene Representation

Authors: Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, Andrew J. Davison

Abstract: Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined prope… ▽ More Semantic labelling is highly correlated with geometry and radiance reconstruction, as scene entities with similar shape and appearance are more likely to come from similar classes. Recent implicit neural reconstruction techniques are appealing as they do not require prior training data, but the same fully self-supervised approach is not possible for semantics because labels are human-defined properties. We extend neural radiance fields (NeRF) to jointly encode semantics with appearance and geometry, so that complete and accurate 2D semantic labels can be achieved using a small amount of in-place annotations specific to the scene. The intrinsic multi-view consistency and smoothness of NeRF benefit semantics by enabling sparse labels to efficiently propagate. We show the benefit of this approach when labels are either sparse or very noisy in room-scale scenes. We demonstrate its advantageous properties in various interesting applications such as an efficient scene labelling tool, novel semantic view synthesis, label denoising, super-resolution, label interpolation and multi-view semantic label fusion in visual semantic mapping systems. △ Less

Submitted 21 August, 2021; v1 submitted 29 March, 2021; originally announced March 2021.

Comments: Camera ready version. To be published in Proceedings of IEEE International Conference on Computer Vision (ICCV 2021) as Oral Presentation. Project page with more videos: https://shuaifengzhi.com/Semantic-NeRF/

arXiv:2012.03023 [pdf, other]

doi 10.1109/LRA.2021.3070308

Volumetric Occupancy Mapping With Probabilistic Depth Completion for Robotic Navigation

Authors: Marija Popovic, Florian Thomas, Sotiris Papatheodorou, Nils Funk, Teresa Vidal-Calleja, Stefan Leutenegger

Abstract: In robotic applications, a key requirement for safe and efficient motion planning is the ability to map obstacle-free space in unknown, cluttered 3D environments. However, commodity-grade RGB-D cameras commonly used for sensing fail to register valid depth values on shiny, glossy, bright, or distant surfaces, leading to missing data in the map. To address this issue, we propose a framework leverag… ▽ More In robotic applications, a key requirement for safe and efficient motion planning is the ability to map obstacle-free space in unknown, cluttered 3D environments. However, commodity-grade RGB-D cameras commonly used for sensing fail to register valid depth values on shiny, glossy, bright, or distant surfaces, leading to missing data in the map. To address this issue, we propose a framework leveraging probabilistic depth completion as an additional input for spatial mapping. We introduce a deep learning architecture providing uncertainty estimates for the depth completion of RGB-D images. Our pipeline exploits the inferred missing depth values and depth uncertainty to complement raw depth images and improve the speed and quality of free space mapping. Evaluations on synthetic data show that our approach maps significantly more correct free space with relatively low error when compared against using raw data alone in different indoor environments; thereby producing more complete maps that can be directly used for robotic navigation tasks. The performance of our framework is validated using real-world data. △ Less

Submitted 22 March, 2021; v1 submitted 5 December, 2020; originally announced December 2020.

Comments: 8 pages, 10 figures, submission to IEEE Robotics and Automation Letters (revised)

arXiv:2010.09232 [pdf, other]

doi 10.1109/ICRA48506.2021.9561736

Elastic and Efficient LiDAR Reconstruction for Large-Scale Exploration Tasks

Authors: Yiduo Wang, Nils Funk, Milad Ramezani, Sotiris Papatheodorou, Marija Popovic, Marco Camurri, Stefan Leutenegger, Maurice Fallon

Abstract: We present an efficient, elastic 3D LiDAR reconstruction framework which can reconstruct up to maximum LiDAR ranges (60 m) at multiple frames per second, thus enabling robot exploration in large-scale environments. Our approach only requires a CPU. We focus on three main challenges of large-scale reconstruction: integration of long-range LiDAR scans at high frequency, the capacity to deform the re… ▽ More We present an efficient, elastic 3D LiDAR reconstruction framework which can reconstruct up to maximum LiDAR ranges (60 m) at multiple frames per second, thus enabling robot exploration in large-scale environments. Our approach only requires a CPU. We focus on three main challenges of large-scale reconstruction: integration of long-range LiDAR scans at high frequency, the capacity to deform the reconstruction after loop closures are detected, and scalability for long-duration exploration. Our system extends upon a state-of-the-art efficient RGB-D volumetric reconstruction technique, called supereight, to support LiDAR scans and a newly developed submapping technique to allow for dynamic correction of the 3D reconstruction. We then introduce a novel pose graph clustering and submap fusion feature to make the proposed system more scalable for large environments. We evaluate the performance using two public datasets including outdoor exploration with a handheld device and a drone, and with a mobile robot exploring an underground room network. Experimental results demonstrate that our system can reconstruct at 3 Hz with 60 m sensor range and ~5 cm resolution, while state-of-the-art approaches can only reconstruct to 25 cm resolution or 20 m range at the same frequency. △ Less

Submitted 9 April, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: 7 pages, 7 figures

arXiv:2010.07929 [pdf, other]

doi 10.1109/LRA.2021.3061989

Multi-Resolution 3D Mapping with Explicit Free Space Representation for Fast and Accurate Mobile Robot Motion Planning

Authors: Nils Funk, Juan Tarrio, Sotiris Papatheodorou, Marija Popovic, Pablo F. Alcantarilla, Stefan Leutenegger

Abstract: With the aim of bridging the gap between high quality reconstruction and mobile robot motion planning, we propose an efficient system that leverages the concept of adaptive-resolution volumetric mapping, which naturally integrates with the hierarchical decomposition of space in an octree data structure. Instead of a Truncated Signed Distance Function (TSDF), we adopt mapping of occupancy probabili… ▽ More With the aim of bridging the gap between high quality reconstruction and mobile robot motion planning, we propose an efficient system that leverages the concept of adaptive-resolution volumetric mapping, which naturally integrates with the hierarchical decomposition of space in an octree data structure. Instead of a Truncated Signed Distance Function (TSDF), we adopt mapping of occupancy probabilities in log-odds representation, which allows to represent both surfaces, as well as the entire free, i.e. observed space, as opposed to unobserved space. We introduce a method for choosing resolution -- on the fly -- in real-time by means of a multi-scale max-min pooling of the input depth image. The notion of explicit free space mapping paired with the spatial hierarchy in the data structure, as well as map resolution, allows for collision queries, as needed for robot motion planning, at unprecedented speed. We quantitatively evaluate mapping accuracy, memory, runtime performance, and planning performance showing improvements over the state of the art, particularly in cases requiring high resolution maps. △ Less

Submitted 30 January, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

Comments: 8 pages, 9 figures, 5 tables

arXiv:2008.13504 [pdf, other]

Deep Probabilistic Feature-metric Tracking

Authors: Binbin Xu, Andrew J. Davison, Stefan Leutenegger

Abstract: Dense image alignment from RGB-D images remains a critical issue for real-world applications, especially under challenging lighting conditions and in a wide baseline setting. In this paper, we propose a new framework to learn a pixel-wise deep feature map and a deep feature-metric uncertainty map predicted by a Convolutional Neural Network (CNN), which together formulate a deep probabilistic featu… ▽ More Dense image alignment from RGB-D images remains a critical issue for real-world applications, especially under challenging lighting conditions and in a wide baseline setting. In this paper, we propose a new framework to learn a pixel-wise deep feature map and a deep feature-metric uncertainty map predicted by a Convolutional Neural Network (CNN), which together formulate a deep probabilistic feature-metric residual of the two-view constraint that can be minimised using Gauss-Newton in a coarse-to-fine optimisation framework. Furthermore, our network predicts a deep initial pose for faster and more reliable convergence. The optimisation steps are differentiable and unrolled to train in an end-to-end fashion. Due to its probabilistic essence, our approach can easily couple with other residuals, where we show a combination with ICP. Experimental results demonstrate state-of-the-art performances on the TUM RGB-D dataset and the 3D rigid object tracking dataset. We further demonstrate our method's robustness and convergence qualitatively. △ Less

Submitted 25 November, 2020; v1 submitted 31 August, 2020; originally announced August 2020.

Comments: RAL 2020. 8 pages, 9 figures, video link: https://youtu.be/6pMosl6ZAPE

arXiv:2006.02116 [pdf, other]

Aerial Manipulation Using Hybrid Force and Position NMPC Applied to Aerial Writing

Authors: Dimos Tzoumanikas, Felix Graule, Qingyue Yan, Dhruv Shah, Marija Popovic, Stefan Leutenegger

Abstract: Aerial manipulation aims at combining the manoeuvrability of aerial vehicles with the manipulation capabilities of robotic arms. This, however, comes at the cost of the additional control complexity due to the coupling of the dynamics of the two systems. In this paper we present a NMPC specifically designed for MAVs equipped with a robotic arm. We formulate a hybrid control model for the combined… ▽ More Aerial manipulation aims at combining the manoeuvrability of aerial vehicles with the manipulation capabilities of robotic arms. This, however, comes at the cost of the additional control complexity due to the coupling of the dynamics of the two systems. In this paper we present a NMPC specifically designed for MAVs equipped with a robotic arm. We formulate a hybrid control model for the combined MAV-arm system which incorporates interaction forces acting on the end effector. We explain the practical implementation of our algorithm and show extensive experimental results of our custom built system performing multiple aerial-writing tasks on a whiteboard, revealing accuracy in the order of millimetres. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: Accepted for publication in Robotics: Science and Systems (RSS) 2020. Video: https://youtu.be/iE--MO0YF0o

arXiv:2003.03134 [pdf, other]

Bundle Adjustment on a Graph Processor

Authors: Joseph Ortiz, Mark Pupilli, Stefan Leutenegger, Andrew J. Davison

Abstract: Graph processors such as Graphcore's Intelligence Processing Unit (IPU) are part of the major new wave of novel computer architecture for AI, and have a general design with massively parallel computation, distributed on-chip memory and very high inter-core communication bandwidth which allows breakthrough performance for message passing algorithms on arbitrary graphs. We show for the first time th… ▽ More Graph processors such as Graphcore's Intelligence Processing Unit (IPU) are part of the major new wave of novel computer architecture for AI, and have a general design with massively parallel computation, distributed on-chip memory and very high inter-core communication bandwidth which allows breakthrough performance for message passing algorithms on arbitrary graphs. We show for the first time that the classical computer vision problem of bundle adjustment (BA) can be solved extremely fast on a graph processor using Gaussian Belief Propagation. Our simple but fully parallel implementation uses the 1216 cores on a single IPU chip to, for instance, solve a real BA problem with 125 keyframes and 1919 points in under 40ms, compared to 1450ms for the Ceres CPU library. Further code optimisation will surely increase this difference on static problems, but we argue that the real promise of graph processing is for flexible in-place optimisation of general, dynamically changing factor graphs representing Spatial AI problems. We give indications of this with experiments showing the ability of GBP to efficiently solve incremental SLAM problems, and deal with robust cost functions and different types of factors. △ Less

Submitted 30 March, 2020; v1 submitted 6 March, 2020; originally announced March 2020.

Comments: Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2020). Video: https://www.youtube.com/watch?v=TqeN8aQNgd0

arXiv:2002.10342 [pdf, other]

Comparing View-Based and Map-Based Semantic Labelling in Real-Time SLAM

Authors: Zoe Landgraf, Fabian Falck, Michael Bloesch, Stefan Leutenegger, Andrew Davison

Abstract: Generally capable Spatial AI systems must build persistent scene representations where geometric models are combined with meaningful semantic labels. The many approaches to labelling scenes can be divided into two clear groups: view-based which estimate labels from the input view-wise data and then incrementally fuse them into the scene model as it is built; and map-based which label the generated… ▽ More Generally capable Spatial AI systems must build persistent scene representations where geometric models are combined with meaningful semantic labels. The many approaches to labelling scenes can be divided into two clear groups: view-based which estimate labels from the input view-wise data and then incrementally fuse them into the scene model as it is built; and map-based which label the generated scene model. However, there has so far been no attempt to quantitatively compare view-based and map-based labelling. Here, we present an experimental framework and comparison which uses real-time height map fusion as an accessible platform for a fair comparison, opening up the route to further systematic research in this area. △ Less

Submitted 24 February, 2020; originally announced February 2020.

Comments: ICRA 2020

arXiv:2002.07705 [pdf, other]

Towards Bounding-Box Free Panoptic Segmentation

Authors: Ujwal Bonde, Pablo F. Alcantarilla, Stefan Leutenegger

Abstract: In this work we introduce a new Bounding-Box Free Network (BBFNet) for panoptic segmentation. Panoptic segmentation is an ideal problem for proposal-free methods as it already requires per-pixel semantic class labels. We use this observation to exploit class boundaries from off-the-shelf semantic segmentation networks and refine them to predict instance labels. Towards this goal BBFNet predicts co… ▽ More In this work we introduce a new Bounding-Box Free Network (BBFNet) for panoptic segmentation. Panoptic segmentation is an ideal problem for proposal-free methods as it already requires per-pixel semantic class labels. We use this observation to exploit class boundaries from off-the-shelf semantic segmentation networks and refine them to predict instance labels. Towards this goal BBFNet predicts coarse watershed levels and uses them to detect large instance candidates where boundaries are well defined. For smaller instances, whose boundaries are less reliable, BBFNet also predicts instance centers by means of Hough voting followed by mean-shift to reliably detect small objects. A novel triplet loss network helps merging fragmented instances while refining boundary pixels. Our approach is distinct from previous works in panoptic segmentation that rely on a combination of a semantic segmentation network with a computationally costly instance segmentation network based on bounding box proposals, such as Mask R-CNN, to guide the prediction of instance labels using a Mixture-of-Expert (MoE) approach. We benchmark our proposal-free method on Cityscapes and Microsoft COCO datasets and show competitive performance with other MoE based approaches while outperforming existing non-proposal based methods on the COCO dataset. We show the flexibility of our method using different semantic segmentation backbones. △ Less

Submitted 27 July, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: 15 pages, 6 figures

arXiv:2002.06598 [pdf, other]

Nonlinear MPC with Motor Failure Identification and Recovery for Safe and Aggressive Multicopter Flight

Authors: Dimos Tzoumanikas, Qingyue Yan, Stefan Leutenegger

Abstract: Safe and precise reference tracking is a crucial characteristic of MAVs that have to operate under the influence of external disturbances in cluttered environments. In this paper, we present a NMPC that exploits the fully physics based non-linear dynamics of the system. We furthermore show how the moment and thrust control inputs can be transformed into feasible actuator commands. In order to guar… ▽ More Safe and precise reference tracking is a crucial characteristic of MAVs that have to operate under the influence of external disturbances in cluttered environments. In this paper, we present a NMPC that exploits the fully physics based non-linear dynamics of the system. We furthermore show how the moment and thrust control inputs can be transformed into feasible actuator commands. In order to guarantee safe operation despite potential loss of a motor under which we show our system keeps operating safely, we developed an EKF based motor failure identification algorithm. We verify the effectiveness of the developed pipeline in flight experiments with and without motor failures. △ Less

Submitted 16 February, 2020; originally announced February 2020.

Comments: Accepted in the International Conference on Robotics and Automation (ICRA) 2020. 7 (6 + 1) pages. Video link: https://youtu.be/cAQeSZ3tIqY

arXiv:2002.04440 [pdf, other]

doi 10.1109/ICRA40945.2020.9196707

Fast Frontier-based Information-driven Autonomous Exploration with an MAV

Authors: Anna Dai, Sotiris Papatheodorou, Nils Funk, Dimos Tzoumanikas, Stefan Leutenegger

Abstract: Exploration and collision-free navigation through an unknown environment is a fundamental task for autonomous robots. In this paper, a novel exploration strategy for Micro Aerial Vehicles (MAVs) is presented. The goal of the exploration strategy is the reduction of map entropy regarding occupancy probabilities, which is reflected in a utility function to be maximised. We achieve fast and efficient… ▽ More Exploration and collision-free navigation through an unknown environment is a fundamental task for autonomous robots. In this paper, a novel exploration strategy for Micro Aerial Vehicles (MAVs) is presented. The goal of the exploration strategy is the reduction of map entropy regarding occupancy probabilities, which is reflected in a utility function to be maximised. We achieve fast and efficient exploration performance with tight integration between our octree-based occupancy mapping approach, frontier extraction, and motion planning-as a hybrid between frontier-based and sampling-based exploration methods. The computationally expensive frontier clustering employed in classic frontier-based exploration is avoided by exploiting the implicit grouping of frontier voxels in the underlying octree map representation. Candidate next-views are sampled from the map frontiers and are evaluated using a utility function combining map entropy and travel time, where the former is computed efficiently using sparse raycasting. These optimisations along with the targeted exploration of frontier-based methods result in a fast and computationally efficient exploration planner. The proposed method is evaluated using both simulated and real-world experiments, demonstrating clear advantages over state-of-the-art approaches. △ Less

Submitted 13 February, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

Comments: Accepted in the International Conference on Robotics and Automation (ICRA) 2020, 7 pages, 8 figures, for the accompanying video see https://youtu.be/tH2VkVony38

arXiv:1904.08405 [pdf, other]

doi 10.1109/TPAMI.2020.3008413

Event-based Vision: A Survey

Authors: Guillermo Gallego, Tobi Delbruck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew Davison, Joerg Conradt, Kostas Daniilidis, Davide Scaramuzza

Abstract: Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of… ▽ More Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of microseconds), very high dynamic range (140 dB vs. 60 dB), low power consumption, and high pixel bandwidth (on the order of kHz) resulting in reduced motion blur. Hence, event cameras have a large potential for robotics and computer vision in challenging scenarios for traditional cameras, such as low-latency, high speed, and high dynamic range. However, novel methods are required to process the unconventional output of these sensors in order to unlock their potential. This paper provides a comprehensive overview of the emerging field of event-based vision, with a focus on the applications and the algorithms developed to unlock the outstanding properties of event cameras. We present event cameras from their working principle, the actual sensors that are available and the tasks that they have been used for, from low-level vision (feature detection and tracking, optic flow, etc.) to high-level vision (reconstruction, segmentation, recognition). We also discuss the techniques developed to process events, including learning-based techniques, as well as specialized processors for these novel sensors, such as spiking neural networks. Additionally, we highlight the challenges that remain to be tackled and the opportunities that lie ahead in the search for a more efficient, bio-inspired way for machines to perceive and interact with the world. △ Less

Submitted 8 August, 2020; v1 submitted 17 April, 2019; originally announced April 2019.

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

arXiv:1903.06482 [pdf, other]

SceneCode: Monocular Dense Semantic Reconstruction using Learned Encoded Scene Representations

Authors: Shuaifeng Zhi, Michael Bloesch, Stefan Leutenegger, Andrew J. Davison

Abstract: Systems which incrementally create 3D semantic maps from image sequences must store and update representations of both geometry and semantic entities. However, while there has been much work on the correct formulation for geometrical estimation, state-of-the-art systems usually rely on simple semantic representations which store and update independent label estimates for each surface element (dept… ▽ More Systems which incrementally create 3D semantic maps from image sequences must store and update representations of both geometry and semantic entities. However, while there has been much work on the correct formulation for geometrical estimation, state-of-the-art systems usually rely on simple semantic representations which store and update independent label estimates for each surface element (depth pixels, surfels, or voxels). Spatial correlation is discarded, and fused label maps are incoherent and noisy. We introduce a new compact and optimisable semantic representation by training a variational auto-encoder that is conditioned on a colour image. Using this learned latent space, we can tackle semantic label fusion by jointly optimising the low-dimenional codes associated with each of a set of overlapping images, producing consistent fused label maps which preserve spatial correlation. We also show how this approach can be used within a monocular keyframe based semantic mapping system where a similar code approach is used for geometry. The probabilistic formulation allows a flexible formulation where we can jointly estimate motion, geometry and semantics in a unified optimisation. △ Less

Submitted 18 March, 2019; v1 submitted 15 March, 2019; originally announced March 2019.

Comments: To be published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019)

arXiv:1903.00987 [pdf, other]

X-Section: Cross-Section Prediction for Enhanced RGBD Fusion

Authors: Andrea Nicastro, Ronald Clark, Stefan Leutenegger

Abstract: Detailed 3D reconstruction is an important challenge with application to robotics, augmented and virtual reality, which has seen impressive progress throughout the past years. Advancements were driven by the availability of depth cameras (RGB-D), as well as increased compute power, e.g.\ in the form of GPUs -- but also thanks to inclusion of machine learning in the process. Here, we propose X-Sect… ▽ More Detailed 3D reconstruction is an important challenge with application to robotics, augmented and virtual reality, which has seen impressive progress throughout the past years. Advancements were driven by the availability of depth cameras (RGB-D), as well as increased compute power, e.g.\ in the form of GPUs -- but also thanks to inclusion of machine learning in the process. Here, we propose X-Section, an RGB-D 3D reconstruction approach that leverages deep learning to make object-level predictions about thicknesses that can be readily integrated into a volumetric multi-view fusion process, where we propose an extension to the popular KinectFusion approach. In essence, our method allows to complete shape in general indoor scenes behind what is sensed by the RGB-D camera, which may be crucial e.g.\ for robotic manipulation tasks or efficient scene exploration. Predicting object thicknesses rather than volumes allows us to work with comparably high spatial resolution without exploding memory and training data requirements on the employed Convolutional Neural Networks. In a series of qualitative and quantitative evaluations, we demonstrate how we accurately predict object thickness and reconstruct general 3D scenes containing multiple objects. △ Less

Submitted 12 August, 2019; v1 submitted 3 March, 2019; originally announced March 2019.

arXiv:1812.07976 [pdf, other]

MID-Fusion: Octree-based Object-Level Multi-Instance Dynamic SLAM

Authors: Binbin Xu, Wenbin Li, Dimos Tzoumanikas, Michael Bloesch, Andrew Davison, Stefan Leutenegger

Abstract: We propose a new multi-instance dynamic RGB-D SLAM system using an object-level octree-based volumetric representation. It can provide robust camera tracking in dynamic environments and at the same time, continuously estimate geometric, semantic, and motion properties for arbitrary objects in the scene. For each incoming frame, we perform instance segmentation to detect objects and refine mask bou… ▽ More We propose a new multi-instance dynamic RGB-D SLAM system using an object-level octree-based volumetric representation. It can provide robust camera tracking in dynamic environments and at the same time, continuously estimate geometric, semantic, and motion properties for arbitrary objects in the scene. For each incoming frame, we perform instance segmentation to detect objects and refine mask boundaries using geometric and motion information. Meanwhile, we estimate the pose of each existing moving object using an object-oriented tracking method and robustly track the camera pose against the static scene. Based on the estimated camera pose and object poses, we associate segmented masks with existing models and incrementally fuse corresponding colour, depth, semantic, and foreground object probabilities into each object model. In contrast to existing approaches, our system is the first system to generate an object-level dynamic volumetric map from a single RGB-D camera, which can be used directly for robotic tasks. Our method can run at 2-3 Hz on a CPU, excluding the instance segmentation part. We demonstrate its effectiveness by quantitatively and qualitatively testing it on both synthetic and real-world sequences. △ Less

Submitted 21 March, 2019; v1 submitted 19 December, 2018; originally announced December 2018.

Comments: Accepted to International Conference on Robotics and Automation (ICRA) 2019. 7 (6 + 1) pages. Please also see video Link: https://youtu.be/gturboNl9gg

arXiv:1809.02966 [pdf, other]

LS-Net: Learning to Solve Nonlinear Least Squares for Monocular Stereo

Authors: Ronald Clark, Michael Bloesch, Jan Czarnowski, Stefan Leutenegger, Andrew J. Davison

Abstract: Sum-of-squares objective functions are very popular in computer vision algorithms. However, these objective functions are not always easy to optimize. The underlying assumptions made by solvers are often not satisfied and many problems are inherently ill-posed. In this paper, we propose LS-Net, a neural nonlinear least squares optimization algorithm which learns to effectively optimize these cost… ▽ More Sum-of-squares objective functions are very popular in computer vision algorithms. However, these objective functions are not always easy to optimize. The underlying assumptions made by solvers are often not satisfied and many problems are inherently ill-posed. In this paper, we propose LS-Net, a neural nonlinear least squares optimization algorithm which learns to effectively optimize these cost functions even in the presence of adversities. Unlike traditional approaches, the proposed solver requires no hand-crafted regularizers or priors as these are implicitly learned from the data. We apply our method to the problem of motion stereo ie. jointly estimating the motion and scene geometry from pairs of images of a monocular sequence. We show that our learned optimizer is able to efficiently and effectively solve this challenging optimization problem. △ Less

Submitted 9 September, 2018; originally announced September 2018.

Comments: ECCV 2018. Video: https://youtu.be/5bZbMm8UqbA

arXiv:1809.00716 [pdf, other]

InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset

Authors: Wenbin Li, Sajad Saeedi, John McCormac, Ronald Clark, Dimos Tzoumanikas, Qing Ye, Yuzhong Huang, Rui Tang, Stefan Leutenegger

Abstract: Datasets have gained an enormous amount of popularity in the computer vision community, from training and evaluation of Deep Learning-based methods to benchmarking Simultaneous Localization and Mapping (SLAM). Without a doubt, synthetic imagery bears a vast potential due to scalability in terms of amounts of data obtainable without tedious manual ground truth annotations or measurements. Here, we… ▽ More Datasets have gained an enormous amount of popularity in the computer vision community, from training and evaluation of Deep Learning-based methods to benchmarking Simultaneous Localization and Mapping (SLAM). Without a doubt, synthetic imagery bears a vast potential due to scalability in terms of amounts of data obtainable without tedious manual ground truth annotations or measurements. Here, we present a dataset with the aim of providing a higher degree of photo-realism, larger scale, more variability as well as serving a wider range of purposes compared to existing datasets. Our dataset leverages the availability of millions of professional interior designs and millions of production-level furniture and object assets -- all coming with fine geometric details and high-resolution texture. We render high-resolution and high frame-rate video sequences following realistic trajectories while supporting various camera types as well as providing inertial measurements. Together with the release of the dataset, we will make executable program of our interactive simulator software as well as our renderer available at https://interiornetdataset.github.io. To showcase the usability and uniqueness of our dataset, we show benchmarking results of both sparse and dense SLAM algorithms. △ Less

Submitted 3 September, 2018; originally announced September 2018.

Comments: British Machine Vision Conference (BMVC) 2018

arXiv:1808.08378 [pdf, other]

Fusion++: Volumetric Object-Level SLAM

Authors: John McCormac, Ronald Clark, Michael Bloesch, Andrew J. Davison, Stefan Leutenegger

Abstract: We propose an online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects. As an RGB-D camera browses a cluttered indoor scene, Mask-RCNN instance segmentations are used to initialise compact per-object Truncated Signed Distance Function (TSDF) reconstructions with object size-dependent resolutions and a novel 3D foreground mask. Reconstru… ▽ More We propose an online object-level SLAM system which builds a persistent and accurate 3D graph map of arbitrary reconstructed objects. As an RGB-D camera browses a cluttered indoor scene, Mask-RCNN instance segmentations are used to initialise compact per-object Truncated Signed Distance Function (TSDF) reconstructions with object size-dependent resolutions and a novel 3D foreground mask. Reconstructed objects are stored in an optimisable 6DoF pose graph which is our only persistent map representation. Objects are incrementally refined via depth fusion, and are used for tracking, relocalisation and loop closure detection. Loop closures cause adjustments in the relative pose estimates of object instances, but no intra-object warping. Each object also carries semantic information which is refined over time and an existence probability to account for spurious instance predictions. We demonstrate our approach on a hand-held RGB-D sequence from a cluttered office scene with a large number and variety of object instances, highlighting how the system closes loops and makes good use of existing objects on repeated loops. We quantitatively evaluate the trajectory error of our system against a baseline approach on the RGB-D SLAM benchmark, and qualitatively compare reconstruction quality of discovered objects on the YCB video dataset. Performance evaluation shows our approach is highly memory efficient and runs online at 4-8Hz (excluding relocalisation) despite not being optimised at the software level. △ Less

Submitted 28 August, 2018; v1 submitted 25 August, 2018; originally announced August 2018.

arXiv:1807.10561 [pdf, other]

Towards an Embodied Semantic Fovea: Semantic 3D scene reconstruction from ego-centric eye-tracker videos

Authors: Mickey Li, Noyan Songur, Pavel Orlov, Stefan Leutenegger, A Aldo Faisal

Abstract: Incorporating the physical environment is essential for a complete understanding of human behavior in unconstrained every-day tasks. This is especially important in ego-centric tasks where obtaining 3 dimensional information is both limiting and challenging with the current 2D video analysis methods proving insufficient. Here we demonstrate a proof-of-concept system which provides real-time 3D map… ▽ More Incorporating the physical environment is essential for a complete understanding of human behavior in unconstrained every-day tasks. This is especially important in ego-centric tasks where obtaining 3 dimensional information is both limiting and challenging with the current 2D video analysis methods proving insufficient. Here we demonstrate a proof-of-concept system which provides real-time 3D mapping and semantic labeling of the local environment from an ego-centric RGB-D video-stream with 3D gaze point estimation from head mounted eye tracking glasses. We augment existing work in Semantic Simultaneous Localization And Mapping (Semantic SLAM) with collected gaze vectors. Our system can then find and track objects both inside and outside the user field-of-view in 3D from multiple perspectives with reasonable accuracy. We validate our concept by producing a semantic map from images of the NYUv2 dataset while simultaneously estimating gaze position and gaze classes from recorded gaze data of the dataset images. △ Less

Submitted 27 July, 2018; originally announced July 2018.

arXiv:1804.00874 [pdf, other]

CodeSLAM - Learning a Compact, Optimisable Representation for Dense Visual SLAM

Authors: Michael Bloesch, Jan Czarnowski, Ronald Clark, Stefan Leutenegger, Andrew J. Davison

Abstract: The representation of geometry in real-time 3D perception systems continues to be a critical research issue. Dense maps capture complete surface shape and can be augmented with semantic labels, but their high dimensionality makes them computationally costly to store and process, and unsuitable for rigorous probabilistic inference. Sparse feature-based representations avoid these problems, but capt… ▽ More The representation of geometry in real-time 3D perception systems continues to be a critical research issue. Dense maps capture complete surface shape and can be augmented with semantic labels, but their high dimensionality makes them computationally costly to store and process, and unsuitable for rigorous probabilistic inference. Sparse feature-based representations avoid these problems, but capture only partial scene information and are mainly useful for localisation only. We present a new compact but dense representation of scene geometry which is conditioned on the intensity data from a single image and generated from a code consisting of a small number of parameters. We are inspired by work both on learned depth from images, and auto-encoders. Our approach is suitable for use in a keyframe-based monocular dense SLAM system: While each keyframe with a code can produce a depth map, the code can be optimised efficiently jointly with pose variables and together with the codes of overlapping keyframes to attain global consistency. Conditioning the depth map on the image allows the code to only represent aspects of the local geometry which cannot directly be predicted from the image. We explain how to learn our code representation, and demonstrate its advantageous properties in monocular SLAM. △ Less

Submitted 14 April, 2019; v1 submitted 3 April, 2018; originally announced April 2018.

Comments: Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018)

arXiv:1708.08844 [pdf, other]

Semantic Texture for Robust Dense Tracking

Authors: Jan Czarnowski, Stefan Leutenegger, Andrew Davison

Abstract: We argue that robust dense SLAM systems can make valuable use of the layers of features coming from a standard CNN as a pyramid of `semantic texture' which is suitable for dense alignment while being much more robust to nuisance factors such as lighting than raw RGB values. We use a straightforward Lucas-Kanade formulation of image alignment, with a schedule of iterations over the coarse-to-fine l… ▽ More We argue that robust dense SLAM systems can make valuable use of the layers of features coming from a standard CNN as a pyramid of `semantic texture' which is suitable for dense alignment while being much more robust to nuisance factors such as lighting than raw RGB values. We use a straightforward Lucas-Kanade formulation of image alignment, with a schedule of iterations over the coarse-to-fine levels of a pyramid, and simply replace the usual image pyramid by the hierarchy of convolutional feature maps from a pre-trained CNN. The resulting dense alignment performance is much more robust to lighting and other variations, as we show by camera rotation tracking experiments on time-lapse sequences captured over many hours. Looking towards the future of scene representation for real-time visual SLAM, we further demonstrate that a selection using simple criteria of a small number of the total set of features output by a CNN gives just as accurate but much more efficient tracking performance. △ Less

Submitted 29 August, 2017; originally announced August 2017.

arXiv:1612.05079 [pdf, other]

SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth

Authors: John McCormac, Ankur Handa, Stefan Leutenegger, Andrew J. Davison

Abstract: We introduce SceneNet RGB-D, expanding the previous work of SceneNet to enable large scale photorealistic rendering of indoor scene trajectories. It provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estima… ▽ More We introduce SceneNet RGB-D, expanding the previous work of SceneNet to enable large scale photorealistic rendering of indoor scene trajectories. It provides pixel-perfect ground truth for scene understanding problems such as semantic segmentation, instance segmentation, and object detection, and also for geometric computer vision problems such as optical flow, depth estimation, camera pose estimation, and 3D reconstruction. Random sampling permits virtually unlimited scene configurations, and here we provide a set of 5M rendered RGB-D images from over 15K trajectories in synthetic layouts with random but physically simulated object poses. Each layout also has random lighting, camera trajectories, and textures. The scale of this dataset is well suited for pre-training data-driven computer vision techniques from scratch with RGB-D inputs, which previously has been limited by relatively small labelled datasets in NYUv2 and SUN RGB-D. It also provides a basis for investigating 3D scene labelling tasks by providing perfect camera poses and depth data as proxy for a SLAM system. We host the dataset at http://robotvault.bitbucket.io/scenenet-rgbd.html △ Less

Submitted 30 January, 2017; v1 submitted 15 December, 2016; originally announced December 2016.

Showing 1–50 of 55 results for author: Leutenegger, S